Category

Showing 1 - 20 out of 36 datasets

Linguistics

Not finding the data sets you're looking for? Not all of our data sets are categorized yet. Try checking out tags instead.
  • Enron Email Dataset

    Offsite — From the CALO Project at Carnegie-Mellon University a massive dataset of emails recovered from discovery documents in the Enron trials About This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a ...
  • Word List - 100,000 + Official Crossword Words (Excel readable)

    Free Download — A word list with over 100,000 entries that are officially permitted in crossword games like Scrabble™. This word list is available in a simple, alphabetically-ordered Excel format, making it convenient for reference, spell-checking, or in more sophisticated application, for developers looking to build a custom spelling dictionary. The entries include variants of words: ...
  • Word List of 64,000+ Common English Dictionary Words (most with definitions, Excel format)

    Free Download — Over 64,000 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.
  • Word List 80,000+ Official Crossword Words (with most definitions, Excel format)

    Free Download — A list of over 80,000 words officially permitted in crossword games like Scrabble™ with some but not all of their definitions. The words are compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has variants of words: -ing, -ed, -s, and so on, it makes a good addition when building a custom spelling dictionary. It is an reference ...
  • Word List - 100,000+ official crossword words (Excel readable)

    Free Download — 113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spelling dictionary.
  • Word List - 350,000+ Simple English Words (Excel readable)

    Free Download — Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.
  • A list of all 22,802 words in the Scribblenauts dictionary.

    Free Download — List of summonable objects from the Nintendo DS game Scribblenauts, from AARDVARK, ABOMINABLE SNOWMAN and ABSCONDER to ZOMBIE, ZUNICERATOPS and ZYGOTE. via the Scribblenauts Wikipedia entry: Scribblenauts is an emergent puzzle action video game with the tagline “Write Anything, Solve Everything”. Its objective is to complete puzzles by summonning any object (from a ...
  • Wordnet

    Offsite — WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts ...
  • Measuring Worth: Interest Rates - US, UK, China, Japan

    Offsite — The mission of the site is to make available to the public the highest quality and most reliable historical data on important economic aggregates, with particular emphasis on nominal measures. The data have been created using the highest standards of the fields of economics and history and are rigorously refereed by the most distinguished researchers in the fields. ...
  • List of Dirty, Obscene, Banned and otherwise unacceptable words

    Free Download — A banned word list representing a collection of many lists from around the web of words considered socially unacceptable for one reason or another. What to do with a banned word list? Use this dirty word list to screen for spammers and griefers, to censor dissidents; to better understand the semiotic role of taboo signifiers in an online modality; to monitor user ...
  • Web FAQ collection | ILPS

    Offsite
  • NIST: Topic Detection and Tracking (TDT)

    Offsite — Topic Detection and Tracking research was pursued under the DARPA Translingual Information Detection, Extraction, and Summarization (TIDES) program. Topic Detection and Tracking is an integral part of the DARPA Translingual Information Detection, Extraction, and Summarization (TIDES) program. The goal of the TIDES program is to enable English-speaking users to access, ...
  • Computer hacker wordlists from packetstormsecurity.org

    Offsite
  • Word List - 250,000+ Hyphenated, Capitalized and Compound English words

    Free Download — A common word list with over 250,000 entries of hyphenated, capitalized and compound English words. The download consists of entries containing more than one word, as well as capitalized words and acronyms. Phrases are considered “common” if they or variations of them occur in a standard dictionary or thesaurus. This word list is available in a simple, ...
  • Word List - 350,000+ Simple English Words (with some Definitions, Excel format)

    Free Download — Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. Some, but not all of the words have definitions. This list does not exclude archaic words or significant variant spellings.
  • Word List - 1,000 Most Frequently Used English Words by Frequency (with Definitions, Excel format)

    Free Download — This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency
  • Word List - 1000 Most Frequent Words from an Internet Corpus

    Free Download — This file consists of the 1,000 most frequently used English words as used on the Internet computer network in 1992.
  • Westbury Lab Usenet Corpus: 28M postings from 47000+ newsgroups 2005-2009

    Offsite — A USENET corpus (2005-2009) This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2010, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be ...
  • Speech Accent Archive: 1200+ speech samples from a variety of language backgrounds

    Offsite — The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers. The Elicitation Paragraph Please call Stella. Ask her ...
  • Word Frequencies in Written & Spoken English from British National Corpus (100M-word)

    Offsite — by Geoffrey Leech, Paul Rayson, Andrew Wilson Overview Download word lists Books of English word frequencies have in the past suffered from severe limitations of sample size and breadth. They have also tended to be restricted to word forms alone. Most importantly, almost all have dealt only with written language. This book overcomes these limitations. It is derived from ...