Tag

corpora

Showing 1 - 20 out of 32 datasets
  • Text Messages sent on 9/11/2001 (wikileaks.org)

    Offsite — 9/11 tragedy pager intercepts. The following are more than half a million national US pager intercepts released by wikileaks.org. This covers the September 11 tragedy from 3am on the same day (Tuesday) until 3am the following day, a 24 hour period surrounding the attacks in New York and Washington. The fields presented are: Date Time Pager-Network Pager-number ...
  • Word List - 100,000 + Official Crossword Words (Excel readable)

    Free Download — A word list with over 100,000 entries that are officially permitted in crossword games like Scrabble™. This word list is available in a simple, alphabetically-ordered Excel format, making it convenient for reference, spell-checking, or in more sophisticated application, for developers looking to build a custom spelling dictionary. The entries include variants of words: ...
  • Word List of 64,000+ Common English Dictionary Words (most with definitions, Excel format)

    Free Download — Over 64,000 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.
  • Word List - 10,000+ Common Place Names

    Free Download — U.S. place names for more than 10,000 entries. This U.S. place name list is available in a simple, alphabetically-ordered .txt format, making it convenient for reference, spell-checking, or in more sophisticated application, for developers looking to build a custom location tool or database. The entries represent a sampling of U.S. place names: 10,196 places in total.
  • Word List 80,000+ Official Crossword Words (with most definitions, Excel format)

    Free Download — A list of over 80,000 words officially permitted in crossword games like Scrabble™ with some but not all of their definitions. The words are compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has variants of words: -ing, -ed, -s, and so on, it makes a good addition when building a custom spelling dictionary. It is an reference ...
  • Word List - 100,000+ official crossword words (Excel readable)

    Free Download — 113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spelling dictionary.
  • Word List - 350,000+ Simple English Words (Excel readable)

    Free Download — Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.
  • VoxForge

    Offsite — About > VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). > We will make available all submitted audio files under the GPL license, and then ‘compile’ them into acoustic models for use with Open Source speech recognition engines such as Sphinx, ISIP, Julius and HTK (note: HTK ...
  • Password Dictionary

    Offsite — A list of 1,717,680 passwords. Useful for verifying whether or not users are displaying good password hygiene.
  • Statistical Machine Translation - Europarl Parallel Corpus

    Offsite — About Overview: > The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. > The goal of the extraction and processing was to generate sentence aligned text for ...
  • Biomedical Text corpora and related data collection resources.

    Offsite
  • Word List - 250,000+ Hyphenated, Capitalized and Compound English words

    Free Download — A common word list with over 250,000 entries of hyphenated, capitalized and compound English words. The download consists of entries containing more than one word, as well as capitalized words and acronyms. Phrases are considered “common” if they or variations of them occur in a standard dictionary or thesaurus. This word list is available in a simple, ...
  • The New York Times Annotated Corpus

    Offsite — From [website](http://ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19): The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff ...
  • Word List - 1000 Most Frequent Words from an Internet Corpus

    Free Download — This file consists of the 1,000 most frequently used English words as used on the Internet computer network in 1992.
  • Word List - 350,000+ Simple English Words (with some Definitions, Excel format)

    Free Download — Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. Some, but not all of the words have definitions. This list does not exclude archaic words or significant variant spellings.
  • Word List - 1,000 Most Frequently Used English Words by Frequency (with Definitions, Excel format)

    Free Download — This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency
  • Westbury Lab Usenet Corpus: 28M postings from 47000+ newsgroups 2005-2009

    Offsite — A USENET corpus (2005-2009) This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2010, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be ...
  • Word Frequencies in Written & Spoken English from British National Corpus (100M-word)

    Offsite — by Geoffrey Leech, Paul Rayson, Andrew Wilson Overview Download word lists Books of English word frequencies have in the past suffered from severe limitations of sample size and breadth. They have also tended to be restricted to word forms alone. Most importantly, almost all have dealt only with written language. This book overcomes these limitations. It is derived from ...
  • Word List - 1,000+ Most Frequent words in King James Bible

    Free Download — 1,185 King James Version frequent substrings (KJVfreq.txt) The most frequently occurring 1,185 substrings in the King James Version Bible ranked and counted by order of frequency.
  • Letter frequency - Substring frequency in an Amy Tan Novel

    Free Download — 467 current fiction substrings (fiction.txt) The most frequently occurring 467 character sequences (n-grams) occurring in a best-selling novel by Amy Tan in 1990.