Tag

corpus

Showing 41 - 60 out of 62 datasets
  • Speech Accent Archive: 1200+ speech samples from a variety of language backgrounds

    Offsite — The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers. The Elicitation Paragraph Please call Stella. Ask her ...
  • The arXiv in your pocket - Downloadable Physics Pre-Print Archive

    Offsite — The arXiv Physics pre-print publishing corpus
  • Word Frequencies in Written & Spoken English from British National Corpus (100M-word)

    Offsite — by Geoffrey Leech, Paul Rayson, Andrew Wilson Overview Download word lists Books of English word frequencies have in the past suffered from severe limitations of sample size and breadth. They have also tended to be restricted to word forms alone. Most importantly, almost all have dealt only with written language. This book overcomes these limitations. It is derived from ...
  • The IBL Corpus

    Offsite — About > The IBL Corpus was collected by the University of Plymouth and the University of Edinburgh as part of the EPSRC funded project IBL, Instruction-based Learning for Mobile Robots (GR/M90023, GR/M90160). IBL focused on the problem of how natural language instructions can be used by an intelligent embodied agent to build a hierarchy of complex functions based on a ...
  • eXtended WordNet

    Offsite — About From website: > WordNet is a lexical database for English that has been widely adopted in artificial intelligence and computational linguistics for a variety of practical applications. Since WordNet was designed as a lexical database, it exhibits some limitations when used for knowledge processing applications. Often one needs to retrieve words that are ...
  • Lyricsfly Lyrics REST API

    Offsite — Application Programming Interface is available to anyone who wishes to use our database for their own music project, website or program. If you currently use the web to search out lyrics or use code tricks to access other lyrics websites to display relevant lyrics text for your content you can now have a reliable source without the hassle. example code for php: ...
  • Word List - 1,000+ Most Frequent words in King James Bible

    Free Download — 1,185 King James Version frequent substrings (KJVfreq.txt) The most frequently occurring 1,185 substrings in the King James Version Bible ranked and counted by order of frequency.
  • Letter frequency - Substring frequency in an Amy Tan Novel

    Free Download — 467 current fiction substrings (fiction.txt) The most frequently occurring 467 character sequences (n-grams) occurring in a best-selling novel by Amy Tan in 1990.
  • IMSLP / Petrucci Music Library

    Offsite — About > Welcome to the International Music Score Library Project! IMSLP attempts to create a virtual library containing all public domain musical scores, as well as scores from composers who are willing to share their music with the world without charge. You can read the full list of goals that IMSLP will try to achieve. > IMSLP also encourages the exchange of musical ...
  • Word List - Official Scrabble (TM) Player's Dictionary (OSPD) 2nd ed (with Definitions, Excel format

    Free Download — 4,160 official crosswords delta (crswd-d.txt) When combined with the 113,809 crosswords file, it produces the official crossword list compatible with the second edition of the Official Scrabble Players Dictionary. (Scrabble is a registered ...
  • Ted Pedersen - Name Discrimination Data / Name Disambiguation Data / Name Ambiguity Data / Named Ent

    Offsite — Contains data where ambiguous entity names in text have been disambiguated. The data has either been manually disambiguated, or created by conflating multiple names into a single ambiguous pseudo-name.
  • Word List - List of Acronyms

    Free Download — 6,213 acronyms (acronyms.txt) common acronyms & abbreviations
  • Word List - 350,000+ Words

    Free Download — Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.
  • Word List - Official Scrabble (TM) Player's Dictionary (OSPD) 2nd ed

    Free Download — 4,160 official crosswords delta (crswd-d.txt) When combined with the 113,809 crosswords file, it produces the official crossword list compatible with the second edition of the Official Scrabble Players Dictionary. (Scrabble is a registered trademark of Milton-Bradley licensed to Merriam-Webster.)
  • Word List - 21,000+ Common Given Names (US & Great Britain)

    Free Download — 21,986 names (names.txt) This database contains the most common names used in the United States and Great Britain. Spelling checkers may want to supplement their basic word list with this one.
  • Word List - 4,900+ Common Female Given Names (English-speaking Countries)

    Free Download — 4,946 female names (names-f.txt) Frequent given names of females in English speaking countries. Spelling checkers may want to supplement their basic word list with this one.
  • Word List - 3,800+ Common Male Given Names (English-speaking Countries)

    Free Download — 3,800 male names Frequent given names of male in English speaking countries. Spelling checkers may want to supplement their basic word list with this one.
  • Word List - Commonly Misspelled English Words

    Free Download — 366 often misspelled words (oftenmis.txt) many of the most commonly misspelled words in English speaking countries
  • Kabul War Diary - Over 90,000 Military documents from the War in Afghanistan (from Wikileaks.org)

    Offsite — The Afghan War Diary an extraordinary secret compendium of over 91,000 reports covering the war in Afghanistan from 2004 to 2010. The reports describe the majority of lethal military actions involving the United States military. They include the ...
  • Google Books Ngrams

    Offsite — Description Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set). Each of the links will directly download a fragment of the given corpus. For ...