Category

Showing 21 - 36 out of 36 datasets

Linguistics

Not finding the data sets you're looking for? Not all of our data sets are categorized yet. Try checking out tags instead.
  • Word List - 1,000+ Most Frequent words in King James Bible

    Free Download — 1,185 King James Version frequent substrings (KJVfreq.txt) The most frequently occurring 1,185 substrings in the King James Version Bible ranked and counted by order of frequency.
  • Letter frequency - Substring frequency in an Amy Tan Novel

    Free Download — 467 current fiction substrings (fiction.txt) The most frequently occurring 467 character sequences (n-grams) occurring in a best-selling novel by Amy Tan in 1990.
  • Word List - Official Scrabble (TM) Player's Dictionary (OSPD) 2nd ed (with Definitions, Excel format

    Free Download — 4,160 official crosswords delta (crswd-d.txt) When combined with the 113,809 crosswords file, it produces the official crossword list compatible with the second edition of the Official Scrabble Players Dictionary. (Scrabble is a registered ...
  • Word List - List of Acronyms

    Free Download — 6,213 acronyms (acronyms.txt) common acronyms & abbreviations
  • Word List - 350,000+ Words

    Free Download — Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.
  • Word List - Official Scrabble (TM) Player's Dictionary (OSPD) 2nd ed

    Free Download — 4,160 official crosswords delta (crswd-d.txt) When combined with the 113,809 crosswords file, it produces the official crossword list compatible with the second edition of the Official Scrabble Players Dictionary. (Scrabble is a registered trademark of Milton-Bradley licensed to Merriam-Webster.)
  • Word List - 21,000+ Common Given Names (US & Great Britain)

    Free Download — 21,986 names (names.txt) This database contains the most common names used in the United States and Great Britain. Spelling checkers may want to supplement their basic word list with this one.
  • Word List - 4,900+ Common Female Given Names (English-speaking Countries)

    Free Download — 4,946 female names (names-f.txt) Frequent given names of females in English speaking countries. Spelling checkers may want to supplement their basic word list with this one.
  • Word List - 3,800+ Common Male Given Names (English-speaking Countries)

    Free Download — 3,800 male names Frequent given names of male in English speaking countries. Spelling checkers may want to supplement their basic word list with this one.
  • Word List - Commonly Misspelled English Words

    Free Download — 366 often misspelled words (oftenmis.txt) many of the most commonly misspelled words in English speaking countries
  • Big Huge Thesaurus API: Access 145,000 Words and Phrases

    Offsite — This site sports a very simple API for retrieving the synonyms for any word and also an actual Big Huge Thesaurus. License You may use the service for any legal and non-slimy purpose* so long as you link to this site in your website or application credits as follows: Thesaurus service provided by words.bighugelabs.com THE SERVICE IS PROVIDED “AS IS” WITHOUT WARRANTY OF ...
  • The Quantz Corpus (Dinosaur Comics)

    Offsite — This is all the text from every Dinosaur Comic ever made in convenient XML format. It was released by the author, Ryan North, as a tool to help solve an anagram presented in the comic for March 1, 2010. The text was also sorted and counted by word, 2 word phrases, and 3 word phrases by Paul Stansifer.
  • Google Books Ngrams

    Offsite — Description Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set). Each of the links will directly download a fragment of the given corpus. For ...
  • Rap Data Pack #1: Jay-Z

    Free Download — Here is the first Rap Data Pack™ released by StapleCrops.com. It includes the raw data from Hov’s complete body of work: word count, readability, release dates, Geo Codes, etc..
  • Linguistic Data Consortium (LDC) - Collection of Linguistic Corpora and Datasets

    Offsite — The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s host institution. The LDC was founded in 1992 with a grant from the Advanced ...
  • List of bad words

    Free Download — A list of bad words.