Showing 21 - 40 out of 64 datasets
  • ISO 639-3 - Codes for the Representation of Names of Languages

    Offsite — About ISO 639-3 is a list of three letter codes for languages: > ISO 639-3 attempts to provide as complete an enumeration of languages as possible, including living, extinct, ancient, and constructed languages, whether major or minor, written or unwritten. Access/re-use The list can be [viewed](http://www.sil.org/iso639-3/codes.asp) or ...
  • Word List - 250,000+ Hyphenated, Capitalized and Compound English words

    Free Download — A common word list with over 250,000 entries of hyphenated, capitalized and compound English words. The download consists of entries containing more than one word, as well as capitalized words and acronyms. Phrases are considered “common” if they or variations of them occur in a standard dictionary or thesaurus. This word list is available in a simple, ...
  • Wiktionary

    Offsite — “Welcome to the English-language Wiktionary, a collaborative project to produce a free, multilingual dictionary with definitions, etymologies, pronunciations, sample quotations, synonyms, antonyms and translations. Wiktionary is the lexical companion to the open-content encyclopedia Wikipedia.”
  • Languages of the World (Multilingual RDF Descriptions)

    Offsite — Description Linkvoj means languages in Esperanto. From the frontpage of <http://www.lingvoj.org/>: http://www.lingvoj.org/lingvoj.rdf is the complete RDF file gathering currently the description of 507 languages, including all languages defined by ISO 639-1 and most of ISO 639-2 codes (a few exceptions remain, for which Wikipedia articles are not consistent with ...
  • GEneral Multilingual Environmental Thesaurus

    Offsite — A thesaurus in 20+ languages for terms related to the environment and environmental data. Published by the European Environment Agency. Available in RDF without reuse constraints.
  • Word List - 1000 Most Frequent Words from an Internet Corpus

    Free Download — This file consists of the 1,000 most frequently used English words as used on the Internet computer network in 1992.
  • Linguistic Data Consortium (LDC) - Collection of Linguistic Corpora and Datasets

    Offsite — The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s host institution. The LDC was founded in 1992 with a grant from the Advanced ...
  • Word List - 350,000+ Simple English Words (with some Definitions, Excel format)

    Free Download — Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. Some, but not all of the words have definitions. This list does not exclude archaic words or significant variant spellings.
  • Word List - 1,000 Most Frequently Used English Words by Frequency (with Definitions, Excel format)

    Free Download — This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency
  • Word Frequencies in Written &amp; Spoken English from British National Corpus (100M-word)

    Offsite — by Geoffrey Leech, Paul Rayson, Andrew Wilson Overview Download word lists Books of English word frequencies have in the past suffered from severe limitations of sample size and breadth. They have also tended to be restricted to word forms alone. Most importantly, almost all have dealt only with written language. This book overcomes these limitations. It is derived from ...
  • The IBL Corpus

    Offsite — About > The IBL Corpus was collected by the University of Plymouth and the University of Edinburgh as part of the EPSRC funded project IBL, Instruction-based Learning for Mobile Robots (GR/M90023, GR/M90160). IBL focused on the problem of how natural language instructions can be used by an intelligent embodied agent to build a hierarchy of complex functions based on a ...
  • eXtended WordNet

    Offsite — About From website: > WordNet is a lexical database for English that has been widely adopted in artificial intelligence and computational linguistics for a variety of practical applications. Since WordNet was designed as a lexical database, it exhibits some limitations when used for knowledge processing applications. Often one needs to retrieve words that are ...
  • Apertium

    Offsite — Description “Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs.” Language-pair data includes: Spanish–Catalan (apertium-es-ca) Spanish–Portuguese (apertium-es-pt) Spanish–Galician ...
  • Word List - 1,000+ Most Frequent words in King James Bible

    Free Download — 1,185 King James Version frequent substrings (KJVfreq.txt) The most frequently occurring 1,185 substrings in the King James Version Bible ranked and counted by order of frequency.
  • Letter frequency - Substring frequency in an Amy Tan Novel

    Free Download — 467 current fiction substrings (fiction.txt) The most frequently occurring 467 character sequences (n-grams) occurring in a best-selling novel by Amy Tan in 1990.
  • Language Icon

    Free Download — This dataset consists of a collection of Infoboxes from Wikipedia on the topic of Language Icon.
  • Language Family

    Free Download — This dataset consists of a collection of Infoboxes from Wikipedia on the topic of Language Family.
  • Programming Language

    Free Download — This dataset consists of a collection of Infoboxes from Wikipedia on the topic of Programming Language.
  • Word List - Official Scrabble (TM) Player&#x27;s Dictionary (OSPD) 2nd ed (with Definitions, Excel format

    Free Download — 4,160 official crosswords delta (crswd-d.txt) When combined with the 113,809 crosswords file, it produces the official crossword list compatible with the second edition of the Official Scrabble Players Dictionary. (Scrabble is a registered ...
  • Language

    Free Download — This dataset consists of a collection of Infoboxes from Wikipedia on the topic of Language.