Showing 1 - 20 out of 26 datasets
  • Children Who Speak a Language Other Than English at Home: 2000 to 2004

    Free Download — The Statistical Abstract files are distributed by the US Census Department as Microsoft Excel files. These files have data mixed with notes and references, multiple tables per sheet, and, worst of all, the table headers are not easily matched to their rows and columns. A few files had extraneous characters in the title. These were corrected to be consistent. A few files ...
  • The Kids Open Dictionary Builder

    Offsite — About From the creators: > The purpose of this project is to create a free, open simple dictionary for students to use. The words in the dictionary will reviewed for quality and appropriateness and ultimately “frozen” for export into a variety of formats, including text, PDF, ebooks, wikis, web, etc., for use on a variety of platforms. > The site also includes a ...
  • 80 Million Tiny Images

    Offsite — Visual dictionary presents a visualization of all the nouns in the English language arranged by semantic meaning. Each of the tiles in the mosaic is an arithmetic average of images relating to one of 53,464 nouns. The images for each word were obtained using Google’s Image Search and other engines. A total of 7,527,697 images were used, each tile being the average of 140 ...
  • German English Parallel Corpus "de-news", Daily News 1996-2000

  • Dict.cc - English German Dictionary

    Offsite — About From [about page](http://www.dict.cc/?s=about%3A): > dict.cc is not only an online dictionary. It’s an attempt to create a platform where users from all over the world can share their knowledge in the field of translations. Every visitor can suggest new translations and correct or confirm other users’ suggestions. The challenging and most important part of the ...
  • MIDAS - Heritage project

    Offsite — From the website: > What is MIDAS? > MIDAS sets out an agreed list of the items or ‘units’ of information that should be included in an inventory or other systematic record of the historic environment. These units of information are grouped together under broad headings or ‘information schemes’. These cover areas such as Monument Character, Events, People and ...
  • The JMdict (Japanese-Multilingual Dictionary) project

    Offsite — About Overview: > The JMdict (Japanese-Multilingual Dictionary) project has at its aim the compilation of a multilingual lexical database with Japanese as the pivot language. The project began in 1999 as an offshoot of the EDICT Japanese-English Electronic Dictionary project. It involved a major rebuild of the main files, with a more complex structure using XML. > ...
  • Renascence Editions

    Offsite — These [public domain works] are provided for nonprofit purposes only; unique site content is copyright ©1992-2007 the editors and The University of Oregon. Corrections and comments to the publisher, Risa Stephanie Bear, M.S., M.A., rbear[at]uoregon.edu…. Early Modern texts published by Renascence Editions are not peer reviewed. While we have done our best to ensure ...
  • Eurfa

    Offsite — About Eurfa consists of an English-Welsh and a Welsh-English dictionary. There are currently around 13,000 words. Authored by Kevin Donnelly, it is currently being used in the development of [Apertium-cy] (http://www.cymraeg.org.uk), a Welsh-English translator. For further information, see [the project website](http://eurfa.org.uk/) and [this blog ...
  • Ding: German-English Dictionary

    Offsite — About German English dictionary from Frank Richter at [Chemnitz University of Technology](http://www.tu-chemnitz.de/). It has been maintained since 1995 (see the [readme file](http://ftp.tu-chemnitz.de/pub/Local/urz/ding/de-en/Readme)). There are now over 216,000 entries. Format Format is .txt. Access/re-use Licensed under GPL.
  • Oxford English Dictionary (OED)

    Offsite — Scans of the first edition of the Oxford English Dictionary along with some software to search those scans. [The post](http://lists.canonical.org/pipermail/kragen-tol/2006-March/000816.html) details work up to volume 6 (as of March 2006) and it is not clear whether any more digitization has been done since then but a search of the Internet Archive (where the scans are ...
  • Word List - 1000 Most Frequent Words from an Internet Corpus

    Free Download — This file consists of the 1,000 most frequently used English words as used on the Internet computer network in 1992.
  • Linguistic Data Consortium (LDC) - Collection of Linguistic Corpora and Datasets

    Offsite — The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s host institution. The LDC was founded in 1992 with a grant from the Advanced ...
  • Word List - 1,000 Most Frequently Used English Words by Frequency (with Definitions, Excel format)

    Free Download — This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency
  • Word Lists Collection

    Offsite — The data is a smorgasbord of word lists, including spell check oriented word lists, an inflection database, parts of speech word list, jargon file word lists, the contents from Ispell, spell check dictionaries, tables that convert between American, British and Canadian spellings, and links to several other word lists.
  • Speech Accent Archive: 1200+ speech samples from a variety of language backgrounds

    Offsite — The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers. The Elicitation Paragraph Please call Stella. Ask her ...
  • Word Frequencies in Written & Spoken English from British National Corpus (100M-word)

    Offsite — by Geoffrey Leech, Paul Rayson, Andrew Wilson Overview Download word lists Books of English word frequencies have in the past suffered from severe limitations of sample size and breadth. They have also tended to be restricted to word forms alone. Most importantly, almost all have dealt only with written language. This book overcomes these limitations. It is derived from ...
  • English Football League System Cells

    Free Download — This dataset consists of a collection of Infoboxes from Wikipedia on the topic of English Football League System Cells.
  • Word List - 1,000+ Most Frequent words in King James Bible

    Free Download — 1,185 King James Version frequent substrings (KJVfreq.txt) The most frequently occurring 1,185 substrings in the King James Version Bible ranked and counted by order of frequency.
  • English Case Infobox

    Free Download — This dataset consists of a collection of Infoboxes from Wikipedia on the topic of English Case Infobox.