Tag

corpus

Showing 1 - 20 out of 62 datasets
  • Article Search API - NYTimes.com

    Offsite — With the Article Search API, you can search New York Times articles from 1981 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata. Along with standard keyword searching, the API also offers faceted searching. The available facets include Times-specific fields such as sections, taxonomic classifiers and ...
  • Text Messages sent on 9/11/2001 (wikileaks.org)

    Offsite — 9/11 tragedy pager intercepts. The following are more than half a million national US pager intercepts released by wikileaks.org. This covers the September 11 tragedy from 3am on the same day (Tuesday) until 3am the following day, a 24 hour period surrounding the attacks in New York and Washington. The fields presented are: Date Time Pager-Network Pager-number ...
  • Enron Email Dataset

    Offsite — From the CALO Project at Carnegie-Mellon University a massive dataset of emails recovered from discovery documents in the Enron trials About This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a ...
  • Word List - 100,000 + Official Crossword Words (Excel readable)

    Free Download — A word list with over 100,000 entries that are officially permitted in crossword games like Scrabble™. This word list is available in a simple, alphabetically-ordered Excel format, making it convenient for reference, spell-checking, or in more sophisticated application, for developers looking to build a custom spelling dictionary. The entries include variants of words: ...
  • Word List of 64,000+ Common English Dictionary Words (most with definitions, Excel format)

    Free Download — Over 64,000 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.
  • Word List - 10,000+ Common Place Names

    Free Download — U.S. place names for more than 10,000 entries. This U.S. place name list is available in a simple, alphabetically-ordered .txt format, making it convenient for reference, spell-checking, or in more sophisticated application, for developers looking to build a custom location tool or database. The entries represent a sampling of U.S. place names: 10,196 places in total.
  • Word List 80,000+ Official Crossword Words (with most definitions, Excel format)

    Free Download — A list of over 80,000 words officially permitted in crossword games like Scrabble™ with some but not all of their definitions. The words are compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has variants of words: -ing, -ed, -s, and so on, it makes a good addition when building a custom spelling dictionary. It is an reference ...
  • Word List - 100,000+ official crossword words (Excel readable)

    Free Download — 113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spelling dictionary.
  • Word List - 350,000+ Simple English Words (Excel readable)

    Free Download — Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.
  • TalkBank

    Offsite — About About TalkBank: > The goal of TalkBank is to foster fundamental research in the study of human and animal communication. It will construct sample databases within each of the subfields studying communication. It will use these databases to advance the development of standards and tools for creating, sharing, searching, and commenting upon primary materials via ...
  • The New York Times Annotated Corpus « YooName - named entity recognition

    Offsite
  • German English Parallel Corpus "de-news", Daily News 1996-2000

    Offsite
  • Tagged datasets for named entity recognition tasks

    Offsite
  • Enron Email Dataset

    Offsite
  • VoxForge

    Offsite — About > VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). > We will make available all submitted audio files under the GPL license, and then ‘compile’ them into acoustic models for use with Open Source speech recognition engines such as Sphinx, ISIP, Julius and HTK (note: HTK ...
  • phishingcorpus [JoseWiki]

    Offsite
  • Web FAQ collection | ILPS

    Offsite
  • Public Resources: Courts

    Offsite — Bulk.resource.org is a service of Public.Resource.Org, the system contains unsupported, as-is copies of selected U.S. government archives. These resources are pertaining to court information with topics like, fiches and scans, cases, courthouse news service, federal judicial center, JURIS database, request for clarification, and video proceedings.
  • A Million Syllabi

    Free Download — A data set of over a million syllabi gathered by Dan Cohen’s Syllabus Finder tool from 2002 to 2009. It could be the largest collection of syllabi ever gathered by several orders of magnitude. See a more detailed description on Dan Cohen’s blog Format Data are formatted as json records separated by newlines. Caution: this data is messy and comes with no warranty.
  • Web as Corpus

    Offsite