Showing 21 - 40 out of 40 datasets
  • Home Page for 20 Newsgroups Data Set

  • Statistical NLP / corpus-based computational linguistics resources

  • Python Cheese Shop : Shakespeare 0.4

  • Linguistic Data Consortium (LDC) - Collection of Linguistic Corpora and Datasets

    Offsite — The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s host institution. The LDC was founded in 1992 with a grant from the Advanced ...
  • Given Name Frequency Project: Analysis of Given Name Popularity

    Offsite — This Given Name Frequency Project provides analysis, tools, and data to spur further work on given names. Data provided includes popular given names in the US from 1801 to 1999, samples of names from England before 1800 from a diverse set of sources, the popularity of the name Mary over the past 800 years, and a sample of cotton workers in Manchester, England from ...
  • E-Mail Index

    Offsite — This directory contains 1,039 messages from the archives that illustrate what the traffic going into a large web service looks like. The full archive is fairly massive, containing over 50 megabytes of ASCII text. In addition, there are another few hundred megabytes of mail files stashed away in other places, messages such as the massive mailbombing of Santa Claus ...
  • TREC-9 Filtering Track Collections - MEDLINE Extract with Relevance Measures

    Offsite — The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. License The National Library of Medicine has agreed to ...
  • CMU Machinelearning Project

    Offsite — Links to several data sets including: The StarPlus fMRI contains data, software and documentation on the fMRI data set for the StarPlus data. The Berkeley Segmentation Dataset and Benchmark contains a collection of 12,000 hand-labeled segmentations of 1,000 Corel dataset images from 30 human subjects. Half of the segmentations were obtained from presenting the subject ...
  • 20 Newsgroups Dataset (De-Duped Version)

    Free Download — The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a ...
  • The arXiv in your pocket - Downloadable Physics Pre-Print Archive

    Offsite — The arXiv Physics pre-print publishing corpus
  • TechTC - Technion Repository of Text Categorization Data Sets

    Offsite — The Technion Repository of Text Categorization Datasets provides a large number of diverse test collections for use in text categorization research.
  • Twitter API Documentation

    Offsite — Twitter exposes its data via an Application Programming Interface (API). This document is the official reference for that functionality.
  • Rail Text Color

    Free Download — This dataset consists of a collection of Infoboxes from Wikipedia on the topic of Rail Text Color.
  • Text Superimpose

    Free Download — This dataset consists of a collection of Infoboxes from Wikipedia on the topic of Text Superimpose.
  • Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets

    Offsite — The following datasets have been provided for evaluating duplicate detection, record linkage, and identity uncertainty systems. Several of these are not yet available for downloading; please contact the authors. The datasets include a segmented citation dataset based on the Cora research paper search engine, a collection of 864 restaurant records from the Fodor’s and ...
  • United States Patent and Trademark Office - Patent Full-Text Database

    Offsite — Searchable database of patents from 1976 to present in full-text.
  • United States Patent and Trademark Office - Patent Application Full-Text Database

    Offsite — US Patent Applications in Full-Text from March 2001 to present.
  • USPTO Assignments on the Web

    Offsite — The database contains all recorded Patent Assignment information from August 1980 to December 8, 2010.
  • Google Books Ngrams

    Offsite — Description Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set). Each of the links will directly download a fragment of the given corpus. For ...
  • FindTheBest.com Religions listing

    Offsite — Find and compare world religions based on the number of followers, the geographic region it is practiced in, the number of deities, the religion���s sacred text and more.