-
Offsite
—
With the Article Search API, you can search New York Times articles from 1981 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata. Along with standard keyword searching, the API also offers faceted searching. The available facets include Times-specific fields such as sections, taxonomic classifiers and ...
-
Offsite
—
9/11 tragedy pager intercepts. The following are more than half a million national US pager intercepts released by wikileaks.org. This covers the September 11 tragedy from 3am on the same day (Tuesday) until 3am the following day, a 24 hour period surrounding the attacks in New York and Washington. The fields presented are: Date Time Pager-Network Pager-number ...
-
Offsite
—
From the CALO Project at Carnegie-Mellon University a massive dataset of emails recovered from discovery documents in the Enron trials About This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a ...
-
Offsite
—
Due to privacy issues, it is very hard to get a hold of large and realistic email corpora. Here you can find a few email data sets, as well as a dataset of news groups text – annotated with personal names spans. The email corpora given here were extracted from the Enron corpus, made public by the Federal agency Regulatory commission. As a second type of informal text, ...
-
Offsite
—
DMOZ100k06 is a large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from the social bookmarking service delicious.com, the content rating system ICRA, and the search engine Google. The data set is freely available for other research.
Michael G. Noll
-
Offsite
—
The Reuters Spotlight service provides Reuters.com content in the form of multimedia articles, pictures, videos and text news through a set standards based consumer XML APIs. The Spotlight service also provides an option to receive the content automatically annotated with rich semantic metadata.
-
Offsite
—
-
Offsite
—
-
Offsite
—
-
Offsite
—
-
Offsite
—
-
Offsite
—
-
Offsite
—
Bulk.resource.org is a service of Public.Resource.Org, the system contains unsupported, as-is copies of selected
U.S. government archives. These resources are pertaining to court information with topics like, fiches and scans, cases, courthouse news service, federal judicial center, JURIS database, request for clarification, and video proceedings.
-
Free Download
—
A data set of over a million syllabi gathered by Dan Cohen’s Syllabus Finder tool from 2002 to 2009. It could be the largest collection of syllabi ever gathered by several orders of magnitude.
See a more detailed description on Dan Cohen’s blog
Format
Data are formatted as json records separated by newlines.
Caution: this data is messy and comes with no warranty.
-
Offsite
—
-
Offsite
—
-
Offsite
—
The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person.
-
Offsite
—
-
Offsite
—
-
Offsite
—