Tag

text

Showing 1 - 20 out of 40 datasets
  • Article Search API - NYTimes.com

    Offsite — With the Article Search API, you can search New York Times articles from 1981 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata. Along with standard keyword searching, the API also offers faceted searching. The available facets include Times-specific fields such as sections, taxonomic classifiers and ...
  • Text Messages sent on 9/11/2001 (wikileaks.org)

    Offsite — 9/11 tragedy pager intercepts. The following are more than half a million national US pager intercepts released by wikileaks.org. This covers the September 11 tragedy from 3am on the same day (Tuesday) until 3am the following day, a 24 hour period surrounding the attacks in New York and Washington. The fields presented are: Date Time Pager-Network Pager-number ...
  • Enron Email Dataset

    Offsite — From the CALO Project at Carnegie-Mellon University a massive dataset of emails recovered from discovery documents in the Enron trials About This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a ...
  • Email Data Sets

    Offsite — Due to privacy issues, it is very hard to get a hold of large and realistic email corpora. Here you can find a few email data sets, as well as a dataset of news groups text – annotated with personal names spans. The email corpora given here were extracted from the Enron corpus, made public by the Federal agency Regulatory commission. As a second type of informal text, ...
  • Document Metadata Based on a Sample of Web Documents from the Open Directory

    Offsite — DMOZ100k06 is a large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from the social bookmarking service delicious.com, the content rating system ICRA, and the search engine Google. The data set is freely available for other research. Michael G. Noll
  • Reuters Spotlight - Article and Media API

    Offsite — The Reuters Spotlight service provides Reuters.com content in the form of multimedia articles, pictures, videos and text news through a set standards based consumer XML APIs. The Spotlight service also provides an option to receive the content automatically annotated with rich semantic metadata.
  • German English Parallel Corpus "de-news", Daily News 1996-2000

    Offsite
  • PMC FTP Service

    Offsite
  • Enron Email Dataset

    Offsite
  • Enron Dataset

    Offsite
  • Data for Data Mining

    Offsite
  • phishingcorpus [JoseWiki]

    Offsite
  • Public Resources: Courts

    Offsite — Bulk.resource.org is a service of Public.Resource.Org, the system contains unsupported, as-is copies of selected U.S. government archives. These resources are pertaining to court information with topics like, fiches and scans, cases, courthouse news service, federal judicial center, JURIS database, request for clarification, and video proceedings.
  • A Million Syllabi

    Free Download — A data set of over a million syllabi gathered by Dan Cohen’s Syllabus Finder tool from 2002 to 2009. It could be the largest collection of syllabi ever gathered by several orders of magnitude. See a more detailed description on Dan Cohen’s blog Format Data are formatted as json records separated by newlines. Caution: this data is messy and comes with no warranty.
  • Text Analytics Solutions from ClearForest

    Offsite
  • ualberta dependency based thesaurus and word count data

    Offsite
  • Natural Language Processing

    Offsite — The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person.
  • Biomedical Text corpora and related data collection resources.

    Offsite
  • Twitter Development Talk - API Documentation

    Offsite
  • Access to Web Research Collections VLC2/WT10g/WT2g

    Offsite