10 datasets
  • Delicious bookmarks, September 2009

    Offsite — A record of all bookmarking activity on delicious.com for a roughly 10-day period in September 2009. The data comes from Arvind Narayanan, a post-doctoral researcher in Computer Science at Stanford University. Format is JSON, one record per line. There are 1.25 million entries. Download size is 170 MB. Sample record: {"updated": “Tue, 08 Sep 2009 08:45:00 +0000”, ...
  • Tagged datasets for named entity recognition tasks

  • phishingcorpus [JoseWiki]

  • Main Task QA Data

  • ADL Gazetteer Development

  • AG's corpus of news articles

  • Big Huge Thesaurus API: Access 145,000 Words and Phrases

    Offsite — This site sports a very simple API for retrieving the synonyms for any word and also an actual Big Huge Thesaurus. License You may use the service for any legal and non-slimy purpose* so long as you link to this site in your website or application credits as follows: Thesaurus service provided by words.bighugelabs.com THE SERVICE IS PROVIDED “AS IS” WITHOUT WARRANTY OF ...
  • Linguistic Data Consortium (LDC) - Collection of Linguistic Corpora and Datasets

    Offsite — The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s host institution. The LDC was founded in 1992 with a grant from the Advanced ...
  • Ted Pedersen - Name Discrimination Data / Name Disambiguation Data / Name Ambiguity Data / Named Ent

    Offsite — Contains data where ambiguous entity names in text have been disambiguated. The data has either been manually disambiguated, or created by conflating multiple names into a single ambiguous pseudo-name.
  • OpenCalais API

    Offsite — The OpenCalais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing (NLP), machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within ...