4 datasets
  • Document Metadata Based on a Sample of Web Documents from the Open Directory

    Offsite — DMOZ100k06 is a large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from the social bookmarking service delicious.com, the content rating system ICRA, and the search engine Google. The data set is freely available for other research. Michael G. Noll
  • Reuters Spotlight - Article and Media API

    Offsite — The Reuters Spotlight service provides Reuters.com content in the form of multimedia articles, pictures, videos and text news through a set standards based consumer XML APIs. The Spotlight service also provides an option to receive the content automatically annotated with rich semantic metadata.
  • Web Content Extraction

    Offsite — The dataset contains the HTML version as well as the true content of a web page. True content is used to mean the text excluding the ads, navigational links/text, comments, etc. For example, for a blog post only the content of the post and not the comments and other surrounding text will be extracted. The dataset contains the HTML source and text content (true content) ...
  • The ClueWeb09 Dataset

    Offsite — The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. Dataset Specifications Web Pages: 1,040,809,705 web pages, in 10 languages 5 TB, ...