5 datasets
  • Web FAQ collection | ILPS

  • The Cornell Web Lab - The Cornell Web Lab

  • YouTube Dataset

  • Crunchbase database crawl

    Free Download — From: http://github.com/petewarden/crunchcrawl This module lets you index and download the company information held in Crunchbase. Included is also the full scrape of the data.Before using, double-check http://www.crunchbase.com/robots.txt and the API conditions to ensure you’re obeying the terms-of-service It contains various scripts to index and pull down the latest ...
  • The ClueWeb09 Dataset

    Offsite — The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. Dataset Specifications Web Pages: 1,040,809,705 web pages, in 10 languages 5 TB, ...