Dataset

The ClueWeb09 Dataset

Added By Infochimps

The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference.

Dataset Specifications

Web Pages:

  • 1,040,809,705 web pages, in 10 languages
  • 5 TB, compressed. (25 TB, uncompressed.)

See the Record Counts Section on the Dataset Information and Sample Files page for detailed information on the distribution of records and languages.

Web Graph:

Entire Dataset:

  • Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed)
  • Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed)

TREC Category B (first 50 million English pages)

  • Unique URLs: 428,136,613 (30 GB uncompressed, 10 GB compressed)
  • Total Outlinks: 454,075,638 (3 GB uncompressed, 1 GB compressed)

The web graph for both the entire dataset and for the TREC Category B dataset (first 50 million English pages) is complete. We are in the process of retrieving the data and performing the final formatting of the web graph.
Information on how the crawl progressed is also available.

Dataset Distribution:

The ClueWeb09 dataset and subsets are distributed in several different ways.

  • Full, 4 × 1.5TB: The full dataset is distributed as tarred/gzipped files on four 1.5 terabyte (TB) hard disk drives, in Linux ext3 format. The physical drives are standard SATA 3 Gbit/sec (SATA/300) 3.5" drives that should be compatible with any SATA/300 interface, including external USB to SATA/300 enclosures.
  • Full, 2 × 3.0TB: The full dataset is also available as tarred/gzipped files on two 3.0 terabyte (TB) hard disk drives, in Linux ext3 format . The physical drives are standard SATA 6 Gbit/sec (SATA/600) 3.5" NOTE: Check that your system’s operating system and hardware are compatiable with 3 TB drives (not all SATA external enclosures are compatiable with these drives).
  • CatB, 1 × 500GB: The TREC “Category B” subset of the full dataset is distributed as tarred/gzipped files on one 1.0 gigabyte (GB) hard disk drive, in Linux ext3 format. The physical drive is a SATA 3 Gbit/sec (SATA/300) 3.5" drives that should be compatible with any SATA/ interface, including external USB to SATA 300 enclosures.
  • JA, 1 × 500GB: The Japanese subset of the full dataset is distributed as tarred/gzipped files on one 500 gigabyte (GB) hard disk drive, in Linux ext3 format. The physical drive is a SATA 3 Gbit/sec (SATA/300) 3.5" drive that should be compatible with any SATA/300 interface, including external USB to SATA/300 enclosures.
  • T11Crowd, web: The subset used by the TREC 2011 Crowdsourcing track is downloaded from the web.

File Formats and Sample Data:

Web pages are in the WARC file format. Each WARC file is about 1 gigabyte, uncompressed. Each WARC file contains several tens of thousands of web pages (e.g., 40,000). Each WARC file is compressed by gzip.
Please see the Dataset Information and Sample Files page for a detailed description of the contents of the dataset including the format of the dataset and sample files.

Obtaining a Copy of the ClueWeb09 Dataset

The ClueWeb09 datasets are distributed by Carnegie Mellon University for research purposes only. A dataset may be obtained from Carnegie Mellon by signing a data license agreement with Carnegie Mellon University, and paying a fee that covers the cost of distributing the dataset.