Tag

huge

Showing 21 - 26 out of 26 datasets
  • USPTO (US Patent Office) patents: Bulk Downloads of Full Text, Scans or OCR

    Offsite — The following USPTO patent products are available for free download from Google. Patent Grants Patent Grant Multi-Page Images (1790 – present) Patent Grant Full Text with Embedded Images (2001 – present) Patent Grant Full Text (1976 – present) Patent Grant Bibliographic Data (1976 – present) Patent Grant OCR Text (1920 – 1979) Patent Grant Single-Page Images (Oct ...
  • Google Labs - Books Ngram Viewer

    Offsite — Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set). Each of the links below will directly download a fragment of the given corpus. For instance, ...
  • Allen Brain Atlas - complete gene expression pattern of mouse brain

    Offsite — “The Allen Brain Atlas that shows the expression pattern of almost every gene in the mouse brain, detailed in a huge series of microscopic images. This resource, which is available to everyone on the Internet, is a wonderful tool for brain researchers” (David Linden) The Allen Mouse Brain Atlas is an interactive, genome-wide image database of gene expression. Find ISH ...
  • Complete and Latest English Wikipedia raw dump with edit history

    Offsite — This is a direct link to the raw wikipedia data dump, roughly 7TB uncompressed. The data is bz2, gz, and 7z compressed and in .xml format. A higher level view of the data is available at this link: http://dumps.wikimedia.org/ As explained on this page: http://en.wikipedia.org/wiki/Wikipedia:Database_download, downloading data of this size uses a lot of bandwidth, which ...
  • 1000 Genomes Data

    Offsite — The 1000 Genomes data is an open dataset from the biological research community containing genetic sequencing data. The complete dataset is huge, at roughly 150TB uncompressed.
  • The ClueWeb09 Dataset

    Offsite — The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. Dataset Specifications Web Pages: 1,040,809,705 web pages, in 10 languages 5 TB, ...