Dataset

Google Books Ngrams

Added By Ganglion

Description

Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set).

Each of the links will directly download a fragment of the given corpus. For instance, the first hundred links collectively comprise the 1-gram (i.e., individual words) counts for English, as collected from Google’s scanned books around July 15, 2009. Details on the corpus construction can be found in the Science article written by J.B. Michel et al. but are abbreviated here.

File format

Each of the files is zipped tab-separated data. (Yes, we know the files have .csv extensions.) Each line has the following format:

ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE

As an example, here are the 30,000,000th and 30,000,001st lines from file 0 of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):

circumvallate 1978 313 215 85
circumvallate 1979 183 147 77

The first line tells us that in 1978, the word circumvallate (which means “surround with a rampart or other fortification”, in case you were wondering) occurred 313 times overall, on 215 distinct pages and in 85 distinct books from our sample.

Here’s the 9,000,000th line from file 0 of the English 5-grams (googlebooks-eng-all-5gram-20090715-0.csv.zip):

analysis is often described as 1991 1 1 1

In 1991, the phrase analysis is often described as occurred one time (that’s the first 1), and on one page (the second 1), and in one book (the third 1).

Inside each file the ngrams are sorted alphabetically and then chronologically. Note that the files themselves aren’t ordered with respect to one another. A French two word phrase starting with ‘m’ will be in the middle of one of the French 2gram files, but there’s no way to know which without checking them all.

If datasets aren’t yet complete, that means we’re still busy uploading them. They’ll be available soon.

Usage

This compilation is licensed under a Creative Commons Attribution 3.0 Unported License.