Offsite
—
DMOZ100k06 is a large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from the social bookmarking service delicious.com, the content rating system ICRA, and the search engine Google. The data set is freely available for other research.
Michael G. Noll
Free Download
—
The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months. These private searches are perfect for research and mining. The data is sorted by anonymous user ID and sequentially arranged. The collection can be used for personalization, query reformulation or ...
Offsite
—
This python module is a quasi-API to make it easier to authenticate into Google Trends for those who want to squeeze the extra level of functionality out of their data. The advantage of programmatic access is that the data can be automatically trended and merged. It can be snuck into a 9:00 AM daily email to the VP of Marketing so that she knows to ramp up Google Adwords ...
Offsite
—
Apache Pig is a platform for analyzing large data sets. Pig’s language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions but you can also create your own user-defined functions to do special-purpose processing. Pig Latin ...
Offsite
—
Google Flu Trends uses aggregated Google search data to estimate current flu activity around the world in near real-time. This site has a visualization of Google Flu Trends in comparison to the CDC’s data. There is also a link to a dataset of Google Flu Trends weekly influenza activity estimates for the world, from December 2002 to the present. Each week, millions of ...
Offsite
—
Visual dictionary presents a visualization of all the nouns in the English language arranged by semantic meaning. Each of the tiles in the mosaic is an arithmetic average of images relating to one of 53,464 nouns. The images for each word were obtained using Google’s Image Search and other engines. A total of 7,527,697 images were used, each tile being the average of 140 ...
Offsite
—
Description
From the front page:
> “DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.”