Dataset

80 Million Tiny Images

Added By Infochimps

Visual dictionary presents a visualization of all the nouns in the English language arranged by semantic meaning. Each of the tiles in the mosaic is an arithmetic average of images relating to one of 53,464 nouns. The images for each word were obtained using Google’s Image Search and other engines. A total of 7,527,697 images were used, each tile being the average of 140 images. The average reveals the dominant visual characteristics of each word. For some, the average turns out to be a recognizable image; for others the average is a colored blob. The list of nouns was obtained from Wordnet, a database compiled by lexicographers which records the semantic relationship between words. Using this database, they extracted a tree-structured semantic hierarchy which they used to arrange tiles within the poster. They tessellated the poster using the hierarchy so that the proximity of two tiles is given by their semantic distance. Thus the poster explores the relationship between visual and semantic similarity.

The Tiny Images dataset consists of 79,302,017 images, each being a 32×32 color image. This data is stored in the form of large binary files which can be accesed by a Matlab toolbox that they have written. You will need around 400Gb of free disk space to store all the files. In total there are 5 files that need to be downloaded, 3 of which are large binary files consisting of (i) the images themselves; (ii) their associated metadata (filename, search engine used, ranking etc.); (iii) Gist descriptors for each image. The other two files are the Matlab toolbox and index data file that together let you easily load in data from the binaries.