Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets
Overview
The following datasets have been provided for evaluating duplicate detection, record linkage, and identity uncertainty systems. Several of these are not yet available for downloading; please contact the authors. The datasets include a segmented citation dataset based on the Cora research paper search engine, a collection of 864 restaurant records from the Fodor’s and Zagat’s restaurant guides that contains 112 duplicate, four citation datasets from the Citeseer scientific literature digital library, and a citation datasets based on DBLP (Digital Bibliography & Library Project) computer science bibliography.
Application Gallery
Do you have an application, visualization or otherwise great use of this data?
Submit it now, and be featured here!
Visit Source
Infochimps Platform
Use this data on the Infochimps Big Data Platform to unlock:
- Advanced analytical capabilities
- Hosting for customer databases
- Access to tools such as Hadoop, Pig, and R
- …and more to come!
Tags
Categories
Stats
| Sources: | ||
|---|---|---|
| Added by: | Infochimps | |
| Collection: | Pete Skomoroch's Bookmarks | |
| Link: | http://www.cs.utexas.edu/users/ml/riddle/data.html | |
| Created: | about 3 years ago | |
| Updated: | about 1 year ago | |
Share
