Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets

The following datasets have been provided for evaluating duplicate detection, record linkage, and identity uncertainty systems. Several of these are not yet available for downloading; please contact the authors. The datasets include a segmented citation dataset based on the Cora research paper search engine, a collection of 864 restaurant records from the Fodor’s and Zagat’s restaurant guides that contains 112 duplicate, four citation datasets from the Citeseer scientific literature digital library, and a citation datasets based on DBLP (Digital Bibliography & Library Project) computer science bibliography.