Tag

record_linkage

14 datasets
  • Email Data Sets

    Offsite — Due to privacy issues, it is very hard to get a hold of large and realistic email corpora. Here you can find a few email data sets, as well as a dataset of news groups text – annotated with personal names spans. The email corpora given here were extracted from the Enron corpus, made public by the Federal agency Regulatory commission. As a second type of informal text, ...
  • 99 Wikipedia Sources Aiding the Semantic Web » AI3:::Adaptive Information

    Offsite
  • INFO 747 - Social and Economic Data

    Offsite
  • Cogblog » Blog Archive » Cogmap APIs

    Offsite
  • Cogmap: The Org Chart Wiki

    Offsite
  • CiteULike: Available datasets

    Offsite
  • ZoomInfo - Welcome to the ZoomInfo Developer API

    Offsite — The ZoomInfo Public API provides free access to ZoomInfo’s people database and company database that contain over 40 million people and nearly 4 million companies, respectively. The ZoomInfo people search API gives you the ability to search for any person in the database by name. The ZoomInfo company search API gives you the ability to search for any company in the ...
  • Given Name Frequency Project: Analysis of Given Name Popularity

    Offsite — This Given Name Frequency Project provides analysis, tools, and data to spur further work on given names. Data provided includes popular given names in the US from 1801 to 1999, samples of names from England before 1800 from a diverse set of sources, the popularity of the name Mary over the past 800 years, and a sample of cotton workers in Manchester, England from ...
  • 1990 Census Name Files

    Offsite — Three separate datasets obtained from the 1990 cense. One set includes last names, one has first male names, and one has first female names. They contain the following data: the name, frequency in percent, cumulative frequency in percent, and rank.
  • New SwetoDblp RDF dataset released with 11M triples

    Offsite — The LSDIS (Large Scale Distributed Information Systems) lab at the University of Georgia has released a new version of the SwetoDblp dataset. SwetoDblp is a large-size ontology (spin-off of SWETO ontology) focused on bibliography data of Computer Science publications where the main data source is DBLP (Digital Bibliography & Library Project). The dataset has about 11M ...
  • LSDIS : SwetoDblp

    Offsite — SwetoDblp is a large-size ontology (spin-off of SWETO ontology) focused on bibliography data of Computer Science publications where the main data source is DBLP (Digital Bibliography & Library Project). SwetoDblp was created from a large XML document available at DBLP’s website and other datasets that are used to add relationships to other entities such as Publishers, ...
  • Vaccines: IIS/Tech/Deduplication Test Cases

    Offsite — NIP (now called the National Center for Immunization and Respiratory Diseases) developed a toolkit to assist immunization information systems (IIS) in the evaluation of their deduplication algorithms. This toolkit helps registries assess their system’s ability to prevent/remove duplicate records. The data and procedures in this toolkit can help identify strengths and ...
  • Ted Pedersen - Name Discrimination Data / Name Disambiguation Data / Name Ambiguity Data / Named Ent

    Offsite — Contains data where ambiguous entity names in text have been disambiguated. The data has either been manually disambiguated, or created by conflating multiple names into a single ambiguous pseudo-name.
  • Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets

    Offsite — The following datasets have been provided for evaluating duplicate detection, record linkage, and identity uncertainty systems. Several of these are not yet available for downloading; please contact the authors. The datasets include a segmented citation dataset based on the Cora research paper search engine, a collection of 864 restaurant records from the Fodor’s and ...