Collection
Pete Skomoroch's Bookmarks
Showing 201 - 250 out of 375 datasetsPete Skomoroch is President and Lead Consultant at Data Wrangling in Arlington, VA, a firm which specializes in mining large datasets to solve problems in search, finance, and recommendation systems.
He maintains an ever-expanding (near 400 as of last count!) list of datasets which have now been incorporated into the Infochimps repository.
-
PMC FTP Service
Offsite — -
The "USpop2002" data set
Offsite — -
Internet Archive: Details: Amazon ASIN listing and similarity graph
Offsite — -
European Climate Assessment Daily Weather Data
Offsite — -
Poverty Data Sets General Information
Offsite — A collection of subnational, spatially explicit, poverty data sets. These are available for selected proxy measures of poverty at global and national scales. The global data are of varying resolution, but primarily coarse; the national data sets are of considerable higher resolution. The Global data sets include two proxy poverty measurements: Malnutrition (underweight ... -
StatLib---Datasets Archive
Offsite — -
National Household Travel Survey (NHTS) Data
Offsite — -
RealClearPolitics - Election 2008 - Democratic Presidential Nomination
Offsite — -
Pew Internet & American Life Project
Offsite — -
Home - Numbrary
Offsite — -
Main Page - OpenTextMining
Offsite — -
Metafilter Infodump
Offsite — -
WEBSPAM-UK2007 | Datasets | Web Spam Detection
Offsite — -
Google to Host Terabytes of Open-Source Science Data | Wired Science from Wired.com
Offsite — -
Zillow - Labs - Neighborhood Boundaries
Offsite — -
Trust network datasets - TrustLet
Offsite — -
Crime in the United States 2006
Offsite — -
TaskForces/CommunityProjects/LinkingOpenData/DataSets - ESW Wiki
Offsite — -
Some Datasets Available on the Web » Data Wrangling Blog
Offsite — -
XML.com: GovTrack.us, Public Data, and the Semantic Web
Offsite — -
CiteULike: Available datasets
Offsite — -
Archive-It.org
Offsite — -
Challenge: Synopsis - Causality Workbench
Offsite — There are 4 datasets available (REGED, SIDO, CINA and MARTI), which have been progressively introduced, see the Dataset page. No new datasets will be introduced until the end of the challenge. -
Natural Language Processing
Offsite — The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person. -
Linguistic Data Consortium (LDC) - Collection of Linguistic Corpora and Datasets
Offsite — The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s host institution. The LDC was founded in 1992 with a grant from the Advanced ... -
1990 Census Name Files
Offsite — Three separate datasets obtained from the 1990 cense. One set includes last names, one has first male names, and one has first female names. They contain the following data: the name, frequency in percent, cumulative frequency in percent, and rank. -
Given Name Frequency Project: Analysis of Given Name Popularity
Offsite — This Given Name Frequency Project provides analysis, tools, and data to spur further work on given names. Data provided includes popular given names in the US from 1801 to 1999, samples of names from England before 1800 from a diverse set of sources, the popularity of the name Mary over the past 800 years, and a sample of cotton workers in Manchester, England from ... -
Email Data Sets
Offsite — Due to privacy issues, it is very hard to get a hold of large and realistic email corpora. Here you can find a few email data sets, as well as a dataset of news groups text – annotated with personal names spans. The email corpora given here were extracted from the Enron corpus, made public by the Federal agency Regulatory commission. As a second type of informal text, ... -
ZoomInfo - Welcome to the ZoomInfo Developer API
Offsite — The ZoomInfo Public API provides free access to ZoomInfo’s people database and company database that contain over 40 million people and nearly 4 million companies, respectively. The ZoomInfo people search API gives you the ability to search for any person in the database by name. The ZoomInfo company search API gives you the ability to search for any company in the ... -
Ted Pedersen - Name Discrimination Data / Name Disambiguation Data / Name Ambiguity Data / Named Ent
Offsite — Contains data where ambiguous entity names in text have been disambiguated. The data has either been manually disambiguated, or created by conflating multiple names into a single ambiguous pseudo-name. -
Developers Area - eBay Market Data Documentation - eBay Market Data Documentation
Offsite — AERS delivers the industry’s leading transaction-based data solutions using our advanced decision support and visualisation tools, and exclusive access to a broad base of consumer marketplace data. As the exclusive licensor of eBay transactional data, AERS is able to provide unique insights and tools that help your business make critical decisions about investments, ... -
New SwetoDblp RDF dataset released with 11M triples
Offsite — The LSDIS (Large Scale Distributed Information Systems) lab at the University of Georgia has released a new version of the SwetoDblp dataset. SwetoDblp is a large-size ontology (spin-off of SWETO ontology) focused on bibliography data of Computer Science publications where the main data source is DBLP (Digital Bibliography & Library Project). The dataset has about 11M ... -
LSDIS : SwetoDblp
Offsite — SwetoDblp is a large-size ontology (spin-off of SWETO ontology) focused on bibliography data of Computer Science publications where the main data source is DBLP (Digital Bibliography & Library Project). SwetoDblp was created from a large XML document available at DBLP’s website and other datasets that are used to add relationships to other entities such as Publishers, ... -
StrikeIron Super Data Pack Web Service 1.0 - StrikeIron Marketplace
Offsite — Offer comprehensive services that can provide vital information about US and international companies. Get the scoop on executives, revenue history, public company stock performance – both past and present. And pull vital demographic data about any state, the District of Columbia or Puerto Rico. After all, the more you know, the more prepared you’ll be to succeed. -
Vaccines: IIS/Tech/Deduplication Test Cases
Offsite — NIP (now called the National Center for Immunization and Respiratory Diseases) developed a toolkit to assist immunization information systems (IIS) in the evaluation of their deduplication algorithms. This toolkit helps registries assess their system’s ability to prevent/remove duplicate records. The data and procedures in this toolkit can help identify strengths and ... -
Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets
Offsite — The following datasets have been provided for evaluating duplicate detection, record linkage, and identity uncertainty systems. Several of these are not yet available for downloading; please contact the authors. The datasets include a segmented citation dataset based on the Cora research paper search engine, a collection of 864 restaurant records from the Fodor’s and ... -
INFO 747 - Social and Economic Data
Offsite — -
Overstock.com Affiliate Program API Access
Offsite — -
Amazon Web Services Developer Connection : Can Alexa WS provide detailed ...
Offsite — -
Market Data — eBay Developers Program
Offsite — The eBay Market Data Program offers rich consumer insight data about what is purchased on eBay and who is purchasing it. Access to this business intelligence can help you make effective buying and selling decisions for your business, regardless of whether you do business on eBay or outside of the eBay Marketplace. -
It’s a Pitch-by-Pitch Scouting Report, Minus the Scout - New York Times
Offsite — -
Daily Kos: Obama helps us track $1,000,000,000,000 of federal spending
Offsite — -
Welcome to USAspending.gov
Offsite — A single searchable website, accessible to the public at no cost, which includes for each Federal award the name of the entity receiving the award; the amount of the award; information on the award including transaction type, funding agency, etc; the location of the entity receiving the award; and a unique identifier of the entity receiving the award. Prime award ... -
Campaign Finance Reports and Data
Offsite — The Federal Election Commission provides information on various campaign expenditures through databases, reports, and a disclosure data catalog. Some of the information included is candidate, PAC, and party summaries, disclosure reports filed electronically by House and Presidential campaigns, parties and PACs, actual financial disclosure reports filed by House, Senate ... -
Machine Learning and Data Mining - Datasets
Offsite — Included are the datasets used for the ICML 2005 paper “Clustering through ranking on Manifolds,” as well as the Yale Face Database of photographs used in the study. Also included are the indexes for the images that were used in the random 90/10 splits. -
GIS for Schools
Offsite — -
Cardiac MRI dataset - York University
Offsite — This dataset consists of short axis cardiac MR images and the ground truth of their left ventricles’ endocardial and epicardial segmentations. The dataset was first compiled and used as part of the following paper: Alexander Andreopoulos, John K. Tsotsos, Efficient and Generalizable Statistical Models of Shape and Appearance for Analysis of Cardiac MRI, Medical Image ... -
Google Trends API coming soon | Tech news blog - CNET News.com
Offsite — -
MIT Media Lab: Reality Mining
Offsite — The Reality Mining data is available as a Matlab workspace. Reality Mining defines the collection of machine-sensed environmental data pertaining to human social behavior. This new paradigm of data mining makes possible the modeling of conversation context, proximity sensing, and temporospatial location throughout large communities of individuals. Mobile phones (and ... -
RL Competition 2008 - Home
Offsite —