Pete Skomoroch's Bookmarks

Showing 251 - 300 out of 375 datasets

Pete Skomoroch is President and Lead Consultant at Data Wrangling in Arlington, VA, a firm which specializes in mining large datasets to solve problems in search, finance, and recommendation systems.

He maintains an ever-expanding (near 400 as of last count!) list of datasets which have now been incorporated into the Infochimps repository.

  • Vehicle Routing Data Sets

    Offsite — A number of data sets gathered from various sources for the Capacitated Vehicle Routing Problem, and charts listing the characteristics of the various instances, as well as an optimal solution where one is known.
  • EIA - Petroleum Data, Reports, Analysis, Surveys

    Offsite — Find statistics on crude oil, gasoline, diesel, propane, jet fuel, ethanol, and other liquid fuels, and information on petroleum prices, crude reserves and production, refining and processing, imports/exports, stocks, and consumption/sales.
  • Document Metadata Based on a Sample of Web Documents from the Open Directory

    Offsite — DMOZ100k06 is a large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from the social bookmarking service, the content rating system ICRA, and the search engine Google. The data set is freely available for other research. Michael G. Noll
  • CMU Machinelearning Project

    Offsite — Links to several data sets including: The StarPlus fMRI contains data, software and documentation on the fMRI data set for the StarPlus data. The Berkeley Segmentation Dataset and Benchmark contains a collection of 12,000 hand-labeled segmentations of 1,000 Corel dataset images from 30 human subjects. Half of the segmentations were obtained from presenting the subject ...
  • Carnegie Mellon University - CMU Graphics Lab - Motion Capture Library

    Offsite — The database contains human motions captured with infrared cameras that you can download and use. The categories of motions include human interaction, interaction with environment, physical activities, sports, and test motions. The motion capture data may be copied, modified, or redistributed without permission.
  • Financial Forecast Center's Historical Economic and Market Data

  • Bureau of Labor Statistics Data

    Offsite — The Bureau of Labor Statistics of the U.S. Department of Labor is the principal Federal agency responsible for measuring labor market activity, working conditions, and price changes in the economy. Its mission is to collect, analyze, and disseminate essential economic information to support public and private decision-making. As an independent statistical agency, BLS ...
  • Browse Business Cycle Indicators Data

    Offsite — These series comprise the old BCI data set which was published in the Survey of Current Business. The BCI is a composite of indexes used to forecast changes in the direction of the overall economy of a country. They can be used to confirm or predict highs and lows of the business cycle. These same series are now published by the Conference Board in a periodical named ...
  • The Numbers Guy : Aspiring to Be the Wikipedia of Numbers

  • Social characteristics of the Marvel Universe

  • Word Lists Collection

    Offsite — The data is a smorgasbord of word lists, including spell check oriented word lists, an inflection database, parts of speech word list, jargon file word lists, the contents from Ispell, spell check dictionaries, tables that convert between American, British and Canadian spellings, and links to several other word lists.
  • ERS/USDA Data - International Macroeconomic Data Set

    Offsite — The International Macroeconomic Data Set provides data from 1969 through 2020 for real (adjusted for inflation) gross domestic product (GDP), population, real exchange rates, and other variables for the 190 countries and 34 regions that are most important for U.S. agricultural trade. The data presented are a key component of the USDA Baseline projections process, and can ...
  • State Agency Databases - GODORT

    Offsite — In every US State and the District of Columbia, agencies are creating databases of useful information – information on businesses, licensed professionals, plots of land, even dates of fish stocking. Because some of this content is not findable through search engines, this wiki has been created by GODORT (Government Documents Round Table of the American Library ...
  • The 2000 U.S. Census: 1 Billion RDF Triples

  • See Who's Editing Wikipedia - Diebold, the CIA, a Campaign

  • Dataset Generator - Perfect data for an imperfect world.

    Offsite — This site hosts a computer program that produces data. The intended use of the program is to help with the empirical analysis of other programs, particularly those that consume data. For example, it can produce data to test sorting programs.
  • National Bureau of Economic Research: Data

    Offsite — Founded in 1920, the National Bureau of Economic Research is a private, nonprofit, nonpartisan research organization dedicated to promoting a greater understanding of how the economy works. The NBER is committed to undertaking and disseminating unbiased economic research among public policy makers, business professionals, and the academic community. The Bureau ...
  • Entree Chicago Recommendation Data

    Offsite — This data contains a record of user interactions with the Entree Chicago restaurant recommendation system. This is an interactive system that recommends restaurants to the user based on factors such as cuisine, price, style, atmosphere, etc. or based on similarity to a restaurant in another city (e.g. find me a restaurant similar to the Patina in Los Angeles). The user ...
  • Community Resource Guide: New York City

    Offsite — A guide to New York City community-based data.
  • Social Science Data on the Net

    Offsite — Links to a large variety of social science data. A sample list of information includes: Public Use Microdata Samples (PUMS), African Population Database, San Diego / Tijuana Border Maps and Graphs,UNICEF Statistics, and Environmental Treaties and Resource Indicators (ENTRI).
  • Federal Highway Administration - Bridges - NBI ASCII Files

    Offsite — Recording and coding guide for the structure inventory and appraisal of the nations bridges.
  • List of films: A - Wikipedia, the free encyclopedia

    Offsite — This is an alphabetical list of film articles (or sections within articles about films), beginning at A. It includes made for television films.
  • The arXiv on your harddrive

  • Sunlight Foundation

    Offsite — The Sunlight Foundation is a non-profit, nonpartisan organization that uses the power of the Internet to catalyze greater government openness and transparency, and provides new tools and resources for media and citizens, alike. The are committed to improving access to government information by making it available online, indeed redefining “public” information as meaning ...
  • Technophilia: Where to find public records online - Lifehacker

    Offsite — A list of sources that can aid you in finding public records. All of the web sites and methods of discovery included are absolutely free, unless stated otherwise.
  • Enron Email Dataset


  • US Maps and Data from

    Offsite — Download GIS, geodata and map data for state, county and local governments.
  • CIA World Factbook Grep in Python

    Offsite — This project is a Python script that parses the CIA’s World Factbook, searches for a specific property, like population, and prints this property for all countries in the database. Type: capital and you’ll get a list of countries and capitals. It works for almost all properties specified in the CIA dataset. The program makes a good start for any CIA data ...
  • Richard Nixon - Presidential Recordings

    Offsite — Between February 16, 1971 and July 18, 1973 Richard Nixon secretly recorded roughly 3,700 hours of conversations and meetings in five different locations. With the exception of the manually-operated equipment in the Cabinet Room, Nixon’s recording system was sound-activated and recorded a wide range of conversations of varying audio and substantive quality. The original ...
  • Deborah Jeane Palfrey Legal Defense Fund

  • UC San Diego Data Mining Competition - 2007 - Datasets

  • package - MoinMaster

  • Retail Industry Financial Ratios & Benchmarks

  • POI Factory | New and Interesting Places for your GPS

    Offsite — POI Factory is a web site where GPS users get together to share interesting locations. POI files contain coordinates and descriptions for a set of locations (which are commonly referred to as “POI” or “Points of Interest”). Many newer GPS models are able to use this data to show what’s nearby or even give an alert as you get close to a location. Some of the most ...
  • GPS Forums - Point of Interest Indexes

    Offsite — A forum focused on Points of Interest from all over the world. Indexes include Speed Traps and Cameras, Chain Stores, Lodging, Restaurants, Dealerships, Gas Stations, Banks, Sports, Holidays/Leisure downloadable to various brands of car navigation systems.
  • Collective Dynamics Group

  • Jester Jokes and Joker Recommender System Ratings

    Offsite — Jester uses a collaborative filtering algorithm called Eigentaste to recommend jokes to you based on your ratings of previous jokes Three datasets: Dataset 1 contains over 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes, that’s right…jokes, from 73,421 users collected between April 1999 – May 2003, Dataset 2 contains Over 1.7 million continuous ratings ...
  • TricTrac: Video Dataset

    Offsite — TRICTRAC is a 3-year project in the field of image processing. The goal is the development of algorithms for the tracking of objects in real time in one or more live video streams. TRICTRAC is a joint project between the INTELSIG group at the Montefiore Institute of the Université de Liège, the tele group at the Université Catholique de Louvain and the research center ...
  • Premium Business Information Databases - AlacraWiki

    Offsite — A list of hundreds of databases related to business information.
  • The SEC's EDGAR Database

    Offsite — This directory contains raw data from the U.S. Securities and Exchange Commission EDGAR, Electronic Data-Gathering, Analysis, and Retrieval system, database. Since 1934, the SEC has required disclosure in forms and documents. In 1984, EDGAR began collecting electronic documents to help investors get information. Format This directory contains XML-transformed SEC feeds ...
  • E-Mail Index

    Offsite — This directory contains 1,039 messages from the archives that illustrate what the traffic going into a large web service looks like. The full archive is fairly massive, containing over 50 megabytes of ASCII text. In addition, there are another few hundred megabytes of mail files stashed away in other places, messages such as the massive mailbombing of Santa Claus ...
  • Anthracite Idioms

    Offsite — Common “idioms” used frequently in Anthracite. There are many ways to use Anthracite to accomplish web mining tasks.
  • Advance Monthly Sales for Retail and Food Services

    Offsite — Monthly Retail Sales and Seasonal Factors from 1992 – 2006.
  • NIST: Topic Detection and Tracking (TDT)

    Offsite — Topic Detection and Tracking research was pursued under the DARPA Translingual Information Detection, Extraction, and Summarization (TIDES) program. Topic Detection and Tracking is an integral part of the DARPA Translingual Information Detection, Extraction, and Summarization (TIDES) program. The goal of the TIDES program is to enable English-speaking users to access, ...
  • Volume of Retail Sales in Great Britain: Social Trends 33

    Offsite — The volume of retail sales in Great Britain has increased steadily over the last decade and this continued in 2001 when the annual average increased by 6 per cent in comparison to 2000. Retail sales follow a strong seasonal pattern and peak in December of each year. The weekly average in December 2001 was just over a third higher than the average for the year as a whole. ...

  • U.S. Company Filings and Annual Reports

    Offsite — These resources were selected for their authority, ease of use, and accessibility. If you need more assistance, please contact the Watson Library research librarians. Additional resources can be found in CLIO, Columbia University’s library catalog. For news and research tips from the Watson Library, please visit the Watson Library Blog.
  • FTP Information - EDGAR Database

    Offsite — Information for FTP Users: The Securities & Exchange Commission’s File Transfer Protocol (FTP) server for EDGAR filings allows comprehensive access to the SEC’s EDGAR (Electronic Data Gathering, Analysis, and Retrieval system) filings by corporations, funds, and individuals. These filings are disseminated to the public through the EDGAR Dissemination Service, currently ...
  • Data Mining For Investing