12 datasets
  • Mushroom Data Set

    Offsite — This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no ...
  • Hunch API

    Offsite — RESTful API for programmatically accessing Hunch, a questions and answers service that harnesses collective knowledge to offer solutions to user-entered problems. Hunch is designed so that every time it’s used, it learns something new. Query for questions, responses, topics, search results and categories as well as statistics pertaining to THAY (Teach Hunch About You) ...
  • Trust Network Datasets

    Offsite — Collection of network datasets in which there are entities (people, robots) and some social relationship connecting two of these entities. From TrustLet, a cooperative environment for the scientific research of trust metrics on social networks. Released datasets are licensed under Creative Commons Attribution 3.0.
  • Failure Trace Archive

    Offsite — The Failure Trace Archive (FTA) is centralized public repository of availability traces of parallel and distributed systems, and tools for their analysis. The purpose of this archive is to facilitate the design, validation and comparison of fault-tolerant models and algorithms.submitted by: Jeremy Cowles
  • UCI Machine Learning Repository

    Offsite — Repository of data sets for machine learning research.
  • UCI KDD Archive

    Offsite — Repository of large data sets for knowledge discovery research. See also: UCI Machine Learning Repository.
  • Netflix Prize

    Offsite — “The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences.” Register to download the dataset.
  • OpenCyc API

    Offsite — Programmatic access to the open source version of the Cyc Knowledge Base, the world’s largest and most complete general knowledge base and commonsense reasoning engine. The Cyc Knowledge Base is a formalized representation of a vast quantity of fundamental human knowledge: facts, rules of thumb, and heuristics for reasoning about the objects and events of everyday ...
  • UMass Amherst Linguistics Sentiment Corpora

    Offsite — N-gram counts extracted from over 700,000 online product reviews in Chinese, English, German and Japanese. Formatted to be read as R data frames. By Noah Constant, Christopher Davis, Christopher Potts and Florian Schwarz. Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
  • Splog Blog Dataset

    Offsite — Dataset of 3,000 blog homepages, of which 700 have been labeled as spam-blogs or splogs and another 700 as authentic blogs.
  • StatLib Datasets Archive

    Offsite — Collection of datasets from books and articles via Carnegie Mellon University’s StatLib community.
  • Twitter Spammers List

    Free Download — Description A flat text list of human classified spam accounts from http://twitter.com. Fields: twitter_user_screen_name: twitter screen name of spam account Source(s): http://www.writing.com/main/view_item/item_id/1618035-Twitter-Spammers