Category

Showing 101 - 120 out of 283 datasets

Technology

Not finding the data sets you're looking for? Not all of our data sets are categorized yet. Try checking out tags instead.
  • Open-Of-Course

    Offsite — Open-Of-Course is a multilingual and interactive portal for open content courses and tutorials. It is based on the free software ELO “Moodle” and people are welcome to add their own open educational content to the system.
  • TAGora » Integrated IMDB and Netflix Dataset

    Offsite — To support the investigation of communal data structures, such as folksonomies, in the context of recommendation, we have created a large knowledge base about movies and how users rate movies. To achieve this, a large portion of the Internet Movie Database (IMDB) was downloaded from to provide information about movies, actors and production personnel, as well a large set ...
  • CMU Machinelearning Project

    Offsite — Links to several data sets including: The StarPlus fMRI contains data, software and documentation on the fMRI data set for the StarPlus data. The Berkeley Segmentation Dataset and Benchmark contains a collection of 12,000 hand-labeled segmentations of 1,000 Corel dataset images from 30 human subjects. Half of the segmentations were obtained from presenting the subject ...
  • 20 Newsgroups Dataset (De-Duped Version)

    Free Download — The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a ...
  • The SEC's EDGAR Database

    Offsite — This directory contains raw data from the U.S. Securities and Exchange Commission EDGAR, Electronic Data-Gathering, Analysis, and Retrieval system, database. Since 1934, the SEC has required disclosure in forms and documents. In 1984, EDGAR began collecting electronic documents to help investors get information. Format This directory contains XML-transformed SEC feeds ...
  • E-Mail Index

    Offsite — This directory contains 1,039 messages from the archives that illustrate what the traffic going into a large web service looks like. The full archive is fairly massive, containing over 50 megabytes of ASCII text. In addition, there are another few hundred megabytes of mail files stashed away in other places, messages such as the massive mailbombing of Santa Claus ...
  • Features and Friends - Color spectra and Image Features of 20k user profile pictures with user stats

    Offsite — “Features_and_friends.csv” contains 33 image features for 19,217 MySpace profile pictures. Also included is the number of friends for each user in the sample. The columns are (roughly): n – Number of brightness levels pn – A measure ...
  • Image Parsing Datasets

    Offsite — Indeed, the “dataset issue” is a big challenge against every researcher who takes Computer Vision seriously. There are dozens of problems that remain unanswered, such as: How to build a general image database without bias to purpose? How to create a benchmark that reflects the real-world difficulty of image understanding? How to guarantee the correctness of annotation? ...
  • Web Content Extraction

    Offsite — The dataset contains the HTML version as well as the true content of a web page. True content is used to mean the text excluding the ads, navigational links/text, comments, etc. For example, for a blog post only the content of the post and not the comments and other surrounding text will be extracted. The dataset contains the HTML source and text content (true content) ...
  • Word List - 1000 Most Frequent Words from an Internet Corpus

    Free Download — This file consists of the 1,000 most frequently used English words as used on the Internet computer network in 1992.
  • Securities & Exchange Commission's Public Information Server

    Offsite — This server features SEC public documents, information of interest to the investing public, rule-making activities, and access to the Commission’s electronic filing database, EDGAR. The public will be able to query the EDGAR database for any company currently filing electronically with the SEC. These filings are updated 24 hours after they are filed with the ...
  • SEC Edgar Interactive Data Filings and RSS Feeds

    Offsite — You can search information collected by the SEC several ways: Company or fund name, ticker symbol, CIK (Central Index Key), file number, state, country, or SIC (Standard Industrial Classification) Most recent filings Full text (past four years) Boolean and advanced searching, including addresses Key mutual fund disclosures Mutual fund voting records Mutual fund ...
  • Twitter Graph Metrics: StrongLinks

    No Data — The service for this API has ceased Our apologies for the inconvenience this may cause. Twitter Strong Links measures a user’s connection potency across the social network. The API uses an innovative algorithm to assess the strength of any Twitter user’s relative connection to other Twitter users. In other words, the API measures how engaged a user is with other ...
  • SEC Corporate Ownership Linked Data, 2003-2006

    Offsite — This is a semantic web, RDF, linked-data, and SPARQL interface to U.S. corporate ownership information derived from filings to the U.S. Securities and Exchange Commission in its EDGAR database. There are three parts to this database: Part I: Individual Ownership via SEC forms 3, 4, 5, Part II: Subsidiary Information via 10-K Filings via CorpWatch, and Part III: Links to ...
  • Twitter Census: Smileys

    Free Download — Twitter smiley data from billions of tweets! This is a free download of Twitter data from March 2006 to November 2009. The data set consists of smileys, or emoticons that follow a convention similar to these examples: . :-) ;-) :D, etc. The data comes from analysis on the full set of tweets during that time period, which is 40 million users, 1.6 billion tweets, and ...
  • Hostnames of Internet addresses suspected of SSH password authentication attacks

    Offsite — Dragon Research Group (DRG) sshpwauth report Entries consist of fields with identifying characteristics of a a source IP address that has been seen attempting to remotely login to a host using SSH password authentication. This report lists hosts that are highly suspicious and are likely conducting malicious SSH password authentication attacks. Each entry is sorted ...
  • Internet Routing Registry

    Offsite — The Internet Routing Registry (IRR) is a distributed routing database development effort. Data from the Internet Routing Registry may be used by anyone worldwide to help debug, configure, and engineer Internet routing and addressing. The IRR provides a mechanism for validating the contents of BGP announcement messages or mapping an origin AS number to a list of networks.
  • TechTC - Technion Repository of Text Categorization Data Sets

    Offsite — The Technion Repository of Text Categorization Datasets provides a large number of diverse test collections for use in text categorization research.
  • GovTrack.us U.S Congress Legislative Data

    Offsite — U.S. Congress data including all Members of Congress since the beginning of the United States, legislative data including bills, sponsorship, roll call votes since around 1990.
  • XDXF - XML Dictionary Exchange Format

    Offsite — About From the website: > XDXF is a project to unite all existing open dictionaries and provide both users and developers with universal XML-based format, convertible to and from other popular dictionary formats. There are currently 308 dictionary files in various languages. Format It appears dictionary files are in XML. Access/Re-use The [SourceForge XDXF ...