Tag

huge

Showing 1 - 20 out of 26 datasets
  • Twitter Census - Conversation Metrics: One year of URLs, Hashtags, Smileys usage (monthly)

    Free Download — Twitter data from millions of tweets! This is a download of Twitter data from March 2006 to November 2009. The data set consists of “tokens,” which are hashtags (#data), URLs, or emoticons (Twitter smileys or other “faces” created using keyboard characters). The data comes from analysis on the full set of tweets during that time period, which is 35 million users, over ...
  • Twitter Census - Conversation Metrics: One Year of URLs, Hashtags, Smileys Usage (by Hour)

    Free Download — Twitter data from millions of tweets! This is a download of Twitter data from March 2006 to November 2009. The data set consists of “tokens,” which are hashtags (#data), URLs, or emoticons (Twitter smileys or other “faces” created using keyboard characters). The data comes from analysis on the full set of tweets during that time period, which is 40 million users, 1.6 ...
  • Twitter Census - Conversation Metrics: One year of URLs, Hashtags, Smileys usage (Smiley Counts)

    Free Download — Twitter smiley data from millions of tweets! This is a free download of Twitter data from March 2006 to November 2009. The smiley data comes from analysis on the full set of tweets during that time period, which is 35 million users, over 500 ...
  • Freebase Wikipedia Extraction (WEX)

    Offsite — The Freebase Wikipedia Extraction (WEX) is a processed dump of the English language Wikipedia. The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted intabular form. Freebase WEX is provided as a set of database tables in TSV format ...
  • NYSE Daily 1970-2010 Open, Close, High, Low and Volume

    Free Download — Historical NYSE stock data from 1970 – 2010, including daily open, close, low, high and trading volume figures. Data is organized alphabetically by ticker symbol. Tickers are filed in spreadsheets titled with the corresponding letter of the alphabet (for example, Daily Prices for CAE appear in the NYSE_daily_prices_C file, and Dividends for CAE appear in the ...
  • NASDAQ Exchange Daily 1970-2010 Open, Close, High, Low and Volume

    Free Download — Historical NASDAQ stock data from 1970 – 2010, including daily open, close, low, high and trading volume figures. Data is organized alphabetically by ticker symbol. Tickers are filed in spreadsheets titled with the corresponding letter of the alphabet (for example, Daily Prices for HNZ appear in the NASDAQ_daily_prices_H file, and Dividends for HNZ appear in the ...
  • Retrosheet: Game Logs (box scores) for Major League Baseball Games

    Offsite — Retrosheet baseball data for Major League games played from 1871 – 2008. Retrosheet provides a listing of the date and score of each game. Baseball data records may include team statistics, winning and losing pitchers, linescores, attendance, starting pitchers, umpires and more. There are 161 fields in each record, described in more detail in the Guide to Retrosheet ...
  • Global Daily Weather Data from the National Climate Data Center (NCDC)

    Offsite — Weather data provided by the National Climate Data Center (NCDC). The downloads available include Global Surface Summary of Day (GSOD) data provided by the NOAA division, National Climate Data Center. Explore the download for weather data, temperature data and more. You can fetch your own copy with wget -r -l3 —no-clobber —no-parent —no-verbose -a /tmp/wget_log.log ...
  • DBPedia Main

    Offsite — DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. The DBpedia knowledge base currently describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,000 companies. The knowledge base consists of 274 million pieces of ...
  • Austin Daily Weather (extracted from National Climate Data Center (NCDC) Data)

    Free Download — About This is an extract from the “Global Daily Weather Data from the National Climate Data Center (NCDC)” dataset for just austin. Contents There are several files in the packages: austin_daily_weather.tsv — daily weather from the operational weather station closest to Austin (Mueller Airport 1948-1999, Bergstrom for the last part of 1999, and Camp Mabry 2000-2009). ...
  • Retrosheet MLB Park IDs

    Free Download — Most of the Retrosheet data uses a Park ID in place of the name of the field. This dataset resolves the park ID to a field name. Format Column headers are in first row. PARKID|NAME|CITY|STATE|START DATE|END DATE|LEAGUE License The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at ...
  • Amazon Web Services Public Datasets » Data Wrangling Blog

    Offsite
  • A Million Syllabi

    Free Download — A data set of over a million syllabi gathered by Dan Cohen’s Syllabus Finder tool from 2002 to 2009. It could be the largest collection of syllabi ever gathered by several orders of magnitude. See a more detailed description on Dan Cohen’s blog Format Data are formatted as json records separated by newlines. Caution: this data is messy and comes with no warranty.
  • FreeBase

    Offsite — Description “Freebase is an open database of the world’s information. It is built by the community and for the community—free for anyone to query, contribute to, built applications on top of, or integrate into their websites.” Openness: OPEN License: cc-by + GFDL for wikipedia derived part (large). Access: ok but no bulk (perhaps via their query engine API but ...
  • The Open Library

    Offsite — About > One web page for every book ever published. It’s a lofty, but achievable, goal. > To build it, we need hundreds of millions of book records, a brand new database infrastructure for handling huge amounts of dynamic information, a wiki interface, multi-language support, and people who are willing to contribute their time, effort, and book data. > To date, we ...
  • Westbury Lab Usenet Corpus: 28M postings from 47000+ newsgroups 2005-2009

    Offsite — A USENET corpus (2005-2009) This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2010, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be ...
  • Stack Overflow Data Dump - Posts, Comments, Users, Votes & Badges

    Offsite — Stack Overflow Creative Commons Data Dump We decided early on that all user-generated content on Stack Overflow would be under a Creative Commons license. All those great Stack Overflow questions, answers, and comments, so generously contributed by all of you, are licensed under cc-wiki: You are free to Share — to copy, distribute, and transmit the work to Remix — to ...
  • ICPSR (Inter-university Consortium for Political & Social Resource): 500,000 data sets metaindex

    Offsite — ICPSR offers more than 500,000 digital files containing social science research data. Disciplines represented include political science, sociology, demography, economics, history, gerontology, criminal justice, public health, foreign policy, ...
  • Freebase Data Dump

    Offsite — Freebase data dumps provide all of the current facts and assertions within the Freebase system. The data dumps are complete, general-purpose extracts of the Freebase data in a variety of formats. Freebase releases a fresh data dump every three months. Freebase is an open database of the world’s information, covering millions of topics across hundreds of categories. ...
  • Google Books Ngrams

    Offsite — Description Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set). Each of the links will directly download a fragment of the given corpus. For ...