Parse.ly 30 million news headlines and summaries from 500K web sources (30M entries, >20GB data)
Overview
Since 2009, Parse.ly’s crawlers have fetched nearly 30M articles from thousands of sources across the web. This InfoChimps exclusive dump provides web application developers, computational linguists, researchers, and other interested parties with a slice of online news media, basically unparalleled in recency, size, scope, completeness, and structure.
The structure of the database is straightforward. The articles collection contains the following fields: title, summary, link, pub_date
In addition to the following fields: source_link, source_name. They describe the source the article came from, both the link, and in many cases a humanized name.
Additional metadata includes created and updated, that are timestamps from the RSS/Atom feeds. These can be different from the pub_date, and can indicate whether a story was revised multiple times after publication.
Finally, identity information is provided: signature_hash and unique_id. The signature_hash is calculated based upon a “text signature” — you can group by these to eliminate dupes and near-duplicates. The unique_id is a short unique code we generated for link shortening purposes. This is usually more convenient than referencing a URL, which can be long.
An example article
{’created’: datetime(2009, 9, 21, 6, 46, 28, 647000),
‘link’: ‘http://trueslant.com/johnkinsellagh/2009/09/20/obamas-health-care-media-blitz-continues/’,
‘pub_date’: datetime(2009, 9, 20, 16, 36, 54),
‘signature_hash’: ‘af3fa290282ad1ba1fe68a3db25eee3c’,
‘source_link’: ‘trueslant.com’,
‘source_name’: ‘True/Slant’,
‘summary’: " … from not only the White House, but from the Speaker of the House, is it any wonder that many Americans are confused about Obama’s health care …",
‘title’: “Obama’s health care media blitz can’t save his ‘plan’”,
‘unique_id’: ‘FeR3’,
‘updated’: datetime(2009, 9, 21, 6, 46, 28, 688000)}
There are almost 30 million entries like this one, ranging from 2009-2011. Here are 10 news headlines with sources chosen at random:
- Tesla to begin trading stock Tuesday (San Jose Mercury News)
- Palin Calls For Tort Reform (RealClearPolitics)
- A new look for fashion (Toronto Sun)
- Lynch holds Aug. 25 senior forum on healthcare (Wicked Local)
- PayPal Mobile 2.5 Covers 23,000 Charities (Coated)
- Google Ventures Hires an Entrepreneur-in-Residence (NYTimes Blog)
- Review: Lady Gaga enthralls her army of little monsters (Vancouver Sun)
- Supervisors vote to reopen MLK Hospital (ABC7)
- Hackers Put Social Networks Such as Twitter in Crosshairs (PCWorld)
- Federal judge OKs $925M UnitedHealth settlement (Associated Press)
Two more collections are included for completeness. “sources” provides a raw list of nearly 500,000 web sources that were crawled by Parse.ly. This has both the URLs and the humanized names (when available). This could be used as the basis for a modern online news crawler, to give a “seed set” of crawl URLs. Here are some example sources:
- { “url” : “www.nytimes.com”, “name” : “New York Times” }
- { “url” : “freakonomics.blogs.nytimes.com”, “name” : “New York Times (blog)” }
- { “url” : “norris.blogs.nytimes.com”, “name” : “New York Times (blog)” }
- { “url” : “www.salon.com”, “name” : “Salon.com” }
- { “url” : “www.slate.com”, “name” : “Slate” }
Finally, the “interests” collection is a set of user interests, which have been cleaned and anonymized for research purposes. Since Parse.ly Reader allowed users to enter free-form interests, we have included the list of these. Excluded are any interests with 2 or fewer hits, since these are noisy. The interests have been neatly organized by creating keys from the terms, by lowercasing them and applying a stemming algorithm to them to group them more fuzzily. This means that, e.g., “green technology”, “Green technology” and “green technologies” all get grouped together under the label “green technology”. (This particular example has 9 hits in our database.) Also preserved are “rank” values, which can be one of “Most”, “Extremely”, “Very”, “Moderately”, “Somewhat” and “Unranked”. This is nothing more than a qualitative indication of the importance of the interest to that individual user, and may be useful for separating top-of-mind interests from background ones.
Example interest:
{ “label” : “pet”,
“hits” : 4,
“entries” : [
{ “term” : “pets”, “rank” : “Most” },
{ “term” : “pets”, “rank” : “Extremely” },
{ “term” : “Pets”, “rank” : “Very” },
{ “term” : “pets”, “rank” : “Most” } ]
}
Unsurprisingly given the initial audience for the Parse.ly Reader, the two most popular interests are “technology” (365 hits) and “startups” (228). However, the content ranges a large span of topics (general news, technology, business, politics, entertainment, health, science, niche interests, you name it). Unlike general web content databases, however, you can be sure that almost all of these entries point to long-form article content.
In terms of format, the data is provided as 3 JSON files, which were exported from a MongoDB store using the mongoexport command. Therefore, it is simple to create 3 MongoDB collections (articles, sources, interests) which make this data instantly queryable and usable in a programming context. If using MongoDB is not an option, the JSON files can also be easily parsed using any of the client libraries listed at http://www.json.org/
If you’ve made it this far, you’re probably interested. So, you can @ me on Twitter (http://twitter.com/amontalenti) if you have any questions or have feedback about the pricing, structure, etc. And of course, check out http://parse.ly if you are the type of engineer interested in this corpus — we may be hiring :)
Happy hacking!
NOTE: The actual size of this complete dataset is over 20GB, however files have been split into 10 packages of ~2GB each. By purchasing this dataset you are purchasing the full 20GB of data, but you will initially only be able to download the first package through the website. To download the other 9, please contact us and we will provide you acces.
Application Gallery
Do you have an application, visualization or otherwise great use of this data?
Submit it now, and be featured here!
Infochimps Platform
Use this data on the Infochimps Big Data Platform to unlock:
- Advanced analytical capabilities
- Hosting for customer databases
- Access to tools such as Hadoop, Pig, and R
- …and more to come!
Tags
Stats
| Added by: | amontalenti | |
|---|---|---|
| Link: | ||
| Created: | 10 months ago | |
| Updated: | 4 months ago | |
Share
