20 Newsgroups Dataset (De-Duped Version)
Overview
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
Note: Additional processing was done from the one here: http://people.csail.mit.edu/jrennie/20Newsgroups/
Format
1. The category and base file name were combined for each document to create a unique document id.
2. Every newline was replaced with a single whitespace character
3. Documents were concatenated into one file
Fields
| Short name | Type | Description |
|---|---|---|
| document_id | string | A unique identifier for the document of the form ‘category’.‘base file name’. See the original source above. |
| text | string | The full text of the article itself, all newlines replaced with whitespace |
Snippet
| alt.atheism.53536 | From: kmr4@po.CWRU.edu (Keith M. Ryan) Subject: Re: Smullyanism for the day….. In article (…snip…) |
| alt.atheism.51164 | From: mccullou@snake2.cs.wisc.edu (Mark McCullough) Subject: Re: Idle questions for fellow a (…snip…) |
| alt.atheism.53448 | From: sandvik@newton.apple.com (Kent Sandvik) Subject: Age of Reason Was: Who has read Rushd (…snip…) |
| alt.atheism.53753 | Subject: Re: The Inimitable Rushdie From: kmagnacca@eagle.wesleyan.edu In article <115621@b (…snip…) |
| alt.atheism.53290 | From: Andrew Newell Subject: Re: Christian Morality is In article Subject: Re: Amusing atheists and agnostics >DA (…snip…) |
| alt.atheism.51255 | From: keith@cco.caltech.edu (Keith Allan Schneider) Subject: Re: <Political Atheists? mathe (…snip…) |
| alt.atheism.53535 | From: healta@saturn.wwc.edu (Tammy R Healy) Subject: getting to the point! To all a.a reade (…snip…) |
Application Gallery
Do you have an application, visualization or otherwise great use of this data?
Submit it now, and be featured here!
Infochimps Platform
Use this data on the Infochimps Big Data Platform to unlock:
- Advanced analytical capabilities
- Hosting for customer databases
- Access to tools such as Hadoop, Pig, and R
- …and more to come!
Tags
Categories
Stats
| Added by: | Ganglion | |
|---|---|---|
| Created: | about 1 year ago | |
| Updated: | about 1 year ago | |
Share
