Dataset
20 Newsgroups Dataset (De-Duped Version)
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
Note: Additional processing was done from the one here: http://people.csail.mit.edu/jrennie/20Newsgroups/
Format
1. The category and base file name were combined for each document to create a unique document id.
2. Every newline was replaced with a single whitespace character
3. Documents were concatenated into one file
Fields
| Short name | Type | Description |
|---|---|---|
| document_id | string | A unique identifier for the document of the form ‘category’.‘base file name’. See the original source above. |
| text | string | The full text of the article itself, all newlines replaced with whitespace |
Snippet
| alt.atheism.53536 | From: kmr4@po.CWRU.edu (Keith M. Ryan) Subject: Re: Smullyanism for the day….. In article (…snip…) |
| alt.atheism.51164 | From: mccullou@snake2.cs.wisc.edu (Mark McCullough) Subject: Re: Idle questions for fellow a (…snip…) |
| alt.atheism.53448 | From: sandvik@newton.apple.com (Kent Sandvik) Subject: Age of Reason Was: Who has read Rushd (…snip…) |
| alt.atheism.53753 | Subject: Re: The Inimitable Rushdie From: kmagnacca@eagle.wesleyan.edu In article <115621@b (…snip…) |
| alt.atheism.53290 | From: Andrew Newell Subject: Re: Christian Morality is In article Subject: Re: Amusing atheists and agnostics >DA (…snip…) |
| alt.atheism.51255 | From: keith@cco.caltech.edu (Keith Allan Schneider) Subject: Re: <Political Atheists? mathe (…snip…) |
| alt.atheism.53535 | From: healta@saturn.wwc.edu (Tammy R Healy) Subject: getting to the point! To all a.a reade (…snip…) |