DatasetAdded By Ganglion
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
Note: Additional processing was done from the one here: http://people.csail.mit.edu/jrennie/20Newsgroups/
1. The category and base file name were combined for each document to create a unique document id.
2. Every newline was replaced with a single whitespace character
3. Documents were concatenated into one file
|document_id||string||A unique identifier for the document of the form ‘category’.‘base file name’. See the original source above.|
|text||string||The full text of the article itself, all newlines replaced with whitespace|
|alt.atheism.53536||From: kmr4@po.CWRU.edu (Keith M. Ryan) Subject: Re: Smullyanism for the day….. In article (…snip…)|
|alt.atheism.51164||From: email@example.com (Mark McCullough) Subject: Re: Idle questions for fellow a (…snip…)|
|alt.atheism.53448||From: firstname.lastname@example.org (Kent Sandvik) Subject: Age of Reason Was: Who has read Rushd (…snip…)|
|alt.atheism.53753||Subject: Re: The Inimitable Rushdie From: email@example.com In article <115621@b (…snip…)|
|alt.atheism.53290||From: Andrew Newell Subject: Re: Christian Morality is In article Subject: Re: Amusing atheists and agnostics >DA (…snip…)|
|alt.atheism.51255||From: firstname.lastname@example.org (Keith Allan Schneider) Subject: Re: <Political Atheists? mathe (…snip…)|
|alt.atheism.53535||From: email@example.com (Tammy R Healy) Subject: getting to the point! To all a.a reade (…snip…)|