20 Newsgroups Dataset (De-Duped Version)

Added By Ganglion

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

Note: Additional processing was done from the one here:


1. The category and base file name were combined for each document to create a unique document id.
2. Every newline was replaced with a single whitespace character
3. Documents were concatenated into one file


Short name Type Description
document_id string A unique identifier for the document of the form ‘category’.‘base file name’. See the original source above.
text string The full text of the article itself, all newlines replaced with whitespace


alt.atheism.53536 From: (Keith M. Ryan) Subject: Re: Smullyanism for the day….. In article (…snip…)
alt.atheism.51164 From: (Mark McCullough) Subject: Re: Idle questions for fellow a (…snip…)
alt.atheism.53448 From: (Kent Sandvik) Subject: Age of Reason Was: Who has read Rushd (…snip…)
alt.atheism.53753 Subject: Re: The Inimitable Rushdie From: In article <115621@b (…snip…)
alt.atheism.53290 From: Andrew Newell Subject: Re: Christian Morality is In article Subject: Re: Amusing atheists and agnostics >DA (…snip…)
alt.atheism.51255 From: (Keith Allan Schneider) Subject: Re: <Political Atheists? mathe (…snip…)
alt.atheism.53535 From: (Tammy R Healy) Subject: getting to the point! To all a.a reade (…snip…)

Parent site.