20-Newsgroups Corpus

AKA: 20NG, 20 Newsgroups Collection.
Context:
- It was created ~1997.
- It can (typically) bed used as a Text Clustering Task benchmark dataset.
Example(s):
- 20news-19997.tar.gz,
- 20news-bydate.tar.gz,
- 20news-18828.tar.gz,
- …
Counter-Example(s):
See: Text Corpus, Text Dataset, Text Classification, Natural Language Processing.

References

(Hu et al., 1999) ⇒ Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. (2009). “Exploiting Wikipedia as External Knowledge for Document Clustering.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557066
- QUOTE: We perform clustering experiments on three datasets: TDT2, LA Times (from TREC), and 20-newsgroups (20NG). We selected … all 19,997 documents in 20-newsgroups. … All 20 classes of 20NG are used for testing.
(Chen et al., 2009) ⇒ Bo Chen, Wai Lam, Ivor Tsang, and Tak-Lam Wong. (2009). “Extracting Discrimininative Concepts for Domain Adaptation in Text Mining.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557045
- QUOTE: We use the 20-Newsgroup corpus to conduct experiments on document classification. This corpus consists of 18,846 newsgroup articles harvested from 20 different Usenet newsgroups. It can be observed that the marginal distributions of the articles among different newsgroups are not identical. There exists distribution shift from one newsgroup to any other newsgroups. However, we observe that some newsgroups are related. For example, the newsgroups rec.autos and rec.motorcycles are related to car. The newsgroups comp.sys.mac.hardware and comp.sys.ibm.pc.hardware are related to hardware, etc. …

(20Newsgroups, 1997) ⇒ http://people.csail.mit.edu/jrennie/20Newsgroups/
- The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.