TDT-2 Benchmark Task

AKA: TDT2, 1998 Topic Detection and Tracking Evaluation, TDT2 Dataset.
Context:
- It consists of 64527 news stories from AP WorldStream, NY Times News service, CNN Headline News, ABC World News Tonight, Voice of America World News, and Public Radio International The World drawn from the first half of 1998.
- It associates a human-labeled topics to each News Story.
See: Topic Detection Algorithm, TDT, Reuters Corpora.

References

http://www.nist.gov/speech/tests/tdt/tdt98/index.htm
http://projects.ldc.upenn.edu/TDT/
- http://projects.ldc.upenn.edu/TDT2/
  - The TDT2 English Corpus has been designed to include six months of material drawn on a daily basis from six English news sources. The period of time covered is from January 4 to June 30, 1998. The six sources are the New York Times News Service, the Associated Press Worldstream News Service, CNN "Headline News", ABC "World News Tonight", Public Radio International's "The World", and the Voice of America.
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T57

(Hu et al., 1999) ⇒ Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. (2009). “Exploiting Wikipedia as External Knowledge for Document Clustering.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557066
- QUOTE: We selected 7,094 documents in TDT2 that have a unique class label ... The ten classes selected from TDT2 are 20001, 20015, 20002, 20013, 20070, 20044, 20076, 20071, 20012, and 20023.

(Xu et al., 2003) ⇒ Wei Xu, Xin Liu, and Yihong Gong. (2003). “Document Clustering Based on Non-Negative Matrix Factorization.” In: Proceedings of the 26th ACM SIGIR Conference (SIGIR 2003). doi:10.1145/860435.860485
- QUOTE: We conducted the performance evaluations using the TDT2² and the Reuters³ document corpora. These two document corpora have been among the ideal test sets for document clustering purposes because documents in the corpora have been manually clustered based on their topics and each document has been assigned one or more labels indicating which topic/topics it belongs to. The TDT2 corpus consists of 100 document clusters, each of which reports a major news event occurred in 1998. It contains a total of 64527 documents from six news agencies such as ABC, CNN, VOA, NYT, PRI and APW, among which 7803 documents have a unique category label. The number of documents for different news events is very unbalanced, ranging from 1 to 1485. In our experiments, we excluded those events with less than 5 documents, which left us with a total of 56 events. The final test set is still very unbalanced, with some large clusters more than 100 times larger than some small ones.

http://www.itl.nist.gov/iad/mig/tests/tdt/1998/
- QUOTE: The TDT-2 Corpus when complete will contain approximately 60,000 news stories from AP WorldStream, NY Times News service, CNN Headline News, ABC World News Tonight, Voice of America World News, and Public Radio International The World (transcripts are provided for the audio sources). A set of 100 target topics will be identified for the corpus and the corpus will be divided into training, development test, and evaluation test subsets of approximately equal size. See the LDC Website or Contact the LDC to obtain the TDT-2 training material. The Development Test material will be made available this summer and the Evaluation Test Material will be released in the Fall prior to the TDT-2 Evaluation in December.
- The TDT-1 Pilot Corpus contains 15,863 news stories from Reuters North American and CNN Broadcast Transcripts . A set of 25 target events has been identified for the corpus. See the LDC Website or Contact the LDC to obtain the TDT-1 Pilot Corpus.
- Documentation regarding the TDT-1 and TDT-2 corpora are available on the LDC TDT Website