1999 AReExaminationOfTextCategorizationMethods

Jump to: navigation, search

Subject Headings: Text Classification Algorithm, Empirical Algorithm Comparison Study.


Cited By



  • This paper reports a controlled study with statistical significance tests on five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Least-squares Fit (LLSF) mapping and a Naive Bayes (NB) classifier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).

1 Introduction

  • Automated text categorization (TC) is a supervised learning task, defined as assigning category labels (pre-defined) to new documents based on the likelihood suggested by a training set of labeled documents. It has raised open challenges for statistical learning methods, requiring empirical examination of their effectiveness in solving real-world problems which are often high-dimensional, and have a skewed category distribution over labeled documents. Topic spotting for newswire stories, for example, is one the most commonly investigated application domains in the TC literature. An increasing number of learning approaches have been applied, including regression models [9, 32], nearest neighbor classification [17, 29, 33, 31, 14], Bayesian probabilistic approaches [25, 16, 20, 13, 12, 18, 3], decision trees [9, 16, 20, 2, 12], inductive rule learning [1, 5, 6, 21], neural networks [28,22], on-line learning[6, 15] and Support Vector Machines [12].
  • While the rich literature provides valuable information about individual methods, clear conclusions about cross-method comparison have been difficult because often the published results are not directly comparable.

6 Conclusions

  • In this paper we presented a controlled study with significance analyses on five well-known text categorization methods. Our main conclusions are:
    • Significance analyses can be applied to both a micro-level and macro-level evaluation of text categorization systems, and jointly used for cross-method comparison.
    • The outcome of a significance test depends on the choice of performance measure, the sensitivity of the test, and the training-set frequency of categories being tested.
    • For the micro-level performance on pooled category assignments, both a sign test and an error-based proportion test suggest that SVM and kNN signi cantly outperform the other classifiers, while NB significantly underperforms all the other classifiers.
    • With respect to the macro-level (category-level) performance analysis using F1, all the significance tests we conducted suggest that SVM, kNN and LLSF belong to the same class, significantly outperforming NB and NNet.


  • Chidanand Apté, Fred Damerau, and Sholom M. Weiss. (1994). “Towards Language Independent Automated Learning of Text Categorization Models.” In: Proceedings of the 17th ACM SIGIR Conference Retrieval.
  • Chidanand Apte, Fred Damerau, and Sholom M. Weiss. (1998). “Text Mining with Decision Rules and Decision Trees.” In: Proceedings of the Conference on Automated Learning and Discorery, Workshop 6: Learning from Text and the Web.
  • L. Douglas Baker, Andrew McCallum, Distributional clustering of words for text classification, Proceedings of the 21st ACM SIGIR Conference retrieval, p.96-103, August 24-28, 1998, Melbourne, Australia  doi:10.1145/290941.290970
  • D. Berry and B.W. Lindgren. Statistics: Theory and Methods. Brooks/Cole, Pacific Grove, California, 1990.
  • William W. Cohen. Text categorization and relational learning. In The Twelfth International Conference on Machine Learning (ICML'95). Morgan Kaufmann, 1995.
  • William W. Cohen, Yoram Singer, Context-sensitive learning methods for text categorization, Proceedings of the 19th ACM SIGIR Conference retrieval, p.307-315, August 18-22, 1996, Zurich, Switzerland  doi:10.1145/243199.243278
  • Corinna Cortes, Vladimir N. Vapnik, Support-Vector Networks, Machine Learning, v.20 n.3, p.273-297, Sept. 1995  doi:10.1023/A:1022627411411
  • Belur V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. McGraw-Hill Computer Science Series. IEEE Computer Society Press, Las Alamitos, California, 1991.
  • N. Fuhr, S. Hartmanna, G. Lustig, M. Schwantner, and K. Tzeras. Air/x - a rule-based multistage indexing systems for large subject fields. In 606-623, editor, Proceedings of RIAO'91, 1991.
  • Philip J. Hayes, Steven P. Weinstein, CONSTRUE/TIS: A System for Content-based Indexing of a Database of News Stories, Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence, p.49-64, May 01-03, 1990
  • Makoto Iwayama, Takenobu Tokunaga, Cluster-based text categorization: a comparison of category search strategies, Proceedings of the 18th ACM SIGIR Conference retrieval, p.273-280, July 09-13, 1995, Seattle, Washington, United States  doi:10.1145/215206.215371
  • Thorsten Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features, Proceedings of the 10th European Conference on Machine Learning, p.137-142, April 21-23, 1998
  • Daphne Koller, Mehran Sahami, Hierarchically Classifying Documents Using Very Few Words, Proceedings of the Fourteenth International Conference on Machine Learning, p.170-178, July 08-12, 1997
  • Wai Lam, Chao Yang Ho, Using a generalized instance set for automatic text categorization, Proceedings of the 21st ACM SIGIR Conference retrieval, p.81-89, August 24-28, 1998, Melbourne, Australia  doi:10.1145/290941.290961
  • David D. Lewis, Robert E. Schapire, James P. Callan, Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th ACM SIGIR Conference retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland  doi:10.1145/243199.243277
  • D.D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR'94), 1994.
  • Brij Masand, Gordon Linoff, David Waltz, Classifying news stories using memory based reasoning, Proceedings of the 15th ACM SIGIR Conference retrieval, p.59-65, June 21-24, 1992, Copenhagen, Denmark  doi:10.1145/133160.133177
  • Andrew McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
  • Tom M. Mitchell, Machine Learning, McGraw-Hill Higher Education, 1997
  • I. Moulinier. Is learning bias an issue on the text categorization problem? In Technical report, LAFORIA-LIP6, Universite Paris VI, 1997.
  • I. Moulinier, G. Raskinis, and J. Ganascia. Text categorization: a symbolic approach. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, 1996.
  • Hwee Tou Ng, Wei Boon Goh, Kok Leong Low, Feature selection, perception learning, and a usability case study for text categorization, Proceedings of the 20th ACM SIGIR Conference retrieval, p.67-73, July 27-31, 1997, Philadelphia, Pennsylvania, United States
  • Edgar Osuna, Robert Freund, Federico Girosi, Support Vector Machines: Training and Applications, Massachusetts Institute of Technology, Cambridge, MA, 1997
  • J. Platt. Sequetial minimal optimization: A fast algorithm for training support vector machines. In Technical Report MST-TR-98-14. Microsoft Research, 1998.
  • Kostas Tzeras, Stephan Hartmann, Automatic indexing based on Bayesian inference networks, Proceedings of the 16th ACM SIGIR Conference retrieval, p.22-35, June 27-July 01, 1993, Pittsburgh, Pennsylvania, United States  doi:10.1145/160688.160691
  • C. J. van Rijsbergen, Information Retrieval, Butterworth-Heinemann, Newton, MA, 1979
  • Vladimir N. Vapnik, The nature of statistical learning theory, Springer-Verlag New York, Inc., New York, NY, 1995
  • E. Wiener, J.O. Pedersen, and A.S. Weigend. A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR'95), 1995.
  • Yiming Yang, Expert network: effective and efficient learning from human decisions in text categorization and retrieval, Proceedings of the 17th ACM SIGIR Conference retrieval, p.13-22, July 03-06, 1994, Dublin, Ireland
  • Y. Yang. Sampling strategies and learning efficiency in text categorization. In: Proceedings of AAAISpring Symposium on Machine Learning in Information Access, pages 88-95, 1996.
  • Yiming Yang, An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, v.1 n.1-2, p.69-90, 1999
  • Yiming Yang, Christopher G. Chute, An example-based mapping method for text categorization and retrieval, ACM Transactions on Information Systems (TOIS), v.12 n.3, p.252-277, July 1994  doi:10.1145/183422.183424
  • Yiming Yang, Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning, p.412-420, July 08-12, 1997,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1999 AReExaminationOfTextCategorizationMethodsYiming Yang
Xin Liu
A Re-examination of Text Categorization MethodsProceedings of the ACM SIGIR Conferencehttp://nyc.lti.cs.cmu.edu/yiming/Publications/yang-sigir99.pdf10.1145/312624.3126471999