2000 TextClassificationfromLabeledan

From GM-RKB
Jump to: navigation, search

Subject Headings: Supervised Text Classification Algorithm, Semi-Supervised Classification Algorithm, EM Algorithm, Multinomial Naive Bayes.

Notes

Cited By

Quotes

Author Keywords

Abstract

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a [[small training dataset|small number]] of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.

We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.


References

  • 1. Avrim Blum, Tom Mitchell, Combining Labeled and Unlabeled Data with Co-training, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, p.92-100, July 24-26, 1998, Madison, Wisconsin, USA doi:10.1145/279943.279962
  • 2. Vittorio Castelli, Thomas M. Cover, On the Exponential Value of Labeled Samples, Pattern Recognition Letters, v.16 n.1, p.105-111, Jan. 1995 doi:10.1016/0167-8655(94)00074-D
  • 3. Peter Cheeseman, John Stutz, Bayesian Classification (AutoClass): Theory and Results, Advances in Knowledge Discovery and Data Mining, American Association for Artificial Intelligence, Menlo Park, CA, 1996
  • 4. William W. Cohen, Yoram Singer, Context-sensitive Learning Methods for Text Categorization, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.307-315, August 18-22, 1996, Zurich, Switzerland doi:10.1145/243199.243278
  • 5. Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, Wiley-Interscience, New York, NY, 1991
  • 6. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Seán Slattery, Learning to Extract Symbolic Knowledge from the World Wide Web, Proceedings of the Fifteenth National/tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, p.509-516, July 1998, Madison, Wisconsin, USA
  • 7. Dagan, I. & Engelson, S.P. (1995). Committee-based Sampling for Training Probabilistic Classifiers. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 150-157).
  • 8. Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38.
  • 9. Thomas G. Dietterich, Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms, Neural Computation, v.10 n.7, p.1895-1923, Oct. 1998 doi:10.1162/089976698300017197
  • 10. Pedro Domingos, Michael Pazzani, On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning, v.29 n.2-3, p.103-130, Nov./Dec. 1997 doi:10.1023/A:1007413511361
  • 11. Jerome H. Friedman, On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality, Data Mining and Knowledge Discovery, v.1 n.1, p.55-77, 1997 doi:10.1023/A:1009778005914
  • 12. Ghahramani, Z. & Jordan, M.I. (1994). Supervised Learning from Incomplete Data via An EM Approach. In Advances in Neural Information Processing Systems 6 (pp. 120-127). Morgan Kaufmann.
  • 13. Tommi S. Jaakkola, Michael I. Jordon, Improving the Mean Field Approximation via the Use of Mixture Distributions, Learning in Graphical Models, MIT Press, Cambridge, MA, 1999
  • 14. Thorsten Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning, p.143-151, July 08-12, 1997
  • 15. Thorsten Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features, Proceedings of the 10th European Conference on Machine Learning, p.137-142, April 21-23, 1998
  • 16. Daphne Koller, Mehran Sahami, Hierarchically Classifying Documents Using Very Few Words, Proceedings of the Fourteenth International Conference on Machine Learning, p.170-178, July 08-12, 1997
  • 17. Lang, K. (1995). Newsweeder: Learning to Filter Netnews. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 331-339).
  • 18. Leah S. Larkey, W. Bruce Croft, Combining Classifiers in Text Categorization, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.289-297, August 18-22, 1996, Zurich, Switzerland doi:10.1145/243199.243276
  • 19. David D. Lewis, An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task, Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.37-50, June 21-24, 1992, Copenhagen, Denmark doi:10.1145/133160.133172
  • 20. David D. Lewis, A Sequential Algorithm for Training Text Classifiers: Corrigendum and Additional Data, ACM SIGIR Forum, v.29 n.2, p.13-19, Fall 1995 doi:10.1145/219587.219592
  • 21. David D. Lewis, Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval, Proceedings of the 10th European Conference on Machine Learning, p.4-15, April 21-23, 1998
  • 22. David D. Lewis, William A. Gale, A Sequential Algorithm for Training Text Classifiers, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.3-12, July 03-06, 1994, Dublin, Ireland
  • 23. David D. Lewis, Kimberly A. Knowles, Threading Electronic Mail: A Preliminary Study, Information Processing and Management: An International Journal, v.33 n.2, p.209-217, March 1997 doi:10.1016/S0306-4573(96)00063-5
  • 24. Lewis, D.D. & Ringuette, M. (1994). A Comparison O Two Learning Algorithms for Text Categorization. Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81-93).
  • 25. Hang Li, Kenji Yamanishi, Document Classification Using a Finite Mixture Model, Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, p.39-47, July 07-12, 1997, Madrid, Spain doi:10.3115/976909.979623
  • 26. Ray Liere, Prasad Tadepalli, Active Learning with Committees for Text Categorization, Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, p.591-596, July 27-31, 1997, Providence, Rhode Island
  • 27. McCallum, A. & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press. http:// Www.cs.cmu.edu/~mccallum.
  • 28. Andrew McCallum, Kamal Nigam, Employing EM and Pool-Based Active Learning for Text Classification, Proceedings of the Fifteenth International Conference on Machine Learning, p.350-358, July 24-27, 1998
  • 29. Andrew McCallum, Ronald Rosenfeld, Tom M. Mitchell, Andrew Y. Ng, Improving Text Classification by Shrinkage in a Hierarchy of Classes, Proceedings of the Fifteenth International Conference on Machine Learning, p.359-367, July 24-27, 1998
  • 30. McLachlan, G. & Basford, K. (1988). Mixture Models. New York: Marcel Dekker.
  • 31. McLachlan, G.J. & Krishnan, T. (1997). The EM Algorithm and Extensions. New York: John Wiley and Sons.
  • 32. Miller, D.J. & Uyar, H.S. (1997). A Mixture of Experts Classifier with Learning based on Both Labelled and Unlabelled Data. In Advances in Neural Information Processing Systems 9 (pp. 571-577). The MIT Press.
  • 33. Thomas M. Mitchell, Machine Learning, McGraw-Hill, Inc., New York, NY, 1997
  • 34. Andrew Y. Ng, Preventing "Overfitting" of Cross-Validation Data, Proceedings of the Fourteenth International Conference on Machine Learning, p.245-253, July 08-12, 1997
  • 35. Michael Pazzani, Jack Muramatsu, Daniel Billsus, Syskill & Webert: Identifying Interesting Web Sites, Proceedings of the Thirteenth National Conference on Artificial Intelligence, p.54-61, August 04-08, 1996, Portland, Oregon
  • 36. Rissanen, J. (1983). A Universal Prior for Integers and Estimation by Minimum Description Length. Annals of Statistics, 11(2), 416-431.
  • 37. Robertson, S.E. & Sparck-Jones, K. (1976). Relevance Weighting of Search Terms. Journal of the American Society for Information Science, 27(3), 129-146.
  • 38. Rocchio, J. (1971). Relevance Feedback in Information Retrieval. In G. Salton (Ed.), The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, NJ: Prentice Hall.
  • 39. Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Baysian Approach to Filtering Junk E-mail. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press. http://robotics. Stanford.edu/users/sahami/papers.html.
  • 40. Salton, G. (1991). Developments in Automatic Text Retrieval. Science, 253(5023), 974-980.
  • 41. Dale Schuurmans, A New Metric-based Approach to Model Selection, Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, p.552-558, July 27-31, 1997, Providence, Rhode Island
  • 42. Shahshahani, B. & Landgrebe, D. (1994). The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5), 1087-1095.
  • 43. Shavlik, J. & Eliassi-Rad, T. (1998). Intelligent Agents for Web-based Tasks: An Advice-taking Approach. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press. http:// Www.cs.wisc.edu/~shavlik/mlrg/publications.html.
  • 44. Stolcke, A. & Omohundro, S.M. (1994). Best-first Model Merging for Hidden Markov Model Induction. Tech. Rep. TR-94-003, ICSI, University of California, Berkeley. http:// Www.icsi.berkeley.edu/techreports/1994.html.
  • 45. Yiming Yang, Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.13-22, July 03-06, 1994, Dublin, Ireland
  • 46. Yiming Yang, An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, v.1 n.1-2, p.69-90, 1999 doi:10.1023/A:1009982220290
  • 47. Yiming Yang, Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning, p.412-420, July 08-12, 1997

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2000 TextClassificationfromLabeledanKamal Nigam
Andrew McCallum
Sebastian Thrun
Tom M. Mitchell
Text Classification from Labeled and Unlabeled Documents Using EM10.1023/A:10076927130852000
AuthorKamal Nigam +, Andrew Kachites McCallum +, Sebastian Thrun + and Tom Mitchell +
doi10.1023/A:1007692713085 +
titleText Classification from Labeled and Unlabeled Documents Using EM +
year2000 +