2000 TextClassificationfromLabeledan

(Nigam et al., 2000) ⇒ Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. (2000). “Text Classification from Labeled and Unlabeled Documents Using EM.” In: Machine Learning Journal, 39(2-3). doi:10.1023/A:1007692713085

Subject Headings: Supervised Text Classification Algorithm, Semi-Supervised Classification Algorithm, EM Algorithm, Multinomial Naive Bayes.

Notes

Cited By

Quotes

Author Keywords

text classification; Expectation-Maximization; integrating supervised and unsupervised learning; combining labeled and unlabeled data; Bayesian learning

Abstract

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a [[small training dataset|small number]] of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.

We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.

References

1. Avrim Blum, Tom Mitchell, Combining Labeled and Unlabeled Data with Co-training, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, p.92-100, July 24-26, 1998, Madison, Wisconsin, USA doi:10.1145/279943.279962
2. Vittorio Castelli, Thomas M. Cover, On the Exponential Value of Labeled Samples, Pattern Recognition Letters, v.16 n.1, p.105-111, Jan. 1995 doi:10.1016/0167-8655(94)00074-D
3. Peter Cheeseman, John Stutz, Bayesian Classification (AutoClass): Theory and Results, Advances in Knowledge Discovery and Data Mining, American Association for Artificial Intelligence, Menlo Park, CA, 1996
4. William W. Cohen, Yoram Singer, Context-sensitive Learning Methods for Text Categorization, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.307-315, August 18-22, 1996, Zurich, Switzerland doi:10.1145/243199.243278
5. Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, Wiley-Interscience, New York, NY, 1991
6. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, SeÃ¡n Slattery, Learning to Extract Symbolic Knowledge from the World Wide Web, Proceedings of the Fifteenth National/tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, p.509-516, July 1998, Madison, Wisconsin, USA
7. Dagan, I. & Engelson, S.P. (1995). Committee-based Sampling for Training Probabilistic Classifiers. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 150-157).
8. Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38.
9. Thomas G. Dietterich, Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms, Neural Computation, v.10 n.7, p.1895-1923, Oct. 1998 doi:10.1162/089976698300017197
10. Pedro Domingos, Michael Pazzani, On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning, v.29 n.2-3, p.103-130, Nov./Dec. 1997 doi:10.1023/A:1007413511361
11. Jerome H. Friedman, On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality, Data Mining and Knowledge Discovery, v.1 n.1, p.55-77, 1997 doi:10.1023/A:1009778005914
12. Ghahramani, Z. & Jordan, M.I. (1994). Supervised Learning from Incomplete Data via An EM Approach. In Advances in Neural Information Processing Systems 6 (pp. 120-127). Morgan Kaufmann.
13. Tommi S. Jaakkola, Michael I. Jordon, Improving the Mean Field Approximation via the Use of Mixture Distributions, Learning in Graphical Models, MIT Press, Cambridge, MA, 1999
14. Thorsten Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning, p.143-151, July 08-12, 1997
15. Thorsten Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features, Proceedings of the 10th European Conference on Machine Learning, p.137-142, April 21-23, 1998
16. Daphne Koller, Mehran Sahami, Hierarchically Classifying Documents Using Very Few Words, Proceedings of the Fourteenth International Conference on Machine Learning, p.170-178, July 08-12, 1997
17. Lang, K. (1995). Newsweeder: Learning to Filter Netnews. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 331-339).
18. Leah S. Larkey, W. Bruce Croft, Combining Classifiers in Text Categorization, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.289-297, August 18-22, 1996, Zurich, Switzerland doi:10.1145/243199.243276
19. David D. Lewis, An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task, Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.37-50, June 21-24, 1992, Copenhagen, Denmark doi:10.1145/133160.133172
20. David D. Lewis, A Sequential Algorithm for Training Text Classifiers: Corrigendum and Additional Data, ACM SIGIR Forum, v.29 n.2, p.13-19, Fall 1995 doi:10.1145/219587.219592
21. David D. Lewis, Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval, Proceedings of the 10th European Conference on Machine Learning, p.4-15, April 21-23, 1998
22. David D. Lewis, William A. Gale, A Sequential Algorithm for Training Text Classifiers, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.3-12, July 03-06, 1994, Dublin, Ireland
23. David D. Lewis, Kimberly A. Knowles, Threading Electronic Mail: A Preliminary Study, Information Processing and Management: An International Journal, v.33 n.2, p.209-217, March 1997 doi:10.1016/S0306-4573(96)00063-5
24. Lewis, D.D. & Ringuette, M. (1994). A Comparison O Two Learning Algorithms for Text Categorization. Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81-93).
25. Hang Li, Kenji Yamanishi, Document Classification Using a Finite Mixture Model, Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, p.39-47, July 07-12, 1997, Madrid, Spain doi:10.3115/976909.979623
26. Ray Liere, Prasad Tadepalli, Active Learning with Committees for Text Categorization, Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, p.591-596, July 27-31, 1997, Providence, Rhode Island
27. McCallum, A. & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press. http:// Www.cs.cmu.edu/~mccallum.
28. Andrew McCallum, Kamal Nigam, Employing EM and Pool-Based Active Learning for Text Classification, Proceedings of the Fifteenth International Conference on Machine Learning, p.350-358, July 24-27, 1998
29. Andrew McCallum, Ronald Rosenfeld, Tom M. Mitchell, Andrew Y. Ng, Improving Text Classification by Shrinkage in a Hierarchy of Classes, Proceedings of the Fifteenth International Conference on Machine Learning, p.359-367, July 24-27, 1998
30. McLachlan, G. & Basford, K. (1988). Mixture Models. New York: Marcel Dekker.
31. McLachlan, G.J. & Krishnan, T. (1997). The EM Algorithm and Extensions. New York: John Wiley and Sons.
32. Miller, D.J. & Uyar, H.S. (1997). A Mixture of Experts Classifier with Learning based on Both Labelled and Unlabelled Data. In Advances in Neural Information Processing Systems 9 (pp. 571-577). The MIT Press.
33. Thomas M. Mitchell, Machine Learning, McGraw-Hill, Inc., New York, NY, 1997
34. Andrew Y. Ng, Preventing "Overfitting" of Cross-Validation Data, Proceedings of the Fourteenth International Conference on Machine Learning, p.245-253, July 08-12, 1997
35. Michael Pazzani, Jack Muramatsu, Daniel Billsus, Syskill & Webert: Identifying Interesting Web Sites, Proceedings of the Thirteenth National Conference on Artificial Intelligence, p.54-61, August 04-08, 1996, Portland, Oregon
36. Rissanen, J. (1983). A Universal Prior for Integers and Estimation by Minimum Description Length. Annals of Statistics, 11(2), 416-431.
37. Robertson, S.E. & Sparck-Jones, K. (1976). Relevance Weighting of Search Terms. Journal of the American Society for Information Science, 27(3), 129-146.
38. Rocchio, J. (1971). Relevance Feedback in Information Retrieval. In G. Salton (Ed.), The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, NJ: Prentice Hall.
39. Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Baysian Approach to Filtering Junk E-mail. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press. http://robotics. Stanford.edu/users/sahami/papers.html.
40. Salton, G. (1991). Developments in Automatic Text Retrieval. Science, 253(5023), 974-980.
41. Dale Schuurmans, A New Metric-based Approach to Model Selection, Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, p.552-558, July 27-31, 1997, Providence, Rhode Island
42. Shahshahani, B. & Landgrebe, D. (1994). The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5), 1087-1095.
43. Shavlik, J. & Eliassi-Rad, T. (1998). Intelligent Agents for Web-based Tasks: An Advice-taking Approach. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press. http:// Www.cs.wisc.edu/~shavlik/mlrg/publications.html.
44. Stolcke, A. & Omohundro, S.M. (1994). Best-first Model Merging for Hidden Markov Model Induction. Tech. Rep. TR-94-003, ICSI, University of California, Berkeley. http:// Www.icsi.berkeley.edu/techreports/1994.html.
45. Yiming Yang, Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.13-22, July 03-06, 1994, Dublin, Ireland
46. Yiming Yang, An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, v.1 n.1-2, p.69-90, 1999 doi:10.1023/A:1009982220290
47. Yiming Yang, Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning, p.412-420, July 08-12, 1997

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2000 TextClassificationfromLabeledan	Tom M. Mitchell Sebastian Thrun Kamal Nigam Andrew McCallum			Text Classification from Labeled and Unlabeled Documents Using EM				10.1023/A:1007692713085		2000

2000 TextClassificationfromLabeledan

Notes

Cited By

Quotes

Author Keywords

Abstract

References

Navigation menu

Search