2002 MachineLearningInAutoTextCateg

From GM-RKB
Jump to: navigation, search

Subject Headings: Text Classification Algorithm, Supervised Learning Algorithm, Survey.

Notes

Cited By

Quotes

Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

References

  • 1. Gianni Amati, Fabio Crestani, Probabilistic learning for selective dissemination of information, Information Processing and Management: an International Journal, v.35 n.5, p.633-654, Sept. 1999  doi:10.1016/S0306-4573(99)00012-6
  • 2. Ion Androutsopoulos, John Koutsias, Konstantinos V. Chandrinos, Constantine D. Spyropoulos, An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages, Proceedings of the 23rd ACM SIGIR Conference retrieval, p.160-167, July 24-28, 2000, Athens, Greece  doi:10.1145/345508.345569
  • 3. Chidanand Apté, Fred Damerau, Sholom M. Weiss, Automated learning of decision rules for text categorization, ACM Transactions on Information Systems (TOIS), v.12 n.3, p.233-251, July 1994  doi:10.1145/183422.183423
  • 4. ATTARDI, G., DI MARCO,S.,AND SALVI, D. (1998). Categorization by context. J. Univers. Comput. Sci. 4, 9, 719-736.
  • 5. L. Douglas Baker, Andrew McCallum, Distributional clustering of words for text classification, Proceedings of the 21st ACM SIGIR Conference retrieval, p.96-103, August 24-28, 1998, Melbourne, Australia  doi:10.1145/290941.290970
  • 6. Nicholas J. Belkin, W. Bruce Croft, Information filtering and information retrieval: two sides of the same coin?, Communications of the ACM, v.35 n.12, p.29-38, Dec. 1992  doi:10.1145/138859.138861
  • 7. P. Biebricher, N. Fuhr, G. Lustig, M. Schwantner, G. Knorz, The automatic indexing system AIR/PHYS - from research to applications, Proceedings of the 11th ACM SIGIR Conference retrieval, p.333-342, May 1988, Grenoble, France  doi:10.1145/62437.62470
  • 8. Harold Borko, Myrna Bernick, Automatic Document Classification, Journal of the ACM (JACM), v.10 n.2, p.151-162, April 1963  doi:10.1145/321160.321165
  • 9. Maria Fernanda Caropreso, Stan Matwin, Fabrizio Sebastiani, A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization, Text databases & document management: theory & practice, Idea Group Publishing, Hershey, PA, 2001
  • 10. CAVNAR,W.B.AND TRENKLE, J. M. (1994). N-grambased text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Docu-ment Analysis and Information Retrieval (Las Vegas, NV, 1994), 161-175.
  • 11. Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. (1998). “Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies.” In: The VLDB Journal — The International Journal on Very Large Data Bases, 7(3)  doi:10.1007/s007780050061
  • 12. Soumen Chakrabarti, Byron Dom, and Piotr Indyk, (199). “Enhanced Hypertext Categorization Using Hyperlinks.” In: Proceedings of the 1998 ACM SIGMOD Conference.
  • 13. Chris Clack, Johnny Farringdon, Peter Lidwell, Tina Yu, Autonomous document classification for business, Proceedings of the first International Conference on Autonomous agents, p.201-208, February 05-08, 1997, Marina del Rey, California, United States  doi:10.1145/267658.267716
  • 14. Cyril Cleverdon, Optimizing convenient online access to bibliographic databases, Document retrieval systems, Taylor Graham Publishing, London, UK, 1988
  • 15. William W. Cohen 1995a. Learning to classify English text with ILP methods. In Advances in Inductive Logic Programming, L. De Raedt, ed. IOS Press, Amsterdam, The Netherlands, 124-143.
  • 16. William W. Cohen 1995b. Text categorization and relational learning. In: Proceedings of ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, CA, 1995), 124-132.
  • 17. COHEN,W.W.AND HIRSH, H. (1998). Joins that generalize: text classification using WHIRL.InProceedings of KDD-98, 4th International Conference on Knowledge Discovery and Data Mining (New York, NY, 1998), 169-173.
  • 18. William W. Cohen, Yoram Singer, Context-sensitive learning methods for text categorization, ACM Transactions on Information Systems (TOIS), v.17 n.2, p.141-173, April 1999  doi:10.1145/306686.306688
  • 19. William S. Cooper, Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval, ACM Transactions on Information Systems (TOIS), v.13 n.1, p.100-111, Jan. 1995  doi:10.1145/195705.195735
  • 20. Robert H. Creecy, Brij M. Masand, Stephen J. Smith, David L. Waltz, Trading MIPS and memory for knowledge engineering, Communications of the ACM, v.35 n.8, p.48-64, Aug. 1992  doi:10.1145/135226.135228
  • 21. Fabio Crestani, Mounia Lalmas, Cornelis J. Van Rijsbergen, Iain Campbell, “Is this document relevant?…probably”: a survey of probabilistic models in information retrieval, ACM Computing Surveys (CSUR), v.30 n.4, p.528-552, Dec. 1998  doi:10.1145/299917.299920
  • 22. DAGAN, I., KAROV,Y.,AND Dan Roth. (1997). Mistakedriven learning in text categorization. In: Proceedings of EMNLP-97, 2nd Conference on Empirical Methods in Natural Language Processing (Providence, RI, 1997), 55-63.
  • 23. DEERWESTER, S., DUMAIS,S.T.,FURNAS,G.W., LANDAUER, T. K., AND HARSHMAN, R. (1990). Indexing by latent semantic indexing. J. Amer. Soc. Inform. Sci. 41, 6, 391-407.
  • 24. DENOYER, L., ZARAGOZA, H., AND GALLINARI, P. (2001). HMM-based passage models for document classification and ranking. In: Proceedings of ECIR- 01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, Germany, 2001).
  • 25. DIAZ ESTEBAN, A., DE BUENAGA RODRIGUEZ, M., URENA LOPEZ,L.A.,AND GARCIA VEGA, M. (1998). Integrating linguistic resources in an uniform way for text classification tasks. In: Proceedings of LREC-98, 1st International Conference on Language Resources and Evaluation (Grenada, Spain, 1998), 1197-1204.
  • 26. Pedro Domingos, Michael J. Pazzani, On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning, v.29 n.2-3, p.103-130, Nov./Dec. 1997
  • 27. DRUCKER, H., VAPNIK,V.,AND WU, D. (1999). Automatic text categorization and its applications to text retrieval. IEEE Trans. Neural Netw. 10,5, 1048-1054.
  • 28. Susan Dumais, Hao Chen, Hierarchical classification of Web content, Proceedings of the 23rd ACM SIGIR Conference retrieval, p.256-263, July 24-28, 2000, Athens, Greece  doi:10.1145/345508.345593
  • 29. Susan Dumais, John Platt, David Heckerman, Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh International Conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States  doi:10.1145/288627.288651
  • 30. Gerard Escudero, Lluís Màrquez, German Rigau, Boosting Applied toe Word Sense Disambiguation, Proceedings of the 11th European Conference on Machine Learning, p.129-141, May 31-June 02, 2000
  • 31. FIELD, B. 1975. Towards automatic indexing: automatic assignment of controlled-language indexing and classification from free indexing. J. Document. 31, 4, 246-265.
  • 32. FORSYTH, R. S. (1999). New directions in text categorization. In Causal Models and Intelligent Data Management, A. Gammerman, ed. Springer, Heidelberg, Germany, 151-185.
  • 33. Paolo Frasconi, Giovanni Soda, Alessandro Vullo, Hidden Markov Models for Text Categorization in Multi-Page Documents, Journal of Intelligent Information Systems, v.18 n.2-3, p.195-217, March-May 2002
  • 34. FUHR, N. (1985). Aprobabilistic model of dictionarybased automatic indexing. In: Proceedings of RIAO-85, 1st International Conference "Re-cherche d'Information Assistee par Ordinateur" (Grenoble, France, 1985), 207-216.
  • 35. Norbert Fuhr, Models for retrieval with probabilistic indexing, Information Processing and Management: an International Journal, v.25 n.1, p.55-72, 1989  doi:10.1016/0306-4573(89)90091-5
  • 36. Norbert Fuhr, Chris Buckley, A probabilistic learning approach for document indexing, ACM Transactions on Information Systems (TOIS), v.9 n.3, p.223-248, July 1991  doi:10.1145/125187.125189
  • 37. FUHR, N., HARTMANN, S., KNORZ, G., LUSTIG,G., SCHWANTNER, M., AND TZERAS, K. (1991). AIR/X"a rule-based multistage indexing system for large subject fields. In: Proceedings of RIAO-91, 3rd International Conference "Recherche d'Information Assistee par Ordina-teur" (Barcelona, Spain, 1991), 606-623.
  • 38. N. Fuhr, G. E. Knorz, Retrieval test evaluation of a rule based automatic indexing (AIR/PHYS), Proceedings of the third joint BCS and ACM symposium on Research and development in information retrieval, p.391-408, November 1984, King's College, Cambridge
  • 39. Norbert Fuhr, Ulrich Pfeifer, Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions, ACM Transactions on Information Systems (TOIS), v.12 n.1, p.92-115, Jan. 1994  doi:10.1145/174608.174612
  • 40. Johannes Fürnkranz, Exploiting Structural Information for Text Classification on the WWW, Proceedings of the Third International Symposium on Advances in Intelligent Data Analysis, p.487-498, August 01, 1997
  • 41. Luigi Galavotti, Fabrizio Sebastiani, Maria Simi, Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization, Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, p.59-68, September 18-20, 2000
  • 42. GALE, W. A., CHURCH,K.W.,AND YAROWSKY, D. (1993). A method for disambiguating word senses in a large corpus. Comput. Human. 26, 5, 415-439.
  • 43. Norbert Gövert, Mounia Lalmas, Norbert Fuhr, A probabilistic description-oriented approach for categorizing web documents, Proceedings of the eighth International Conference on Information and knowledge management, p.475-482, November 02-06, 1999, Kansas City, Missouri, United States  doi:10.1145/319950.320053
  • 44. GRAY,W.A.AND HARLEY, A. J. 1971. Computerassisted indexing. Inform. Storage Retrieval 7, 4, 167-174.
  • 45. Louise Guthrie, Elbert Walker, Joe Guthrie, Document classification by machine: theory and practice, Proceedings of the 15th conference on Computational linguistics, August 05-09, 1994, Kyoto, Japan  doi:10.3115/991250.991322
  • 46. P. J. Hayes, P. M. Andersen, I. B. Nirenburg, L. M. Schmandt, TCS: a shell for content-based text categorization, Proceedings of the sixth conference on Artificial intelligence applications, p.320-326, January 1990, Santa Barbara, California, United States
  • 47. HEAPS, H. 1973. A theory of relevance for automatic document classification. Inform. Control 22, 3, 268-278.
  • 48. William Hersh, Chris Buckley, T. J. Leone, David Hickam, OHSUMED: an interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th ACM SIGIR Conference retrieval, p.192-201, July 03-06, 1994, Dublin, Ireland
  • 49. David Hull, Improving text retrieval for the routing problem using latent semantic indexing, Proceedings of the 17th ACM SIGIR Conference retrieval, p.282-291, July 03-06, 1994, Dublin, Ireland
  • 50. David A. Hull, Jan O. Pedersen, Hinrich Schütze, Method combination for document filtering, Proceedings of the 19th ACM SIGIR Conference retrieval, p.279-287, August 18-22, 1996, Zurich, Switzerland  doi:10.1145/243199.243275
  • 51. ITTNER,D.J.,LEWIS,D.D.,AND AHN, D. D. (1995). Text categorization of low quality images. In: Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1995), 301-315.
  • 52. Makoto Iwayama, Takenobu Tokunaga, Cluster-based text categorization: a comparison of category search strategies, Proceedings of the 18th ACM SIGIR Conference retrieval, p.273-280, July 09-13, 1995, Seattle, Washington, United States  doi:10.1145/215206.215371
  • 53. Raj D. Iyer, David D. Lewis, Robert E. Schapire, Yoram Singer, Amit Singhal, Boosting for document routing, Proceedings of the ninth International Conference on Information and knowledge management, p.70-77, November 06-11, 2000, McLean, Virginia, United States  doi:10.1145/354756.354794
  • 54. Thorsten Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning, p.143-151, July 08-12, 1997
  • 55. Thorsten Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features, Proceedings of the 10th European Conference on Machine Learning, p.137-142, April 21-23, 1998
  • 56. Thorsten Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the Sixteenth International Conference on Machine Learning, p.200-209, June 27-30, 1999
  • 57. Thorsten Joachims, Fabrizio Sebastiani, Guest Editors' Introduction to the Special Issue on Automated Text Categorization, Journal of Intelligent Information Systems, v.18 n.2-3, p.103-105, March-May 2002
  • 58. JOHN, G. H., KOHAVI, R., AND PFLEGER, K. (1994). Irrelevant features and the subset selection problem. In: Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, NJ, 1994), 121-129.
  • 59. JUNKER,M.AND ABECKER, A. (1997). Exploiting thesaurus knowledge in rule induction for text classification. In: Proceedings of RANLP-97, 2nd International Conference on Recent Advances in Natural Language Processing (Tzigov Chark, Bulgaria, 1997), 202-207.
  • 60. JUNKER,M.AND HOCH, R. (1998). An experimental evaluation of OCR text representations for learning document classifiers. Internat. J. Document Analysis and Recognition 1, 2, 116-122.
  • 61. Brett Kessler, Geoffrey Numberg, Hinrich Schütze, Automatic detection of text genre, Proceedings of the 35th annual meeting on Association for Computational Linguistics, p.32-38, July 07-12, 1997, Madrid, Spain
  • 62. Yu-Hwan Kim, Shang-Yoon Hahn, Byoung-Tak Zhang, Text filtering by boosting naive Bayes classifiers, Proceedings of the 23rd ACM SIGIR Conference retrieval, p.168-175, July 24-28, 2000, Athens, Greece  doi:10.1145/345508.345572
  • 63. Ralf Klinkenberg, Thorsten Joachims, Detecting Concept Drift with Support Vector Machines, Proceedings of the Seventeenth International Conference on Machine Learning, p.487-494, June 29-July 02, 2000
  • 64. Kevin Knight, Mining online text, Communications of the ACM, v.42 n.11, p.58-61, Nov. 1999  doi:10.1145/319382.319394
  • 65. Gerhard Knorz, A decision theory approach to optimal automatic indexing, Proceedings of the 5th annual ACM conference on Research and development in information retrieval, p.174-193, May 18-20, 1982, West Berlin, Germany
  • 66. Daphne Koller, Mehran Sahami, Hierarchically Classifying Documents Using Very Few Words, Proceedings of the Fourteenth International Conference on Machine Learning, p.170-178, July 08-12, 1997
  • 67. Robert R. Korfhage, Information storage and retrieval, John Wiley & Sons, Inc., New York, NY, 1997
  • 68. Savio L. Y. Lam, Dik Lun Lee, Feature Reduction for Neural Network Based Text Categorization, Proceedings of the Sixth International Conference on Database Systems for Advanced Applications, p.195-202, April 19-21, 1999
  • 69. Wai Lam, Chao Yang Ho, Using a generalized instance set for automatic text categorization, Proceedings of the 21st ACM SIGIR Conference retrieval, p.81-89, August 24-28, 1998, Melbourne, Australia  doi:10.1145/290941.290961
  • 70. Wai Lam, LOW,K.F., and HO, C. Y. (1997). Using a Bayesian network induction approach for text categorization. In: Proceedings of IJCAI-97, 15th International Joint Conference on Artificial Intelligence (Nagoya, Japan, 1997), 745-750.
  • 71. Wai Lam, Miguel Ruiz, Padmini Srinivasan, Automatic Text Categorization and Its Application to Text Retrieval, IEEE Transactions on Knowledge and Data Engineering, v.11 n.6, p.865-879, November 1999  doi:10.1109/69.824599
  • 72. LANG, K. (1995). NEWSWEEDER: learning to filter netnews. In: Proceedings of ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, CA, 1995), 331-339.
  • 73. Leah S. Larkey, Automatic essay grading using text categorization techniques, Proceedings of the 21st ACM SIGIR Conference retrieval, p.90-95, August 24-28, 1998, Melbourne, Australia  doi:10.1145/290941.290965
  • 74. Leah S. Larkey, A patent search and classification system, Proceedings of the fourth ACM conference on Digital libraries, p.179-187, August 11-14, 1999, Berkeley, California, United States  doi:10.1145/313238.313304
  • 75. Leah S. Larkey, W. Bruce Croft, Combining classifiers in text categorization, Proceedings of the 19th ACM SIGIR Conference retrieval, p.289-297, August 18-22, 1996, Zurich, Switzerland  doi:10.1145/243199.243276
  • 76. David D. Lewis, An evaluation of phrasal and clustered representations on a text categorization task, Proceedings of the 15th ACM SIGIR Conference retrieval, p.37-50, June 21-24, 1992, Copenhagen, Denmark  doi:10.1145/133160.133172
  • 77. David Dolan Lewis, Representation and learning in information retrieval, University of Massachusetts, Amherst, MA, 1992
  • 78. David D. Lewis, Evaluating and optimizing autonomous text classification systems, Proceedings of the 18th ACM SIGIR Conference retrieval, p.246-254, July 09-13, 1995, Seattle, Washington, United States  doi:10.1145/215206.215366
  • 79. David D. Lewis, A sequential algorithm for training text classifiers: corrigendum and additional data, ACM SIGIR Forum, v.29 n.2, p.13-19, Fall 1995  doi:10.1145/219587.219592
  • 80. LEWIS, D. D. 1995c. The TREC-4 filtering track: description and analysis. In: Proceedings of TREC-4, 4th Text Retrieval Conference (Gaithersburg, MD, 1995), 165-180.
  • 81. David D. Lewis, Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval, Proceedings of the 10th European Conference on Machine Learning, p.4-15, April 21-23, 1998
  • 82. LEWIS,D.D.AND CATLETT, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, NJ, 1994), 148-156.
  • 83. David D. Lewis, William A. Gale, A sequential algorithm for training text classifiers, Proceedings of the 17th ACM SIGIR Conference retrieval, p.3-12, July 03-06, 1994, Dublin, Ireland
  • 84. LEWIS,D.D.AND HAYES, P. J. (1994). Guest editorial for the special issue on text categorization. ACM Trans. Inform. Syst. 12, 3, 231.
  • 85. LEWIS,D.D.AND RINGUETTE, M. (1994). A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1994), 81-93.
  • 86. David D. Lewis, Robert E. Schapire, James P. Callan, Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th ACM SIGIR Conference retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland  doi:10.1145/243199.243277
  • 87. Hang Li, Kenji Yamanishi, Text classification using ESC-based stochastic decision lists, Proceedings of the eighth International Conference on Information and knowledge management, p.122-130, November 02-06, 1999, Kansas City, Missouri, United States  doi:10.1145/319950.319966
  • 88. LI,Y.H.AND JAIN, A. K. (1998). Classification of Text Documents. Comput. J. 41, 8, 537-546.
  • 89. Elizabeth D. Liddy, Woojin Paik, Edmund S. Yu, Text categorization for multiple users based on semantic features from a machine-readable dictionary, ACM Transactions on Information Systems (TOIS), v.12 n.3, p.278-295, July 1994  doi:10.1145/183422.183425
  • 90. LIERE,R.AND TADEPALLI, P. (1997). Active learning with committees for text categorization. In: Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence (Providence, RI, 1997), 591-596.
  • 91. Joo-Hwee Lim, Learnable visual keywords for image classification, Proceedings of the fourth ACM conference on Digital libraries, p.139-145, August 11-14, 1999, Berkeley, California, United States  doi:10.1145/313238.313290
  • 92. Christopher D. Manning, Hinrich Schütze, Foundations of statistical natural language processing, MIT Press, Cambridge, MA, 1999
  • 93. M. E. Maron, Automatic Indexing: An Experimental Inquiry, Journal of the ACM (JACM), v.8 n.3, p.404-417, July 1961  doi:10.1145/321075.321084
  • 94. Brij Masand, Optimizing confidence of text classification by evolution of symbolic expressions, Advances in genetic programming, MIT Press, Cambridge, MA, 1994
  • 95. Brij Masand, Gordon Linoff, David Waltz, Classifying news stories using memory based reasoning, Proceedings of the 15th ACM SIGIR Conference retrieval, p.59-65, June 21-24, 1992, Copenhagen, Denmark  doi:10.1145/133160.133177
  • 96. Andrew McCallum, Kamal Nigam, Employing EM and Pool-based Active Learning for Text Classification, Proceedings of the Fifteenth International Conference on Machine Learning, p.350-358, July 24-27, 1998
  • 97. Andrew McCallum, Ronald Rosenfeld, Tom M. Mitchell, Andrew Y. Ng , Improving Text Classification by Shrinkage in a Hierarchy of Classes, Proceedings of the Fifteenth International Conference on Machine Learning, p.359-367, July 24-27, 1998
  • 98. MERKL, D. (1998). Text classification with selforganizing maps: Some lessons learned. Neurocomputing 21, 1/3, 61-77.
  • 99. Tom M. Mitchell, Machine Learning, McGraw-Hill Higher Education, 1997
  • 100. Dunja Mladenić, Feature Subset Selection in Text-Learning, Proceedings of the 10th European Conference on Machine Learning, p.95-100, April 21-23, 1998
  • 101. MLADENIC,D.AND GROBELNIK, M. (1998). Word sequences as features in text-learning. In: Proceedings of ERK-98, the Seventh Electrotechnical and Computer Science Conference (Ljubljana, Slovenia, 1998), 145-148.
  • 102. Isabelle Moulinier, Jean-Gabriel Ganascia, Applying an existing machine learning algorithm to text categorization, Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, p.343-354, January 1996
  • 103. MOULINIER, I., RASKINIS,G.,AND GANASCIA, J.-G. (1996). Text categorization: a symbolic approach. In: Proceedings of SDAIR-96, 5th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1996), 87-99.
  • 104. Kary Myers, Michael J. Kearns, Satinder P. Singh, Marilyn A. Walker, A Boosting Approach to Topic Spotting on Subdialogues, Proceedings of the Seventeenth International Conference on Machine Learning, p.655-662, June 29-July 02, 2000
  • 105. Hwee Tou Ng, Wei Boon Goh, Kok Leong Low, Feature selection, perception learning, and a usability case study for text categorization, Proceedings of the 20th ACM SIGIR Conference retrieval, p.67-73, July 27-31, 1997, Philadelphia, Pennsylvania, United States
  • 106. Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom M. Mitchell, Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning, v.39 n.2-3, p.103-134, May-June 2000
  • 107. Hyo-Jung Oh, Sung Hyon Myaeng, Mann-Ho Lee, A practical hypertext catergorization method using links and incrementally available class information, Proceedings of the 23rd ACM SIGIR Conference retrieval, p.264-271, July 24-28, 2000, Athens, Greece  doi:10.1145/345508.345594
  • 108. International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, January 1997
  • 109. Ellen Riloff, Little words can make a big difference for text classification, Proceedings of the 18th ACM SIGIR Conference retrieval, p.130-136, July 09-13, 1995, Seattle, Washington, United States  doi:10.1145/215206.215349
  • 110. Ellen Riloff, Wendy Lehnert, Information extraction as a basis for high-precision text classification, ACM Transactions on Information Systems (TOIS), v.12 n.3, p.296-333, July 1994  doi:10.1145/183422.183428
  • 111. ROBERTSON,S.E.AND HARDING, P. (1984). Probabilistic automatic indexing by learning from human indexers. J. Document. 40, 4, 264-270.
  • 112. ROBERTSON,S.E.AND SPARCK JONES, K. 1976. Relevance weighting of search terms. J. Amer. Soc. Inform. Sci. 27, 3, 129-146. Also reprinted in Willett {1988}, pp. 143-160.
  • 113. Dan Roth, Learning to resolve natural language ambiguities: a unified approach, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.806-813, July 1998, Madison, Wisconsin, United States
  • 114. Miguel E. Ruiz, Padmini Srinivasan, Hierarchical neural networks for text categorization (poster abstract), Proceedings of the 22nd ACM SIGIR Conference retrieval, p.281-282, August 15-19, 1999, Berkeley, California, United States  doi:10.1145/312624.312700
  • 115. SABLE,C.L.AND HATZIVASSILOGLOU, V. (2000). Textbased approaches for non-topical image categorization. Internat. J. Dig. Libr. 3, 3, 261-275.
  • 116. Gerard M. Salton, Christopher Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management: an International Journal, v.24 n.5, p.513-523, 1988  doi:10.1016/0306-4573(88)90021-0
  • 117. Gerard M. Salton, A. Wong, C. S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, v.18 n.11, p.613-620, Nov. 1975  doi:10.1145/361219.361220
  • 118. Tefko Saracevic, Relevance: a review of and a framework for the thinking on the notion in information science, Readings in information retrieval, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1997
  • 119. Robert E. Schapire, Yoram Singer, BoosTexter: A Boosting-based Systemfor Text Categorization, Machine Learning, v.39 n.2-3, p.135-168, May-June 2000
  • 120. Robert E. Schapire, Yoram Singer, Amit Singhal, Boosting and Rocchio applied to text filtering, Proceedings of the 21st ACM SIGIR Conference retrieval, p.215-223, August 24-28, 1998, Melbourne, Australia  doi:10.1145/290941.290996
  • 121. Hinrich Schütze, Automatic word sense discrimination, Computational Linguistics, v.24 n.1, March 1998
  • 122. Hinrich Schütze, David A. Hull, Jan O. Pedersen, A comparison of classifiers and document representations for the routing problem, Proceedings of the 18th ACM SIGIR Conference retrieval, p.229-237, July 09-13, 1995, Seattle, Washington, United States  doi:10.1145/215206.215365
  • 123. Sam Scott, Stan Matwin, Feature Engineering for Text Classification, Proceedings of the Sixteenth International Conference on Machine Learning, p.379-388, June 27-30, 1999
  • 124. Fabrizio Sebastiani, Alessandro Sperduti, Nicola Valdambrini, An improved boosting algorithm and its application to text categorization, Proceedings of the ninth International Conference on Information and knowledge management, p.78-85, November 06-11, 2000, McLean, Virginia, United States  doi:10.1145/354756.354804
  • 125. Amit Singhal, Mandar Mitra, Chris Buckley, Learning routing queries in a query zone, Proceedings of the 20th ACM SIGIR Conference retrieval, p.25-32, July 27-31, 1997, Philadelphia, Pennsylvania, United States
  • 126. Amit Singhal, Gerard M. Salton, Mandar Mitra, Chris Buckley, Document length normalization, Information Processing and Management: an International Journal, v.32 n.5, p.619-633, Sept. 1996  doi:10.1016/0306-4573(96)00008-8
  • 127. SLONIM,N.AND TISHBY, N. (2001). The power of word clusters for text classification. In: Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, Germany, 2001).
  • 128. Karen Spärck Jones, Peter Willett, Readings in information retrieval, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1997
  • 129. Hirotoshi Taira, Masahiko Haruno, Feature selection in SVM text categorization, Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, p.480-486, July 18-22, 1999, Orlando, Florida, United States
  • 130. D. R. Tauritz, J. N. Kok, I. G. Sprinkhuizen-Kuyper, Adaptive information filtering using evolutionary computation, Information Sciences: an International Journal, v.122 n.2-4, p.121-140, Feb. 2000  doi:10.1016/S0020-0255(99)00123-1
  • 131. TUMER,K.AND GHOSH, J. (1996). Error correlation and error reduction in ensemble classifiers. Connection Sci. 8, 3-4, 385-403.
  • 132. Kostas Tzeras, Stephan Hartmann, Automatic indexing based on Bayesian inference networks, Proceedings of the 16th ACM SIGIR Conference retrieval, p.22-35, June 27-July 01, 1993, Pittsburgh, Pennsylvania, United States  doi:10.1145/160688.160691
  • 133. VAN RIJSBERGEN, C. J. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Document. 33, 2, 106-119.
  • 134. C. J. van Rijsbergen, Information Retrieval, Butterworth-Heinemann, Newton, MA, 1979
  • 135. Andreas S. Weigend, Erik D. Wiener, Jan O. Pedersen, Exploiting Hierarchy in Text Categorization, Information Retrieval, v.1 n.3, p.193-216, October 1999  doi:10.1023/A:1009983522080
  • 136. Sholom M. Weiss, Chidanand Apte, Fred J. Damerau, David E. Johnson, Frank J. Oles, Thilo Goetz, Thomas Hampp, Maximizing Text-Mining Performance, IEEE Intelligent Systems, v.14 n.4, p.63-69, July 1999  doi:10.1109/5254.784086
  • 137. WIENER,E.D.,PEDERSEN,J.O.,AND WEIGEND,A.S. (1995). A neural network approach to topic spotting. In: Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1995), 317-332.
  • 138. Peter Willett, Document retrieval systems, Taylor Graham Publishing, London, UK, 1988
  • 139. Jacqueline W. T. Wong, W. K. Kan, Gilbert Young, ACTION: automatic classification for full-text documents, ACM SIGIR Forum, v.30 n.1, p.26-41, April 1996  doi:10.1145/381984.381987
  • 140. Yiming Yang, Expert network: effective and efficient learning from human decisions in text categorization and retrieval, Proceedings of the 17th ACM SIGIR Conference retrieval, p.13-22, July 03-06, 1994, Dublin, Ireland
  • 141. Yiming Yang, Noise reduction in a statistical approach to text categorization, Proceedings of the 18th ACM SIGIR Conference retrieval, p.256-263, July 09-13, 1995, Seattle, Washington, United States  doi:10.1145/215206.215367
  • 142. Yiming Yang, An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, v.1 n.1-2, p.69-90, 1999
  • 143. Yiming Yang, Christopher G. Chute, An example-based mapping method for text categorization and retrieval, ACM Transactions on Information Systems (TOIS), v.12 n.3, p.252-277, July 1994  doi:10.1145/183422.183424
  • 144. Yiming Yang, Xin Liu, A re-examination of text categorization methods, Proceedings of the 22nd ACM SIGIR Conference retrieval, p.42-49, August 15-19, 1999, Berkeley, California, United States  doi:10.1145/312624.312647
  • 145. Yiming Yang, Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning, p.412-420, July 08-12, 1997
  • 146. Yiming Yang, Seán Slattery, and Rayid Ghani, (2002). “A Study of Approaches to Hypertext Categorization.” In: Journal of Intelligent Information Systems, 18(2-3).
  • 147. Kwok Leung Yu, Wai Lam, A new on-line learning algorithm for adaptive text filtering, Proceedings of the seventh International Conference on Information and knowledge management, p.156-160, November 02-07, 1998, Bethesda, Maryland, United States  doi:10.1145/288627.288652,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 MachineLearningInAutoTextCategFabrizio SebastianiMachine Learning in Automated Text CategorizationAssociation of Computing Machinery Computing Surveyshttp://www.cis.uni-muenchen.de/~heller/SuchMasch/jobsuchmasch/ACMCS02.pdf10.1145/505282.5052832002