2003 AnEvalOnFeatureSelectForTextCluster

Jump to: navigation, search

Subject Headings: Feature Selection Algorithm


Cited By



Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, we first give empirical evidence that feature selection methods can improve the efficiency and performance of text clustering algorithm. Then we propose a new Feature selection method called “Term Contribution (TC)” and perform a comparative study on a variety of feature selection methods for text clustering, including Document Frequency (DF), Term Strength (TS), Entropy-based (En), Information Gain (IG) and N2 statistic (CHI). Finally, we propose an “Iterative Feature Selection (IF)method that addresses the unavailability of label problem by utilizing effective supervised feature selection methods to iteratively select features and perform clustering. Detailed experimental results on Web Directory data are provided in the paper.


  • Aggrawal, C.C., & Yu, P.S. (2000). Finding generalized projected clusters in high dimensional spaces. Proceedings of SIGMOD’00 (pp. 70-81).
  • Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2001). On Feature Distributional Clustering for Text Categorization. Proceedings of SIGIR’01 (pp. 146-153).
  • Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 1(2), 245-271.
  • Bottou L., & Bengio Y. (1995). Convergence properties of the k-means algorithms. Advances in Neural Information Processing Systems, 7, 585-592.
  • Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey.W. (1992). Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of SIGIR’92 (pp. 318–329).
  • Dash, M., & Liu, H. (1997). Feature selection for classification. International Journal of Intelligent Data Analysis, 1(3), 131-156.
  • Dash, M., & Liu, H. (2000). Feature Selection for Clustering. Proceedings of PAKDD-00 (pp. 110-121).
  • Arthur P. Dempster, Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Stat. Society,39, 1-38.
  • Friedman, J.H. (1987). Exploratory projection pursuit. Journal of American Stat. Association, 82, 249-266.
  • Galavotti, L., Sebastiani, F., & Simi, M. (2000). Feature selection and negative evidence in automated text categorization. Proceedings of KDD-00.
  • Jain, A.K., Duin P.W., & Jianchang, M. (2000). Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence]], 22, 4-37.
  • Jolliffe, I.T. (1986). Principal Component Analysis. Springer Series in Statistics.
  • Koller, D., & Sahami, M. (1996). Toward Optimal Feature Selection. Proceedings of ICML’96 (pp.284-292).
  • Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. Proceedings of ICML-97 (pp. 170-178).
  • Kowalski, G. (1997). Information Retrieval Systems Theory and Implementation. Kluwer Academic Publishers. Martin, H. C. L., Mario, A. T. F., & Jain, A.K (2002). Feature Saliency in unsupervised learning(Technical Report 2002). Michigan State University.
  • Gerard M. Salton (1989). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-wesley, Reading, Pennsylvania.
  • Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. Proceedings of SIGIR’00 (pp. 208-215).
  • Wilbur, J.W., & Sirotkin, K. (1992). The automatic identification of stop words. Journal of Information Science, 18, 45-55.
  • Wyse, N., Dubes, R., & Jain, A.K. (1980). A critical evaluation of intrinsic dimensionality algorithms. Pattern Recognition in Practice (pp. 415-425). North-Holland.
  • Yang, Y. (1995). Noise reduction in a statistical approach to text categorization. Proceedings of SIGIR’95 (pp. 256-263).
  • Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Proceedings of ICML-97 (pp. 412-420).
  • Zamir, O., Oren Etzioni, Madani, O., & Karp, R. M. (1997). Fast and Intuitive Clustering of Web Documents. Proceedings of KDD-97 (pp. 287-290)

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 AnEvalOnFeatureSelectForTextClusterTao Liu
Shengping Li
Zheng Chen
Wei-Ying Ma
An Evaluation on Feature Selection for Text ClusteringProceedings of the Twentieth International Conference on Machine Learninghttp://www.aaai.org/Papers/ICML/2003/ICML03-065.pdf2003