2003 AnEvalOnFeatureSelectForTextCluster

(Liu et al., 2003) ⇒ Tao Liu, Shengping Li, Zheng Chen, Wei-Ying Ma. (2003). “An Evaluation on Feature Selection for Text Clustering.” In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003).

Subject Headings: Feature Selection Algorithm

Notes

Cited By

~102 http://scholar.google.com/scholar?cites=6664053621488137342

Quotes

Abstract

Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, we first give empirical evidence that feature selection methods can improve the efficiency and performance of text clustering algorithm. Then we propose a new Feature selection method called “Term Contribution (TC)” and perform a comparative study on a variety of feature selection methods for text clustering, including Document Frequency (DF), Term Strength (TS), Entropy-based (En), Information Gain (IG) and N2 statistic (CHI). Finally, we propose an “Iterative Feature Selection (IF)” method that addresses the unavailability of label problem by utilizing effective supervised feature selection methods to iteratively select features and perform clustering. Detailed experimental results on Web Directory data are provided in the paper.

References

Aggrawal, C.C., & Yu, P.S. (2000). Finding generalized projected clusters in high dimensional spaces. Proceedings of SIGMOD’00 (pp. 70-81).
Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2001). On Feature Distributional Clustering for Text Categorization. Proceedings of SIGIR’01 (pp. 146-153).
Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 1(2), 245-271.
Bottou L., & Bengio Y. (1995). Convergence properties of the k-means algorithms. Advances in Neural Information Processing Systems, 7, 585-592.
Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey.W. (1992). Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of SIGIR’92 (pp. 318–329).
Dash, M., & Liu, H. (1997). Feature selection for classification. International Journal of Intelligent Data Analysis, 1(3), 131-156.
Dash, M., & Liu, H. (2000). Feature Selection for Clustering. Proceedings of PAKDD-00 (pp. 110-121).
Arthur P. Dempster, Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Stat. Society,39, 1-38.
Friedman, J.H. (1987). Exploratory projection pursuit. Journal of American Stat. Association, 82, 249-266.
Galavotti, L., Sebastiani, F., & Simi, M. (2000). Feature selection and negative evidence in automated text categorization. Proceedings of KDD-00.
Jain, A.K., Duin P.W., & Jianchang, M. (2000). Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 4-37.
Jolliffe, I.T. (1986). Principal Component Analysis. Springer Series in Statistics.
Koller, D., & Sahami, M. (1996). Toward Optimal Feature Selection. Proceedings of ICML’96 (pp.284-292).
Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. Proceedings of ICML-97 (pp. 170-178).
Kowalski, G. (1997). Information Retrieval Systems Theory and Implementation. Kluwer Academic Publishers. Martin, H. C. L., Mario, A. T. F., & Jain, A.K (2002). Feature Saliency in unsupervised learning(Technical Report 2002). Michigan State University.
Gerard M. Salton (1989). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-wesley, Reading, Pennsylvania.
Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. Proceedings of SIGIR’00 (pp. 208-215).
Wilbur, J.W., & Sirotkin, K. (1992). The automatic identification of stop words. Journal of Information Science, 18, 45-55.
Wyse, N., Dubes, R., & Jain, A.K. (1980). A critical evaluation of intrinsic dimensionality algorithms. Pattern Recognition in Practice (pp. 415-425). North-Holland.
Yang, Y. (1995). Noise reduction in a statistical approach to text categorization. Proceedings of SIGIR’95 (pp. 256-263).
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Proceedings of ICML-97 (pp. 412-420).
Zamir, O., Oren Etzioni, Madani, O., & Karp, R. M. (1997). Fast and Intuitive Clustering of Web Documents. Proceedings of KDD-97 (pp. 287-290)

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2003 AnEvalOnFeatureSelectForTextCluster	Wei-Ying Ma Zheng Chen Tao Liu Shengping Li			An Evaluation on Feature Selection for Text Clustering		Proceedings of the Twentieth International Conference on Machine Learning	http://www.aaai.org/Papers/ICML/2003/ICML03-065.pdf			2003

2003 AnEvalOnFeatureSelectForTextCluster

Notes

Cited By

Quotes

Abstract

References

Navigation menu

Search