2004 AProbabilisticFrameworkForSemiSupClust

(Basu et al., 2004) ⇒ Sugato Basu, Mikhail Bilenko, and Raymond Mooney. (2004). “A Probabilistic Framework for Semi-Supervised Clustering.” In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004). doi:10.1145/1014052.1014062

Subject Headings: Semi-Supervised Clustering, Pairwise Constraint.

Notes

Cited By

~370 http://scholar.google.com/scholar?q=%22A+Probabilistic+Framework+for+Semi-Supervised+Clustering%22+2004

Quotes

Abstract

Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.

References

1. Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
2. Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Suvrit Sra, Generative model-based clustering of directional data, Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, D.C. doi:10.1145/956750.956757
3. A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In: Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), 2004.
4. Nikhil Bansal, Avrim Blum, Shuchi Chawla, Correlation Clustering, Proceedings of the 43rd Symposium on Foundations of Computer Science, p.238, November 16-19, 2002
5. A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence relations. In: Proceedings of 20th International Conference on Machine Learning (ICML-03), pages 11--18, 2003.
6. Sugato Basu, Arindam Banerjee, Raymond Mooney, Semi-supervised Clustering by Seeding, Proceedings of the Nineteenth International Conference on Machine Learning, p.27-34, July 08-12, 2002
7. Sugato Basu, A. Banerjee, and Raymond Mooney. Active semi-supervision for pairwise constrained clustering. In: Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), 2004.
8. Sugato Basu, Arindam Banerjee, Raymond Mooney, Semi-supervised Clustering by Seeding, Proceedings of the Nineteenth International Conference on Machine Learning, p.27-34, July 08-12, 2002
9. J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, Series B (Methodological), 48(3):259--302, 1986.
10. Mikhail Bilenko, Raymond Mooney, Adaptive duplicate detection using learnable string similarity measures, Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, D.C. doi:10.1145/956750.956759
11. Avrim Blum, Tom M. Mitchell, Combining Labeled and Unlabeled Data with Co-training, Proceedings of the eleventh annual conference on Computational learning theory, p.92-100, July 24-26, 1998, Madison, Wisconsin, United States doi:10.1145/279943.279962
12. Y. Boykov, O. Veksler, R. Zabih, Markov Random Fields with Efficient Approximations, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p.648, June 23-25, 1998
13. D. Cohn, Rich Caruana, and Andrew McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University, 2003.
14. Thomas M. Cover, Joy A. Thomas, Elements of information theory, Wiley-Interscience, New York, NY, 1991
15. A. Demiriz, K. P. Bennett, and M. J. Embrechts. Semi-supervised clustering using genetic algorithms. In Artificial Neural Networks in Engineering (ANNIE-99), pages 809--814, 1999.
16. Arthur P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1--38, 1977.
17. Inderjit S. Dhillon, Yuqiang Guan, Information Theoretic Clustering of Sparse Co-Occurrence Data, Proceedings of the Third IEEE International Conference on Data Mining, p.517, November 19-22, 2003
18. Inderjit S. Dhillon, Dharmendra S. Modha, Concept Decompositions for Large Sparse Text Data Using Clustering, Machine Learning, v.42 n.1-2, p.143-175, January-February 2001
19. B. E. Dom. An information-theoretic external cluster-validity measure. Research Report RJ 10219, IBM, 2001.
20. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, USA, 95:14863--14848, 1998.
21. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721--742, 1984.
22. J. M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. Unpublished manuscript, 1971.
23. D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180--184, 1985.
24. Thorsten Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the Sixteenth International Conference on Machine Learning, p.200-209, June 27-30, 1999
25. S. D. Kamvar, D. Klein, and Christopher D. Manning. Spectral learning. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-03), pages 561--566, 2003.
26. M. Kearns, Y. Mansour, and A. Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In: Proceedings of 13th Conference on Uncertainty in Artificial Intelligence (UAI-97), pages 282--293, 1997.
27. Dan Klein, Sepandar D. Kamvar, Christopher D. Manning, From Instance-level Constraints to Space-level Constraints: Making the Most of Prior Knowledge in Data Clustering, Proceedings of the Nineteenth International Conference on Machine Learning, p.307-314, July 08-12, 2002
28. Jon Kleinberg, Éva Tardos, Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields, Proceedings of the 40th Annual Symposium on Foundations of Computer Science, p.14, October 17-18, 1999
29. J. MacQueen. Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.
30. E. M. Marcotte, I. Xenarios, A. van der Bliek, and D. Eisenberg. Localizing proteins in the cell from their phylogenetic profiles. Proceedings of the National Academy of Science, 97:12115--20, 2000.
31. K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., 2nd edition, 2000.
32. Radford M. Neal, Geoffrey E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, Learning in graphical models, MIT Press, Cambridge, MA, 1999
33. Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom M. Mitchell, Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning, v.39 n.2-3, p.103-134, May-June 2000
34. (Pearl, 1988) ⇒ Judea Pearl. (1988). “Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.” Morgan Kaufmann. ISBN:1558604790
35. Fernando Pereira, Naftali Tishby, Lillian Lee, Distributional clustering of English words, Proceedings of the 31st annual meeting on Association for Computational Linguistics, p.183-190, June 22-26, 1993, Columbus, Ohio doi:10.3115/981574.981598
36. E. Segal, H. Wang, and Daphne Koller. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19:i264-- i272, July 2003.
37. A. Strehl, J. Ghosh, and Raymond Mooney. Impact of similarity measures on web-page clustering. In: Proceedings of AAAI2000 Workshop on Artificial Intelligence for Web Search, pages 58--64, July 2000.
38. (Wagstaff et al., 2001) ⇒ Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schrödl. (2001). “K-meansClusteringWithBackgroundKnowledge.pdf Constrained K-means Clustering with Background Knowledge.” In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001).
39. (Xing, 2003) ⇒ Eric P. Xing, Andrew Y. Ng , Michael I. Jordan and Stuart Russell. (2003). “Distance Metric Learning, with Application to Clustering with Side-Information.” In: Advances in Neural Information Processing Systems 15.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 AProbabilisticFrameworkForSemiSupClust	Sugato Basu Mikhail Bilenko Raymond J. Mooney			A Probabilistic Framework for Semi-Supervised Clustering		Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining	http://dx.doi.org/10.1145/1014052.1014062	10.1145/1014052.1014062		2004