2009 FastApproximateSpectralClusteri

(Yan et al., 2009) ⇒ Donghui Yan, Ling Huang, and Michael I. Jordan. (2009). “Fast Approximate Spectral Clustering.” In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2009). doi:10.1145/1557019.1557118

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Abstract

Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n³) in general, with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.

References

1. David Arthur, Sergei Vassilvitskii, K-means++: The Advantages of Careful Seeding, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, p.1027-1035, January 07-09, 2007, New Orleans, Louisiana
2. Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, Angela Y. Wu, An Optimal Algorithm for Approximate Nearest Neighbor Searching Fixed Dimensions, Journal of the ACM (JACM), v.45 n.6, p.891-923, Nov. 1998 doi:10.1145/293347.293348
3. A. Asuncion and D. Newman. UCI Machine Learning Repository, Department of Information and Computer Science. Http://www.ics.uci.edu/ Mlearn/MLRepository.html, 2007.
4. Francis R. Bach, Michael I. Jordan, Learning Spectral Clustering, With Application To Speech Separation, The Journal of Machine Learning Research, 7, p.1963-2001, 12/1/2006
5. Mihai Bādoiu, Sariel Har-Peled, Piotr Indyk, Approximate Clustering via Core-sets, Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montreal, Quebec, Canada doi:10.1145/509907.509947
6. Paul S. Bradley, Usama M. Fayyad, Refining Initial Points for K-Means Clustering, Proceedings of the Fifteenth International Conference on Machine Learning, p.91-99, July 24-27, 1998
7. Leo Breiman, Random Forests, Machine Learning, v.45 n.1, p.5-32, October 1 2001 doi:10.1023/A:1010933404324
8. Sanjoy Dasgupta, Yoav Freund, Random Projection Trees and Low Dimensional Manifolds, Proceedings of the 40th Annual ACM Symposium on Theory of Computing, May 17-20, 2008, Victoria, British Columbia, Canada doi:10.1145/1374376.1374452
9. Inderjit S. Dhillon, Yuqiang Guan, Brian Kulis, Weighted Graph Cuts Without Eigenvectors A Multilevel Approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.29 n.11, p.1944-1957, November 2007 doi:10.1109/TPAMI.2007.1115
10. P. Drineas and M. W. Mahoney. On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-based Learning. In: Proceedings of COLT, Pages 323--337, 2005.
11. Shai Fine, Katya Scheinberg, Efficient Svm Training Using Low-rank Kernel Representations, The Journal of Machine Learning Research, 2, 3/1/2002
12. Charless Fowlkes, Serge Belongie, Fan Chung, Jitendra Malik, Spectral Grouping Using the Nyström Method, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.26 n.2, p.214-225, January 2004 doi:10.1109/TPAMI.2004.1262185
13. R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions of Information Theory, 44(6):2325--2383, 1998.
14. Simon Günter, Nicol N. Schraudolph, S. V. N. Vishwanathan, Fast Iterative Kernel Principal Component Analysis, The Journal of Machine Learning Research, 8, p.1893-1918, 12/1/2007
15. John A. Hartigan, Clustering Algorithms, John Wiley & Sons, Inc., New York, NY, 1975
16. J. A. Hartigan and M. A. Wong. A K-means Clustering Algorithm. Applied Statistics, 28(1):100--108, 1979.
17. Bruce Hendrickson, Robert Leland, A Multilevel Algorithm for Partitioning Graphs, Proceedings of the 1995 ACM/IEEE Conference on Supercomputing (CDROM), p.28-es, December 04-08, 1995, San Diego, California, United States doi:10.1145/224170.224228
18. L. Huang, D. Yan, M. I. Jordan, and N. Taft. Spectral Clustering with Perturbed Data. In Advances in Neural Information Processing Systems (NIPS), December 2008.
19. A. K. Jain, M. N. Murty, P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999 doi:10.1145/331499.331504
20. Ravi Kannan, Santosh Vempala, Adrian Vetta, On Clusterings: Good, Bad and Spectral, Journal of the ACM (JACM), v.51 n.3, p.497-515, May 2004 doi:10.1145/990308.990313
21. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, An Efficient K-Means Clustering Algorithm: Analysis and Implementation, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.24 n.7, p.881-892, July 2002 doi:10.1109/TPAMI.2002.1017616
22. George Karypis, Vipin Kumar, A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM Journal on Scientific Computing, v.20 n.1, p.359-392, Aug. 1998 doi:10.1137/S1064827595287997
23. Amit Kumar, Yogish Sabharwal, Sandeep Sen, A Simple Linear Time (1+ ") -Approximation Algorithm for K-Means Clustering in Any Dimensions, Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, p.454-462, October 17-19, 2004 doi:10.1109/FOCS.2004.7
24. J. F. Lu, J. B. Tang, Z. M. Tang, J. Y. Yang, Hierarchical Initialization Approach for K-Means Clustering, Pattern Recognition Letters, v.29 n.6, p.787-795, April, 2008 doi:10.1016/j.patrec.2007.12.009
25. David Madigan, Nandini Raghavan, William Dumouchel, Martha Nason, Christian Posse, Greg Ridgeway, Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction, Data Mining and Knowledge Discovery, v.6 n.2, p.173-190, April 2002 doi:10.1023/A:1014095614948
26. M. Meila and J. Shi. Learning Segmentation with Random Walk. In Advances in Neural Information Processing Systems (NIPS), 2001.
27. Pabitra Mitra, C. A. Murthy, Sankar K. Pal, Density-Based Multiscale Data Condensation, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.24 n.6, p.734-747, June 2002 doi:10.1109/TPAMI.2002.1008381
28. A. Y. Ng, M. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and An Algorithm. In Advances in Neural Information Processing Systems (NIPS), 2002.
29. J. M. Peña, J. A. Lozano, P. Larrañaga, An Empirical Comparison of Four Initialization Methods for the <italic>K</italic>-Means Algorithm, Pattern Recognition Letters, v.20 n.10, p.1027-1040, Oct. 1999 doi:10.1016/S0167-8655(99)00069-0
30. Stephen J. Redmond, Conor Heneghan, A Method for Initialising the K-means Clustering Algorithm Using Kd-trees, Pattern Recognition Letters, v.28 n.8, p.965-973, June, 2007 doi:10.1016/j.patrec.2007.01.001
31. Jianbo Shi, Jitendra Malik, Normalized Cuts and Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.22 n.8, p.888-905, August 2000 doi:10.1109/34.868688
32. Alex J. Smola, Bernhard Schökopf, Sparse Greedy Matrix Approximation for Machine Learning, Proceedings of the Seventeenth International Conference on Machine Learning, p.911-918, June 29-July 02, 2000
33. G. Stewart. Introduction to Matrix Computation. Academic Press, 1973.
34. U. Von Luxburg, M. Belkin, and O. Bousquet. Consistency of Spectral Clustering. Annals of Statistics, 36(2):555--586, 2008.
35. C. Williams and M. Seeger. Using the Nyström Method to Speed Up Kernel Machines. In Advances in Neural Information Processing Systems, 2001.
36. D. Yan, L. Huang, and M. I. Jordan. Fast Approximate Spectral Clustering. Technical Report, Department of Statistics, UC Berkeley, 2009.
37. Stella X. Yu, Jianbo Shi, Multiclass Spectral Clustering, Proceedings of the Ninth IEEE International Conference on Computer Vision, p.313, October 13-16, 2003
38. P. L. Zador. Asymptotic Quantization Error of Continuous Signals and the Quantization Dimension. IEEE Transactions of Information Theory, 28:139--148, 1982.

}},

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 FastApproximateSpectralClusteri	Donghui Yan Ling Huang Michael I. Jordan			Fast Approximate Spectral Clustering		KDD-2009 Proceedings		10.1145/1557019.1557118		2009