# 2000 EfficientClusteringOfHighDimDataSetsWithAppsToRefMatching

- (McCallum et al., 2000b) ⇒ Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. (2000). “Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching.” In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000). (doi:10.1145/347090.347123).

**Subject Headings:** String Distance Function, Record Coreference Resolution Algorithm, Greedy Algorithm, Agglomerative Clustering Algorithm, High-Dimensionality Clustering Algorithm.

## Notes

## Cited By

- ~265 ...

## Quotes

### Abstract

Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at once — for example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, highdimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies. Then clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under reasonable assumptions about the cheap distance metric, this reduction in computational cost comes without any loss in clustering accuracy. Canopies can be applied to many domains and used with a variety of clustering approaches, including Greedy Agglomerative Clustering, K-means and Expectation-Maximization. We present experimental results on grouping bibliographic citations from the reference sections of research papers. Here the canopy approach reduces computation time over a traditional clustering approach by more than an order of magnitude and decreases error in comparison to a previously used algorithm by 25%.

## References

- 1. Hitotogu Akaike. On entropy maximization principle. Applications of Statistics, pages 27-41, 1977.
- 2. M. R. Anderberg. Cluster Analysis for Application. Academic Press, 1973.
- 3. P. S. Bradley, Usama M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In: Proceedings of 4th International Conference on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, August 1998.
- 4. I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969..
- 5. Jerome H. Freidman, Jon Louis Bentley, Raphael Ari Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time, ACM Transactions on Mathematical Software (TOMS), v.3 n.3, p.209-226, Sept. 1977 doi:10.1145/355744.355745.
- 6. C. Lee Giles, Kurt D. Bollacker, Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States doi:10.1145/276675.276685.
- 7. Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States
- 8. H. Hirsh. Integrating mulitple sources of information in text classification using whril. In Snowbird Learning Conference, April 2000.
- 9. J. Hylton. Identifying and merging related bibliographic records. MIT LCS Masters Thesis, 1996.
- 10. B. Kilss and W. Alvey, editors. Record Linkage Techniques-1985, (1985). Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.fcsm.gov/.
- 11. (McCallum et al., 2000) ⇒ Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. (2000). “Automating the Construction of Internet Portals with Machine Learning" In: Information Retrieval, 3(2). (doi:10.1023/A:1009953814988).
- 12. A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
- 13. Alvaro E. Monge and C. Elkan. The field-matching problem: algorithm and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.
- 14. Alvaro E. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997.
- 15. Andrew W. Moore, Very fast EM-based mixture model clustering using multiresolution kd-trees, Proceedings of the 1998 conference on Advances in Neural Information Processing Systems II, p.543-549, July 1999
- 16. H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954-959, 1959.
- 17. S. Omohundro. Five balltree construction algorithms. Technical report 89-063, International Computer Science Institute, Berkeley, California, 1989.
- 18. K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210-2239, 1998.
- 19. Gerard M. Salton, Christopher Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management: an International Journal, v.24 n.5, p.513-523, 1988 doi:10.1016/0306-4573(88)90021-0
- 20. M. Sankaran, S. Suresh, M. Wong, and D. Nesamoney. Method for incremental aggregation of dynamically increasing database data sets. U.S. Patent 5,794,246, 1998.
- 21. D. Sanko and J. B. Kruskal. Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.
- 22. John W. Tukey and J. O. Pedersen. Method and apparatus for information access employing overlapping clusters. U.S. Patent 5,787,422, 1998.
- 23. Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD Conference, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada,

Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|

2000 EfficientClusteringOfHighDimDataSetsWithAppsToRefMatching | Andrew McCallum Kamal Nigam Lyle H. Ungar | Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching | Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining | http://www-2.cs.cmu.edu/~knigam/papers/canopy-kdd00.pdf | 10.1145/347090.347123 | 2000 |

Author | Andrew McCallum +, Kamal Nigam + and Lyle H. Ungar + |

doi | 10.1145/347090.347123 + |

journal | Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining + |

title | Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching + |

titleUrl | http://www-2.cs.cmu.edu/~knigam/papers/canopy-kdd00.pdf + |

year | 2000 + |