2002 SurveyOfClusteringDMTechniques

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Clustering Algorithm, Survey Paper, k-Medoids, k-Means.

Notes

Cited By

Quotes

Abstract

Clustering is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data. Clustering is therefore related to many disciplines and plays an important role in a broad range of applications. The applications of clustering usually deal with large datasets and data with many attributes. Exploration of such data is a subject of data mining. This survey concentrates on clustering algorithms from a data mining perspective.

1 Introduction

We provide a comprehensive review of different clustering techniques in data mining. Clustering refers to the division of data into groups of similar objects. Each group, or cluster, consists of objects that are similar to one another and dissimilar to objects in other groups. When representing a quantity of data with a relatively small number of clusters, we achieve some simplification, at the price of some loss of detail (as in lossy data compression, for example). Clustering is a form of data modeling, which puts it in a historical perspective rooted in mathematics and statistics. From a machine learning perspective, clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. Therefore, clustering is unsupervised learning of a hidden data concept. Clustering as applied to data mining applications encounters three additional complications: (a) large databases, (b) objects with many attributes, and (c) attributes of different types. These complications tend to impose severe computational requirements that present real challenges to classic clustering algorithms. These challenges led to the emergence of powerful broadly applicable data mining clustering methods developed on the foundation of classic techniques. These clustering methods are the subject of this survey.

1.1 Notations

To fix the context and clarify terminology, consider a dataset X consisting of data points (which may in turn represent objects, instances, cases, patterns, tuples, transactions, and so forth) x_i = (x_i1, . . ., x_id), i = 1 : [math]\displaystyle{ N }[/math], in attribute space [math]\displaystyle{ A }[/math], where each component x_il ∈ A_l, l = 1 : [math]\displaystyle{ d }[/math], is a numerical or a nominal categorical attribute (which may represent a feature, variable, dimension, component, or field). For a discussion of attribute data types see [Han & Kamber 2001]. This point-by-attribute data format conceptually corresponds to an [math]\displaystyle{ N }[/math] × [math]\displaystyle{ d }[/math] matrix and is used by the majority of algorithms reviewed later. However, data of other formats, such as variable length sequences and heterogeneous data, are not uncommon.

The simplest subset in an attribute space is a direct Cartesian product of subranges C = MUL C_l ⊂ A, Cl ⊂ A_l, called a segment (or a cube, cell, or region). A unit is an elementary segment whose subranges consist of a single category value or a small numerical bin. Describing the number of data points per unit represents an extreme case of clustering, a histogram. The histogram is a very expensive representation and not a very revealing one. User-driven segmentation is another commonly used practice in data exploration that utilizes expert knowledge regarding the importance of certain subdomains. Unlike segmentation, clustering is assumed to be automatic, and so it is unsupervised in the machine learning sense.

The goal of clustering is to assign data points to a finite system of [math]\displaystyle{ k }[/math] subsets (clusters). These subsets do not intersect (however, this requirement is sometimes violated in practice), and their union is equal to the full dataset with the possible exception of outliers

1.2 Plan of Further Presentation

For the reader’s convenience we provide a classification of clustering algorithms closely followed by this survey:

  • Hierarchical methods
    • Agglomerative algorithms
    • Divisive algorithms
  • Partitioning relocation methods
    • Probabilistic clustering
    • k-medoids methods
    • k-means methods
  • Density-based partitioning methods
    • Density-based connectivity clustering
    • Density functions clustering
  • Grid-based methods
  • Methods based on co-occurrence of categorical data
  • Other clustering techniques
    • Constraint-based clustering
    • Graph partitioning
    • Clustering algorithms and supervised learning
    • Clustering algorithms in machine learning
  • Scalable clustering algorithms
  • Algorithms for high-dimensional data
    • Subspace clustering
    • Coclustering techniques

1.3 Important Issues

The properties of clustering algorithms of concern in data mining include:

  • Type of attributes an algorithm can handle
  • Scalability to large datasets
  • Ability to work with high-dimensional data
  • Ability to find clusters of irregular shape
  • Handling outliers
  • Time complexity (we often simply use the term complexity)
  • Data order dependency
  • Labeling or assignment (hard or strict vs. soft or fuzzy)
  • Reliance on a priori knowledge and user-defined parameters
  • Interpretability of results

2 Hierarchical Clusters of Arbitrary Shapes

For spatial data, linkage metrics based on Euclidean distance naturally generate clusters of convex shapes. Meanwhile, visual inspection of spatial images frequently reveals clusters with more complex shapes.

Guha et al. [207] introduced the hierarchical agglomerative clustering algorithm CURE (Clustering Using REpresentatives). This algorithm has a number of novel and important features. CURE takes special steps to handle outliers and to provide labeling in the assignment stage. It also uses two techniques to achieve scalability: data sampling (Sect. 8), and data partitioning. CURE creates p partitions, so that fine granularity clusters are constructed in partitions first. A major feature of CURE is that it represents a cluster by a fixed number, c, of points scattered around it. The distance between two clusters used in the agglomerative process is the minimum of distances between two scattered representatives. Therefore, CURE takes a middle approach between the graph (all-points) methods and the geometric (one centroid) methods. Single link and average link closeness are replaced by representatives’ aggregate closeness. Selecting representatives scattered around a cluster makes it possible to cover nonspherical shapes. As before, agglomeration continues until the requested number k of clusters is achieved. CURE employs one additional trick: originally selected scattered points are shrunk to the geometric centroid of the cluster by a user-specified factor α. Shrinkage decreases the impact of outliers; outliers happen to be located further from the cluster centroid than the other scattered representatives. CURE is capable of finding clusters of different shapes and sizes. Because CURE uses sampling, estimation of its complexity is not straightforward. For low-dimensional data, Guha et al. provide a complexity estimate of O(N2 sample) defined in terms of the sample size. More exact bounds depend on the input parameters, which include the shrink factor</math>α</math>, the number of representative points c, the number of partitions p, as well as the sample size. Figure 1a illustrates agglomeration in CURE. Three clusters, each with three representatives, are shown before and after the merge and shrinkage. The two closest representatives are connected.

While the CURE algorithm works with numerical attributes (particularly low-dimensional spatial data), …

References

  • AGGARWAL, C.C., HINNEBURG, A., and KEIM, D.A. (2000). On the surprising behavior of distance metrics in high dimensional space. IBM Research report, RC 21739.
  • AGGARWAL, C.C., PROCOPIUC, C., WOLF, J.L., YU, P.S., and PARK, J.S. 1999a. Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD Conference, 61-72, Philadelphia, PA.
  • AGGARWAL, C.C., WOLF, J.L., and YU, P.S. 1999b. A new method for similarity indexing of market basket data. In: Proceedings of the ACM SIGMOD Conference, 407-418, Philadelphia, PA.
  • AGGARWAL, C.C. and YU, P.S. (2000). Finding generalized projected clusters in high dimensional spaces. Sigmod Record, 29, 2, 70-92.
  • AGRAWAL, R., FALOUTSOS, C., and SWAMI, A. (1993). Efficient similarity search in sequence databases. In: Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms, Chicago, IL.
  • AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., and RAGHAVAN, P. (1998). Automatic Subspace Clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD Conference, 94-105, Seattle, WA.
  • AL-SULTAN, K. (1995). A Tabu search approach to the clustering problem. Pattern Recognition, 28, 9, 1443-1451.
  • ANDERBERG, M. 1973. Cluster Analysis and Applications. Academic Press, New York.
  • ANKERST, M., BREUNIG, M., KRIEGEL, H.-P., and SANDER, J. (1999). OPTICS: Ordering points to identify clustering structure. In: Proceedings of the ACM SIGMOD Conference, 49-60, Philadelphia, PA.
  • ARABIE, P. and HUBERT, L.J. (1996). An overview of combinatorial data analysis, in: Arabie, P., Hubert, L.J., and Soete, G.D. (Eds.) Clustering and Classification, 5-63, World Scientific Publishing Co., NJ.
  • ARSLAN, A.N. and EGECIOGLU, O. (2000). Efficient algorithms for normalized edit distance. Journal of Discrete Algorithms, 1, 1.
  • BABU, G.P. and MURTY, M.N. (1993). A near-optimal initial seed value selection in Kmeans algorithm using a genetic algorithm. Pattern Recogn. Lett. 14, 10, 763-169.
  • BABU, G.P. and MARTY, M.N. (1994). Clustering with evolution strategies. Pattern Recognition, 27, 2, 321-329.
  • BAEZA-YATES, R. (1992). Introduction to data structures and algorithms related to information retrieval. In Frakes, W.B. and Baeza-Yates, R. (Eds.) Information Retrieval, Data Structures and Algorithms, 13-27, Prentice-Hall.
  • BAKER, L.D. and Andrew McCallum K. (1998). Distributional clustering of words for text classification. In: Proceedings of the 21st ACM SIGIR Conference, Melbourne, Australia.
  • BALL, G. and HALL, D. 1965. ISODATA, a novel method of data analysis and classification. Technical Report AD-699616, SRI, Stanford, CA.
  • BANERJEE, A. and GHOSH, J. (2002). On scaling up balanced clustering algorithms. In: Proceedings of the 2nd SIAM ICDM, 333-349, Arlington, VA.
  • BANFIELD, J. and RAFTERY, A. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803-821.
  • BARBARA, D. and CHEN, P. (2000). Using the fractal dimension to cluster datasets. In: Proceedings of the 6th ACM SIGKDD, 260-264, Boston, MA.
  • BECHER, J., BERKHIN, P., and FREEMAN, E. (2000). Automating exploratory data analysis for efficient data mining. In: Proceedings of the 6th ACM SIGKDD, 424-429, Boston, MA.
  • BECKMANN, N., KRIEGEL, H-P., SCHNEIDER, R., and SEEGER, B. (1990). The R*-tree: An efficient access method for points and rectangles. In: Proceedings of International Conference on Geographic Information Systems, Ottawa, Canada.
  • BERCHTOLD, S., BÖHM, C., and KRIEGEL, H-P. (1998). The Pyramid-technique: towards breaking the curse of dimensionality. In: Proceedings of the ACM SIGMOD Conference, 142-153, Seattle, WA.
  • BERKHIN, P. and BECHER, J. (2002). Learning Simple Relations: Theory and Applications. In: Proceedings of the 2nd SIAM ICDM, 420-436, Arlington, VA.
  • BEN-DOR, A. and YAKHINI, Z. (1999). Clustering gene expression patterns. In: Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB 99), 11-14, Lyon, France.
  • BERRY, M.W. and BROWNE, M. (1999). Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM.
  • BERRY, M., DUMAIS, S., LANDAUER, T., and O?BRIEN, G. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37, 4, 573-595.
  • BOTTOU, L. and BENGIO, Y. (1995). Convergence properties of the K-means algorithms. In Tesauro, G. and Touretzky, D. (Eds.) Advances in Neural Information Processing Systems 7, 585-592, The MIT Press, Cambridge, MA.
  • BEYER, K., GOLDSTEIN, J., RAMAKRISHNAN, R., and SHAFT, U. (1999). When is nearest neighbor meaningful? In: Proceedings of the 7th ICDT, Jerusalem, Israel.
  • BEZDEK, D. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, NY.
  • BOCK, H.H. (1996). Probability models in partitional cluster analysis. In Ferligoj, A. and Kramberger, A. (Eds.) Developments in Data Analysis, 3-25, Slovenia.
  • BOLEY, D.L. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2, 4, 325-344.
  • BOZDOGAN, H. (1983). Determining the number of component clusters in the standard multivariate normal mixture model using model-selection criteria. TR UIC/DQM/A83-1, Quantitative Methods Department, University of Illinois, Chicago, IL.
  • BOZDOGAN, H. (1994). Mixture-model cluster analysis using model selection criteria and a new information measure of complexity. In: Proceedings of the 1st US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, 69-113, Dordrecht, Netherlands.
  • BRADLEY, P. S., BENNETT, K. P., and DEMIRIZ, A. (2000). Constrained k-means clustering. Technical Report MSR-TR-2000-65. Microsoft Research, Redmond, WA.
  • BRADLEY, P. and Usama M. Fayyad (1998). Refining initial points for k-means clustering. In: Proceedings of the 15th ICML, 91-99, Madison, WI.
  • BRADLEY, P., Usama M. Fayyad, and REINA, C. (1998). Scaling clustering algorithms to large databases. In: Proceedings of the 4th ACM SIGKDD, 9-15, New York, NY.
  • BREUNIG, M., KRIEGEL, H-P., KROGER, P., and SANDER, J. (2001). Data Bubbles: quality preserving performance boosting for hierarchical clustering. In: Proceedings of the ACM SIGMOD Conference, Santa Barbara, CA.
  • BREUNIG, M.M., KRIEGEL, H.-P., NG, R.T., and SANDER, J. (2000). LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD Conference, 93-104, Dallas, TX.
  • BROWN, D. and HUNTLEY, C. (1991). A practical application of simulated annealing to clustering. Technical report IPC-TR-91-003, University of Virginia.
  • BUSYGIN, S., JACOBSEN, G., and KRÄMER, E. (2002). Double conjugated clustering applied to leukemia microarray data, 2nd SIAM ICDM, Workshop on clustering high dimensional data, Arlington, VA.
  • CADEZ, I., GAFFNEY, S., and Padhraic Smyth (2000). A general probabilistic framework for clustering individuals. Technical Report UCI-ICS 00-09.
  • CADEZ, I., Padhraic Smyth, and Heikki Mannila (2001). Probabilistic modeling of transactional data with applications to profiling, Visualization, and Prediction, In: Proceedings of the 7th ACM SIGKDD, 37-46, San Francisco, CA. Carpenter, G.A., Grossberg, S., and Rosen, D.B. (1991). Fuzzy art: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4, 759-771.
  • CHEESEMAN, P. and STUTZ, J. (1996). Bayesian Classification (AutoClass): Theory and Results. In Usama M. FayyadM., Piatetsky-Shapiro, G., Padhraic Smyth, and Uthurusamy, R. (Eds.) Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press.
  • CHENG, C., FU, A., and ZHANG, Y. (1999). Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 5th ACM SIGKDD, 84-93, San Diego, CA.
  • CHIU, T., FANG, D., CHEN, J., and Wang, Y. (2001). A Robust and scalable clustering algorithm for mixed type attributes in large database environments. In: Proceedings of the 7th ACM SIGKDD, 263-268, San Francisco, CA.
  • COOLEY, R., MOBASHER, B., and SRIVASTAVA, J. (1999). Data preparation for mining world wide web browsing. Journal of Knowledge Information Systems, 1, 1, 5-32.
  • CORTER, J. and GLUCK, M. (1992). Explaining basic categories: feature predictability and information. Psychological Bulletin, 111, 291-303.
  • COVER, T.M. and THOMAS, J.A. (1990). Elements of Information Theory. John Wiley & Sons, New York, NY.
  • CRISTOFOR, D. and SIMOVICI, D.A. (2002). An information-theoretical approach to clustering categorical databases using genetic algorithms. 2nd SIAM ICDM, Workshop on clustering high dimensional data, Arlington, VA.
  • CUTTING, D., KARGER, D., PEDERSEN, J., and TUKEY, J. (1992). Scatter/gather: a clusterbased approach to browsing large document collection. In: Proceedings of the 15th ACM
  • SIGIR Conference, 318-329, Copenhagen, Denmark.
  • DANIEL, C. and WOOD, F.C. (1980). Fitting Equations To Data: Computer Analysis of Multifactor Data. John Wiley & Sons, New York, NY.
  • DAY, W. and EDELSBRUNNER, H. (1984). Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1, 7, 7-24.
  • DEFAYS, D. 1977. An efficient algorithm for a complete link method. The Computer Journal, 20, 364-366. Arthur P. Dempster, LAIRD, N., and RUBIN, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1, 1-38.
  • DHILLON, I. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD, 269-274, San Francisco, CA.
  • DHILLON, I., FAN, J., and GUAN, Y. (2001). Efficient clustering of very large document collections. In: Robert L. Grossman, C. Kamath, P. Kegelmeyer, Vipin Kumar, V., and R.R. Namburu (Eds.) Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers.
  • DHILLON, I., GUAN, Y., and KOGAN, J. (2002). Refining clusters in high dimensional data. 2nd SIAM ICDM, Workshop on clustering high dimensional data, Arlington, VA.
  • DHILLON, I., MALLELA, S., and KUMAR, R. (2002). Enhanced Word Clustering for Hierarchical Text Classification, In: Proceedings of the 8th ACM SIGKDD, 191-200, Edmonton, Canada.
  • DHILLON, I. and MODHA, D. (1999). A data clustering algorithm on distributed memory multiprocessors. 5th ACM SIGKDD, Large-scale Parallel KDD Systems Workshop, 245- 260, San Diego, CA.
  • DUBES, R.C. (1993). Cluster Analysis and Related Issues. In Chen, C.H., Pau, L.F., and Wang, P.S. (Eds.) Handbook of Pattern Recognition and Computer Vision, 3-32, World Scientific Publishing Co., River Edge, NJ. 47
  • DUDA, R. and HART, P. 1973. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, NY.
  • DUMOUCHEL, W., VOLINSKY, C., JOHNSON, T., CORTES, C., and PREGIBON, D. (1999). Squashing flat files flatter. In: Proceedings of the 5th ACM SIGKDD, 6-15, San Diego, CA.
  • ENGLEMAN, L. and HARTIGAN, J. 1969. Percentage points of a test for clusters. Journal of the American Statistical Association, 64, 1647-1648.
  • ERTOZ, L., STEINBACH, M., and KUMAR, V. (2002). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Technical Report.
  • Martin Ester, FROMMELT A., KRIEGEL H.-P., and SANDER J. (2000). Spatial data mining: database primitives, algorithms and efficient DBMS support. Data Mining and Knowledge Discovery, Kluwer Academic Publishers, 4, 2-3, 193-216.
  • Martin Ester, KRIEGEL, H-P., SANDER, J. and XU, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM SIGKDD, 226-231, Portland, Oregon.
  • Martin Ester, KRIEGEL, H-P., and XU, X. (1995). A database interface for clustering in large spatial databases. In: Proceedings of the 1st ACM SIGKDD, 94-99, Montreal, Canada.
  • ESTIVILL-CASTRO, V. and LEE, I. (2000). AMOEBA: Hierarchical Clustering Based on Spatial Proximity Using Delaunay Diagram. In: Proceedings of the 9th International Symposium on Spatial Data Handling. Beijing, China.
  • EVERITT, B. (1993). Cluster Analysis (3rd ed.). Edward Arnold, London, UK.
  • FALOUTSOS, C. and LIN, K. (1995). Fastmap: A fast algorithm for indexing, data mining and visualization of traditional and multimedia. In: Proceedings of the ACM SIGMOD Conference, 163-174, San Jose, CA.
  • FALOUTSOS, C., RANGANATHAN, M., and MANOLOPOULOS, Y. (1994). Fast subsequence matching in time-series databases. In: Proceedings of the ACM SIGMOD Conference, 419-429, Minneapolis, MN.
  • FASULO, D. (1999). An analysis of recent work on clustering algorithms. Technical Report
  • UW-CSE01 -03-02, University of Washington.
  • FISHER, D. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139-172.
  • FISHER, D. (1996). Iterative optimization and simplification of hierarchical clustering. Journal of Artificial Intelligence Research, 4, 147-179.
  • FORGY, E. 1965. Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics, 21, 768-780.
  • FORREST, S., HOFMEYR, S.A., and SOMAYAJI, A. (1997). Computer immunology. Communications of the ACM, 40, 88-96.
  • FOSS, A., WANG, W., and ZAANE, O. (2001). A non-parametric approach to Web log analysis. 1st SIAM ICDM, Workshop on Web Mining, 41-50, Chicago, IL.
  • FRALEY, C. and RAFTERY, A. (1999). MCLUST: Software for model-based cluster and discriminant analysis, Tech Report 342, Dept. Statistics, Univ. of Washington.
  • FRALEY, C. and RAFTERY, A. How many clusters? (1998). Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41, 8, 578-588.
  • Jerome H. Friedman, BENTLEY, J.L., and FINKEL, R.A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software, 3, 3, 209-226.
  • K. FUKUNAGA. (1990). Introduction to Statistical Pattern Recognition. Academic Press, San Diego, CA.
  • GANTI, V., GEHRKE, J. and RAMAKRISHNAN, R. 1999a. CACTUS-Clustering Categorical Data Using Summaries. In: Proceedings of the 5th ACM SIGKDD, 73-83, San Diego, CA.
  • GANTI, V., RAMAKRISHNAN, R., GEHRKE, J., POWELL, A., and FRENCH, J. 1999b. Clustering large datasets in arbitrary metric spaces. In: Proceedings of the 15th ICDE, 502-511, Sydney, Australia.
  • GENNARI, J., Pat Langley, and FISHER, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11-61.
  • GERSHO, A. and GRAY, R. M. (1992). Vector Quantization and Signal Compression. Communications and Information Theory. Kluwer Academic Publishers, Norwell, MA.
  • GHOSH, J., (2002). Scalable Clustering Methods for Data Mining. In Nong Ye (Ed.) Handbook of Data Mining, Lawrence Erlbaum, to appear.
  • GHOSH, A.K., SCHWARTZBARD, A., and SCHATZ. M. (1999). Learning program behavior profiles for intrusion detection. In: Proceedings of the SANS Conference and Workshop on Intrusion Detection and Response, San Francisco, CA.
  • GIBSON, D., Jon Kleinberg, and RAGHAVAN, P. (1998). Clustering categorical data: An approach based on dynamic systems. In: Proceedings of the 24th International Conference on Very Large Databases, 311-323, New York, NY.
  • GOIL, S., NAGESH, H., and CHOUDHARY, A. (1999). MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, Northwestern University.
  • GOLDBERG, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley.
  • GONZALEZ, T.F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293-306.
  • GOVAERT, G. (1995). Simultaneous clustering of rows and columns. Control and Cybernetics, 24, 437-458.
  • GOWDA, K.C. and KRISHNA, G. 1978. Agglomerative clustering using the concept of mutual nearest neighborhood. Pattern Recognition, 10, 105-112.
  • (Guha et al., 2001) ⇒ Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. (2001). “CURE: an efficient clustering algorithm for large databases.” In: Proceedings of the 1998 [[ACM SIGMOD] Conference (SIGMOD 1998).
  • GUHA, S., RASTOGI, R., and SHIM, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In: Proceedings of the 15th ICDE, 512-521, Sydney, Australia.
  • GURALNIK, V. and KARYPIS, G. (2001). A Scalable algorithm for clustering sequential data. IEEE ICDM 2001, Silicon Valley, CA.
  • GUTTMAN, A. (1984). R-trees: a dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD Conference, 47-57, Boston, MA.
  • HALL, L.O., OZYURT, B., and BEZDEK, J.C. (1999). Clustering with a genetically optimized approach. IEEE Trans. on Evolutionary Computation, 3, 2, 103-112.
  • HAN, E-H., KARYPIS, G., KUMAR, V., and MOBASHER, B. (1997). Clustering based on association rule hypergraphs. ACM SIGMOD Conference, Data Mining Workshop (DMKD'97), Tucson, Arizona. Jiawei Han and KAMBER, M. (2001). Data Mining. Morgan Kaufmann Publishers. Jiawei Han, KAMBER, M., and TUNG, A. K. H. (2001). Spatial clustering methods in data mining: A survey. In Miller, H. and Jiawei Han (Eds.) Geographic Data Mining and Knowledge Discovery, Taylor and Francis.
  • HAREL, D. and KOREN, Y. (2001). Clustering spatial data using random walks, In Proceedings of the 7th ACM SIGKDD, 281-286. San Francisco, CA.
  • HARTIGAN, J. 1975. Clustering Algorithms. John Wiley & Sons, New York, NY.
  • HARTIGAN, J. and WONG, M. 1979. Algorithm AS136: A k-means clustering algorithm. Applied Statistics, 28, 100-108.
  • HEER, J. and CHI, E. (2001). Identification of Web user traffic composition using multimodal clustering and information scent. 1st SIAM ICDM, Workshop on Web Mining, 51-58, Chicago, IL.
  • HINNEBURG, A. and KEIM, D. (1998). An efficient approach to clustering large multimedia databases with noise. In: Proceedings of the 4th ACM SIGKDD, 58-65, New York, NY.
  • HINNEBURG, A. and KEIM, D. (1999). Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th Conference on VLDB, 506-517, Edinburgh, Scotland.
  • HUANG, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2, 3, 283-304.
  • HULTEN, G., SPENCER, L., and Pedro Domingos (2001). Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD, 97-106, San Francisco, CA.
  • JAIN, A. and DUBES, R. (1988). Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ.
  • JAIN, A.K. and FLYNN, P.J. 1966. Image segmentation using clustering. In Advances in Image Understanding: A Festschrift for Azriel Rosenfeld, IEEE Press, 65-83.
  • JAIN, A.K. and MAO, J. (1996). Artificial neural networks: A tutorial. IEEE Computer, 29, 3, 31-44.
  • JAIN, A.K, MURTY, M.N., and FLYNN P.J. (1999). Data clustering: a review. ACM Computing Surveys, 31, 3, 264-323.
  • JARVIS, R.A. and PATRICK, E.A. 1973. Clustering using a similarity measure based on shared nearest neighbors. IEEE Transactions on Computers, C-22, 11.
  • JEBARA, T. and JAAKKOLA, T. (2000). Feature selection and dualities in maximum entropy discrimination. In: Proceedings of the 16th UIA Conference, Stanford, CA.
  • JOLIFFE, I. (1986). Principal Component Analysis. Springer-Verlag, New York, NY.
  • KALTON, A., Pat Langley, WAGSTAFF, K., and YOO, J. (2001). Generalized clustering, supervised learning, and data assignment. In: Proceedings of the 7th ACM SIGKDD, 299- 304, San Francisco, CA.
  • KANDOGAN, E. (2001). Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proceedings of the 7th ACM SIGKDD, 107-116, San Francisco, CA.
  • KARYPIS, G., AGGARWAL, R., KUMAR, V., and SHEKHAR, S. (1997). Multilevel hypergraph partitioning: application in VLSI domain, In: Proceedings ACM/IEEE Design Automation Conference.
  • KARYPIS, G., HAN, E.-H., and KUMAR, V. 1999a. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling, COMPUTER, 32, 68-75.
  • KARYPIS, G., HAN, E.-H., and KUMAR, V. 1999b. Multilevel refinement for hierarchical clustering. Technical Report # 99-020.
  • KARYPIS, G. and HAN, E.-H. (2000). Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis.
  • KARYPIS, G. and KUMAR, V., (1999). A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20, 1.
  • KASS, R. and RAFTERY, A. (1995). Bayes factors. Journal of Amer. Statistical Association, 90, 773-795.
  • KAUFMAN, L. and ROUSSEEUW, P. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York, NY.
  • (Keogh et al., 2001c) ⇒ Eamonn Keogh, K. Chakrabarti, S. Mehrotra, and Michael J. Pazzani. (2001). “Locally adaptive dimensionality reduction for indexing large time series databases.” In: Proceedings of the ACM SIGMOD Conference (SIGMOD 2001).
  • (Keogh et al., 2001a) ⇒ Eamonn Keogh, K. Chakrabarti, Michael J. Pazzani, and S. Mehrotra, (2001). “Dimensionality reduction for fast similarity search in large time series databases.” In: Journal of Knowledge and Information Systems, 3, 3.
  • (Keogh et al., 2001b) ⇒ Eamonn Keogh, S. Chu, and Michael J. Pazzani. (2001). “Ensemble-index: A new approach to indexing large databases.” In: Proceedings of the 7th ACM SIGKDD (KDD-2001).
  • KERNIGHAN, B.W. and LIN, S. 1970. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49, 2, 291-307.
  • Teuvo Kohonen. (1990). The self-organizing map. Proceedings of the IEEE, 9, 1464-1479.
  • Teuvo Kohonen. (2001). Self-Organizing Maps. Springer Series in Information Sciences, 30, Springer.
  • KOLATCH, E. (2001). Clustering Algorithms for Spatial Databases: A Survey. PDF is available on the Web.
  • KOLLER, D. and SAHAMI, M. (1996). Toward optimal feature selection. In: Proceedings of the 13th ICML, 284-292, Bari, Italy.
  • KNORR E. and NG R. (1998). Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24h Conference on VLDB, 392-403, New York, NY.
  • KNORR, E.M., NG, R.T., and ZAMAR, R.H. (2001). Robust Space Transformations for distance-based operations. In: Proceedings of the 7th ACM SIGKDD, 126-135, San Francisco, CA.
  • KRIEGEL H.-P., SEEGER B., SCHNEIDER R., and BECKMANN N. (1990). The R*-tree: an efficient access method for geographic information systems. In: Proceedings International Conference on Geographic Information Systems, Ottawa, Canada.
  • KULLBACK, S. and LEIBLER, R.A. 1951. On information and sufficiency. Annals of Mathematical Statistics, 22, 76-86.
  • LANCE, G. and WILLIAMS W. 1967. A general theory of classification sorting strategies. Computer Journal, 9, 373-386.
  • LARSEN, B. and AONE, C. (1999). Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD, 16-22, San Diego, CA.
  • LEE, C-Y. and ANTONSSON, E.K. (2000). Dynamic partitional clustering using evolution strategies. In: Proceedings of the 3rd Asia-Pacific Conference on Simulated Evolution and Learning, Nagoya, Japan.
  • LEE, W. and STOLFO, S. (1998). Data mining approaches for intrusion detection. In Proceedings of the 7th USENIX Security Symposium, San Antonio, TX.
  • LIEBOVITCH, L. and TOTH, T. (1989). A fast algorithm to determine fractal dimensions by box counting. Physics Letters, 141A, 8.
  • LIN, D. (1998). An information-theoretic definition of similarity. In: Proceedings of the 15th
  • ICML, 296-304, Madison, WI.
  • LIU, B., XIA, Y., and YU, P.S. (2000). Clustering through decision tree construction. In SIGMOD 2000.
  • LIU, H. and SETIONO, R. (1996). A probabilistic approach to feature selection - a filter solution. In: Proceedings of the 13th ICML, 319-327, Bari, Italy.
  • MANILLA, H. and RUSAKOV, D. (2001). Decomposition of event sequences into independent components. In: Proceedings of the 1st SIAM ICDM, Chicago, IL.
  • MAO, J. and JAIN, A.K. (1996). A Self-organizing network for hyperellipsoidal clustering (HEC). IEEE Transactions on Neural Networks, 7, 1, 16-29.
  • MARDIA, K., KENT, J. and BIBBY, J. (1980). Multivariate Analysis. Academic Press.
  • MARROQUIN, J.L. and GIROSI, F. (1993). Some extensions of the k-means algorithm for image segmentation and pattern classification. A.I. Memo 1390. MIT, Cambridge, MA.
  • MASSART, D. and KAUFMAN, L. (1983). The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. John Wiley & Sons, New York, NY.
  • MCCALLUM, A., NIGAM, K., and Lyle H. UngarH. (2000). Efficient clustering of highdimensional data sets with application to reference matching. In: Proceedings of the 6th
  • ACM SIGKDD, 169-178, Boston, MA.
  • MCLACHLAN, G. and BASFORD, K. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, NY.
  • MCLACHLAN, G. and KRISHNAN, T. (1997). The EM Algorithm and Extensions. John Wiley & Sons, New York, NY.
  • MICHALSKI, R.S. and STEPP, R. (1983). Learning from observations: conceptual clustering. In Machine Learning: An Artificial Intelligence Approach. San Mateo, CA, Morgan Kaufmann.
  • MILLIGAN, G. and COOPER, M. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159-179.
  • MIRKIN, B. (1996). Mathematic Classification and Clustering. Kluwer Academic Publishers. Tom M. Mitchell. (1997). Machine Learning. McGraw-Hill, New York, NY.
  • MODHA, D. and SPANGLER, W. (2002). Feature weighting in k-means clustering. Machine Learning, 47.
  • MOORE, A. (1999). Very fast EM-based mixture model clustering using multiresolution kd-trees. Advances in Neural Information Processing Systems, 11. Rajeev Motwani and RAGHAVAN, P. (1995). Randomized Algorithms. Cambridge University Press.
  • MURTAGH, F. (1983). A survey of recent advances in hierarchical clustering algorithms. Computer Journal, 26, 4, 354-359.
  • MURTAGH, F. (1985). Multidimensional Clustering Algorithms. Physica-Verlag, Vienna.
  • NAGESH, H., GOIL, S., and CHOUDHARY, A. (2001). Adaptive grids for clustering massive data sets, In: Proceedings of the 1st SIAM ICDM, Chicago, IL.
  • NG, R. and Jiawei Han (1994). Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th Conference on VLDB, 144-155, Santiago, Chile.
  • NISHISATO, S. (1980). Analysis of Categorical Data: Dual Scaling and Its Applications. University of Toronto.
  • OLIVER, J., BAXTER, R. and WALLACE, C. (1996). Unsupervised learning using MML. In Proceedings of the 13th ICML, Bari, Italy.
  • OLSON, C. (1995). Parallel algorithms for hierarchical clustering. Parallel Computing, 21, 1313-1325.
  • OYANAGI, S., KUBOTA, K., and NAKASE, A. (2001). Application of matrix clustering to Web log analysis and access prediction. 7th ACM SIGKDD, WEBKDD Workshop, San Francisco, CA.
  • PADMANABHAN, B. and TUZHILIN, A. (1999). Unexpectedness as a measure of interestingness in knowledge discovery. Decision Support Systems Journal, 27, 3, 303-318.
  • PADMANABHAN, B. and TUZHILIN, A. (2000). Small is beautiful: discovering the minimal set of unexpected patterns. In: Proceedings of the 6th ACM SIGKDD, 54-63, Boston, MA.
  • PELLEG, D. and MOORE, A. (1999). Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings of the 5th ACM SIGKDD, 277-281, San Diego, CA.
  • PELLEG, D. and MOORE, A. (2000). X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In: Proceedings 17th ICML, Stanford University.
  • PIATETSKY-SHAPIRO, G. and MATHEUS, C.J. (1994). The interestingness of deviations. In Proceedings of the AAAI-94 Workshop on Knowledge Discovery in Databases.
  • RAMASWAMY, S., RASTOGI, R., and SHIM, K. (2000). Efficient algorithms for mining outliers from large data sets, Sigmoid Record, 29, 2, 427-438.
  • RAND, W.M. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Assoc, 66, 846-850. Philip Resnik (1995). Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of IJCAI-95, 448-453, Montreal, Canada.
  • RISSANEN, J. 1978. Modeling by shortest data description. Automatica, 14, 465-471.
  • RISSANEN J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientific Publishing Co., Singapore.
  • SANDER, J., ESTER, M., KRIEGEL, H.-P., and XU, X. (1998). Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. In Data Mining and Knowledge Discovery, 2, 2, 169-194.
  • SARAFIS, I., ZALZALA, A.M.S., and TRINDER, P.W. (2002). A genetic rule-based data clustering toolkit. To be published in Congress on Evolutionary Computation (CEC), Honolulu, USA.
  • SAVARESI, S. and BOLEY, D. (2001). On performance of bisecting k-means and PDDP. In Proceedings of the 1st SIAM ICDM, Chicago, IL.
  • SAVARESI, S.M., BOLEY, D.L., BITTANTI, S., and GAZZANIGA, G. (2002). Cluster Selection in divisive clustering algorithms. In: Proceedings of the 2nd SIAM ICDM, 299- 314, Arlington, VA.
  • SCHALKOFF, R. (1991). Pattern Recognition. Statistical, Structural and Neural Approaches. John Wiley & Sons, New York, NY.
  • SCHIKUTA, E. (1996). Grid-clustering: a fast hierarchical clustering method for very large data sets. In: Proceedings 13th International Conference on Pattern Recognition, 2, 101-105.
  • SCHIKUTA, E., ERHART, M. (1997). The BANG-clustering system: grid-based data analysis. In: Proceedingseeding of Advances in Intelligent Data Analysis, Reasoning about Data, 2nd International Symposium, 513-524, London, UK.
  • SCHWARZ, G. 1978. Estimating the dimension of a model. The Annals of Statistics, 6, 461-464.
  • (Scott, 1992) ⇒ David W. Scott. (1992). “Multivariate Density Estimation: theory, practice, and visualization." Wiley. ISBN:0471547700
  • SHEIKHOLESLAMI, G., CHATTERJEE, S., and ZHANG, A. (1998). WaveCluster: A multiresolution clustering approach for very large spatial databases. In: Proceedings of the 24th Conference on VLDB, 428-439, New York, NY.
  • SIBSON, R. 1973. SLINK: An optimally efficient algorithm for the single link cluster method. Computer Journal, 16, 30-34.
  • SILBERSCHATZ, A. and TUZHILIN, A. (1996). What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Eng., 8, 6.
  • SLONIM, N. and TISHBY, N. (2000). Document clustering using word clusters via the Information Bottleneck Method. In: Proceedings SIGIR, 208-215.
  • SLONIM, N. and TISHBY, N. (2001). The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research.
  • Padhraic Smyth (1998). Model selection for probabilistic clustering using cross-validated likelihood. ICS Tech Report 98-09, Statistics and Computing.
  • Padhraic Smyth (1999). Probabilistic model-based clustering of multivariate and sequential data. In: Proceedings of the 7th International Workshop on AI and Statistics, 299-304.
  • SPATH H. (1980). Cluster Analysis Algorithms. Ellis Horwood, Chichester, England.
  • STEINBACH, M., KARYPIS, G., and KUMAR, V. (2000). A comparison of document clustering techniques. 6th ACM SIGKDD, World Text Mining Conference, Boston, MA.
  • STREHL, A. and GHOSH, J. (2000). A scalable approach to balanced, high-dimensional clustering of market baskets, In: Proceedings of 17th International Conference on High Performance Computing, Springer LNCS, 525-536, Bangalore, India.
  • THOMOPOULOS, S., BOUGOULIAS, D., and WANN, C-D. (1995). Dignet: An unsupervisedlearning clustering algorithm for clustering and data fusion. IEEE Trans. on Aerospace and Electr. Systems, 31, 1, 2,1-38.
  • TISHBY, N., Fernando PereiraC., and BIALEK, W. (1999). The information bottleneck method. In: Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, 368-377. (TUKEY, J.W. 1977. Exploratory Data Analysis. Addison-Wesley.
  • TUNG, A.K.H., HOU, J., and Jiawei Han (2001). Spatial clustering in the presence of obstacles. In: Proceedings of the 17th ICDE, 359-367, Heidelberg, Germany.
  • TUNG, A.K.H., NG, R.T., LAKSHMANAN, L.V.S., and Jiawei Han (2001). Constraint-based Clustering in Large Databases, In: Proceedings of the 8th ICDT, London, UK.
  • VOORHEES, E.M. (1986). Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management, 22, 6, 465-476.
  • WALLACE, C. and DOWE, D. (1994). Intrinsic classification by MML ? the Snob program. In the Proceedings of the 7th Australian Joint Conference on Artificial Intelligence, 37- 44, UNE, World Scientific Publishing Co., Armidale, Australia.
  • WALLACE, C. and FREEMAN, P. (1987). Estimation and inference by compact coding. Journal of the Royal Statistical Society, Series B, 49, 3, 240-265.
  • XU, X., ESTER, M., KRIEGEL, H.-P., and SANDER, J. (1998). A distribution-based clustering algorithm for mining large spatial datasets. In: Proceedings of the 14th ICDE, 324-331, Orlando, FL.
  • WANN C.-D. and THOMOPOULOS, S.A. (1997). A Comparative study of self-organizing clustering algorithms Dignet and ART2. Neural Networks, 10, 4, 737-743.
  • WARD, J.H. 1963. Hierarchical grouping to optimize an objective function. Journal Amer. Stat, Assoc., 58, 301, 235-244.
  • WANG, W., YANG, J., and MUNTZ, R. (1997). STING: a statistical information grid approach to spatialdata mining. In: Proceedings of the 23rd Conference on VLDB, 186-195, Athens, Greece.
  • WANG, W., YANG, J., and MUNTZ, R.R. (1998). PK-tree: a spatial index structure for high dimensional point data. In: Proceedings of the 5th International Conference of Foundations of Data Organization.
  • WANG, W., YANG, J., and MUNTZ, R.R. (1999). STING+: An approach to active spatial data mining. In: Proceedings 15th ICDE, 116-125, Sydney, Australia.
  • XU, X., ESTER, M., KRIEGEL, H.-P., and SANDER, J. (1998). A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of the 14th ICDE, 324-331, Orlando, FL.
  • YAO, A. (1982). On constructing minimum spanning trees in k-dimensional space and related problems. SIAM Journal on Computing, 11, 4, 721-736.
  • ZHANG, B. (2001). Generalized k-harmonic means ? dynamic weighting of data in unsupervised learning. In: Proceedings of the 1st SIAM ICDM, Chicago, IL.
  • ZHANG, T., RAMAKRISHNAN, R. and LIVNY, M. (1996). BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference, 103-114, Montreal, Canada.
  • ZHANG, T., Ramakrishnan, R., and LIVNY, M. (1997). BIRCH: A new data clustering algorithm and its applications. Journal of Data Mining and Knowledge Discovery, 1, 2, 141-182.
  • ZHANG, Y., FU, A.W., CAI, C.H., and Heng. P.-A. (2000). Clustering categorical data. In: Proceedings of the 16th ICDE, 305, San Diego, CA.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 SurveyOfClusteringDMTechniquesPavel BerkhinA Survey of Clustering Data Mining Techniqueshttp://www.springerlink.com/content/x321256p66512121/2002