Clustering Algorithm

Jump to: navigation, search

A clustering algorithm is a unsupervised learning algorithm that can solve a clustering task.



    • Clustering algorithms can be categorized based on their cluster model, as listed above. The following overview will only list the most prominent examples of clustering algorithms, as there are probably a few dozen (if not over 100) published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms.

      There is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder." The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. It should be noted that an algorithm that is designed for one kind of models has no chance on a data set that contains a radically different kind of models. For example, k-means cannot find non-convex clusters.


  • (Wikipedia, 2009) ⇒
    • Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
    • Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.
    • Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of this kind.
    • Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.
    • Another important distinction is whether the clustering uses symmetric or asymmetric distances. A property of Euclidean space is that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.






  • (Berkhin, 2002) ⇒ Pavel Berkhin. (2002). “A Survey of Clustering Data Mining Techniques." Technical Report, Accrue Software.
    • For the reader’s convenience we provide a classification of clustering algorithms closely followed by this survey:
      • Hierarchical methods: Agglomerative algorithms, Divisive algorithms
      • Partitioning relocation methods: Probabilistic clustering, k-medoids methods, k-means methods
      • Density-based partitioning methods: Density-based connectivity clustering, Density functions clustering, Grid-based methods
      • Methods based on co-occurrence of categorical data
      • Other clustering techniques: Constraint-based clustering, Graph partitioning, Clustering algorithms and supervised learning, Clustering algorithms in machine learning
      • Scalable clustering algorithms
      • Algorithms for high-dimensional data: Subspace clustering, Coclustering techniques
    • The properties of clustering algorithms of concern in data mining include:
      • Type of attributes an algorithm can handle
      • Scalability to large datasets
      • Ability to work with high-dimensional data
      • Ability to find clusters of irregular shape
      • Handling outliers
      • Time complexity (we often simply use the term complexity)
      • Data order dependency
      • Labeling or assignment (hard or strict vs. soft or fuzzy)
      • Reliance on a priori knowledge and user-defined parameters
      • Interpretability of results


  • J. F. Hair et al. (1998). “Multivariate Data Analysis, 5th edition. Chapter 9, pages 469-517.
  • L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, March 1990.





  • MacQueen, J. B. (1967). “Some methods for classification and analysis of multivariate observations.” In: Proceedings of the Fifth Symposium on Math, Statistics, and Probability (pp. 281{297).



  • H. Steinhaus. (1956). “Sur la division des corp materiels en parties.” In: Bull. Acad. Polon. Sci., C1. III vol IV.