2001 CureAnEfficientClusteringAlgori

(Guha et al., 2001) ⇒ Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. (2001). “Cure: An Efficient Clustering Algorithm for Large Databases.” In: Information Systems Journal, 26(1). doi:10.1016/S0306-4379(01)00008-4

Subject Headings: CURE Clustering Algorithm, Agglomerative Clustering Algorithm, ROCK Clustering Algorithm.

Notes

This is a journal paper version of (Guha et al., 1998) in SIGMOD 1998 http://dx.doi.org/10.1145/276304.276312

Cited By

2005

(Xy & Wunsch II, 2005) ⇒ Rui Xu, and Donald Wunsch II. (2005). “Survey of Clustering Algorithms.” In: IEEE Transactions on Neural Networks, 16(3). doi:10.1109/TNN.2005.845141
- QUOTE: Noticing the restriction of centroid-based HC, which is unable to identify arbitrary cluster shapes, Guha, Rastogi, and Shim developed a HC algorithm, called CURE, to explore more sophisticated cluster shapes [116]. The crucial feature of CURE lies in the usage of a set of well-scattered points to represent each cluster, which makes it possible to find rich cluster shapes other than hyperspheres and avoids both the chaining effect [88] of the minimum linkage method and the tendency to favor clusters with similar sizes of centroid. These representative points are further shrunk toward the cluster centroid according to an adjustable parameter in order to weaken the effects of outliers. CURE utilizes random sample (and partition) strategy to reduce computational complexity. Guha et al. also proposed another agglomerative HC algorithm, ROCK, to group data with qualitative attributes [117]. They used a novel measure “link” to describe the relation between a pair of objects and their common neighbors. Like CURE, a random sample strategy is used to handle large data sets. Chameleon is constructed from graph theory and will be discussed in Section II-E.

2002

(Berkhin, 2002) ⇒ Pavel Berkhin. (2002). “A Survey of Clustering Data Mining Techniques." Technical Report, Accrue Software.
- QUOTE: Guha et al. [207] introduced the hierarchical agglomerative clustering algorithm CURE (Clustering Using REpresentatives). This algorithm has a number of novel and important features. CURE takes special steps to handle outliers and to provide labeling in the assignment stage. It also uses two techniques to achieve scalability: data sampling (Sect. 8), and data partitioning. CURE creates p partitions, so that fine granularity clusters are constructed in partitions first. A major feature of CURE is that it represents a cluster by a fixed number, c, of points scattered around it. The distance between two clusters used in the agglomerative process is the minimum of distances between two scattered representatives. Therefore, CURE takes a middle approach between the graph (all-points) methods and the geometric (one centroid) methods. Single link and average link closeness are replaced by representatives’ aggregate closeness. Selecting representatives scattered around a cluster makes it possible to cover nonspherical shapes. As before, agglomeration continues until the requested number k of clusters is achieved. CURE employs one additional trick: originally selected scattered points are shrunk to the geometric centroid of the cluster by a user-specified factor α. Shrinkage decreases the impact of outliers; outliers happen to be located further from the cluster centroid than the other scattered representatives. CURE is capable of finding clusters of different shapes and sizes. Because CURE uses sampling, estimation of its complexity is not straightforward. For low-dimensional data, Guha et al. provide a complexity estimate of O(N2 sample) defined in terms of the sample size. More exact bounds depend on the input parameters, which include the shrink factor</math>α</math>, the number of representative points c, the number of partitions p, as well as the sample size. Figure 1a illustrates agglomeration in CURE. Three clusters, each with three representatives, are shown before and after the merge and shrinkage. The two closest representatives are connected. While the CURE algorithm works with numerical attributes (particularly low-dimensional spatial data), …

Quotes

Author Keywords

Data Mining; Knowledge Discovery; Clustering Algorithms

Abstract

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2001 CureAnEfficientClusteringAlgori	Sudipto Guha Rajeev Rastogi Kyuseok Shim			Cure: An Efficient Clustering Algorithm for Large Databases				10.1016/S0306-4379(01)00008-4		2001