1999 AutomaticSubspClustOfHighDimData

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Subspace Clustering Algorithm, High-Dimensional Data, Patent.

Notes

  • Assignee: International Business Machines Corporation

Cited By

~1303 http://scholar.google.com/scholar?cites=6578639591891317356

Quotes

Abstract

  • A method for finding clusters of units in high-dimensional data having the steps of determining dense units in selected subspaces within a data space of the high-dimensional data, determining each cluster of dense units that are connected to other dense units in the selected subspaces within the data space, determining maximal regions covering each cluster of connected dense units, determining a minimal cover for each cluster of connected dense units, and identifying the minimal cover for each cluster of connected dense units.

Description

  • Clustering is a descriptive task associated with data mining that identifies homogeneous groups of objects in a dataset. Clustering techniques have been studied extensively in statistics, pattern recognition, and machine learning. Examples of clustering applications include customer segmentation for database marketing, identification of sub-categories of spectra from the database of infra-red sky measurements, and identification of areas of similar land use in an earth observation database.
  • Clustering techniques can be broadly classified into partitional techniques and hierarchial techniques. Partitional clustering partitions a set of objects into K clusters such that the objects in each cluster are more similar to each other than to objects in different clusters. For partitional clustering, the value of K can be specified by a user, and a clustering criterion must be adopted, such as a mean square error criterion, like that disclosed by P. H. Sneath et al., Numerical Taxonomy, Freeman, 1973. Popular K-means methods, such as the FastClust in SAS Manual, 1995, from the SAS Institute, iteratively determine K representatives that minimize the clustering criterion and assign each object to a cluster having its representative closest to the cluster. Enhancements to partitional clustering approach for working on large databases have been developed, such as CLARANS, as disclosed by R. T. Ng et al., Efficient and effective clustering methods for spatial data mining, Proceedings of the VLDB Conference, Santiago, Chile, September 1994; Focussed CLARANS, as disclosed by M. Ester et al., A database interface for clustering in large spatial databases, Proceedings of the 1st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995; and BIRCH, as disclosed by T. Zhang et al., BIRCH: An efficient data clustering method for very large databases, Proceedings of the ACM SIGMOD Conference on Management Data, Montreal, Canada, June 1996.
  • Hierarchial clustering is a nested sequence of partitions. An agglomerative, hierarchial clustering starts by placing each object in its own atomic cluster and then merges the atomic clusters into larger and larger clusters until all objects are in a single cluster. Divisive, hierarchial clustering reverses the process by starting with all objects in cluster and subdividing into smaller pieces. For theoretical and empirical comparisons of hierarchical clustering techniques, see for example, A. K. Jain et al., Algorithms for Clustering Data, Prentice Hall, 1988, P. Mangiameli et al., Comparison of some neutral network and hierarchical clustering methods, European Journal of Operational Research, 93(2):402-417, September 1996, P. Michaud, Four clustering techniques, FGCS Journal, Special Issue on Data Mining, 1997, and M. Zait et al., A Comparative study of clustering methods, FGCS Journal, Special Issue on Data Mining, 1997.
  • Emerging data mining applications place special requirements on clustering techniques, such as the ability to handle high dimensionality, assimilation of cluster descriptions by users, description minimation, and scalability and usability. Regarding high dimensionality of data clustering, an object typically has dozens of attributes in which the domains of the attributes are large. Clusters formed in a high-dimensional data space are not likely to be meaningful clusters because the expected average density of points anywhere in the high-dimensional data space is low. The requirement for high dimensionality in a data mining application is conventionally addressed by requiring a user to specify the subspace for cluster analysis. For example, the IBM data mining product, Intelligent Miner described in the IBM Intelligent Miner User's Guide, version 1 release 1, SH12-6213-00 edition, July 1996, and incorporated by reference herein, allows specification of "active" attributes for defining a subspace in which clusters are found. This approach is effective when a user can correctly identify appropriate attributes for clustering.
  • A variety of approaches for reducing dimensionality of a data space have been developed. Classical statistical techniques include principal component analysis and factor analysis, both of which reduce dimensionality by forming linear combinations of features. For example, see R. O. Duda et al., Pattern Classification and Scene Analysis, John Wiley and Sons, 1973, and K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1990. For the principal component analysis technique, also known as Karhunen-Loeve expansion, a lower-dimensional representation is found that accounts for the variance of the attributes, whereas the factor analysis technique finds a representation that accounts for the correlations among the attributes. For an evaluation of different feature selection methods, primarily for image classification, see A. Jain et al., Algorithms for feature selection: An evaluation, Technical report, Department of Computer Science, Michigan State University, East Lansing, Mich., 1996. Unfortunately, dimensionality reductions obtained using these conventional approaches conflict with the requirements placed on the assimilation aspects of data mining.

References


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1999 AutomaticSubspClustOfHighDimDataPrabhakar Raghavan
Johannes Ernst Gehrke
Dimitrios Gunopulos
Rakesh Agrawal
Automatic Subspace Clustering of High Dimensional Data for Data Mining Applicationshttp://www.google.com/patents?vid=USPAT60030291999