Clustering Task
A clustering task is a mapping task that requires the creation of a cluster set (with item clusters whose members have low in-group variation and high out-group variation given some similarity function).
- AKA: Grouping, Partitioning, Cluster Analysis.
- Context:
- input:
- a Data Record Set.
- optional: the number [math]k[/math] of Clusters to be generated; a Similarity Function.
- output: Clustering Result/Cluster Set.
- Clustering Performance Metric: explainability.
- It can be solved by a Clustering System (that applies a clustering algorithm).
- It can range from being a Heuristic Clustering Task to being a Data-Driven Clustering Task (such as unsupervised clustering).
- It can range from being a Crisp-Clusters Clustering Task to being a Fuzzy Clustering Task.
- It can range from being a Low-Dimensional Clustering Task to being a High-Dimensional Clustering Task.
- It can range from being a Small-Dataset Clustering Task to being a Large-Dataset Clustering Task.
- It can range from being a One-Sided Clustering Task, to being a Two-Sided Clustering Task, to being an n-Sided Clustering Task.
- It can range from being a Partitional Clustering Task to being an Agglomerative Clustering Task.
- It can be the focus of a Clustering Discipline.
- It can range from being a Constrained Clustering Task to being an Unconstrained Clustering Task.
- It can be an Algorithm-specific Clustering Task, such as: k-Means Clustering, k-Medoids Clustering, ...
- It can provide information to other Data Mining Tasks (rather than provide actionable information).
- input:
- Example(s):
- Counter-Example(s):
- See: Unsupervised Learning Task, Distance Function, Cluster Set; Categorical Data Clustering; Cluster Editing; Cluster Ensembles; Consensus Clustering; Clustering from Data Streams; correlation Clustering; Cross-Language Document Clustering; Density-Based Clustering; Dirichlet Process; Evolutionary Clustering; Graph Clustering; Model-Based Clustering; Projective Clustering; Sublinear Clustering.
References
2013
- (Wikipedia, 2013) ⇒ http://en.wikipedia.org/wiki/Cluster_analysis
- Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify data preprocessing and model parameters until the result achieves the desired properties.
Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek βότρυς "grape") and typological analysis. The subtle differences are often in the usage of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification primarily their discriminative power is of interest. This often leads to misunderstandings between researchers coming from the fields of data mining and machine learning, since they use the same terms and often the same algorithms, but have different goals.
- Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
2011
- (Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Clustering." In: (Sammut & Webb, 2011) p.180
2009
- (WordNet, 2009) ⇒ http://wordnetweb.princeton.edu/perl/webwn?s=clustering
- S: (v) cluster, constellate, flock, clump (come together as in a cluster or flock) "The poets constellate in this town every summer"
- http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm
- Cluster analysis, also called segmentation analysis or taxonomy analysis, seeks to identify homogeneous subgroups of cases in a population. That is, cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. Other techniques, such as and Q-mode factor analysis, multidimensional scaling, and latent class analysis also perform clustering and are discussed separately.
- http://www.statsoft.com/textbook/stcluan.html
- in the field of medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies. In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy. In archeology, researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques. In general, whenever one needs to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.
2003
- http://www.nature.com/nrg/journal/v4/n9/glossary/nrg1155_glossary.html
- CLUSTER ANALYSIS A mathematical algorithm that organizes a set of items according to their similarity. For example, genes can be clustered according to their similarity in pattern of expression.
2002
- (Berkhin, 2002) ⇒ Pavel Berkhin. (2002). “A Survey of Clustering Data Mining Techniques." Technical Report, Accrue Software.
- Clustering is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data.
- The goal of clustering is to assign data points to a finite system of [math]k[/math] subsets (clusters).
2000
- (Witten & Frank, 2000) ⇒ Ian H. Witten, and Eibe Frank. (2000). “Data Mining: Practical Machine Learning Tools and Techniques with Java implementations." Morgan Kaufmann.
- … In clustering, groups of examples that belong together are sought.
1999
- (Jain et al., 1999) ⇒ Anil K. Jain, M. N. Murty, and P. J. Flynn. (1999). “Data Clustering: A Review.” In: ACM Computing Surveys (CSUR) Journal, 31(3). doi:10.1145/331499.331504
- QUOTE: Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters).