2007 SurveyGraphClustering

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Graph Clustering; Graph Clustering Algorithm

Notes

Cited By

Quotes

Abstract

In this survey we overview the definitions and methods for graph clustering, that is, finding sets of ''related'' vertices in graphs. We review the many definitions for what is a cluster in a graph and measures of cluster quality. Then we present global algorithms for producing a clustering for the entire vertex set of an input graph, after which we discuss the task of identifying a cluster for a specific seed vertex by local computation. Some ideas on the application areas of graph clustering algorithms are given. We also address the problematics of evaluating clusterings and benchmarking cluster algorithms.

1. Introduction

Any nonuniform data contains underlying structure due to the heterogeneity of the data. The process of identifying this structure in terms of grouping the data elements is called clustering, also called data classification [152]. The resulting groups are called clusters. The grouping is usually based on some similarity measure defined for the data elements. Clustering is closely related to unsupervised learning in pattern recognition systems [81]. A basic task in unsupervised learning is to classify a data set into two or more classes based on a similarity measure over the data, without resorting to any a priori information on how the classification should be done.

Graphs are structures formed by a set of vertices (also called nodes) and a set of edges that are connections between pairs of vertices. Graph clustering is the task of grouping the vertices of the graph into clusters taking into consideration the edge structure of the graph in such a way that there should be many edges within each cluster and relatively few between the clusters. Graph clustering in the sense of grouping the vertices of a given input graph into clusters, which is the topic of this survey, should not be confused with the clustering of sets of graphs based on structural similarity; such clustering of graphs as well as measures of graph similarity is addressed in other literature [38,124,168,169,202,206], although many of the techniques involved are closely related to the task of finding clusters within a given graph.

As the field of graph clustering has grown quite popular and the number of published proposals for clustering algorithms as well as reported applications is high, we do not even pretend to be able to give an exhaustive survey of all the methods, but rather an explanation of the methodologies commonly applied and pointers to some of the essential publications related to each research branch

3. Graph clustering

In this section, we begin the difficult work of defining what constitutes a cluster in a graph and what a clustering should be like; we also discuss some special classes of graphs. In some of the clustering literature, a cluster in a graph is called a community [ 186 , 107 ].

Formally, given a data set, the goal of clustering is to divide the data set into clusters such that the elements assigned to a particular cluster are similar or connected in some predefined sense. However, not all graphs have a structure with natural clusters. Nonetheless, a clustering algorithm outputs a clustering for any input graph. If the structure of the graph is completely uniform, with the edges evenly distributed over the set of vertices, the clustering computed by any algorithm will be rather arbitrary. Quality measures – and if feasible, visualizations – will help to determine whether there are significant clusters present in the graph and whether a given clustering reveals them or not.

In order to give a more concrete idea of what clusters are, we present here a small example. On the left in Fig. 1 we have an adjacency matrix of a graph with n = 210 vertices and m = 1505 edges: the 2 m black dots (two for each edge) represent the ones of the matrix, whereas white areas correspond to zero elements. When the vertices are ordered randomly, there is no apparent structure in the adjacency matrix and one can not trivially interpret the presence, number, or quality of clusters inherent in the graph. However, once we run a graph clustering algorithm (in this example, an algorithm of Schaeffer [ 205 ]) and re-order the vertices according to their respective clusters, we obtain a diagonalized version of the adjacency matrix, shown on the right in Fig. 1 . Now the cluster structure is evident: there are 17 dense clusters of varying orders and some sparse connections between the clusters. Matrix diagonalization in itself is an important application of clustering algorithms, as there are efficient computational methods available for processing diagonalized matrices, for example, to solve linear systems. Such computations in turn enable efficient algorithms for graph partitioning [ 214 ], as the graph partitioning problem can be written in the form of a set of linear equations. The goal in graph partitioning is to minimize the number of edges that cross from one subgroup of vertices to another, usually posing limits on the number of groups as well as to the relative size of the groups.

Fig. 1 – The adjacency matrix of a 210-vertex graph with 1505 edges composed of 17 dense clusters. On the left, the vertices are ordered randomly and the graph structure can hardly be observed. On the right, the vertex ordering is by cluster and the 17-cluster structure is evident. Each black dot corresponds to an element of the adjacency matrix that has the value one, the white areas correspond to elements with the value zero.

Fig. 2 – Two graphs both of which have 84 vertices and 358 edges. The graph on the left is a uniform random graph of the Gn, m model [ 84,85] and the graph on the right has a relaxed caveman structure [ 228]. Both graphs were drawn with spring-force visualization [203].

3.1. Generation models for clustered graphs

Gilbert [106] presented in 1959 a process for generating uniform random graphs with n vertices: each of the n 2 possible edges is included in the graph with probability p , considering each pair of vertices independently. In such uniform random graphs, the degree distribution is Poissonian. Also, the presence of dense clusters is unlikely as the edges are distributed by construction uniformly, and hence no dense clusters can be expected.

A generalization of the Gilbert model, especially designed to produce clusters, is the planted `-partition model [ 59 ]: a graph is generated with n = ` · k vertices that are partitioned into ` groups each with k vertices. Two probability parameters p and q < p are used to construct the edge set: each pair of vertices that are in the same group share an edge with the higher probability [math]\displaystyle{ p }[/math], whereas each pair of vertices in different groups shares an edge with the lower probability [math]\displaystyle{ r }[/math]. The goal of the planted partition problem is to recover such a planted partition into `clusters of [math]\displaystyle{ k }[/math] vertices each, instead of optimizing some measure on the partition. McSherry [173] discusses also planted versions of other problems such as [math]\displaystyle{ r }[/math]-clique and graph colouring.

Fig. 2 shows two graphs of the same order and size, one of is a uniform random graph and the other has a clearly clustered structure. The graph on the right is a relaxed caveman graph . Caveman graphs were an early attempt in social sciences to capture the clustering properties of social networks, produced by linking together a ring of small complete graphs called “caves” by moving one of the edges in each cave to point to another cave [231]. Also the graph represented in Fig. 1 was generated with the caveman model. These graphs are also especially constructed to contain a clear cluster structure, possibly with a hierarchical structure where clusters can be further divided into natural subclusters.

The procedure to create a relaxed caveman graph is the following [228]: a connection probability p ∈ ( 0 , 1 ] of the top level of the hierarchy is given as a parameter, together with a scaling coef fi cient s that adjusts the density of the lower-level caves. The minimum n min and maximum n max for the numbers of subcomponents (subcaves at higher levels, vertices at the bottom level) are given as parameters. The

References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 SurveyGraphClusteringSatu Elisa SchaefferSurvey: Graph Clustering10.1016/j.cosrev.2007.05.0012007