^{1, 2}

^{2, 3}

^{2}

^{2}

^{1}

^{2}

^{3}

Within graph theory and network analysis, centrality of a vertex measures the relative importance of a vertex within a graph. The centrality plays key role in network analysis and has been widely studied using different methods. Inspired by the idea of vertex centrality, a novel centrality guided clustering (CGC) is proposed in this paper. Different from traditional clustering methods which usually choose the initial center of a cluster randomly, the CGC clustering algorithm starts from a “LEADER”—a vertex with the highest centrality score—and a new “member” is added into the same cluster as the “LEADER” when some criterion is satisfied. The CGC algorithm also supports overlapping membership. Experiments on three benchmark social network data sets are presented and the results indicate that the proposed CGC algorithm works well in social network clustering.

Clustering is a process of partitioning a set of data into meaningful subsets so that all data in the same group are similar and the data in different groups are dissimilar in some sense. It is a method of data exploration and a way of looking for patterns or structure in the data that are of interest. Clustering has wide applications in social science, biology, chemistry, and information sciences. A general review of cluster analysis can be found in many references such as [

The commonly used clustering methods are partitional clustering and hierarchical clustering. Partitional algorithms typically determine all clusters at once.

The hierarchical clustering is either agglomerative or divisive. Agglomerative algorithms begin with each element as a separate cluster and two clusters separated by the shortest distance are merged successively. Most hierarchical clustering algorithms are agglomerative, such as SLINK [

In recent years, social network analysis has gained much attention. Social network analysis is the study of social relations in terms of networks. A social network is usually modeled as a directed graph or undirected graph. The set of nodes in the graph represent individual members. The set of edges in the graph represent relationship between the individuals, such as friendship, coauthorship, and so forth. A fundamental problem related to social networks is the discovery of clusters or communities. Porter et al. [

In this work, a novel hierarchical clustering algorithm is proposed for social network clustering. Traditional clustering methods, such as

The paper is organized as follows. Different centrality measurements are discussed in Section

Centrality is one of the most widely studied concepts in social network analysis. Within graph theory and network analysis, centrality of a vertex measures the relative importance of a vertex within the graph. For example, how important a person is within a social network or how well used a road is within an urban network. During past years, various measures of the centrality of a vertex have been proposed. Centrality measurement, such as degree centrality, betweenness, and eigenvector centrality, are among the most popular ones.

Degree centrality is the simplest centrality measurement. Given a graph

Degree centrality considers only the local topology of the network. It can be interpreted as a measure of immediate influence, as opposed to global effect in the network [

The betweenness centrality for any

Betweenness centrality is one of the most popular centrality measures which consider the global structure of the network. It characterizes how influential a vertex is in communicating between vertex pairs [

The eigenvector centrality score of the

Eigenvector centrality is a sort of extended degree centrality which is proportional to the sum of the centralities of the vertex’s neighbors. A vertex has large value of eigenvector centrality score either if it is connected to many other vertices or if it is connected to others that themselves have high eigenvector centrality [

Due to the fact that different centrality measures are based on different aspect of a network, the final centrality scores and ranking of the nodes in the network may be different. The difference will be discussed in Section

In this section, some notation and terminology are introduced and the centrality guided clustering (CGC) algorithm is presented.

Given an input dataset, the dataset is modeled as a weighted graph

Let

The density of the subgraph

As discussed in Section

For any vertex

A vertex

there exists a vertex

Condition (a) says that a vertex must be a neighbor of the subgraph

The overall structure of the CGC algorithm is shown in Algorithm

(Initialization)

while

(GROUPING) Cluster the vertices in

(MERGING) Merge those groups with large percentage of overlap.

(CONTRACTION) Contract those vertices in the same groups to a new vertex,

calculate the edge weights in the contracted graph.

Denote the contracted graph as

The details of the GROUPING step is shown in Algorithm

Calculate the centrality score of each vertex

Order

that

while

Create a new group

while

Find the

Calculate the contribution to

Sort

if

else

In the CGC algorithm, every vertex is allowed to be belonged to more than one group. So after the GROUPING step, different groups may have some overlapping elements. If the number of overlapping elements in two groups exceeds some threshold, it is better to merge all vertices in the two groups into a new larger group. The following criterion is applied to determine whether two groups should be merged. Given any two groups, say

clustered into

List all groups of

that

while

if

else

After the MERGING step, each group

merged into

List all groups of

Generate a new vertex

for

for

To evaluate the effectiveness of the CGC algorithm, three benchmark datasets on social network analysis are tested. The three benchmark datasets and the clustering results are described in Sections

To test whether the centrality measures will influence the results, different centrality measures are applied to the CGC algorithm independently and the clustering results are compared in Section

Zachary’s karate club dataset is a typical dataset which is used to test the clustering algorithm in social network analysis. It is a social network of friendships between 34 members of a karate club at a US university [

The social network of Zachary’s karate club. Red dots denote the supporters of instructor and blue squares denote the supporters of the president. The dashed curve is the partition by the CGC algorithm.

When the Girvan-Newman algorithm is applied to this dataset, node 3 is misclassified. The partition by the CGC algorithm is shown as the dashed curve in Figure

The dendrogram of the karate club dataset by the CGC algorithm.

The dolphin social network dataset is another representative dataset to test the accuracy of clustering algorithms. It is a social network of frequent associations between dolphins in a community in Doubtful Sound, New Zealand [

The social network of the dolphins. The dashed curve denotes the division of the network into two equal-size groups found by the standard spectral partitioning method, and the solid curve represents the division found by the modularity-based method by Newman [

The ground truth groups are represented by the shapes of the vertices in Figure

The dendrogram of the dolphin dataset by the CGC algorithm.

The third example is a social network map of political books based on purchase patterns from the online book seller Amazon.com. This dataset is provided by Krebs [

The ground truth partition of the political books. Triangles for neutral books, dots for conservative books, and squares for liberal books.

In order to see the clustering results based on the book copurchase information, the Girvan-Newman algorithm [

The clustering result of the political books by the Girvan-Newman algorithm. Red for neutral books, blue for conservative books, and black for liberal books.

The clustering result of the political books by the CGC algorithm. Red for neutral books, blue for conservative books, and black for liberal books.

As mentioned in previous sections, the centrality score of a node in a network could be looked as how important a node is in the network. And the importance of the nodes could be sorted by their centrality scores from large to small. When different centrality measures are applied to the same dataset, the ordering of nodes may be different.

The purpose of this subsection is to test whether the starting clustering node will influence the final clustering result and to compare the effectiveness of different centrality measure when combined with the CGC algorithm. In this subsection, degree centrality, eigenvalue centrality, and betweenness centrality are independently applied to the CGC algorithm. And the same three datasets as in Sections

Table

The number of misclassified members by the CGC algorithm based on different centrality measures.

Karate club | Dolphin | Political books | |
---|---|---|---|

Degree | 0 | 1 | 17 |

Eigenvalue | 0 | 2 | 16 |

Betweenness | 0 | 0 | 16 |

In this work, the importance of the centrality score of vertices in a network is discussed and a centrality guided clustering method is proposed. The CGC algorithm initiates the clustering process at a vertex with highest centrality score, which is a potential leader of a community. The CGC algorithm is applied to several benchmark social network datasets. Experimental results show that CGC algorithm works well on social network clustering.

Centrality measurements may influence the results of the CGC algorithm. The degree criterion serves as a very local measurement for centrality, while betweenness centrality and eigenvalue centrality search for global “leaders” of the entire networks. The experiment results show that the betweenness centrality works better than the other two centrality measures for the CGC algorithm.

One may notice that in Figure

The CGC algorithm is a hierarchical clustering algorithm. One direction for future research would be to apply the centrality score guided idea to other clustering methods such as

The authors declare that there is no conflict of interests regarding the publication of this article.

The authors would like to thank the anonymous reviewers for suggesting many ways to improve the paper. The work is partially supported by the National Natural Science Foundation of China (no. 61202312); the NSA Grant H98230-12-1-0233, the NSF Grant DMS-1264800; the Fundamental Research Funds for the Central Universities, China (no. JUSRP11231); and the Shandong Province Natural Science Foundation of China (no. ZR2010AQ018).