^{1, 2}

^{1}

^{1}

^{1}

^{1}

^{2}

In complex networks, cluster structure, identified by the heterogeneity of nodes, has become a common and important topological property. Network clustering methods are thus significant for the study of complex networks. Currently, many typical clustering algorithms have some weakness like inaccuracy and slow convergence. In this paper, we propose a clustering algorithm by calculating the core influence of nodes. The clustering process is a simulation of the process of cluster formation in sociology. The algorithm detects the nodes with core influence through their betweenness centrality, and builds the cluster’s core structure by discriminant functions. Next, the algorithm gets the final cluster structure after clustering the rest of the nodes in the network by optimizing method. Experiments on different datasets show that the clustering accuracy of this algorithm is superior to the classical clustering algorithm (Fast-Newman algorithm). It clusters faster and plays a positive role in revealing the real cluster structure of complex networks precisely.

With the population of information networks and the discovery of the small world effect and the scale-free characteristic, research on complex networks has become a trend. Complex network study involves graph theory, statistical physics, computers, ecology, sociology, and economics [

One of the most important features in complex networks is the cluster structure. Many studies have shown that some networks have cluster structures other than a large number of nodes only randomly linked. Heterogeneity has been found in many real-world networks. The heterogeneity of complex networks is embodied in more connections in similar types of nodes, while different types of nodes have fewer connections. These subgraphs with similar types of nodes and their connections are called “clusters.”

Clustering algorithm plays a basic role in studying the cluster structure of complex networks. It has not only important theoretical significance in researching complex network topology, understanding the network function, revealing hidden laws, and predicting the network behavior but also broad application prospects. Clustering algorithm has been applied to the social network analysis, biological network analysis, search engine, spatial data clustering and image segmentation, and many other areas [

According to the analysis strategy, complex network clustering methods are divided into optimization methods and heuristic methods. The earlier clustering algorithms like spectral method [

KL algorithm is also based on the idea of graph partition, which aims at minimizing the difference between the number of intercluster connections and internal connections. By continuously adjusting clusters, the algorithm chooses and accepts the candidate solutions that can get the minimization of the objective function. KL algorithm, very sensitive to the initial solution and highly dependent on the prior knowledge, often gets local optimal results.

Girvan and Newman proposed GN algorithm [

Based on the Maximum Flow-Minimum Cut Theorem, Flake et al. proposed a heuristic clustering algorithm, the Maximum Flow Community [

Newman proposed a fast clustering algorithm based on local search [

Based on the FN algorithm, Guimera and Amaral similarly adopted the

Although optimization algorithms based on the

Since the community detection results through the clustering algorithms based on optimization depend on the objective function to be optimized, “biased” objective function will inevitably lead to “biased” solution. The

With larger scale of complex networks, the calculation of the objective function and the iterative process become more complex, resulting in more and more time and resources consumed.

Though the clustering algorithm based on heuristics method is able to handle the large-scale data in complex networks, compared to the optimization algorithm, it has lower clustering accuracy and cannot give high-precision clustering results.

To solve the above problems, we proposed a novel clustering algorithm based on the core influence of nodes. The algorithm combines heuristics method with optimization method. Its clustering process is designed to simulate the driven process of the cluster formation in sociology, to reflect the clustering process of nodes in the real network more accurately, and to achieve “no biased” precise clustering as far as possible.

The rest of this paper is organized as follows. In the next section, we introduce the clustering algorithm based on the core influence of nodes, then the experimental results and analysis are illustrated in Section

The basic idea of the clustering algorithm based on the core influence of nodes is to identify the nodes with core influence based on the betweenness centrality theory, build the core structure of clusters with these nodes in the complex network through the evaluation function, and, finally, cluster the remaining nodes in the network using optimizing methods. Thus, clusters of the whole network can be obtained.

The core influence of nodes in complex networks is denoted by the centrality of nodes. Centrality refers to the use of metric methods to evaluate the center position of a node in the network. It describes whether there are cores, how many cores there are, and how these cores are in the network.

Centrality has many definitions in complex networks, such as the degree centrality, the compactness centrality, the betweenness centrality, and the flow betweenness centrality. In order to reveal the role the nodes play in the transferring process of information, material, and energy in the complex network, this paper uses the betweenness [

Geodesic is defined as the path with least edges between two nodes. Thus, betweenness centrality of node

Betweenness centrality partially describes the core influence of nodes in complex networks. However, betweenness centrality itself is a global evaluation parameter, which cannot accurately describe the relative influence of nodes in the local environment, especially in large-scale complex networks. Therefore, combining the betweenness centrality and local clustering features of nodes, the core influence of nodes is denoted as

The definition of the core influence above accurately describes how important a node is in its clustering environment. Higher core influence of a node indicates higher contribution and heavier load in the information dissemination process in a complex network. Meanwhile, different from the simple degree centrality, a node with the highest core influence is not probably the node with the maximum degree or a topological center in the network structure.

In complex networks, the core structure of clusters is usually not only a simple single node with high core influence but possibly also a certain structure composed of several active nodes with high influence [

The goal of the

According to Fortunato and Barthélemy’s study of the

After determining the core cluster structure, the algorithm clusters the remaining nodes by optimizing method. The remaining nodes are centralized by rearranging all nodes regarding their core influence. A “centralized” network can thus be obtained, where nodes are arranged from inside to outside. The objective “centralization function” that reflects the level of centralization is then defined as

The objective function shows that if all nodes have the same core influence, which indicates that the network is noncore, then

The strategy to search and accept the candidate solution is as follows. Firstly, arrange all nodes descendingly according to their core influence. Then, change the structure of the cluster a node belongs to and then calculate the corresponding “centralization function.” And accept the candidate solution that maximizes the sum of the value of the whole network’s “centralization function.” The process ends when all nodes are classified into their own respective cluster structure.

According to the algorithm, the actual steps of the clustering algorithm based on the core influence are as follows.

Sort all nodes by betweenness centrality in descending order

Set up three groups of nodes;

Select the node

Judge whether

If

For all nodes connected to Cluster 2’s core, classify those in

Traverse the nodes in

Return to step (4) and iterate and traverse all the nodes.

For more objective and comprehensive evaluation, the algorithm is tested on three datasets (Neural Network [

In this section, the algorithm is tested on the dataset “Neural Network.” The dataset is a complex network of neurons in a living system, where each node represents a complete and independent neuron and the edge denotes the connection between neurons. The properties of the network are shown in Table

Neural Network dataset properties.

Properties | Values |
---|---|

Number of nodes | 297 |

Average clustering coefficient | 0.2924 |

Number of edges | 2359 |

Diameter | 5 |

Number of triangles | 3241 |

Average shortest path length | 2.4553 |

The number of nodes and edges and the diameter describe the overall size of the network. The average clustering coefficient, the number of triangles, and the average shortest path length describe the relative tightness of the network and how obvious the clustering feature is.

The evaluating values of the clustering effect of two algorithms are shown in Figure

Evaluating values of the clustering effect of the Neural Network dataset.

In this dataset, neurons have explicit functions and every neuron does not get global information. As a result, the FN algorithm cannot cluster precisely. The influence algorithm, proposed by us, however, digs out neurons with similar functional properties more precisely by considering the role every neuron plays in the process of information dissemination and gives the structural relationship among neurons of similar functions and among neurons clusters of different functions. The clustering results help medical researchers understand the mechanism of nervous system better so that they can analyze causes of neurological diseases and provide theoretical support for cures [

In this section, the algorithm is tested on the dataset “Political Blogs.” The dataset is a political blog network in complex social networks, where each node represents a politician and the edge denotes the real social relations between them. Compared with the Neural Network dataset, Political Blogs dataset has a larger scale, where the number of nodes increases by 3.1 times and the number of edges increases by 7 times. So, the connections between nodes are closer, and the clustering coefficient and the number of short circuits (triangle closure) increase in the network. On the other hand, the average shortest path between nodes becomes longer, indicating that the increase of the tightness of relationships between nodes is limited, though the network is larger. The properties of the network are shown in Table

Political Blogs dataset properties.

Properties | Values |
---|---|

Number of nodes | 1222 |

Average clustering coefficient | 0.3203 |

Number of edges | 16717 |

Diameter | 8 |

Number of triangles | 101043 |

Average shortest path length | 2.7375 |

The evaluating values of the clustering effect of two algorithms are shown in Figure

Evaluating values of the clustering effect of the Political Blogs dataset.

In the comparison with Figure

The cluster analysis of the Political Blogs dataset by the influence algorithm is the theoretical basis of information diffusion and behavior spread in politics. For politicians, the clustering results help individuals to predict the support and resistance in the dissemination of their political opinion. The results also help predict the probability of the pass of a political proposal and even the election result.

In this section, the algorithm is tested on the dataset “Email” in the social system, which is established by receiving and sending emails. Each node represents an email address and two nodes are connected when they have email exchanges in history.

Compared with the first two datasets, the “Email” dataset contains fewer nodes and sparser connections. Thus, it has lower clustering coefficient and larger average value of shortest paths. In this case, the locality of nodes is stronger and the probability for nodes to grasp global information is smaller. The properties of the network are shown in Table

Email dataset properties.

Properties | Values |
---|---|

Number of nodes | 1133 |

Average clustering coefficient | 0.2202 |

Number of edges | 5452 |

Diameter | 8 |

Number of triangles | 5453 |

Average shortest path length | 3.6060 |

The evaluating values of the clustering effect of two algorithms are shown in Figure

Evaluating values of the clustering effect of the Email dataset.

The experimental results on three datasets show that the clustering accuracy of the influence algorithm on large-scale complex networks increases variously compared to the FN algorithms. The effect is especially prominent for large-scale networks or networks with high heterogeneity. Studies have shown that when the size of a cluster is in the range of 50 to 100, the structure is relatively stable and real, and the effect of the clustering algorithm based on the core influence of nodes is much better than the FN algorithm in this interval.

In this paper, to solve the biasness in traditional clustering methods, we proposed an algorithm based on the core influence of nodes. On the basis of the core influence of nodes, the algorithm simulates the driven process of cluster formation in sociology. It absorbs the advantages of both heuristic and optimizing algorithms and reflects the real clustering process in a more accurate way. The clustering experiments on different datasets conclude that the clustering accuracy of this algorithm is superior to the classic clustering algorithm (FN algorithm) in complex networks. Meanwhile, this algorithm runs faster and plays a positive role in revealing the real cluster structure of complex networks.

Future studies can be conducted in two directions. Firstly, improve the algorithm based on the core influence of nodes to achieve higher accuracy and prove the “unbiased” nature of its clustering results. Secondly, optimize the iterative strategy of the algorithm to further improve the clustering efficiency when handling large-scale networks.

The authors declare that there is no conflict of interests regarding the publication of this paper.

This work is supported by the Research Fund of the State Key Laboratory of Software Development Environment under Grant no. BUAA SKLSDE-2012ZX-17, the National Natural Science Foundation of China under Grant nos. 61170296 and 61190125, the Program for New Century Excellent Talents in University under Grant no. NECT-09-0028, the Natural Science Foundation of Beijing, China, under Grant no. 4123101, and the Science Foundation of China University of Petroleum, Beijing (no. KYJJ2012-05-22).