Semisupervised Community Detection by Voltage Drops

. Many applications show that semisupervised community detection is one of the important topics and has attracted considerable attention in the study of complex network. In this paper, based on notion of voltage drops and discrete potential theory, a simple and fast semisupervised community detection algorithm is proposed. The label propagation through discrete potential transmission is accomplished by using voltage drops. The complexity of the proposal is 𝑂(|𝑉| + |𝐸|) for the sparse network with |𝑉| vertices and |𝐸| edges. The obtained voltage value of a vertex can be reflected clearly in the relationship between the vertex and community. The experimental results on four real networks and three benchmarks indicate that the proposed algorithm is effective and flexible. Furthermore, this algorithm is easily applied to graph-based machine learning methods.


Introduction
From the point of view of mathematics, many real-world systems in nature and society can be effectively modeled as complex networks or graphs.Specifically, the entities of the system are represented by the vertices and the interactions between the entities are represented by the edges.Examples include social relationships, spreading of viruses and diseases, the World Wide Web, author cooperation networks, citation networks, and biochemical networks.It has been shown that many real-world networks have a structure of modules or communities, where the nodes within a community are higher connected to each other than the nodes among communities.The community structures play an important role in the functional properties of complex network, and finding such a structure could be of significant practical importance.
Identifying community structure in special networks has a considerable merit of practice because it gives us insights to the structure-functionality relationship.In the past decades, plenty of techniques have been proposed to detect the community structure hidden in networks.The more typical algorithms for community detection can be found in [1].Very recently, Chen et al. [2] defined the antimodularity as a quantitative measure of anticommunity partitioning on a network and showed the reliability of antimodularity as a measurement of the quality of an anticommunity partitioning.A vertices similarity probability model to find community structure without the prior knowledge of the type of complex network structure was presented [3].By studying the community structure in Chinese character network, Zhang et al. [4] found that community structure was always considered as one of the most significant features in complex networks, and it played an important role in the topology and function of the networks.Palla et al. [5] revealed that complex network models exhibited an overlapping community structure, also called fuzzy community.These complicated structures actually make it harder to appropriately construct algorithms to uncover them.Along this way, researchers have made great contributions to the community detection [6][7][8][9][10].
The methods mentioned above belong to unsupervised community detection methods since the topological information of the network is used only and its background knowledge is ignored.In fact, some prior information is of great value in identifying the community structure.Based on the discussion of an equivalence of the objective functions of the symmetric nonnegative matrix factorization and the maximum optimization of modularity density, Ma et al. [11] introduced a semisupervised clustering algorithm for

The Graph and Discrete Potential Method
The graph can be mathematically represented as  = (, ), where  = {V 1 , V 2 , . . ., V  } is the set of vertices and  ⊂  ×  denotes the set of edges.Generally, the graph can be expressed by its adjacent matrix , whose elements   are equal to 1 if V  points to V  and 0 otherwise.We denote   as the degree of vertex V  .The degree matrix  is a diagonal matrix containing the vertex degree   ( = 1, 2, . . ., ) of a graph on the diagonal.Then the Laplacian matrix  can be defined as Denote    as the potential of vertex V  in the electrostatic field generated by vertices with label .Assign the potentials of all labeled vertices with labels other than  to zero and the labeled vertices to have a unit potential.The process of potential transmission for each electrostatic field is a circuit theory problem and can be modeled by combinatorial Dirichlet [15].By using the Laplacian matrix , a combinatorial formulation of the Dirichlet integral is in the form [15,19] where  is the potentials of all vertices minimizing (2).
Reassigning the order of all vertices of the graph and putting the labeled vertices forward, (2) can be rewritten into where   and   are two vectors whose elements represent the potentials of labeled vertices and unlabeled vertices, respectively.Setting the derivative of [] with respect to   equal to zero, one can obtain a system of linear equations where   is a |  | dimensional vector whose elements are unknown quantities needing to be solved.If the graph is connected, or if every connected component contains a seed, then (4) will be nonsingular.
For each label , a system of linear equations can be established as ( If one assigns a unit potential to the labeled vertices with label  and zero to other labeled vertices, it will generate an electrostatic field.The potentials of unlabeled vertices can be obtained by the solution of (5).By comparing the potentials of each unlabeled vertex, its label is assigned the same as the labeled vertex corresponding to the greatest potential.Thus the community structure is detected.
From the perspective of discrete potential theory, the solution to (5) can be interpreted as a circuit theory.Based on the three fundamental equations of circuit theory, Kirchhoff 's Current Law, Ohm's Law, and Kirchhoff 's Voltage Law, one can also get an equivalent system of (5) [15,19].
In [15], the solutions of ( 5) have been obtained by conjugate gradient decent algorithm, and a novel semisupervised community detection algorithm was proposed.Several experimental results demonstrate the effectiveness of their approach.

The Proposed Algorithm
It should be noted that the coefficient matrix   in (5) must be a symmetric positive definite matrix while solving the nonhomogeneous linear equations (5) by conjugate gradient decent algorithm.Obviously, Laplacian matrix  is not a positive definite matrix since every row of the Laplacian matrix sums to zero, 0 is always its eigenvalue, and the corresponding eigenvector is (1, 1, . . ., 1).This fact compels us to develop a new method to detect communities in network while considering the network as an electric circuit.In [16], Wu and Huberman introduced an unsupervised method to solve the system like (5) to discover the communities in complex network in linear time.Since there is no class information in advance, they employed bipartite strategy and some superb skills for the case of multiple communities.In this work, we extend their work to the case of semisupervised community detection.
In what follows, we would like to present a novel method to find community structure in complex networks by the process of voltage transmission.
For a given network, we suppose each edge to be a resistor with the same resistance.One attaches all the labeled vertices with label  to anode of a battery and other labeled vertices to negative pole so that they have fixed voltages, say 1 and 0. Based on these assumptions, the network can be viewed as an electric circuit with current flowing through each edge (resistor).By solving Kirchhoff equations, one can obtain the voltage value of each unlabeled vertex which of course should be within (0, 1).In this case, the voltage value of each vertex can be thought of as the membership degree similar as in FCM algorithm, which reflects clearly the relationship between a vertex and the th community.In turn, we can get  voltages of a vertex for the different labels if there are  classes.In semisupervised learning methods, it is required that at least one sample must be labeled in each class.This indicates that the class parameter  is known previously.
Physically, if node V  connects to  neighbors V 1 , V 2 , . . ., V  in an electric circuit, the Kirchhoff equation [20] tells us that the total current flowing into V should sum up to zero; that is, where   is the current flowing from V  to V and   is the voltage at neighbor node V  .
It is easy to rewrite (6) into the following form: That is to say, the voltage of a node is the average of those voltages of its neighbors.Suppose the number of communities to be ; then the label set  = {1, 2, . . ., }.In addition, we also assume that there must be at least one labeled vertex in each community.Divide the vertex set  into two parts, as the voltage of vertex V  in the electrostatic field generated by vertices with label  and (V  ) as the set of neighbors of V  .
If we reassign the order of all vertices of the graph and put the labeled vertices forward and labeled vertices with label  first, following (6), one can get the system: = 0, for  =  + 1,  + 2, . . ., , Equation ( 10) is a linear system with  −  variables and can be put into a symmetrical form as follows: Define and then the matrix form of Kirchhoff equation is which has the solution Generally, it will take (( − ) 3 ) time to solve this system.Wu-Huberman algorithm [16] skillfully avoids this difficulty by solving ( 8)- (10) for  = 1 and  = 2.This method seems naturally to be a semisupervised learning method.We now extend it to the case of semisupervised learning.Specifically, we first set    = 1, for V  ∈    , and

Mathematical Problems in Engineering
Starting from (V  ) (V  ∈    ), one consecutively updates the voltages of V  ∈   to The updating process adopts breadth-first search algorithm and it will end when we get voltages for all vertices in   .This process is called a round.One spends an amount of (  ) time calculating neighbor voltage of vertex V  and || time setting initial voltages; therefore the complexity in one round is (||+||).After repeating the updating process for a finite number of rounds, one will reach an approximate solution of ( 14) within a certain precision which only depends upon the number of iteration rounds.Unlike Wu and Huberman's method [20], we do not need to compute the ideal voltage gap and know roughly the size of each community.As a result, we get a ( − )-dimensional voltage vector.The component    reflects the relationship of vertex V  and th community.For each label  in label set  = {1, 2, . . ., }, we repeat this process.Therefore, for each vertex V  , we obtain a voltage vector (

Experiments
To validate the proposed algorithm, one would like to test it on four real networks and three benchmarks which are widely used to test the validity of various community division methods.The experimental platform is based on Windows 7 Ultimate Service Pack 1 with Intel® Core™ i5-3470 CPU 3.20 GHz, 4.00 GB memory, ×64 Operating system, and Java 1.8 Eclipse RCP Luna sr1.

Three Evaluation Indices of Clustering.
To assess the quality of partition, we here use the -measure, -measure, and modularity  to quantify the cluster results.The -measure is a harmonic combination of the precision and recall values used in information retrieval [21].
If   is the number of the members of class , and   is the number of the members of class  in cluster , then the precision   and recall   can be defined as is denoted by The corresponding -measure (FM) of the whole clustering result is defined as where  is the total number of the members in the data set.
In general, the high value of -measure indicates the better cluster result.
The purity of a cluster represents the fraction of the cluster corresponding to the largest class of data assigned to that cluster; thus the purity of cluster  is defined as The purity of the whole clustering result is defined as In general, the larger the purity value is, the better the clustering result is.
In order to quantify the validity of community division of a complex network and to optimize the chosen splitting, we use, following [22], the concept of modularity.It is defined as follows: given a network division, Let   be the fraction of edges in the network that connect vertices in group  to those in group , and let   = ∑    .Then the modularity  is defined as It measures the fraction of edges that fall between communities minus the expected value of the same quantity in a random graph with the same community division.Obviously, the larger  corresponds to the ideal community structure.

Experiment on Four Real Networks.
Testing an algorithm essentially means analyzing a network with a well-defined community structure and recovering its communities.In this subsection, four classical complex networks with known community structures are selected to test the introduced algorithm.The description of these four networks can be found everywhere [1,11,16,23].Taking Zachary Karate Club network with two communities, for example, we first choose randomly one node in each community and label it.Afterwards, the algorithm can work on this network and a community division is detected.The values of FM, PM, and  can be computed according to the obtained partition.It is possible that the community division may be changed with the different selection of initial labeled notes.To evaluate the validity of the proposal objectively, we calculate the average values of three indices by choosing randomly 10 groups of initial labeled notes.Along this way, we also compute three indices values by adding the number of labeled notes in each community.In Table 1, we list the average values of three indices by selecting randomly 10 groups of different labeled  1, it is easy to see that we can detect an ideal community division for these four networks by the proposed algorithm when we label 3 nodes in each community.The accuracy of network partition is greater than or equal to 94% except polbooks network.Three indices values are ascending or varying slightly with the increasing of labeled nodes.These results also show that one can detect a good network partition by labeling a small quantity of nodes in each community.For football network, we can get the same partition accuracy as in [15] while the number of the labeled vertices randomly selected is from 1 to 4.
In Table 2, the average run times of the proposed algorithm for four real networks are presented.It is shown that the run times decrease with the increase of labeled nodes.This is reasonable because the number of nodes that need to be divided is reduced.
Figures 1 and 2 show the variety of run time of the proposed algorithm and the values of three indices for dolphins network and karate network, respectively.

Experiment on Three Benchmarks.
For testing community detection algorithms on graphs with overlapping communities, several artificial networks or benchmarks are introduced.Among them, the most famous benchmark for community detection is a class of networks introduced by Girvan and Newman (GN) [24].Each network has 128 nodes, divided into four communities with 32 nodes.The average degree of the network is 16 and the nodes have approximately the same degree, as in a random graph.
In what follows, we apply the proposed algorithm to detect the communities on this benchmark.For each fixed number of labeled nodes, one also selects randomly 10 groups of different initial labeled nodes to compute the average values of three indices.The benchmark can be thought of as the network with apparent community structure if mixing parameter  < 0.5.From Table 3, one can see that four communities in this benchmark are detected accurately when mixing parameter  < 0.25 and the number of labeled nodes is equal to or greater than 4. If we take  = 0.5 and label 10 nodes in each community, 90% of nodes in this benchmark can be partitioned correctly.When  > 0.5, this benchmark is with overlapping community structures.Although the partition accuracy becomes higher and higher with the increasing of number of labeled nodes, we can not find ideal communities in this network.Particularly, our algorithm fails to divide it into four groups when  = 1 and number of labeled nodes in each groups is less than 10.
Assuming that both the degree and the community size distributions are power laws, Lancichinetti et al. [25] designed a more general benchmark for testing community detection algorithms on graphs.Some parameters used in this benchmark are explained as follows: : mixing parameter,  1 : minus exponent for the degree sequence,  2 : minus exponent for the community size distribution, min : minimum for the community sizes, max : maximum for the community sizes, on: number of overlapping nodes, om: number of memberships of the overlapping nodes, : [average clustering coefficient] not mandatory.
In this benchmark, , , max , and  have to be specified.For the others, the program can use default values:  1 = 2;  2 = 1; on = 0; om = 0; min  and max  will be chosen close to the degree sequence extremes.
To test the validity of our algorithm on large network, we apply the proposed algorithm to this benchmark with parameters  = 10 5 ,  = 20, max  = 10 4 , and  2 = 1.The mixing parameter  is varied from 0.1 to 0.6.For each fixed , one takes  1 = 2 and  1 = 3, respectively.Unlike the GN benchmark, the community size is power laws in this network.Therefore, it is proper to label nodes in each community in terms of node proportion.The minimal proportion which we will take is 10% because of the requirement that there descending with the increasing of mixing parameter .This shows that a good network partition will not be found by the proposed algorithm for the network which communities overlap seriously.Figure 3 presents the comparison of run time of our algorithm on two benchmarks with different parameters and label nodes numbers or label proportions.The increasing of labeled nodes number or label proportions implies that the number of unlabeled nodes in benchmarks is descending, and therefore it needs less and less time to partition network into groups.
We now present our experimental results on the LFR benchmark and further compare our proposal with GN algorithm [24], spectral clustering algorithm [1], NMF algorithm [20], and SNMF-SS algorithm [11] by a normalized mutual information index (NMI).
The LFR benchmark is designed by Lancichinetti et al. [25] and widely employed to test the performance of community structure identification.It allows user to specify distributions for both the community sizes and the degree distribution and then generates vertices and communities by sampling from those distributions.The mix parameter  represents the average ratio of intracommunity adjacencies to total adjacencies.The large  corresponds to the network with apparent community structure.In this paper, the input parameters of the LFR benchmark are the same for our algorithm and the comparative algorithms.For the different values of  ∈ {0.50, 0.6, 0.7, 0.8, 0.9}, we generated 50 instances for each of LFR benchmark graphs whose node degree is taken from a power law distribution with exponent 2 and community size from a power law distribution with exponent 1.Each graph has 1000 vertices, average degree of 15, maximum degree of 50, maximum for the community sizes of 50, and minimum for the community sizes of 5.The definition of NMI can be found everywhere [11,15,26].
From Figure 4, we can see that the values of NMI obtained by our algorithm are bigger than those gotten by the other four algorithms.The peak value of our approach is 0.732 at  avg = 0.9.This value is bigger than the one 0.7 computed by SNMF-SS algorithm.Because the decrease of  means that the LFR benchmark is with the obscure community structure, it is difficult to detect communities correctly for five algorithms.It is reasonable that the NMI values obtained by five algorithms become smaller and smaller as  decreases.The NMF algorithm seems to be stable since it has a small decrease speed.The performance of our proposal decreases greatly while  is greater than 0.6.This fact implies that our algorithm can not apply the networks with nonapparent community structure.However, compared with other four algorithms, our algorithm can gain the best performance.

Conclusions
In this paper, we propose a semisupervised community detection algorithm for partitioning network into groups.This approach amalgamates the discrete potential theory and Wu-Huberman algorithm.The complexity (|| + ||) of the introduced approach indicates that it can be applied to detect  community on large network.The validity of our proposal is demonstrated by applying it to four real networks and three benchmarks.The experimental results show that a good community division of a complex network is obtained by labeling a small quantity of nodes in each community.However, it is difficult to classify correctly the network with heavily overlapping communities or obscure community structure by our method.This fact can be seen from the experimental result on LFR benchmark.Therefore, it is worthwhile to further introduce new and fast algorithm to deal with this case.

Figure 3 :Figure 4 :
Figure 3: The comparison of run time of our algorithm on two benchmarks.
1  ,  2  , . . .,    ).The element    can be considered as the membership degree which vertex V  belongs to the th community.The vertex V  is within the th community if = max    , 1 ≤  ≤ .That is to say, largest voltage of each vertex indicates to which community the vertex V  should belong.

Table 1 :
The average values of indices for four real networks.

Table 2 :
The average run time of the proposed algorithm for four real networks.

Table 3 :
The average values of indices on GN benchmark.

Table 4 :
The mean values of indices on power law benchmarks.
network while  < 0.2 and 10% nodes in each community are labeled.In this case, there is no distinct variety of three indices values with the increasing of label proportion.This fact indicates that one can detect a good community division on the network with apparent community structure although a few nodes are labeled.The values in each column are