A New Multiobjective Evolutionary Algorithm for Community Detection in Dynamic Complex Networks

Community detection in dynamic networks is an important research topic and has received an enormous amount of attention in recent years. Modularity is selected as a measure to quantify the quality of the community partition in previous detection methods. But, the modularity has been exposed to resolution limits. In this paper, we propose a novel multiobjective evolutionary algorithm for dynamic networks community detection based on the framework of nondominated sorting genetic algorithm. Modularity density which can address the limitations of modularity function is adopted to measure the snapshot cost, and normalized mutual information is selected to measure temporal cost, respectively. The characteristics knowledge of the problem is used in designing the genetic operators. Furthermore, a local search operator was designed, which can improve the effectiveness and efficiency of community detection. Experimental studies based on synthetic datasets show that the proposed algorithm can obtain better performance than the compared algorithms.


Introduction
Many real-world complex systems take the form of networks.Acquaintance networks, Internet, power grids, and neural networks are some examples.Networks could be modeled as graphs, where the individual objects are represented by nodes and the interactions among these objects are represented by edges.Community structure, that is, the vertices in networks are often found to cluster into tightly knit groups with a high density of within-group edges and a lower density of between-group edges [1], is one important property of these networks.The detection of such a community structure has great practical meaning.
Traditional analysis of community detection treats the network as a static graph, where the static graph is either derived from aggregation of data over all time or taken as a snapshot of data at a particular time.Researchers have successively proposed many effective detection methods of static network.In fact, dynamic networks capture the modifications of interconnections over time, which allow tracing the changes of network structure at different time steps.Community detection in dynamic networks is attracting increasing interest.
Recently, a framework called temporal smoothness is applied to solve community detection in dynamic networks.In this framework, it is not desirable that the significant changes of clusters structure in a short time period [2].In order to smooth each community over time, it needs to trade off two competing objectives of snapshot quality, which is that the clustering should reflect as accurately as possible the data coming during the current time step, and temporal cost, which is that each clustering should not shift dramatically from one time step to the successive one.Using this idea, Folino and Pizzuti proposed a multiobjective approach named DYN-MOGA to discover communities in dynamic networks by employing genetic algorithms [3].In DYN-MOGA, Community Score (CS) and Normalized Mutual Information (NMI) were selected as two objectives to be optimized simultaneously.At the end of each timestamp, DYN-MOGA returns a set of solutions contained in the Pareto front.They adopted modularity as a criterion to automatically select one solution with respect to another.Following this work, Gong et al. introduced a novel multiobjective immune algorithm with local search to solve the community detection problem in dynamic networks [4].They adopted Modularity and NMI as two objectives to optimize.
But, the modularity has been exposed to resolution limits [5][6][7].Fortunato and Barthélemy [5] have recently found that modularity optimization may fail to identify modules smaller than a scale which depends on the total size of the network and on the degree of interconnectedness of the modules, even in cases where modules are unambiguously defined.To overcome the limitations of modularity function, a new measure, named by modularity density, was presented in [8].
In this paper, we present a novel multiobjective algorithm, named DNCD-MOEA, to detect community in dynamic networks.The algorithm adopts modularity density as one objective to measure how well the clustering found represents the data at the current time, and it adopts NMI as another objective to measure the distance between two clusterings at consecutive time steps.Based on the problem-specific knowledge, the new genetic operators are designed to solve the proposed model.In order to improve the quality of the solutions, a local search operator is designed.
The outline of the paper is as follows.Section 2 introduces related background.In Section 3 we present our algorithm and explain its steps.Experimental studies are presented in Section 4. Section 5 concludes this paper.

Background
2.1.Notation.We define a static network   at time  as   = (  ,   ), where   is a set of objects, each V  ∈   denotes a node, and   is a set of links, each V  ∈   represents an edge that connects two nodes V  and V  at time .The dynamic network  can be defined as a sequence of static networks   = (  ,   ); that is,  = { where SC denotes snapshot cost which measures how well a community structure   represents the data at time  and TC denotes temporal cost which measures how similar the community structure   is with the previous community structure  −1 .The input parameter  is used by the user to control the level of emphasis on each part of the two objectives.When  = 1, it returns the clustering without temporal smoothing.
When  = 0, however, the framework produces the same clustering structure with the previous time step, that is,   =  −1 .Thus it can control the preference degree of each subcost by changing the value of parameter  between 0 and 1.

Proposed Algorithm
As [3], we adopt the same multiobjective representative frameworks for dynamic community detection which treat snapshot cost SC and temporal cost TC as two competing objectives.Modularity density and NMI are employed to denote snapshot cost SC and temporal cost TC, respectively.The main advantage of this method is that it does not need to fix the control parameter .
The proposed algorithm DNCD-MOEA is realized under the framework of NSGA-II [9].The details of objective functions and representation method are given as follows.

Objective Functions.
To denote snapshot cost SC, we adopt modularity density as an objective function to measure the quality of the community.The modularity density is defined as follows: In detail, given an undirected network  = (, ) consisting of the vertex set  = {V 1 , V 2 , . . ., V  } ( is the cardinality of ) and the edge set .Where {  }  =1 is a partition of the vertex set  into  groups,   is the complement of   with respect to , (  ,   ) = ∑ ∈  ,∈    ( = (  ) is the adjacent matrix of ), and |  | is the cardinality of   .
Normalized Mutual Information (NMI) was employed to denote the second objective function, that is, temporal cost.Danon et al. have proved NMI to be a reliable similarity measure [10].
Let  = { 1 ,  2 , . . .,   } and  = { 1 ,  2 , . . .,   } denote two partitions of a network in communities, and  denotes the confusion matrix whose element   is the number of nodes of the community   ∈  that are also in the community   ∈ .The NMI(, ) is defined as follows: where  .( . ) is the sum of the elements of  in row  (column ) and  is the number of nodes.If  = , set NMI(, ) = 1.
If  and  are completely different, set NMI(, ) = 0.In this paper, the two objectives of (  ) and NMI( −1 ,   ) will be maximized simultaneously.

Solution Selection.
In fact, the proposed algorithm DNCD-MOEA returns Pareto front at the end of each timestamp, which contains a set of solutions.Each of these solutions corresponds to a different tradeoff between the two objectives and thus to a diverse partitioning of the network consisting of various number of clusters.It needs to establish a criterion to automatically select which solution denotes the optimal partitioning of the current network at each time step.Community score introduced in [11] has been proved to be very effective in detecting communities.In this paper, we adopt community score as a criterion to select, among the solutions found, the solution having the highest value of community score.It is defined as follows.
Let   = { 1  ,  2  , . . .,    } be the set of all communities in network   at time  where    denotes the th community at time .The community score of   is defined as where The parameter   = (1/|   |) ∑ ∈     denotes the fraction of edges which connect each node  of    to the nodes in the same community    .The community score takes into account both the fraction of interconnections among the nodes and the number of interconnections contained in the module    .It gives a global measure of the network division in communities by summing up the local score of each module found.Thus the larger community score indicates that the community structure is stronger.So, we select the maximum community score value in the set of solutions as the best solution.[12] is used in our community detection algorithm, similar to [11].

Genetic Representation. The locus-based adjacency representation proposed by Park and Song
As a graph  = (, ) has  vertices, an arbitrary individual of the population consists of  genes, which can be represented by a number of strings as follows: where   =  denotes that there exists a link between the nodes  and .This means that thenodes  and  will be in the same community.It is necessary to identify all of the components of the corresponding graph in the decoding step.The nodes participating to the same component are assigned to the same community.Adopting this representation, the main advantage is that the decoding step can be done in linear time and it has been verified as an effective encoding schema for community detection as shown in [3].Additionally, the number  of clusters is automatically determined by the number of components contained in an individual and is also determined by the decoding step [11].An example of the encoded genotype and corresponding network is shown in Figure 1.As shown in Figure 1(a), the supposed network consists of seven nodes numbered from 1 to 7. It is obvious that the network can be partitioned into two groups visualized by different shape.A possible genotype as shown in Figure 1(b), which corresponds to the optimal solution, is translated in the graph structure given in Figure 1(c).Each connected component provides a grouping of nodes that corresponds to the partitioning of the network in Figure 1(a).

Initialization. If an individual is created randomly, it
may not be a feasible solution.In fact, a randomly generated individual could contain an allele value  in the th position, but no connection exists between the two nodes  and ; that is, the edge (, ) is not present.In order to overcome this limitation, we should check whether the individual is safe after an individual is created.When the individual is not safe, that is, a gene  contains a value , but link (, ) does not exist, it needs to repair to ensure whether the individual is safe or not.Safe individuals improve the convergence of the method because the space of the possible solutions is restricted.

Crossover and Mutation.
As in [3], we use uniform crossover which can guarantee the maintenance of the effective connections of the nodes in the network in the child individual.Given two arbitrary safe parent individuals and that a random binary vector is created, the genes are selected from the first parent if the vector element is 1, and the genes are selected from the second parent if the vector element is 0. Then the genes are combined to form a child.Because of the biased initialization, the child created from the two safe parents is safe also.In the child, if a position  contains a value , then the edge (, ) exists.The uniform crossover is shown in Figure 2.
Crossover operator is regarded as a macroscopic operation on individuals, while the mutation operator is regarded as a microcosmic operation on individuals.The mutation operator that randomly changes the value  of the th gene causes a useless exploration of the search space.In order to guarantee the mutated child is safe as the crossover operation, the possible value of an allele after mutating is restricted to be one of the replaced gene's neighbors.

The Pseudocode of the Proposed Algorithm.
The pseudocode of our DNCD-MOEA algorithm is described in Algorithm 1.

Local Search Strategy.
Local search is proved to be an effective algorithm.The mutation operator is regarded as a microcosmic operation on individuals and can achieve its local search function by moving single nodes between communities.Inspired by this idea, we adopt mutation operator in our local search algorithm.
In local search algorithm, it needs to convert multiple objectives into a single objective function.In our study, we select a weighted objective as follows [3]: where   () is the objective function which is described in ( 2) and (3), respectively, and   is the nonnegative weights for the two objectives.The weights are calculated in a special way as follows: where  max  or  min  is the maximum or minimum value of each objective function   () in the obtained dominant population and ∑ 2 =1   = 1.In Algorithm 2, the detailed pseudocode of the local search algorithm is given.respectively.So the total time complexity of the main programming is ( 2 ).The local search operation mainly contains two loops and its computational complexity is ().So, the total computational complexity of DNCD-MOEA algorithm is ( 2 ) + ().Because  < ,  = 2, and in practice  < , the time complexity can be simplified to ( 2 ).

Experiments
In order to check the ability of our approach on a dynamic network, we adopt the method proposed as [3] to generate data simulating dynamic network.Firstly, we generate synthetic datasets by following the procedure suggested by Girvan and Newman [1].The datasets are generated using the software package designed by Lancichinetti et al. [13].The data have 128 nodes, which are divided into 4 communities of 32 vertices each.Every node has an average degree of 16 and shares a number  out , which represents the average number of edges from a node to nodes in other communities.If we increase  out , then the noise level of network is augmented.
On the other hand, if we decrease  out , then the noise level in the network decreases.A parameter , which represents the average ratio of external degree/total degree for each node, is used to control the noise level in the dynamic networks.If the value of  is increased, then the network will become more noisy in the sense that the community structure becomes less obvious and hard to detect.In this study, by setting  = 0.1 and  = 0.3, the datasets under two different noise levels are generated.In order to introduce dynamics into the network, we let the community structure of the network evolve in the following way.After time step 1, 5% of the nodes is randomly choused to leave their original community and randomly assigned to the other three communities at each time step.After the community memberships are decided, links are generated by following the parameter .We generate the network with community evolution in this way for 10 time steps.
Figure 3 shows the statistical average value of normalized mutual information with the ground truth over the 10 networks for the 10 timestamps when the value of  = 0.1 (Figure 3

Figure 1 :
Figure 1: (a) A graph modeled as a network, (b) the representation of a possible genotype, and (c) the graph structure of the genotype to be decoded.
(a)) and  = 0.3 (Figure 3(b)).Both figures show that the proposed algorithm DNCD-MOEA can achieve better accuracy than the compared method.Especially, the average values of NMI at each time step obtained by DNCD-MOEA are closed to 1 when  = 0.1.

Figure 4
reports the community score obtained by the two algorithms at each time step when  = 0.1 (Figure4(a)) and  = 0.3 (Figure4(b)).It indicates that the corresponding network is densely connected within each subnetwork when the obtained value of community score is larger.From Figure4, it can be found that the algorithm DNCD-MOEA outperforms
1 ,  2 , . . .,   }, where each   represents the snapshot of nodes and edges at time .Let   = { 1  ,  2  , . . .,    } be the set of all communities in network   at time  where    denotes the th community at time .
The number  of time steps, the sequence of dynamic network  = { 1 ,  2 , ...,   } Output: The sequence of community structure detected in the dynamic network  = { 1 ,  2 , ...,   }BeginStep 1: Set  = 1.Generate the initial community structure 1 = { 1 1 , 2 1 , . . .,   1 } of the network  1 using GA-Net algorithm.Set  =  + 1. Step 2: If  > , return the sequence of community structure  = { 1 ,  2 , . . .,   } as the output, algorithm stops; Else, go to Step 3. Step 3: Set  = 1.Randomly generate individuals whose length equals the nodes number   =           of network   as an initial population   ; Step 4: While termination condition is not satisfied do Step 4.1: Create a new population   of offspring by applying the variation operators on population   ; Step 4.2: Combine the parents   and offspring   into a new pool   and; Step 4.3: Decode each individual  of the population   to generate the partitioning   = { 1  ,  2  , . . .,      } of the network   in   connected components; Step 4.4: Evaluate the two fitness values of the translated individuals; Step 4.5: Partition   into fronts, assign a rank to each individual and sort them according to nondomination rank; Step 4.6: Select individuals based on rank and crowding length to comprise new population  +1 ; Step 4.7: Select the dominant individuals   in  +1 , Step 4.8: Perform the local search algorithm on the selected individuals in   to generate the new dominant population    .Update the dominant population   with    in  +1 .Select the individual which has the maximum Community Score on the Pareto front.Decode the selected individual to get the community structure   = { 1  ,  2  , . . .,     } of the network   .Step 6: Set  =  + 1, go to Step 2.