Hierarchical Modular Structure Identification with Its Applications in Gene Coexpression Networks

Network module (community) structure has been a hot research topic in recent years. Many methods have been proposed for module detection and identification. Hierarchical structure of modules is shown to exist in many networks such as biological networks and social networks. Compared to the partitional module identification methods, less research is done on the inference of hierarchical modular structure. In this paper, we propose a method for constructing the hierarchical modular structure based on the stochastic block model. Statistical tests are applied to test the hierarchical relations between different modules. We give both artificial networks and real data examples to illustrate the performance of our approach. Application of the proposed method to yeast gene coexpression network shows that it does have a hierarchical modular structure with the modules on different levels corresponding to different gene functions.


Introduction
Networks are widely applied to model complex systems, including biological systems, social organizations, World-Wide-Webs, and so on. In a network, the nodes (vertices) represent the members in the system, while the edges represent the interactions among the members. If two nodes have interactions in a network, there will be an edge connecting them. With such a representation, the complex systems can be analyzed by computational methods.
Module (community) structure is a common property of many different types of networks. Modules are the dense subgroups of a network, where the nodes in the same module are more likely to connect each other than the nodes in other modules. In general, the members in the same module share some common properties or play similar roles. For example, in a gene coexpression network, the genes in the same module may belong to the same functional category such as lipid metabolism and acute-phase response [1]. Since the paper published by [2], module detection and identification becomes a hot research topic in several different areas such as computer science, physics, and statistics. A large number of related works have been published with the physicists making the most contributions [3][4][5][6][7][8][9][10][11][12]. Several recent review papers provide details and comparisons of the module identification methods [6,9,13]. Reference [13] compares the performance of several existing methods for both computation time and output. Reference0020 [6] is a thorough, more recent discussion. Reference [9] contrasts different perspectives of the methods and sheds light on some important similarities of several methods. A recent comparison of some popular methods is shown in [14]. Among the compared methods, the method by maximizing the average degree within modules and minimizing the average connections between different modules outperforms other methods in identification accuracy. Its computational speed is also competitive [14]. Besides these computational methods, theoretical analysis on module identifications is presented very recently. Bickel and Chen gave the first statistical analysis on the properties of modules [15]. There based on the stochastic block model, they gave the sufficient conditions for a modularity to be a consistent estimator of modules and presented a new consistent modularity. However, the computation of maximizing this modularity is very time consuming.
Although so many related works are published, how to choose an appropriate number of modules keeps being an open problem. Different methods output different solutions when they are applied to the same network. In reality, all 2 The Scientific World Journal of the different choices may be reasonable because different choices of this number may correspond to the modules on different levels. As explained in [16], some modular networks may have hierarchical structure. For example, in a friendship network, on the large scale, the modules may correspond to people from different countries. On the smaller scales, people in the same module may graduate from the same university, grow up in the same community, or even be born in the same family. Such hierarchical modular structure appears in different kinds of networks. For example, Meunier and colleagues gave an example on hierarchical modular structures in human brains [17]. Figure 1 shows an example of hierarchical modular network. There are two levels of the modules. We can identify three modules corresponding to different shapes of nodes on the lowest level or two modules with nodes represented by cubes and circles being combined together on the higher level.
Compared to the module identification in a partitional way (all the modules are on the same level), there are much fewer works on computational methods for hierarchical modular structure analysis [18][19][20]. Although these papers present some methods to construct the hierarchical modular structure, they do not give a clear picture on how these modules are organized and what the relationship among the modules is. In this paper, we mainly consider the problem of hierarchical modular structure in unweighted networks. Based on the module identification method presented in [14], we give the method on how to construct the hierarchical structure from all the possible modules in Section 2. Numerical experiments for both simulated networks and real data networks are presented to show the performance of our proposed method in Section 3. The application of the proposed method to yeast gene coexpression network shows that it does have a hierarchical structure, which corresponds to the different levels of gene functions. Conclusion remarks are given finally. By constructing the hierarchical structure, we aim to explore the functions of modules on different levels and explain why the number of modules may differ for different identification methods.

Methodology
Before going to the details on how to construct the hierarchical structure, we give its definition first. We consider a network G(V , E) with n nodes, where V denotes the set of nodes and E denotes the set of edges. The adjacency matrix is denoted as A with each entry being 0 or 1. The hierarchical structure of a network is defined based on the stochastic block model, which is a direct extension of the Erdös-Rényi random graph model [21]. The network is obtained by starting with a set of n nodes and adding edges between them in a probabilistic fashion. The presence of an edge between any two nodes is a Bernoulli event where the probability may be vertex-pair dependent. At the beginning, we assume there are K modules in the network. The network is generated in two steps. First, any node is assigned to a module M i with a probability µ i , where µ = (µ 1 , µ 2 , . . . , µ K ) satisfies K i=1 µ i = 1. Then any two nodes u, v ∈ V and u ∈ M i , v ∈ M j are connected with probability P i, j depending on M i , M j , and P is symmetric. If there is the modular structure in the network, then P i, j < min{P i,i , P j, j }. With this model, the hierarchical structure of a network can be defined recursively. For any three modules M i , M j , and M k , if P i, j > max{P i,k , P j,k }, we say there is hierarchical structure among these three modules and M i , M j can be combined to a new module parallel to M k .
To construct the hierarchical structure, we use the bottom-up strategy. We first find all the possible modules on the lowest level and then build the hierarchical structure. We use the method presented in [14] to find all the possible modules. Suppose K is given first. We let N k denote the number of nodes in subnetwork V k , L kk denote twice the total number of edges in subnetwork V k , and L kl denote the total number of connections between the subnetworks V k and V l , where k, l = 1, 2, . . . , K. The module identification problem is formulated as where P is a partition of the network. In matrix form, if we let the problem is formulated as Here 1 is a vector with all elements being 1. The objective function aims to both maximize the average degree within each module and minimize the average connections between different modules. We expect to achieve a good balance of the module size and make correct inference The Scientific World Journal 3 on the modules. The problem (3) is solved with an approximate method similar to the spectral clustering. We first compute the K eigenvectors of the matrix 2A − D. By clustering these K eigenvectors as a matrix of n objects with K dimensions, we get the assignment of the n nodes into K modules. Now, we discuss how to determine the lowest level of all the possible modules K. For any node i ∈ V , the degree can be written as where which defines the connections that node i has in the subnetwork V k . To determine the number of possible modules, we compare the average connectivity within a subnetwork and the average connectivity between it and any other subnetwork. If the average connectivity within a subnetwork is greater, we take it as a module, that is, Alternatively, it can also be written as if we multiply both sides with N k . This condition is very weak, thus with it, we hope we find all the modules as on the lowest level. We do the clustering for K increasing from two until the condition (6) does not hold and get all the possible modules. The efficiency of the above algorithm can be seen in [14]. Based on the above results, we construct the hierarchical structure in an agglomerative way (bottom-to-up). We directly use connection probability, which is computed from the clustering results through maximum likelihood estimation, to measure the distance between different modules. This connection probability matrix is denoted as P 0 . First the maximum connection probability between different modules is found, and we assume it is P 0 i0, j0 with the corresponding two modules i 0 , j 0 being recorded. The second largest connection probability for these two modules i 0 , j 0 are also found, and we assume they are P 0 i0,k0 and P 0 j0,l0 with the corresponding modules being k 0 and l 0 . To determine whether there is a hierarchical structure for these modules, we use Fisher exact test to see whether the connection probabilities P 0 i0,k0 and P 0 j0,l0 are the same as P 0 i0, j0 . That is, we need to test P 0 i0, j0 = P 0 i0,k0 and P 0 i0, j0 = P 0 j0,l0 . Here we take a P value threshold to be 0.05. Three different cases may occur for these two relations. (1) Both of these two null hypotheses are rejected. In this case, there is hierarchical structure and the modules i 0 , j 0 are on the lower level than k 0 and l 0 . We combine the two modules i 0 and j 0 and take them as one module. (2) Only one of P 0 i0, j0 = P 0 i0,k0 and P 0 i0, j0 = P 0 j0,l0 is accepted. The corresponding modules having the same connection probability are combined together. We look for the next largest connection probability for these three modules and test the relationship again. If two modules are tested to have the same connection probability, they are combined into one group, and the same step is implemented again. (3) Both of these two null hypotheses are accepted. These modules are taken as on the same level and combine together. We search the next largest connection probability to these four modules and do the statistical test until the hierarchical structure occurs or all the modules are combined together. After the above steps are finished, the connection probability between different modules is recalculated and recorded as P 1 . The above search and test steps are repeated for P 1 . Such steps are implemented recursively until all the modules are combined into one big module/network. For the statistical tests, we can also use ttest to test the relations between the connection probabilities if the distribution of the connections between different modules can be approximated by normal distribution. With this method, we can efficiently combine the modules with the same connection probability into the same level.

Numerical Experiments
In this section, we evaluate the performance of our proposed method through its application to several examples. We first start with two artificial networks having comparatively clear module structures. We then apply our method to two real networks to evaluate its performance. The first real network is the well-known karate club network and the second one is a yeast gene coexpression network.
The pattern of the adjacency matrix is shown in Figure 2(a). From upper-left to lower-right, we denote the four modules as M 1 , M 2 , M 3 , and M 4 , which correspond to the position in the connection probability matrix. We can see the hierarchical structure of the network from the adjacency matrix. We apply our proposed method to this network. The condition (6) is satisfied until K = 4. The estimated connection probability matrix is We apply statistical tests to the corresponding modules, and finally we get the hierarchical structure as shown in

A Randomly Generated Network.
In this example, we also consider a network with 200 nodes and 4 modules.
The size of each module is 10, 45, 45, and 100. We set the degree of each node within its module to be 6, 15, 15, and 30. Then the connections between different nodes are randomly generated. We keep all the edges generated for each node. So finally the average degree within each module is greater than the prespecified number. The connection probability between different modules is 0.002. The pattern of the adjacency matrix is shown in Figure 3.
By using the statistical tests, these four modules are determined as parallel modules, which is the same as that in our network generation strategy.

Karate Club Network.
We consider the Zachary's network of karate club members [22] in this example. There are 34 nodes in this network corresponding to the members in a karate club. This dataset has been applied as a benchmark to test many module identification algorithms since the true modules are known in this network. The people in the club were observed for a period of three years. The edges represent connections of the individuals outside the activities of the club. At some point, the administrator and the instructor of the club broke up due to a conflict between them. The club was separated into two groups supporting the administrator and the instructor. Figure 4 shows the network. Originally, there are two modules, which have 16 nodes (squares and pentagons in the figure) and 18 nodes (circles and triangles in the figure), respectively. We apply our proposed method to this network. The criterion (6) is satisfied until K = 4. The result is shown in Figure 4, with different shapes of the nodes denoting different modules. The estimated connection probability matrix is From this matrix, it is easy to see that M 3 and M 4 are more likely to connect each other. With statistical tests, we can get that the connection probability among M 3 , M 4 , and M 1 is the same. Although M 2 has no connections to M 3 and M 4 , it has a larger connection probability to M 1 than M 3 , M 4 to M 1 . Thus these four modules are on the same level. In [19], the authors considered constructing the hierarchical modular structure of this network too. At first, they also found four modules on the lowest level. Then they found that this network has two modules with some nodes (3,9,10,14,31) belonging to both of them. We did not consider the overlapping nodes in this article. However, we can see that because these overlapping nodes belong to both M 1 and M 3 , and they connect both parts closely, our method detect M 1 and M 3 , M 3 and M 4 as having the same connectivity. Coexpression Network. In this section, we apply our proposed approach to analyze a gene coexpression network of yeast. The data set we use was generated by Brem and Kruglyak from a cross between two distinct isogenic strains BY and RM [23]. As described in [23], a total of 5740 ORFs were obtained after data preprocessing. In our analysis, we only use the 1,800 most differentially expressed genes as input to construct coexpression network and derive modules. When constructing the adjacency matrix of the network, we use the hard thresholding, that is: if the absolute value of Pearson correlation coefficient between two genes is greater than some given value, we assign an edge between them; otherwise, there is no edge. We compute the linear regression coefficient between the frequency of degree d (log 10( f (d))) and the log 10 transformed degree d (log 10(d)), and choose the threshold that leads to approximately scale free property of the network as described in [24]. Finally, the threshold is set to be 0.705, R is about 0.75. By such a setting, this gene coexpression network is divided into 690 unconnected parts with the largest part of size 788. Here, we only analyze the hierarchical modular structure of the largest connected network.

Hierarchical Modular Structure in Yeast Gene
Starting from K = 2, we apply the method in [14] to this network, and the condition (6) holds until K = 10. To make the solution more accurate, we do a global maximization by changing the module index of boundary nodes starting from the approximate solution. Since the approximate solution is already good, this step is very fast. The structure of the network is shown in Figure 5(a), with different colors and shapes denoting different modules as described in Table 1. Then we construct the hierarchical modular structure as shown in Figure 5(b). On the lowest level, there are ten modules, while on the highest level, there are four modules.
Since coexpressed genes tend to be coregulated and possibly have similar functions, genes in the same module are expected to be enriched for some function categories. In order to understand the biological basis of the network modules, we consider each identified module for enrichment of annotations from gene ontology (GO) [25]. In our analysis, the enrichment analysis was performed by GO stats from Bioconductor. For each module, the statistically most significant GO categories are analyzed. Table 1 shows the enrichment results for the ten modules. "M-size" and "G-size" are the size of both the modules and the GO categories, respectively. "Overlap" is the overlap size of the module and the GO category. The Scientific World Journal  We use the software REViGO to check the hierarchical structure of the enriched GO categories [26]. We consider the enriched GO categories in Tables 1 and 2 except the category "regulation of translational termination" because its G-size is very small and the P value is comparatively large. Figure 6 shows the tree map of the most enriched GO categories. The subgraph that we do not mark with the module corresponds to the combined module M 1 , M 2 , M 3 , M 4 , M 7 , M 8 . Here the modules M 6 , M 9 and other modules are parallel to each other, which is consistent with our results. M 3 and M 7 belong to a large category, which is "branched chain family amino acid metabolic process". This large category is different from the most enriched category for the combined module M 3 and M 7 . This may come from the fact that since M 7 is very small, it does not cover a large part of its enriched category. M 1 and M 4 are parallel to each other which is also consistent with our analysis. All these results show that our proposed method can explain some of the hierarchical structure of the GO categories. Due to the network size, we did not handle all the genes of yeast. This may be a reason why some of our computational results are not consistent with the GO function tree map.

Conclusion
Module identification problem has attracted much attention from different fields and it continues being a hot research topic. How to determine the number of modules in a modular network has been an open problem during the study of module identification methods. This problem may come from the hierarchical structure of modular networks. The different numbers correspond to the different levels of