Modeling Gene Networks in Saccharomyces cerevisiae Based on Gene Expression Profiles

Detailed and innovative analysis of gene regulatory network structures may reveal novel insights to biological mechanisms. Here we study how gene regulatory network in Saccharomyces cerevisiae can differ under aerobic and anaerobic conditions. To achieve this, we discretized the gene expression profiles and calculated the self-entropy of down- and upregulation of gene expression as well as joint entropy. Based on these quantities the uncertainty coefficient was calculated for each gene triplet, following which, separate gene logic networks were constructed for the aerobic and anaerobic conditions. Four structural parameters such as average degree, average clustering coefficient, average shortest path, and average betweenness were used to compare the structure of the corresponding aerobic and anaerobic logic networks. Five genes were identified to be putative key components of the two energy metabolisms. Furthermore, community analysis using the Newman fast algorithm revealed two significant communities for the aerobic but only one for the anaerobic network. David Gene Functional Classification suggests that, under aerobic conditions, one such community reflects the cell cycle and cell replication, while the other one is linked to the mitochondrial respiratory chain function.


Introduction
The difference between aerobic and anaerobic conditions at the molecular level has drawn considerable attention in last twenty years. Man and Pilpel [1] found that the transcription rate of mitochondrial gene under aerobic conditions is significantly higher than that under anaerobic conditions, while glycolytic genes are more active under anaerobic conditions. Hou et al. [2] found that under anaerobic conditions SPT3 and SPT15 are overexpressed, which may not only enhance resistance to ethanol and stress but also upregulate the fermentation transcription factors in Saccharomyces cerevisiae. Jiang et al. [3] found that, under anaerobic condition, mitochondrial function is weakened while fermentation capacity is enhanced in Saccharomyces cerevisiae. This suggests competition between aerobic and anaerobic metabolisms as a result of evolution.
A fundamental goal of genomics is to understand how gene regulatory networks actually give rise to cellular phenotypes. Modeling and reconstruction of gene regulatory networks based on high-throughput data have become one of the most common goals of systems biology. Diverse models of gene regulatory networks have been developed such as Boolean network [4][5][6], probabilistic Boolean network (PBN) [7,8], and Bayesian network [9][10][11][12]. The main objective of these network models is to study the logical interactions of genes based on large-scale microarray data and further get meaningful biological information. Recently using network method to study gene interactions of Saccharomyces cerevisiae has attracted the interest of several researchers [13][14][15][16]. For example, Zhang et al. [14] examined the integrated gene interaction network of Saccharomyces cerevisiae and found many enriched multicolor network motifs corresponding to different biological themes. They concluded that significantly enriched motifs in the network are often signatures of network themes, higher-order network structures that correspond to biological phenomena. Lee et al. [15] modified a probabilistic functional gene network of the baker's yeast, Saccharomyces cerevisiae, and experimentally verified the function of the yeast RNA binding protein Puf6 in 60S ribosomal subunit biogenesis. Hu et al. [16] profiled transcriptional responses in Saccharomyces cerevisiae strains with individual deletions of 263 transcription factors. Then they reconstructed a functional transcriptional regulatory network between these transcription factors and analyzed the enrichment of promoter motifs on these transcription factors.
In 2005, Bowers et al. [17] introduced a computational approach called logic analysis of phylogenetic profiles (LAPP), which identified detailed logic relationships among gene triplets on the basis of genomic data. This method may be used for functional annotation of proteins and genes and for designing biochemical experiments to elucidate biological mechanisms. Lately, further progress has been achieved on the theory and application of higher logic. Zhang et al. [18] described a three-way gene interaction model that captures the dynamics of coexpression relationships between two genes. Shoemaker and Panchenko [19] proposed ways to address the defects of the LAPP method, such as high computational complexity, strong dependence on information spectrum, and the uncertainty of homology detection at large genetic distances. Sprinzak et al. [20] detected coordinated regulation of multiple protein complexes using logic analysis of gene expression data and identified protein complexes by mapping specific kinds of gene triples to multicomplexes triplets. Notably, the LAPP method is related to stochastic logic. Modeling approaches using stochastic logic, such as stochastic Boolean network (SBN) and stochastic multiplevalued network (SMN), have been already proposed [21][22][23][24].
In this paper, we focus on the construction, analysis, and comparisons of structural characteristics of logic networks inferred from the gene expression profiles of Saccharomyces cerevisiae under aerobic and anaerobic conditions. Firstly, gene expression profiles are discretized into multiple values. Secondly, the logical AND, OR, and NOT operators are given by algebraic formulation on multiple values. Down-and upregulation self-entropy as well as joint entropy are used to compute the uncertainty coefficient. Four parameters of the logic network are generalized to more complex networks for our purpose. Putative regulator genes of respiratory mechanisms are identified by contrasting the differences of the four structural parameters. Lastly, the Newman fast algorithm is used to discover the community structures of the logic network. We find that the gene logic network under aerobic condition (aerobic network) has two significant communities while the anaerobic network only has one in this framework. Furthermore, David Gene Functional Classification (http://david.abcc.ncifcrf.gov) [25] reveals the possible biological function of these communities.

Expression Data.
The gene expression profiles data of DNA chip in this study are taken from the GSE11452 database using GPL90 platform in the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov) (NCBI) and comprise 42 and 52 expression vectors with more than 20,000 genes in yeast under aerobic and anaerobic conditions ( Table 1). Such a large number of genes incur an intractable computational complexity, so we select candidate genes based on the Wilcoxon rank sum test [26] as follows. Let = ( 1 , 2 , . . . , 0 ), = ( 1 , 2 , . . . , 1 ) denote 0 and 1sized random sample of expression of gene under aerobic and anaerobic conditions, respectively. The Wilcoxon rank sum test statistic is derived from the concatenation of and , which results in vector . This is done by first sorting all points in by ascending order, obtaining the sets of ranks for the and points, respectively, then calculating the corresponding sums 0 and 1 , and finally defining the test statistic as 0 / 0 − 1 / 1 . Under the null hypothesis the expected value of this statistic is zero; otherwise the alternative hypothesis holds, in which case then gene is called a candidate gene. Setting the significance level to be 10 −5 , we obtain 73 candidate genes finally. . In order to calculate firstorder and second-order logical relations using LAPP method, we define the following quantities for this discretized vector: (2) Upregulation self-entropy is is present iff and are both present or and are both absent And total joint entropy is ( , are the corresponding frequencies of component in vectors and , respectively. For discretized vectors and , the uncertainty coefficient ( value) is defined as This quantity informs on the probability that regulates . Note that for simplicity and denote not only expression vectors but also the corresponding genes. The first-order logical relationship between genes and is determined as where 1 is the proper functions of first-order logic of to . The uncertainty coefficient of ( , ) → is where 2 is one of the proper functions of second-order logic of to . Table 2 lists ten types of proper functions and the corresponding algebraic operations. Using these operations on and , we get . Table 2 also gives the algebraic representation of three basic operators for multiple values: logical NOT is represented as ¬( = or ) = ( = or ), logical AND and OR can be represented as min( , ), max( , ), where , are the th component of , .
We normalize the value for each gene triplet and database by replacing it with / max , where max is the maximum value in the own database. For simplicity, the normalized values are also denoted by . The condition  ( | 2 ( , )) ≥ max{ ( | ) + , ( | ) + } is used to filter out all gene triplets. The combination requires gene to be better predicted from genes and together than just gene alone or gene alone. Figure 1 shows an example for a gene triplet, for which the logical AND operation on gene and gene is denoted by ∨ → . As in the LAPP method, all such gene triplets, with the corresponding the values, give rise to the gene logic networks further studied in our present work.

Structural Parameters Definition of Logic Network.
In fact, the logic network can be seen as a directed and weighted network without multiple edges and self-loops. Let = ( , , ) be a logic network with gene node-set , directed edge-set , and function : → that assigns each edge ∈ and weight ( ) ∈ . In fact, the edge weights can be interpreted as the uncertainty coefficients that express interaction strength between gene triplets. Various structural parameters have used to study network structure in complex network [27]. For the logic network, structural parameters are detected again including average degree, average path length, average clustering coefficient, and average betweenness to (1) Average Degree ( ). The degree of a node is the number of nodes adjacent to it. The average degree is the average value of the degrees of all nodes. For the logic network, the definitions of in-degree, out-degree need to be reconsidered based on the principle that the sum of in-degree and that of out-degree of all the nodes in a network are equivalent. Based on the principle, in-degree and out-degree can be defined according to second-order logical relationship; if there are activations of , then in-degree of increases by /2. However, the out-degrees of and are determined by the proportion of their contributions to the second-order logical relationship.
Here we can assume that the proportion contribution to the second-order logical relationship from and that from are always equivalent. In other words, the out-degree increment of is the same as that of .
(2) Average Clustering Coefficient ( ). For a certain node with second-order logical relationships, we need to define doublets of second-order logical relationships to measure the clustering coefficient. A doublet of second-order logical relationships is a combination of two second-order logical relationships with at least one common node.
If the common node is V, we call this second-order logic doublet centered on V. Figure 2 shows all possible secondorder logic doublets centered on V. These three types of second-order logic doublets are named according to the different positions of V as "both-in," "both-out," and "inout" doublets. If there are two common nodes in a secondorder logic doublet, we call the doublet strong connected. Figure 3 shows all possible strong connected second-order logic doublets centered on V. The number of second-order logic doublets (including both-in, both-out, and in-out doublets) centered on V is denoted by doub (V), and the number of strong connected second-order logic doublets is denoted by sc-doub (V). The clustering coefficient of node V, denoted by doub (V), in a logic network with second-order logical relationships is defined as doub (V) = sc-doub (V)/ doub (V). So the average clustering coefficient of network is defined as = (1/ ) ∑ V∈ doub (V).
(3) Average Path Length ( ). Path and its length should be reconsidered according to the different second-order logic types: AND, OR, and XOR. Take a second-order logical relationship ( , ) → , and an arbitrary node in other than (say ), the shortest directed path , and the distance from to are defined as follows.
Case 1 (second-order logic type is AND). Namely, nodes and regulate node cooperatively.
(1) is or (say without loss of generality). The shortest paths from to are denoted by , if there is at least one directed path from to . The directed path from to arrives at through and then the second-order logical relationship to . The distance from to is the total sum of and ( , ) → , where ( , ) → is the length of the second-order logical relationship. In our study, ( , ) → is estimated by the reciprocal of the uncertainty coefficient of this second-order logical relationship. On the other hand, if V and are not connected by a directed path, there is no path from V to .
(2) is neither nor . is reachable from at most one of and . Then there is no directed path starting from and ending at . is reachable from both and . Then there is at least one directed path connecting and . The distance from to is = max{ , } + ( , ) → . The directed path first reaches the nearer one of and followed by the other and finally .
Case 2 (second-order logic type is OR). Either or can regulate independently. The distance from to is estimated by the probability that activation of results from . The next step is to distribute the uncertainty coefficient ( | 2 ( , )) of the second-order logical relationship. Calculate the uncertainty coefficients caused by and , denoted by ( | ) and ( | ), respectively, as follows: On the basis of above obtained uncertainty coefficients, the shortest directed path and the distance from to can be determined as follows.
(1) is or (say without loss of generality). There is at least one path starting with and ending at .  Choose one of the shortest directed paths from to randomly, denoted by . The total sum of and the reciprocal of ( | ) are denoted by (1) and (2) , respectively. If (1) is less than (2) , the shortest directed path goes directly from to through the second-order logical relationship, and the distance equals (1) . Otherwise, first arrives at and then , and is equal to (2) . There is no path from to . The path from to is only the second-order logical relationship, and the distance equals the reciprocal of ( | ).
(2) is neither nor . There is at least one path starting from V towards or . Therefore, the distance from to is = min{ + ( , ) → , + ( , ) → }. The shortest directed path from to is the corresponding path to the choice of distance. Note that when is unreachable from (or ), (or ) is infinite. Neither nor is reachable from . The distance from to is infinite and no path connects them in this case.
Case 3 (second-order logic type is XOR). Both and can activate cooperatively or independently. Therefore, XOR type of second-order logical relationship is a combination of AND type and OR type. However, when only one of and is reachable from , the condition is the same as OR logic. When both and are reachable from , or is or , it is the same as AND logic.
According to the definition above, all shortest directed paths and all distances can be found in a logic network. And the average path length of a logic network is defined as = (1/| |) ∑ ( , )∈ , where is the set of ordered pairs of nodes and the distance from the first one to the second one is finite; that is, = {( , ) | , ∈ , < +∞}. | | is the number of the elements in .
(4) Average Betweenness ( ). Betweenness centrality [28] is one indicator of a node's centrality in complex network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. A node with high betweenness centrality has a large influence on the transfer of items through the network, under the assumption that item transfer follows the shortest paths. Let be the set of all the shortest paths (allowing more than one shortest paths between two nodes) in the logic network . If node V appears at least once in a directed path starting with and ending at , then V is referred to as intermediate node in this path, denoted by V ∈ . The standard betweenness of V, denoted by Logic (V), is defined as The standard betweenness centrality ranges from zero to one. Higher betweenness means larger possibility to appear in the shortest paths. It is more important in the structure and information transfer of a network. Therefore betweenness can help to discover crucial nodes that may have significant impacts on the structural characteristics of the logic network.

Community Structures in the Logic Network.
Another important feature of complex network is its community structures, which depict the organization of vertices into clusters, with many edges joining vertices of the same cluster and comparatively few edges joining vertices of different clusters. The community is a good tool to describe network structures and provides better understanding of network functions. Several researchers [29][30][31] proposed algorithms to detect community structures. Newman fast algorithm [32] is a greedy modularity algorithm starting from a set of isolated nodes. The links of the original graph are iteratively added such that the largest possible increase of the modularity at each step is achieved. The fast algorithm is used to find the community structure in the logic network. In addition, the inside-to-outside-of-community ratio = 1 / 2 is used to evaluate how close of community connections in the network, where 1 is number of edges within a community and 2 is the number of edges cooperating the internal and external node in a community.

Results
In order to highlight the characteristics of the logic network structures, we contrast the change curves ( Figure 4) of four structural parameters along with threshold from 0.1 to 0.9 between the aerobic network and the anaerobic network with step length 0.1. We find that the average degree, average clustering coefficient, and average betweenness of the aerobic network are greater than those of the anaerobic network for all thresholds. It can be seen that the average path length of the aerobic network is greater than that of the anaerobic network in some threshold ranging from 0.3 to 0.7. The significant changes of parameters mean that the energy metabolism conditions for Saccharomyces cerevisiae in aerobic and anaerobic respirations actually differ on the molecular level.
Each node of the logic network corresponds to a different structural parameter. From this, we obtain the degree, clustering coefficient, betweenness, and path length of each node. By calculating and ranking the difference values of the four parameters for each node, we capture the top five gene nodes. In Table 3, -Difference, -Difference, -Difference, and -Difference denote the difference values of the four structural parameters, respectively. Finally we get the intersection of these gene nodes which includes genes ATP6, YIG1, RGI2, BAG7, and COX1. The structural parameters of these genes change significantly comparing the aerobic network with the anaerobic network. For example, some genes are allocated a higher degree in the aerobic network than in the anaerobic network. That is to say, these genes have great contribution to degree structure changes for the two networks, so we define them as the structural key genes.
If gene connects other genes by a certain second-order logic relationship, then these genes together from one set are denoted by . For example, if ∩ → , then ∈ , ∈ ; if ∩ → , then ∈ . If the gene of has a particular function, then we predict that gene also has the same function. David Gene Functional Classification analysis is carried out on . The second column of Table 4 shows that the functional annotation of these genes has been detected in David database while the final column lists the predicted functions of them. Table 5 shows nonisolated nodes, the numbers, and modularity of community structures. Utilizing Newman fast algorithm, we find that the aerobic network has two obvious community structures. The modularity is 0.3756 including 47 and 15 nonisolated nodes. The corresponding ratio of inside and outside of community is 87.3 and 19.8, respectively. There are three community structures in anaerobic network, including 13, 16, and 5 nodes. The corresponding ratio of Table 3: Difference of parameters for some genes between the aerobic and the anaerobic network.   Gene  COX1  ATP6  COB  BAG7  YIG1  RGI2  -Difference  22  22  21  20  16  15  Gene  COX1  YIG1  BAG7  BI3  ATP6  RGI2  -Difference  12  11  11  10  9  7  Gene  COB  ATP6  COX1  BAG7  YIG1    inside and outside of community is 1.8, 6.0 and 0.57 (see Table 6).

Conclusions
By David Gene Functional Classification, we predict that the genes ATP6 COX1 BAG7 and YIG1 are involved in the yeast mitochondrial respiratory chain while RGI2 is involved in membrane transportation (Table 4). In fact, ATP6 [33] belongs to the family of genes, which is called the mitochondrial respiratory chain complex. It participates in the respiratory chain and provides information for synthesis of a protein encoded by mitochondrial DNA that is essential for normal mitochondrial function. YIG1 [34] encodes protein and is involved in regulating anaerobic glycerol metabolism in Saccharomyces cerevisiae. Deletion or overexpression of YIG1 significantly affects growth yield or glycerol yield in anaerobic batch cultures. This is consistent with the previously proposed low flux control exerted at the Gpp level. BAG7 [35] is involved in the energy metabolism under aerobic conditions; its expression is induced under carbon limitation and suppressed under high glucose; COX1 [36] belongs to the family of cytochrome C-oxidase, catalyzing the reduction of oxygen to water in the respiratory chain. RGI2 [37] is a protein of unknown function associated with metabolism under conditions of aerobic respiration; its expression is induced under carbon limitation and suppressed under high glucose. It is involved in the control of metabolism and significantly contributes to cell fitness, especially under respiratory growth conditions. Some biological functions of organism are realized by gene interactions of proteins. These genes are closely linked, thus revealing the community phenomenon of the logic network. We produce the spider diagrams in Figures 5 and 6 using the Pajek software (http://mrvar.fdv.uni-lj.si/pajek/). The aerobic network is shown to have two distinct communities while the anaerobic network has an obvious community. These communities may correspond to certain biological functions. So we apply the David Gene Functional Classification analysis to these communities. The result reveals that, in the first community of the aerobic network, seven genes including 4533 at, 5527 at, 8444 at, 8426 at, 10446 s at, 6116 at, and 6908 at are related to the cell cycle function. Ten genes including 2565 s at, 2498 s at, 3215 f at, 8822 at, 6044 at, 2361 s at, 6590 at, 10644 at, 10041 at, 8431 at, and 2425 at are associated with the DNA transcription. In the second community of the aerobic network, 12 genes including 3975 at, 3970 s at, 3974 at, 3959 at, 4008 at, 3988 at, 3987 at, 2623 s at, 2622 s at, 2845 g at, 2793 s at, and 3966 i at are involved in the mitochondrial respiratory chain.

Discussions
For the nonbinary gene expression profiles, this work gives a novel approach in calculating the self-entropy (including downward self-entropy, upregulation self-entropy) and joint entropy for gene vectors. LAPP method is utilized to find 8 Computational and Mathematical Methods in Medicine  all gene triplets. Furthermore the gene logic networks are constructed. The relationships between the structure and the function of the network could help us to understand metabolisms of Saccharomyces cerevisiae on the molecular level. To analyze the structural difference between two networks, parameters such as average degree, average clustering coefficient, average path length, and average betweenness, which reveal that the second-order logical types have significant differences among the different experimental gene sets, have been generalized to the networks. The differences may provide us with a new idea and some reference to the biologists on their research work. However, how the other analytical methods, such as community structures, can be generalized into the logic networks is still an interesting issue. By applying the Newman fast algorithm to the logic networks, we find that the aerobic network has two significant communities, while the anaerobic network has only one significant community. Furthermore the David Gene Functional Classification shows that one of two communities possibly relates to cell cycle and cell replication functional group; the other possibly correlates mitochondrial respiratory chain functional group in the aerobic network.