msiDBN: A Method of Identifying Critical Proteins in Dynamic PPI Networks

Dynamics of protein-protein interactions (PPIs) reveals the recondite principles of biological processes inside a cell. Shown in a wealth of study, just a small group of proteins, rather than the majority, play more essential roles at crucial points of biological processes. This present work focuses on identifying these critical proteins exhibiting dramatic structural changes in dynamic PPI networks. First, a comprehensive way of modeling the dynamic PPIs is presented which simultaneously analyzes the activity of proteins and assembles the dynamic coregulation correlation between proteins at each time point. Second, a novel method is proposed, named msiDBN, which models a common representation of multiple PPI networks using a deep belief network framework and analyzes the reconstruction errors and the variabilities across the time courses in the biological process. Experiments were implemented on data of yeast cell cycles. We evaluated our network construction method by comparing the functional representations of the derived networks with two other traditional construction methods. The ranking results of critical proteins in msiDBN were compared with the results from the baseline methods. The results of comparison showed that msiDBN had better reconstruction rate and identified more proteins of critical value to yeast cell cycle process.


Introduction
A biological process is a complexity of spatial and temporal interactions among innumerable molecules. Understanding dynamic biological processes and revealing the mechanisms behind dynamic systems are of great value for a wide variety of important biological and medical issues, such as understanding aging, cancers, and other perplexing diseases. Dynamic biological network mining has attracted increasing attention from biologists in the past few years [1][2][3]. For example, we want to know which gene or protein is of critical effect to disease development. In this domain, microarray gene expression data offers useful dynamic information and is generally exploited to locate differentially expressed genes that may be related to specific abnormal conditions. A few tools are available for finding differentially expressed genes under varying conditions among which statistical methods are widely accepted, including methods based on -test and SAM [4,5]. However, change level of gene expression is such a representation that is far from satisfaction to explain the complex dynamic mechanism, considering that it is not capable of investigating the dynamic changes of relationships of proteins in consecutive protein-protein interaction networks (PPINs). For instance, methods based on sole gene expression analysis cannot capture genes with medium expression, but in contrast more accurate and complete understanding can be achieved by putting them into the PPIN.
There are mainly two challenges in dynamic network analysis. The first one is to construct the dynamic networks that accurately model the dynamic processes. Proteins perform their functions at specific times under distinguished conditions, which we call them in their active forms. Dynamic PPINs reveal the instant relationships of functional proteins. However, the publicly available PPI datasets are mostly aggregates of all possible interactions obtained under different examined conditions or time points [6] and are oblivious to the temporal changes of these networks. Dynamic analysis involves extracting dynamic PPI networks from these known datasets and the methods mainly fall into two directions. The first way is based on the differential coexpression correlations. Studies [7] have shown that highly positive coexpressed proteins tend to form the most static modules appearing at all times and at the center of which there are some hubs with high degree being referred to as "party" hubs. Further, some less positive coexpressed proteins interact at particular time points, the hubs therefore being referred to as "date" hubs that are believed to cause dynamic interactions and plausibly induce aberrant pathways and molecular disorders. Taylor et al. [8] also observed multimodal distribution of correlation coefficients of gene expression using curated sources from the literature. The second way to construct dynamic PPINs is based on expression variance [9] by determining the peak time points of expression for each protein. Thus, if a protein is at its peak point, it is considered to be in its active form, the status at which a protein can interact with its active neighbors. This assumption allows computing scored gene expression activity using a single threshold [10] or a systematical threshold [11]. In the present work, we assert that coexpression correlation may describe only the possible coregulated relationships of proteins, while existence of a specific interaction at a certain time point would depend on the activities of the two associated proteins. Hence, the integration of these two aspects becomes necessary in the construction of dynamic networks and a comprehensive way of defining the existence of dynamic PPIs is needed. In addition, some researchers argue that the gene expression data contain far more noise that will induce unauthentic factors. For example, the genes are sent into a filter that defines a criterion for genes of being dynamic or stable in Xiao et al. 's paper [12], and the stable ones are left out of the subsequent construction of dynamic networks. However, the definition of dynamic networks is slightly different from Xiao et al. 's. In our case, the stable active proteins are impartially included in the dynamic networks.
The second challenge of dynamic network analysis is to identify the most critical proteins out of a series of dynamic networks. As discussed above, Han et al. [7] concluded that hubs can be divided into two categories: the "party" and "date" hubs, among which the latter ones are more essential to global connectivity and functions that cells process. In this paper, the proteins exhibiting dramatic structural changes in the set of consecutive networks are defined as critical proteins which serve a compensation of the definition of "date" hubs to some extent. The intuition is that a series of networks in the same biological process should share a certain degree of consistence in structure. By extracting the consistence and reserving the structural difference of dynamic networks, we are able to find the critical proteins that are extremely important for the dynamic process. To this end, the consistent and varying properties of local structures in dynamic PPINs are studied in this work and a critical node detection method based on integration of multiple deep belief networks are proposed to identify the most critical proteins that are responsible for dynamic changes during a certain time period, and specifically, the case of yeast cell cycle processes is studied in our present work. There are several comparative methods of extracting consistent information from multiple graphs, such as the most straightforward average network and the joint nonnegative matrix factorization (JNMF) [13]. NMF tries to decompose the original graph to linear combination of basis vectors and is usually used in clustering problems, graph partition problems, and so on. In this work the hierarchical fashion of PPIN structures is taken into consideration by building a multisource integrated deep belief network (msiDBN) as a joint multilayer model that extracts the common higher levels of structural units. Our network is based on the previous work in [14]; however, in this msiDBN model, we decipher the structural varieties of nodes at different time points as the residuals of reconstruction and believe that the nodes with dramatic changes of structure in the networks play an important role in the progression of the cell cycle.
The framework of this present work is shown in Figure 1. It is assumed that a small part of proteins in the network is associated with the changing of dynamic processes, marked by red circles in Figure 1 as an example. The changing of local structure is studied through our msiDBN method and in summary this work contributes in there ways.

(i) A new method for constructing dynamic coregulated
PPINs has been proposed and it gets better representation of dynamic process by comparing to other construction methods. (ii) A multisource integrated deep belief network (msiDBN) is developed to extract the common representation of multiple networks, reconstruct the dynamic networks, analyze the residual of reconstruction, and identify the critical proteins in the yeast cell cycle processes. (iii) Experimental results show that our strategy of dynamic network construction is superior to the other baseline methods and the msiDBN is able to reconstruct the dynamic networks with the lowest root mean square error (RMSE) than the comparative methods for it extracts the consistent hierarchical structures while others do not have any deep insight of the networks.
The rest of the paper is organized as follows: the proposed dynamic PPIN construction method is described in Section 2; Section 3 defines the critical protein identification problem in an anomaly detection fashion and introduces the msiDBN method and the critical node ranking criteria; and the proposed methods are evaluated in Section 4 from different aspects. Finally, the conclusion of this work is given in Section 5.

Dynamic PPI Network Construction
In the traditional dynamic network construction methods, the instant interactions between proteins are determined using either coexpression correlation or gene expression level shift. However, we construct the dynamic networks by integrating both assumptions in the present work.    Ins  12267  13034  13653  10463  9345  8370  8133  13263  16302  13586  13640  10010  3segma, AP  531  545  505  393  364  343  361  603  688  473  545  449  3segma Ins  3003  3325  3140  1841  1462  1278  1588  4352  6380  3654  4123  2611 protein activities. Here we have assumed that proteins are active at their most peak points of gene expressions as discussed in Wang et al. 's work [11] and a similar threshold is set for the expression of each gene that is collected under continuous conditions. The active score is determined by where thr 1 ( ) is the mean of the gene expression of protein , which is also denoted as ( ), thr 2 ( ) = ( ) × ( ), where ( ) is the standard deviation of the gene expression of protein , and ( ) = 1/(1 + ( )). As seen from (1), ( ) is a weight function of ( ) and occurs in the range of (0, 1). An empirical parameter was set for maintaining the active score AcScore within the range of ( ( ), ( ) + ( ( ) 3 /(1 + ( ) 2 ))). The performances of different empirical have been discussed in the experimental section.
By setting such an active score threshold, the activity PPI networks Act were built for each timestamp: where is a column vector representing the activity of proteins at time and is the transpose of the column vector. Each element in is determined by the binary threshold function as shown below:

Combining with Coexpression Correlation and Static
PPIN. It has been demonstrated previously that functionally related genes are frequently coexpressed across multiple conditions and different organisms [15]. Coexpression correlation coefficient is used as a measure of coexpressed genes having the same expression variance patterns across different conditions [16]. We have used the Pearson correlation coefficient [17] (normalized to the range of 0 to 1) to calculate the coexpression correlation and build coregulation protein networks. Since the computation of correlation coefficient requires expression data that cover a period of time, a time window was set on the original expression dataset which covered the time points from −1 to +1, where t is the current time point. The correlation coefficient matrix at time is denoted as CoE . Combining the static PPIN and the activity PPIN provides the dynamic coregulation protein network at each time point: where denotes the static PPI network adjacency matrix and ∘ represents element-wise multiplication.
Given the adjacency matrices of networks, the structural difference of networks can be studied in many different ways. The most important point is to incorporate the changes induced by neighbours' behaviors. Hence, we use higher order of the adjacency matrices to mimic random walks on these networks while keeping the nonnegative property at the same time.

Definition of Critical Node Detection Problem in Dynamic
Networks. Given a set of PPINs { 1 , 2 , . . . , } under time points, they are naturally evolving all along the biological process. The structure of network is represented by high order of adjacency matrix in this paper, which can be considered as the reachability of one node to the other in certain steps of rand walk. PPINs exhibit hierarchical structure and the trigger of changes in biological processes can be a small but complex set of molecules [18]. At a certain time point, the proteins in the PPIN are taken as nodes and each row of the high order adjacency matrix that a node corresponds to is seen as its feature at that time. There are sources about the nodes. To rank the most critical proteins, our intuition is that a node will receive low score if its topological structures of neighborhoods are consistent across the evolving networks and vice versa. It is impossible to directly compare the network structures because of the noise, sparsity, and indirect paths issues. However, because of the hierarchy of PPINs, we can extract hierarchical latent layers hidden in the networks that explain the evolution of network structures and protein functions. In other words, the hidden layers can be seen as the implied reasons of dynamic changes, by which the proteins fall into different groups of different characters of structural changes.
The flow of msiDBN is shown in Figure 2 where matrices of evolving networks are fed in as inputs. The msiDBN model tries to find the latent layers ( ) , representing the hidden variables of the th layer for the th network and the symmetrical weighted connections between input layers and hidden layers, that is, ( ) . As shown in Figure 2, multiple inputs are trained separately at the beginning and then are all combined to extract the common factors in the top layer.

DBN in Critical Protein Detection.
To explain the framework of DBN, we should first go through the concept of restricted Boltzmann machines (RBMs), which are stacked one on top of each other to compose the DBNs [19]. RBM is defined as a network of symmetrically coupled binary random variables or units. As shown in Figure 3, these units can be divided into two groups: the visible variables, V ∈ {0, 1} |V| , and the hidden variables, ℎ ∈ {0, 1} |ℎ| (| ⋅ | gets the dimension of the object inside it). The visible variables can be the original input or the transformed results from last layer according to the position of current RBM in the whole DBN model. The hidden variables imply the dependencies among the visible variables through their mutual interactional relationships as mimicked by the weighted matrix of . In RBM, the interactions among visible-to-visible variables and among hidden-to-hidden ones are ignored [20]. Hence, we get a bipartite graph with completed connections. The RBM defines an energy function between the visible and hidden layer variables: where ℎ and V are row vectors in and , respectively, and are the bias to the visible layer and hidden layer, and is the weights between two layers. In RBM the training purpose is to learn the weights and biases between adjacent layers so that the energy function achieves its lowest level. The joint probability distribution of RBM with a normalization factor is With the restricted conditions, the hidden variables are independent given the visible variables and this property factorizes the individual activation probabilities of a hidden variable as follows: Likewise, we have the individual activation probabilities of a visible variable as where the sigmoid represents the logistic sigmoid function. To train the probabilistic models, we typically adapt and find the best parameters that maximize the likelihood of the training data. The most straightforward way is to maximize the likelihood following the log-likelihood gradient. However, in the gradient of the log-likelihood, there are terms that are intractable, that is, the ones that compute the expectations over the joints of variables V and ℎ. There are several ways of dealing with this problem, like the contrastive divergence (CD) [21] which uses a very short Gibbs chain to estimate the expectation of the joints of V and ℎ. The reliability of CD has been proved by different groups of researchers [22][23][24].
The RBM model extracts the latent variables hidden in the training data and several RBMs are stacked one on top of others, using the hidden variables derived from lower models as the input, to get deeper layer variables that explain the hierarchical factorizations of PPINs. Given layers of RBMs, the joint distribution is (V, ℎ 1 , ℎ 2 , . . . , ℎ ) As the variables inside each layer are independent and considering the biases for each layer, we get (ℎ = 1 | ℎ +1 ) = sigmoid ( + ∑ ℎ ( +1) ) . (10)

Critical Protein Detection from Reconstruction of msiDBN.
The msiDBN model to detect the critical proteins is built upon the assumption that most proteins have similar behavior patterns across the time courses while the most critical proteins that are responsible for the progression of the yeast cell cycle exhibit different expression levels and more importantly they engage in different interactions with contemporary neighbors. The other intuition is that the integrated deep belief networks extract the common features at the top layer ( Figure 2) which represent the hidden deeper reasons for which the interactions change at different time points. Although we get merely hidden variables at the top level, the feature space it can represent is scale of 2 which is a much larger space than the common NMF clustering method gets. The joint probability of msiDBN is given as follows: where (ℎ (1) 2 , . . . , ℎ ( ) 2 , ℎ) is where is the bias variable of ℎ ( ) and is the bias of the top hidden variable ℎ. As discussed above, the conditional distributions can be derived according to the independency conditions as follows: The parameters of msiDBN can be learned approximately by greedy layer-wise training using CD. Therefore, with the common hidden variables and the trained weight matrices we are able to build an auto-encoder machine to reconstruct the input dynamic networks. The pseudo code in Algorithm 1 shows how to train the msiDBN model. With the common representation of multiple networks, we reconstruct them using (13) for the sampled data from (ℎ ( ) 2 | ℎ) which can be viewed as the approximation of original data.
We quantify the reconstruction error using root mean square error (RMSE) which is denoted by Er and calculated as follows: where ( ) denotes the reconstructed network and Er ∈ R is a vector representing the RMSE of protein between the original data and reconstructed data across time 0 to . The dispersion of Er is rated based on the relative standard error (RSD) which is RSD = / , with denoting the mean and denoting the standard deviation of Er . The lower RSD scores correspond to the proteins that are well recovered by the model and also show average smoothness across time courses, while the higher RSD scores reveal the ones that are more likely having varying structures at different time points and are expected to play important roles during research like drug targets design.

Datasets.
The gene expression data from GSE3431 [25] was used as the time course data to construct time course PPINs. GSE3431 is an expression profiling of yeast over three successive metabolic cycles. Further, we also adopted another reference cell cycle gene expression data for yeast indexed by GSE7645 to alleviate the bias of expression in the calculation of mean and variance for each gene. In the experiment generating GSE7645, S. cerevisiae was cultured under oxidative stress induced by cumene hydroperoxide (CHP) and the transcriptional profile is collected at = 0 (immediately before adding CHP) and at 3, 6, 12, 20, 40, 70, and 120 minutes after adding the oxidant. The static PPIN of yeast was collected from BioGRID dataset for yeast and the cell cycle regulated protein dataset was downloaded from http://nar.oxfordjournals.org/content/ 38/suppl 1/D699 which will serve as the golden data in validation. We also constructed the cell cycle related static PPIN based on these proteins and their first neighbors in BioGRID PPIN. Finally we get a static PPIN with 2069 proteins and 43462 interactions between them.
The function of detected modules was validated by adopting the CYC2008 human-curated complex dataset as benchmark data [26]. CYC2008 is a comprehensive catalogue of manually curated 408 heteromeric protein complexes in S. cerevisiae reliably backed by small-scale experiments from the literature.

Evaluation of Dynamic Network Construction.
We compare the proposed dynamic network construction method with traditional methods, that is, methods from the work of Tang et al. [10] and Wang et al. [11]. It is widely believed that the dynamic network reveals more accurate functional interactions between proteins than static PPIN and also a better dynamic network construction method should achieve better functional module analysis results. Hence, we run two traditional clustering methods on the different sets of dynamic networks and compare the precision of module detection results. Known complexes in the CYC2008 dataset served as a gold-standard data to evaluate the experimental results.
It is expected for a module detection method that the predicted clusters ( ) and the reference complexes ( ) match as much as possible. The overlapping scores OL( , ) are used to find the matched complexes: where | | is the size of the predicted cluster, | | is the size of the known complex, and | ∩ | is the number of the intersections of the predicted cluster and the known complex. and are considered to be matched if their OL score is larger than a threshold , which is typically chosen as 0.2 [27,28]. Precision is defined as Prec = TP/(TP + FP), where TP (true positive) is the number of the predicted clusters matched with known complexes by OL > , and FP (False Positive) is the number of the unmatched known complexes in the predicted clusters.

Comparison of Different Dynamic Networks Construction Methods.
In the experiments, a fixed threshold (Th = 0.7) was set to Tang's method, and the second construction method is described in Wang et al. 's paper [11] which uses a three-sigma threshold to determine the status of proteins. Table 1 shows the numbers of active proteins derived from three different threshold setting methods. In our method, the parameter was chosen to be 1.5. The performance of MCL and spectral clustering method on different dynamic PPI construction methods (Figure 4) indicates that our integrative method is more effective in constructing dynamic networks, and both functional module detection methods benefit from it by achieving the highest precisions compared to the other two dynamic PPIN construction methods.

Parameter
Setting. Initially, we analyzed the effect of change in parameters on our dynamic network construction method. Thus, we performed the spectral method to detect dynamic functional modules at 12 time points and compared the results with the CYC2008 dataset. The Precs of the results under different parameter settings have been compared as shown in Table 2 and the mean and variance of Prec are shown in Figure 5. From the results of comparison, it was obvious that on fixing at 1.5 the precision of the functional module detection achieved the highest score. Thus, in the following comparisons with other module detection methods, this prime parameter setting has been used.

Critical Proteins Identification with msiDBN.
In the msiDBN model, the visible input variables were chosen as the high order of the adjacency matrices; in this case the 2nd-order was used, and the self-transmissions were ignored which meant that the diagonal elements in each matrix were set to 0. We built the separate training DBNs as 2-layer models and combined different sources on the third layer as shown in Figure 2. We compared our method with two basic reconstruction methods and also with the original DBN to verify the effectiveness of our method.
An online resource of cell-cycle-related gene list from database was used as the golden data which contains 150 genes that are proved to be related to yeast cell cycle process. The evaluation metric of this experiment was also Prec with the same definition as the one introduced above.
The baseline methods include joint NMF (JNMF) method, the straightforward average network, and the original DBN method. The JNMF method learns a common base matrix from different sources that best approximates the original sources. It is often used in clustering problems and dimension reduction problems. In our experiments the prior low dimension of JNMF was set as 500 by which the approximation to the original data generally achieved the best position. And the method which adopts the average network,  denoted as AVG in the following content, simply extracted the average of the 2nd order adjacencies of the series of dynamic networks. Comparing with our msiDBN, the DBN method just processes our 12 networks through one straightforward deep structure of three layers to get the common representation and derive the reconstruction errors. By comparing the RMSEs, it is easy to see in Figure 6 that the msiDBN method obtains the best reconstruction while the AVG gets the worst in all of the four methods. The property of msiDBN is to extract a hierarchy of hidden features which naturally meets the characteristics of PPIN. The JNMF is analogous to one layer feature extraction model that does not fit in to its best within this scenario. In addition, our method surpasses the traditional DBN which considers all the networks identically and shows the promising results of the framework of multisource integrated deep belief networks. As we know, RSD is a measure that quantifies whether a set of variables are constant or have more variabilities. A high RSD number indicates that the data is more varied. The RSD scores of each protein were calculated and ranked to get the top 150 proteins in these three methods. The proteins in the top 150 RSD lists, derived from these three methods, were compared with the golden standard list from Cyclebase database. We ran msiDBN and JNMF 100 times separately to get the average performance and the comparison of precision results is shown in Figure 7. The matched proteins that are truly associated with cell cycle process in msiDBN were around 75, which was much higher than JNMF and AVG method. Among those unmatched proteins inside our top 150 list, we saw an interesting thing: according to the gene ontology annotations, a few of the unmatched proteins are relative to the true cell-cycle-related proteins, as shown in Table 3 where just a part of the unmatched proteins are listed due to limitations on space. We also checked in the static PPI network and discovered that most of the unmatched proteins are directly linked to the cell-cycle-related proteins or within short length of link distance. This phenomenon is consistent with our structural changing assumption about critical proteins in which for a critical protein, which has been varied structure during the cell cycle progression, the structure of its neighbors must change along with it.

Conclusion
In this paper, the structural variability of dynamic PPINs was studied to identify the critical proteins in the yeast cell cycle process. A comprehensive method of constructing dynamic active PPI networks was proposed which simultaneously modeled the activity of proteins and assembled the dynamic coregulation protein network at each time point. And then a critical node detection method that integrated multiple networks into deep belief network model was developed, in which the reconstruction results were ranked by the variabilities of the reconstruction errors across time courses and finally the top proteins in rank order were selected to be the critical ones that may play important roles in dynamic mechanisms. We evaluated our network construction method by comparing the functional representations of proteins in the derived networks with that from two traditional construction methods and our method achieved superior function analysis results. The critical protein ranking results from msiDBN were compared with results from JNMF reconstruction method and the comparison of results showed that msiDBN had better reconstruction rate and identified more proteins of critical value to yeast cell cycle process.
The fact that a few proteins among the unmatched protein lists are truly relevant to the cell cycle process inspires an interesting idea that the system analysis of dynamic networks should be done to reveal groups of critical proteins with the same or relative functional roles in the dynamic mechanism. We will focus more on a system level study of the dynamic networks in the future research.