An Improved Method for Completely Uncertain Biological Network Alignment

With the continuous development of biological experiment technology, more and more data related to uncertain biological networks needs to be analyzed. However, most of current alignment methods are designed for the deterministic biological network. Only a few can solve the probabilistic network alignment problem. However, these approaches only use the part of probabilistic data in the original networks allowing only one of the two networks to be probabilistic. To overcome the weakness of current approaches, an improved method called completely probabilistic biological network comparison alignment (C_PBNA) is proposed in this paper. This new method is designed for complete probabilistic biological network alignment based on probabilistic biological network alignment (PBNA) in order to take full advantage of the uncertain information of biological network. The degree of consistency (agreement) indicates that C_PBNA can find the results neglected by PBNA algorithm. Furthermore, the GO consistency (GOC) and global network alignment score (GNAS) have been selected as evaluation criteria, and all of them proved that C_PBNA can obtain more biologically significant results than those of PBNA algorithm.


Introduction
The development of biological experiment technology has generated more and more biological network data such as protein-protein interaction and gene transcriptional regulatory network data, which brings considerable number of pieces of information about the interactions and relationships between biological organisms. For this reason scientists carried out a lot of research in this area. Comparative analysis, namely, biological network alignment, is an important method in biological network research. Biological networks alignment can simply be described as the analysis of biological networks by comparing the data to find the correlation between structure and function of organisms and thus to help the study of biological development and evolution. This study demonstrates great potentials to discover basic functions and to reveal essential mechanisms for various biological phenomena, by understanding biological systems not at individual component level but at a system-wide level [1,2]. Ogata et al. first proposed the graph comparison approach to identify local similarities between two graphs, which allows gaps and mismatch of nodes and edges and is especially suitable for detecting biological features in 2000 [3]. They used the above-mentioned comparative method to discover the relationship between enzymes and positions of their corresponding gene encodings in the entire genome. After analyzing these results, they found the local structure similarities corresponding to functionally related enzyme clusters. Thereafter, the graph comparison research attracted many scholars' research interests in this field. Kelley et al. in 2003 introduced the value of the concept called BLASTE into protein interaction network and thereby described a new way to detect the highly conserved pathway and the highly conserved functional module in the two networks [4]. Subsequently, Koyutürk et al. took advantage of the duplication/divergence model to translate protein-protein interaction (PPI) network comparison into the maximum weight subgraphs problems and used the greedy method to solve the problem [5]. In 2007, Singh et al. proposed the IsoRank algorithm, which converted the problem to a constraint-based optimization objective function problem. Then they also introduced an algorithm called IsoRankN, which was an approach similar to the PageRank-Nibble algorithm, to align multiple PPI network [6]. The Matchand-Split algorithm proposed by Narayanan and Karp, 2007, with the idea of dividing and conquering strategy divided biological networks into submodules. This approach deals with biological networks alignment by comparing smaller modules [7]. In 2009, Klau [8] normalized the problem to an optimization problem and solved the problem with the integer linear programming method. So far, most of the researches are focused on determining biological network data.
However due to the size, density, and redundancy of interacting molecules in the network and even errors in biological experiments [9], interactions in biological networks are probabilistic events. For example, in a living cell, DNA binding proteins are believed to be in equilibrium between the bound and unbound states, thus introducing uncertainties in protein-DNA interactions. Similar circumstance holds for protein-protein interactions, which are crucial to cellular functions both in assembling protein machinery and in signaling cascades. Therefore we abstract the biological networks into the uncertain networks whose edges are denoted by the values, respectively. Our solution is closer to realistic situation. Incorporation of uncertain information will bring more challenges to the biological networks alignment and analysis.
To the best of our knowledge, there are only two biological network alignment algorithms that can deal with uncertain network. Weighted IsoRankN [10] based on the IsoRank was proposed to deal with the probabilistic case. But the probability information in Weighted IsoRankN was considered as "weight" rather than the true "probability. " Essentially, the Weighted IsoRankN algorithm merely simplifies the probability graph into the deterministic diagram. Hence a majority of pieces of information were neglected via this measure. PBNA (probabilistic biological network alignment) [11] proposed by Todor et al. in 2013 was an advanced version of the IsoRank algorithm. The core of this algorithm was to replace the determining variables in the IsoRank with a random variable so as to establish a model for biological network alignment problem. Then this probability algorithm was optimized by using conditional probability distribution knowledge to reduce its complexity. However, the PBNA approach requires at least one of the networks participating in alignment be determined diagram. In other words, if the participating networks are all uncertain networks, one of them will be considered automatically as a deterministic graph. Clearly, the neglect of the probability information of the networks may lead to the deviation.
In order to simplify the discussion in the rest of the cases, "deterministic network alignment" refers to the algorithm in which two participators are all deterministic network and "part probabilistic network alignment" refers to the algorithm in which one of participators is probability graph and "complete probabilistic network alignment" refers to the algorithm in which both participators are uncertain graphs.
In this paper, we develop a method called "complete probabilistic biological network alignment" (C PBNA) based on "part probabilistic biological network alignment (PBNA). " Our approach can take full advantage of the information for the uncertain network alignment with two uncertain networks. Finally, we conduct 122 groups of contrast tests based on uncertain protein interactions network data preprocessed by Todor et al. from MINT database. We use the agreement to tell the difference between two algorithms. The biological significance of the network comparison is quantified by global network alignment score, gene ontology consistency, and functional coherence of the alignments. The experiment results indicate that C PBNA can find the results neglected by PBNA algorithm. Furthermore, C PBNA can obtain more biologically significant results than those of PBNA algorithm.
The rest of the paper is organized as follows. In Section 2, we describe the C PBNA algorithm. In Section 3, we apply the C PBNA algorithm into MINT [12] and analyze the results. In Section 4, the conclusions are given.

Methods
The C PBNA proposed in the paper is an advanced biological networks alignment algorithm derived from the PBNA [9]. Both of the algorithms are based on the framework of IsoRank for deterministic network. The PBNA approach takes the uncertainties of the networks into consideration. However, the precondition of PBNA is that at least one of the biological networks participating in alignment must be deterministic network. Our approach can deal with alignment of two uncertain biological networks. Furthermore it can directly deal with the deterministic and partially probabilistic situation. In the following sections, we start by analyzing the probabilistic biological networks which are dealt with by the C PBNA algorithm then and the results discovered by C PBNA. Our whole approach is described in five sections: (1) Probabilistic Biological Network; (2) Probabilistic Support Matrix; (3) Probabilistic Model of the Eigenvector; (4) Extracting Alignment Results; (5) C PBNA Algorithm.

Probabilistic Biological Network.
Traditional deterministic biological network can be characterized by a two-tuple = ( , ), where denotes the vertex and denotes the set of graph edges. Different types of biological networks correspond to different graphs; for instance, PPI network (protein-protein interaction network) can be abstracted into an undirected graph with the vertices tagged by labels denoting different proteins and the edges denoting the interaction relationship of proteins.
It is important to note that biological networks usually are indeterminate; for instance, the interactions of proteins often exist at certain probability. Therefore, we consider the uncertain biological network as a network in which the proteins are denoted by determinate nodes and the interactions of proteins are denoted by edges with a probability value.
Uncertain network is characterized by a three-tuple = ( , , Pr ), where , denote the vertex set and edge set, respectively [13]. Consider Note that Pr is a function which denotes a value probability in [0, 1] for each edge = ( , V); specifically Pr = 1 indicates that edge = ( , V) definitely exists: = ( , , Pr ) denotes uncertain graph and = ( , ) denotes deterministic graph, respectively. Particularly, uncertain graph contains deterministic , which is abbreviated as ⇒ , when and only when = and ⊆ ∩ ( × ), where ∩( × ) denotes an edge set in which two endpoints of the edge are both in the vertex set .

Example 1.
Observe that original uncertain graph in Figure 1 has three probability sides. Hence, it contains 8 kinds of different deterministic networks with different probability. This means that in a probabilistic network with | | edges there are actually 2 | | deterministic networks which occur at a certain probability.
Note that, with the uncertainty added, the complexity of the graph increases greatly. For instance, the MINT [12] network data used in the experiments contain 2 313 implication graphs for the maximum organism containing 313 edges. Precise comparison seems almost impossible for such a large amount of data.

Probabilistic Support Matrix.
Firstly, our approach proposed in the paper is based on the primal framework of IsoRank algorithm which aimed at deterministic graph alignment. One of the core ideas of IsoRank is that the similarity between two vertices may be determined by all the neighborhood vertices' similarity. First of all, we introduce the simple case of pairwise global network alignment (GNA). For deterministic networks 1 = ( 1 , 1 ) and 2 = ( 2 , 2 ), the similarity score between vertexes V and V can be calculated by where V ∈ 1 , V ∈ 2 , ( ), ( ), respectively, denote the neighbor vertexes set of V and V , and , V , respectively, denote the degrees of V and V V . We assume that = | 1 |, = | 2 |, all similarity scores (0 ≤ ≤ , 0 ≤ ≤ ) constituting an × dimensional similarity score vector . can be seen as a vector form converted from an × matrix. Therefore (4) can be rewritten in matrix form: = , where . . . . . . The term 1/ denotes that point or point V is an acnode. As can be seen, formula, = , indicates that vector is a characteristic feature vector of matrix when the eigenvalue is 1.
The important improvement of PBNA algorithm is to replace the deterministic variable in original algorithm with a random variable so as to simplify the model by calculating the expectation ( ) instead of itself. It should be stressed that, due to the complexity in calculating desired ( ), PBNA alignment algorithm requires that one of the graphs must be determined. Considering this idea as a reference, we propose C PBNA algorithm which can be extended to the network alignment problem in which two graphs, 1 and 2 , are both uncertain graphs. Hence, the degrees of uncertain graph nodes set of V and V V are both uncertain rather than deterministic values. The degree values V , are denoted by discrete random variables V , respectively, and then (4) can be rewritten as where V , are discrete distribution: ( V = ), = 0, 1, . . . , max , V is the degree distribution of node V V . We assume that V is a set of edges connecting to point V V ; hence, ( V ) can be obtained via probabilistic graphical models shown in Table 1.
Clearly, adding uncertain information increases the complexity of the algorithm. As a result, the time complexity for calculating each node degree distribution sequence increase to exponential time, because the neighbor points degrees in the matrix as an item subject to the discrete distribution rather than a certain value. Therefore, based on the core idea of literature [9], we use ( ) instead of involved in the calculation. The following section summarizes the calculation arriving at expectation ( ) of matrix .

Probabilistic Model of the Eigenvector.
We start by discussing (5) in the first case. Clearly, as discussed earlier, (5) as follows: Because of assuming that the edges of the network 1 and 2 are independent events, so the and V are independent too, we can derive as follows: In the next case of (5), the probability is denoted by ( V = 0). Note that and V are also independent; we can get (8), after some manipulations as follows: Similarly, substituting the results of (7) and (8) for 0 and 1 2 , respectively, we obtain where the probability distributions of and V are calculated as discussed in Table 1; thus we get the ( ). However, as discussed above in Section 2.2, calculating ( = ) directly means that the computational complexity of the algorithm can reach (2 V + ). In order to reduce the high complexity, we use the probability generating function introduced in the literature.
Definition 2 (see [14]). Assume that is a discrete random integer variable ranging from 0 to ; therefore the probability generating function (PGF) of may be defined as a polynomial of : As we see in Definition 2, the coefficient distribution sequence corresponds to discrete random variables distribution of in probability generating function. Clearly, as long as the probability generating function is calculated, the probability distribution will be obtained easily. Moreover, the probability generating function may be calculated by Theorem 3.  Theorem 3 (see [14]). Suppose that = ( , , Pr ) is an uncertain graph and V denotes a set of edges connecting with V endpoint; hence, the degree of V is a discrete random variable whose probability generating function is shown as For example, Figure 1 shows an uncertain network, in which there are two edges connecting with the node and the appearance probability of those two edges is 0.4 and 0.9, respectively. As a result, the probability generating function of degree distribution for node is ( ) = (0.6 + 0.4 )(0.1 + 0.9 ) = 0.06 + 0.58 + 0.36 2 ; then we can easily get the distribution of degree for node as in Table 2.
Therefore, we can calculate the probability generating function of the node degree distribution for V and then obtain node distribution sequence via the probability generating function. The computational complexity of this process can decrease Naive Approach Complexity from (2 Our ultimate objective is the conditional probability distribution of the degree. In other words, the purpose is to calculate the distribution sequence of node V with the presupposition that there exists an edge connecting to node V and edge ∈ V . As a result, the probability generating function of the conditional probability can be obtained by simply dividing (1 − + ). Now we can easily calculate the probability generating function of the conditional probability ( V = | ∈ V ) = 1, 2, . . . , max V . And we can get the support matrix ( ) within the time expected in the polynomial equation (9). In particular, the sequence similarity method is also added to the score vector according to the literature [5] as where is the vector denoting normalization of sequence similarity score BLAST and is a constant used for balancing influence of topological similarity and sequence similarity on calculating the pairwise similarity. Finally, we use a power iteration method [15,16] to calculate and record all similarity score. See Algorithm 2.

Extracting Alignment Results.
After calculating similarity vector , the last step of our model is to extract the final alignment results from vector . In order to extract the final alignment results, we introduce a breadth first searching approach by using maximum weight bipartite matching technique [17,18]. First of all, we introduce the concept of bipartite graph and maximum weight bipartite matching. Bipartite graph: a graph 12 = ( 12 , 12 ) is bipartite if there exists 12 = 1 ∪ 2 with 1 ∩ 2 = 0. And for each  edge ∈ 12 , the two end vertices must belong to the two different subsets 1 and 2 . Maximum weight bipartite matching: given a bipartite graph 12 = ( 12 , 12 ) with bipartition ( 1 , 2 ) and weight function : → find matching of maximum weight where the weight of matching is given by ( ) = ∑ ∈ 12 ( ).
The process of extracting results from the feature vector is the process to find a max-weight matching in 12 .
Let us call a function : ( 1 ∪ 2 ) → a potential if ( ) + ( ) ≤ ( , ) for each ∈ 1 , ∈ 2 . The value of potential is ∑ V∈ 1 ∪ 2 (V). It can be seen that the cost of each perfect matching is at least the value of each potential. The Hungarian method finds perfect matching and a potential with equal value which proves the optimality of both. In fact it finds perfect matching of tight edges: an edge is called tight for a potential if ( ) + ( ) = ( , ). See Algorithm 3.

C PBNA Algorithm.
The C PBNA algorithm can be roughly divided into three steps, constructing the support matrix, calculating the eigenvector of the matrix, and extracting alignment results, as in Figure 2. These steps will give detailed descriptions by Algorithms 1, 2, and 3.
First we build probabilistic support matrix based on the conclusions of Section 2.2 and calculate ( ) based on formula (9) in Section 2.3. The pseudocode can be seen in Algorithm 1. Secondly, we calculate the feature vector by using an iterative approach denoted as in Algorithm 2. Thirdly, we extract optimal comparison by interpreting as encoding a bipartite graph and finding the maximum weight bipartite matching, which is denoted as in Algorithm 3.
In Algorithm 1 we have the following.
Input: Probabilistic graph 1 = ( 1 , 1 ), Probabilistic graph 2 = ( 2 , 2 ) Output: ( ) // Construct PGF of 1 2 (1) for all ∈ 1 , V ∈ 2 do: (2) construct PGF of (3) construct PGF of V (4) end for // Compute every entry in ( )  After getting the desired ( ), we use an iterative approach to calculate the feature vector . Set each of values in the eigenvalue equal to constant 1/ , the original variable called 0 . indicates the normalized vector of sequence homologies. The values have been studied in the literature [5,16], so we directly use the best value 0.6. is a sufficiently small constant; iteration will eventually converge to approximate similarity score vector . The process calculation of is shown in Algorithm 2.
In Algorithm 2 we have the following.
(1) Line 1-Line 4. Set the initial value of the feature vector 0 .
After getting the feature vector , the last step of the C PBNA algorithm is extracted by final comparison results from as shown in Algorithm 3. C PBNA adopted the method mentioned in Section 2.4; this method is to find perfect matching .
In Algorithm 3 we have the following.
(1) Line 1-Line 3. Build bipartite graph 12 and compute the weight matrix by the feature vector .
(3) Line 7-Line 11. Find an optimal augmenting path cover (V , V V ) by the max-flow min-cut theorem and then update the feasible labeling .

Experiments and Results
The experiments in this research include two main parts. The first part shows that C PBNA algorithm can obtain the results which are neglected by PBNA. Further, the second part of the experiments proves that results of C PBNA are more biologically significant using GOC and GNAS (global network alignment score) as evaluation standards.
The uncertain dataset used in the experiment obtained from the MINT database is network data of protein-protein interactions preprocessed by Todor et al. [9,10]. As a result, providing that MINT network is of enough biological importance, the network information offered by KEGG database is divided into several smaller networks. Then, only the network with more than 10 nodes remains. Finally, we obtain 198 protein-protein interaction networks coming from 10 organisms. Table 4 shows statistical information of this network.
There are 198 networks comparing with each other, which will result in 2 198 = 19503 groups of experiments. However the networks through KEGG are divided into a set of vertices function associated with proteins, and KEGG used a label to mark the set of proteins. The proteins from different sets share less similarity, which makes little sense to do the network comparison. Therefore we can get 122 groups of comparison experiments from the KEGG database.

Coherence Comparison of C PBNA and PBNA.
In order to prove that C PBNA can discover the results neglected by PBNA, agreement evaluation criterion [9] is introduced in this research.
The definition of agreement is based on the same dataset. Hence, the proportion of common results discovered by both C PBNA and PBNA in all alignments is shown as Alignments in common All alignments .
The score of this evaluation criterion is between 0 and 1. The larger the score is, the more common results these two  Figure 3.
The ordinate values are the 122 agreement values after sorting, and the abscissa values are the serial number of the experiment. Table 5 shows the detailed agreement statistics of 122 groups of experimentation. In particular, the left Pie Chart is divided into 4 parts corresponding to the percentage of each category and Category 1 is not described in the Pie Chart due to its percentage of 0. For instance, Category 5 in Table 5 indicates that there are 30 experiments with the agreement  between 0.8 and 1, which accounts for 24.6% of the total experiments.
The general distribution of the agreements is shown intuitively in Figure 4. We can see that the Agreement values are distributed within the range from 0.2 to 0.95. From the Pie Chart, we can further see that agreement scores less than 0.8 experiments accounted for 75% of the experiments, among which only 14% of the total experiment is less than 0.4 points. It indicates that both the C PBNA results and PBNA results have many overlapping parts but have noticeable difference at the same time. The reason is that one of the most basic differences is that C PBNA concludes all of the uncertain information while the PBNA method only utilizes half the uncertain information.
Therefore, we may draw a conclusion that neglecting uncertain information could lead to deviation. In addition, all the above shows that much more innovative result can be obtained through C PBNA algorithm. The corresponding biological significance will be demonstrated by the following experiments.

Gene Ontology Consistency
Comparisons of C PBNA and PBNA. Gene ontology consistency (GOC) has been where GO( ) denotes the set of GO terms which label a protein in gene ontology database. Then, the GOC of each pair of proteins in alignment results is calculated, respectively. The bigger the GOC is, the more similar function these proteins have; especially, the maximum of GOC is 1 which means that these proteins have totally the same function. All GO data in this study comes from GO Consortium [19] and literature [14].
Similarly, in order to get alignment results of C PBNA and PBNA, respectively, the GOC (the value of GOC is between 0 and 1) is calculated for each pair of proteins in 122 groups of experiments. Finally, we get 122 groups of results, among which there are 2 GOCs in each group. The distribution of the GOC is shown in Figure 4 and Table 6.
In Figure 4, -coordinate denotes GOC value of C PBNA algorithm and -coordinate denotes GOC value of PBNA algorithm. Table 6 shows 122 groups of GOC value and the diversity of every two GOC values in each group.
As we can see in Table 6, for most of the results the value of -coordinate is larger than -coordinate. For instance, it includes 96 groups of experiments, 78.7% of the total experiments, in Category 3 and Category 4. Furthermore, the -coordinate is higher than the -coordinate more than 10% in 17 groups of experiment. These all indicate that in most of experiments C PBNA algorithm may discover more biologically significant results based on the same evaluation standard GOC.
In conclusion, C PBNA and PBNA can obtain diverse results for biological networks alignment which is proved through the first experiment. Moreover, C PBNA is demonstrated in the second experiment to be superior to PBNA in discovering biologically significant results since it uses all uncertain information while the PBNA algorithm neglects some uncertain information in biological networks.

Functional
Coherence of the Alignments. The functional coherence of the alignments is motivated by the lack of automated and direct measures of ortholog-list quality. Comparing with GOC, the functional coherence of the alignments reports the average of the medians instead of the sum. And it maps each GO term to one or more of a standardized set of GO terms.
We can see in Figure 5 that PBNA and C PBNA have very similar functional coherence values with only a few minor differences. One of the reasons is that the functional coherence function computes the similarity of a standardized set of GO terms instead of the aligned proteins directly. The other reason is that it reports the average of the medians, so it cannot tell whether a mapping has many highly similar terms or not. Since the median of a distribution is not an accurate representation of the entire distribution, the result it returns is not sensitive enough to tell the difference between different alignments.

GNAS Valuation Comparison of C PBNA and PBNA.
As one of the biological networks alignment algorithms, GNAS (global network alignment score) [20] is adopted as evaluation criterion in this paper defined in (15). Specifically the larger value of GNAS indicates more conserved interactions and higher sequence similarity: In this formula, seq( , V) denotes sequence similarity values of the two nodes; | | denotes the number of edges of 12 (for the definition of 12 , please refer to Section 2.4). Based on the common dataset, two groups of GNAS with 122 values in each group are obtained through C PBNA and PBNA, respectively.
The average values of GNAS and | | are showed in Figure 6. As we can see in Figure 6, the values of | | and GNAS obtained from C PBNA are superior to those from PBNA since C PBNA adopts full uncertain information which increases the amount of conserved interactions.

Time Analysis.
The running time of PBNA and C PBNA is evaluated in this experiment. The most time-consuming step for both algorithms is constructing similarity matrix, which takes about 90% of the entire running time. Therefore, it is reasonable that we measure only this step's running time in order to evaluate the entire algorithm time efficiency. The results are shown in Table 7. Table 7 indicates that the time spent in constructing similarity in C PBNA is longer than its counterpart in PBNA, because C PBNA deals with both probabilistic networks, which takes more information into consideration. Network comparison with two uncertain networks is much more complex as we can see in Sections 2.1 and 2.2. When computing ( ), in fact, the complexity of C PBNA is (( max while the complexity of PBNA is (( max V ) 2 ) by PGF method mentioned above in Section 2.3. Although C PBNA spends more time dealing with both probabilistic networks, its time performance is still acceptable. Furthermore, the results which we get have more biological significance, and C PBNA can deal with the probabilistic networks directly instead of the preprocessing and transforming of the data into deterministic network.

Conclusions
Biological networks alignment is an important topic in bioinformatics. However, the network data has its inherent complexity of and the combine optimizes features of biological networks alignment are not clear, which make relevant algorithm study extremely challenging. A majority of the classic biological networks alignment algorithms are based on deterministic network while the alignment method for probabilistic networks is still under discussion.
In this paper, we propose a complete probabilistic model and a complete probabilistic biological algorithm for network comparison. Our approach has several advantages. First, our approach is based on complete probabilistic network, which takes the uncertainties of both networks instead of the single one into consideration. Consequently, our approach can take full advantage of the uncertainties properties of network comparison. Second, we model the network alignment using two probability matrices. Therefore, the uncertainties can be quantified by the probabilities of connections in the networks. As a result, our approach is capable of comparing two networks which both have uncertain properties. Third, we use a unified probabilistic model for different types of network alignment (deterministic, part probabilistic, and complete probabilistic), unlike other alignments which use different methods for different types of networks. Finally, the evaluation criteria including GOC and GNAS are used in the experiments to demonstrate that the results of C PBNA and PBNA are different and that the results of the former algorithm are more biologically significant.
Usually, affinity propagation in probabilistic networks is random and probability factors have not been taken into consideration in this paper, and the effect of these factors on results will remain an open problem for the future research. The computational time increases as a result of using more probability information, which is a subject we will study in the next step.