RecRWR: A Recursive Random Walk Method for Improved Identification of Diseases

High-throughput methods such as next-generation sequencing or DNA microarrays lack precision, as they return hundreds of genes for a single disease profile. Several computational methods applied to physical interaction of protein networks have been successfully used in identification of the best disease candidates for each expression profile. An open problem for these methods is the ability to combine and take advantage of the wealth of biomedical data publicly available. We propose an enhanced method to improve selection of the best disease targets for a multilayer biomedical network that integrates PPI data annotated with stable knowledge from OMIM diseases and GO biological processes. We present a comprehensive validation that demonstrates the advantage of the proposed approach, Recursive Random Walk with Restarts (RecRWR). The obtained results outline the superiority of the proposed approach, RecRWR, in identifying disease candidates, especially with high levels of biological noise and benefiting from all data available.


Introduction
A major research domain in molecular biology is the study of the causal association between genomic variations and clinical phenotypes [1][2][3]. Classical methods use a manual approach where one or a limited number of genomic targets are individually tested. However, due to the resources needed to systematically perform this procedure and due to the difficulty in controlling all experimental variables, improved strategies were required. The possibility to use computational methods to identify the best disease candidates to be further validated was a major breakthrough [4][5][6][7][8]. A common constraint of most methods is the need for training data, which is scarce and difficult to validate.
A recent research trend consists of exploiting the topological properties of protein-protein interaction (PPI) networks combined with other biological data to envisage the underlying mechanisms of genetic diseases. Barabási et al. [9] and Joy et al. [10] discuss the role of proteins with high betweenness as mediators of relevant metabolic processes. Ma and Zeng [11] explore the use of the closeness centrality to quickly identify the top central metabolites in large scale networks. Approaches proposed by Erten et al. [12] and Arrais and Oliveira [13] explore the potentialities of the nodes with high degree for the prioritization of disease-associated genes.
While the previous methods focus on evaluation of the weights given to each node, a complementary strategy consists of evaluating the proximity of two given nodes in the network. Some common methods to conduct this task are the shortest path, log-likelihood, propagation matrix, and the RWR (Random Walk with Restarts). Previous studies confirm that the RWR outperforms other methods [14][15][16][17][18]. One common limitation of these studies is that they assume the graph is single concept, meaning that every node is equally treated. However, as we demonstrate in this study, those methods are poor when the graph integrates nodes from distinct data types.
In this paper, we propose a novel method to improve selection of the best disease targets for multiconcept graphs. Towards this aim we build a multilayer biomedical graph that stores PPI data, annotated with stable knowledge from OMIM diseases and Biological Process from Gene Ontology. The inherent improvements of the proposed method are (a) the use of multilayer networks formed with PPI data and by the terms' associations; (b) the combination of this data to establish new associations among nodes; and (c) the use of degree-based methods to evaluate node weights.
Finally, we present comprehensive validation that demonstrates the superiority of the proposed approach, Recursive Random Walk with Restarts (RecRWR).

Methods
The method proposed herein uses a graph representation of biomedical knowledge centred on proteins enriched with biomedical terms. The first step consisted of selecting and curating the required data and using it to construct the graph. For performance issues this network is represented as a matrix of adjacencies. Based on this ground-based biomedical graph we apply a modified version of the Hubs and Authorities (HITS) [13] algorithm adapted to this particular subject, in order to obtain a normalized and more accurate association among relations. Although here we are interested in tuning to protein-disease association, it is important to stress that it can be extended to the study of general association of many-to-many biomedical terms. Finally, we formulate how the proposed method, RecRWR, can be applied to this subject.

Multiconcept Graph Modelling.
A graph-based representation is used to store the relations among the biomedical terms. Since we are integrating three distinct data sources, three interconnected subgraphs are obtained.
(i) PPI data are retrieved from STRING database [19], where the average confidence level is considered. A filter is applied to select only human.
(ii) Disease data are extracted from OMIM morbid map [20] data where the genotype-phenotype associations are preserved. The morbid map is also used to extract the mapping relation for known protein diseases.
(iii) Biological Process from Gene Ontology (GO) [21] Directed Acyclic Graph structure is extracted and replicated. The GO-GO mapping is also retrieved and stored.
For each of the previous data sources a curated set of terms ⃗ is extracted: with representing the content of the th term from the interval ( ∈ [1, ] ⊂ N). Each term is a tuple of three elements that can be represented as where the element has an association with the element , with a confidence score where ( , ∈ [1, ] ⊂ N) and ( ∈ [0, 1000] ⊂ N).
The set of vectors ⃗ are modelled as a nonoriented weighted graph = ( , , ).
is obtained by identifying the unique entry -or -of all the association tuples contained in vector ⃗ . The vertices are labelled by their unique identifier. (ii) Each edge ∈ connects vertexes (V , V ) representing an association between the terms represented by the vertices V and V contained in vector ⃗ . (iii) The weight V ,V of each edge corresponds to the score between two nodes.
The graph = ( , , ) is then mapped to an adjacency matrix representation that consists of a | |×| | = × matrix : Because the graph is undirected the adjacency matrix is symmetric and therefore = .
The compiled graph resulted in 60.000 nodes with an average degree of 5. The memory space required to represent the graph is Θ(| |), which is realistically equivalent to a memory space of 6.0 MB, excluding hash tables required for node mapping. The adjacency matrix requires a memory space of Θ(| | 2 ), 7.2 GB.

RecRWR: Recursive Random Walk with Restarts.
Next we formulate the RecRWR algorithm including a detailed pseudocode description of the algorithm (Algorithm 1). The three main components are (i) Random Walk with Restarts; (ii) recursive cross subgraph mapping; (iii) node replacement.

Random Walk with Restarts. The final probability vector of Random Walker is defined as
where is the column-normalized adjacency matrix and ⃗ is a vector in which the th element holds the probability of being at node at time step . The vector 0 holds the probability of the initial states and is constructed such that equal probabilities are assigned to the list of seed nodes where the sum of the probabilities is equal to 1. This is obtained by a given list of seed nodes, where ⊂ .

Algorithm Recursive Random Walk with Restarts Input: Adjancecy matrix A of size N × N;
Vector of size N with nseed proteins p0; Restart probability r Output: Sorted list of candidate diseases (1) Let m1, . . ., mk be the binary vectors of length N (2) Let W be a column normalized of the Amatrix of size N × N (3) (4) Let p := p0; (5) Let i := 1, j := 1;

Recursive Cross Subgraph Mapping.
We extend the previous formulation to a symmetric matrix composed of 2 /2 of submatrixes, where corresponds to the number of data sources. The submatrix that corresponds to the mapping between the subgraphs and is obtained by where → and → are binary vectors with n elements that represent the mask of the source and target subgraphs where , ∈ [1, ] ⊂ N, | → | = | → | = .
The result of each iteration of the Random Walk with Restarts is given by where in fact the algorithm stabilizes when the following condition is met: where → 1 is disease mask vector and ⃗ is weight vector at time . The product will result in a scalar that corresponds to the sum of the differences between two iterations. The condition is true when this value is less than a given constant .

Node Replacement.
The recursive iteration of the cross subgraph mapping returns a new term. A node replacement strategy is used to replace the genes to be used. The selection of the node index to be replaced by the node index is given by the minimum value of = min 0≤ ≤| | → ∞ and where the candidate node is given by = max 0≤ ≤| | * → ∞ .

Results and Discussion
In this section we explore and evaluate the performance of the proposed method. We present a systematic evaluation using a synthetic datasets based on well-known disease profiles. We also present how the results of RecRWR can be used to explore the resemblances mechanisms on breast cancer.

Validation Procedure.
For each selected phenotype entry on the OMIM database we created a dataset with the associated genotypes. We have selected 100 phenotype diseases with at least 10 associated genotypes each. Then, we iteratively replace genes from the original dataset by random ones, in 20% increments, and the dataset is progressively converted to a fully random dataset. We use each of these protein datasets as seed nodes on the graph. We end up with a test space of 600 gene sets (6 random step levels plus 100 diseases).

Information Paradox.
Previous use of RWR on molecular biology typically concentrates on PPI networks. One would expect that including additional data would contribute to an improved overall result. Figure 1 presents a comparison of the relative frequency of the ranks for each of the analysed datasets, for two of the tested methods (RWR over only PPI data and RWR over the whole network) and for four levels of randomness. From analysis of this graph it is clear there is no improvement with including external annotations on the original PPI network. Indeed for original dataset, with random effect, there are no perceptible differences between the two methods. This statement is even sharper when we test progressive levels of randomness. For instance, when 20% of the genes on the dataset are random, 55% of the RWR over PPI ranks the disease in the top 3, while with the RWR over all data this frequency drops to 48%. For 60% randomness, 35% of the RWR over PPI ranks the disease in the top 5, while with the RWR over all data the frequency drops to 23%. These results were the primary motivation for the work presented in this paper, as they clearly show that the RWR method is not suitable for dealing with multiple biological data.

RecRWR Results on Synthetic Datasets.
We evaluate the performance of the RecRWR method using the receiver operating characteristic (ROC) curves where each curve contains the results for each level of randomness. A higher AUC (area under curve) corresponds to a better overall performance. Figure 2 and Table 1  result. However, if randomness is introduced the proposed method shows a strong improvement.
With 20% randomness the RecRWR AUC is 0.9834, which compared to 0.9453 on the RWR corresponds to a 4.0% of improvement. Comparing the behaviour of the RecRWR the 20% randomness reflects no real impact (−0.22%) on the obtained AUC.
For 40% and 60% the difference is even higher (7.1% for 40% and 7.6% for 60%) demonstrating the greater capability of the proposed method.
It is also relevant to note a 1.0 TPR (true positive rate,axis on the graphs from Figure 2), meaning that the disease is always correctly identified and is consistently obtained at the expense of a lower FPR (false positive rate, -axis).

RecRWR Results on Breast Cancer. Breast cancer (MIM:
114480) is considered a complex disorder having 23 known genotypes that are shared with other cancer-related disorders. We have used RecRWR over the common expression profiles of breast cancer to explore the network of diseases that share common mechanism. The diseases most closely related to breast cancer are hepatocellular carcinoma, bladder cancer, and lung cancer.
From analysing the network of associations, we can see that the proteins most related with breast cancer are responsible for important cellular functions, such as DNA repairing, cell cycle arrest and its regulation, induced cell death (apoptosis), and tumor suppression. Also, we can see that the more closely GO terms associated with breast cancer are protein binding and apoptotic process. This means that the probable causes of breast cancer are related to the impairment of all these functions. For instance, a genetic mutation causing loss of function on a tumor suppressor gene (such as the cellular tumor antigen p53, P04637) product would result in unrestrained cellular proliferation. Conversely, the transformation of a protooncogene (a gene that participates in a cell-growth pathway) into an oncogene (a protein that can induce cancer on animals) requires a gain-of-function mutation that will allow its permanent activation. An example of this is the epidermal growth factor receptor (EGFR, P00533), also present in Figure 3. EGFR is involved in the conversion of extracellular stimulus to cellular responses. Also, transcription errors are usually immediately corrected by DNA repairing proteins, like the DNA repair and recombination protein RAD54-like (Q92698), shown in the network. A mutation in this gene would result in the defective proteins, and subsequently the correction of transcription and translation errors would cease. Finally, the protein caspase-8 also seems to be a possible cause of breast cancer. Since caspase-8 is involved in the apoptotic process, impairment of this protein would result in the absence of apoptosis, and defective cells would not be destroyed. The shortest path between the two diseases is mediated by the cellular tumor antigen p53. There are however other connections between the two nodes. For instance, the proteins receptor tyrosine-protein kinase erbB-2 (P04626), GTPase KRas (P01116), and caspase-8 (Q14790) also connect the two cancers. The influence of caspase-8 mutations on the onset of cancer was explained above. ERBB2 is a protooncogene, with the potential of being converted into an oncogene and inducing cancer. The GTPase KRas is involved in a great variety of important biological processes, including regulation of both of cell proliferation and gene expression, signal transduction, and cell signalling. The majority of the proteins analysed here are part of the same KEGG pathways: pathways in cancer (hsa05200), neurotrophin signaling pathway (hsa04722), and focal adhesion (hsa04510). The first pathway consists of an integration of the various cancer pathways. The neurotrophin signalling pathway is responsible for the differentiation and survival of neural cells. However, this second pathway is heavily regulated by other intracellular signalling cascades, in which some of the proteins presented in Figure 3 participate. The focal adhesion pathway plays important roles in the proliferation, differentiation, and survival of cells and in gene expression. In case of compromise of any of the proteins involved on this pathway cellular communication becomes defective, which can also result in cancer.

Conclusion
In this paper, we have proposed a graph-based approach to address the problem of selecting the best disease targets for multiconcept graphs. Towards this aim we build a multilayer biomedical graph that stores PPI data, annotated with stable knowledge from OMIM diseases and Biological Process from Gene Ontology. The inherent improvements of the proposed method are the use of multilayer networks formed with the PPI data and by the terms' associations; combination of this data to establish new associations among nodes; and use of degree-based methods for evaluating node weights. Finally, we have presented comprehensive validation that demonstrates the superiority of the proposed approach, Recursive Random Walk with Restarts (RecRWR). The obtained results outline the superiority of the proposed approach in identifying disease candidates, especially with high levels of biological noise and benefiting from all data available.