1. Introduction

BMRI

BioMed Research International

2314-6141 2314-6133

Hindawi Publishing Corporation

416323

10.1155/2014/416323

416323

Review Article

Approaches for Recognizing Disease Genes Based on Network

Zou

Quan

¹ Li

Jinjin

¹ Wang

Chunyu

² Zeng

Xiangxiang

¹ Huang

Tao

School of Information Science and Technology

Xiamen University

Xiamen 361005

China

xmu.edu.cn

School of Computer Science and Technology

Harbin Institute of Technology

Harbin 150001

China

hit.edu.cn

2014

2422014

2014 06 12 2013 06 01 2014 09 01 2014 24 2 2014

2014

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Diseases are closely related to genes, thus indicating that genetic abnormalities may lead to certain diseases. The recognition of disease genes has long been a goal in biology, which may contribute to the improvement of health care and understanding gene functions, pathways, and interactions. However, few large-scale gene-gene association datasets, disease-disease association datasets, and gene-disease association datasets are available. A number of machine learning methods have been used to recognize disease genes based on networks. This paper states the relationship between disease and gene, summarizes the approaches used to recognize disease genes based on network, analyzes the core problems and challenges of the methods, and outlooks future research direction.

1. Introduction

Although the human genome project has been accomplished and has achieved great success, and new methods that verify gene function with high-throughput have been applied, studying genetic problems that induce diseases is still one of the major challenges facing humanity [1]. The traditional gene mapping method is based on family genetic disease. First, genes inducing diseases are located in a chain interval. Most recent studies at recognizing disease gene that involves linkage analysis or association studies have resulted in a genomic interval of 0.5 cm to 10 cm, which contains 300 genes [2, 3]. Second, using the biological experiment method to identify each gene located in a chain interval requires a large number of human resources and capital support [4]. In addition, recognizing disease gene by checking the genes set in the interval is often not possible [5]. However, the study of candidate association works well when using a set of known functional candidate genes, which have a clear biological relationship to the disease [6]. Selecting known functional candidate genes is not easy and is often limited by a good deal of factors. The selection of functional candidate genes and prioritization candidate genes has been one of the keys in recognizing disease genes because several reorganization approaches are based on the functions of these genes.

In recent years, a number of recognizing disease gene approaches and computer tools have been developed through building mathematic models based on functional annotation, sequence-based features, protein interaction, and disease phenotype [7–16], such as sequence features [15], functional annotation [7, 8, 10, 13], and physical interactions [12, 13, 17]. Based on the above features, an approach to rank candidate disease genes by computing a correlation score that stands for the correlation between genes and diseases has been introduced. However, various factors may affect the association between genes and diseases.

System biology has indicated that diseases with overlapping clinical manifestations are induced by one or more mutations from the same function module [18–21]. Researches in biological experiments of human disease and patterns have found that genes causing similar disease phenotype often interact with each other directly or indirectly [22–24]. These discoveries have shown that positive correlation exists between disease phenotype and disease gene. Many researchers have proposed disease gene prediction methods based on gene interaction and disease phenotype similarity [7, 25–29]. Recently, many approaches making full use of gene interaction and disease similarity have established the gene interaction and disease phenotype similarity network to predict disease genes. Some typical methods based on networks will be introduced in detail in this paper.

2. Datasets

In the field of biological information, construction dataset is the data foundation of all subsequent work. The validity of datasets directly affects the validity and reliability of the learning algorithm and test. Thus, building a dataset is a basic and important preparatory work.

The recognition of disease gene datasets is mainly obtained from two databases: Online Mendelian Inheritance in Man (OMIM), which is a synthesis database [30–32], and Human Protein Reference Database (HPRD) [33, 34]. Although none of the datasets from OMIM or HPRD are currently complete, they are comprehensive enough [6].

OMIM has the most abundant information, most extensive resources, most comprehensive, authoritative, and timely human genes and genetic disorders based on knowledge composed to support human genetics research and education and the clinical genetics research. OMIM is daily updated and has free access and acquisition at http://www.omim.org/. In OMIM, each item has a short text summary of a generally determined phenotype or gene and a large number of links to other genetic databases [30]. Datasets of disease phenotype and gene-disease phenotype can be obtained from OMIM. However, the data from OMIM need to be disposed to recognize the disease genes [35].

HPRD is a database which curetted proteomic information suited to human proteins. Even though HPRD is updated relatively slow, it is a full-scale resource for studying the relationship between human diseases and genes [36] and is linked to an outline of human signaling paths. HPRD is also available for free at http://www.hprd.org/ [34]. The dataset of gene interaction can be obtained from HPRD [35].

3. Networks

Most research on recognizing disease genes use networks, including the disease phenotype network, protein-protein interaction network, and gene-disease phenotype network, among others. In this study, we introduce only the most commonly used networks. GPPI represents the gene (proteins) interaction network, GDP represents the disease phenotype network, and GP-DP represents the gene-disease phenotype network [6, 16].(1)

GPPI=(G,EG) is an undirected graph and denotes the gene-gene interaction. G={g1,g2,…,gn} is the subset of the gene set, and EG⊂G×G expresses the interaction of genes with weight. Figure 1(a) shows GPPI. In the gene-gene interaction network, the relationship between genes is obtained from the gene-gene relationship database, which is one of the most important databases in the biological information field.

(2)

GDP=(D,ED) is also an undirected graph and denotes the disease phenotype network. D={d1,d2,…,dn} is the subset of the disease phenotype set, and ED⊂D×D represents the similarity of the disease phenotype with weight. Figure 1(b) shows GDP. In the disease phenotype network, the relationship between disease phenotypes is obtained from the phenotype relationship database, which can also be replaced by the disease-disease relationship database.

(3)

GP-DP=(G,D,EP-DP), which is an undirected biograph, is a gene-disease phenotype network. G is the subset of the gene set, and D is the subset of the disease phenotype set. EP-DP⊂G×D expresses the link between the known gene and the disease phenotype. Figure 1(c) stands for GP-DP. The association between disease gene and disease phenotype can be obtained from the gene-disease relationship database.

Illustration of the network by using a specific example: (a) is a gene-gene network, (b) is a disease phenotype network, and (c) is a gene-disease phenotype network [6].

(a) (b) (c)

4. Methods

In previous research, various methods, such as CIPHER, RWRH, Prince, Meta-path, Katz, Catapult, Diffusion Kernel [5], and ProDiGe, were used to recognize disease genes. In the current paper, we introduce several types of typical recognition disease gene methods.

4.1. CIPHER

CIPHER [6] is a tool for predicting and prioritizing disease genes. Furthermore, CIPHER is applied to general genetic phenotypes, which do better in the genome-wide scan of disease genes; furthermore, they are extendable for exploring gene cooperatives in complex diseases. CIPHER is based on an assumption that if two genes have the closest connection in the gene interaction, then the two genes can lead to more similar phenotypes. A regression model can be formulated according to this assumption. A score assessing how likely a gene is associated with a specific phenotype is obtained from the regression model. To construct the regression model, the similarity between phenotypes, interaction between proteins and genes, and list of associations between known disease gene and phenotype must be prepared. The next paragraph expresses the procedures of prioritizing disease genes.

For a given query phenotype and candidate genes, CIPHER first combines the gene interaction network, disease phenotype network, and gene-disease phenotype network into a single network. The similarity scores of the query phenotype with all known phenotypes in the disease phenotype network are derived directly from the phenotype network and the topological distances between the candidate genes. All known disease genes in the gene interaction network are counted and grouped on the basis of their phenotypes. The correlation between phenotypes and disease genes is obtained and acts as the concordance score for each candidate gene by using the regression model. Finally, all candidate genes for the query phenotype are ranked in line with the concordance scores. On account of different neighborhood systems, two ways are available to define the topological distance: direct neighbor and shortest path. Thus, there are two versions of CIPHER which are CIPHER-SP and CIPHER-DN [6].

The similarity scores of the query phenotype and all phenotypes in the disease phenotype network are calculated by the following formulation: (1)Spp′=Cp+∑g∈G(p)∑g′∈G(p′)βpge-Lgg′2. In the above formulation, Spp′ is the similarity score of the query phenotype and another phenotype in the disease phenotype network. Lgg′ is the topological distance between the candidate genes and g′ in the gene interaction network. There exit two ways to define the topological distance, Lgg′, on the basis of how to consider indirect the interaction. One way to define the topological distance is shortest path; Lgg′ is the graph theory shortest path length between genes g and g′ in the gene interaction network. The other way to define the topological distance is direct neighbor; Lgg′ is infinity when g and g′ are indirect neighbors. G(p) indicates all disease genes belonging to phenotype p. Cp is a constant and can act as the basal similarity between p and other phenotypes whose causative genes are not connected to those of p in the gene interaction network. βpg is the coefficient of the regression model and stands for the level of gene g contributing to the similarity of the phenotype p to any other phenotype p′. To denote the association between a phenotype and a gene, the following formulation (2) is defined: (2)Φgp′=∑g′∈G(p′)e-Lgg′2. The following vector is used to denote the similarities between the query phenotype and all phenotypes in the disease phenotype network: Sp=(Spp1,Spp2,Spp3,…,Sppn).

In the same way, the following vector is used to denote the closeness between the genes and the phenotypes in the disease phenotype network: Φg=(Φgp1,Φgp2,Φgp3,…,Φgpn). Synthesizing Formulas (1) and (2) and two vectors extends Formula (1) to the following form: (3)Sp=Cp+∑g∈G(p)βpgΦg. In this regression model, the concordance score is defined by Formula (4): (4)CSpg=cov⁡(Sp,Φg)σ(Sp)σ(Φg), where cov⁡ and σ are the covariance and standard deviation, respectively. The candidate genes for the query phenotype are ranked according to the values obtained from Formula (4). If a gene that does not connect to any disease genes exists, then Formula (4) cannot be used and the gene will rank at the tail.

4.2. PRINCE

PRINCE is another approach based on networks for ranking candidate disease genes for a given disease and inferring the complex associations between genes. PRINCE is on account of formulating constraints on the ranking function that involved usage of prior information and its smoothness over the network.

Before using PRINCE, the gene disease composed of the phenotype network (set of gene-disease associations), gene-gene interaction network (set of gene-gene association), and at least a query disease phenotype is prepared. G=(V,E,w) denotes the gene-gene interaction network, where V is the set of genes, E is the set of interactions, and w is a weight function denoting the reliability of each interaction. Given a query disease phenotype, PRINCE ranks all the genes in V.

Suppose a gene v∈V, the direct neighborhood of gene v is denoted by N(v). The prioritization candidate disease gene function is denoted by F:V→ℜ, and F(v)=q reflects the relevance of v to q. Another function is defined as prior knowledge function denoted by Y:V→{0,1}. In the prior knowledge function, gene v is related to q,V(v)=1; otherwise, V(v)=0. PRINCE computes function F that is smooth over the network. Thus, function F is a combination of two conditions: (5)F(v)=α[∑u∈N(v)F(u)w′(v,u)]+(1-α)Y(v), where the parameter α∈(0,1) weighs the relative importance of gene v to gene u. w′ is a normalized form of w. Formally, a diagonal matrix D is defined, and D(i,i) is the sum of row i of W. W is normalized by W′=D-1/2WD-1/2, which obtains a symmetric matrix. Here, Wij′=Wij/D(i,i)D(j,j). Formula (5) can be expressed in linear form as follows: (6)F=αW′F+(1-α)Y⟺F=(I-αW′)-1(1-α)Y, where F and V are viewed as vectors of size |V|. W′ is a matrix whose values are given by w′. Given that the eigenvalues of W′ are set in [−1, 1], α∈(0,1), and the eigenvalues of (I-αW′) are positive. In addition, (I-αW′)-1 exists.

The above linear system can be solved accurately because an iterative propagation-based algorithm works fast and is guaranteed to converge to the system solution for larger networks. Formula (6) is transferred to an iterative algorithm and is denoted as follows: (7)Ft=αW′Ft-1+(1-α)Y, where F1=Y. Every node propagates the information received in the previous iteration to its neighbors. Finally, the values obtained from Formula (7) rank all the candidate disease genes for a query disease phenotype.

4.3. RWRH

Random walk with restart on heterogeneous network (RWRH) is extended from the random walk with restart algorithm to the heterogeneous network. The heterogeneous network is constructed by connecting the gene-gene interaction network and disease phenotype network by using the gene-disease phenotype relationship information. In brief, the gene-disease phenotype network is the heterogeneous network. RWRH prioritizes the genes and the phenotypes simultaneously, which is inspired by the coranking framework [37]. Given a query disease, seed nodes as genes and phenotypes are associated with the disease, and the top ranked phenotype is the most similar to the query disease.

Random walk is defined as an iterative walker transition from its current node to a randomly selected neighbor, starting at a given source node v in the network. However, RWRH allows the restart of the walk in every time step at node v with probability r. P0 is the probability vector at step 0, indicating that it is the initial probability vector with the sum of probabilities equal to 1. Similarly, Ps is the probability vector at step s, in which the ith element holds the probability of finding the random walker at node i at step s. The probability vector at step s+1 is denoted as follows: (8)Ps+1=(1-γ)MTPs+γP0, where M is the transition matrix of the heterogeneous network; Mij is the transition probability from node i to node j; γ∈(0,1) is the restart probability in every time step. After several iterations, P∞ reaches a steady-state that is obtained by performing the iteration until the change between Ps and Ps+1 falls below 10⁻¹⁰. P∞ is the measure of closing to seed nodes. In vector P∞, when P∞(i)>P∞(j), node i is more likely to be the seed node than node j.

M is the transition matrix of the heterogeneous network. In addition, M consists of four subnetwork transition networks and is denoted as follows: (9)M=[MGMGPMPGMp], where MG is the transition matrix of the gene-gene interaction network, which is the intrasubnetwork of the heterogeneous network. Mp is the transition matrix of the disease phenotype network, which is also the intrasubnetwork of the heterogeneous network. MPG and MGP are the intersubnetwork transition matrixes. Supposing the probability of jumping from gene-gene interaction network to the disease phenotype network is λ, the reverse is the same. In the gene-gene interaction network, λ=0 if a node is not connected to the phenotype. If a node is directly linked to the disease phenotype network, then the node will jump to the disease phenotype network with probability λ. The node will jump to other nodes in the gene-gene interaction network with probability 1-λ. Thus, the transition probability from gi to pj can be denoted as follows: (10)(MGP)i,j=P(pj∣gi)={λBij∑jBij,if ∑jBij≠00,otherwise. In the same way, the transition probability from pi to gj can be denoted as follows: (11)(MPG)i,j=P(gj∣pi)={λBji∑jBji,if ∑jBji≠00,otherwise. The transition probability from gi to gj, which is the element of MG at the ith row and jth column, can be denoted as follows: (12)(MG)i,j={(AG)i,j∑j(AG)i,j,if ∑jBij=0(1-λ)(AG)i,j∑j(AG)i,j,otherwise. The transition probability from pi to pj, which is the element of MP at the ith row and the jth column, can be denoted as follows: (13)(MP)i,j={(AP)i,j∑j(AP)i,j,if ∑jBij=0(1-λ)(AP)i,j∑j(AP)i,j,otherwise.

In the above four formulations, AG(n×n), AP(m×m), and B(n×m) are the adjacency matrixes for the gene-gene interaction network, disease phenotype network, and gene-disease phenotype network, respectively. The adjacency matrix of the heterogeneous network can be denoted as follows: (14)A=[AGBBTAP].

The initial probability of the gene-gene interaction network and phenotype network is denoted by u0 and v0, respectively. The initial probability of the gene network u0 makes the equal probabilities to all the seed nodes in the gene network, with the sum of the probabilities equal to 1. The initial probability of the phenotype network v0 is the same as the gene-gene interaction network. Thus, the initial probability vector of the heterogeneous network is denoted as P0=[(1-η)u∞ ηv∞]T. In the initial probability vector of the heterogeneous network, the parameter η∈(0,1) acts as the judge to weight the importance of each subnetwork. When η=0.5, the importance of the gene-gene interaction network and the disease phenotype network are equal. If η>0.5, then the importance of the gene-gene interaction network is greater than the disease phenotype network. When η<0.5, the gene-gene interaction network is more important than the disease phenotype network is. P0 and the transition matrix M are substituted into Formula (8). After many iterations, steady-state P∞ is denoted as P∞=[(1-η)u∞ ηv∞]T. In this way, the steady probabilities u∞ and v∞ are used to rank the genes and disease phenotypes. A web server named GeneWanderer, which is a computational method that prioritizes a set of candidate genes according to their probability to become involved in a particular disease or phenotype using HWRH or diffusion kernel, is used.

4.4. Katz

The Katz method is successfully applied to social network link prediction. Predicting the social network link is close to the problem of predicting disease genes. The Katz approach, which is based on a graph, finds the similar nodes for the query nodes in the network [38].

An adjacency matrix A is available in an undirected unweighted graph. The Katz approach counts the number of walks of different lengths that connects i and j. These walks act as the similarity of the two nodes i and j. (Al)ij is the number of walks of length l that connect i and j. (Al)ij gives a measure of similarity between i and j. A single similarity measure based on the different walk lengths is necessary. The measure is given below, in which β is a constant that restrains contributions of longer walks: (15)Sij=∑l=1kβl(Al)ij. The above measure is denoted as follows: (16)S=∑l=1kβlAl. If l→∞, βl→0. In this study, setting βl=βl leads to the well-known Katz method: (17)Skatz=∑l≥1βlAl=(I-βA)-1-I, where β is chosen, such that β<1/∥A∥2. In the case of the Katz method, the connections in the graph are weighed so that Aij is the strength of the connection between nodes i and j. For the choice of k, the sum over infinitely many path lengths is not necessarily considered. According to the experimental results, small values of k (k=3 or k=4) obtain good performance in the task of recommending similar nodes.

The adjacency matrix of the heterogeneous network is denoted as follows: (18)A=[AGBBTAP]. One of the advantages of Katz is A, which can represent the other species if we want to study human disease phenotypes and other species disease phenotypes. (19)B=[PHSPS], AP=[APHS00APS]. Here, APHS and APS represent human phenotypes and the other species phenotypes, respectively. PHS and PS indicate gene-disease phenotype association of human and other species, respectively. When an experiment on human is conducted, set PS=0 and APS=0. By synthesizing expressions (18) and (19), we substitute matrix A into Formula (17) and obtain the similarity of genes and phenotypes from the similarity matrix.

4.5. CATAPULT

Combining data across species by using positive-unlabeled learning techniques is abbreviated to CATAPULT. And CATAPULT is a supervised machine learning method which uses a biased support vector machine (SVM), where the features are derived from walks in a heterogeneous gene-trait network.

Given a query disease phenotype, a gene is not associated with the query phenotype. Scholars report positive association between genes and phenotypes; however, the negative associations are rarely reported. In the CATAPULT approach, the unlabeled gene-disease phenotype pairs act as negative associations. The characteristics of the dataset are that only the positive associations are known, and the negative associations and a large number of unlabeled gene-disease phenotype pairs as negative associations are unknown. The general idea of CATAPULT is that the examples are not known to be negative. The false positives are not penalized heavily, but the false negatives are penalized heavily.

CATAPULT uses a biased SVM to classify the gene-phenotype pairs of humans with a single training phase. This approach draws a random bootstrap sample of a few unlabeled examples from the set of all unlabeled examples and trains a classifier to classify the bootstrap samples as negatives along with the positive samples. CATAPULT also uses the bagging technique to obtain an aggregate classifier by using positive and unlabeled examples. The algorithm description is shown in Algorithm 1. T denotes the number of bootstraps, A is the set of positive, n+ denotes the number of examples in A, U denotes the set of unlabeled gene-disease phenotype pairs, C- is a penalty for false positives, and C+ is a relatively larger penalty for false negatives. The source code can be downloaded from http://marcottelab.org/index.php/Catapult.

<bold>Algorithm 1: </bold>CATAPULT algorithm description.

Initialize θ=0, s(x)=0, ∀x∈U, and n(x)=1, ∀x∈U

For t = 1,2 , 3 , … , T :

Step 1. Draw a bootstrap sample Ut⊆U of size n+

Step 2. Train a linear classifier θt using the positive training examples A and Ut as negative examples by solving:

min⁡θ′∈Rd12∥θ′∥2+C-∑i∈Utξi+C+∑i∈Aξi

Subject to ξi≥0, ∀i∈A∪Ut

〈Φ(xi),θ′〉≥1-ξi, ∀i∈A

-〈Φ(xi),θ′〉≥1-ξi, ∀i∈Ut

Step 3. For any x∈U∖Ut update:

(i) n(x)←n(x)+1

(ii) s(x)←s(x)+〈θt,Φ(x)〉

Return s(x)←s(x)/n(x), ∀x∈U.

Before applying any supervised machine learning approach, extracting the features for gene-disease phenotypes is essential. The features are derived from the paths in the heterogeneous network. For a given gene-disease phenotype pair, different walks of the same length and walks of different lengths can be used as features for the gene-disease phenotype pair.

4.6. Meta-Path

The meta-path approach mainly uses the technology of multilabel classification. The multilabel classification method is useful for recognizing disease genes. A gene may exhibit many diseases caused by the gene. In the above example, the gene is an instance, and various diseases are different labels. Given an instance, a large space of all possible label sets may exist, which may be exponential to the number of candidate labels. The frequently used approach to solve the above problem is exploiting correlations among different labels. In the network, exploiting the correlations among different labels denoted by nodes is an advantage.

Meta-path is defined as a sequence of relations in the network. The objects in the network are linked through multiple-type associations. Multiple-type associations help exploit the correlations among different labels for multilabel classification. In recognizing the disease genes, the labels of the genes are diseases, and the labels of the diseases are genes. The explanation of the correlations among genes is summed up in this study.

Given a set of meta-paths among the gene nodes acting as labels, Sl={P1,P2,…,Pcl}, the meta-path-based label correlations can be used as follows: (20)∀i, P(Yi∣xi)=∏k=1qP(Yik∣xi,YiP1(k),YiP2(k),…,YiPcl(k)), where Pj(k) denotes the index set of the genes linked to the kth gene through the meta-path Pj∈Sl. xi denotes the feature vector of node i in the input space. Yi denotes the association between a gene and a gene set. The set of all candidate genes is denoted as Vl={l1,l2,…,lq}, and Yi is denoted as Yi=(Yi1,Yi2,…,Yiq)T∈{0,1}q. In the same way, given a set of meta-paths among disease phenotype nodes acting as instances, SI={P1′,…,PcI′}, the meta-path-based label correlations can be used as follows: (21)P(Y∣X)≈∏iP(Yi∣xi,Yp1′(i),…,YpcI′(i)), where Pj′(i) denotes the index set of disease phenotypes linked to the ith disease phenotype through meta-path Pj′∈SI.

To perform multilabel collective classification more effectively in heterogeneous information network, both meta-path-based label correlations and meta-path-based instance correlations are performed simultaneously: (22)P(Y∣X)≈∏i ∏k=1qP(Yik∣xi,Yipj(k),…,Ypj′(i)k), where the gene is set as the label set, and the disease phenotype is set as the instance set. On the contrary, the disease phenotype is set as the label set, and the gene set is as the instance set.

Some research has proposed algorithms based on multilabel collective classification. We briefly introduce the multilabel collective classification algorithm called PIPL. The algorithm roughly includes the following steps.(1)

Meta-path constructions: extracting all nonredundant meta-paths for label correlations and instance correlations.

(2)

Training initialization: construction of q extended training sets for all 1≤k≤q, Dk={(xik,yik)} by converting each instance xi to xik by using the functions in Algorithm 2. Training one classifier on each label by using the extended training sets.

(3)

Iterative inference: the inference step is an iterative classification algorithm. It updates the testing instance label set predictions and the relational features of label and instance correlations.

<bold>Algorithm 2: </bold>The function of construction relational features for meta-path-based label correlations and meta-path-based instance correlations.

(i) xl=LabelPathFeature(lk,Yi)

For each meta-path Ρj′∈SI:

(1) Get related labels for node lk through meta-path Pj.

(2) xi=Aggregation({Yi∣i∈C})

Return (…,xjT,…)T

(ii) xI=LabelPathFeature(i,Y)

For each meta-path Ρj∈SI:

(1) Get related labels for node Ii through meta-path Pj′.

(2) xi=Aggregation({Yi∣i∈C})

Return (…,xjT,…)T

5. Evaluation Methods

A comprehensive comparison should be conducted among these methods. In the next several paragraphs, we will introduce some of the key comparisons for recognizing the disease genes reported so far.

Cross-validation is the most frequently used approach in evaluating these methods. However, this method is similar to that used in a previous work, which performs leave-one-out. Each of the known gene-disease phenotype associations is taken as a test case, and a set of genes is assigned as the negative control for each test case. In each round of cross-validation, the disease phenotype is held out, a link between the disease phenotype and one of the associated genes is removed, and the link removed gene is added into the test genes. The rank of the test genes is obtained according to the recognizing methods. Several processing approaches are available for the rank, such as the enrichment score, setting a threshold, precision, recall, and receiver-operating characteristic (ROC).

5.1. Enrichment Score

Suppose the number of test genes is 100. If a recognition disease gene method ranks the actual disease gene as the highest and is sequenced first, then an enrichment of 50-fold exists. The formula of the enrichment score is Enrichment = 50/rank.

5.2. ROC Analysis

ROC analysis denotes the true-positive rate (TPR) versus the false-positive rate (FPR) subject to the threshold dividing the prediction classes. The TPR/FPR is the rate of correctly/incorrectly classified samples of all samples classified to the positive class. To evaluate the scores of disease gene predictions, ROC is explained as a plot of the number of the disease genes above the threshold versus the number of the disease genes below the threshold. The area under the ROC curve for each curve is calculated to compare the different curves obtained by the ROC analysis.

5.3. Setting a Threshold

Concordance score is calculated for each test gene. If the true disease gene ranks first based on the concordance score, then the prediction is successful, and precision is used as the proportion of the successful predictions among all predictions. Another evaluation approach is setting a threshold, in which the highest score of all test genes in this case is not less than the threshold. Thus, recall is the fraction of true disease genes predicted among all disease genes.

6. Materials and Results

In the above section, several recognition disease gene methods and evaluation methods have been mentioned. This part introduces the data used and the comparison results.

Figure 2 [17] denotes the comparison of different recognition methods by setting a threshold. In Figure 2, six recognizing methods are shown, including Catapult, Katz, ProDiGe, RWRH, PRINCE, and Degree. HumanNet gene network, which is a part of the OMIM dataset, is the dataset used to compare the different recognition methods. Figure 2 shows that Katz and Catapult do better than the others with the HumanNet gene network and the evaluation method.

Figure 2

Setting a threshold to compare recognition disease gene methods.

RWRH is compared with CIPHER-DN and CIPHER-SP. The evaluation experiment is based on gene network containing 34,364 interactions between 8919 genes, the phenotypic similarity matrix between 5080 phenotype entities calculated by using MimMiner, and 1428 gene-phenotype links between 937 genes and 1216 phenotype entities. The comparison result is denoted by Figure 3 When γ=0.7, λ=η=0.5, RWRH successfully ranks 814 known disease genes as top 1. The result is denoted by L001 in Figure 3. The column of L002 is the result of removing a known gene-phenotype link and using the phenotype and the rest of the disease genes associated with this phenotype as seed nodes. The identification of disease genes for phenotype from the genome is called ab initio prediction. The ab initio method removes all the links from a phenotype to disease genes and uses the phenotype entity as seed node to run RWRH. If one of the disease genes associated to the phenotype ranks top 1, then the prediction is successful. The result of ab initio is shown in Figure 3. From the L001, L002, and ab initio, RWRH is better than CIPHER-SP and CIPHER-DN. Figure 4 denotes the result of the comparison between RWR and RWRH. Leave-one-out cross-validation is conducted for each disorder. In each cross-validation, a disease gene is selected, and the links between the phenotype entries and the disease gene are removed. The rest of the disease genes and the phenotype entry are used as seed nodes. The selected disease gene and all disease genes in the artificial linkage are ranked by RWRH and RWR. ROC analysis is used to evaluate the two recognizing approaches.

Figure 3

The comparison of different methods.

Figure 4

ROC curve of RWR and RWRH.

7. Conclusion

Identifying disease genes is one of the fundaments of medical care and has been a goal in biology. Although traditional linkage analysis and modern high-throughput techniques often provide hundreds of disease gene candidates, identifying disease genes in the candidate genes by using the biological experiment method time-consuming and expensive. To deal with the above issues, the methods based on networks have been proposed. Many methods based on network have been created to recognize disease genes. In this paper, five typical algorithms based on networks, namely, CIPHER, PRINCE, RWRH, Katz, and CATAPULT, are introduced in detail.

Some novel methods have been put forward to recognize and prioritize disease genes. For instance, BRIDGE [39] takes advantage of multiple regression models with penalty to automatically weight different data sources. A researcher employed the ensemble boosting learning technique to combine variant computational approaches for gene prioritization to improve overall performance [40].

Biological relationships are showed by networks, which brings forth new ideas. A network can be used to denote the association between genes and disease to recognize the gene-disease phenotype and to obtain a more complete understanding of the biological system. Networks have been successfully used in biology. However, combining experiments with networks results in the challenge of defining node similarities. Different ways to define node similarity may lead to different effects.

With the development of biology and the emergence of a large number of relevant data, disease gene research based on networks constantly matures. New machine learning methods and technologies will be used to predict disease genes. Research on disease gene recognition will achieve new breakthroughs. The disease gene research will open a new era of medical treatment.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The work was supported by the Natural Science Foundation of China (no. 61370010, no. 61202011, no. 61271346, no. 61172098, and no. 60932008) and the Ph.D. Programs Foundation of Ministry of Education of China (20120121120039).

Adams

M. D.

Kelley

J. M.

Gocayne

J. D.

Dubnick

Polymeropoulos

M. H.

Xiao

Merril

C. R.

Olde

Moreno

R. F.

Kerlavage

A. R.

McCombie

W. R.

Venter

J. C.

Complementary DNA sequencing: expressed sequence tags and human genome project

Science 1991 252 5013 1651 1656

2-s2.0-0025879424

Glazier

A. M.

Nadeau

J. H.

Aitman

T. J.

Genetics: finding genes that underline complex traits

Science 2002 298 5602 2345 2349

2-s2.0-0037146578

10.1126/science.1076641

Botstein

Risch

Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease

Nature Genetics 2003 33 228 237

2-s2.0-0037373275

10.1038/ng1090

Lin

Gottschalk

Collision detection between geometric models: a survey

Proceedings of the IMA Conference on the Mathematics of Surfaces

1998

Köhler

Bauer

Horn

Robinson

P. N.

Walking the interactome for prioritization of candidate disease genes

American Journal of Human Genetics 2008 82 4 949 958

2-s2.0-41549139527

10.1016/j.ajhg.2008.02.013

Jiang

Zhang

M. Q.

Network-based global inference of human disease genes

Molecular Systems Biology 2008 4 article 189

2-s2.0-43249114206

10.1038/msb.2008.27

Perez-Iratxeta

Bork

Andrade

M. A.

Association of genes to genetically inherited diseases using data mining

Nature Genetics 2002 31 3 316 319

2-s2.0-0036649742

10.1038/ng895

Freudenberg

Propping

A similarity-based method for genome-wide prediction of disease-relevant human genes

Bioinformatics 2002 18 supplement 2 S110 S115

2-s2.0-0037580196

van Driel

M. A.

Cuelenaere

Kemmeren

P. P. C. W.

Leunissen

J. A. M.

Brunner

H. G.

A new web-based data mining tool for the identification of candidate genes for human genetic disorders

European Journal of Human Genetics 2003 11 1 57 63

2-s2.0-0037265647

10.1038/sj.ejhg.5200918

Turner

F. S.

Clutterbuck

D. R.

Semple

C. A. M.

POCUS: mining genomic sequence annotation to predict disease genes

Genome Biology 2003 4 11, article R75

2-s2.0-1542787787

10.1186/gb-2003-4-11-r75

Tiffin

Kelso

J. F.

Powell

A. R.

Pan

Bajic

V. B.

Hide

W. A.

Integration of text- and data-mining using ontologies successfully selects disease gene candidates

Nucleic Acids Research 2005 33 5 1544 1552

2-s2.0-15044341082

10.1093/nar/gki296

Aerts

Lambrechts

Maity

Van Loo

Coessens

De Smet

Tranchevent

L.-C.

De Moor

Marynen

Hassan

Carmeliet

Moreau

Gene prioritization through genomic data fusion

Nature Biotechnology 2006 24 5 537 544

2-s2.0-33646568805

10.1038/nbt1203

Franke

Van Bakel

Fokkens

De Jong

E. D.

Egmont-Petersen

Wijmenga

Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes

American Journal of Human Genetics 2006 78 6 1011 1025

2-s2.0-33646884801

10.1086/504300

Adie

E. A.

Adams

R. R.

Evans

K. L.

Porteous

D. J.

Pickard

B. S.

SUSPECTS: enabling fast and effective prioritization of positional candidates

Bioinformatics 2006 22 6 773 774

2-s2.0-33645104470

10.1093/bioinformatics/btk031

Adie

E. A.

Adams

R. R.

Evans

K. L.

Porteous

D. J.

Pickard

B. S.

Speeding disease gene discovery by sequence based candidate prioritization

BMC Bioinformatics 2005 6 1, article 55

2-s2.0-25444482801

10.1186/1471-2105-6-55

López-Bigas

Ouzounis

C. A.

Genome-wide identification of genes likely to be involved in human genetic disease

Nucleic Acids Research 2004 32 10 3108 3114

2-s2.0-3543106033

10.1093/nar/gkh605

Oti

Snel

Huynen

M. A.

Brunner

H. G.

Predicting disease genes using protein-protein interactions

Journal of Medical Genetics 2006 43 8 691 698

2-s2.0-33747035889

10.1136/jmg.2006.041376

Jiménez

Thomas

Torras

3D collision detection: a survey

Computers and Graphics 2001 25 2 269 285

2-s2.0-0035311963

10.1016/S0097-8493(00)00130-8

Grisart

Coppieters

Farnir

Karim

Ford

Berzi

Cambisano

Mni

Reid

Simon

Spelman

Georges

Snell

Positional candidate cloning of a QTL in dairy cattle: identification of a missense mutation in the bovine DGAT1 gene with major effect on milk yield and composition

Genome Research 2002 12 2 222 231

2-s2.0-0036175484

10.1101/gr.224202

Thaller

Kühn

Winter

Ewald

Bellmann

Wegner

Zühlke

Fries

DGAT1, a new positional and functional candidate gene for intramuscular fat deposition in cattle

Animal Genetics 2003 34 5 354 357

2-s2.0-0141483650

10.1046/j.1365-2052.2003.01011.x

Clop

Marcq

Takeda

Pirottin

Tordoir

Bibé

Bouix

Caiment

Elsen

J.-M.

Eychenne

Larzul

Laville

Meish

Milenkovic

Tobin

Charlier

Georges

A mutation creating a potential illegitimate microRNA target site in the myostatin gene affects muscularity in sheep

Nature Genetics 2006 38 7 813 818

2-s2.0-33745577150

10.1038/ng1810

Oti

Brunner

H. G.

The modular nature of genetic diseases

Clinical Genetics 2007 71 1 1 11

2-s2.0-33845545435

10.1111/j.1399-0004.2006.00708.x

Wood

L. D.

Parsons

D. W.

Jones

Lin

Sjo"blom

Leary

R. J.

Shen

Boca

S. M.

Barber

Ptak

Silliman

Szabo

Dezso

Ustyanksky

Nikolskaya

Nikolsky

Karchin

Wilson

P. A.

Kaminker

J. S.

Zhang

Croshaw

Willis

Dawson

Shipitsin

Willson

J. K.

Sukumar

Polyak

Park

B. H.

Pethiyagoda

C. L.

Pant

P. V.

Ballinger

D. G.

Sparks

A. B.

Hartigan

Smith

D. R.

Suh

Papadopoulos

Buckhaults

Markowitz

S. D.

Parmigiani

Kinzler

K. W.

Velculescu

V. E.

Vogelstein

The genomic landscapes of human breast and colorectal cancers

Science Signaling 2007 318 5853 1108 1113

Lim

Hao

Shaw

Patel

A. J.

Szabó

Rual

J.-F.

Fisk

C. J.

Smolyar

Hill

D. E.

Barabási

A.-L.

Vidal

Zoghbi

H. Y.

A protein-protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration

Cell 2006 125 4 801 814

2-s2.0-33646687963

10.1016/j.cell.2006.03.032

van Driel

M. A.

Bruggeman

Vriend

Brunner

H. G.

Leunissen

J. A. M.

A text-mining analysis of the human phenome

European Journal of Human Genetics 2006 14 5 535 542

2-s2.0-33646058032

10.1038/sj.ejhg.5201585

Zhang

Constructing biological networks through combined literature mining and microarray analysis: a LMMA approach

Bioinformatics 2006 22 17 2143 2150

2-s2.0-33748669610

10.1093/bioinformatics/btl363

Gaulton

K. J.

Mohlke

K. L.

Vision

T. J.

A computational system to select candidate genes for complex human traits

Bioinformatics 2007 23 9 1132 1140

2-s2.0-34249850089

10.1093/bioinformatics/btm001

van Heyningen

Yeyati

P. L.

Mechanisms of non-Mendelian inheritance in genetic disease

Human Molecular Genetics 2004 13 supplement 2 R225 R233

2-s2.0-5444259253

10.1093/hmg/ddh254

Lage

Karlberg

E. O.

Størling

Z. M.

Ólason

P. Í.

Pedersen

A. G.

Rigina

Hinsby

A. M.

Tümer

Pociot

Tommerup

Moreau

Brunak

A human phenome-interactome network of protein complexes implicated in genetic disorders

Nature Biotechnology 2007 25 3 309 316

2-s2.0-33947095027

10.1038/nbt1295

Hamosh

Scott

A. F.

Amberger

J. S.

Bocchini

C. A.

McKusick

V. A.

Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders

Nucleic Acids Research 2005 33 supplement 1 D514 D517

2-s2.0-13444266370

10.1093/nar/gki033

Hamosh

Scott

A. F.

Amberger

Bocchini

Valle

McKusick

V. A.

Onlined Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders

Nucleic Acids Research 2002 30 1 52 55

2-s2.0-0035998736

Hamosh

Scott

A. F.

Amberger

Valle

McKusick

V. A.

Online Mendelian inheritance in man (OMIM)

Human Mutation 2000 15 1 57 61

Baolin

HPRD: a high performance RDF database

Network and Parallel Computing 2007 4672

Berlin, Germany

Springer

364 374 Lecture Notes in Computer Science

Keshava Prasad

T. S.

Goel

Kandasamy

Keerthikumar

Kumar

Mathivanan

Telikicherla

Raju

Shafreen

Venugopal

Balakrishnan

Marimuthu

Banerjee

Somanathan

D. S.

Sebastian

Rani

Ray

Harrys Kishore

C. J.

Kanth

Ahmed

Kashyap

M. K.

Mohmood

Ramachandra

Y. L.

Krishna

Rahiman

B. A.

Mohan

Ranganathan

Ramabadran

Chaerkady

Pandey

Human protein reference database—2009 update

Nucleic Acids Research 2009 37 supplement 1 D767 D772

Patra

J. C.

Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network

Bioinformatics 2010 26 9 1219 1224

2-s2.0-77952808324

10.1093/bioinformatics/btq108

btq108

Peri

Navarro

J. D.

Kristiansen

T. Z.

Amanchy

Surendranath

Muthusamy

Gandhi

T. K. B.

Chandrika

K. N.

Deshpande

Suresh

Rashmi

B. P.

Shanker

Padma

Niranjan

Harsha

H. C.

Talreja

Vrushabendra

B. M.

Ramya

M. A.

Yatish

A. J.

Joy

Shivashankar

H. N.

Kavitha

M. P.

Menezes

Choudhury

D. R.

Ghosh

Saravana

Chandran

Mohan

Jonnalagadda

C. K.

Prasad

C. K.

Kumar-Sinha

Deshpande

K. S.

Pandey

Human protein reference database as a discovery resource for proteomics

Nucleic Acids Research 2004 32 supplement 1 D497 D501

2-s2.0-9144236198

Zhou

Orshanskiy

S. A.

Zha

Giles

C. L.

Co-ranking authors and documents in a heterogeneous network

Proceedings of the 7th IEEE International Conference on Data Mining (ICDM '07)

October 2007

Omaha, Neb, USA

739 744

2-s2.0-49749140070

10.1109/ICDM.2007.57

Katz

A new status index derived from sociometric analysis

Psychometrika 1953 18 1 39 43

2-s2.0-0002827622

10.1007/BF02289026

Chen

Jiang

Integrating human omics data to prioritize candidate genes

BMC Medical Genomics 2013 6 1, article 57

Lee

P. F.

Soo

V. W.

An ensemble rank learning approach for gene prioritization

Proceedings of the 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC '13)

2013

Osaka, Japan

3507 3510