Diseases are closely related to genes, thus indicating that genetic abnormalities may lead to certain diseases. The recognition of disease genes has long been a goal in biology, which may contribute to the improvement of health care and understanding gene functions, pathways, and interactions. However, few large-scale gene-gene association datasets, disease-disease association datasets, and gene-disease association datasets are available. A number of machine learning methods have been used to recognize disease genes based on networks. This paper states the relationship between disease and gene, summarizes the approaches used to recognize disease genes based on network, analyzes the core problems and challenges of the methods, and outlooks future research direction.
Although the human genome project has been accomplished and has achieved great success, and new methods that verify gene function with high-throughput have been applied, studying genetic problems that induce diseases is still one of the major challenges facing humanity [
In recent years, a number of recognizing disease gene approaches and computer tools have been developed through building mathematic models based on functional annotation, sequence-based features, protein interaction, and disease phenotype [
System biology has indicated that diseases with overlapping clinical manifestations are induced by one or more mutations from the same function module [
In the field of biological information, construction dataset is the data foundation of all subsequent work. The validity of datasets directly affects the validity and reliability of the learning algorithm and test. Thus, building a dataset is a basic and important preparatory work.
The recognition of disease gene datasets is mainly obtained from two databases: Online Mendelian Inheritance in Man (OMIM), which is a synthesis database [
OMIM has the most abundant information, most extensive resources, most comprehensive, authoritative, and timely human genes and genetic disorders based on knowledge composed to support human genetics research and education and the clinical genetics research. OMIM is daily updated and has free access and acquisition at
HPRD is a database which curetted proteomic information suited to human proteins. Even though HPRD is updated relatively slow, it is a full-scale resource for studying the relationship between human diseases and genes [
Most research on recognizing disease genes use networks, including the disease phenotype network, protein-protein interaction network, and gene-disease phenotype network, among others. In this study, we introduce only the most commonly used networks.
Illustration of the network by using a specific example: (a) is a gene-gene network, (b) is a disease phenotype network, and (c) is a gene-disease phenotype network [
In previous research, various methods, such as CIPHER, RWRH, Prince, Meta-path, Katz, Catapult, Diffusion Kernel [
CIPHER [
For a given query phenotype and candidate genes, CIPHER first combines the gene interaction network, disease phenotype network, and gene-disease phenotype network into a single network. The similarity scores of the query phenotype with all known phenotypes in the disease phenotype network are derived directly from the phenotype network and the topological distances between the candidate genes. All known disease genes in the gene interaction network are counted and grouped on the basis of their phenotypes. The correlation between phenotypes and disease genes is obtained and acts as the concordance score for each candidate gene by using the regression model. Finally, all candidate genes for the query phenotype are ranked in line with the concordance scores. On account of different neighborhood systems, two ways are available to define the topological distance: direct neighbor and shortest path. Thus, there are two versions of CIPHER which are CIPHER-SP and CIPHER-DN [
The similarity scores of the query phenotype and all phenotypes in the disease phenotype network are calculated by the following formulation:
In the same way, the following vector is used to denote the closeness between the genes and the phenotypes in the disease phenotype network:
PRINCE is another approach based on networks for ranking candidate disease genes for a given disease and inferring the complex associations between genes. PRINCE is on account of formulating constraints on the ranking function that involved usage of prior information and its smoothness over the network.
Before using PRINCE, the gene disease composed of the phenotype network (set of gene-disease associations), gene-gene interaction network (set of gene-gene association), and at least a query disease phenotype is prepared.
Suppose a gene
The above linear system can be solved accurately because an iterative propagation-based algorithm works fast and is guaranteed to converge to the system solution for larger networks. Formula (
Random walk with restart on heterogeneous network (RWRH) is extended from the random walk with restart algorithm to the heterogeneous network. The heterogeneous network is constructed by connecting the gene-gene interaction network and disease phenotype network by using the gene-disease phenotype relationship information. In brief, the gene-disease phenotype network is the heterogeneous network. RWRH prioritizes the genes and the phenotypes simultaneously, which is inspired by the coranking framework [
Random walk is defined as an iterative walker transition from its current node to a randomly selected neighbor, starting at a given source node
In the above four formulations,
The initial probability of the gene-gene interaction network and phenotype network is denoted by
The Katz method is successfully applied to social network link prediction. Predicting the social network link is close to the problem of predicting disease genes. The Katz approach, which is based on a graph, finds the similar nodes for the query nodes in the network [
An adjacency matrix
The adjacency matrix of the heterogeneous network is denoted as follows:
Combining data across species by using positive-unlabeled learning techniques is abbreviated to CATAPULT. And CATAPULT is a supervised machine learning method which uses a biased support vector machine (SVM), where the features are derived from walks in a heterogeneous gene-trait network.
Given a query disease phenotype, a gene is not associated with the query phenotype. Scholars report positive association between genes and phenotypes; however, the negative associations are rarely reported. In the CATAPULT approach, the unlabeled gene-disease phenotype pairs act as negative associations. The characteristics of the dataset are that only the positive associations are known, and the negative associations and a large number of unlabeled gene-disease phenotype pairs as negative associations are unknown. The general idea of CATAPULT is that the examples are not known to be negative. The false positives are not penalized heavily, but the false negatives are penalized heavily.
CATAPULT uses a biased SVM to classify the gene-phenotype pairs of humans with a single training phase. This approach draws a random bootstrap sample of a few unlabeled examples from the set of all unlabeled examples and trains a classifier to classify the bootstrap samples as negatives along with the positive samples. CATAPULT also uses the bagging technique to obtain an aggregate classifier by using positive and unlabeled examples. The algorithm description is shown in Algorithm
Subject to
(i) (ii) Return
Before applying any supervised machine learning approach, extracting the features for gene-disease phenotypes is essential. The features are derived from the paths in the heterogeneous network. For a given gene-disease phenotype pair, different walks of the same length and walks of different lengths can be used as features for the gene-disease phenotype pair.
The meta-path approach mainly uses the technology of multilabel classification. The multilabel classification method is useful for recognizing disease genes. A gene may exhibit many diseases caused by the gene. In the above example, the gene is an instance, and various diseases are different labels. Given an instance, a large space of all possible label sets may exist, which may be exponential to the number of candidate labels. The frequently used approach to solve the above problem is exploiting correlations among different labels. In the network, exploiting the correlations among different labels denoted by nodes is an advantage.
Meta-path is defined as a sequence of relations in the network. The objects in the network are linked through multiple-type associations. Multiple-type associations help exploit the correlations among different labels for multilabel classification. In recognizing the disease genes, the labels of the genes are diseases, and the labels of the diseases are genes. The explanation of the correlations among genes is summed up in this study.
Given a set of meta-paths among the gene nodes acting as labels,
To perform multilabel collective classification more effectively in heterogeneous information network, both meta-path-based label correlations and meta-path-based instance correlations are performed simultaneously:
Some research has proposed algorithms based on multilabel collective classification. We briefly introduce the multilabel collective classification algorithm called PIPL. The algorithm roughly includes the following steps. Meta-path constructions: extracting all nonredundant meta-paths for label correlations and instance correlations. Training initialization: construction of Iterative inference: the inference step is an iterative classification algorithm. It updates the testing instance label set predictions and the relational features of label and instance correlations.
(i) For each meta-path (1) Get related labels for node (2) Return (ii) For each meta-path (1) Get related labels for node (2) Return
A comprehensive comparison should be conducted among these methods. In the next several paragraphs, we will introduce some of the key comparisons for recognizing the disease genes reported so far.
Cross-validation is the most frequently used approach in evaluating these methods. However, this method is similar to that used in a previous work, which performs leave-one-out. Each of the known gene-disease phenotype associations is taken as a test case, and a set of genes is assigned as the negative control for each test case. In each round of cross-validation, the disease phenotype is held out, a link between the disease phenotype and one of the associated genes is removed, and the link removed gene is added into the test genes. The rank of the test genes is obtained according to the recognizing methods. Several processing approaches are available for the rank, such as the enrichment score, setting a threshold, precision, recall, and receiver-operating characteristic (ROC).
Suppose the number of test genes is 100. If a recognition disease gene method ranks the actual disease gene as the highest and is sequenced first, then an enrichment of 50-fold exists. The formula of the enrichment score is Enrichment = 50/rank.
ROC analysis denotes the true-positive rate (TPR) versus the false-positive rate (FPR) subject to the threshold dividing the prediction classes. The TPR/FPR is the rate of correctly/incorrectly classified samples of all samples classified to the positive class. To evaluate the scores of disease gene predictions, ROC is explained as a plot of the number of the disease genes above the threshold versus the number of the disease genes below the threshold. The area under the ROC curve for each curve is calculated to compare the different curves obtained by the ROC analysis.
Concordance score is calculated for each test gene. If the true disease gene ranks first based on the concordance score, then the prediction is successful, and precision is used as the proportion of the successful predictions among all predictions. Another evaluation approach is setting a threshold, in which the highest score of all test genes in this case is not less than the threshold. Thus,
In the above section, several recognition disease gene methods and evaluation methods have been mentioned. This part introduces the data used and the comparison results.
Figure
Setting a threshold to compare recognition disease gene methods.
RWRH is compared with CIPHER-DN and CIPHER-SP. The evaluation experiment is based on gene network containing 34,364 interactions between 8919 genes, the phenotypic similarity matrix between 5080 phenotype entities calculated by using MimMiner, and 1428 gene-phenotype links between 937 genes and 1216 phenotype entities. The comparison result is denoted by Figure
The comparison of different methods.
ROC curve of RWR and RWRH.
Identifying disease genes is one of the fundaments of medical care and has been a goal in biology. Although traditional linkage analysis and modern high-throughput techniques often provide hundreds of disease gene candidates, identifying disease genes in the candidate genes by using the biological experiment method time-consuming and expensive. To deal with the above issues, the methods based on networks have been proposed. Many methods based on network have been created to recognize disease genes. In this paper, five typical algorithms based on networks, namely, CIPHER, PRINCE, RWRH, Katz, and CATAPULT, are introduced in detail.
Some novel methods have been put forward to recognize and prioritize disease genes. For instance, BRIDGE [
Biological relationships are showed by networks, which brings forth new ideas. A network can be used to denote the association between genes and disease to recognize the gene-disease phenotype and to obtain a more complete understanding of the biological system. Networks have been successfully used in biology. However, combining experiments with networks results in the challenge of defining node similarities. Different ways to define node similarity may lead to different effects.
With the development of biology and the emergence of a large number of relevant data, disease gene research based on networks constantly matures. New machine learning methods and technologies will be used to predict disease genes. Research on disease gene recognition will achieve new breakthroughs. The disease gene research will open a new era of medical treatment.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The work was supported by the Natural Science Foundation of China (no. 61370010, no. 61202011, no. 61271346, no. 61172098, and no. 60932008) and the Ph.D. Programs Foundation of Ministry of Education of China (20120121120039).