Network Based Integrated Analysis of Phenotype-Genotype Data for Prioritization of Candidate Symptom Genes

Background. Symptoms and signs (symptoms in brief) are the essential clinical manifestations for individualized diagnosis and treatment in traditional Chinese medicine (TCM). To gain insights into the molecular mechanism of symptoms, we develop a computational approach to identify the candidate genes of symptoms. Methods. This paper presents a network-based approach for the integrated analysis of multiple phenotype-genotype data sources and the prediction of the prioritizing genes for the associated symptoms. The method first calculates the similarities between symptoms and diseases based on the symptom-disease relationships retrieved from the PubMed bibliographic database. Then the disease-gene associations and protein-protein interactions are utilized to construct a phenotype-genotype network. The PRINCE algorithm is finally used to rank the potential genes for the associated symptoms. Results. The proposed method gets reliable gene rank list with AUC (area under curve) 0.616 in classification. Some novel genes like CALCA, ESR1, and MTHFR were predicted to be associated with headache symptoms, which are not recorded in the benchmark data set, but have been reported in recent published literatures. Conclusions. Our study demonstrated that by integrating phenotype-genotype relationships into a complex network framework it provides an effective approach to identify candidate genes of symptoms.


Introduction
Traditional Chinese medicine (TCM) is an essential part of the healthcare system in China. TCM diagnosis and treatment are formed based on a comprehensive analysis of the clinical manifestations obtained through four main procedures: observation, listening, questioning, and pulse analysis [1]. Patients with different diseases would often manifest different symptoms and signs, such as anorexia and pain, which are the evidences to be considered by physicians for clinical diagnoses in TCM [2].
Although symptoms play important role in modern biomedical diagnosis and disease classification, most modern biomedical research attempts to gain understanding of the molecular mechanism of disease phenotypes [3], including investigating the genotypes of disease/disease categories. Likewise, in the TCM field, attempt has also been made to investigate the genotypes or molecular mechanisms of the diagnosis (i.e., TCM syndrome) [4,5].
A recent research showed that there exist metabolic biomarkers of clinical manifestations like symptoms and syndromes in different types of rheumatoid arthritis (RA) diseases [6]. However, there is no clear understanding of the underlying molecular mechanism of symptoms and the principle of TCM syndrome in TCM field.
Large-scale diagnosis and phenotype-genotype association data, including both published literature and manually curated databases, have been gathered in the last decades [7]. PubMed, which is a public-available biomedical bibliographic database, provides a significant resource for studying the associations between diseases and clinical manifestations [8]. The phenotype-genotype association database like OMIM [9] contains high-quality data on relationships between diseases and genes. In addition, large-scale molecular network data are available [10][11][12], such as protein-protein interaction data, metabolic pathway data, and gene regulation data. Those provide important resources to explore the molecular correlations of symptoms.
In this paper, we first extracted the symptom-disease relationships from PubMed bibliographic records. We used the cosine similarities to evaluate the association between symptoms and diseases. We then integrated the symptom-disease relationships with disease-gene associations and proteinprotein interactions (PPI) to construct a new database recording the associations between symptoms and genes. We finally used the PRINCE algorithm to rank the potential genes of symptoms. We evaluate the results of the prediction by using manually curated symptom-gene data set and PubMed literature searching. The evaluation shows that the results suggest medical meaningful insight.

Related Work
Using network-based approaches to gain insights into human disease has found multiple potential biological and clinical applications [13]. Further understanding of the effects of cellular interconnectedness on disease progression leads to the identification of disease biomarker genes and the pathways causing the associated diseases [14], which, in turn, offer effective targets for new drug development. Many human genetic diseases are caused by multiple genes. For genes that are associated with the same or similar phenotypes, the genes are likely to be functionally related. Such relations can be exploited to aid in searching for novel disease genes. Computational approaches have recently been proposed to predict associations between genes and diseases [15][16][17]. Vanunu et al. developed a network-based approach, which is known as PRINCE algorithm, for predicting causal genes and protein complexes involved in a disease of interest [18]. The availability of large-scale data of phenotype-genotype associations like OMIM, CTD [19], and PharmGKB [20] provides valuable resources for studying disease-gene associations.
Recently increasing interest on the study of molecular mechanism of symptoms was found. The underlying molecular mechanisms of several symptoms, such as depression, pain, and high blood pressure, have been discussed previously [21][22][23]. However, no work has been done to investigate systematically the mechanism of symptoms in the literature. Until recently, Zhou et al. used large-scale biomedical literature database to construct a symptom-based human disease network and investigate the associations between clinical manifestations of diseases and the underlying molecular interactions [24]. Their results showed that symptom-based similarity of diseases correlates strongly with the number of shared genetic associations and the extent to which their associated proteins interact. This indicates that symptoms would have their underlying molecular mechanisms needed to be further explored. In this paper, we attempt to develop a new data mining framework to explore the relationships between symptoms and genes, which may provide scientific evidences to traditional Chinese medicine in individualized diagnosis and treatment because symptoms are the main clinical manifestations captured by TCM physicians for both diagnosis and treatment.

Phenotype-Genotype Data Integration.
In order to extract the associations between symptoms and genes, we first built symptom-disease associations based on a large number of medical literatures in PubMed [25] and the Medical Subject Headings (MeSH). Using the cooccurrence of diseases and symptoms, we construct two vectors and to calculate the similarity of symptom and disease, in which denotes a disease vector represented by its cooccurrence symptoms and denotes a symptom vector represented by its cooccurrence symptoms as well. Suppose we have a dictionary with symptom items, we would have an -features vector for both disease and symptom. Based on the vectors of diseases and symptoms, we calculate the similarity of symptom and disease using cosine correlation: (1) In this study, we integrated three public available diseasegene databases (OMIM, CTD, and PharmGKB) and five protein-protein interactions databases (HPRD, BioGrid, IntAct, MINT, and DIP) into database ( Figure 1). Based on these data sets a heterogeneous network is constructed with nodes representing symptoms, diseases, and proteins, respectively, and the links representing symptom-disease relationships, disease-gene associations, and protein-protein interactions.

Network Inference for Prioritization of Symptom
Candidate Genes. The network-based disease gene prediction approach, PRINCE, is used for predicting the genes with respect to symptom. The initialization of the parameters in PRINCE algorithm is the symptom-disease correlations, disease-gene associations, and protein-protein interactions. It uses a propagation-based algorithm [26] to infer a scoring function for estimating the strength of an association. A score is defined for each gene, which reflects the prior information of the genes on the related disease. The score is then used in combination with a PPI network for the identification of proteins involved in the given symptom, as shown in Figure 2.

Computing the Prioritization
Protein-protein interactions Figure 1: The integration of phenotype-genotype data. Symptom-disease associations are extracted based on the fact that the symptom and disease appeared in same bibliographic record (including title, abstract, and MeSH) of PubMed. Three disease gene association databases (i.e., OMIM, CTD, and PharmGKB) and five human PPI databases (i.e., HPRD, BioGrid, IntAct, MINT, and DIP) are integarted in this study. The relationships among symptoms (denoted s1-s4), diseases (denoted d1-d7), and proteins (denoted p1-p14) are then extracted.
the given symptom-disease associations (denoted by A), disease-gene associations (B), and a protein-protein interaction network = ( , ), where is a set of proteins and is a set of interactions between proteins. The goal of the algorithm is to prioritize all the proteins in with respect to .
Let : → R represent a prioritization function; reflects the relevance of V (V ∈ ) to . : → [0, 1] represent a prior knowledge function, where 1 is assigned to a protein that is known to be related to the disease with respect to , and 0 otherwise. In other words, is the vector of genes  Figure 2: The approach for predicting the genes with respect to symptom using PRINCE algorithm. For a query symptom S, it has varying degrees of relationship with other diseases, denoted by d1-d5 (where the thickness of lines represents degree of correlation between symptom and diseases). p1-p9 comprise the protein set of a protein-protein interaction network, where interactions are denoted by lines with different thickness (confidence). PRINCE uses an iterative propagation method to assign a score of each protein.
The protein with higher score is considered to be the causal gene candidate for symptom S.
which are known to be causal gene of diseases with respect to symptom. To obtain , we first analyzed the distribution of similarity between symptom and disease and found that the symptom may have high possibility of relating to a disease when their similarity is above 0.1. Here, we want to choose the diseases which have high possibility to associate with a symptom, so that we could get the related genes to build . The 10% top ranked disease-symptom relationships with similarities larger than 0.1 are chosen (in our experiment the threshold is 0.57). At last, we selected the ten most related diseases as the diseases corresponding to symptom and its causal genes to build .
By iterative procedures, the information is transferred between their neighbors, as defined by where 1 := ⋅ is a | | × | | matrix which is a normalized form of (described below) and and are viewed here as vectors of size | |. The details on the inference of in PRINCE algorithm could be found [18]. The parameter ∈ (0, 1) weighs the relative importance of these constraints with respect to one another. Here is set to be 0.9 as suggested in the PRINCE algorithm that the appropriate values of could be above 0.5 with fast convergence and 0.9 gets the comparative highest performance [18].

Evaluation Methods.
We use Human Phenotype Ontology (HPO) [27] as the benchmark data to evaluate the results. HPO was manually curated from OMIM records and constructed with the goal of covering all phenotypic abnormalities that are commonly encountered in human monogenic diseases [28]. In this study we use the T184 (Sign or Symptom) semantic type of UMLS [29] to filter the phenotype terms and construct a subset of HPO phenotypes (349 records), after filtering the phenotype-genotype associations with focusing on symptoms results in 7,262 symptom-gene records and 1,275 related genes. To deal with the issue of HPO having different symptom terms from MeSH, we used UMLS to map HPO symptom terms to MeSH. We finally obtained 3,418 symptom-gene records with 139 symptoms and 937 genes, which were used for evaluation. Although HPO contains high-quality data on phenotype ontology and genotype-phenotype (mainly on diseases and disorders) associations, the data is rather incomplete and still lack many well-known symptom-gene associations. We evaluated the symptom-gene prediction results by three approaches: (1) compare our rank list with the genes in HPO and calculated recall and AUC [30], (2) compare our result with random case, and (3) evaluate the random chosen results by recent published literatures.

Results
We extracted 125,226 symptom-disease associations with 322 symptoms and 4,219 diseases from PubMed bibliographic records and calculated the cosine similarity between symptoms and diseases. We constructed 94,536 proteinprotein interactions with 14,221 proteins and integrated 28,336 disease-gene associations (shown in Table 1). The protein-protein interactions were assigned 1 if they are correlated. We used these scores to construct the adjacency matrix . As a result, we obtained totally 4,211,956 symptom-gene associations between 290 symptoms and 14,221 genes with correlation values bigger than zero. The distribution of correlation between symptoms and genes is depicted in Figure 3. It is noted that 83% of the correlations are <0.001, and only about 0.24% are distributed on the range of bigger than 0.01. We consider that the genes with  correlation scores bigger than 0.01 have higher possibility than most of the genes (i.e., 83% genes). Therefore, these genes with correlation scores higher than 0.01 are considered to be the potential genes related to symptoms in this study. Using the HPO benchmark data, we quantify the accuracy of the prediction by comparing the predicted gene list of symptoms with that of the benchmark data. The area under the ROC curves (AUC) of the proposed method is 0.616 ( Figure 4).
In order to evaluate the effectiveness of the gene ranking, we also compared the result with random prediction case. We calculate the quantity of genes contained in HPO on the top of our gene list ( < 0.05) by comparing with the average quantity of randomly selected the same number of genes. It is noted that the number of true positive candidate genes is 10-fold of the random prediction, with the best case being 249-fold of the random prediction. We take symptom Muscle Cramp as an example to compare our result with random case. Given 27 genes in HPO, there are 10 genes included in the top 251 genes ( < 0.05) of our candidate genes list. Randomly choosing 251 genes among all the genes (14,221 genes), the possibility of each gene being causing gene is 0.0018986 (27/14,221, we have the hypothesis that the genes in HPO are all causing genes). The expected number of genes in HPO is 0.477 (0.0018986 * 251); that is, there is on average 0.477 true causing genes in HPO gene list if 251 genes are randomly selected. So the number of true positive candidate genes is approximately 20-fold (10/0.477) over the random prediction.
To demonstrate the effectiveness of this method, we listed the suggested genes of headache and hemiplegia for instance. Through the analysis of the distribution of all the scores of symptom related genes, we found that most scores (95% in average) are in very low values (i.e., 0.01) with some exceptions of having much larger scores than these row values. Table 2 shows the top 46 ranked genes of the 13,966 genes whose correlation scores are greater than 0.01 with respect to the symptom of headache. We found that TNF and EDNRA are the causing genes for headache as listed in HPO.
We were aware that the HPO is an incomplete database. To have a more comprehensive evaluation on the prediction result, we manually searched the literature in PubMed for the symptom-gene associations. Among the top 10 genes of our list, we found that five additional genes CALCA, TGFBR2, ESR1, KCNK18, and MTHFR (bold font in Table 2) are all considered to be related to headache in recent published literatures [31][32][33][34], although they are not recognized in the HPO database. As a result, we recognized totally 7 possible causing genes (CALCA, TGFBR2, TNF, ESR1, EDNRA, KCNK18, and MTHFR) of headache in the top 10 genes.
The relationship between symptoms and diseases is complicated. Some symptoms would be more particularly manifested in several diseases than others. This kind of clinical association would have its underlying molecular mechanisms. To explore the interactions of the related genes of symptoms and diseases in the context of PPI network, we show a subset of protein-protein interactions with respect to headache in Figure 5, which is constructed by the genes connected with 6 diseases related to headache directly. In Figure 5, genes connected with the same diseases are marked in the same colors. We found that 15 genes of 32 genes in HPO (marked in box) in our subnet are the causal genes of diseases or locate on their shortest path. It is possible that the causal genes of a disease, which holds the symptom as particular phenotype, would be the related genes for symptom (marked in pink box), or the candidate genes for symptom would possibly locate on the shortest paths of these genes of the diseases, which have the related symptoms as general phenotypes (marked in red box). To have more clear view of the relationships between the candidate genes of symptoms and the casual genes of the diseases holding  the corresponding symptoms as particular manifestations, we also constructed a network to show the direct relationships among the causing genes of diseases related to headache and the genes in HPO (Figure 6, genes in HPO are marked in red and genes connected with different diseases are marked with different colors). The genes, CALCA, TGFBR2, TNF, ESR1, EDNRA, MTHFR, and so forth, of our top 10 rank list (mentioned above) are marked with underline. We found that the candidate genes with high scores of headache symptom are the causal genes of the diseases, which regard headache as distinct symptom, such as migraine. It is possible that the causing genes of diseases with respect to the distinct symptoms would also be related to their corresponding symptoms. Table 3 lists the 83 top ranked genes with respect to hemiplegia with correlation greater than 0.01. In the causing genes of hemiplegia in HPO, four genes, namely, COL4A1, CACNA1A, ATP1A2, and SCN1A, are all found in the top 83 candidate genes (recall is 66.7%) except for the gene DOCK8 which is ranked 6667th in whole list of 14,221 genes. However, we found no related publications on indicating the relationships between the 8 genes (except for the 2 genes included in HPO) of the top 10 genes and hemiplegia after manually searching the PubMed literatures.

Discussion
As a kind of established clinical manifestations in TCM clinical, symptoms provide key information for the classification of the state of human disease and personalized herb treatment. Symptoms are essentially objective although 8 BioMed Research International the observation and description of symptoms incorporate subjective factors like human sense and language. Therefore, investigation of the underlying molecular mechanisms of symptoms is more feasible than TCM syndrome. Through integrating disease-symptom associations and multiple phenotype-genotype data sources, this paper proposes a network inference method to predict the candidate gene list for symptoms. Like similar work for disease gene predictions [35,36], the rank list of symptom-related candidate genes can promote the discovery of molecular mechanisms 9 of symptoms and thereafter draw the picture of connection between symptoms and genes with respect to diseases. Evaluation shows the effectiveness of the method in identifying genes related to symptoms. Like the predicted genes of headache, more predicted genes could be further investigated to understand the medical insights, which would ultimately support the researchers to confirm the causal genes of symptoms in laboratory study. It is necessary to mention that this paper is intended to introduce the proposed integrated network framework for predicting the symptom candidate genes. Several aspects related to the method could be improved in future work. Firstly, a carefully curated and evaluated database needs to be established for benchmark data set. Currently, although HPO provides a start point, more effects are needed to obtain high quality symptom-gene databases. While this database is curated, it would offer reliable benchmark platform to evaluations and possible supervision for machine learning methods. On the other hand, due to the complicated confounders involved in symptom-disease relation detection from biomedical literatures, a comprehensive database on disease-symptom relationships would be also very helpful. Secondly, because the similarities between diseases and symptoms indicate different degree of correlations, the similarities between symptoms and diseases could be systematically utilized to improve the iterative computing procedures of random walk related network inference methods. Thirdly, it is highly valuable to investigate the molecular correlations between symptoms and diseases to detect the molecular patterns connecting these two phenotype entities. When some network characteristics underlying the connection are discovered, it would give guideline framework for the development of symptom-gene prediction methods.