Predicting Metabolite-Disease Associations Based on Linear Neighborhood Similarity with Improved Bipartite Network Projection Algorithm

A large number of clinical observations have showed that metabolites are involved in a variety of important human diseases in the recent years. Nonetheless, the inherent noise and incompleteness in the existing biological datasets are tough factors which limit the prediction accuracy of current computational methods. To solve this problem, in this paper, a prediction method, IBNPLNSMDA, is proposed which uses the improved bipartite network projection method to predict latent metabolite-disease associations based on linear neighborhood similarity. Speciﬁcally, liner neighborhood similarity matrix about metabolites (diseases) is reconstructed according to the new feature which is gained by the known metabolite-disease associations and relevant integrated similarities. The improved bipartite network projection method is adopted to infer the potential associations between metabolites and diseases. At last, IBNPLNSMDA achieves a reliable performance in LOOCV (AUC of 0.9634) outperforming the compared methods. In addition, in case studies of four common human diseases, simulation results conﬁrm the utility of our method in discovering latent metabolite-disease pairs. Thus, we believe that IBNPLNSMDA could serve as a reliable computational tool for metabolite-disease associations prediction.


Introduction
Metabolites, the final products of cellular regulatory process, whose levels can be considered as the ultimate response of biological systems to genetic or environmental changes have significant effects in human body [1]. Meanwhile, it is a trend for disease researches to find the effect in molecular level with the rapidly developing biomedical instruments, and analytical platforms [2,3] and metabolisms disrupted by disease state are widely identified as disease signatures [4].
Although many metabolite signatures of diseases have been gradually identified by high-throughput metabolomics technologies in metabolomics [4,5], the unconfirmed metabolite-disease associations still exist in large numbers. Furthermore, the efficiency of obtaining useful results by conventional biology experiments is not high due to the factories of time, fund, and accuracy. us, developing computational methods to efficiently and reliably excavate the potential metabolite-disease associations is significant for human health and medical advance, which also can solve time-consuming and labor-intensive problems. RWRMDA [6] is the first method to explore the latent associations between metabolites and diseases, which pushes the development of computational method in metabolomics. However, they do not consider the diseases similarity when calculating the last predicted results. Although KATZMDA [7] considers two similarities, less information about similarity integration is a disadvantage. Additionally, some similarity measurement methods with biological characteristic of diseases or metabolites have been widely taken advantage of other fields in bioinformatics such as functional similarity or semantic similarity. Howbeit, some biological characteristics between disease or metabolite pairs are insufficient. Other methods about measuring similarity such as Gaussian interaction profile kernel similarity [8] or cosine similarity [9] based on pairwise topological similarities between diseases or metabolites are not robust enough as noted by [10].
In this paper, we put forward an improved bipartite network projection method based on linear neighborhood similarity for unconfirmed metabolite-disease association predictions (IBNPLNSMDA) (see Figure 1). Firstly, the new feature matrix is obtained by WKNKN and integrating metabolite (disease) similarities in order to make full use of existing data. Secondly, the relevant linear neighborhood similarity is constructed based on utilizing new feature matrix and reconstructing data points from neighbors.
irdly, the improved bipartite network projection algorithm is utilized to predict the potential by combining the linear neighborhood similarity about metabolites (diseases) with the known metabolite-disease associations, which guarantees the accuracy of predictions. Finally, the IBNPLNSMDA obtains an AUC value of 0.9634 which outperforms the other methods in LOOCV. In addition, four types of case studies demonstrated the reliability and feasibility of IBNPLNSMDA.

Human Metabolite-Disease Associations.
Firstly, the data of the known human metabolite-disease associations are extracted from human metabolome database (HMDB). Due to the calculation of disease semantic similarity and disease functional similarity, the data about diseases ontology [11] and DisGeNET [12] (http://www. disgenet.org/web/DisGeNET/menu) need to be considered. Secondly, we select the disease with DOID according to diseases ontology. en, the common disease between DisGeNET and the disease we have selected in the last step become the final diseases data. Finally, the known human metabolite-disease network (see Figure 2) is constructed according to the final diseases data [13], which contains 3589 distinct experimentally confirmed human metabolite-disease associations about 2121 metabolites and 130 diseases. Based on these associations, we construct a nd × nm dimensional adjacency matrix M, where nd and nm are denoted as the number of diseases and metabolites. If a disease d(i) has been experimentally verified to be associated with a metabolite m(j), then M(i,j) equals to 1, otherwise 0.

Diseases Functional Similarity.
Based on the assumption that the more common the related genes between two diseases are, the larger the similarity between two diseases is. According to DisGeNET, we extract the associations between diseases and their relevant genes. en, we construct an adjacency matrix GD in which the row represents diseases and the column represents genes and utilize the cosine similarity measurement to calculate disease similarity by calculating the angle cosine values of two vectors [9]: where ⇀ denotes the associations of disease q with all the genes.

Diseases Semantic Similarity.
A disease can be described as a directed acyclic graph (DAG) according to the mesh database [14,15]. Taking disease D as an example, we use DAG (D, T(D), E(D)) to represent it, where T(D) is the node set consisting of the disease D and its ancestor nodes and E(D) is the edge set including the direct edges from parent nodes to child nodes. And then, the semantic value of the disease D is given by the following equation [16]: where Δ is the layer contribution factor which is set 0.5 in this study as in previous literature [17]. e diseases located in the same layer contribute the same semantic value to disease D, but the contribution of other diseases decreases by a factor Δ when the layer between these diseases and D increases. Sharing the larger parts of DAGs between 2 diseases is considered to be more similar. us, we define semantic similarity between d i and d j as follows:

Metabolite Functional
Similarity. e metabolite functional similarity depends on the basic idea that two functional similar metabolites have the similar diseases. Assume that DT A and DT B represent a group of diseases associated with the metabolite m A and m B , respectively. Firstly, we, respectively, select the maximum semantic similarity between two diseases in DT A and DT B , representing the similarity of a disease and a disease group which is defined as follows [6]: where dt and DT � {dt 1 , dt 2 , . . ., dt k } represent a disease and a disease group, respectively. en, the functional similarity between metabolite A and B is denoted as MFS(m A , m B ), which is calculated as where two metabolites are connected if the similarity score is greater than 0 and the score is set as the weight in the metabolite functional similarity network.  Complexity diseases. Because the adjacency matrix M contains the information about the associations between diseases and metabolites, the IP(d i ) and IP(m j ) represent the ith row and jth column of the adjacency matrix M which is mentioned above. erefore, the Gaussian interaction profile kernel similarity of diseases and metabolites can be given as follows [15]:

Gaussian Interaction
where parameters ω d and ω m are used to control the kernel bandwidth, which can be gained by normalizing the original ω d ′ and ω m ′ (both were set as 1 based on previous work), which can be calculated as follows: Yet two matrix GD and GM are obtained, in which the entity G D(d i , d j ) represents the Gaussian interaction profile kernel similarity between disease d i and diseases d j and GM(m i , m j )represents the Gaussian interaction profile kernel similarity between metabolites m i and m j .

Integrated Similarity for Metabolites and Diseases.
We denote the final similarity matric of metabolites as SM ∈ (nm * nm) which is constructed by matrices MFS and GM. Meanwhile, the final similarity matric of diseases is defined as S D ∈ (n d * n d) which consists of matrices DFS, DSS, and GD. By combining relevant matrices, the integrated similarity for metabolites and diseases is defined as follows:

Features of Diseases and Metabolites.
In this section, we use vector IP of diseases and metabolites from adjacency matrix M representing the initial feature vectors [18][19][20], respectively. However, most of associations between diseases and metabolites have not been verified which lead to feature very sparse. In order to solve this problem, WKNKN [21] as a preprocessing procedure is utilized to infer the interaction likelihood score for these latent pairs based on their known neighborhoods. Specifically, there are three steps when WKNKN replaces M(i,j) = 0 with an interaction likelihood value (Algorithm 1): (1) Take metabolite i as an example, and we need to obtain the K known metabolites nearest to m i according to integrated similarity matric (SM) and utilize their corresponding vector IP to estimate the interaction likelihood profile for m i . (2) Similar to the step (1), the interaction likelihood profile for d j can be calculated. 2.8. Linear Neighborhood Similarity. Roweis et al. [22] reveal that it is close to a locally linear patch of the manifold between a data point and its neighbors, and Wang et al. [10] discover that each point can be optimally reconstructed by its neighbors. Besides, Zhang et al. [23,24] apply the linear neighborhood similarity in bioinformatics which achieves better prediction performance. Based on these studies [6,23,25], we reconstruct the metabolite (disease) pairwise similarities. Take disease as an instance, let X i denote the feature vector of the ith diseases in MF, and we use the following objective function, which minimizes the reconstruction error: where N(X i ) is defined the set of k (a free parameter) nearest neighbors which is calculated by Euclidean distance of X i . X i j is the jth neighbor of X i , and w ii j denotes the contribution of X i j to the reconstruction of X i and could be regarded as their similarities.
. en, ε i can be rewritten as e Tikhonov regularization term that minimizes the norm of reconstructive weight w i is adopted to avoid overfitting, and the objective function can be modified as where w i � w i 1 , w i 2 , . . . ., w i k and α is the penalty parameter for the regularization term which is set 1 for simplicity. Standard quadratic programming is used to solve (16), and the results are the reconstruction weights of X i . After every feature vector about diseases in MF are calculated, we finally get a weight matrix W whose dimension is nd * nd that could be treated as the disease linear neighborhood similarity (SD * ). Similarly, when we input feature vector about metabolites in MF, we also get a weight matrix W whose dimension is nm * nm that could be treated as the metabolite linear neighborhood similarity (SM * ).

Improved Bipartite Network Projection Recommendation
Algorithm. e baseline bipartite network projection recommendation algorithm [26] is a two-round resource transfer process which does not consider the weights of relevant similarities by just using the information of the known metabolite-disease matric (M). However, the bias for allocation of resources about each metabolite (disease) prefers to a specific disease (metabolite) together with their similar metabolites (or diseases). Simultaneously, enlightened by the idea that a potential metabolite (disease) could be predicted according to the related similar metabolites (diseases) [27][28][29], the similarity weights about metabolites and diseases are, respectively, considered when the resources are allocated which can be written as where W m ∈ (n d * nm) and W d ∈ (n d * nm) represent the different weighted matrices when allocating resources according to different similarities. Nm represents the Complexity 5 number of metabolites, and nd represents the number of diseases. M is an adjacency matrix which is mentioned before. Next, each metabolite would be allocated with a score after two-round resource distribution. In the first round, the initial resource in MM � {m 1 , m 2 , ..., m nm } flows to D � {d 1 , d 2 , ..., d nd } [30] according to W m and W d , and the jth D node gains the resource as follows: en, all the resources located on the D node returns back to MM by W m and W d [28], and the final resource located on the m i node is where k m (m i ) is the sum of the ith row of the weighted matrix W m and the k d (d l ) is the sum of the lth row of the weighted matrix W d . R m (m i ) is the initial resource located on M i based on the weighted matrix W m , and R d (m i ) is the initial resource based on M i using the weighted matrix W d . For simplification, the damping factors c and μ which are used to balance the scores between W m and W d are set 0.5.

Results
Leave-one-out across validation (LOOCV) is utilized to evaluate the prediction accuracy of IBNPLNSMDA. Each known metabolite-disease association is selected in turn as the test sample, and the rest of associations are regarded as training samples. Moreover, all metabolite-disease pairs whose associations are not confirmed would be considered negative samples, while the positive samples consist of the known associations. ereafter, we rank each test sample with all metabolite-disease pairs without known associations based on the predicted scores. Additionally, test samples with rankings above the given threshold are regarded to be successful samples. According to the results of the LOOCV, AUC which is the area under the ROC (receiver operating characteristic) curve containing true-positive rate (TPR) and the falsepositive rate (FPR) and AUPR which is the area under PR (precision-recall) curve containing precision and recall plays significant roles in evaluation performance of method. After LOOCV, IBNPLNSMDA obtains reliable AUC value of 0.9634 and AUPR value of 0.4971 which indicates that IBNPLNSMDA has satisfactory prediction performances.

Comparison.
In this section, we explore the influence of main parts on the accuracy of our method and compare other methods such as RWRMDA and KATZMDA based on the same data about known metabolite-disease associations as follows: firstly, we compare baseline bipartite network projection method (BBNP) which only contains the information of known metabolite-disease pairs and RWRMDA [6] which uses the random walk model and only considers Input: matrixes M ∈ R n d * nm , S D ∈ R n d * n d , SM ∈ R nm * nm , neighborhood sizes K, and decay term T. Output: new feature matrix MF.  Figures 3 and 4 indicate that our method is better than compared methods. e reason for a higher predictive performance is that we use WKNKN to build new feature matrix before the construction of linear neighborhood similarity which cuts down the influence by lack and imbalance of relevant known information. According to above tests, we find main parts are crucial to our methods for improvement of accuracy about prediction, and our method is expected to be a reliable biomedical research tool for predicting latent metabolite-disease associations.

Parameter Analysis.
According to the previous study [31], the K in WKNKN is set 5 and T, which is a decay term with T ≤1, is selected 0.5 for convenience. en, the number of linear neighborhood of diseases and metabolites represent kd and km, respectively, which are set as [10,50] and utilize the result of 10-fold cross validation to analyse relevant parameters. In this study, we set km = 50 according to Table 1 (deep multiview (Figure 5)).

Case Study.
In this section, four kinds of diseases such as hepatitis, leukemia, obesity, and Alzheimer's disease are selected for case studies to explore their pathogenic mechanism from the perspective of metabolites. ere are 10, 10, 9, and 7 out of the top 10 predicted metabolites that could be verified for the four diseases by literature. e predicting network with several diseases and their relevant top 10 metabolites which include their associations and their relevant neighborhood is shown in Figure 6.
Hepatitis, a general term for inflammation of the liver usually refers to the destruction of liver cells and the damage of liver function, which is caused by many pathogenic factors, such as virus, bacteria, parasite, chemical poison, drug, alcohol, and autoimmune factor. We conducted a case study on hepatitis using our calculation method. As illustrated in Table 3, the top 10 predicted metabolites interrelated with hepatitis are selected and verified to be correlative. For instance, high concentrations of homocysteine (2ed) could have favorable consequences in HCV (chronic hepatitis virus C) infection [32].
Leukemia is a group of life-threatening malignant disorder of the blood and bone marrow. Most of patients have     e orange represents target diseases, and the green represents their relevant neighborhoods. e purple represents the predicted metabolites, and the blue represents their relevant neighborhoods. 8 Complexity common first symptoms including fever, progressive anemia, significant bleeding tendency, or bone and joint pain, which is prevalent for the adolescent and young adult (AYA) population. We carried out a case study of leukemia disease with our method, and 9 out of the top 10 predicted metabolites interrelated with leukemia are verified to be correlative (see Table 4). Taking the following as instances, orotic acid is verified that its level is higher in milk of cows with leukemia [33]. It is confirmed that combining parthenolide with inhibitors of L-cystine uptake will achieve a greater toxicity to childhood T-cell acute lymphoblastic leukemia [34]. Furthermore, it is shown that several GABAAR (gamma-aminobutyric acid) subunits are significantly increased in ALL (acute lymphoblastic leukemia) children compared with the data of non-ALL children [35]. Obesity is a common disease caused by metabolic disorder. When the human body takes in more calories than required, the rest of calories are stored in the body in the form of fat, which exceeds the normal physiological requirements and becomes obese when it reaches a certain value. e detailed metabolic phenotype of the obese will play a valuable role in understanding the pathophysiology of metabolic disorders. In the obesity-related metabolite prediction results, 9 out of the top 10 predicted metabolites have been verified by published references (see Table 5). For example, L-alanine (Ala) has been reported to regulate pancreatic β-cell physiology and to prevent body fat accumulation in diet-induced obesity [36].
Alzheimer's disease which is a neurodegenerative disorder with insidious onset and slow progression is a growing global health concern with huge implications for individuals and society. It is reported that about 5.4 million Americans have Alzheimer's disease. Today, the number of people living with Alzheimer's disease in the United States is still growing. Someone in the country develops Alzheimer's disease every 66 seconds. e costs of Alzheimer's care may place a substantial financial burden on families. us, it is a new therapeutic strategy with the aim of moving from treatment to prevention. Studying disease-related metabolism and observing their concentration changes is also one of the preventive measures. In this study, we select the top 10 latent associations with Alzheimer's disease, and 7 out of the top 10 predicted metabolites have been verified by published references (see Table 6).

Discussion
In this paper, we propose the improved bipartite network projection based on linear neighborhood similarity for metabolite-disease association prediction (IBNPLNSMDA). We take advantage of the integrated similarities for obtaining the new feature matric to construct linear neighborhood similarity at the beginning of the method. Furthermore, we improve the baseline Algorithm 1 of bipartite network recommendation by adding similarity weights when resources are allocated. Furthermore, LOOCV and several case studies on important human diseases have been implemented. As a result, IBNPLNSMDA performs well both in LOOCV and the case studies. e excellent performance of IBNPLNSMDA mainly attributes to the following several important factors. Firstly,  Acetic acid Unconfirmed 10 Gamma-aminobutyric acid PMID: 27080467  different data such as the data of mesh database and Dis-GeNET are considered to construct integrated disease similarity and integrated metabolite similarity which could make full use of various similarity information to lay a foundation for obtaining new features. Secondly, the application of linear neighborhood similarity (LNS) with WKNKN alleviates the sparsity and incompleteness problems in the current dataset. Last but not least, similarities such as weights are taken into account in the baseline bipartite network recommendation algorithm which has a significant improvement for prediction results.
Despite the efficiency and practicability of the proposed method, it still has some limitations in identifying diseaserelated metabolites. First of all, more known confirmed human metabolite-disease associations would improve the development and performance of computational human metabolite-disease prediction methods. Furthermore, some reliable metabolite (disease) similarity matrices from other biological features could integrate with relevant linear neighborhood similarity.

DAG:
Directed acyclic graph GIP: Gaussian interaction profile LOOCV: Leave-one-out across validation TPR: True-positive rate FPR: False-positive rate ROC: Receiver operating characteristics AUC: Area under the curve.

Data Availability
e data about metabolite-disease associations used are from the website https://hmdb.ca/.

Conflicts of Interest
e authors declare that there are no conflicts of interest.

Authors' Contributions
CZ carried out the method IBNPLNSMDA to predict the latent associations of metabolites and diseases, participated in designing, and drafted the manuscript. XJL helped to draft the manuscript. All authors read and approved the final manuscript.