Graph Based Link Prediction between Human Phenotypes and Genes

Background: The learning of genotype-phenotype associations and history of human disease by doing detailed and precise analysis of phenotypic abnormalities can be defined as deep phenotyping. To understand and detect this interaction between phenotype and genotype is a fundamental step when translating precision medicine to clinical practice. The recent advances in the field of machine learning is efficient to predict these interactions between abnormal human phenotypes and genes. Methods: In this study, we developed a framework to predict links between human phenotype ontology (HPO) and genes. The annotation data from the heterogeneous knowledge resources i.e., orphanet, is used to parse human phenotype-gene associations. To generate the embeddings for the nodes (HPO&genes), an algorithm called node2vec was used. It performs node sampling on this graph based on random walks, then learns features over these sampled nodes to generate embeddings. These embeddings were used to perform the downstream task to predict the presence of the link between these nodes using 5 different supervised machine learning algorithms. Results: The downstream link prediction task shows that the Gradient Boosting Decision Tree based model (LightGBM) achieved an optimal AUROC 0.904 and AUCPR 0.784. In addition, LightGBM achieved an optimal weighted F1 score of 0.87. Compared to the other 4 methods LightGBM is able to find more accurate interaction/link between human phenotype&gene pairs.


Introduction
Today, many humans have diseases with abnormalities in the genome, and because of their nonuniformity, diseases are missed or undiagnosed.e analysis of human phenotypes plays a crucial function in clinical practice [1].Phenotype Ontology (HPO) applies precision medicine into clinical practice in deep phenotyping, which is an in-depth analysis of phenotypic abnormalities with precision where phenotype components are investigated and interpreted [2].HPO is a resource to systematically define and logically organize human phenotypes [3].It is obtained from medical literature and knowledge resources such as the database of chromosomal imbalance and human phenotypes using DECIPHER [4], OMIM [5], and Orphanet [6].e majority of research denotes relevant phenotypes mainly based on topological and ancestorial relationships between any two nodes in the directed acyclic graphs.However, they fail to use feature learning through embeddings of different nodes to detect associations which is difficult to infer through graph structure.e traditional classification methods were applied to predict the interactions and associations between HPO-gene terms.ese methods introduce inconsistencies because target values are predicted without considering inherited relationships within the ontology.For instance, if we are trying to predict the link between human phenotype gene, a normal classifier will associate the HPO term "Squamous Cell Carcinoma" with a gene, but it might not associate "Abnormality of the Skin," hence leading to an inaccurate prediction.To appropriately handle the hierarchical relationships between HPO terms that accurately characterize HPO, we use node embeddings.e node embeddings supply answers to transform graph nodes to distributed representations, allowing them to convert the relationship in a graph to an embedding domain.Node2Vec is a common method to create node embeddings [7].It performs a flexible neighborhood sampling strategy using biased random walk and passes this sampling data to the word2vec model as input [8].In this research, a framework was designed to estimate the interaction between human phenotype and gene.
e dataset used for this study were taken from heterogeneous knowledge resources, specifically the Orphanet.We have converted this knowledge resource data to an undirected graph and are using this graph to create node embeddings to learn the features.ese node features were used to build downstream machine learning models to predict interactions.We conducted a quantitative analysis on the output of these models using different metrics [9][10][11][12][13][14][15][16].

Related Work
Several studies have used the Human Phenotype Ontology graph structure to understand the association and interaction between different phenotype ontologies, genes, and proteins in the clinical service domain.e HPO2GO [17] study shows ways to predict associations between human phenotype terms in conjunction with genes [18].OntoFUNC [19] integrated pharmacogenomics databases to identify associations between chemical pathways using analysis on chemical ontology.
e HPOAnnotator [20] infers large-scale proteinhuman phenotype associations using PPI (protein-protein interaction) information and the HPO Graph.e low-rank approximation was employed to solve the sparsity problem in finding associations.However, none of these methods uses node embeddings and advanced machine learning models to extract appropriate node features for predicting human phenotype-gene associations [9,[21][22][23][24][25][26].

Methodology
3.1.Data Preparation.Link prediction intends to find the missing links or identify future link interactions between nodes based on the currently observed partial network.Predicting links between nodes has been the most significant topic in the field of graphs.To use machine learning for predicting the interaction between the unconnected pairs in the graph, we need to represent the graph with features.In Figure 1, (a) graph has seven nodes and unconnected node pairs AF, BD, BE, BG, and EG.Let us suppose we analyze the graph data at time t and find new connections that have been formed in the graph at time t + n (red links) in Figure 2. We extract unconnected node pairs A-F, B-D, B-E, B-G, and E-G, and then the graph is updated at time t + n and these three new links are labeled (red lines in Figure 2) in the graph for the node pairs A-F, B-D, and B-E as 1, and the node-pairs BG and EG will be labeled as 0 because still at time t + n no links were found for these nodes.
In this example, when we access the graph at time t + n, we can get the labels for the target variable.However, in realworld networks or graphs, we would have access to a single large graph.erefore, we first need to understand that connections between nodes in the graph are developed gradually over time.Hence, randomly hiding some edges in the given graph and then creating labels would solve the problem.When we remove links or edges, we always avoid deleting edges which might lead to an isolated node, i.e., a node without any edge or an isolated network.
A significant part of the undirected graph obtained from the Orphanet HPO annotation dataset belongs to the negative population or unconnected node pairs.To find these node pairs, we create an adjacency matrix, as shown in Figure 3. is adjacency matrix is a square matrix where both columns and rows are defined by nodes of the graph.
e values in the matrix denote edges or links.e value of one means an edge exists, and a value of zero means no edge was found between that node pair.In Figure 3, nodes HP: 0000951 and ORPHA:763 have a value of 0 at a cross junction in the adjacency matrix, and there is also no edge or link between them in the graph.
In this symmetric matrix, we only consider elements either above or below the diagonal.e traversal procedure finds the positions of the negative samples.A few edges are randomly dropped to obtain the positive samples.However, too much dropping might lose the connected fragments and nodes.To overcome it, we first check for the splitting of the graph when dropping a node pair.If these conditions are met, that node pair can be safely dropped.By following these steps, we found a total of 1665146 unconnected node pairs, out of which only 125304 were positive samples, i.e., around 7.5% (highly imbalanced dataset).e heterogeneous knowledge resource, Orphanet, contains 133733 associations between human phenotype and genes.e statistics for the undirected graph generated from this data can be found in Figure 4.

Feature Extraction.
To extract features from graph G, we use the node2vec algorithm.It creates vector space embeddings to represent node features.Node2Vec starts with the weighted random walks from every node of the graph and interprets random walks as terms that are embedded by a skip-gram model into Euclidean space [21].e aim of this approach is to maximize the node n's probability in context K within the contextual window of length l: Here, k i denotes the i th term or node sequence generated by using a random walk.e output from the skip-gram method, which is being defined by the SoftMax function, is p(k j | k i ), which is as follows: where n j , n i are vector representations of terms k j , k i in the hidden layer of the skip-gram model.e node2vec library was used to train a model with 30 nodes in each walk and 200 walks per node with 128 embedding dimensions.We will apply this node2vec model to every node pair in the dataset.

Downstream Prediction Using ML Algorithms.
e node features from the node2vec are fed to machine learning models.To verify their performance, we split the complete dataset into training (80%) and testing (20%).We used the following methods for this supervised link prediction work.

Logistic Regression.
Logistic regression classifies binary outcomes borrowed from the field of statistics.It uses the sigmoid activation function to restrict the outcome between 0 and 1. e coefficients are estimated from our training set using a maximum-likelihood learning algorithm.It is a common algorithm that makes assumptions about the distribution of data.For this study, we used the L-BFGS solver [22].

Random Forest.
Random forest is a widely used bagging method for supervised machine learning tasks.e decision trees are ensembled together to generate forest Mathematical Problems in Engineering training using bagging methods.e bagging here refers to building multiple decision trees together and later merging them to get a more accurate prediction. is algorithm can be used for both classifications as well as a regression task.In our case, we will be training the random forest model for binary classification to predict the link between two nodes.e maximum depth used to train this model was 12 [23][24][25][26][27][28][29].

Neural Network.
Neural networks have revolutionized the field of machine learning with recent advancements.is algorithm tries to simulate the human brain to find patterns in information.Neural networks can be used for a wide variety of tasks, including similar clustering of data, classifying different objects, and many more.e neural network is initiated with random weights and thresholds.e training data are being fed as vectors to the input layer and letting it pass through the succeeding hidden layers by complexly adjusting these weights and thresholds until we yield similar outputs as true labels.For this study, we are using a neural network with two hidden layers, each with the ReLU activation, and using sigmoid activation at the output layer to get the output values between 0 and 1. e Adam optimizer is used with a learning rate value of 1e-3, which is an extension to SGD (stochastic gradient descent), which helps us to update the network weight based on training data.

XGBoost. XGBoost (eXtreme gradient boosting
) is a go-to algorithm in the machine learning community to solve most of the classification and regression problems.It uses a boosting technique by training models in isolation from each other by trying to correct mistakes from the previous models.ese models are added up sequentially until we reach the point where no further mistakes can be made.Gradient boosting is a method that trains on the errors or residuals of previous models.
is method is computationally efficient and fast as compared to other boosting methods.For this study, we have used a learning rate of 0.1, a maximum depth of 12, and a scale weight of 0.99 to handle the imbalanced nature of the data [24].

LightGBM.
LightGBM is one of the gradient boosting methods that is very fast and computationally efficient.is framework is used for many downstream prediction tasks such as classification, regression, and ranking.It uses the same concepts as XGBoost except for one key difference: it splits trees on the leaf and not on the level.is helps to reduce more losses when growing on the same leaf.We use the maximum depth parameter value of 10 to avoid building a more complex model.e scale pos weight 99 was used to deal with the class imbalance problem for this study.

Results and Discussion
4.1.Evaluation Metrics.Due to the high imbalance, the model evaluation metrics we choose are important.In this study, we try to understand how each class is performing rather than focusing on overall metrics.Below are the metrics which we have used to validate the performance of these machine learning methods.

AUROC.
AUROC stands for the area under the receiver operating characteristics (ROC) curve.In a binary classification task, a very common summary statistic to calculate the goodness of a predictor is defined by AUROC.In AUROC, the receiver operating characteristics is the probability curve, and the area under the curve is the degree of separability.It provides information about the class distinguishability in the model.is ROC curve is drawn with the true positive rate (sensitivity) against the false positive rate (1-specificity).PrecisionMicroAvg can be defined as the sum of all true positives for all classes by all positive predictions.
RecallMicroAvg can be defined as the sum of all true positives for all classes by all actual positives, i.e., true positives not the predicted positives.

Recall Micro Avg
F1MicroAvg can be defined by calculating the harmonic mean for PrecisionMicroAvg and RecallMicroAvg.
4.1.4.Macroaverage Precision/Recall/F1 Score.e macroaverage is used when we want to treat all classes equally when evaluating performance [25] concerning the most frequent class in the dataset.
PrecisionMacroAvg can be defined as the arithmetic mean of precision from all the classes.It can be mathematically defined as RecallMacroAvg can be defined as the arithmetic mean of recall from all the classes.It can be mathematically defined as Mathematical Problems in Engineering Recall Macro Avg � F1MacroAvg can be defined as the mean of the F1 score by class wise.
where i is the class index, and N is the total number of classes.
4.1.5.Weighted Average Precision/Recall/F1 Score.Precision Weighted Avg can be calculated by multiplying class weights based on true labels with its corresponding precision score.
where W i is class weight on true labels.Similarly, Recall Weighted Avg can be calculated by multiplying class weights based on true labels with its corresponding recall score.

Recall Weighted Avg
F1Weighted Avg can be calculated by multiplying class weights based on true labels with its corresponding f1 score.Mathematically, it can be represented as 3follows: where W is class weight based on true labels.1, it is much easier to identify which model is doing a great job in identifying each class.Based on these metrics, we can see that XGBoost and random forest perform well in identifying positive samples which are positive, i.e., when they predict a link between nodes, they are correct 99% and 100% of the time, respectively.Contrastingly, LightGBM beats all other methods in identifying correct actual positives-in other words, it correctly predicts 82% of all the links between these nodes.From Table 2, we can confer that LightGBM is better than all other models in terms of AUROC and AUCPR.
Figure 4 shows the graph statistics and degree distribution.Figure 5 denotes the AUCROC and AUCPR for each classifier.

Conclusion
In this research, an approach is proposed to predict links between human phenotype and genes using heterogeneous knowledge resources, i.e., Orphanet.
e most important part of this study is to represent data into a graph and then to find a way to represent this graph into an appropriate feature set which will allow us to use it for down-streaming tasks like prediction.In essence, we provided a way to get the embedding vectors by using an algorithm called node2vec and then using these embeddings to build five different machine learning models.We evaluated and compared the performances using different quantitative metrics, including AUROC, AUCPR, micro, macro, and weighted precision, recall, and F1 score.Some of these metrics were calculated for each class instance to better understand the situation for imbalanced class, in our case, positive samples.Based on these metrics, we found very interesting results.Suppose we want to just focus on positive samples meaning the measure of the link that we correctly identify having associations of all the actual associations in the graph (we refer to it as precision).In that case, we may either use XGBoost or random forest algorithm.On the other hand, if we just want to focus on accurately identifying positives from true positives, i.e., actual links in the graph (recall), then use LightGBM.

Figure 4 :
Figure 4: Graph statistics and degree distribution.