Identifying the Risky SNP of Osteoporosis with ID3-PEP Decision Tree Algorithm

. In the past 20 years, much progress has been made on the genetic analysis of osteoporosis. A number of genes and SNPs associated with osteoporosis have been found through GWAS method. In this paper, we intend to identify the suspected risky SNPs of osteoporosis with computational methods based on the known osteoporosis GWAS-associated SNPs. The process includes two steps. Firstly, we decided whether the genes associated with the suspected risky SNPs are associated with osteoporosis by using random walk algorithm on the PPI network of osteoporosis GWAS-associated genes and the genes associated with the suspected risky SNPs. In order to solve the overfitting problem in ID3 decision tree algorithm, we then classified the SNPs with positive results based on their features of position and function through a simplified classification decision tree which was constructed by ID3 decision tree algorithm with PEP (Pessimistic-Error Pruning). We verified the accuracy of the identification framework with the data set of GWAS-associated SNPs, and the result shows that this method is feasible. It provides a more convenient way to identify the suspected risky SNPs associated with osteoporosis.


Introduction
Osteoporosis is a type of systemic skeletal disease that is characterized by reduced bone mass and microarchitecture deterioration of bone tissues, thereby leading to the loss of strength and increased risk of fractures [1].It is one of the agerelated diseases with arteriosclerosis, hypertension, diabetes, and cancer.Currently, none of the medical methods is safe and effective to cure osteoporosis.Therefore, it is necessary to provide theoretical basis for developing a medical strategy to cure the disease from the pathogenesis of osteoporosis.
With the completion of the International HapMap Project and 1000 Genomes Project, about ten millions SNPs of human were annotated, among which more than 3 million are common SNPs.Genetic analysis has reached the stage of genome-wide association study (GWAS).The GWAS is applied to the study of 40 kinds of diseases that are related to more than 500 thousands SNPs [2].
Osteoporosis is a complex and polygenic disease of bone system with the heritability of bone mass is about 60-80% [3].Much progress has been made on the genetic analysis of osteoporosis in the past 20 years and it has been found that a lot of genes and SNPs are associated with osteoporosis through GWAS [4,5].
Computational biology refers to the development and application of data analysis, the theory of data method, mathematical modeling, and computer simulation technology, used in the study of biology, behavioral, and social group system of a discipline [6].The rapid mass of biological data accumulation is unprecedented in the history of human science.Now, a variety of methods and tools of computational biology through the Internet have been successfully applied in every aspect in the field of biological research.They are powerful for post-GWAS studies [7] and could identify the potential and promising causal SNPs that require experimental tests for follow-up functional studies.Extensive work has

Material and Method
We identified the suspected risky SNPs associated with osteoporosis by algorithm based on the analysis of osteoporosis GWAS-associated SNPs with the method mentioned above [9].It is assumed that the SNPs that are similar to the osteoporosis GWAS-associated SNPs are possible risky SNPs associated with osteoporosis.The identification process of the suspected risky SNPs includes two steps in general.Firstly, we constructed a Protein-Protein Interaction (PPI) network based on the Protein-Protein Interaction analysis of the osteoporosis GWAS-associated genes and the genes associated with suspected risky SNPs and identify whether the genes associated with the suspected risky SNPs are associated with osteoporosis through random walk algorithm based on Markov chain.By the algorithm, we also selected the suspected risky SNPs whose associated genes are identified to be associated with osteoporosis.We then classified those SNPs based on their characteristics of function and their loci features by a classification decision tree, and the decision tree was constructed by ID3 decision tree algorithm with Pessimistic-Error Pruning. Figure 1 describes the process to identify the osteoporosis risky SNPs.

The Identification of Genes Associated with Suspected Risky
SNPs.According to the modular property of the genetic diseases, many scholars have proposed prioritization algorithms to predict the disease-causing genes based on the PPI, Human Disease Network, and DISEASOME recently [12][13][14][15][16]. Similarly, we obtained the scores of the genes associated with the suspected risky SNPs through the random walk algorithm based on the PPI of the osteoporosis GWAS-associated genes and the genes associated with suspected risky SNPs.Then, the result was acquired by setting up a threshold , and the genes associated with suspected risky SNPs are probably the osteoporosis associated genes if their scores are greater than .

The Random Walk Algorithm Based on Markov Chain.
Kohler proposed a method for the problem of candidategene prioritization by random walk algorithm based on the global network distance of PPI.The results indicate that the algorithm is more effective than the local network distance algorithm [17].The random walk algorithm was applied to Protein-Protein Interaction network of all associated genes.
An undirected graph  = (, ) is defined for the Protein-Protein Interaction network of all associated genes.In the undirected graph ,  is the set of vertices for the interactors of the network.And  is defined as  = {V 1 , V 2 , . . ., V  };  is the set of edges; and  is defined as Every edge in the set of edges corresponds to two nodes of the set of vertices for the interaction between the interactors.Moreover, it is assumed that a random process meets the condition of Markov chain.The random process should be as follows: (a) The probability distribution of time +1 is only related to the state of time , and it is not related to the state before time .
(b) The state transition is not related to the value of  from the time  to time  + 1.Therefore, the Markov chain model is defined as is a nonempty set that consists of all the possible states of the system.It is a state space that can be a limited and denumerable set or a nonempty set. = [  ] × is the state transfer-probability matrix,   is the probability that the system is in the state  at time  to the state  at time  + 1.  is the number of system states. = { 0 ,  1 , . . .,  −1 } is the initial probability distribution of the system,   is the  probability that the system in state  at the initial time, and Based on the above theory model, the random walk on graphs is defined as an iterative walk's transition from its current node to a randomly selected neighbor starting at given source node [17].The random walk is defined as is a vector in which the th element holds the probability of being at node  at time step . is a constant between 0 and 1 that it is the restart of the walk in every step at the node  with probability , and  ∈ (0, 1] [17]. 0 is a row vector of 1 ×  which is the initial state of the system, and  is the element number of .The value of known elements of  0 is equal, and the sum of them is 1.And the value of other elements is 0.  is the transition probability matrix which is defined as is an adjacency matrix of undirected graph .Every element   of  is defined as follows: if there is interaction between V  and V  in the network, the element   = 1; otherwise,   = 0 the formula is defined as is a diagonal matrix.Each element   of  is defined as follows: if  =  then it should have   =   ; otherwise   = 0.   is the degree of V  in the network.The formula is defined as The transition probability matrix  is also a rownormalized adjacency matrix of the graph.Formula (2) meets the state of stationary distribution of Markov train model obviously, so the central point of random walk algorithm is evaluating the stationary distribution state of the probability of the nodes in the undirected network  which consists of PPI.Firstly, the transition probability matrix  should be obtained and the initial value is set for  0 .Then, process  times iteration based on formula (2) until lim( +1 −   ) = 0,  +1 is a convergence vector.A threshold is set for the probability value, and if the probability value of the nodes (or genes) is greater than the threshold, they are osteoporosis associated genes.

Classify the Suspected Risky SNPs by ID3 Decision Tree
Algorithm.ID3 decision tree algorithm is a classification algorithm for tree structure [18,19].The goal of the algorithm is to predict target variable based on multiple input variables and deduce a classification rule with decision tree form from a group of irregular samples.We assume that all input characteristic elements have a limited discrete domain and need an individual characteristic element as a category.The nonleaf nodes of a classification decision tree classify the samples by characteristics of samples, and each leaf node of the tree is a class or classes of probability distribution.Therefore, we chose decision tree to classify the SNP based on the condition of the training set and algorithm characteristics.
SNPs located within the promoter or distant enhancer region of genes may alter the binding of TFs with DNA and subsequently regulate gene expression [20].The suspected risky SNPs are classified by ID3 decision tree algorithm based on four features of significant position on genes, mapping on putative enhancer region, mapping on distal interaction, and the region where the SNPs are located [21].
The decision tree algorithm chooses the attribute with the maximum information gain after it is split, and the algorithm searches the decision-space by way of top-down greedy algorithm. is defined as the training set of SNPs with their loci features, and the training set is divided into  classes.That is,  = { 1 ,  2 , . . .,   }.The number of the training instances in th class is defined as |  | =   .The number of the training instances in  is ||.The probability that a training instance belongs to the th class is (  ).And a formula is defined as For the training set , () is defined as the information entropy of , and we have the formula The greater the value of information entropy () is, the smaller the degree of uncertainty for the division of  is.The attribute  is selected as the test attribute which is the loci features of the training set SNPs, and the value set for attribute  is  = { 1 ,  2 , . . .,   }.The probability of the attribute belongs to th class when  =   can be formulated as is the number of training instances which belongs to th class.
When the attribute  =   , a formula is used to define the conditional entropy of the attribute  as is the training instances set of training set .
The information entropy of attribute  is defined as We built a top-down decision tree and classified the training instances by choosing the attribute with the maximum information entropy based on the formulas above.
However, the overfitting problem could not be avoided if there were many noise samples in the training set, because of a complicated classification decision tree constructed by ID3 decision tree algorithm with a fair amount of noise samples in the training set.To solve the problem, a PEP (Pessimistic-Error Pruning) algorithm was exerted on the ID3 decision tree classification algorithm.PEP is the most accurate topdown pruning strategy which deals with the pruning problem without separating the training set.
We define a decision tree  which grows on a large scale based on the training set of SNPs with their loci features. 1 is a nonleaf node set,  2 is a leaf node set, and  3 is for all nodes of .The formula is Before pruning, we define () as the error rate of node  in the decision tree.The formula is () is the number of samples in node , and () is the number of samples that does not belong to node  actually.We define   as a subtree of the decision tree , and  is the root node of   .So the error rate of the subtree   is is the leaf node set of subtree   , and we define Apparently, the formula for error rate of the subtree   is binomial distribution.We define a continuity correction factor   () in order to make the binomial distribution approach the normal distribution.And the formula is Therefore, we deduce the continuity correction factor for the subtree   .The formula is In order to simplify the formula, we define   () as the error sample number instead of error rate.So the error sample number of node  in the decision tree  is Therefore, the error sample number of the subtree   is Similarly, the formula for the error sample number of subtree   is binomial distribution.And the standard deviation for   (  ) is defined as Finally, we deduce from formulas above that the subtree   will be cut if the node  meets the condition: The process of the PEP algorithm is as follows: Algorithm: PEP Begin Input: decision tree  before pruned Output: decision tree  after pruned (1) Get the nonleaf node set  1 of the decision tree (2) For  = 1 to length ( 1 ) (3) Do get a subtree   whose root node is We classified the suspected risky SNPs effectively based on their loci characteristics and studied their functions according the ID3 decision tree algorithm and PEP.

Results
By the end of 2014, nine GWAS and nine meta-analyses had reported 107 genes and 129 SNPs (lead SNP) that were associated with BMD, osteoporosis, or fractures with a significant threshold of 5 × 10 −8 .222 SNPs linked to osteoporosis GWAS-associated lead SNPs had also been identified by using LD information in the Caucasians population via HapMap website [9].Moreover, we obtained 107 known osteoporosis GWAS-associated genes which showed significant connectivity among proteins.And there were interactions between The first column is part of osteoporosis GWAS-associated SNPs; (b) the column of "bda," "td," and "enhancer" means whether the SNP is on significant TFs binding affinity, mapping on distal interaction, and mapping on putative enhancer region; (c) the last column is the category the SNP belong to.
osteoporosis GWAS-associated genes and interactors.We used the common Protein-Protein Interaction databases, such as Human Protein Interaction database (HPID) and General Repository for Interaction Data (GRID), to find the interactors which had interactions with the osteoporosis GWAS-associated genes and their interactions.Then, we obtained the interaction network graph by Cytoscape v3.4.0. Figure 2 is the PPI of osteoporosis GWAS-associated genes.
The result was verified by 10-fold cross-validation based on the data set of osteoporosis GWAS-associated genes and SNPs.We divided the data set of 129 osteoporosis GWASassociated lead SNPs and 222 SNPs linked with them into 10 samples.One sample was then randomly chosen and saved as the validation set to verify the model from the 10 samples, and the other 9 samples were saved as training set.The verification process was repeated 10 times so that each sample was the validation set once, and the accuracy was calculated every time.A 10-fold cross-validation was completed by the process above.
We set a threshold  ( > 10 −3 ) as a result of the validation.The recall was calculated, which was the true positive result to positive result ratio.The 10-fold crossvalidation was repeated for ten times and the average recall rate of every validation was calculated.The result was shown in Figure 3.
The classification result was also verified by 10-fold crossvalidation.The osteoporosis GWAS-associated SNPs were used as the data set.The SNPs of training set were classified based on their loci features.Part of classification of the training set was shown in Table 1.We classified the SNPs of validation set through ID3 decision tree algorithm and recorded the accuracy of classification, which was the proportion of classification accurate samples to all the samples.
Then, the process of validation was repeated for ten times and calculated the average accuracy rate and average classification reliability.The result was shown in Figure 4.
We also used genome-wide association studies (GWAS) of type 2 diabetes (T2D) data as negative data to verify our method [22].50 lead SNPs of T2D were obtained with their position features and associated genes.We searched the interactors of the associate genes from the PPI database and constructed the PPI network with the known osteoporosis GWAS-associated genes.The random walk algorithm was used on the PPI network.
We then used PEP for ID3 decision tree to construct a simplified classification decision tree.We combined the two steps of the risky SNPs identification method and verified the method by 10-fold cross-validation.Finally, we found  that not only was the computation efficiency improved, but also the accuracy rate of the result by using ID3 decision tree algorithm with PEP in the identification method was higher.The improvement is due to the fact that we had cut the subtrees which were constructed by the noise samples and solved the overfitting problem.While we defined ID3 decision tree algorithm with PEP in the identification method as ID3-PEP and ID3 decision tree algorithm as ID3, the result comparison of these two classification algorithm in the identification method was described by Figure 5.According to the result, we concluded that the ID3-PEP in the identification method was more stable than ID3 algorithm, and it had better effect for the classification problem.C4.5 is the optimization of ID3.They have the same way to learn training set and build a classification decision tree, but the difference of them is the way of choosing split attribute.C4.5 algorithm chooses the maximum attribute with information gain ratio to split.In order to solve the problem of overfitting in ID3 decision tree algorithm, C4.5 algorithm needs to scan the data set and rank them in every step.This calculation method and process of the algorithm have low operational efficiency.ID3-PEP algorithm solved the problem and was more accurate than C4.5.We made a comparison of these two algorithms through ROC curve, which is shown in Figure 6.Result shows that ID3-PEP is better than C4.5 in our classification.

Discussion and Conclusion
Since SNP plays a key role in the process of pathology and susceptibility of osteoporosis [23], it is necessary to find the unknown risky SNPs.Using the data set of known osteoporosis GWAS-associated SNPs and genes [8], we identified the genes of suspected risky SNPs associated with osteoporosis by random walk algorithm on the PPI network constructed by osteoporosis GWAS-associated genes and the genes associated with suspected risky SNPs.The suspected risky SNPs were classified based on the features of their loci position and function.We used 10-fold cross-validation to verify our method.
The result of the experiment above showed that the identification method for risky SNPs of osteoporosis was correct and effective.Our method efficiently achieved the process of identifying osteoporosis suspected risky SNPs.However, there is still a need to perfect the identification method.First of all, we need to search the loci features of suspected risky SNPs associated with osteoporosis and the interactors of associated genes manually.The training set for our method is the known osteoporosis GWAS-associated SNPs, which is not large enough to identify the risky SNPs accurately.Therefore, further research is needed.Firstly, a workflow can be constructed to improve the identification process, aiming to automatically identify the suspected risky SNPs' features.In order to improve the accuracy of our method, more features of the SNPs should be examined, such as the conservation of SNPs and the influence of the SNPs on miRNA binding site.Finally, we use our method to predict risky SNPs associated with osteoporosis by constructing the PPI network of all the human genes.

Figure 1 :
Figure 1: Process to identify the suspected risky SNPs associated with osteoporosis.

Figure 2 :Figure 3 :
Figure2: PPI of osteoporosis GWAS-associated genes (the pink nodes indicated those which had interactions with the osteoporosis GWASassociated genes, and the yellow nodes indicated the osteoporosis GWAS-associated genes).

Figure 4 :
Figure 4: Result of ID3 decision tree (the blue credibility refers to the average accuracy values of 10-fold cross-validation, and the orange credibility refers to the average reliability value).

Figure 5 :
Figure 5: Comparison of two classification algorithm (the blue credibility refers to the classification accuracy by ID3 algorithm, and the yellow credibility refers to the classification accuracy by ID3 decision tree algorithm with PEP).

Table 1 :
Part of the classification of training set.