In the past 20 years, much progress has been made on the genetic analysis of osteoporosis. A number of genes and SNPs associated with osteoporosis have been found through GWAS method. In this paper, we intend to identify the suspected risky SNPs of osteoporosis with computational methods based on the known osteoporosis GWAS-associated SNPs. The process includes two steps. Firstly, we decided whether the genes associated with the suspected risky SNPs are associated with osteoporosis by using random walk algorithm on the PPI network of osteoporosis GWAS-associated genes and the genes associated with the suspected risky SNPs. In order to solve the overfitting problem in ID3 decision tree algorithm, we then classified the SNPs with positive results based on their features of position and function through a simplified classification decision tree which was constructed by ID3 decision tree algorithm with PEP (Pessimistic-Error Pruning). We verified the accuracy of the identification framework with the data set of GWAS-associated SNPs, and the result shows that this method is feasible. It provides a more convenient way to identify the suspected risky SNPs associated with osteoporosis.
Osteoporosis is a type of systemic skeletal disease that is characterized by reduced bone mass and microarchitecture deterioration of bone tissues, thereby leading to the loss of strength and increased risk of fractures [
With the completion of the International HapMap Project and 1000 Genomes Project, about ten millions SNPs of human were annotated, among which more than 3 million are common SNPs. Genetic analysis has reached the stage of genome-wide association study (GWAS). The GWAS is applied to the study of 40 kinds of diseases that are related to more than 500 thousands SNPs [
Osteoporosis is a complex and polygenic disease of bone system with the heritability of bone mass is about 60–80% [
Computational biology refers to the development and application of data analysis, the theory of data method, mathematical modeling, and computer simulation technology, used in the study of biology, behavioral, and social group system of a discipline [
The method of computational biology can also be used to study and understand these osteoporosis-susceptible genes and the function of SNP. All the osteoporosis associated genes and SNPs (including linkage disequilibrium (LD) SNPs) sequence information were collected and aggregated from the national center for biological information (NCBI) database, and the effects of osteoporosis GWAS-associated lead SNPS and their linked SNPs to transcription factor (TF) binding affinity were studied through JASPAR database. At the same time, the osteoporosis GWAS-associated genes have also been analyzed with Protein-Protein Interaction (PPI) network analysis tool in the study of the osteoporosis GWAS-associated SNPs associated by the online PPI tool named String. Combining with GO and pathway analysis, we found that the hub proteins associated and the Wnt signaling pathway were related to the mesenchymal stem cell differentiation and hormone signaling that was related to the metabolism of osteoporosis [
In the BIBM workshop paper [
We identified the suspected risky SNPs associated with osteoporosis by algorithm based on the analysis of osteoporosis GWAS-associated SNPs with the method mentioned above [
Process to identify the suspected risky SNPs associated with osteoporosis.
According to the modular property of the genetic diseases, many scholars have proposed prioritization algorithms to predict the disease-causing genes based on the PPI, Human Disease Network, and DISEASOME recently [
Kohler proposed a method for the problem of candidate-gene prioritization by random walk algorithm based on the global network distance of PPI. The results indicate that the algorithm is more effective than the local network distance algorithm [
An undirected graph The probability distribution of time The state transition is not related to the value of
Based on the above theory model, the random walk on graphs is defined as an iterative walk’s transition from its current node to a randomly selected neighbor starting at given source node [
The transition probability matrix
ID3 decision tree algorithm is a classification algorithm for tree structure [
SNPs located within the promoter or distant enhancer region of genes may alter the binding of TFs with DNA and subsequently regulate gene expression [
The decision tree algorithm chooses the attribute with the maximum information gain after it is split, and the algorithm searches the decision-space by way of top-down greedy algorithm.
For the training set
The greater the value of information entropy
When the attribute
The information entropy of attribute
We built a top-down decision tree and classified the training instances by choosing the attribute with the maximum information entropy based on the formulas above.
However, the overfitting problem could not be avoided if there were many noise samples in the training set, because of a complicated classification decision tree constructed by ID3 decision tree algorithm with a fair amount of noise samples in the training set. To solve the problem, a PEP (Pessimistic-Error Pruning) algorithm was exerted on the ID3 decision tree classification algorithm. PEP is the most accurate top-down pruning strategy which deals with the pruning problem without separating the training set.
We define a decision tree
Before pruning, we define
We define
Apparently, the formula for error rate of the subtree
Therefore, we deduce the continuity correction factor for the subtree
In order to simplify the formula, we define
Therefore, the error sample number of the subtree
Similarly, the formula for the error sample number of subtree
Finally, we deduce from formulas above that the subtree
The process of the PEP algorithm is as follows:
We classified the suspected risky SNPs effectively based on their loci characteristics and studied their functions according the ID3 decision tree algorithm and PEP.
By the end of 2014, nine GWAS and nine meta-analyses had reported 107 genes and 129 SNPs (lead SNP) that were associated with BMD, osteoporosis, or fractures with a significant threshold of
PPI of osteoporosis GWAS-associated genes (the pink nodes indicated those which had interactions with the osteoporosis GWAS-associated genes, and the yellow nodes indicated the osteoporosis GWAS-associated genes).
The result was verified by 10-fold cross-validation based on the data set of osteoporosis GWAS-associated genes and SNPs. We divided the data set of 129 osteoporosis GWAS-associated lead SNPs and 222 SNPs linked with them into 10 samples. One sample was then randomly chosen and saved as the validation set to verify the model from the 10 samples, and the other 9 samples were saved as training set. The verification process was repeated 10 times so that each sample was the validation set once, and the accuracy was calculated every time. A 10-fold cross-validation was completed by the process above.
We set a threshold
Result of random walk (the ten colors of the points indicated ten 10-fold cross-validation, and the same color of points indicated the validation process. The points connected by a line were the average recall value of ten experiments. The
The classification result was also verified by 10-fold cross-validation. The osteoporosis GWAS-associated SNPs were used as the data set. The SNPs of training set were classified based on their loci features. Part of classification of the training set was shown in Table
Part of the classification of training set.
SNP | bda | td | Enhancer | Gene region | Class |
---|---|---|---|---|---|
rs7524102 | Y | Y | Y | Intergenic | C |
rs34920465 | Y | Y | Y | Control region | D |
rs6426749 | Y | N | Y | Control region | G |
rs1430742 | N | N | N | Cording sequence | B |
rs6929137 | Y | Y | Y | Missense | A |
rs479336 | Y | Y | N | Cording sequence | K |
rs11898505 | Y | N | Y | Intergenic | F |
rs17040773 | Y | Y | Y | Cording sequence | E |
rs344081 | Y | N | Y | Cording sequence | H |
rs6909279 | Y | Y | N | Intergenic | I |
(a) The first column is part of osteoporosis GWAS-associated SNPs; (b) the column of “bda,” “td,” and “enhancer” means whether the SNP is on significant TFs binding affinity, mapping on distal interaction, and mapping on putative enhancer region; (c) the last column is the category the SNP belong to.
Then, the process of validation was repeated for ten times and calculated the average accuracy rate and average classification reliability. The result was shown in Figure
Result of ID3 decision tree (the blue credibility refers to the average accuracy values of 10-fold cross-validation, and the orange credibility refers to the average reliability value).
We also used genome-wide association studies (GWAS) of type 2 diabetes (T2D) data as negative data to verify our method [
We then used PEP for ID3 decision tree to construct a simplified classification decision tree. We combined the two steps of the risky SNPs identification method and verified the method by 10-fold cross-validation. Finally, we found that not only was the computation efficiency improved, but also the accuracy rate of the result by using ID3 decision tree algorithm with PEP in the identification method was higher. The improvement is due to the fact that we had cut the subtrees which were constructed by the noise samples and solved the overfitting problem. While we defined ID3 decision tree algorithm with PEP in the identification method as ID3-PEP and ID3 decision tree algorithm as ID3, the result comparison of these two classification algorithm in the identification method was described by Figure
Comparison of two classification algorithm (the blue credibility refers to the classification accuracy by ID3 algorithm, and the yellow credibility refers to the classification accuracy by ID3 decision tree algorithm with PEP).
C4.5 is the optimization of ID3. They have the same way to learn training set and build a classification decision tree, but the difference of them is the way of choosing split attribute. C4.5 algorithm chooses the maximum attribute with information gain ratio to split. In order to solve the problem of overfitting in ID3 decision tree algorithm, C4.5 algorithm needs to scan the data set and rank them in every step. This calculation method and process of the algorithm have low operational efficiency. ID3-PEP algorithm solved the problem and was more accurate than C4.5. We made a comparison of these two algorithms through ROC curve, which is shown in Figure
The comparison of ID3-PEP and C4.5.
Since SNP plays a key role in the process of pathology and susceptibility of osteoporosis [
The result of the experiment above showed that the identification method for risky SNPs of osteoporosis was correct and effective. Our method efficiently achieved the process of identifying osteoporosis suspected risky SNPs.
However, there is still a need to perfect the identification method. First of all, we need to search the loci features of suspected risky SNPs associated with osteoporosis and the interactors of associated genes manually. The training set for our method is the known osteoporosis GWAS-associated SNPs, which is not large enough to identify the risky SNPs accurately. Therefore, further research is needed. Firstly, a workflow can be constructed to improve the identification process, aiming to automatically identify the suspected risky SNPs’ features. In order to improve the accuracy of our method, more features of the SNPs should be examined, such as the conservation of SNPs and the influence of the SNPs on miRNA binding site. Finally, we use our method to predict risky SNPs associated with osteoporosis by constructing the PPI network of all the human genes.
The authors declare that there are no conflicts of interest regarding the publication of this paper and the received funding did not lead to any conflicts of interest regarding the publication of this manuscript.
This research is supported by the National Natural Science Foundation of China (Grants nos. 61532008 and 31371275), the National Social Science Foundation of China (no. 14BYY093), and the Fundamental Research Funds for the Central Universities (no. CCNU17TS0003).