Comparison of Classification Algorithms with Wrapper-Based Feature Selection for Predicting Osteoporosis Outcome Based on Genetic Factors in a Taiwanese Women Population

An essential task in a genomic analysis of a human disease is limiting the number of strongly associated genes when studying susceptibility to the disease. The goal of this study was to compare computational tools with and without feature selection for predicting osteoporosis outcome in Taiwanese women based on genetic factors such as single nucleotide polymorphisms (SNPs). To elucidate relationships between osteoporosis and SNPs in this population, three classification algorithms were applied: multilayer feedforward neural network (MFNN), naive Bayes, and logistic regression. A wrapper-based feature selection method was also used to identify a subset of major SNPs. Experimental results showed that the MFNN model with the wrapper-based approach was the best predictive model for inferring disease susceptibility based on the complex relationship between osteoporosis and SNPs in Taiwanese women. The findings suggest that patients and doctors can use the proposed tool to enhance decision making based on clinical factors such as SNP genotyping data.

Several gene polymorphisms may cooperatively contribute to the development of osteoporosis in Taiwanese women. Accumulating evidence reveals that SNPs are 2 International Journal of Endocrinology T 1: Panel of 11 SNPs [9]. Gene  rs number  Genotype  1  2  3  1  TNF -857  rs1799724  TT  TC  CC  2  TGF 1-509  rs1800469  TT  TC  CC  3  Osteocalcin  rs1800247  CC  CT  TT  4  TNF -308  rs1800629  AA  AG  GG  5 PTH (BstB I)  potential genetic markers for predicting osteoporosis outcome in Taiwanese women [9]. Chang et al. [19] also proposed a novel odds ratio-based genetic algorithm (OR-GA) method of using odds ratios for quantitatively measuring the disease risk associated with various SNP combinations to determine the susceptibility to osteoporosis in Taiwanese women. Taiwanese women who are carriers of risk alleles in two or more of these SNPs are likely to be at increased risk of osteoporosis because several partial de�ciencies in these pathways may severely diminish bone density. erefore, SNPs may indicate risk of osteoporosis in Taiwanese women and may be useful in clinical association studies to determine the genetic basis of disease susceptibility. e risk of osteoporosis is likely to be higher than normal in carriers of risk alleles in two or more of these SNPs because several partial de�ciencies in these pathways may substantially decrease bone density. erefore, interacting polymorphisms may affect osteoporosis risk. In [9], the effects of age, BMI, and genetic factors on BMD were evaluated in pre-and postmenopausal Taiwanese women were evaluated. Eleven interacting polymorphisms in nine genes were studied in terms of their effects on the incidence of low BMD ( Table 1). Combinations of SNPs were evaluated for genotype associations in women with osteoporosis. e �ndings showed that speci�c SNP combinations may be risk factors for postmenopausal osteoporosis in Taiwanese women. In addition to these speci�c SNP combinations, BMI and age also showed independent associations with BMD in postmenopausal Taiwanese women.

SNP
Although an apparent association between SNPs and osteoporosis has been identi�ed in Taiwanese women, a continuing challenge in genomics studies of Taiwanese women populations lies in identifying signi�cant genes. Exhaustive computation over the model space is infeasible if the model space is very large, as there are 2 p models with p SNPs [20,21]. Feature selection techniques are designed to �nd responsible genes and SNPs for certain diseases. By selecting a small number of SNPs with signi�cantly larger effects compared to other SNPs and by disregarding SNPs of lesser signi�cance, researchers can focus on the most promising candidate genes and SNPs for use in diagnosis and therapy [21,22].
In [9], combined polymorphisms in different genomic regions were evaluated for associations with BMD variation. e �ndings showed that a combination of several gene polymorphisms contributes to the development of osteoporosis in Taiwanese women. However, that study did not report a subset of SNPs that can be used to predict osteoporosis outcome in this population. erefore, the current study used the same dataset used in [9] to elucidate the relationship between osteoporosis and SNPs in Taiwanese women in a performance comparison of three different classi�cation algorithms with wrapper-based feature selection [23]: multilayer feedforward neural network (MFNN) [24][25][26][27][28], naive Bayes [29], and logistic regression [30]. e MFNNs have proven particularly effective for nonlinear mapping based on human knowledge and are now attracting interest for use in solving complex classi�cation problems [24]. An MFNN containing layers of simple computing nodes, which is analogous to brain neural networks, has proven effective for approximating nonlinear continuous functions and for revealing previously unknown relationships between given input and output variables [25,26]. e unique structure of MFNNs enables them to learn by using algorithms such as backpropagation and evolutionary algorithms [31,32]. Potential medical applications of MFNNs include solving problems in which the relationship between independent variables and clinical outcome are poorly understood [33]. Because MFNNs are capable of self-training with minimal human intervention, many studies of large epidemiology databases have, in addition to conventional statistical methods, used MFNNs for further insight into the interrelationships among variables. A naive Bayes classi�er assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. Depending on the precise nature of the probability model, naive Bayes classi�ers can be trained very efficiently in a supervised learning setting. e classi�er obtained by using this set of discriminant functions and by estimating the relevant probabilities from the training set is o�en called the naive Bayesian classi�er because, if the the attributes are "naively" assumed to be independent given the class, direct application of the Bayes theorem easily con�rms that this classi�er is optimal in terms of minimi�ing the misclassi�cation rate or �ero-one loss [34,35]. Logistic regression is a statistical method of predicting the outcome of a variable that is categorical (i.e., it can have several different categories) and is dependent on one or more predictor variables. A logistic function can be used to model the probabilities describing the possible outcome of a single trial as a function of explanatory variables. Logistic regression is typically used to measure the relationship between a categorical dependent variable and one or more continuous independent variables by converting the dependent variable to probability scores [36].
e wrapper-based feature selection method [23], in which the feature selection algorithm acts as a wrapper around the classi�cation algorithm, was also used to identify an SNP subset with sufficient predictive power to distinguish between high-and low-risk alleles. In the wrapper-based approach, the function used to evaluate feature subsets uses the classi�cation algorithm itself to perform a best-�rst search for a good subset [23]. Starting from an empty feature set, it searches forward for potential feature subsets by performing greedy hillclimbing augmented with a backtracking technique [37]. e wrapper-based feature selection method is applied here because Huang et al. [21] showed that it may be superior to hybrid approaches combining chi-square and information-gain methods reported in the literature. A comprehensive literature review shows no attempts to predict osteoporosis outcome in Taiwanese women using genetic factors (SNPs) and the three above mentioned classi�cation algorithms with wrapper-based feature selection method. is study therefore compared performance in three classi�cation algorithms: MFNN, naive Bayes, and logistic regression, with and without wrapper-based feature selection techniques. Identifying the genes and SNPs associated with Taiwan population of women with osteoporosis would enable researchers to focus on the candidate genes and SNPs that are most promising for use in diagnosis and therapy. e results of our studies could be generalized to SNP searches in genetic studies of human disorders and to development of new molecular diagnostic/prognostic tools. However, before routine application of genomic analysis in clinical practice, genetic markers must be validated in prospective clinical trials.

Materials and Methods
2.1. Subjects. e dataset in this study, which included SNPs, age, menopause, and BMI, was the same dataset used in a previous study by the �rst author of this paper [9]. e -score was calculated according to WHO classi�cations using a locally derived reference range provided by the manufacturer. e subjects were divided into two BMD groups according to -score [38][39][40]. Subjects with -score > −1 were enrolled in the high BMD group, and those with -scores ≤ −1 were enrolled in the low BMD group. e overall dataset was derived from 295 cases, including (i) 247 postmenopausal cases (83.73%) and 48 prepremenopausal cases (16.27%); (ii) 112 high BMD cases (37.97%) and 183 low BMD cases (62.03%). Table 2 presents the demographic characteristics of the study subjects. Post-menopause was de�ned as the absence of menstruation for >6 months or age ≥ 50 years [9]. Clinical data used for diagnosis were further converted into numerical form, that is, 1 for "high BMD" and 0 for "low BMD. " Table 1 shows the 22 SNPs analyzed in this study, which were the same as those analyzed previously by the �rst author of this paper [9]. Table 1 shows that the nine candidate genes included TNF , transforming growth factorbeta 1 (TGF 1), osteocalcin, parathyroid hormone (PTH), interleukin 1 receptor antagonist (IL1_ra), HSP, calcitonin receptor (CTR), bone morphogenetic protein-4 (BMP-4), and three genotypes per locus.

Candidate Genes.
2.�. C�assi�cati�n �����it��s. e three families of classi�cation algorithms used as the basis for comparisons in this study were MFNN, naive Bayes, and logistic regression. ese classi�ers were implemented using the Waikato Environment for Knowledge Analysis (WEKA) soware [37].
An MFNN is an arti�cial neural network (ANN) model in which connections between the units do not form a directed cycle [24][25][26][27][28]30]. From an algorithmic perspective, the underlying process of an MFNN can be divided into retrieving and learning phases [24]. Assume an -layer feedforward neural network with units at the th layer. In the retrieving phase, the MFNN iterates through all layers to produce the retrieval response { ( } at the output layer based on test pattern inputs { ( }, the known weights of the network, and the nonlinear activation function (e.g., sigmoid function). In the learning phase of this MFNN, the backpropagation algorithm [30] and evolutionary algorithms [31,32] are used in the learning scheme. e backpropagation algorithm is used as a simple gradient descent approach. e weight updating mechanism is a backpropagation of corrective signals from the output layer to the hidden layers. e goal is iteratively selecting a set of weights ( for all layers such that the squared error function can be minimized by a pair of input training patterns { ( } and target training patterns { }. Mathematically, the iterative gradient descent formulation for updating each speci�c weight ( can be expressed by the following equation: where is the learning rate and ( can be effectively calculated through a numerical chain rule by backpropagating the error signal from the output layer to the input layer.
Structurally, however, an MFNN is a spatial and iterative neural network with several layers of hidden neuron units between the input and output neuron layers. e basic function of each neuron is the linear basis function, and activation is modeled with a non-decreasing and differentiable sigmoid function. is approach uses an MFNN to model osteoporosis outcome. Inputs contain the information about clinical factors, for example, SNPs, that are needed for the database. Outputs contain the information about the osteoporosis outcome.
In summary, the MFNN is trained �rst by repeatedly providing input-output training pairs and by executing the backpropagation learning algorithm. Aer this training process is complete, the MFNN is tested by sending testing data  Second, all features in naive Bayes, which is the simplest Bayesian network, are assumed to be conditionally independent [34]. Let ( 1 , 2 , … , ) be features (i.e., SNPs) used to predict class (i.e., disease status, 1 = high BMD or 0 = low BMD). Given a data instance with genotype ( 1 , 2 , … , ), the best prediction of the disease class is given by class , which maximizes the conditional probability Pr( 1 1 , 2 2 , … , ). Bayes theorem is used to estimate the conditional probability Pr( 1 1 , 2 2 , … , ), which is decomposed into a product of conditional probabilities.
ird, the logistic regression generates the coefficients for the following formula used for logit transformation of the probability of a patient having a characteristic of interest: logit( ) 0 + 1 1 + 2 2 + ⋯ + [41]. e formula used to calculate the probability of the characteristic of interest in this study is 1 (1 + −logit( ) ), where 1 = high BMD and 0 = low BMD.

Feature
Selection. e wrapper-based feature selection approach [23], in which a feature selection algorithm acts as a wrapper around a classi�cation algorithm, was used to �nd a subset of SNPs that maximizes the performance of the prediction model. Figure 1 shows that, in the wrapper approach, the feature subset is selected by using a black box classi�cation algorithm (i.e., selection is performed using the interface alone and does not require knowledge of the algorithm). To search for a good subset, the feature subset selection algorithm includes the classi�cation algorithm itself in the evaluation function. e accuracy of the deduced classi�ers is estimated using accuracy estimation techniques. e search space is organized such that each state represents a feature subset. For features, each state has bits, and each bit indicates whether a feature is present (1) or absent (0). To determine the connectivity between the states, this study used operators that add or delete a single feature from each state, where the states correspond to the search space commonly used in stepwise method [23]. Figure 2 shows an example of the state space and operators obtained by stepwise method in a four-feature problem. e size of the search space for features is O(2 ) [23]. e classi�cation algorithms are used to calculate a performance measure for each of 16 different subsets.  erefore, the wrapper-based approach conducts a best-�rst search for a good subset by including the classi�cation algorithm itself (MFNN, naive Bayes, or logistic regression) in the feature subset evaluation [23]. To search for potential feature subsets, the best-�rst search starts from an empty feature set and searches forward by greedy hillclimbing augmented with a backtracking technique [37]. Figure 3 shows how MFNN, naive Bayes, and logistic regression were applied in the wrapper-based approach.

Evaluating Predictive
Performance. e performance of the prediction models was measured in terms of receiver operating characteristic (ROC) and area under the ROC curve (AUC) [42]. e AUC of a classi�er can be interpreted as the probability of the classi�er ranking a randomly chosen positive example higher than a randomly chosen negative one [42]. Most researchers have now adopted AUC for evaluating the predictive capability of classi�ers since AUC is a better performance metric compared to accuracy [42]. is study used the AUC value for performance comparison of different prediction models using the same dataset. e higher the AUC, the better the learning performance [43]. Other calculations included sensitivity, the proportion of correctly predicted responders out of all tested responders, and speci�city, the proportion of correctly predicted nonresponders out of all tested nonresponders.
To investigate the generalization of the prediction models produced by the above algorithms, the repeated 10-fold crossvalidation method was used [44]. First, the whole dataset was randomly divided into ten distinct parts. e model was then trained with nine-tenths of the data and tested by the remaining tenth of data to estimate its predictive performance. is procedure was repeated nine more times. Each time, a different tenth of the data was used as testing data, and a different nine-tenths of the data were used as training data. Finally, the average estimate over all runs was reported by running the above regular 10-fold crossvalidation 100 times with different splits of data. In repeated 10-fold cross-validation testing, the performance of all models was evaluated with and without feature selection.

Results
Tables 3 and 4 summarize the results of the repeated 10-fold cross-validation experiments for MFNN, naive Bayes, and logistic regression using SNPs with and without feature selection. First, the AUC, sensitivity, and speci�city were calculated for the three predictive models without wrapper-based feature selection. Table 3 shows that the average AUC values for the MFNN, the naive Bayes, and the logistic regression prediction models were 0.489, 0.462 and 0.485, respectively. In terms of AUC, the the MFNN model (AUC = 0.489) outperformed the naive Bayes (AUC = 0.462) and logistic regression (AUC = 0.485) models.
e classi�ers were also compared with and without feature selection. Feature selection using the wrapper-based approach clearly improved performance in the MFNN,  ). Table 4 shows that the AUCs did not signi�cantly differ between the MFNN model with wrapper-based feature selection (AUC = 0.631) and the logistic regression model with wrapper-based feature selection (AUC = 0.620). However, the MFNN classi�er with wrapper-based feature selection required fewer SNPs ( ) compared to the logistic regression classi�er with wrapper-based feature selection ( ), that is, by selecting a small number of SNPs with signi�cantly larger effects compared to other SNPs and by disregarding relatively insigni�cant SNPs, the MFNN model with wrapper-based feature selection successfully identi�ed a subset of four major SNPs that could be used to predict osteoporosis outcome in the study population (rs1800469 (TGF 1-509), VNTR (IL1_ra), rs2227956 (HSP70 hom), and rs1801197 (CTR)). Aer con�rming that the MFNN model outperforms the logistic regression model, the next objective was �nding the candidate genes and SNPs that are most promising for diagnosing osteoporosis, designing therapies, and predicting outcome in the studied population of Taiwanese women with osteoporosis.

Discussion
is study compared three classi�cation algorithms, including MFNN, naive Bayes, and logistic regression with and without feature selection in terms of accuracy in predicting osteoporosis outcome in a population of Taiwanese women. Accounting for models is not a trivial task because even a relatively small set of candidate genes obtains a large number of possible models [20]. For example, the 11 candidate SNPs studied yielded 2 possible models. e three classi�ers were chosen for comparison because they cover varying techniques with different representational models such as probabilistic MFNN, naive Bayes, and logistic regression models [43]. e proposed procedures can also be implemented using the publicly available soware WEKA [37] and are thus easily applicable in genomic studies. To the best of our knowledge, this study is the �rst to propose the use of three classi�cation algorithms, including MFNN, naive Bayes, and logistic regression, and wrapperbased feature selection method for modeling osteoporosis outcome in Taiwanese women based on genetic factors such as SNPs.
In this paper, the wrapper-based feature selection approach was used to �nd a subset of SNPs that maximizes the performance of the prediction model according to how feature selection search is incorporated in the classi�cation algorithms. e results showed that the MFNN classi�er with wrapper-based approach was superior to the other tested algorithms and achieved the greatest AUC with the smallest number of SNPs when distinguishing between high and low BMD in Taiwanese women. ese results suggest that MFNN model is a good method of modeling complex nonlinear relationships among clinical factors and the responsiveness of osteoporosis outcome in Taiwanese women. e wrapperbased approach does not require knowledge of the classi�cation algorithm used in the feature selection process, in which features are optimized by using the classi�cation algorithm as part of the evaluation function [21,23]. Another advantage of the wrapper-based method is its inclusion of the interaction between feature subset search and the classi�cation model [21]. However, the risk of over-�tting is high when using the wrapper-based method [21,45]. In the current study, use of the wrapper-based feature selection approach to assess high and low BMD individuals revealed a panel of genetic markers, including TGF 1-509, IL1_ra, HSP70 hom, and CTR, which were more prominent compared to other markers observed in the examined Taiwanese women population with osteoporosis.
A noted limitation of this study is that, due to the small sample size, the AUC values were too low (<0.7) to obtain good dataset classi�cations. A dataset based on a larger sample size is needed for improved accuracy. erefore, further prospective clinical trials are recommended to determine whether the observed outcome associations with these candidate genes are reproducible in a larger population of Taiwanese women with osteoporosis.

Conclusion
is study used an MFNN methodology with wrapper-based feature selection method to predict osteoporosis outcome in Taiwanese women based on clinical factors such as SNPs. e trained MFNN model showed good responsiveness in inferring osteoporosis outcome. e �ndings suggest that patients and doctors can use the proposed tool to enhance decision making based on clinical factors such as SNP genotyping data. However, genetic markers require validation in further prospective clinical trials before routine clinical use of genomic analysis for predicting osteoporosis outcome.