Logistic regression (LR) is a conventional statistical technique used for data classification problem. Logistic regression is a model-based method, and it uses nonlinear model structure. Another technique used for classification is feedforward artificial neural networks. Feedforward artificial neural network is a data-based method which can model nonlinear models through its activation function. In this study, a hybrid approach of model-based logistic regression technique and data-based artificial neural network was proposed for classification purposes. The proposed approach was applied to lung cancer data, and obtained results were compared. It was seen that the proposed hybrid approach was superior to logistic regression and feedforward artificial neural networks with respect to many criteria.
Data classification problems can be encountered in many fields such as medicine and economy. The basic statistical method used for data classification problem in the literature is logistic regression. Logistic regression is a white-box method which can test the efficacy of explanatory variable in the model. Another technique used for classification problem is artificial neural networks (ANNs). Although artificial neural network, a black-box method, has a strong modeling ability, it cannot interpret the coefficients in the model. However, the advantage of artificial neural network against logistic regression is that it is a data-based approach not a model-based. In their study, Dreiseitl and Ohno-Machado [
LR is a regression method for predicting a binary dependent variable. The dependent variable takes 0 or 1 values. The conditional probability for dependent variable is given as follows:
Artificial neural network is a data processing mechanism generated by the simulation of human nerve cells and nervous system in a computer environment. The most important feature of artificial neural network is its ability to learn from the examples. Despite having a simpler structure in comparison with the human nervous system, artificial neural networks provide successful results in solving problems such as forecasting, pattern recognition and classification.
Although there are many types of artificial neural networks in the literature, feedforward artificial neural networks are frequently used for many problems. Feedforward artificial neural networks consist of input layer, hidden layer(s), and output layer. An example of feedforward artificial neural network architecture is shown in Figure
Multilayer feedforward artificial neural network with one output neuron.
In feedforward artificial neural networks, learning is the determination of weights generating the closest outputs to the target values that correspond with the inputs of artificial neural network. Learning is achieved by optimizing the total errors with respect to weights. There are several types of training algorithms in the literature used for learning of feedforward artificial neural networks. One of the widely used training algorithms is the Levenberg-Marquardt (LM) algorithm which was also used in this study. Matlab Package Program: Neural Network Toolbox is used for the ANN solutions.
Logistic Regression is a model-based white-box method in which coefficients can be interpreted. Therefore, in logistic regression method, the efficacy of explanatory variables in the model can be tested, and variable selection can be done easily. In logistic regression, forward, backward, and stepwise selection methods have been used for variable selection. Additionally, Bayesian variable selection methods were applied to logistic regression model in Chen and Dey [
Flow diagram of proposed hybrid method.
The data consist of 178 observations. Patient, presented with hemoptysis to Ondokuz Mayıs University Department of Chest Diseases, prospectively evaluated between November 2003 and September 2006. Posteroanterior chest radiography, complete blood count, and renal and hepatic function tests were performed for each patient. Another examination like thorax computer tomography, bronchoscopy, and different laboratory and pathological diagnostic modalities was done if needed. 160 observations used as training data and randomly selected 18 observations used as test data. The LR method, firstly, applied to data. Stepwise variable selection method is applied to data and four significance independent variables (age, time of the hemoptysis (THM), and number of hemoptysis (NHM) and RAL) are selected. The summarized information about these variables are given below.
Hepatic functions test. The hepatic function test, also known as liver function tests (LFTs), is used to evaluate the liver for injury, infection, or inflammation. This test measures the blood levels of total protein, albumin, bilirubin, and liver enzymes. Enzymes that are often measured in LFTs include gamma-glutamyl transferase (GGT); alanine aminotransferase (ALT or SGPT); aspartate aminotransferase (AST or SGOT); alkaline phosphatase (ALP). LFTs may also include prothrombin time (PT), a measure of how long it takes for the blood to clot. High or low levels may mean that liver damage or disease is present.
Time of hemoptysis (THM). Hemoptysis is the expectoration, or coughing up of blood, from the lower respiratory tract. There are three types: minor, moderate, and massive hemoptysis. Distinguishing between minor, moderate, and massive hemoptysis is important, as severity determines the need for emergency treatment. Minor hemoptysis is defined as small specks of blood or clots in the patient’s sputum and is generally not life threatening. Moderate hemoptysis can include larger clots up to the loss of 200 mL of blood within a 24 hour period. Finally, massive hemoptysis is anything greater than a loss of 200 mL of blood within 24 hours and is always a medical emergency, as patient asphyxiation can occur rapidly.
Number of hemoptysis (NHM). Hemoptysis can be classified as mild, moderate, or massive, depending on the amount of blood expectorated: < 100 mL in 24 h (mild); 100–600 mL in 24 h (moderate); RAL: RAL is a breath sound, like a scrunch, from the lungs. CBC: complete blood count (CBC) is often used as a broad screening test to determine an individual’s general health status. It can be used to screen for a wide range of conditions and diseases.
Parameter estimations, standard errors of estimation, and significance values of these estimations are given in a Table
Estimation results of logistic regression.
Variables | Estimations | Standard |
|
|
---|---|---|---|---|
Constant | −5,19 | 1,02 | −5,07 | 0,00 |
Age | 0,06 | 0,01 | 4,03 | 0,00 |
Time of HM | 0,03 | 0,01 | 2,36 | 0,01 |
Number of HM | 1,44 | 0,45 | 3,20 | 0,00 |
RAL | −1,26 | 0,49 | −2,56 | 0,01 |
ROC Curve for training data in logistic regressions.
ROC Curve for testing data in logistic regressions.
When Table
The feedforward artificial neural network method secondly applied to data. The inputs of ANN are age, time of HM, and numbers of HM and RAL independent variables. Target of ANN is the diagnosis of lung cancer. The architecture of used ANN is given in a Figure
Optimal weights of neural network.
Hidden layer |
Input neurons | Output neuron | Bias (input-hidden) | Bias (hidden-output) | |||
---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | ||||
1 | 0,32 | 0,32 | −1,57 | −6,13 | 6,371389 | −13,34 | 0,50 |
2 | −0,52 | −0,19 | −0,82 | 3,89 | 8,982728 | −5,31 | |
3 | −0,13 | −1,64 | −10,51 | −2,97 | −5,06077 | −8,56 | |
4 | 1,04 | 1,36 | 5,78 | 6,87 | −7,04415 | −11,90 | |
5 | 0,16 | 0,62 | −8,57 | 1,71 | −4,95642 | 12,34 | |
6 | 0,31 | −0,08 | 3,75 | −3,69 | 4,679567 | −15,32 | |
7 | −1,14 | 3,94 | 12,95 | −7,24 | 3,461384 | 7,62 | |
8 | −1,50 | −0,46 | −2,81 | 2,81 | −10,4971 | −1,13 | |
9 | −0,46 | 0,22 | 6,29 | −2,44 | −9,7104 | −4,79 | |
10 | −2,66 | 1,57 | −4,24 | −5,29 | −1,1065 | 9,20 |
ROC Curve for training data in ANN.
ROC Curve for test data in ANN.
The proposed method was applied to the data. In the first step of hybrid approach, important explanatory variables and forecasts of logistic regression were obtained using stepwise logistic regression. Explanatory variables (age, time of the hemoptysis (THM), number of hemoptysis (NHM), and RAL) obtained from stepwise logistic regression and forecasts of
Artificial neural network architecture of proposed hybrid method.
Artificial neural network which was given in Figure
Optimal weights of neural network in hybrid method.
Hidden layer neurons | Input neurons |
Output |
Bias (input-hidden) | Bias (hidden-output) | ||||
---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||||
1 | −6,61 | 2,65 | −8,75 | −1,33 | 5,44 | −11,65 | 6,31 | −72,62 |
2 | −0,78 | −2,55 | 5,75 | −5,05 | 0,03 | 0,10 | −7,15 | |
3 | 0,82 | −1,39 | 11,42 | 15,77 | 1,74 | 95,32 | −4,06 | |
4 | −11,75 | 8,74 | 2,75 | 3,24 | 7,74 | 10,75 | −8,36 | |
5 | 1,32 | 0,58 | 1,51 | −5,39 | 5,72 | −60,09 | −8,33 | |
6 | −0,29 | −0,59 | 0,25 | −3,59 | −5,53 | −1,63 | −5,80 | |
7 | 0,68 | 8,74 | 2,30 | 4,72 | −0,58 | −66,45 | −8,72 | |
8 | 0,70 | 0,13 | −16,37 | −39,04 | 140,65 | 94,49 | −58,77 | |
9 | −0,05 | 0,32 | 3,01 | 4,68 | −6,32 | 100,77 | −4,97 | |
10 | 10,60 | −10,53 | −5,43 | 0,85 | −5,27 | 8,80 | 7,26 |
Results of LR, ANN, and LR-ANN hybrid method for lung cancer data.
AUC | CCR | Sensitivity | Specificity | |||||
---|---|---|---|---|---|---|---|---|
Training data | Test data | Training data | Test data | Training data | Test data | Training data | Test data | |
LR model | 0,8468 | 0,8923 | 0,8125 | 0,7778 | 0,5208 | 0,6000 | 0,9375 | 0,8462 |
ANN model | 0,8631 | 0,9185 | 0,8375 | 0,8889 | 0,5417 | 0,6000 | 0,9643 | 1,0000 |
Proposed hybrid LR-ANN | 0.9050 | 0.9231 | 0.8812 | 0.9444 | 0.6875 | 0,8000 | 0.9645 | 1,0000 |
ROC curve for training data in proposed hybrid LR-ANN method.
ROC Curve for test data in Proposed Hybrid LR-ANN Method.
The comparisons of proposed hybrid method, logistic regression, and artificial neural network are made in Table
Logistic regression and artificial neural networks which have both advantages and disadvantages are two methods used for classification. Whilst selection of meaningful variables is done via statistical techniques in logistic regression, as it is a black-box model variable selection cannot be done with statistical techniques in feedforward artificial neural network. In this study, a new method using both logistic regression and feedforward artificial neural network was proposed. In the proposed method, variable selection is done by stepwise logistic regression. Additionally, explanatory variables which were obtained from logistic regression and forecasts are taken as inputs of feedforward artificial neural networks. Thus, the proposed method has advantages of both LR and ANN. When the results given in Table