Breast Cancer Detection with Reduced Feature Set

This paper explores feature reduction properties of independent component analysis (ICA) on breast cancer decision support system. Wisconsin diagnostic breast cancer (WDBC) dataset is reduced to one-dimensional feature vector computing an independent component (IC). The original data with 30 features and reduced one feature (IC) are used to evaluate diagnostic accuracy of the classifiers such as k-nearest neighbor (k-NN), artificial neural network (ANN), radial basis function neural network (RBFNN), and support vector machine (SVM). The comparison of the proposed classification using the IC with original feature set is also tested on different validation (5/10-fold cross-validations) and partitioning (20%–40%) methods. These classifiers are evaluated how to effectively categorize tumors as benign and malignant in terms of specificity, sensitivity, accuracy, F-score, Youden's index, discriminant power, and the receiver operating characteristic (ROC) curve with its criterion values including area under curve (AUC) and 95% confidential interval (CI). This represents an improvement in diagnostic decision support system, while reducing computational complexity.


Introduction
Breast cancer is one of the leading causes of death among all cancers for women [1]. Early detection and correct diagnosis of cancer are essential for the treatment of the disease. However, the traditional approach to cancer diagnosis depends highly on the experience of doctors and their visual inspections. Naturally, human beings can make mistakes due to their limitations. Humans can recognize patterns easily. However, they fail when probabilities have to be assigned to observations [2]. Although several tests are applied, exact diagnosis may be difficult even for an expert. That is why automatic diagnosis of breast cancer is investigated by many researchers. Computer aided diagnostic tools are intended to help physicians in order to improve the accuracy of the diagnosis [3][4][5].
A study was carried out to demonstrate that the machine learning may improve the accuracy of diagnosis. In Brause's work, the result shows that the most experienced physician can diagnose with 79.97% accuracy while 91.1% correct diagnosis is achieved with the help of machine learning [6].
Tumors are classified as benign and malignant. Benign tumors are not cancerous or life threatening. However these can increase the risk of getting breast cancer. Malignant tumors are cancerous and more alarming than benign tumors. Although significant studies are performed for early detection, about 20% of all women with malignant tumors die from this disease [7].
In order to improve accuracy of breast mass classification as benign and malignant, the performance of backpropagation artificial neural network (ANN) was evaluated [8]. Moreover, the fast learning rates and generalization capabilities of radial basis function neural networks (RBFNN) have showed excellent accuracy in microcalcification detection task [9,10]. The advantages of RBFNN are simple structure, good performance with approaching nonlinear function, and fast convergence velocity. Thus, it has been widely used in pattern recognition and system modeling [11,12]. On the other hand, the structure of RBFNN increases when the net's input dimension increases. Moreover, the irrelevant components in the inputs will decrease the generalization performance of RBFNN [13]. Support vector machine (SVM) is an effective statistical learning method for classification [14]. SVM is based on finding optimal hyperplane to separate different classes mapping input data into higher-dimensional feature space. SVM has advantage of fast training technique, even with large number 2 Computational and Mathematical Methods in Medicine of input data [15,16]. Therefore it has been used for many recognition problems such as object recognition and face detection [17][18][19].
Principal component analysis (PCA) is a technique to reduce dimensionality using second order statistical information [20]. Independent component analysis (ICA) is a recently developed method in pattern recognition and signal processing fields [21,22]. It involves higher order statistics to extract independent components that involve richer information than PCA. ICA can be used to reduce dimensionality before training -NN, ANN, RBFNN, and SVM. Consequently the complexity of classifiers can be reduced; convergence velocity and performance can be increased [13,23].
The objective of the proposed study is to analyze the effect of feature reduction using ICA on classification of the tumors as benign or malignant. Thus, the dimension of WDBC dataset is reduced into only one feature using ICA. The reduced data is subdivided into test and training data using 5/10-fold cross-validation and 20% partitioning to evaluate the performance of -NN, ANN, RBFNN, and SVM. Performance measures including accuracy, specificity, sensitivity, -score, Youden's index, and discriminant power are computed and the receiver operating characteristic (ROC) curve is plotted to compare the classifiers. Section 2 summarizes background knowledge on dataset, ICA, -NN, ANN, RBFNN, SVM, and performance measures. In Section 3, the methodology deployed in this study is described. In Sections 4 and 5 experimental results are presented and discussed. Finally, there is a conclusion part in Section 6. Ten real-valued features given in Table 1 calculated for each cell nucleus, and the mean, standard error, and "worst" or largest (mean of the three largest values) of these features were calculated for each image, resulting in 30 features [24].

Independent Component
Analysis. The basic model of ICA is as follows. Suppose that the observed signal is the linear combination of two independently distributed sources. The observed signal can be written as follows: where is a vector that consists of the source signals, is an unknown mixing matrix composed of constant elements, and is a vector of observed values. The unknown mixing matrix, , is estimated using the ICA, and then separating matrix is computed which is the inverse of . The original signal can be found bŷ= .
The computing of the independent components (ICs) begins with centering data by removing the mean values of the variable, as in principal component analysis (PCA). Whitening, Fractal dimension ("coastline approximation" − 1) also known as sphering data, is the next step. Data which have been whitened are uncorrelated (as PCA). On the other hand, all variables have variances of one. PCA can be used for both these computations because it decorrelates the data and gives information on the variance of the decorrelated data in the form of the eigenvectors [25]. ICs are determined by applying a linear transformation to the uncorrelated data: where ic is the independent component and is the vector to reconstruct ic. There are many different approaches to estimate using an objective function that relates to variable independence. In this study, FASTICA algorithm has been used to compute ICs, due to its flexibility and interactive mode [26].

Artificial Neural Networks.
Feedforward neural network (FFNN) is most popular ANN structure due to its simplicity in mathematical analysis and good representational capabilities [27,28]. FFNN has been used successfully to various applications such as control, signal processing, and pattern classification. FFNN architecture is shown Figure 2. states the number of input patterns and states the number of neurons in hidden layer. Neurons in the hidden layer receive weighted inputs from a previous layer and transfer output to the neurons in the next layer in FFNN, and these computations can be described as where 0 is bias, is the weight of each input neuron, is input neuron, net is composed of the summation of weighted inputs, out is the output of system, ( net ) denotes   the nonlinear activation function, obs is the observed output value of neural network, and is the error between output value and network result [29]. A RBFNN also consists of feedforward architecture with three layers, but the hidden layer uses Gaussian function mostly and is called radial basis layer. Each neuron consists of a radial basis function (RBF) centered on a point. The centers and spreads are computed by the training. A hidden neuron computes the Euclidean distance of input vector and the test case from the neuron's center point. Thus, it applies the RBF kernel function to the distance using the spread values.

Support Vector Machine (SVM)
. SVM is a supervised learning algorithm studied for data classification and regression. It was proposed by Boser et al. [30] and Vapnik [31]. SVM algorithm is used to find a hyperplane that separates the classes minimizing training error and maximizing the margin in order to increase generation capability.
When the datasets are linearly separable, a linear SVM algorithm can be used to classify them. The algorithm tries to maximize the margin. Support vectors are the points lying on the margins that are shown in Figure 3.
The discriminant function of the hyperplane can be described by the following equation: where describes data points, is a coefficient vector, and shows offset from the origin. In case of linear SVM ( ) ≥ 0 for the closest point on the one of the class, ( ) < 0 for the closest point belongs to another class. Margin (2/‖ ‖ 2 ) should be maximized for better generalization ability minimizing the cost function as follows: This is a quadratic optimization task with respect to a set of linear inequality constraints. From Karush-Kuhn-Tucker (KKT) conditions the Lagrange function is found by where are Lagrange multipliers and must be minimized to find out optimal and b. The optimization equation can be written as The other usage of SVM is that it can solve nonlinear classification problems through the trick of a kernel function. The kernel function maps data points onto a higher-dimensional space in order to construct a hyperplane separating the classes. The new discriminant function is found by where Φ( ) represents the mapping of input vectors, onto the kernel space . Therefore, the optimization equation can be written as: where ( , ) is the kernel function equals to {Φ( ), Φ( )}. The kernel functions can be radial basis function (RBF), polynomial or any symmetric functions which satisfy the Mercel conditions [32].

Performance Measures.
There are several ways to evaluate the performance of classifiers. Confusion matrix keeps the correct and incorrect classification results to measure the quality of the classifier. Table 2 shows a confusion matrix for binary classification, where TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative counts, respectively. The most common empirical measure to assess effectiveness is the accuracy for classifier and it is calculated by Sensitivity measures the proportion of actual positives which are correctly identified and specificity measures the proportion of negatives which are correctly identified. These are formulated by -score is a measure of test accuracy. It considers both precision and the recall to compute. These are calculated by where is the bias and -Score is balanced when = 1. It favors recall when < 1 and favors precision otherwise.
Other two measures which are used to analyze the performance of a classifier in medical diagnosis are discriminant power (DP) and Youden's index. DP evaluates how well a classifier distinguishes between positive and negative samples: where The result can be summarized as follows: DP < 1 then "poor discriminant, " DP < 2 then "limited discriminant, " DP < 3 then "fair discriminant" and other cases then "good discriminant. " Youden's index evaluates a classifier's ability to avoid failure [33] and is described as Youden's index is used summary measure of the receiver operating characteristic (ROC) curve. The diagnostic performance of a test or a classifier to distinguish diseased cases from normal cases is evaluated using the ROC curve analysis [34].
In this study, an attempt has been made to evaluate the performance of the classifiers computing the aforementioned measures for 5/10-fold cross-validations (CV) and 20% data partitioning. For 5-CV or 10-CV, the data are divided into 5 or 10 subsets, and each subset is sequentially deployed as test data while others are deployed as trainig data. Thus 5 or 10 iterative processes are evaluated to determine distinguishing capability of the classification model. Data partitioning is easier and less reliable than CV method. In our simulations, once 20% of the data is randomly selected as test data, the other samples are used for training.

Methodology
In this study, the original 30 features of WDBC data and reduced one feature using ICA are deployed to evaluate the classifiers performance on breast cancer decision. Thus, the proposed model shown in Figure 4 is applied to WDBC data that have 30 features and 569 instances (patients) were used to train and test the models.
First, the dimensionality of the data is reduced using ICA and partitioned into subsamples using 5/10-CV and 20% partitioning to evaluate the classifiers. The subsamples have been used sequentially to train and test ANN, RBFNN, SVM, and -NN. The outputs of the classifiers have been evaluated to find out performance measures.
First, ICA is used to compute ICs. Since the first IC has distinctly large eigenvalue given in Figure 5, it has been selected as a feature vector.
In other words, one IC can successfully identify the thirty features with the retained 98.205% of nonzero eigenvalues. In addition, the distribution of the IC is given in Figure 6 to indicate its distinguishing capability.
The data are divided into subsets using 5/10-CV and 20% partitioning to test and train classifiers. After training process, the test data are used to evaluate diagnostic performances of the classifiers in terms of sensitivity, specificity, accuracy, -score, Youden's index, DP, and ROC curve.  For training processes, -NN classifier, one-dimensional Euclidean distance, = √( test − training ) 2 between test and training samples [35]. The results of -NN classifier are obtained for the values from 1 to 25, and then the performance measures at the best value are stored. The model of ANN is selected as feedforward neural network with one hidden layer. The total number of neurons in the hidden layer is sequentially increased to find the maximum accuracy. Moreover, the activation function of the hidden layer of the network has been chosen as log-sigmoid transfer function. In order to train the network, gradient descent with momentum and adaptive learning rate backpropagation algorithm is used. RBFNN is also evaluated varying the spread value ( ). For SVM, linear, quadratic, and RBF kernels are used to explore which type of separating hyperplane is more suitable for breast cancer classification.

Results
One-dimensional feature vector of WDBC data reduced using ICA is used for training and testing the classifiers. The accuracy, sensitivity, and specificity of one dimensionality have been performed using 5/10 CV technique and 20% of data as test data. Also, the success of the breast cancer classification is generally evaluated on the basis of sensitivity value because the classifying of the malignant mass is more important than the benign mass.
The accuracy of the -NN classifier has been computed for varying values between 1 and 25. The comparison graph of the effect of ICA on accuracy of -NN classifier is shown in Figure 7.
The maximum accuracy results when 20% test data with 30 features is 96.49% where = 5. However, reduced one feature vector using ICA provides the accuracy of 92.98% where = 5 and 20% test data is selected. Moreover, the accuracy of -NN classifier is decreased from 93.15% (30 features) to 91.04% (1 feature by ICA) when 10-CV is used to test and train.
Accuracy graph of ANN has been plotted varying neuron numbers in the hidden layer for 10/5-CV and 20% test data. The accuracy graph of ANN classifier is given in Figure 8.
ANN classifier has nearly perfect accuracy value of 99.12% (the number of neurons is four) when original 30 features and 20% test data are selected. The effect of ICA on reducing into one feature is changed accuracy value to 91.23% where neuron number is nine. In addition, the accuracy value is changed from 97.54% to 90.51% when 10-CV is used.
Spread value of RBFNN is adjusted between 0 and 60 to get maximum accuracy for 20% test data ratio and 10/5-CV. The accuracy graph of RBFN is shown in Figure 9.
Referring to the accuracy graph of RBFNN, maximum accuracy, 95.12%, is obtained where spread value is 48 for 20% test data. This value is decreased to 90.35% when reduced one-dimensional feature vector by ICA is used. However, when 10-CV is used, the effect of ICA increases the accuracy from 87.18% (with 30 features) to 90.49% (with 1 feature reduced by ICA).   Accuracy evaluation of SVM has been computed for kernel functions including linear, polynomial, and RBF with kernel function parameters such as RBF sigma value for RBF kernel and polynomial degree for polynomial kernel. The accuracy graph of SVM classifier is presented in Figure 10  where the axes of polynomial degree indicate linear kernel when its value equals one. Generally, SVM classifier with linear kernel provides more accurate result than polynomial and RBF kernel. Its accuracy is 98.25% for 30 features and 90.35% for reduced 1 feature when 20% of data is used as test data. In contrast to polynomial kernel, effect of ICA increases the accuracy of SVM with RBF kernel from 89.47% (30 features) to 91.23% (1 feature). When 10-CV is used, the accuracy is decreased from 97.54% (30 features, linear kernel) and 95.25% (30 features, RBF kernel) to 90.33% and 90.86% (reduced 1 feature by ICA).
-NN, ANN, RBFNN, and SVM have been tested and trained to find out maximum accuracy adjusting their parameter. The performance measures such as accuracy, specificity, sensitivity, -score, Youden's index, and discriminant power of the classifiers are compared to each other. The parameters of the classifiers which provide maximum accuracy are selected to be compared to the other classifiers. In addition to these performance measures, the ROC curve of three classifiers is plotted to enhance visuality of the comparison.
10-CV and one-dimensional feature vector reduced by ICA are used to compare the performances of classifiers. In input data of classifiers, the test data are compared to the original class label to find out TP, TN, FP, and FN values. These values for classifiers are given in the form of confusion matrix in Table 3.
RBFNN classification using 30 original features provides worse performance than reduced one-dimensional feature vector; refer to Table 3. The other classification used with 30 features has slightly higher true values when compared to classification with one feature reduced by ICA.  The performance measures of -NN, ANN, RBFNN, and SVM classifiers such as sensitivity, specificity, accuracy,score, discriminant power (DP), and Youden's index are given in Table 4 to compare the effect of ICA on the classification.
Discriminant power evaluates how well a classifier distinguishes between positive and negative samples. DP of ANN and SVM with 30 original features differs from 3 which means good discriminant. When ICA is used to reduce to one dimensionality, DP falls to 2.769 (SVM) and 2.655 (ANN). In other words, discriminants turn to fair.
A higher value of Youden's index shows better ability to avoid failure. -NN results in the highest value of Youden's index; refer to Table 4. Youden's index is used to plot the ROC curve of a classifier. The true positive rate (sensitivity) is plotted in function of the false positive rate (1 − Specificity) for cut-off points in a ROC curve. The ROC curve can be used to compute area under the ROC curve (AUC) and 95% confidence interval (CI). AUC equals 1 when all test data is assigned to true class labels. Higher AUC indicates that higher accuracy 95% CI is another indicator of the ROC curve which can be used to test whether a classifier can distinguish the classes. If its value is not 0.5, it means the classifier can distinguish the classes. The ROC curves of the -NN, ANN, RBFNN, and SVM classifiers using one-dimensional feature vector reduced by ICA and 30 features are presented in Figure 11.
The criterion values of the ROC curves of classifiers are given in Table 5. AUC of the ANN (0.966) and SVM (0.949) results in higher value when 30 original features are used. However, when classification with 1 feature reduced by ICA   is evaluated, -NN (0.897) and SVM (0.885) result in higher AUC. It means -NN and SVM classifiers using reduced one feature distinguish samples more correctly. Table 5 shows that the accuracy of the -NN (91.03%) is better than the accuracy of ANN, RBFNN, and SVM (90.50%, 90.49%, and 90.86%). Generally, one feature reduced by ICA decreases the accuracy of -NN, ANN, and SVM. However, it increases the accuracy of RBFNN. The aforementioned classification methods are analyzed in terms of computing time given in Table 6 to compare the computational complexities to the classifications with the original 30 features.
The proposed methods have lower computing time when compared to classification of the original dataset. In case of neural network classifications with 30 features, network constructions consume highly more time than classification with one IC. The measured durations of 13.9 and 20.03 seconds are decreased to 11.12 and 14.9 seconds when ANN and RBFN with 20% partitioning are deployed. Particularly, the effect of using IC as feature on complexity is existed when 10-CV is selected. The consumed time of the ANN and RBFNN is decreased from 118.21 and 129.84 seconds to 76.72 and 90.49 seconds, respectively. In addition, ICA decreases computational times of the SVM and -NN classifications, but the rates are less than the neural networks.

Discussions
Sensitivity/specificity indicates the proportion of actual positives/negatives which are correctly identified. While use of one-dimensional feature vector reduced by ICA decreases accuracy slightly, it increases sensitivity values of SVM and RBFNN classifiers. The maximum sensitivity measure belongs to SVM with RBF kernel when one feature is used. The graph of the effect of ICA on sensitivity measures of classifiers is shown in Figure 12. Sensitivity refers successfully identified malignant samples in cancer classification. Thus, higher sensitivity means higher diagnostic capability of malignant tumors and it can be used to help physicians to diagnose cancerous mass more correctly. The accuracy and sensitivity measures of previous classification studies and this study on WDBC dataset are given in Table 7 to compare the effect of feature reduction using ICA. It should be noted that the studies on WDBC differ from studies on WBC dataset which consists of 699 instances with 10 attributes.
Higher number of features used to classify breast cancer as benign and malignant resultsin slightly higher accuracy. Feature reduction into one using ICA decreases the accuracy of -NN, ANN, and SVM slightly. However, it increases the accuracy of RBFNN and the sensitivity values of SVM and RBFNN.
Referring to Table 7, the sensitivity measures of the classifiers used with one-dimensional feature vector reduced by ICA in this study perform better than the other studies. However, accuracy rates of the proposed classifications (90.53% ± 0.34) are lower than the previous methods (94.93% ± 2.07). The study of WDBC data creators [39] set has the highest accuracy (97.50%) using multisurface method tree (MSM-T) with 3 selected features. Similarly, hybrid methods are more successful than the others. Breast cancer classifications using probabilistic neural network (PNN) with hybrid feature reduction using discrete wavelet transform (DWT) and ICA [40] or classification using SVM with 6-dimensional feature space obtained by -means algorithm [41] have accuracy rates of 96.31% and 97.38% for 10-CV. Particularly, SVM based studies [36,38] with 30 features have near scores to our onedimensional results.

Conclusions
In this study, the effect of dimensionality reduction using independent component analysis (ICA) on breast cancer decision support systems with several classifiers such as artificial neural network (ANN), -nearest neighbor ( -NN), radial basis function neural network (RBFNN), and support vector machine (SVM) is investigated. The results of the applied original thirty features of Wisconsin diagnostic breast cancer (WDBC) are compared with the reduced one dimension by ICA. The accuracy rates of the classifications with thirty original features except RBFNN have slightly decreased from 97.53%, 91.03%, and 95.25% to 90.5%, 91.03%, and 90.86%, respectively. However, the one-dimensional feature vector causes RBFNN classifier to be more distinguishing with the increased accuracy from 87.17% to 90.49%. Furthermore, the sensitivity rates which define the successfully recognized malignant samples are increased from 93.5% to 96.63% for RBFNN and from 96.07% to 97.47% for SVM, while the others have slight decrease at the rate between 0.96% and 3.09%. If the objective is to increase the rate of the successfully identified malignant breast cancer using RBFNN or decrease computational complexity without loss of the high accuracy rate, feature reduction applying ICA can be a high performance solution.