An Efficient Diagnosis System for Parkinson's Disease Using Kernel-Based Extreme Learning Machine with Subtractive Clustering Features Weighting Approach

A novel hybrid method named SCFW-KELM, which integrates effective subtractive clustering features weighting and a fast classifier kernel-based extreme learning machine (KELM), has been introduced for the diagnosis of PD. In the proposed method, SCFW is used as a data preprocessing tool, which aims at decreasing the variance in features of the PD dataset, in order to further improve the diagnostic accuracy of the KELM classifier. The impact of the type of kernel functions on the performance of KELM has been investigated in detail. The efficiency and effectiveness of the proposed method have been rigorously evaluated against the PD dataset in terms of classification accuracy, sensitivity, specificity, area under the receiver operating characteristic (ROC) curve (AUC), f-measure, and kappa statistics value. Experimental results have demonstrated that the proposed SCFW-KELM significantly outperforms SVM-based, KNN-based, and ELM-based approaches and other methods in the literature and achieved highest classification results reported so far via 10-fold cross validation scheme, with the classification accuracy of 99.49%, the sensitivity of 100%, the specificity of 99.39%, AUC of 99.69%, the f-measure value of 0.9964, and kappa value of 0.9867. Promisingly, the proposed method might serve as a new candidate of powerful methods for the diagnosis of PD with excellent performance.


Introduction
Parkinson's disease (PD) is one degenerative disease of the nervous system, which is characterized by a large group of neurological conditions called motor system disorders because of the loss of dopamine-producing brain cells. The main symptoms of PD are given as follows: (1) tremor or trembling in hands, arms, legs, jaw, or head, (2) rigidity or stiffness of the limbs and trunk, (3) bradykinesia or slowness of movement, (4) postural instability or impaired balance (http://www.ninds.nih.gov/research/parkinsonsweb/index. htm, last accessed: April 2012). At present, PD has an impact on about 1% of the worldwide population over the age of 50; however, this proportion is on the increase as people live longer [1]. Till now, PD has no medical treatment and some dedication is only available for relieving the symptoms of disease [2]. It is so important that we gain more of insight into the problem and improve our methods to deal with PD. Here we focus on the study based on dysphonia, which is known as a group of vocal impairment symptoms; it is reported to be one of the most significant symptoms of PD [3]. The researches have shown that about 90% of people with PD have such vocal evidence. The dysphonic indicators of PD make speech measurements as an important part of diagnosis [4]. Dysphonic measures have been proposed as a reliable tool to detect and monitor PD [5,6].
Previous studies on the PD problem based on machine learning methods have been undertaken by various researchers. Little et al. [6] used support vector machine (SVM) classifier with Gaussian radical basis kernel function to predict PD, by means of feature selection method to reduce the feature space, and best accuracy rate of 91.4% was 2 Computational and Mathematical Methods in Medicine obtained by the proposed model. Shahbaba and Neal [7] presented a nonlinear model based on Dirichlet mixtures for the PD classification, compared with multinomial logit models, decision trees, and SVM; the classification accuracy of 87.7% was achieved by the proposed model. Das [8] used a comparative study of neural networks (NN), DMneural, regression and decision trees for the diagnosis of PD; the experiment results had shown that the NN method achieved the overall classification performance of 92.9%. Sakar and Kursun [9] used mutual information measure to combine with SVM for the diagnosis of PD and achieved the classification result of 92.75%. Psorakis et al. [10] introduced sample selection strategies and model improvements for multiclass multikernel relevance vector machines and achieved the classification accuracy of 89.47% in the PD dataset. Guo et al. [11] combined genetic programming and the expectation maximization (EM) to diagnose PD in the ordinary feature data and achieved the classification accuracy of 93.1%. Luukka [12] proposed a new method which used fuzzy entropy measures to combine with the similarity classifier to predict PD, and the mean classification of 85.03% was achieved. Li et al. [13] introduced a fuzzy-based nonlinear transformation approach together with SVM in the PD dataset; best classification accuracy of 93.47% was obtained. Ozcift and Gulten [14] combined the correlation based feature selection method with the rotation forest ensemble classifier of 30 machine learning algorithms to distinguish PD; the proposed model got best classification accuracy of 87.13%.Åström and Koker [15] achieved highest classification accuracy of 91.2% by using a parallel neural network model for PD diagnosis. Spadoto et al. [16] adopted evolutionary based method together with the optimum-path forest (OPF) classifier for PD diagnosis, and best classification accuracy of 84.01% was obtained. Polat [17] applied the fuzzy -means (FCM) clustering feature weighting (FCMFW) together with the -nearest neighbor classifier for detecting PD; the classification accuracy of 97.93% was obtained. Chen et al. [18] proposed a model which used the principle component analysis based feature extraction together with the fuzzy -nearest neighbor method to predict PD and achieved best classification accuracy of 96.07% by the proposed model. Daliri [19] presented a chi-square distance kernel-based SVM to discriminate the subjects with PD from the healthy control subjects using gait signals, and the classification result of 91.2% was obtained. Zuo et al. [20] used a new diagnosis model based on particle swarm optimization (PSO) to strengthen the fuzzy -nearest neighbor classifier for the diagnosis of PD, and the mean classification accuracy of 97.47% was achieved.
From these works, it can be seen that most of the common classifiers from machine learning community have been used for PD diagnosis. For the nonlinear classification problems, the data preprocessing methods such as feature weighting, normalization, and feature transformation could increase the performance of alone classifier algorithm. So it is obvious that the choice of an efficient feature preprocessing method and an excellent classifier is of significant importance for the PD diagnosis problem. Aiming at improving the efficiency and effectiveness of the classification performance for the diagnosis of PD, in this paper, an efficient features weighting method called subtractive clustering features weighting (SCFW) and a fast classification algorithm named kernelbased extreme learning machine (KELM) are examined. The SCFW method is used to map the features according to data distributions in dataset and transform linearly nonseparable dataset to linearly separable dataset. In this way, the similar data within each feature are prone to getting together so that the distinction between classes is increased to classify the PD datasets correctly. It is reported that SCFW method can help improve the discrimination abilities of classifiers in many applications, such as traffic accident analysis [21] and medical datasets transformation [22]. KELM is the improved version of ELM algorithm based on kernel function [23]. The advantage of KELM is that only two parameters (the penalty parameter and the kernel parameter ) need to be adjusted, unlike ELM which needs to specify the suitable values of weights and biases for improving the generalization performance [24]. Furthermore, KELM not only trains as fast as that of ELM, but also can achieve good generalization performance. The objective of the proposed method is to explore the performance of PD diagnosis using a two-stage hybrid modeling procedure via integrating SCFW with KELM. Firstly the proposed method adopts SCFW to construct the discriminative feature space through weighting features, and then the achieved weighted features serve as the input of the trained KELM classifier. To evaluate the performance of proposed hybrid method, classification accuracy (ACC), sensitivity, specificity, AUC, -measure, and kappa statistic value have been used. Experimental results have shown that the proposed method achieves very promising results based on proper kernel function by 10-fold cross validation (CV).
The main contributions of this paper are summarized as follows.
(1) It is the first time that we have proposed to integrate SCFW approach with KELM classifier to detect PD in an efficient and effective way.
(2) In the proposed system, SCFW method is employed as data preprocessing tool to strengthen the discrimination between classes for further improving the distinguishing performance of KELM classifier.
(3) Compared with the existing methods in previous studies, the proposed diagnostic system has achieved excellent classification results.
The rest of the paper is organized as follows. Section 2 offers brief background knowledge on SCFW and KELM. The detailed implementations of the diagnosis system are presented in Section 3. In the next section, the detailed experiment design is described, and Section 5 gives the experiment results and discussions of the proposed method. Finally, conclusions and recommendations for future work are summarized in Section 6.

Subtractive Clustering Features Weighting (SCFW).
Subtractive clustering is the improved version of mountain clustering algorithm. The problem of mountain clustering is that its calculation grows exponentially with the dimension of the problem. Subtractive clustering has solved this problem using data points as the candidates for cluster centers, instead of grid points as in mountain clustering, so the calculation cost is proportional to the problem size instead of the problem dimension [25]. The subtractive clustering algorithm can be briefly summarized as follows: Step 1. Consider a collection of data points { 1 , 2 , . . . , } in -dimensional space. Since each data point is a candidate for cluster center, the density measure at data point is defined as where is a positive constant defining a neighborhood radius; it is used to determine the number of cluster centers. So, a data point will have a high density value if it has many neighboring data points. The data points outside the neighborhood radius contribute slightly to the density measure. Here, is set to 0.5.
Step 2. After the density measure of each data point has been calculated, the data point with the highest density measure is selected as the first cluster center. Let 1 be the point selected and 1 the density measure. Next, the density measure for each data point is revised as follows: where is a positive constant and = ⋅ , is a constant greater than 1 to avoid cluster centers being in too close proximity. In this paper, is set to 0.8.  Step 3. After the density calculation for each data point is revised, the next cluster center 2 is selected and all the density calculations for data point are revised again. The process is repeated until a sufficient number of cluster centers are generated. For SCFW method, firstly the cluster centers of each feature are calculated by using subtractive clustering. After calculating the centers of features, the ratios of means of features to their cluster centers are calculated and these ratios are multiplied with the data of each feature [21]. The pseudocode of SCFW method is given in Algorithm 1, and the flowchart of weighting process is shown in Figure 1.

Kernel-Based Extreme Learning Machine (KELM).
ELM is an algorithm originally developed for training single hidden layer feed-forward neural networks (SLFNs) [26]. The essence of ELM is that parameters of hidden neurons in neural network are randomly created instead of being tuned and then fixed the nonlinearities of the network without iteration. Figure 2 shows the structure of ELM.
For given samples (x, y) having hidden neurons and activation function ℎ( ), the output function of ELM is defined as follows: where = [ 1 , 2 , . . . , ] is the output weight connecting hidden nodes to output nodes. H = {ℎ } ( = 1, . . . , and = 1, . . . , ) is the hidden layer output matrix of neural network. ℎ( ) actually maps the data from the ddimensional input space to the L-dimensional hidden layer feature space H, and thus, ℎ( ) is indeed a feature mapping.
The determination of the output weights is calculated by the least square method: where H + is the Moore-Penrose generalized inverse [26] of the hidden layer output matrix H.
To improve the generalization capabilities of ELM in comparison with the least square solution-based ELM, Huang et al. [23] proposed kernel-based method for the design of ELM. They suggested adding a positive value 1/ (where is a user-defined parameter) for calculating the output weights such that Therefore, the output function is expressed as follows: When the hidden feature mapping function ℎ( ) is unknown, a kernel matrix for ELM is used according to the following equation: where ( , ) is a kernel function. Many kernel functions, such as linear, polynomial, and radial basis function, can be used in kernel-based ELM. Now the output function of KELM classifier can be expressed as

The Proposed SCFW-KELM Diagnosis System
This work proposes a novel hybrid method for PD diagnosis. The proposed model is comprised of two stages as shown in Figure 3. In the first stage, SCFW algorithm is firstly applied to preprocess data in the PD dataset. The purpose of this method is to map the features according to their distributions in dataset and to transform from linearly nonseparable space to linearly separable one. With this method, similar data in the same feature are gathered, which will substantially help improve the discrimination ability of classifiers. In the next stage, KELM is evaluated on the weighted feature space with different types of activation functions to perform the classification. Finally, the best parameters and the suitable activation function are obtained based on the performance analysis. The detailed pseudocode of the hybrid method is given in Algorithm 2. For SVM, LIBSVM implementation was used, which was originally developed by Chang and Lin [27]. The empirical experiment was conducted on Intel Dual-Core TM (2.0 GHz CPU) with 2 GB of RAM.

Experimental Design
In order to guarantee the valid results, -fold CV was used to evaluate the classification results [28]. Each time, nine of ten subsets were put together to form a training set and the other subset was used as the test set. Then the average result across all 10 trials was calculated. Thanks to this method, all the test sets were independent and the reliability of the results With the best combination (C, ) Predict the labels on the remaining one testing set using the trained KELM model No Yes K = 10?
Average the prediction results on the ten independent test sets Obtain the optimal model

Begin
Weight features using subtractive clustering algorithm; For = 1: k / * Performance estimation by using -fold CV, where = 10 * / Training set = k-1 subsets; Test set = remaining subset; Train KELM classifier in the weighted training data feature space, store the best parameter combination; Test the trained KELM model on the test set using the achieved best parameter combination;

End For
Return the average classification results of KELM over th test set; End Algorithm 2: Pseudocode for the proposed model. could be improved. Because of the arbitrariness of partition of the dataset, the predicted results of model at each iteration were not necessarily the same. To evaluate accurately the performance of the PD dataset, the experiment was repeated 10 times and then the results were averaged.

Measure for Performance Evaluation.
In order to evaluate the prediction performance of SCFW-KELM model, we used six performance metrics, ACC, sensitivity, specificity, AUC, -measure, and kappa statistic value, to test the performance of the proposed model. About the mentioned performance In the confusion matrix, TP is the number of true positives, which represents that some cases with PD class are correctly classified as PD. FN is the number of false negatives, which represents that some cases with the PD class are classified as healthy. TN is the number of true negatives, which represents that some cases with the healthy class are correctly classified as healthy and FP is the number of false positives, which represents that some cases with the healthy class are classified as PD. ACC is a widely used metric to determine class discrimination ability of classifiers. The receiver operating characteristic (ROC) curve is usually plotted using true positives rate versus false positives rate, as the discrimination threshold of classification algorithm is varied. The area under ROC curve (AUC) is widely used in classification studies with relevant acceptance and it is a good summary of the performance of the classifier [29]. Also -measure is a measure of a test's accuracy, which is usually used as performance evaluation metric to assess the performance of binary classifier, based on the harmonic mean for the classifier's precision and recall. Kappa error (KE) or Cohen's kappa statistics (KS) is adopted to compare the performances of different classifiers. KS is a good measure to inspect classifications that may be due to chance. As KS value calculated for classifiers closer to 1, the performance of classifier is assumed to be more realistic rather than being by chance. Thus, KS value is a recommended metric to consider for evaluation in the performance analysis of classifiers and it is calculated with [30] where ( ) means total agreement probability and ( ) means agreement probability due to chance.

Experimental Results and Discussions
Experiment 1 (classification in the PD dataset). In this experiment, we firstly evaluated KELM in the original feature space without SCFW. It is known that different types of kernel activation functions have great influence on the performance of KELM. Therefore, we presented the results from our investigation on the influence of different types of Computational and Mathematical Methods in Medicine 7  To investigate whether SCFW method can improve the performance of KELM, we further conducted the model in the PD dataset in the weighted feature space by SCFW. The proposed system consisted of two stages. Firstly, SCFW approach was used to weight the features of PD dataset. By using SCFW method, the weighted feature space was constructed. Table 4 listed the cluster centers of the features in the PD dataset using SCFW method. Figure 4 depicted the box graph representation of the original and weighted PD dataset with the whole 22 features. Figure 5 showed the distribution of two classes of the original and weighted 195 samples formed by the best three principle components obtained with the principle component analysis (PCA) algorithm [31]. From Figures 4 and 5, it can be seen that the discriminative ability of the original PD dataset has been improved substantially by SCFW approach. After data preprocessing stage, the classification algorithms have been used and discriminated the weighted PD dataset. The detailed results obtained by SCFW-KELM with four types of different kernel functions were presented in Table 5. As seen from Table 5, all these best results were much higher than the ones obtained in the original feature space without SCFW. The classification performance in the PD dataset has significantly improved by using SCFW method. Compared with KELM with RBF kernel function in the original feature space, KELM with RBF kernel based on SCFW method increased the performance by 3.6%, 3.65%, 3.67%, and 3.65% in terms of ACC, sensitivity, specificity, and AUC and has obtained highest -measure value of 0.9966 and highest KS value of 0.9863. The KELM models with the other three kernel functions also have got great improvements in terms of six performance metrics. Table 6 also presented the comparison results of the confusion matrices obtained by SCFW-KELM and KELM. As seen from Table 6, SFCW-KELM correctly classified 194 normal cases out of 195 total normal cases and misclassified only one patient with PD as a healthy person, while KELM without SCFW method only correctly classified 187 normal cases out of 195 total normal cases and misclassified 6 patients with PD as healthy persons and 2 healthy persons as patients with PD.
For SVM classifier, we have performed SVM classifier with RBF kernel. It is known that the performance of SVM is sensitive to the combination of the penalty parameter and the kernel parameter . Thus, the best combination of ( , )  needs to select in the classification tasks. Instead of manually setting the parameters ( , ) of SVM, the grid-search technique [32] was adopted using 10-fold CV to find out the best parameter values. The range of the related parameters and was varied between = [2 −15 , 2 −14 , . . . , 2 11 ] and = [2 −15 , 2 −14 , . . . , 2 5 ]. The combinations of ( , ) were tried and the one with the best classification accuracy was chosen as the parameter values of RBF kernel for training model.
For original ELM, we know that the classification performance of ELM with sigmoid additive function is sensitive to the number of hidden neurons , so value of needs to be specified by users. Figure 6 presented the detailed results of ELM in the original and weighted PD dataset with different hidden neurons ranging from 1 to 50. Specifically, the average results of 10 runs of 10-fold CV for every specified neuron were recorded. As shown in Figure 6, the classification rates of ELM were improved with hidden neuron increasing at first and then gradually fluctuated. In the original dataset, it achieved highest mean classification accuracy with 40 hidden neurons, while in the weighted dataset with SCFW method, highest mean classification accuracy was gained with only 26 hidden neurons.
For KNN classifier, the influence of neighborhood size of KNN classifier in the classification performance of the PD dataset has been investigated. In this study, value of increased from 1 to 10. The results obtained from KNN classifier with different values of in the PD dataset are shown in Figure 7. From the figure, we can see that the best results have been obtained by 1-NN classifier, and the performance was decreased with the value of increasing, while the better results were achieved in the weighted PD dataset with SCFW method for 2-NN.
For KELM classifier, there were two parameters, the penalty parameter and the kernel parameter , that need to be specified. In this study, we have conducted the experiments on KELM depending on the best combination of ( , ) by grid-search strategy. The parameters and were both varied in the range of [2 −15 , 2 −14 , . . . , 2 15 ] with the step size of 1. Figure 8 showed the classification accuracy surface in one run of 10-fold CV procedure, where -axis and -axis were log 2 and log 2 , respectively. Each mesh node in the ( , ) plane of the classification accuracy represented a parameter combination and -axis denoted the achieved test accuracy value with each parameter combination. Table 7 summarized the comprehensive results achieved from four classifiers and those based on SCFW method in terms of ACC, sensitivity, specificity, AUC, -measure, and KS value over 10 runs of 10-fold CV. Besides, the sum of computational time of training and that of testing in seconds was recorded. In this table, we can see that, with the aid of SCFW method, all these best results were much higher than the ones obtained in the original feature space. The SCFW-KELM model has achieved highest results of 99.49%, 100%, 99.39%, and 99.69% in terms of ACC, sensitivity, specificity, and AUC and got highest -measure of 0.9966 and KS value of 0.9863, which outperforms the other three algorithms. Compared with KELM without SCFW, SCFW-KELM has improved the average performance by 3.6%, 3.65%, 3.67%, and 3.65% in terms of ACC, sensitivity, specificity, and AUC. Note that the running time of SCFW-KELM was extremely short, which costs only 0.0126 seconds.
In comparison with SVM, SCFW-SVM has achieved the results of 97.95%, 96.67%, 98.71%, and 97.6% in terms of ACC, sensitivity, specificity, and AUC and improved the performance by 2.57%, 11.58%, 0.04%, and 5.72%, respectively. KNN also has significantly improved by SCFW method. For ELM classifier, it has achieved best results by ELM with 36 hidden neurons on the original feature space, while the best performance was achieved by SCFW-ELM with small hidden neurons (only 26). It meant that the combination of SCFW   and ELM not only significantly improved the performance but also compacted the network structure of ELM. Moreover, the sensitive results of SVM and ELM were significantly improved by 11.58% and 21.84%, respectively. Whatever in the original or weighted feature space, KELM with RBF kernel was much superior to the other three models by a large percentage in terms of ACC, sensitivity, specificity, AUC, -measure, and KS value. Although SVM achieved the specificity of 98.67%, the sensitivity, AUC, -measure, and KS value were lower than those of KELM with RBF kernel. We can also see that the performance of KELM with RBF kernel was much higher than those of ELM with sigmoid function. The reason may lie in the fact that the relation between class labels and features in the PD dataset is linearly nonseparable; kernel-based strategy works better for this case by transforming from linearly nonseparable to linearly separable dataset. However, the performances obtained by SCFW-SVM approach were close to those of SCFW-KNN. It meant that, after data preprocessing, SVM can achieve the same ability to discriminate the PD dataset as that of KNN. Additionally, it is interesting to find that the standard deviation of SCFW-KELM was much lower than that of KELM, and it had the smallest SD in all of the models, which meant SCFW-KELM became more robust and reliable by means of SCFW method. In addition, the reason why SCFW method outperforms FCM is that SCFW may be more suitable for nonlinear separable datasets. It considers the density measure of data points to reduce the influence of outliers; however, FCM tends to select outliers as initial centers.
For comparison purpose, the classification accuracies achieved by previous methods which researched the PD diagnosis problem were presented in Table 8. As shown in the table, our developed method can obtain better classification results than all available methods proposed in previous studies.
Experiment 2 (classification in two other benchmark datasets). Besides the PD dataset, two benchmark datasets, that is, Cleveland Heart and Wisconsin Diagnostic Breast Cancer (WDBC) datasets, from the UCI machine learning repository, have been used to further evaluate the efficiency and effectiveness of the proposed method. We used the same flow as in the PD dataset for the experiments of two datasets. The weighted features space of datasets was constructed using SCFW and then the weighted features were evaluated with the four mentioned algorithms. It will only give the classification results of four algorithms for the sake of convenience. Table 9 showed the obtained results in the classification of the original and weighted Cleveland Heart dataset by SCFW-KELM model. Table 10 presented the achieved results in the classification of the original and weighted WDBC dataset using SCFW-KELM model. As seen from these results, the proposed method also has achieved excellent results. It indicated the generality of the proposed method.

Conclusions and Future Work
In this work, we have developed a new hybrid diagnosis method for addressing the PD problem. The main novelty of this paper lies in the proposed approach; the combination of SCFW method and KELM with different types of kernel functions allows the detection of PD in an efficient and fast manner. Experiments results have demonstrated that the proposed system performed significantly well in discriminating the patients with PD and healthy ones. Meanwhile, the comparative results are conducted among KELM, SVM, KNN, and ELM. The experiment results have shown that the SCFW-KELM method performs advantageously over the other three methods in terms of ACC, sensitivity, specificity, AUC, -measure, and kappa statistic value. In addition, the proposed system outperforms the existing methods proposed in the literature. Based on the empirical analysis, it indicates that the proposed method can be used as a promising alternative tool in medical decision making for PD diagnosis.