A Multiple-Classifier Framework for Parkinson's Disease Detection Based on Various Vocal Tests

Recently, speech pattern analysis applications in building predictive telediagnosis and telemonitoring models for diagnosing Parkinson's disease (PD) have attracted many researchers. For this purpose, several datasets of voice samples exist; the UCI dataset named “Parkinson Speech Dataset with Multiple Types of Sound Recordings” has a variety of vocal tests, which include sustained vowels, words, numbers, and short sentences compiled from a set of speaking exercises for healthy and people with Parkinson's disease (PWP). Some researchers claim that summarizing the multiple recordings of each subject with the central tendency and dispersion metrics is an efficient strategy in building a predictive model for PD. However, they have overlooked the point that a PD patient may show more difficulty in pronouncing certain terms than the other terms. Thus, summarizing the vocal tests may lead into loss of valuable information. In order to address this issue, the classification setting must take what has been said into account. As a solution, we introduced a new framework that applies an independent classifier for each vocal test. The final classification result would be a majority vote from all of the classifiers. When our methodology comes with filter-based feature selection, it enhances classification accuracy up to 15%.


Introduction
Parkinson's disease was first introduced in 1817 by Doctor James Parkinson as "shaking palsy" [1]. It is the second common neurological disease coming afterwards Alzheimer and is mostly common among elders [2,3]. PD is a kind of progressive disease in which an area of brain becomes damaged over the years. It causes various signs and symptoms. From one perspective, these signs and symptoms can be grouped into two major categories: motor symptoms and nonmotor symptoms. Motor symptoms are those that affect movement and muscles and nonmotor symptoms include neurobehavioral and cognitive problems, sleep problems, sensory problems, and autonomic neuropathy (dysautonomia) [4].
Speech disturbance is one of the most common motor problems of PD [4]. Research has shown that about 90% of PWP are affected with motor problems, especially speech impairment [5,6]. In addition to the prevalence of vocal impairments in PD patients, gathering speech samples and doing signal processing of their voice has low cost and it is appropriate for telemonitoring and telediagnosis systems [7,8]. Therefore, PD diagnosis from speech impairments is becoming more widespread.
In Parkinsonism patients, speech disorders result from neurologic impairments which are associated with weakness, slowness, or incoordination of the muscles used to produce speech [9,10]. Speech disturbance usually occurs in the following forms: hypophonia, which is soft speech that results from weakness in the vocal musculature, monotonic speech, which deals with speech quality in the cases that are soft, hoarse, and monotonous, and festination speech, which is when the speech becomes excessively rapid, soft, breathy, and poorly intelligible [4].
Many approaches have been proposed in order to find the severity of each speech impairment sign. There are two types of the best known vocal tests for this purpose: sustained phonation [11,12] and running speech [12] tests. In sustained phonation, the patient is asked to say a single vowel, while holding its pitch as constant and long as possible. In running speech, the patient says a standard sentence which includes representative linguistic units that can show possible impairment signs of vocal disorder. The main focus of this research is on the latter problem statement. Previous researches had two main flaws: (a) all the voice samples were classified by a single classifier; (b) the vocal samples of each subject were summarized with the help of statistical metrics irrespective of discriminating ability of each vocal test.
Since most studies in the area of PD detections based on speech are done on datasets gathered on just one or a few types of vocal tests, we have brought our attention to a dataset with multiple sound recordings. The main contributions of this study are twofold: (1) to suggest a new distinctive classification framework which proposes to apply a unique classifier to vocal samples of each type, for example, have a classifier just for vowel "a," rather than applying a single classifier for all vocals and (2) to present which vocal tests are more representative and to indirectly omit less discriminating vocal tests by embedding majority voting in our proposed method.
The rest of the paper is organized as follows. Section 2 reviews previous studies of this domain. In Section 3, a brief description of the dataset, evaluation metrics, and the proposed method can be found. Section 4 demonstrates the results of this work and, finally, Section 5 presents the conclusion of this study.

Related Work
In recent years the detection of vocal disorders with the help of machine learning turned into a hot topic. Various research papers have attempted to solve this problem by considering acoustic measurements of dysphonia as effective features to distinguish normal (control) from disordered cases [7,8,13,14]. Studies in this field can be categorized into two main groups: (1) those that attempt to find the most effective vocal features and produce new datasets [8,13,15] and (2) those that try to find more effective features from existing datasets and concentrate on enhancing classification accuracy [14,[16][17][18][19][20][21][22][23][24][25][26].
Some studies focused on how to produce new datasets based on their research findings. Little et al. in [8] aimed to analyze the effectiveness of nonstandard measurements. Their work led to the introduction of a new dysphonia measurement named as PPE (pitch period entropy). In their study, they had collected sustained vowel "a" phonations from 31 subjects of which 23 were PD patients and they reached the classification accuracy of 91.4%. In [13], Sakar et al. presented a dataset of 40 subjects including 20 PD. Each individual was trained to say a set of 26 distinct disorder representative terms consisting of sustained vowels, words, numbers, and short sentences. This dataset is the focus of current work. They applied summarized leave-one-out (s-LOO) validation technique in which all the voice samples of each individual will be summarized using central tendency and dispersion metrics such as median, mean, standard deviation, trimmed mean, interquartile range, and mean absolute deviation. Their approach obtained 77.5% of classification accuracy. Tsanas et al. in [15] focused on monitoring the PD progression with the help of extracted features using signal processing techniques applied on a huge dataset of about 6000 voice samples from 42 patients with early-stage PD. They have attempted to estimate the unified Parkinson's disease rating scale (UPDRS) using linear and nonlinear regression. Their results show the accuracy of about 7.5-point difference from clinical UPDRS estimations. These three datasets are the main publicly available datasets of PD speech-based area of study.
Other studies tried to improve the PD detection rate using the existing datasets. For instance, Tsanas et al. in [14] computed 132 dysphonia new measurements using an existing dataset consisting of 263 vowels "ahh. . ." phonations from 43 cases by applying feature selection techniques. They obtained 99% overall classification accuracy. In another work, Sakar and Kursun [16] tried to assess the relevance and correlation between the features and PD score by applying mutual information-based selection algorithm with permutation test and feed the data with selected features ranked based on maximum-relevance-minimum-redundancy (mRMR) into an SVM classifier. They used leave-one-subject-out (LOSO) as the cross validation technique of their model in order to avoid bias. In LOSO validation scheme, all the voice samples of an individual which is the candidate of being the testing sample will be left out from the rest of the data. Their approach gained 92.75% classification accuracy [8]. Shahbaba and Neal [17] presented a nonlinear model based on Dirichlet mixtures and obtained the classification accuracy of 87.7%. Das [18] conducted a comparative study of neural networks (NN), DMneural, regression, and decision trees for PD diagnosis; their study resulted in classification performance of 92.9% based on NN. Guo et al. [19] applied a combination of genetic programming and the expectation maximization (EM) and obtained a classification accuracy of 93.1%. Luukka [20] proposed a method that used fuzzy entropy measures and similarity classifier and resulted in the mean accuracy of 85.03%. Li et al. [21] introduced a fuzzy-based nonlinear transformation approach combined with SVM; their best classification accuracy was 93.47%. Ozcift and Gulten [22] proposed classifier ensemble construction with a rotation forest approach which got classification accuracy of 87.13%. Aström and Koker [23] achieved the classification accuracy of 91.2% by using a parallel neural network model. Polat [24] applied the fuzzy C-means clustering feature weighting together with the k-nearest neighbor classifier; their best obtained classification accuracy was 97.93%. Chen et al. [25] proposed a model which combined PCA and the fuzzy k-nearest neighbor method; their classification approach achieved an accuracy of 96.07%. Zuo et al. [26] used a diagnosis model based on particle swarm optimization (PSO) to strengthen the fuzzy k-nearest neighbor classifier which resulted in mean classification accuracy of 97.47%.
In most of the studies, SVM was used as the base classifier to distinguish healthy subjects from PWP [8,14,27] and the success of the diagnostic system is measured with ROC curves, AUC, and reporting True Positive and False Positive rates [28].
The speech datasets used in the field of PD diagnosis consist of multiple speech recordings per subject [29]. These datasets can be grouped into two categories: (1) those that  contain the repetition of one term and (2) those that consist of different vocal terms. The majority of datasets go to the first category. Hence, most of the studies on PD diagnosis are conducted on these datasets [14,[16][17][18][19][20][21][22][23][24][25][26]; however, none of them could obtain 100% classification accuracy. The most popular and available datasets of this type are "Parkinson's Data Set" [7] and "Parkinson's Telemonitoring Data Set" [15], both accessible from UCI Machine Learning Repository. The only dataset of the second category that is available in the form of processed data matrix was produced by Sakar et al. [13]. Less research has been conducted on this type of datasets; also, corresponding classification accuracies are not promising up to this time. The aim of this study is to show that this type of data collection can lead to high PD detection rates just by altering the classification strategy. [13], which is available on the University of California, Irvine (UCI) machine learning dataset repository website. This dataset consists of 40 subjects, including 20 PD patients and 20 control subjects. For each subject, 26 different sound recordings have been gathered, consisting of three sustained vowels, numbers one through 10, nine words, and four short sentences. There are 26 features extracted from these recordings. Table 1 lists the features gathered in Sakar et al. 's work and their corresponding groups (see [13] for more details).

Overview of the Proposed
where = (1/ ) ∑ =1 and similarly = (1/ ) ∑ =1 . This equation gives a value between −1 and +1, where +1 is maximum positive correlation, 0 is no correlation, and −1 is the strongest negative correlation. The values were calculated using Student's -distribution for a transformation of the correlation. Those features in the correlation coefficient matrix with values less than 0.05 were selected.

(3) MCFS and A-MCFS. When the Pearson Correlation
Coefficient feature selection is applied, some vocal tests may remain with no relevant features. We call those vocal tests as unsuccessful vocal tests. Two approaches for dealing with those unsuccessful vocal tests are taken in this study. The first is the MCFS approach; the vocal tests are used in the analysis only based on the prevalent features of other vocal tests. Table 2 shows each vocal test and its corresponding correlated features after applying feature selection and Figure 2 shows the frequency of each feature. Features 2 and 4 with frequency of six and five were, respectively, the most frequent selected features. The third most frequent was shared by features 25 and 26 with frequency of four. The most four frequent features were used in MCFS as selected features for unsuccessful vocal tests. The other methodology, A-MCFS, is to simply omit unsuccessful vocal tests.
(4) Classification and Majority Voting. After doing feature selection on each subset, for each of them, a classifier is built. Since we have 26 vocal tests, 26 classifiers are built. Each of these classifiers will predict the class label of its own subset.
Leave-one-out cross validation technique was used for all of these classifiers. Since each subject has only one record in each subset, we do not have to worry about how to treat each subject for doing cross validation as it was the case in LOSO or previous approaches.
The majority vote of classifiers decides which class the person belongs to. Each classifier votes whether the subject has PD or not. Then, the subject whom the majority of the classifiers have voted to be a PD patient will be labeled as " 1" showing the presence of PD and "0" otherwise.
where TP (True Positive) is the number of PD patients who are correctly classified as Parkinsonism patients by the model, TN (True Negative) is the number of control subjects who are labeled as healthy by the model, FN (False Negative) is the number of patients that the model falsely labeled them as healthy, and finally FP (False Positive) is the number of healthy cases who are incorrectly labeled as having PD by the classifier. It simply shows that the accuracy is the ratio of the correctly classified samples to the total number of instances: A well-known metric in machine learning which can be used for evaluating the quality of a binary classifier is MCC. This metric is reliable since it takes TP, TN, FN, and FP into account and this makes it stable even if classes are of very different sizes. Actually, MCC is a correlation coefficient between observed (actual) labels of the samples and those predicted by the binary classifier: .
This equation returns a value between −1 and +1. A coefficient of +1 shows a perfect prediction, 0 represents the fact that the classifier is not better than random guessing, and finally −1 indicates a complete disagreement between the actual values and the predicted ones.

Results and Discussions
After separating the data into subsets, the -score normalization process was applied on each subset. In other words, after transformation, mean is equal to zero and standard deviation changes to one. Then the proposed framework was applied on the refined data. Four classifiers including k-NN, SVM, discriminant analysis, and Naïve Bayes were applied to the preprocessed data. Distance metric used for the k-NN classifier was Euclidean distance and with values of 1, 3, 5, and 7 was used. SVM classifier was applied using linear and radial basis kernels (RBF) with scaling factor (sigma) of 3 and penalty parameter ( ) of 1. Table 3 includes the accuracy, sensitivity, specificity, and MCC obtained from applying mentioned classifiers under LOSO, s-LOO, and the proposed frameworks. The results reveal that k-NN classifier performance is almost analogous to random guessing when it is used with LOSO cross validation technique. Besides that, s-LOO could not  Table 3.
perform much better than LOSO when it comes to k-NN since its best overall accuracy and MCC are 65.00% and 0.3062, respectively. Results show that A-MCFS outperforms s-LOO's best result with overall accuracy of 77.50%, which is a 12.5% improvement and MCC of 0.5507. When is 1, 3, and 5, MCFS results are better than s-LOO at least for 5% and at most 12.5%, but its accuracy is 2.5% lower than s-LOO when is 7.
Sensitivity is another important factor, especially in biomedical sciences, which should be investigated closely in the results. As the results show, A-MCFS also has improved the sensitivity up to 80.00% and its lowest sensitivity (70.00%) is still better than that of LOSO and s-LOO when k-NN is used. k-NN achieved its best results when A-MCFS was applied; besides this, LOSO and s-LOO could not reach MCFS's results except for k = 7.
In addition, the best classification accuracy obtained by applying A-MCFS is 87.5% which is a 10% accuracy enhancement in comparison to the best accuracy obtained by s-LOO. Figure 3 gives a better demonstration of classification accuracies obtained by different methods.
In order to examine the correctness of our approach toward finding less discriminating vocal tests, we have reported the classification accuracy of each vocal test prior to the majority voting phase. The results are shown in Table 4. Comparing the results shown in Tables 2 and 4 reveals that the features which were excluded in A-MCFS are those that achieved a mean accuracy of below 55%. This shows the reason of the superiority of A-MCFS over MCFS. As a result, a closer investigation toward finding more effective vocal tests is necessary. 6 International Journal of Telemedicine and Applications

Conclusion
PWP detection based on vocal samples has been an attractive area of research. Finding a solution toward discriminating PD patients from the healthy people based on different vocal tests had been less accurate since all the vocal terms were treated by a single classifier. The proposed method treated each vocal test separately and used majority voting to resolve any potential confusion. Obtained results from this research showed that more accurate PD detection based on multiple vocal tests is achievable. Another important result, achieved from this study, was that the discriminating ability of all the vocal terms is not the same, even some of those vocal terms that have been considered to be discriminating in the literature, such as vowel "a," failed to be successful. As a result, our study may encourage other researchers to conduct further studies on different vocal terms from the proposed perspective.
As the future work, we plan to devise a laboratory setting to collect data from PWP and healthy subjects with several vocal tests from various languages and extend our results to other languages.