Data Clustering Improves Siamese Neural Networks Classification of Parkinson’s Disease

Parkinson’s disease (PD) is a clinical neurodegenerative disease having symptoms like tremor, rigidity, and postural disability. According to Harvard, about 60,000 of American citizens are diagnosed with PD yearly, withmore than 10million people infected worldwide. An estimate of 4% of the people have PD before they reach the age 50; however, the incident increases with age. Diagnosis of PD relies on the expertise of the physician and depends on several established clinical criteria. &is makes the diagnosis subjective and inefficient. Hence, continuous efforts are being made to enhance the diagnosis of PD using deep learning approaches that rely on experienced neurologists. Siamese neural networks mainly work on two different input vectors and are used in comparison of output vectors. Moreover, clustering a dataset before applying classification enhances the distribution of similar samples among groups. In addition, applying the Siamese network can overcome the limitation of samples per class in the dataset by guiding the network to learn differences between samples rather than focusing on learning specific classes. In this paper, a Siamese neural network is applied to diagnose PD. Siamese networks predict the sample class by estimating how similar a sample is to other samples.&e idea behind this work is clustering the dataset before training the network, as different pairs that belong to the same cluster are candidates to be mistaken by the network and assumed to be matched pairs. To overcome this problem, the dataset is first clustered, and then the architecture feeds the network to pairs of the same cluster. &e proposed framework is concerned with comparing the performance when using clustered against unclustered data.&e proposed framework outperforms the conventional framework without clustering. &e accuracy achieved for classifying unclustered PD patients reached 76.75%, while it reached 84.02% for clustered data, outperforming the same technique on unclustered data.&e significance of this study is in the enhanced performance achieved due to the clustering of data, which shows a promising framework to enhance the diagnostic capability of computer-aided disease diagnostic tools.


Introduction
Machine and deep learning are increasingly used in numerous fields. Medical and health applications are among those fields where machine learning and deep learning are used to diagnose, detect, and early predict diseases like Alzheimer's [1], cardiovascular disease [2], cancer [3], and Parkinson's disease [4]. e models used for diagnosing, detecting, and predicting diseases include algorithms such as decision trees [5], clustering [6], support vector machine [5], naïve bias [6], logistic regression [7], and neural network [3,5,6]. Neural networks, especially deep neural networks, have high classification accuracy; however, these models fail when the number of samples used for training is small. Siamese neural network [1,4] is one type of neural network model that works well under this limitation. Siamese neural network was first presented by [4] for signature verification, and this work was later extended for text similarity [8], face recognition [9,10], video object tracking [11], and other image classification work [1,12]. is work is motivated by the importance of Parkinson's disease (PD), which is a neurodegenerative disease (NDD) that has motor-related symptoms, like tremor and instability [13]. Parkinson's disease patients are expected to reach 1% and 4% in people of age 60 and 80, respectively [14]. e key concern of computer-aided PD diagnosis is the early diagnosis of the first signs of PD for better quality of patients' life [15]. PD is diagnosed using many techniques and types of data. Some techniques rely on biomarkers [16]. Other methods use handwritten dynamics and speech assessment for PD diagnosis. Various machine learning techniques are used to diagnose PD using a variety of data types [17].
Siamese neural networks mainly work on two different input vectors and is used in comparison of output vectors. Moreover, clustering a dataset before applying classification enhances the distribution of similar samples among groups. In addition, applying the Siamese network can overcome the limitation of samples per class in the dataset by guiding the network to learn differences between samples rather than focusing on learning specific classes. A few number of researches addressed the concept of clustering prior to classification. e concept was applied to a number of datasets in [18], including lung cancer and Coli2000, and the clustering of data showed improved classification accuracy. Moreover, in [19], a combination of clustering with classification was proven to give an increase of up to 10% in accuracy. However, the choice of classification and clustering techniques is critical in achieving the increased performance. In this paper, a framework that clusters the Parkinson's dataset [20] is adopted, which then applies Siamese neural network to classify the patients of this dataset into three classes, namely, Parkinson's disease (PD), REM sleep behavior disorder (RBD), and healthy controls (HC). e proposed framework shows enhanced performance when the data is clustered, which is a promising enhancement that can be applied in computer-aided diagnostic systems. e accuracy achieved for classifying PD patients reached 76.75% for unclustered data, while it reached 84.02% for clustered data, outperforming the same technique on unclustered data. e significance of this study lies in the increased accuracy that is achieved due to the clustering of data, which shows promising performance enhancement when adding clustering before classification in computer-aided diagnostic systems. e main contribution of this work can be highlighted in the points below: (i) Proposing a framework for clustering data before classification. (ii) Employing the k-means clustering in conjunction with Siamese neural networks. (iii) Applying the proposed model on Parkinson's disease patients. (iv) Showing that data clustering improves the diagnostic ability of computer-aided diagnostic systems.
e remaining sections of this paper are summarized as follows. Section 2 gives the work relevant to clustering, Siamese neural network, and PD diagnosis. In Section 3, the Methodology section, the proposed framework is presented. e framework is followed by the Results and Discussion section, Section 4. e results focus on a comparison of applying Siamese neural network to both clustered Parkinson's dataset and Parkinson's dataset without clustering. Finally, the paper concludes and highlights some future suggestions in Section 5.

Related Work
In this section, the state-of-the-art methods of clustering and a review of Siamese neural networks are presented, in addition to an overview of Parkinson's disease diagnostic systems.

Clustering.
Clustering is an unsupervised learning technique that groups instances together based on feature similarities, with the objective to increase similarity within the same class and decrease similarity between classes. e effectiveness of clustering is measured by its ability to identify unknown patterns. is is achieved by using distance measures such as Euclidean distance, Manhattan distance, and Minkowski distance [21]. Data clustering is used in many different applications, among which are machine learning, pattern recognition, and disease prediction. According to [21], data clustering can be categorized into linear and nonlinear clustering. Linear clustering algorithms include k-means clustering, quality threshold clustering, hierarchical clustering, fuzzy c-mean clustering, and Gaussian clustering, while nonlinear clustering algorithms include density-based clustering techniques such as minimum spanning tree-based clustering and kernel k-mean clustering. Clustering algorithms can be further categorized into partition-based, hierarchical, fuzzy, density-based, distribution-based, graph theory-based, model-based, and grid-based clustering [22]. Partition-based clustering is based on grouping instances according to the center of data points [22]. Algorithms based on partitioning include but are not limited to k-means [23], k-medoids [24], and CLARANS [25]. K-means is considered the most popular clustering technique; it is implemented by iteratively updating the center of a cluster (center of data) until a convergence criterion is met. K-medoids is an improved version of k-means that deals with discrete data. Hierarchical clustering algorithms are based on constructing hierarchical relationships among data [26]. Among hierarchical clustering algorithms are BIRCH [27] and CURE [28]. Fuzzy clustering algorithms label instances by changing the discrete value of belonging to a certain cluster into a continuous interval. FCM [29], FCS [30], and MM [31] are among the well-known fuzzy clustering algorithms. Density-based algorithms use the density of data for clustering, where the data belonging to high density regions are considered in the same cluster [32]. Distribution-based clustering, like DBCLASD [33] and GMM [34], generates clusters of data from the same distribution. In graph theory-based clustering, data is represented as a graph, where the nodes are used to represent the data points and the edges represent relationships among data. Examples of algorithms in this category of clustering are minimum spanning tree-based clustering [35]. Model-based clustering selects a specific model to represent each cluster and then selects the data that 2 Complexity best fits each cluster according to that model, using statistical learning or neural network learning. Examples of algorithms in this category are GMM [34]. Grid clustering represents the original data on a grid and uses this structure to cluster data points. Algorithms for grid clustering include STING [36] and CLIQUE [37]. More recent clustering algorithms include user kernel ensemble, quantum theory, swarm intelligence, spectral graph theory, affinity propagation, clustering algorithms for large-scale data, and spatial data [22].

Siamese Neural Networks.
A Siamese neural network is a type of neural network that is used to solve the problem of one-shot learning [38], where the class must be correctly predicted even if only a few examples are available for each new class. A Siamese neural network is structured into two similar neural networks (sometimes called twins); Figure 1 shows the architecture of a Siamese neural network. e input to Siamese neural network, used for training, is a couple of samples, one sample for the top twin and the other for the bottom one, in addition to a label that shows whether the two samples belong to the same class or not. e output of each twin network is a feature vector; these two feature vectors are combined through a cost function; the output of this function is a scalar energy. e output of the cost function is then combined with the label through a loss function. In the training phase, the network parameters are updated using the backpropagation method such that the loss function value is minimum for the pairs that belong to the same class and the maximum for the pairs that belong to different classes. In [6,39], the authors used a contrastive energy function as a loss function. Koch et al. [40] used a different approach, as they used L1-distance function followed by a sigmoid activation function. e concept of Siamese neural network was extended to what is called "triplet loss," where three samples are used as input to the neural network, namely, the anchor, the positive sample (which belongs to the same class of the anchor), and the negative sample (which belongs to a class different from the anchor class) [41,42]. e idea behind the triplet loss is the minimization of the distance between the anchor and positive samples and the maximization of the distance between the anchor and the negative samples. In 2018, Utkin et al. [43] introduced an alternative approach to Siamese neural network which is called Siamese deep forest. It is based on gcForest [44], a structure consisting of multiple layers, where each layer consists of groups of decision trees.
e Siamese deep forest prevents the overfitting from happening in the conventional neural networks due to the limitation of the available training data.

Parkinson's Disease Diagnosis.
Different sources of data are employed to help in the diagnosis of PD. is shows the significance of the disease. Among those data types are images, speech, sensor data, and handwriting motor data.
e study presented in [45] on the diagnosis of Parkinson's disease (PD) discusses the features that can be used in diagnosis. ey employed a Med-Line search to measure the clinical characteristics of PD.
e study highlighted the importance of clinical characteristics, such as tremor, rigidity, and loss of postural reflexes, to differentiate PD from other diseases. Based on this study, the dataset used in this paper was selected. Moreover, it was concluded that genetic variations and neuroimaging tests may also be used to improve diagnosis. Pereira et al. [46] used various CNN architectures to classify PD from non-PD patients and proposed a model with an accuracy of up to 95% using handwritten images. In addition, Moetesum et al. [47] used CNN to extract visual features from different samples of handwritten images and a support vector machine (SVM) classifier was used for classification with an accuracy of 83%. Singh and Nasoz [48] also worked on handwritten images to decrease the loss function of the images from the validation set. Classification accuracy reached 83.11% and 90.38%, for meander and spiral tests, respectively, using CNNs and SVM. An accuracy of 88% was achieved by Khatamino et al. [49] using a CNN system that employs dynamic features of spirals, as well as visual attributes, to detect PD. A fusion of a fuzzy system with neural networks was also proposed in [50] with an improved performance over other classification methods. An expert system using a genetic algorithm and wavelet kernel extreme learning was proposed in [51] with an accuracy of 96.81%.
Other methods rely on speech data. For example, Al-Fatlawi et al. [52] used DBN with two stacked RBMs. e proposed system reached an accuracy of 94%. Furthermore, in [53], speech impairments were used to diagnose PD using the deep neural network (DNN) classifier. e obtained accuracy was up to 93.79%. Additionally, Gunduz [54] proposed two CNN-based frameworks to classify PD using vocal features with an accuracy of 86.9%. e work in [55] used a speech dataset with many machine learning techniques. e highest performance reported was for the Light Gradient Boosting model, with an area under curve of 0.951.
Other techniques use image analysis on different scans. In this context, Segovia et al. [56] used brain images for classification using SVM and the Partial Least Squares were used. Accuracy reached up to 94.7%. In [57], a deep CNN model was used on SPECT images. e performance of this approach reached an accuracy of 90.7%. Finally, Sivaranjini and Sujatha [58] employed a CNN architecture to classify the MRI of healthy controls and PD patients with an accuracy of 88.9%.
Techniques that use sensor data are also employed in some studies for the diagnosis of PD. Aich et al. [59] showed that decision trees outperformed other methods, obtaining an accuracy of 88.46% to study the effect of medicine on the    [61] presented an automated gait differentiation procedure for the diagnosis of PD through a holistic, nonintrusive method that uses Vertical Ground Reaction Force (VGRF). Gait features are extracted from the VGRF for training of a neural network that achieves an accuracy of 97.4%.

Methodology
is section explains the methods used in the proposed framework to classify Parkinson disease patients.
In this context, k-means clustering is first applied to the dataset; then the Siamese neural network is employed to classify the clustered dataset. e k-means method is used because it does not need complex computation. However, prior knowledge of the number of clusters is a disadvantage to this method.
is is overcome by choosing the optimal number of clusters which minimizes the distance between each sample vector in the dataset and the centroid of its corresponding cluster.

Parkinson Disease Dataset.
e original Parkinson disease dataset includes 130 patients prepared by professional neurologists, so that the proposed framework is not affected by the subjectivity of nonprofessional neurologists and clinicians. Patients are classified into three classes, namely, early untreated Parkinson's disease (PD), REM sleep behavior disorder (RBD), and healthy controls (HC). e dataset contains 30 patients with early untreated Parkinson's disease (PD), 50 patients with REM sleep behavior disorder (RBD), which have a high risk of developing the disease or other synucleinopathies, and 50 healthy controls (HC). Each sample in the dataset is described by 65 features collected from patients [20]. ese features include age, gender, positive history of Parkinson disease in the family, medications, dosage, motor examination, speech, facial expression, measures of tremor, measures of rigidity, finger taps, hand movements, leg agility, posture, and respiration. e features in the dataset represent clinical data, as well as data obtained from hand movements (as in handwriting images), and data to measure speech features. Figure 2 shows the distribution of the dataset on age, gender, age of disease onset, and duration of disease from first symptoms. For gender, 0 represents male and 1 represents female. Year 0 in the graph indicates a healthy person. e dataset consists of 103 males and 27 females, their ages range from 34 to 83 years, the age where they first experience the symptoms ranges from 30 to 81 years, and the time duration of disease from first symptoms ranges from 0.5 to 17 years.

Data Preprocessing.
e data preprocessing prepares the data to be suitable for the proposed framework. e steps done in the data preprocessing step are as follows: (i) e "patient code" field is eliminated as it has no effect on the learning process. (ii) e string values of the data field "Gender" are enumerated, where 0 represents male and 1 represents female. (iii) e fields "Positive," "Antidepressant therapy," "Antiparkinsonian medication," "Antipsychotic medication," and "Benzodiazepine medication" are enumerated. (iv) e healthy control class has no applicable value for the "Age of disease onset" and "Duration of disease from the first symptoms," and therefore, in the preprocessing stage those values are replaced by zeros. (v) e missing values from the dataset are replaced by the value −1. is approach is selected to avoid deleting samples and losing information and to avoid using calculated values, like median or mean, which may lead to variance in results. e 130 patients' dataset can generate 130 × 129 � 16,770 pairs, of which 747 and 296 pairs are used for training and testing, respectively. e pairs are created randomly from the same clusters such that the percentage of positive pairs to the negative pairs is 1 : 2 because we have 3 classes.

K-Means Clustering.
Clustering is carried out via the k-means clustering algorithm. e algorithm is used to cluster the Parkinson disease dataset. e input to the k-means algorithm is the number of clusters, M, and the size of the dataset, N. Choosing the most optimum number of clusters depends on minimizing the cluster inertia, I, which measures the sum of the Euclidean distances between each sample vector in the dataset and the centroid of its corresponding cluster according to where i ∈ 1, 2, . . . , N { }, j ∈ 1, 2, . . . , M { }, X i is the vector that represents sample i, in the dataset, C j is the centroid of cluster j, and w ij equals 1 when X i ∈ cluster j and 0, otherwise.
ere is a trade-off between the number of clusters and the inertia. To solve this problem, the elbow method [62] is used. e elbow method examines the relation between the number of clusters and inertia and finds the point after which the decrease in inertia is not significant and then considers the number of clusters corresponding to this point as the optimal number of clusters. Figure 3 shows how to get the optimal number of clusters graphically.
To get the elbow point mathematically, the perpendicular distance between each point of the curve and the line that connects the first point of the curve with the last point, 4 Complexity L, is calculated. e elbow point corresponds to the maximum distance, d max , as shown in Figure 4. e distance d max is calculated using where (x 1 , y 1 ) is the first point of the curve, (x M , y M ) is the last point of the curve, and (x k , y k ) is the point from which the distance to the line, L, is calculated. After the optimal number of clusters is determined, the k-means algorithm is implemented according to the following steps: (1) Obtain the centroid of each cluster randomly.
(2) Assign each sample to its nearest centroid.

Siamese Neural Network Framework.
e Siamese neural network is a feedforward network with error backpropagation. e network comprises of two indistinguishable feedforward neural networks joined at their yield, as shown in Figure 1. Amid training, each network peruses a profile made of genuine values and forms its values at each layer. e network enacts a few of the neurons based on these values and upgrades its weights through error backpropagation, and at the end it creates an output profile that is compared with the output of the other network. e Siamese neural network compares the output of the upper and the lower networks by calculating a distance metric. rough this distance, the network states that the two outputs are distinctive or comparative. e algorithm at that point names occurrence as positive or negative, depending on the distance metric. e final output can at last be compared with its corresponding ground truth value. e proposed Siamese neural network framework consists of twin neural networks (top and bottom networks), where each of them consists of three dense layers (other than the input layer). e input layer consists of 64 neurons, and    e activation function applied for the first three layers is "tanh." To avoid overfitting, dropout is applied between layer 1 and layer 2, and layer 2 and layer 3 with fraction 10%. Table 1 shows a summary of both top and bottom neural networks. e two twin neural network outputs are joined together to form the output layer with one neuron, where the Euclidean distance is applied between the outputs of the top and bottom networks to measure the similarity between the two outputs. e dataset is split into two partitions, such that twothirds of the dataset are dedicated for training and the rest is dedicated for testing. Random pairs of records of the dataset are created; the pair is positive if the two records belong to the same class; otherwise the pair is negative. e data is fit to the model in batches. e batch size is 128. e network is trained through 200 epochs. e objective of the network is minimizing positive pairs distance, while maximizing negative pairs distance. is is done using the contrastive loss function: where L is the loss function, (X, Y p ) is a positive pair, (X, Y n ) is a negative pair, D is the distance between two records of the same pair, and m is the margin value which shows that the two records of the negative pair are distant enough. To get the values of network weights at which the loss is optimum, the RMSprop optimization algorithm is applied.

Results and Discussion
is section presents the results of the experiment on Parkinson's disease dataset. e proposed model is analyzed with the performance metrics accuracy, precision, recall, specificity, and f 1 -score [20]. In a binary classification problem (where we have only positive and negative classes), the performance metrics are calculated using True Positives However, our classification problem is multiclass (where there are more than two classes); in this case the performance metrics are calculated with respect to each class, and then the weighted average is calculated for each metric. e weighted average for metric p is C i�1 N i p i /N, where C is the number of classes, N i is the number of test samples in the class, and p i is the value of this metric with respect to class i.

Parkinson's Disease Experiment.
Applying k-means algorithm to cluster Parkinson disease dataset results in partitioning the data into eight clusters. Two-thirds of the dataset are used for creating 747 pairs to train the proposed Siamese neural network framework, and the rest is used to create 296 pairs to test it. Figure 5 shows the overall accuracy on both the training set and the validation set, for the proposed framework during training. Figure 6 shows the loss on both the training set and the validation set, for the proposed framework during training. e performance of the proposed framework changes insignificantly at epoch 175; hence, the iteration can be terminated as there is no significant change after that.
To study the effect of clustering dataset, the Siamese neural network model is applied on the dataset before clustering, and then the model is applied on the clustered dataset. Table 2 shows the performance metrics in the two cases, clustered and unclustered Parkinson's dataset. e weighted average Precision/Recall/F 1 (PRF) of the unclustered model is 71.55%, 70.83%, and 71.19%, and the accuracy  e PRF of the clustering model is 73.06%, 72.00%, and 72.53%, and the accuracy is 85.40%. is increase in accuracy comes from better predicting the truenegative and true-positive samples.
e system iterates the above experiment 10 times. Table 3 shows the overall accuracy for each iteration. Applying the model on the clustered dataset outperforms applying it on the dataset without clustering. On average, the overall accuracy is 84.02% and 76.75%, correspondingly.

Conclusions
In this work, we adopt a new method of training Siamese neural networks on Parkinson's disease dataset.
e new method depends on clustering the dataset prior to the training phase and concentrates on training the network on pairs of the same clusters. We have compared the proposed framework to the conventional framework. We run both models (the conventional and proposed framework) 10 times. On average, we get an overall accuracy of 76.75% when applying the conventional model, while we get an accuracy of 84.02% when applying the proposed framework. Moreover, both frameworks were analyzed with the performance metrics, accuracy, precision, recall, specificity, and f1-score. On average, while the conventional model achieves 79.17% accuracy, 71.55% precision, 70.83% recall, 83.33% specificity, and 71.19% f 1 -score, the proposed framework achieves 85.4% accuracy, 73.06% precision, 72% recall, 90.06% specificity, and 72.53% f 1 -score. e above experimental results prove that the proposed framework outperforms the conventional framework. e proposed framework shows a promising improvement in performance, and hence, the model is expected to be further tested on multiple datasets for Parkinson's disease, in addition to other classification problems.   Data Availability e data used in this manuscript were taken from [20] and are available at https://archive.ics.uci.edu/ml/machinelearning-databases/00392/. Any other data required are available upon request from the author.

Conflicts of Interest
e authors have no conflicts of interest.