Discrimination between Alzheimer's Disease and Mild Cognitive Impairment Using SOM and PSO-SVM

In this study, an MRI-based classification framework was proposed to distinguish the patients with AD and MCI from normal participants by using multiple features and different classifiers. First, we extracted features (volume and shape) from MRI data by using a series of image processing steps. Subsequently, we applied principal component analysis (PCA) to convert a set of features of possibly correlated variables into a smaller set of values of linearly uncorrelated variables, decreasing the dimensions of feature space. Finally, we developed a novel data mining framework in combination with support vector machine (SVM) and particle swarm optimization (PSO) for the AD/MCI classification. In order to compare the hybrid method with traditional classifier, two kinds of classifiers, that is, SVM and a self-organizing map (SOM), were trained for patient classification. With the proposed framework, the classification accuracy is improved up to 82.35% and 77.78% in patients with AD and MCI. The result achieved up to 94.12% and 88.89% in AD and MCI by combining the volumetric features and shape features and using PCA. The present results suggest that novel multivariate methods of pattern matching reach a clinically relevant accuracy for the a priori prediction of the progression from MCI to AD.


Introduction
Alzheimer's disease (AD) [1] is the most common type of dementia. Clinical signs are characterized by progressive cognitive deterioration, together with declining activities of daily living and by neuropsychiatric symptoms or behavioral changes. The early detection of AD is potentially challenging because of several reasons. First of all, there existed no known biomarkers. The disease usually has an insidious onset which can be a combination of genetic and environmental factors. It is difficult to differentiate other types of dementia.
Mild cognitive impairment (MCI) is a transitional stage between normal aging and demented status. The syndrome is defined by the greater cognitive decline than age and education matched individuals, but no interference of daily function [2]. According to the major symptoms, MCI is characterized with memory loss and cognitive impairment. Research has reported that MCI has a risk between 10% to 64% developing AD [3,4]. AD is a progressively neurodegenerative disorder and is distinguished from MCI by the progressive deterioration of daily function. The prevalence of AD increases dramatically at age 65 and it affects approximately 26 million people worldwide, which may increase fourfolds by the year of 2050. Recent reports in the treatment or prevention of AD lead to a growing concerns in the early diagnosis. Therefore, the detection of changes in brain tissues that reflect the pathological processes of MCI would prevent or postpone the disease progresses either from normal control to MCI or from MCI to AD. If MCI can be diagnosed at an early stage and effectively intervened, then it is possible to reduce the advanced damages.
Since the poor performance in memory and execution function indicates the high risk of dementia, the probable AD patients are usually evaluated by standardized neuropsychological tests [5][6][7][8]. Additionally, many studies have been proposed to examine the predictive abilities of nuclear imaging with respect to AD and other dementia illnesses [9][10][11][12][13]. However, under the consideration of imaging cost and noninvasive requirement, magnetic resonance imaging (MRI) has been widely used for early detection and diagnosis of MCI and AD [14][15][16][17].
Atrophy typically starts in the medial temporal and limbic areas, subsequently extending to parietal association areas, and finally to frontal and primary cortices. Early changes in hippocampus and entorhinal cortex have been demonstrated with the help of MRI, and these changes are consistent with the underlying pathology of MCI and AD. Many studies have used manual or automatic methods to measure hippocampus and entorhinal cortex [18][19][20]. Hippocampal volumes and entorhinal cortex measures have been found to be equally accurate in distinguishing between AD and normal cognitive elderly subjects [21]. However, the segmentation and identification of hippocampus or entorhinal cortex are usually sensitive to the subjective opinion of the operator and also time consuming. In addition, the enlargement of ventricles is also a significant characteristic of AD due to neuronal loss. Ventricles are filled with cerebrospinal fluid (CSF) and surrounded by gray matter (GM) and white matter (WM). As a result, by measuring the ventricular enlargement, hemispheric atrophy rate shows higher correlation with the disease progression.
In this study, we have designed an MRI-based classification framework to distinguish the patients of MCI and AD from normal individuals using multiple features and different classifiers. Since the features adopted here are volume-related and shape-related, we also aimed to investigate whether the combination of both statistical analysis and principal component analysis (PCA) would improve the accuracies of classification than using volume-related alone, shape-related alone, or all features. Our hypothesis was that the combination of all MRI-based features is helpful for distinguishing the patients with early Alzheimer's disease from the subjects with mild cognitive impairment and healthy controls, respectively.
The remainder of this paper is organized as follows. Section 2 illustrated the proposed scheme, including features extraction and used classifiers, that is, self-organizing map (SOM), support vector machine (SVM), particle swarm optimization (PSO), and the proposed hybrid PSO-SVM. Statistical analysis, experimental results, and discussion are revealed in Section 3. Finally, conclusions are included in Section 4. Figure 1 is the flowchart that demonstrated the system we proposed. In the step of Feature Extraction, spatial normalization is performed by coregistering the brain MRI data from each individual to a T1-weighted MRI template such that these images of the investigated subjects will be in the same scale space. Next, with the aids of segmentation and morphological procedures, all MRI brain images are segmented into GM, WM, CSF, and ventricle's tissues and shape descriptors. Here, volume-related and shape-related features are utilized for further classification. The step of Feature Reduction is divided into two parts: (1) Mann-Whitney U test is adopted to filter out the features with low discriminative power; (2) principal component analysis (PCA) is applied to reduce the dimensions of feature space. Route I only uses U test; Route II is combined with U test and PCA. At last, a classifier, for example, SOM, SVM, and PSO-SVM, is employed to classify tested volunteers into three categories: normal individuals, MCI, and AD patients. The details of the proposed method are described below.

Spatial Normalization of MRI Data.
Spatial normalization of the brain images is useful for determining what happens generically over individuals. It is a procedure to register an MRI data set to a standard coordinate system, also known as Talairach and Tournoux coordinate system [22]. With the aid of normalization, all images were spatially normalized to stereotactic space ICBM-152 [23] via a 12degrees-of-freedom affine transformation which normalizes the brain in terms of dimensions, position, and spatial orientation.

Volume Features Extraction.
The volumes of brain tissues such as GM, WM, and CSF indicate important information, especially in brain degeneration diseases [24]. A clusteringbased segmentation algorithm provided by SPM8 [25] is using a modified Gaussian mixture model to extract GM, WM and CSF probability maps from whole-brain MRI data. The intensities of voxels belonging to each of these clusters conform to a normal distribution which can be described by a mean, a variance, and the number of voxels belonging to the distribution. Here, the volumes of GM, WM, CSF, and wholebrain are calculated by where i is any pixel of the MRI data and ( ) stands for the gray level of . means the cluster. tissue stands for the parts of GM, WM, or CSF. Figure 2 illustrates the segmentation results of the normal individual and AD patient used in this study.
Next, we employ region growing and double threshold algorithm [26] to extract binary ventricle volume data, that is, ( , , ). The morphological operators, for example, erosion and dilation, are used to obtain the binary ventricle regions. And the edges of binary images are detected by applying Sobel operation on a slice-by-slice basis. Then, this segmented region will construct a binary mask image. In  this mask image, 1 (white) denotes the ventricle pixel, and 0 (black) denotes the nonventricle pixel. Finally, we can calculate the volume of cerebral ventricle by where is any pixel of the mask data, is the mask image, and ( ) denotes the gray level of .

Shape Features
Extraction. The volume features, which are extracted from the whole three dimensional volume, cannot capture the variation of the anatomical shape. Wang et al. [27,28] proposed a ventricle shape-based method for improved classification of Alzheimer's patients. Therefore, to enhance the accuracy of the classification, in addition to the volume features, we also added ventricle shape features. Figure 3 shows the sagittal view of ventricle that we segmented. The shape features we analyzed are composed of two types: three-dimensional shape features and twodimensional shape features. The algorithms to obtain these features are illustrated in the following subsections.

3 Shape Features.
To obtain the feature of 3D shape, a leave-one-out method is used to construct training set and testing set following Wang's method. Three sets of probability map were then built using where indicates the type of the subjects, inclusive of normal control, AD, and MCI. is the number of training samples, and denotes the gray level of the ventricular mask image. In order to compare the differences of patients (AD and MCI) and normal controls, we subtracted the normal probability map from the patient probability map to obtain the discriminate map. At last, a matching coefficient (MC) between a testing input and the discriminate map is calculated by where ( , , ) is the discriminate map and denotes the testing ventricular mask image.

2D Shape
Features. The 2D shape features are extracted from the segmented ventricles on a slice-by-slice basis. In 2D viewpoint, there are many 2D ventricle slices for each case. In order to effectively compare the differences in each case, we selected the slices with maximum areas from 3D ventricle data sets as the datum plane. These 2D shape features used herein are referred to the work of Yang et al. [29] and listed as follows: (1)

Learning Methods for Classification.
Machine learning algorithms can be organized into a taxonomy based on the desired outcome of the algorithm or the type of input available during training the machine. They are often divided into supervised, nonsupervised, and reinforcement learning (RL). Supervised learning requires the explicit provision of input-output (I/O) pairs and the task is one of constructing a mapping from one to the other. Non-supervised learning has no concept of target data and performs processing only on the input data. In contrast, RL uses a scalar reward signal to evaluate I/O pairs and hence discover, through trial and error, the optimal outputs for each input. In this sense, RL can be thought of as intermediary to supervised and non-supervised learning since some form of supervision is present, albeit in the weaker guise of the reward signal. As such, the trained algorithm may be treated as a "black box" encapsulating knowledge gleaned from the training data whose inputs are useful for producing the expected outcome. For this reason, machine learning and computeraided diagnostics (CADs) have been of growing interest in the field of medical applications. To evaluate whether the performance of supervised and non-supervised methods is good or not, we used three classifiers to produce the outcome. In many researches of pattern recognition, dataset is often divided into two subsets of training and testing. The former is used to create the model, and the latter is used to assess the accuracy of the model to predict the unknown sample. This method can be called Train-and-Test method. Crossvalidation is the experimental method to effectively estimate the generalization error. In this study, leave-one-out crossvalidation (LOOCV) is adopted in three classifiers to estimate dependable generalization error. LOOCV involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. In this section, the classifiers we adopted are illustrated in the following subsections particularly.

Self-Organizing Map Architecture.
A self-organizing map (SOM) is a type of artificial neural network for the visualization of high-dimensional data. In general, SOMs are divided into two parts: training and mapping. Training builds the map using input examples, called a Kohonen map [30]. An SOM consists of components called nodes or neurons. Each node has a set of neighbors. When this node wins a competition, not only its weight is adjusted, but those of the neighbors are also changed. They are not changed as much though. The further the neighbor is from the winner, the smaller its weight change. Furthermore, as training goes on, the neighborhood gradually shrinks. At the end of training, the neighborhoods have shrunk to zero size.
When a training example is fed to the network, its Euclidean distance to all weight vectors is computed by using (5). Here denotes the dimension of data, and is the index of the data item in a given sequence, The neuron with weight vector most similar to the input is called the best matching unit (BMU). The weights of the BMU and neurons close to it in the SOM lattice are adjusted towards the input vector. The magnitude of the change decreases with time and with distance from the BMU. The update formula for a neuron with weight vector is (6) where ( ) is a monotonically decreasing learning coefficient and ( ) is the input vector. The neighborhood function ℎ ( ) depends on the lattice distance between the BMU and neuron. The neighborhood function ℎ ( ) is Figure 4 illustrates the procedure of SOM classifier. In this study, we use a two-stage method for learning [31]. First, we adopt less iterative time, higher learning rate, and large neighborhood distance for learning and make it convergence speedily. After repeating many times, we can acquire network parameters which have the best convergence. Next, combining higher iterative time, less learning rate, and small neighborhood distance with network parameters obtained in first stage to conduct second learning and adjust network parameters slowly. At last, we obtain these parameters: iterative time is set as 1000 epochs, ordering phase learning rate = 0.9, tuning phase learning rate = 0.5, and tuning phase neighborhood distance = 0.5. In order to verify the stability of SOM to generalize the correct tendency, the classifier was trained 10 times to get reliable results. Thirty cases are chosen (AD = 7, Normal = 7, MCI = 8) to be the training set randomly. Scaling of variables is of special importance in our model since the SOM algorithm uses Euclidean metric to measure distances between vectors. In order to solve this problem, we achieved this by linearly scaling all variables so that their variances were equal to one.

Support Vector
Machine. SVM is a type of artificial neural networks that is, trained by using supervised learning, have shown their advantage on reducing training-and-testing errors, resulting in obtaining higher recognition accuracy [32]. However, some feature data are linearly nonseparable. In some situations, features are not perfectly separable, especially at the border between categories. To allow some flexibility in separating the categories, SVMs utilize a cost parameter, denoted as , to control the trade-off between allowing training errors and forcing rigid margins. The cost function with is defined as (8), where is a slack variable, Mapping the patterns in a high dimension feature space is generated through combining features to form a kernel matrix. The kernel matrix is usually constructed by using a kernel function which takes two patterns as arguments and outputs a value. In this study, a radial basis function (RBF) kernel, as shown in (9), is employed. We use oneagainst-rest assembles classifiers that distinguish one from all where denotes the input vector, denotes the th prototype vector, and Fit = correctly − classified/total number of testing data. Finally, the optimal solution can be solved by using Lagrange method, where ‖ ‖ is the Euclidean norm of , that stands for the Lagrange multipliers, is the Lagrange function, and is the dual solution of . and are used to control the tradeoff between training errors and generalization ability in SVM with RBF kernel. Therefore, a PSO was utilized to find the optimal combination of and .

Hybrid PSO-SVM.
Particle swarm optimization (PSO) algorithm [33,34] uses particles moving in an -dimensional space to search solutions of an optimization problem with variables. In our approach, PSO is initialized and searches for the optimal particle iteratively. Each particle represents a candidate solution. SVM classifier is built for each candidate solution to evaluate its performance. Velocity and position of particles can be updated by where is evolutionary generation, V is the velocity of particle on dimension , and stands for the position of particle on dimension . Inertia weight is used to balance the global exploration and local exploitation, rand 1 and rand 2 are random functions, and 1 and 2 are personal and social learning factors. As we know, if the number of particles,  denoted as , is too large, it might cause the optimization process to be time consuming. On the contrary, if is too small, then it is hard to find the optimal solution due to the limited search area. In the literature [35], it is proven that the optimal solution can be obtained when is between 20 and 40. In this work, the number of the iterations and is set to 200 and 30, respectively. Similarly, the parameters 1 , 2 , and will affect the convergence of optimization process. If they are set too large, it causes the particle velocity to be speedy and thus cannot obtain the optimal solution. On the other hand, it is time consuming to find the optimal solution [36]. Therefore, we set 1 , 2 , and to 2, 2, and 0.8, respectively.
More specifically, based on the approach [37], the proposed hybrid PSO-SVM aims at optimizing the accuracy of SVM classifier by randomly generating the parameters ( and ) and estimating the best values for regularization of kernel parameters for SVM model. Basic operation of hybrid PSO-SVM proposed in this paper is given in Figure 5.
This process continues until the performance of SVM converges. The termination criteria are that the iteration number reaches the maximum number of iterations (100%) or the value of global optimal fitness does not improve after 200 consecutive iterations. In this study, 22 cases were chosen (AD = 7, Normal = 7, MCI = 8) to be the training set.

Materials.
According to the research [4], most patients with Alzheimer's disease are aged 65 years or older. Therefore, most of the subjects in the whole data we choose are over 65 years old. The image data used in this study were provided by Chang Gung Memorial Hospital, Lin-Kou, Taiwan. The degree of clinical severity for each participant was evaluated by experienced clinicians whom conducted independent semistructured interviews which included a set of questions regarding the functional status of the participant, along with a standardized neurologic, psychiatric, and health examinations. This interview generates an overall Clinical Dementia Rating (CDR) and Mini Mental State Examination   (MMSE) score. The whole dataset consists of three groups comprising normal control, MCI, and AD. Demographic information is provided in Table 1.
The whole-brain MRI scans were obtained by a 3T MR scanner (Trio A TIM system, Siemens, Erlangen, Germany). T1-weighted images were acquired by magnetizationprepared 180 degrees radio-frequency pulses and rapid gradient-echo (T1-MPRAGE) series. The following imaging parameters were used: repetition time (TR) = 2000 ms, echo time (TE) = 4.16 ms, and flip angle = 9 degrees. The results were represented as a 224 × 256 matrix, and slice thickness = 1 mm in 160 slices.

Statistical Analysis and Classification.
Through image processing techniques, we obtained individual volume and shape features. In order to confirm whether there is a significant effect of the classification for these features, we use statistical MW test to compare differences between three groups on various features (continuous variables).
The MW test, also called a Mann-Whitney or Mann-Whitney Wilcoxon test, is a nonparametric rank-based test for identifying the difference between populations with respect to their medians or means. The test does not require sample data to be normal (sample > 30), and it is relatively insensitive to the nonhomogeneity of the variance of sample data. The null hypothesis is that the two populations from which samples have been drawn have equal medians or means. The alternatives are that the populations do not have equal medians. The two samples are combined, and all sample observations are ranked from smallest to largest. It was performed on each feature to evaluate its discriminative power, as shown in (12). obt is the smaller value taken from the sum of 1 and 2 , where 1 and 2 are the sizes of the first and second samples, respectively, The values obtained from the tests can provide the probability that a variation would assume a value greater than or equal to the observed value strictly by chance. It is known that the value which is less than the predetermined significance level (0.05) would result in the rejection of the null hypothesis at the 5% (significance) level. All statistical results of volume and shape features we adopted (<0.05) are shown in Table 2, inclusive of three volume features and seventeen shape features.

Results.
Although the features we adopted have statistical significance (<0.05) between three groups, some of the features may be redundant or have high correlation. Therefore, principal component analysis (PCA) [38] is used to reduce the dimensionality of a data set consisting of a  large number of interrelated variables, while retaining as much as possible of the variation present in the data set. On the other hand, it can also improve the computation time required for classification. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated and are ordered so that the first few retain most of the variation present in all of the original variables. In order to effectively represent all the data, we used the PCs that captured 95% total variation in data set. To train a volume-feature-based classification, the first two principal components were adopted. To train a shape-feature-based classification, only the first eight principal components were adopted. When we integrated volume and shape features into classification, the first six principal components were used to stand for all of the features. Table 3 gives the variances and the coefficients of the PCs, when the analysis is done on the correlation matrix. The symbol * indicates that this PCA coefficient is used as a feature for classification. SOM, SVM, and PSO-SVM were used to train a classifier, and the results were presented in Tables 4, 5 3.4. Discussion. In this study, we investigated the feasibility of using anatomical MR images to extract different types of features as a predictive marker for AD and MCI. We employed multiple features and different classifiers to identify the patients with AD and MCI from normal participants. From the results, volumetric analysis, inclusive of gray/white matter, cerebrospinal fluid, and local shape analysis on ventricle, provides significant atrophy information. Especially, the properties of gray matter volume, ventricular area, elongation, mean signature value, and distances show the statistical significance (<0.01). This implies that using the volume and shape features have the potential ability to identify normal control, AD, and MCI. By combining both the volumetric features and shape features, the classification accuracy of SOM reached up to 76.47% and 66.67% in patients with AD and MCI, respectively. Moreover, with the help of PCA algorithm, the classification result was improved up to 88.24% and 72.22% in patients with AD and MCI, respectively. The classification accuracy of SVM reached up to 76.47% and 77.78% in patients with AD and MCI, respectively. Moreover, with the help of PCA algorithm, the classification result was improved up to 82.35% and 83.33% in patients with AD and MCI, respectively. With the hybrid classification framework based on PSO, the result achieved up to 82.35% and 77.78% in AD and MCI. Moreover, with the help of PCA algorithm, the classification result was improved up to 94.12% and 88.89% in patients with AD and MCI, respectively. According to the results, combining PSO-SVM with statistical analysis and principal component analysis (PCA) would improve the accuracy of classification.
It was also noted that the classification ability was significant for AD and normal control than the patients with  MCI. MCI is a transitional stage between normal cognitive aging and dementia. Therefore, the characteristics of patients with MCI were similar to AD subjects. On the other hand, the characteristic of patients with MCI was also possibly similar to normal participants. Combination with other features was essential to improve the accuracy of classification ability for patients with MCI in an early stage.

Conclusion
In this paper, we compared different methods for the classification of patients with AD and MCI based on anatomical T1weighted MRI. To evaluate and compare the performances of each method, two classification experiments were performed: CN versus AD and CN versus MCI. It is observed that the volume features and shape features can be integrated to increase classification accuracy with the low computational complexity. Classification results also verify our hypothesis that the combination of multimodal features, including volume and shape features, outperforms a single modality of features, possibly because different features are mutually complementary. Furthermore, it is proven that statistical analysis and PCA can achieve accuracies significantly better than all the features that are adopted. In the performance of classifiers used here, it is shown that PSO-SVM can achieve the best accuracy, sensitivity, and specificity, no matter for CN versus AD and CN versus MCI. For the moment, the classified results are greater for patients with AD and normal participants than for patients with MCI. It can provide clinically useful information at the large-scale population-based screening studies. The results would be welcomed for prognosticating disease progression and providing an objective evaluation of cognitive rehabilitation treatments for dementing illness.