Inclusion of Neuropsychological Scores in Atrophy Models Improves Diagnostic Classification of Alzheimer's Disease and Mild Cognitive Impairment

Brain atrophy in mild cognitive impairment (MCI) and Alzheimer's disease (AD) are difficult to demarcate to assess the progression of AD. This study presents a statistical framework on the basis of MRI volumes and neuropsychological scores. A feature selection technique using backward stepwise linear regression together with linear discriminant analysis is designed to classify cognitive normal (CN) subjects, early MCI (EMCI), late MCI (LMCI), and AD subjects in an exhaustive two-group classification process. Results show a dominance of the neuropsychological parameters like MMSE and RAVLT. Cortical volumetric measures of the temporal, parietal, and cingulate regions are found to be significant classification factors. Moreover, an asymmetrical distribution of the volumetric measures across hemispheres is seen for CN versus EMCI and EMCI versus AD, showing dominance of the right hemisphere; whereas CN versus LMCI and EMCI versus LMCI show dominance of the left hemisphere. A 2-fold cross-validation showed an average accuracy of 93.9%, 90.8%, and 94.5%, for the CN versus AD, CN versus LMCI, and EMCI versus AD, respectively. The accuracy for groups that are difficult to differentiate like EMCI versus LMCI was 73.6%. With the inclusion of the neuropsychological scores, a significant improvement (24.59%) was obtained over using MRI measures alone.


Introduction
Perhaps one of the most challenging research issues in Alzheimer's disease (AD) is in identifying relevant measures which could define the different stages of AD as a progressive neurodegenerative disorder [1,2]. Targeted treatment and early intervention procedures could be prescribed on the basis of such findings.
Brain imaging and neuropsychological testing are the main research domains used to determine specific cognitive, structural, functional, and biological measures to study AD and its prodromal stages. Structural MRI [3][4][5][6][7] and functional imaging modalities like Single-Photon Emission Computed Tomography (SPECT) [8,9], Positron Emission Tomography (PET) [10,11], synchronous neural interactions (SNI) obtained using magnetoencephalography (MEG) [12,13], and Central Spinal Fluid (CSF) [6] as well as electroencephalography (EEG) [14][15][16] have been used with varying degrees of success in identifying AD. Clinicians regularly use these biomarkers as guides, and, more recently, combinations of two or more biomarkers are being explored to improve our understanding of AD [4][5][6][7]10]. Exemplifying such combinations, biomarkers of MRI and CSF reportedly yield better accuracy as compared to their individual results. In 2 Computational Intelligence and Neuroscience similar studies, Fan et al. combined MRI and PET biomarkers [5], while the group of Walhovd et al. and the group of Zhang et al. worked on a combination of MRI, PET, and CSF biomarkers and reported results with conclusive indicators in the diagnosis of AD or Mild Cognitive Impairment (MCI) [4,10].
Many other studies focused on the combination of neuropsychological testing with medical imaging modalities. In a notable study, Ewers and his colleagues combined the main biomarkers of MRI and CSF with neuropsychological tests to predict the conversion from MCI to AD [17]. Their study, which included 81 AD patients and 101 elderly control subjects, demonstrated that single-predictor models do yield comparable accuracies as multipredictor models. It showed that when the entorhinal cortex is used as the single predictor, the accuracy of the results ranged from the mid-60s to a high of 68.5%. In another study involving the prediction of MCI to AD conversion over a 2-year period, Gomar et al. researched the usefulness of combining different variables drawn from a series of biomarkers including cognitive markers and the different risk factors involved [18]. Using brain volumes, CSF and other cognitive markers, they determined that cognitive markers at baseline yield better predictors in the conversion of MCI to AD as compared to temporal neurobiological markers. They also show that, in contrast to biomarkers, a sharp decline in functional ability could serve as a better predictor in the conversion of MCI to AD. This latter finding concurs with their results that show that, with the inclusion of neuropsychological data, the accuracy increased to 90% in delineating AD patients from controls.
Both these studies, which primarily focus on the conversion process of MCI to AD, use a manual selection of the volumetric measures of the different regions of the brain and rely on the ADNI (Alzheimer's Disease Neuroimaging Initiative) public database. The proposed study, which relates well to these two studies, uses instead a fully automated approach to rank the neurobiological variables and volumetric measures. Thus, a more global approach is provided for constructing patterns of structural and physiological abnormalities in their entirety [5], with statistical proofs in support of the choice of the different variables and measures considered.
Other studies have focused their research efforts on determining the distinctive features that could delineate early MCI (EMCI) from late MCI (LMCI) [19,20]. For example, Ferman et al. showed that nonamnestic MCIs (naMCI) were more likely to develop dementia with Lewy bodies (DLB), whereas patients with amnestic MCIs (aMCI) are more likely to convert to AD [21].
The proposed study examines the classification of AD, EMCI, and LMCI on the basis of a combination of subcortical and cortical MRI volumes with a slate of neuropsychological tests that include Mini-Mental State Examination (MMSE), Rey Auditory Verbal Learning Test (RAVLT), and Clinical Dementia Rating Scale Sum of Boxes Scores (CDRSb). The study reveals the importance of including neuropsychological tests in classifying the different stages of AD by using a combination of MMSE, RAVLT, and select volumetric variables. This study proposes also a fully automated feature extraction technique, with a ranking that provides statistical significance to the variables to be used in a multidimensional decisional space for optimal classification. Figure 1 illustrates the general structure of the entire process. The steps include acquisition of the MRI and neuropsychological parameters, the selection of significant variables using pairwise backward stepwise linear regression models, and the classification process using a well-established linear discriminant analysis (LDA). The fully automated data-driven technique allows for the possibility of replacing LDA with other algorithms such as Support Vector Machines (SVM) and Artificial Neural Networks (ANN) and probabilistic classifiers such as Quadratic Discriminant Analysis (QDA).

Study Data.
Data used in this study were obtained from the ADNI database (http://adni.loni.usc.edu/). ADNI launched in 2003 aims to test whether magnetic resonance imaging, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD.
Baseline demographic, clinical, neuropsychological, and volumetric MRI data for 385 subjects (55 diagnosed with mild AD and 91 with LMCI and 114 EMCI and 125 cognitively normal (CN) cases) were explored as outlined in Table 1.
All subjects had (1) a neurological and medical evaluation by a physician; (2) a full battery of neuropsychological tests [22], all in accordance with the National Alzheimer's Coordinating Center protocol (http://www.alz.washington.edu/), along with RAVLT [23]; and (3) structural volumetric MRI scans of the brain. The CDRSb was used as the index of functional ability, and the MMSE was used as the index of cognitive ability.
The cognitive diagnosis was made using a combination of the physician's diagnosis and neuropsychological diagnosis, as described previously [24]. The etiological diagnosis was made by the examining physician. The diagnosis of CN required that the physician's diagnosis was CN and no cognitive test scores were ≥1.5 SD below age-and educationcorrected means. A probable AD diagnosis required a dementia syndrome and National Institute of Neurological and Communicative Disorders and Stroke/Alzheimer's Disease and Related Disorders Association criteria for AD [25].

Volumetric Correction.
FreeSurfer pipeline (version 5.1.0) was applied to the MRI scans to produce 115 cortical and subcortical volumetric variables. These 115 regional volumes were corrected for head size variation using FreeSurfer's estimate of total intracranial volume (TIV), which has been considered highly accurate in adults [26]. Regional MRI volumes, normalized to total intracranial volume, were obtained from the ADNI database using data derived by researchers at the University of California, San Francisco, as part of the ADNI2 cohort. Each regional volume of the brain was divided by the subject's TIV to estimate regional volumetric ratios, which were used as features for the classification models. The volumes were not corrected for age, gender, and education since these demographic variables are later used as model terms in the statistical testing as detailed in Section 2.4. Retain only those significant parameters that appear more than 75% of the time for each classification pair. Alzheimer's disease for assessing hippocampal atrophy [27], functional decline in cognitive ageing [28,29], and cortical atrophy [30], to name a few. In this study, the SLRM method is used as a feature extraction technique to determine those variables (demographic, clinical, neuropsychological, or volumetric) that are significant towards classification of the different subtypes of the disease. The procedure followed is as shown in Figure 2.

Statistical
The feature extraction step begins by randomly dividing the feature set of 120 parameters (age, gender, education, 2 neuropsychological test scores, namely, RAVLT and MMSE, and 115 corrected volumetric MRI regional ratios) into two groups. One of these groups is assigned as a training group and the other is discarded in this initial step. For the training group pairwise stepwise linear regression models are used to estimate the best set of parameters, which yields the highest correlation between the diagnostic gold standard and the feature set as given in (1). The pairwise models are trained for each of the six classification pairs, namely, CN versus EMCI, CN versus LMCI, CN versus AD, EMCI versus LMCI, EMCI versus AD, and LMCI versus AD. Consider where is the model response (a logical variable showing the class in the pairwise classification pair), 1 , 2 , . . . , are the feature variables; 1 , 2 , . . . , are their respective coefficients, and 0 is a constant. For the current study = 121; however, the model is completely scalable to accommodate for fewer or larger number of features if needed.
The significance threshold for adding a feature to the model is fixed at 0.1 for the model R 2 ; that is, if the increase in the R-squared of the model is larger than 0.1, the corresponding feature is added to the model. On the other hand, a feature is removed from the model if the feature fails to improve the R 2 of the model by a number greater than 0.05. The choice of significant thresholds for adding or removing a term from the model was empirically adjusted so as to obtain Computational Intelligence and Neuroscience 5 a stable model, which can explain the variance in the data.
Stepwise regression models, as executed in the paper, fit an initial model comprised of a single feature and then grow to accommodate other features. The choice of an increment in R 2 of 0.1 for adding a parameter to a model was to achieve a more conservative approach towards limiting feature space. Complex models with a large number of features often tend to overfit the model capturing the noise in the data rather than the underlying phenomenon. The removal threshold was fixed to 0.05 again to eliminate only weak features that resulted in lesser than 5% improvement in the model.
To account for the varied nature of the disease and the random distribution of the data under investigation, the SLRM are repeated 50 times for each diagnostic pair and only those features which appear to be significant more than 75% of the time are retained in the final feature set as was shown in Figure 2. The final features that are deemed significant are bound to constitute an optimal decisional space.

Linear Discriminant Analysis (LDA) Classifier.
LDA is a technique widely used in pattern recognition, statistics, and machine learning, among others, for determining characteristic features that can aid in difficult segmentation tasks [31][32][33]. The LDA classifier used in this study attempts to estimate a posterior probability for each subject to enable its classification into either of the two groups for each of the six pairwise classifications.
The significant features determined in the feature extraction step are used to train a classifier to estimate the parameters of the linear discriminant functions for the two classes as given in where 1 , 2 , . . . , and 1 , 2 , . . . , are the LDA parameters for the two groups, respectively, and 1 , 2 , . . . , are the set of significant parameters for each classification pair, where ≤ . The training algorithm assumes a prior probability prior = 0.5, suggesting that a given subject has an equal probability of belonging to either one of the two classes. The classification algorithm assigns posterior probabilities 1 or 2 on the basis of the linear score 1 and 2 , respectively, as described in (3). The posterior probabilities as calculated in (4) signify the likelihood of a subject to belong to either one of the groups. A higher posterior probability determines the grouping of the subjects: All of the experiments conducted in this study were based on 2-fold cross validation, distributing the subjects equally between training and testing sets. The training and testing sets were randomly assigned while the number of subjects in each group remained fixed. To limit the potential data portioning error introduced by random data assignment and cross validation, the same experiment with random data assignment was run 20 times and the average metrics of accuracy, sensitivity, specificity, precision, and F-measure are reported.

Significant
Features. The significant features as determined by the SLRM technique for each of the six classifications are provided in Table 2. The results show that a combination, which includes neuropsychological parameters, demographic variables, and the volumetric variables, could act as the best linear model to estimate diagnostic patterns in pairwise comparisons. The number of features that are selected for each pairwise comparison varies from 5 in the case of LMCI versus AD to as many as 14 in the case of CN versus EMCI. On average, for diagnostic groups which are closely related to each other in disease progression, that is, CN versus EMCI, EMCI versus LMCI, and LMCI versus AD, a fewer number of significant parameters are seen as compared to groups that are diagnostically well separable. Such a trend was expected since closely related diagnostic groups have fewer marked atrophy changes that are visible with disease progression.
The SLRM offer the opportunity to also rank these variables on the basis of their significance to each classification pair. The features listed in Table 2 are ranked according to the value reflecting the significance of the variable towards the models. The ranking of the features displays the potential discriminating power of the different features in classification of the stages of AD. The anatomical distribution of the volumetric features, both cortical and subcortical, is displayed in Figures 3 and 4.
An important observation that can be made from Table 2 is the dominance of the neuropsychological parameters. It is seen in all the pairwise comparisons that the MMSE appears at very high ranks. MMSE is ranked as a significant feature in 4 of the 6 pairwise comparisons, whereas RAVLT shows up as significant in 2 of the 6 pairwise comparisons.
Interesting findings can also be seen in Figures 3 Table 2. For each classification pair the letter "L" or "R" signifies the left and right hemisphere, respectively. Additionally, for the cortical representations, the top images in each set show the lateral view and the bottom images show the medial views. 8 Computational Intelligence and Neuroscience can be a potential indicator of the shifts in atrophy patterns in the different stages of the disease. In the case of CN versus AD a more bihemispherical layout of the variables is observed as was reported in another study [34]. A closer inspection of the results shows that the top ranked volumetric variables, for example, hippocampus [35][36][37], ventricular [38,39], cortical [35,37,40], and amygdala [39,41], are all regions that have been proven to be effective predictors of AD and/or MCI by many other research groups. This observation is a strong indicator of the accuracy and usability of the ranking system developed in this study.
Also, cortical volumetric measures of the temporal, parietal, and cingulate regions show a marked presence in the significant groups. Other recent studies have also demonstrated the utility of regional temporal brain atrophy [38] and involvement of frontal lobe atrophy as important markers for AD staging [42]. Figures 5(a)-5(f) show the classification accuracy, sensitivity, specificity, and precision for the 6 pairwise comparisons studies in this paper. The study performs an incremental analysis, whereby the classification is performed by adding an additional feature to the model starting with a single feature model. In other words, firstly only the top variable is used for classification and the performance is recorded. Following this the top 2 features are employed in the LDA classifier and so on and so forth. The results show a typical saturation effect whereby increasing the number of features in the classifier beyond a point does not improve the accuracy of the classifier. Table 3 lists the highest accuracies obtained for each classification along with the number of features that were used to yield such accuracies. All results, displayed as Average ± Standard Deviation, indicate that, in all the cases except in the CN versus AD classification, the highest accuracy is indeed obtained when all the significant parameters, as  Table 2, are included in the LDA. In case of CN versus AD the first seven out of the ten features were able to define the optimal decisional space for the classification. Table 3 also highlights the fact that the best classification accuracies are those obtained for groups, which are well separated diagnostically. For example, CN versus AD, CN versus LMCI, and EMCI versus AD show accuracies of 93.9%, 90.8%, and 94.5%, respectively. However, for groups which are not clearly differentiable like EMCI versus LMCI, an accuracy of 73.6% was achieved. Although this accuracy seems to be low, if not better, it is comparable to other studies found in literature, which have reported a similar accuracy in classification of MCI subjects [35,43,44].

Classification Performance.
Additionally, the choice of the prior probabilities used in the classification model can be derived from population's empirical estimates; that is, for the classification of AD subjects from CN the priors for the groups can be chosen to be 55/180 and 125/180 for the AD and CN groups, respectively. Although the choice of empirical prior probabilities may improve performance, the algorithm refrains from assigning empirical prior probabilities to reflect the nature of the problem in clinical environments, where distribution of subjects can be unknown.

Impact of Neuropsychological and Volumetric Features.
Results shown in Figure 5 together with those given in Table 4  clearly illustrate the merits in combining neuropsychological measures with structural measures, where the combined model showed much improved accuracies for all the twogroup classifications. For example, inclusion of the third ranked feature (MMSE) results in a sharp increase in accuracy in the CN versus AD classification.
In order to assess the merits of each category of these features, neuropsychological versus MRI measures, the classification algorithm was modified to operate separately using only either neuropsychological scores or MRI measures. For the neuropsychological model all neuropsychological features are used in the analysis, whereas in the MRI model only all the 115 MRI measures are used as features. In the combined model all neuropsychological and MRI measures are used concurrently. Please note that demographic variables, age, gender, and education, are used in all models as features to account for variability due to age and gender related changes often seen in Alzheimer's disease. Table 4 lists the average accuracy results obtained using a 2-fold LDA with 20 repetitions for classification of the different categories using only neuropsychological scores or MRI measures and then the combination of both.
It can be seen that the difference between the combined model and neuropsychological model alone is extremely small with the combined model offering a relative improvement of only 3.12% on average for the different classification pairs over a range of (0.33%-11.36%). However, it is interesting to note that the improvement offered by the combined model in case of EMCI versus LMCI, which are diagnostically very similar in cognition, is up to 11.36%. Additionally, MRI models alone offer on average accuracies of 71.2% (60%-84%). A large improvement in relative accuracy (24.59%) is thus obtained by combining neuropsychological scores with the MRI models.
Also, it is seen that for most of the classification pairs the Neuropsychological model offers a higher accuracy than the MRI model except for the EMCI versus LMCI classification pair. This finding shows that cognitive scores are not very good markers to differentiate between EMCI and LMCI and in such classification studies MRI based atrophy measures offer an added advantage.

Comparative Analysis.
A comparison of classification performance with multiple studies that have appeared in literature is shown in Table 5, providing details of the respective imaging modality/biomarkers used in the study, the nature of the dataset used, and the classifier statistics in terms of the results obtained. The proposed approach achieved a very good performance relative to these studies. It is seen that for classification of AD subjects from CN the proposed technique fairs better than most of the studies. Another important point to note is that, except for Cuingnet et al. [36] and Zhou et al. [34], this study considered a larger database.
Moreover, most of the studies listed do not differentiate between the EMCI and LMCI groups but are pooled into a single larger MCI group. The availability of such data from the

12
Computational Intelligence and Neuroscience ADNI2 cohort made it possible for this study to explore this specific two-group classification and evaluate their distinctive features in the progression stages of AD. The study by Klöppel et al. [37] shows the best results of all studies that only use MRI imaging based markers. For this study diagnosis of patients in Group I and Group II is confirmed using either histopathologically or neuropathologically with the aid of a biopsy or autopsy. Such a method of validation is extremely useful and most reliable for providing a more definite diagnosis that is less susceptible to errors caused by the subjectivity of the physician's diagnosis. Diagnosis in living subjects like those included in the proposed study naturally tends to be more subjective in nature. However, this does not hinder the intent or the merit of this study, which is to achieve high classification accuracy on a large population group with ease of applicability in a clinical setting.
Another important feature that can be observed from Table 5 is the comparative performance of the proposed study to multimodal imaging studies. It is shown that the classification performance achieved in this study is competitive and sometimes even better than other studies that relied on multiple imaging modalities. Although combination of modalities like PET and CSF provides valuable insight into the disease, their lack of availability across imaging centers and medical institutions hinder their potential integration in the decision making process in such facilities. A significant merit of the proposed technique is in its ability to achieve a very good classification performance using only MRI modality in conjunction with neuropsychological scores, all of which are routinely carried out for the diagnosis of AD.
The main limitation of this study was that the diagnosis of MCI and AD and the distinction of these two entities from normal aging are based on clinical measures, of which memory measures are paramount. On the other hand, structural MRI, as used in this study, measures volumetric changes in the brain. The severity of Alzheimer's pathology is only weakly correlated to cognitive and functional changes during life (in part because several other variables such as age and cognitive and brain reserve capacity can modify the correlations), but the pathology is strongly correlated to volumetric MRI changes. Furthermore, about 30−40% of cognitively normal elderly individuals have the pathology of Alzheimer's disease in their brains and these changes will be reflected as volumetric changes in their MRI scans, but not in their cognitive measures. Hence, using MRI measures, there is considerable overlap built in between "normal aging" and MCI/AD and so the classification between CN and MCI/AD is automatically at a disadvantage as compared to cognitive measures. It is, therefore, not surprising that we found MRI measures to provide only a small additional effect in separating CN from MCI/AD. A major advantage to using structural MRI scans is in distinguishing between different causes of dementia, such as separating the signature or atrophy patterns in Alzheimer's disease from frontotemporal dementia, vascular dementia and hydrocephalus, and distinguishing subtypes of Alzheimer's disease by the pattern of atrophy. These issues were not addressed as they were beyond the scope of this study, in which we addressed only the magnitude of contribution of volumetric MRI measures to the cognitive measures.

Conclusions
This study introduced a novel framework that combined MRI volumetric measures with neuropsychological scores in a statistically meaningful way. Consequently, a ranking of these features, structural and cognitive, proved very useful in constructing optimal decisional spaces for high-accuracy in two-group classifications. The highly ranked MRI measures proved effective in extracting the significant brain atrophy regions associated with AD and its prodromal stages for classifying cognitive normal subjects, EMCI, LMCI, and AD subjects. The feature extraction technique is based on a backward stepwise linear regression analysis, which demonstrated dominance of neuropsychological parameters like MMSE and RAVLT in delineating the different groups. The extracted features are also dominated by the presence of well-known subcortical atrophy regions like the hippocampus, amygdala, and ventricles and various temporoparietal and cingulate cortical regions. Classification results in two-group comparisons revealed a very high accuracy of 93.9%, 85.6%, 90.8%, 73.6%, 94.5%, and 90.1% for CN versus AD, CN versus EMCI, CN versus LMCI, EMCI versus LMCI, EMCI versus AD, and LMCI versus AD, respectively. The study also showed that a combination of MRI measures and neuropsychological parameters do yield better diagnostic results on average (accuracy: 87.6%) than using either MRI (accuracy: 71.2%) or cognitive scores (accuracy: 85.3%) alone. A practical merit of the proposed technique is in its ability to achieve high classification accuracy using only the MRI modality together with neuropsychological scores, all of which are routinely carried out for diagnosing AD.

Disclosure
All data for this research were acquired from the Alzheimer's Disease Neuroimaging Initiative (ADNI), which were obtained and analyzed anonymously.