White Matter Lesion Assessment in Patients with Cognitive Impairment and Healthy Controls: Reliability Comparisons between Visual Rating, a Manual, and an Automatic Volumetrical MRI Method—The Gothenburg MCI Study

Age-related white matter lesions (WML) are a risk factor for stroke, cognitive decline, and dementia. Different requirements are imposed on methods for the assessment of WML in clinical settings and for research purposes, but reliability analysis is of major importance. In this study, WML assessment with three different methods was evaluated. In the Gothenburg mild cognitive impairment study, MRI scans from 152 participants were used to assess WML with the Fazekas visual rating scale on T2 images, a manual volumetric method on FLAIR images, and FreeSurfer volumetry on T1 images. Reliability was acceptable for all three methods. For low WML volumes (2/3 of the patients), reliability was overall lower and nonsignificant for the manual volumetric method. Unreliability in the assessment of patients with low WML with manual volumetry may mainly be due to intensity variation in the FLAIR sequence used; hence, intensity standardization and normalization methods must be used for more accurate assessments. The FreeSurfer segmentations resulted in smaller WML volumes than the volumes acquired with the manual method and showed deviations from visible hypointensities in the T1 images, which quite likely reduces validity.


Introduction
Age-related white matter lesions (WML) mainly affect information processing speed and executive function [1] and entail an increased risk for cognitive decline and disability [2]. In a meta-analytic study, high hazard ratios were also reported for incident stroke (3.3) and dementia (1.9) [3], but results from studies on WML association to dementia subtypes are inconclusive [4][5][6]. e prevalence of WML increases with age, and in a population study, 51% of randomly selected healthy subjects aged 44-48 had WML [7]. For the age range 60-64, all had WML, and 49% of these had at least one large (>12 mm) WML region [8].
In magnetic resonance (MR) imaging, white matter hypointensities in T1 weighted images, and white matter hyperintensities in T2 weighted and FLAIR images are regarded as visualizations of WML. e conditions for demarcation of WML are enhanced in FLAIR images due to suppression of the signal from free �uid. In contrast to T2 and T1 weighted images, this suppression causes �uid �lled cavities in FLAIR images to be hypointensive and excluded from WML by the intensity de�nition. However, the intensity suppression in the FLAIR sequence entails incident imaging artifacts in the border region between WML and free �uid, for example, in the periventricular region [9]. WML visible in MR imaging re�ect demyelinization, axonal loss, gliosis, or edema. However, WML MR imaging comprises no lucid separation between these pathological substrates.
It is customary to use CT or MR imaging of WML among the diagnostic criteria for subcortical vascular dementia, for example, in Erkinjuntti et al. [10]. However, standardization of WML estimation is needed in order to establish uniform diagnostic criteria in clinical practice. Visual rating is simple and fast and therefore an important candidate method for standardized clinical use. ere are several visual rating methods for WML research purposes [11,12]. Among these, the Fazekas visual rating [13,14] is frequently used in research, and it has been shown to have good reliability compared to two other rating scales [12], but the results on its correlation to volumetrical assessment diverge. For example, the Fazekas visual rating has shown the highest [11] and lowest [15] correlation to volumetrical assessment compared to other visual rating scales. Manual volumetrical assessments have shown higher reliability than visual rating scales [11,15] and would be valuable in clinical settings if made less labour intensive. In Gouw et al. [16], visual rating of WML and WML volumetry had similar correlations to neuropsychological performance, but in Garrett et al. [17], a correlation to neuropsychological performance was only found using WML volumetry and not in visual WML rating. Several segmentation and thresholding techniques have been used to manually assess WML volume in the literature, but one of the few techniques with a methodological description is reported in Gurol et al. [18].
Several automatic volumetry methods have been developed for WML classi�cation [19][20][21]. FreeSurfer is one of these methods, and it contains automatic assessment of neuroanatomical subregions as well as WML hypointensity volumes. FreeSurfer has been frequently used for neuroanatomical subregional volumetry and has been shown to be comparable in accuracy to manual labeling for many tasks [22,23] and to perform well compared to other automated segmentation tools [24]. However, few publications have reported FreeSurfer WML volumes at all, and only one study reports intermethod reliability �gures [25]. e latter study also reported that regional WML predicted executive dysfunction. One study found that total WML was signi�cant for AD versus controls [26], and another found that total WML predicted functional decline almost as well as the best predicting regional WML regions [5].
Reliability analysis is the �rst step in assessing the accuracy of a WML method, and the oen reported excellent or near excellent reliability may be a reason why methodological issues about reliability have not always been given enough attention. e aim of the present study was to compare three types of assessment methods of total WML in a clinical sample in order to examine different aspects of their reliability and to determine features in need of further methodological development. e Fazekas rating was included in this study because of its simplicity. e manual MRIcron WML volumetry method was included on the basis of a presumed superiority in accuracy. FreeSurfer WML volumetry was included because of the automatic assessment of WML as well as the need for further validation because of the common use of the method.

Method
In the Gothenburg mild cognitive impairment (MCI) study, subjects between 40 and 86 years of age (mean 65.6, SD 7.7 years) with subjective or objective cognitive impairment were recruited from the Memory Clinic at the Sahlgrenska University Hospital. e selection process has been described in more detail by Eckerström et al. [27]. Exclusion criteria were acute somatic disease, severe psychiatric disorder, pseudodementia, and substance abuse or confusion caused by drugs. Controls were recruited from other medical studies and senior citizen organizations, and baseline exclusion criteria were subjective or objective signs of cognitive disorder. e study was approved by the Ethics Committee of the University of Gothenburg, and the subjects gave their informed consent to participate in the study.
e study subjects were biannually assessed and classi�ed according to the Global Deterioration Scale (GDS) [28]. Subjects with a GDS 2 or GDS 3 score received an MCI classi�cation. MCI subjects remaining in the GDS 2 to GDS 3 range at followup were classi�ed as �MCI stable, � while MCI subjects receiving a GDS 4 (mild dementia) or higher classi�cation at followup were classi�ed as �MCI converting. � Twenty eight controls, 69 MCI stable, 9 MCI converting, and 46 dementia subjects were included in the present study. WML was measured on the �rst 1.5T MR imaging acquisitions in the study, between the years 2005 and 2007.
Imaging was performed on a Siemens Symphony 1.5T scanner. Axial T2 weighted images were used for the Fazekas rating, coronal FLAIR images were manually segmented in the MRIcron soware, and coronal T1 weighted images were analyzed with the FreeSurfer package (stable release version 4.0.5). In Table 1, the scan parameters used in the study are presented. All raters and operators were blinded for header data like identity and cognitive status of the subject. is study report was outlined to be in accordance with the guidelines for reliability studies in Kottner et al. [29].
2.1. Visual Rating. WML severity was rated using modi�cations of the Fazekas scale [13]. Periventricular WML and deep WML were not rated separately, in accordance with the recommendations by Fazekas et al. [30], but only three grades were used as in Inzitari et al. [14] in contrast to the original four-graded Fazekas scale, and in the present study, grade 1 also included possible absence of WML. For each subject, the slice with the largest WML occurrence visible was used to determine the WML load as belonging to one of three grades. e assessments were performed independently by two raters (J. Berge and J. Eldblom) who compared the images with the template images in axial orientation of the different grades of WML given by [14]. Metrical measurements were not used. Only cerebral WML were included in the ratings. e raters J. Berge and J. Eldblom had substantial knowledge of neuroanatomy and went through limited training in visual rating before entering the study assessment.

MRIcron.
Cerebral WML segmentation and intensity thresholding were performed on 1.5T coronal 5 mm FLAIR images using the MRIcron soware with a modi�cation (see later) of the method used in Gurol et al. [18]. e segmentation method involves an initial rough manual demarcation of WML, separating them from noncerebral regions and septum pellucidum, as in Holland et al. [31], within the same intensity span, followed by an intensity threshold set to separate WML from adjacent tissue types.
In the Gothenburg MCI cohort, artifactual image intensity differences between slices and between series (using the same FLAIR sequence) were common. No automatic intensity normalization between patients was done, but in order to consistently analyze each patient's series in similar intensity settings, the MRIcron grayscale mapping was used on a window setting containing all brain tissue in a certain slice (containing the quadrigeminal plate). Due to intensity inhomogeneities, the manual volumetry method adapted from Gurol et al. [18] required a modi�cation towards a more WML-speci�c manual segmentation, where the contour of the WML was demarcated quite closely. One rater (E. Olsson) demarcated 152 subjects, while a second rater (J. Berge) independently demarcated 27 randomly selected subjects for the determination of interrater reliability. e rater E. Olsson has longstanding experience in MRI segmentation, and the rater J. Berge went through substantial training in the manual WML volumetrical method before entering the study assessment.

FreeSurfer.
Hypointensity volumetrics determined as WML were estimated by an operator (N. Klasson) running the FreeSurfer analysis (stable release version 4.0.5). FreeSurfer is a highly automatic image analysis suite and is available for download online (http://surfer.nmr .mgh.harvard.edu/). FreeSurfer uses a probabilistic atlas generated from manually segmented MR scans to execute the segmentation. e probabilistic information has been mapped into Talairach space so that each location therein contains speci�c probabilities for each tissue type. Given a speci�c location and tissue type, the probabilities are given as (1) a Gaussian intensity distribution, (2) probability of occurrence, and (3) probability of neighboring tissue types. During segmentation, a number of processes take place including motion correction and intensity normalization, removal of nonbrain tissue, Talairach registration, and labeling of voxels into tissue types. An initial labeling takes place where each voxel is assigned its most probable tissue type. An iterative algorithm then uses the found tissue probabilities to calculate new probabilities for the voxel labels. is process continues until no changes of tissue types take place. Manual edits were done by the operator (N. Klasson) to reduce inaccuracies in white and grey matter classi�cation. e segmentation process has been described in detail elsewhere [22,23].  (Table 3).

Statistical
No signi�cant systematic differences between raters were found for Fazekas visual rating or for MRIcron manual volumetry in the Wilcoxon matched pairs test (details not shown), but the FreeSurfer automatic volumetry differed signi�cantly from the manual MRIcron volumetry. Almost all FreeSurfer volumes were lower than the corresponding manual MRIcron volume (Figure 1). In order to evaluate reliabilities in the dense and the sparse parts of the data distribution (Figure 2), respectively, the WML volumes were separated into tertiles (Table 3). For the aggregated lower two tertiles, the Fazekas interrater reliability (Spearman's rho) measured 0.65, and for the upper WML tertile 0.92, this difference was signi�cant. e manual MRIcron interrater reliability was nonsigni�cant for the lower two tertiles (which make up about 10 percent of the whole volume range) but signi�cant for the upper WML tertile with an interrater reliability of 0.94. For the anterior part of the brain taken separately, there was still no signi�cant T 2: Study characteristics. -values were assessed by the two-tailed -test except gender where chi square was used. Gr 4 is group 4 Dementia. Gr 2 and MCI-s are stable MCI. MCI-c is converting MCI. Gr 2 and Gr 4 had signi�cantly lower scores than the group speci�ed in the cell for the pertaining variable; for example, Gr 2 (stable MCI) had signi�cantly lower age than group 4 (dementia group). ere was no signi�cant difference in gender.  reliability for the lower aggregated WML tertiles, but it was signi�cant for the posterior part with a rho value of 0.56. For the intermethod reliability between the manual MRIcron volumetry and the automatic FreeSurfer volumetry, the rho value was also lower, 0.38 in the lower aggregated tertiles, than in the upper tertiles, 0.74.
For ICC values, see Table 3. e reason why we do not report them in the text is stated in Section 4.
Journal of Aging Research 5 T 3: Inter-rater and inter-method reliability. * Correlation was signi�cant at the 0.05 level (2-tailed). * * Correlation was signi�cant at the 0.01 level (2-tailed). ManWML is manual MRIcron volumetry; . Anterior is the manual MRIcron assessed volume anterior of the quadrigeminal plate. Posterior is the manual MRIcron assessed volume posterior of the quadrigeminal plate. ManWML versus AutoWML is the manual MRIcron volumetry correlation to the automatic FreeSurfer volumetry. Rho below T3 is the Spearman correlation, and ICC below T3 is the intraclass correlation in the lower two tertiles. Rho in T3 is the Spearman correlation and ICC in T3 the intraclass correlation in the highest tertile. e reliability was signi�cantly higher in the highest tertile for all methods.  Figure 3(a) shows the Bland-Altman (BA) plots of absolute volume interrater differences in the manual MRIcron method, demonstrating increasing differences with higher volumes. However, the variation in rater differences was larger in the assessments of the lowest range of the WML volumes embracing about 80% of the subjects. When comparing the difference as a percentage of the mean measures of each subject (Figure 3(b)), the variation was even more pronounced with regard to the lowest part of the WML volumes, while there was no clear indication of an increase in rater differences with larger volumes. e intermethod BA plots (Figure 1) for the manual MRIcron volumetry and automatic FreeSurfer volumetry showed a similar pattern as the interrater BA plots, with a pronounced variation for low WML volumes in Figure 1

Reliability.
Conventional interrater reliabilities were acceptable for the Fazekas rating and the manual MRIcron volumetry. e intermethod reliability for the manual MRIcron and automatic FreeSurfer methods was also acceptable. It is common to evaluate reliability in WML research with intraclass correlation (ICC) with excellent results, for example, in Gao et al. and Smith et al. [11,25]. However, it is misleading to use such an analysis in a data structure where the distribution is skewed, with very sparse data points in the upper third of the volume range. While showing an excellent reliability for the whole sample, like previous studies, the ICC in the lower aggregated tertiles for the manual method was nonsigni�cant (Table 3). e density of low-burden WML (Figure 2) probably represents a very common clinical distribution, and reliability analysis must under these conditions be performed and interpreted with caution.
e �nding in the present study of lower reliability in the Fazekas rating with lower WML burden is congruent with the �nding by Wardlaw et al. [32] where cohorts with lower WML burden showed lower reliability. e higher reliability for high WML burden has been considered as a ceiling effect, but there might as well be �oor effects in visual rating [33]. Presumed ceiling and �oor effects do not disqualify visual rating as a candidate for clinical use, but such effects would in general lower the usefulness for WML research, for example, the possibility to �nd valid correlations with psychometrics.
e nonsigni�cant Spearman correlation in the aggregated lower tertiles between manual MRIcron volumetry raters could possibly be due to the high density of data affecting the rank order of the ratings. It is unclear if the high Spearman correlation for high-burden WML implies an overestimation of the reliability due to the sparse data density (leading to less error with ranked data) or if there really was a higher reliability as the BA plot of fractional differences seems to imply (Figure 3(b)). In our opinion, the low reliability in low-burden WML is most probably not due to the combination of high density of data and the nature of the Spearman correlation. Rather, it may be a real rater-and intermethod variation as is visible in the BA plots which seem to indicate an increased variation in the lowest quarter of the volume range. e higher rater variation in F 4: Sagittally reformatted slice from an example subject showing the common anterior intensity shi between the fourth and �h slices in the coronal 2D FLAIR sequence.
low WML may in turn be due to difficulties in the handling of intensity distortion affecting the thresholding step in the manual volumetry.
Intensity inhomogeneities in MR images and variation in grayscale level between scan series result in inconsistent classi�cations of hyperintensities that in turn introduce measurement errors in the manual WML assessment. In order to assess WML volumes as accurately as possible under conditions with varying intensity levels through the image slices of a subject, a methodology was chosen where the thresholding was adjusted as a compromise to best �t the visible WML volume through all image slices of a subject's brain. Since the localization and extent of WML vary, the accuracy of the thresholding can be expected to be decreased by these intensity distortions. A particular shi in grayscale level was observed in the anterior part of all FLAIR image series (Figure 4), which may be due to gradient eddy currents or cross talk between 2D FLAIR slices [34,35]. e unreliability for low WML seems to emanate mainly from the anterior half of the image data, and the intensity shi in the anterior part may well be a reason for more complex considerations in the thresholding of low WML burden cases. For higher WML volumes, the thresholding step in general only affects the amount of WML selected, but for lower volumes, the thresholding more oen affected the presence or absence of WML in a slice, which means a more straightforward thresholding with higher volumes. In short, increased complexity in the thresholding may be the main cause of an increased rater variation for low WML volumes.
e FreeSurfer suite includes intensity normalization steps which to some extent will limit intensity distortion, but that also may extinguish some of the hypointensities. Further, the T1 MR sequence used for the FreeSurfer volumetry does in general have lower WML de�nition and somewhat smaller areas of WML hypointensities compared to the hyperintensities in the FLAIR sequence. A visual inspection of the T1 weighted images and the FreeSurfer segmentations con�rmed that the substantial deviations between the volumetrical methods in the detection of WML in various locations may mainly be due to the lower de�nition in the T1 weighted images. e FreeSurfer segmentation oen omitted large amounts of deep WML seen in the FLAIR images ( Figure 5). On few occasions, FreeSurfer classi�ed sulcal cortical voxels as WML. Occasionally, FreeSurfer detected more WML in the periventricular region than visible in the T1 weighted images. Compared to WML seen in FLAIR images, the amount of WML found by FreeSurfer still was lower even in the periventricular regions. e T1 MRI images in Figures 5 and 6 are examples showing that the noise level makes the detection of punctate WML a difficult task, nevertheless FreeSurfer detects several punctate WML patches in this region. e single punctate WML patch in the FLAIR slice in Figure 6 is hardly visible in the T1 slice, and it is conceivable that no punctuate WML patch in this location is detected in FreeSurfer. In some cases, FreeSurfer detected small punctate WML not detected in the manual volumetry ( Figure 6(b)), but occasionally, it detected less punctate WML than the manual volumetry ( Figure 6(c)). Hence, FreeSurfer seems to detect punctate WML in a largely random way. However, a more accurate punctate WML detection may be a disadvantage in dementia research since punctate WML oen originate from perivascular spaces and have been found to have low progression [36]. In short, the FreeSurfer volumetry generally detects less WML, which is likely due to a combination of the method, the characteristics of WML, and the visibility of WML in T1 weighted images and leads to larger differences between the methods when measuring larger WML volumes (Figure 1).
WML progression generally begins in the periventricular region and subsequently expands radially to more peripheral locations [37]. e skewness of the distribution of WML, with a low fraction of high-burden WML subjects also in the dementia part of the sample, can be expected to exist also for clinical samples in general. e statistical atlas used in FreeSurfer was generated from a small sample (http://surfer.nmr.mgh.harvard.edu/fswiki/Buckner40Testing), where high-burden WML cases might be rare or lacking. An atlas generated with a low frequency of high-burden exposure may contribute to the FreeSurfer inability to �nd deep WML.

4.2.
Limitations. e Fazekas rating was performed by comparing the largest WML aggregation to the templates, and no metrical measurements were used which possibly affected the classi�cation of the subjects close to the Fazekas de�nitions of the metrical borders. e visual rating in three grades differs from the design in other studies, which limits the comparability of results. It is not feasible to use a zero grade WML in T2 images with any accuracy, and in a later control of the FLAIR images, it was ascertained that no subjects with complete absence of WML were present in the study. In visual rating, the FLAIR sequence can be expected to be advantageous compared to the T2 sequence due to the signal suppression of free �uid in FLAIR images. However, the FLAIR sequence in this study had coronal orientation and 5 mm slice thickness, which made it unsuitable to use with the axial template images. erefore, the axial T2 scan sequence was used.
In the manual method, the basal ganglia nuclei in FLAIR images oen had no clear intensity separation from white matter, which may have contributed to an inaccurate volume assessment in this region. e hyperintensities along the ventricular lining in the basal ganglia region were included in the manually assessed WML, which may be unsuitable and possibly confound the differentiation, primarily in patients with low WML volumes. e different scan sequences used in the different methods confound the methodological comparison; for example, only the FLAIR sequence does to some extent show separate intensity ranges for �uid �lled cavities and WML. By contrast, in T1 and T� sequences, �uid-�lled cavities have intensities in the same range as WML. e moderate interrater correlation for the manual volumetry could possibly be a consequence of intensity distortions in the FLAIR images. Although steps were taken in the thresholding methodology to minimize the impact of intensity distortions, sample tests of WML volume using different thresholdings unveiled considerable variation in measurements. e FLAIR sequence had obvious intensity inhomogeneities, and no automatic intensity normalization was performed which may limit the possibilities to interpret the present �ndings further.

Conclusions
Reliability analysis showed acceptable overall results but with lower reliability for all methods in the lower aggregated tertiles. Despite excellent overall reliability for manual volumetry, no signi�cant reliability was found in the lower aggregated tertiles. Hence, the results of intraclass correlation in WML samples, that is, commonly skewed, might be misleading, and reliability analysis of WML methods should be considered with caution. Image intensity distortion seems to be the major cause of the reliability de�cits. �ptimized MR imaging, postscan intensity standardization and normalization are quite likely among the most important factors to consider for more accurate measurements of WML. FreeSurfer comprised lower volumes than the manual method, probably due to the T1 sequence it uses, and was not able to detect punctuate WML in a consistent manner.

Future Directions.
A medically oriented follow-up paper focusing on the predictive power of the different WML measurements is being prepared. In another study, we intend to re�ne the visual comparison between the methods, using a larger sample of subjects and coregistration of images from different scan series. Finally, our long-term goal is an improved manual method that excludes thresholding but includes an adequate intensity normalization.

Con�ict o� �nte�ests
e authors have no actual or potential con�ict of interests with the work in the present study, either �nancial or otherwise.