Statistical Evaluations of the Reproducibility and Reliability of 3-Tesla High Resolution Magnetization Transfer Brain Images: A Pilot Study on Healthy Subjects

Magnetization transfer imaging (MT) may have considerable promise for early detection and monitoring of subtle brain changes before they are apparent on conventional magnetic resonance images. At 3 Tesla (T), MT affords higher resolution and increased tissue contrast associated with macromolecules. The reliability and reproducibility of a new high-resolution MT strategy were assessed in brain images acquired from 9 healthy subjects. Repeated measures were taken for 12 brain regions of interest (ROIs): genu, splenium, and the left and right hemispheres of the hippocampus, caudate, putamen, thalamus, and cerebral white matter. Spearman's correlation coefficient, coefficient of variation, and intraclass correlation coefficient (ICC) were computed. Multivariate mixed-effects regression models were used to fit the mean ROI values and to test the significance of the effects due to region, subject, observer, time, and manual repetition. A sensitivity analysis of various model specifications and the corresponding ICCs was conducted. Our statistical methods may be generalized to many similar evaluative studies of the reliability and reproducibility of various imaging modalities.


Introduction
Magnetization transfer (MT) imaging is a quantitative approach for detecting subtle or occult abnormalities in brain tissue. In previous studies, the Magnetization Transfer Ratio (MTR), an index of MT imaging, was sensitive to brain changes in patients with mild cognitive impairment, an Alzheimer's disease prodrome [1,2], to new lesions in patients with multiple sclerosis, [3] and to changes associated with progression in chronic neurological disorders [4]. The higher magnetic field strength afforded by 3T allows MT image resolution to be augmented compared with conventional MT acquisition at 1.5T [5][6][7]. We developed a high resolution MT technique to detect subtle changes in anatomically small, functionally eloquent brain structures. The increased field strength affords whole-brain coverage with considerably thinner slices, potentially reducing partial volume artifacts. However, even among healthy subjects, numerous factors may introduce variability in measures derived from magnetic resonance (MR) data, such as static field B 0 signal dropout and RF nonuniformity. Measurement variation may be introduced by scan repetitions, repositioning at different time points, and image postprocessing. Moreover, 3T may be susceptible to variation 2 International Journal of Biomedical Imaging associated with increased field strength [8]. Such variability may pose limitations when conducting clinical comparisons to differentiate normal and diseased brains or in developing statistically predictive algorithms.
To validate high resolution MT for detecting early disease or for monitoring progression in chronic neurological disease, it is necessary to collect information on normative values and to evaluate the reliability and reproducibility of the measurements when measured across time in healthy controls. This investigation evaluated observer-agreement of high-resolution MT measurements determined from repeated brain scans of 9 healthy volunteers. We postulated that MT values would remain stable during the one month study interval. We evaluated the reliability and reproducibility of the high resolution MT measurements in 12 brain regions of interest (ROIs), applied statistical measures to the data and used complex multivariate mixed-effects models to test the statistical significance of several effects due to region, subject, observer, time, and manual repetition.

Study Subjects.
The study was approved by the IRB at the North Shore University Health System, and conducted following the ethical principles outlined in the Declaration of Helsinki. Eleven healthy adult volunteers were randomly selected from a database maintained at the Center for Advanced Imaging, Radiology Department, NorthShore University Health System provided written informed consent and evaluated for eligibility criteria. To protect the subjects' confidentiality, all data were de-identified and handled according to the guidelines specified by the Health Insurance Portability and Accountability Act (HIPAA) in the USA.

Image
Acquisition. Brain images were acquired using a 3T General Electric (GE) HDx system (Waukesha, WI, USA). Each volunteer was scanned twice in a randomly-selected time interval between 1 to 4 weeks. Methods for reducing random errors in image acquisition included the use of a body-coil for excitation to control B1 non-uniformities and an 8-channel quadrature receive-only coil [9]. MT pulses with (M s ) and without saturation (M 0 ) were applied at an offset frequency from water resonance. To accelerate the scan for whole-brain coverage, while maintaining thin slices, the image protocol was optimized based on 3T using 3D SPGR [5]. The Gaussian Sinc MT pulse was applied in 8 ms at a 1200 HZ offset. The stability of the scanner and set-up procedure were addressed with a fixed set of parameters per subject. MT pulse was based on a three-dimensional spoiled gradient recalled (3D SPGR) acquisition. The image protocol included the following parameters: TR 34 to 35 ms, TE 4 to 8 ms, imaging FA 5 • , bandwidth 15.6 kHz, 0.75 NEX, phase FOV 0.75, voxel dimensions 0.9 × 0.9 × 0.9 ∼ 1.3 mm 3 . The whole brain was covered in 90 to 140 slices with acquisition time ranging from 7 minutes 40 seconds to 10 minutes 20 seconds using a partial k-space acquisition. 2.3. Image Analysis. MTR maps were generated off-line on a General Electric AW Workstation (General Electric, Milwaukee, WI, USA) using the standard equation: where M S and M 0 were the signal intensities in a given voxel obtained, with and without the MT saturation pulse, respectively. MTR maps generated based on the high resolution MT are demonstrated in Figure 1. The 12 ROIs were: genu, splenium, left and right hemispheres of the hippocampus, caudate, putamen, thalamus, and cerebral white matter. Figure 2 illustrated the 12 ROIs that were investigated. Each ROI was sized approximately 30 to 43 mm 2 and manually and independently placed by Observers 1 and 2 (Authors S.S. and Y.W.) following procedures in classical and standard agreement studies [10]. After an initial consensus decision was drawn regarding the sizes and locations of the 12 ROIs, the observers performed manual segmentations of the ROI independently on each set of images. This ROI placement procedure was repeated by each observer in the following week. MTR values were extracted using the manually-defined ROIs with the combinations of observer, time point, and repetition ( Table 1). The mean and SDs of the ROI values were calculated. Meta-data were stored in a SAS 9.1 (SAS, Cary, NC, USA) dataset, with individual volunteer identification numbers withheld and replaced by a sequence of 1 to 9 for each subject.
where N m = I × J × K × L = 9 × 2 3 = 72 measurements and the operator "•" means the marginal sum over the particular index.
The 95-percentile normality range was approximately within the following interval, with the following lower and upper bounds: The term "normality range" as used in Europe, could be arbitrarily-defined according to the number of standard deviations away from the mean [11]. Thus, it should not be viewed as the range of the entire dataset, but rather an interval useful for estimating the population value by one or several standard deviations away from the mean. Here the critical value of 2 was chosen as recommended by Bland and Altman [12].

4
International Journal of Biomedical Imaging Additionally, we justified using a Student's t-distribution with N m − 1 = 71 degrees of freedom. For any tail probability of α/2 (e.g., 0.025 for a 95-percent normality range), we used the quantile of the corresponding to particular tdistribution, such that This value happened to be close to the recommended multiplier of 2. Therefore, we rounded it to 2 in (3) for convenience.

Concordance Using Spearman's Rank Coefficient Coefficients.
We first explored and measured the concordance between the various measurements fully nonparametrically via Spearman's rank correlation coefficient. Suppose that we correlated the ROI values by Observers j = 1 and j = 2, then denoted the marginal ranks, R i jklm = rank i (Y i jklm ) and R i j klm = rank i (Y i jklm ), respectively, for all j / = j with j = 1 and j = 2. The sample version of Pearson's productmoment correlation coefficient between the ranks of the data was equivalent to Spearman's rank correlation coefficient [13]: .
Assuming that there was no presence of any ties since the ROI values were of continuous random variables, the Spearman's rank correlation coefficient between Observers j and j was where the difference of an arbitrary pair of marginal ranks for Observer j and j was denoted by D i•klm = R i jklm − R i j klm , for all j / = j . Consequently, all of the raw mean ROI values were converted to their marginal ranks and the differences between the ranks of each observation on the two variables were computed. Spearman's rank correlation coefficient was also computed for the ROI values between any two different time points k = 1 and k = 2.
The strength of the concordance and the benchmark values have been discussed [14]. Bar diagrams were made to display the Spearman's rank correlation coefficients between observers or time points for each ROI.

Reproducibility Using Coefficients of Variations.
We used the normalized measure of dispersion of a distribution to evaluate the reproducibility of the measurement [15]. The measure was the coefficient of variation (CV), defined as the ratio of the SD to the mean.
where both the numerator (i.e., sample SD) and the denominators (i.e., sample mean) in the above expression for CV are provided in (2). Skewed data, such as those generated by an exponential distribution for which the underlying population mean and standard deviation would be equal, and thus the CV became 1. Hence, CV < 1 would generally represent low variability, and CV > 1 would represent high variability. As in (4) and (6), further stratified computations of CV for different observers, time point, or repetitions were achieved using formulae similar to (7).

Normality and Significance Tests for the Effects via a
Multivariate Regression Analysis. As overall variability was likely a result of the effects illustrated in Table 1. We employed a multivariate mixed-effects regression analysis to direct model the ROI values. A variance-component approach has advantages over many stratified analyses, especially studying studies with a limited sample size. Here, because of the novel imaging modality using MT and 3T acquisitions with labor-intensive manual segmentation procedures, large number of subjects would not have been feasible. To conduct an analysis of variance (ANOVA) based on the various effects, a distributional assumption of normality was necessary and convenient. Therefore, we conducted marginal normality tests using the Shapiro-Wilk test [16]. We would demonstrate (see Section 3.4) that the normality assumption was generally satisfactory.
Thus, we could then consider adopting a linear randomeffects model with all pair-wise interactions, in addition to a third-order interaction term: The effects represented the following: μ m as intercept, S i as subjects, O i as observers, T i as time points, R i as repetitions, International Journal of Biomedical Imaging 5 and ε i jklm as the error team. A random-effects model assumed that each of the effects would have independent normal distributions with mean and variance.
If normality had failed and because the data were mean ROI values that were positively-valued, we would recommend a Box-Cox transformation, h(Y i jklm , λ), of the outcome variable with an optimal power coefficient λ [17][18][19]. Note that the log-normal becomes a special case when the power coefficient λ = 0. This normality transformation is given by: A profile log-likelihood, llik of λ given the observations y i jklm , would be maximized to estimate an optimal Box-Cox transformation via a nonlinear minimization routine, where the log-likelihood was where c was a constant free of the power coefficient to be optimized.
Due to the limited number of subjects, however, even with an optimal normality transformation, over-fitting and non-convergence might be issues. Alternatively, we could regard all of the observers, time points, and repetitions as fixed and specify a mixed-effects model. The significances of the sources of variability were tested via a restricted maximum likelihood (REML) approach. For our multivariate analysis, the significance threshold for two-tailed P-values was set if P ≤ .05.

2.4.5.
Interobserver Reliability Using the ICCs. Stratified by the time points within each ROI, a two-way ANOVA was performed by regarding all of the observers, time points, and repetitions as fixed. We specified a mixed-effects model for simplicity. Due to the complexity of the variance components, we instead adopted a hybrid approach by considering two effects at once. For example, all subjects were segmented by the same observers who were from an entire population of observers. In other words, the subject effect was always assumed to be random, while the remaining effect (e.g., here the observer) was assumed to be fixed. We computed the Case-3 ICCs, accordingly [20].
We simplified our notations by only keeping the indices for the subject and observer effects of interest. We decomposed the data as follows: where the subject effect S i was assumed to be random in an upper-case letter, which had a normal distribution with mean 0 and variance σ 2 S , for all i = 1, . . . , I (here I = 9); the observer effect o j was considered to be a fixed effect in a lower-case letter, with the constraint J j=1 o j = 0, with the corresponding parameter to the variance being θ 2 o = (1/(J − 1)) J j=1 o 2 j , for all j = 1, . . . , J (here J = 2); the interaction term between the subject and the observer S i × o j was the degree to which the jth observer departed from his or her usual rating tendencies for the ith subject, which had a normal distribution with a mean of 0 and variance σ 2 S×o ; the errors terms ε i j were assumed to have an independent and identical distribution (iid) normal distribution with a mean of 0 and variance σ 2 E . For the same ith subject, the effects are further assumed to be subjected to the constraint J j=1 (S × o) i j = 0 over all of the observers. The corresponding two-way ANOVA table was listed (Table 3).
Shrout and Fleiss gave the true definition of ICC using the variance ratio of the subject variance over the total variance, with its estimated version using the quantities via ANOVA (Table 3) [19]: 6 International Journal of Biomedical Imaging 2.4.6. Intraobserver Reliability Using the ICCs. Similar to the analysis described above, we adopted a hybrid approach by considering two effects at once, with the subject effect always assumed to be random and the time point assumed to be fixed. The associate model was given by As in (12), the estimated intraobserver agreement and its estimate were provided by: where the interaction term the interaction term between the subject and the time S i × t k had a normal distribution with a mean of 0 and variance σ 2 S×t .

Sensitivity Analyses of the ICCs under Various Models.
We performed a sensitivity analysis by computing 6 different ICC values Shrout and Fleiss previously proposed assumptions for ICCs (

Descriptive Statistics.
Eleven healthy adults provided written informed consent to be evaluated and 9 underwent brain scans. Mean age of participants who received scans was 37.9±14.2 years; 7 participants were men and 2 were women. The mean ROI values varied across different region ( Table 5). The left and right hemispheres tended to yield similar results when the average over these healthy subjects was considered.

Concordance Using Spearman's Rank Coefficient Coefficients.
Spearman's rank correlation coefficients showed that a majority of correlations within each observer was above 0.5, suggesting a moderate to high concordance (Figure 3). Time point 2 tended to yield higher concordance between the observers, which suggested a possible learning effect over time (Figure 4). Due to limited sample sizes in this pilot study, in Figures 3 and 4, we demonstrated the effect of observers by averaging over repetitions by each observer. Similarly, we demonstrated the effect of time points by averaging over repetitions at each time point.

Reproducibility Using Coefficients of Variations.
Overall, CVs ranged from 1.2% in the genu for Observer 2 to 7.0% in the right hippocampus for Observer 1 ( Table 6). Since all of the CVs were within 7%, that is, all CVs were less than 10%, the reproducibility was reasonably high.

Normality and Significance Tests via a Multivariate
Analysis. The tests of the normal distribution assumption marginally using the Shapiro-Wilk test indicated that only occasionally (e.g., for left caudate, left and right putamen, and right hippocampus), this assumption was not met (see Table 7). Therefore, it was reasonable to specify linear mixedeffects modeling and two-way ANOVA reported in Sections 3.5 and 3.6.

Interobserver Reliability Using the ICCs.
At time point 1, ICCs were greater than 0.7 in regions of genu, left and right putamen, whereas ICCs were from 0.5 to 0.7 in regions of splenium, left and right hippocampus, left caudate, and right cerebral white matter (Table 8). These results indicated moderate to strong interobserver reliability. In comparison, at time point 2, ICCs were greater than 0.7 in regions of genu, splenium, left and right caudate, putamen and cerebral white matter, and left hippocampus and thalamus, while ICCs were from 0.5 to 0.7 in right hippocampus and thalamus. These results suggested a learning effect over time. However, for some ROIs such as the left cerebral white matter, right ICC(2,1) All subjects are rated by the same observers who are assumed to be a random subset of all possible observers.
ICC(3,1) All subjects are rated by the same observers who are assumed to be the entire population of observers. ICC (1,2) Same assumptions as ICC(1,1) but reliability for the mean of 2 ratings. ICC (2,2) Same assumptions as ICC(2,1) but reliability for the mean of 2 ratings.
ICC (3,2) Same assumptions as ICC(3,1) but reliability for the mean of 2 ratings. Assumes additionally there is no subject × observer interaction. caudate, right thalamus, ICCs increased from 0.2 (at time point 1) to 0.9 (at time point 2), making it difficult to determine whether this represents a learning effect.
3.6. Intraobserver Reliability Using the ICCs. At each time point, intraobserver agreement was at least 0.5 for a majority of the regions (Table 9).

Sensitivity Analyses of the ICCs under Various Models.
Six different methods for generating ICCs exhibited similar patterns for high vs. low reliability results in different ROIs (Table 10). Thus, reliability appeared to be sensitive to ROI.

Conclusions and Discussion
We present mathematical methods for MT brain images using 3-T high resolution. Our image analysis may provide useful pilot information for future investigations. These mathematical and statistical methods may easily be generalized to practical studies with larger sample sizes or to studies of patients with active disease. We acquired repeat brain measurements based on a high resolution MT imaging protocol at 3T in 9 healthy adults. Our results indicate moderate to high reproducibility, supporting the validity of this method for further studies. Overall, higher intraobserver reliability was observed at the second time point than that at the initial time point, suggesting a possible learning curve effect for both observers. Interobserver reliability was generally lower than intraobserver variability, suggesting a strong observer effect in this comparison, which may be a factor in future investigations using MT imaging.
Our analyses examined different aspects in a typical observer-agreement study, using measures for concordance, reproducibility, reliability, variance-component analysis, and multivariate analysis. In other studies, all or some of such methods may be considered. However, with a simpler study of either several observers, or one observer with several repetitions at different sessions or time points, then these scenarios may only require several of our methods. Only a small sample of healthy volunteers was evaluated in this initial pilot study. Therefore, the generalization of the 95percentile normality range may be limited with respect to the wider spectrum of brain mechanisms represented in the broader population. For instance, demonstrating summary measures using all possible observer and time point combinations may not lead to meaningful interpretations in all cases. Nevertheless, since the technology is new, this 8 International Journal of Biomedical Imaging     research may provide useful pilot information for future investigations. Moreover, the statistical methods employed and illustrated here may easily be generalized to studies with larger sample sizes and diseased subjects.
Another limitation was that this study aimed to evaluate only the reproducibility and reliability, rather than the accuracy in a more comprehensive validation study. In the absence of a true gold standard, such as one based on digital phantoms where realistic variability may still not be simulated, or on histopathology, improved reliability may not be equated with improved accuracy [21]. Both sensitivity and specificity are of interest. Further research would benefit from a useful algorithm to perhaps statistically and optimally estimate the underlying spatial "ground truth" [22,23]. Finally, future research may be directed to evaluating the diagnostic utility of high resolution MT for early detection of Alzheimer's disease, multiple sclerosis or other neurological disorders and for monitoring progression across the clinical course.