Radial Basis Function-Sparse Partial Least Squares for Application to Brain Imaging Data

Magnetic resonance imaging (MRI) data is an invaluable tool in brain morphology research. Here, we propose a novel statistical method for investigating the relationship between clinical characteristics and brain morphology based on three-dimensional MRI data via radial basis function-sparse partial least squares (RBF-sPLS). Our data consisted of MRI image intensities for multimillion voxels in a 3D array along with 73 clinical variables. This dataset represents a suitable application of RBF-sPLS because of a potential correlation among voxels as well as among clinical characteristics. Additionally, this method can simultaneously select both effective brain regions and clinical characteristics based on sparse modeling. This is in contrast to existing methods, which consider prespecified brain regions because of the computational difficulties involved in processing high-dimensional data. RBF-sPLS employs dimensionality reduction in order to overcome this obstacle. We have applied RBF-sPLS to a real dataset composed of 102 chronic kidney disease patients, while a comparison study used a simulated dataset. RBF-sPLS identified two brain regions of interest from our patient data: the temporal lobe and the occipital lobe, which are associated with aging and anemia, respectively. Our simulation study suggested that such brain regions are extracted with excellent accuracy using our method.


Introduction
Recently, brain morphometry research has gained considerable attention for its proposed utility in the early detection of dementia and assessment of regional cerebral atrophy. Furthermore, several authors have reported an association between brain morphology and clinical characteristics such as age, chronic disease, and genetics [1][2][3] using magnetic resonance imaging (MRI) data. Voxel-based morphometry (VBM) is a commonly used technique for such analyses [4]. This method is based on general linear models with the values of each MRI voxel (in units of pixels, preprocessed for standardization) as a dependent variable and clinical characteristics (including the group indicator variable and covariates) as explanatory variables. However, this approach has some drawbacks which have been discussed by Davatzikos [5]. For example, a multiple-comparison correction requires several assumptions that are difficult to verify. An alternative to this approach is to use prespecified assemblies of voxels based on anatomical knowledge, which is known as a region of interest (ROI) approach. Therefore, the ROI approach requires the investigator to have precise and accurate knowledge of true anatomical boundaries. Moreover, variables need to be selected carefully in order to minimize the influence of irrelevant variables in the statistical model. We have taken a data-mining approach using an entire brain region and have used voxel intensity levels for the dependent variables and clinical characteristics (including patient background and blood test results) as explanatory variables.
There are two important statistical problems in the regression model that concern our use of large, complex data. The first is the selection of a set of relevant variables among a large number of both dependent and explanatory variables that are highly correlated. The partial least squares (PLS) regression, which was introduced by Wold [6], is a latent factor approach that is suitable for data with correlated variables. It has been used as an alternative approach to ordinary least squares (OLS) regression in ill-conditioned linear regression models that arise in several disciplines such as chemistry, economics, and medicine [7,8]. Tibshirani has used PLS in neuroimaging [9]. The second problem is a problem in variable selection that often arises when the sample size is much smaller than total number of variables ( ; the so-called "large small problem") for both dependent and explanatory variables. Utilizing the sparsity principle with 1 -penalty has been promoted as an effective solution [9,10]. This version of sparse PLS (sPLS) combines the 1 -penalty and has been proposed by Lê Cao et al. [11] and Chun and Keleş [12]. The number of applications for this approach is steadily increasing in not only neuroimaging fields but also bioinformatics and chemometrics. This technique produces sparse, linear combinations of the explanatory variables and achieves both dimension reduction and variable selection simultaneously. The pioneering application of this method to brain imaging data has been used to investigate genetic polymorphisms and functional imaging data [3]. However, it is based on PLS regression in its symmetric (also called canonical) mode. In this paper, we consider the PLS in its regression mode based on the singular value decomposition (PLS-SVD). The difference lies in the fact that factors are orthogonal in the canonical mode, contrary to PLS-SVD, in which the loadings are orthogonal. The main concern in this approach is the restriction of analysis to prespecified brain regions. Using brain regions that have not been specified a priori would be more data-driven approach that may yield new and unexpected results, but such approaches typically introduce computational difficulties because of the large number of voxels to be analyzed. For this reason, we decided to combine this approach with a first step of dimension reduction on brain images using basis expansion.
In this paper, we propose a sparse PLS approach with basis expansion (RBF-sPLS; radial basis function-sparse partial least squares) and provide an application for real data using three-dimensional MRI brain scans with about a million voxels and 73 clinical characteristics from chronic kidney disease (CKD) patients. In addition, we conducted a simulation study to compare our proposed method with the original method. Our proposed, RBF-sPLS, prediction model with dimension reduction devices offers discriminant functions with excellent prediction performance in terms of sensitivity and specificity. This paper is organized as follows. Section 2 provided a discussion of three-dimensional MRI data and their preprocessing. Section 3 states the proposed statistical methods. In Section 4, we report a simulation study for the characteristics of sPLS with basis expansion (RBF-sPLS) or without it (sPLS).

Subjects.
Between 2009 and 2012, we recruited 102 patients (mean age: 61 ± 11 years, 52% male, 48% female) with chronic kidney disease (CKD) to participate in our study. We examined brain volume using MRI scanning, and clinical parameters were measured on the same day. Patients were eligible if they were between 20 and 80 years old and had no prior history of brain injury such as stroke, traumatic brain injury, or brain tumor. The participant characteristics are shown in Table 1. Fifty-five percent of participants had a history of smoking (47 former and 9 current smokers). Blood pressure in the brachial artery was measured with the subjects in a sitting position after a 10 min rest. All patients provided informed consent. Kyushu University Institutional Review Board approved all procedures.

Image Data.
Brain MRI was acquired from each subject using a 3.0 tesla MRI scanner of the same model. No major hardware upgrades occurred during the period. All subjects were scanned with identical pulse sequences: 124 contiguous, 3.0 mm thick axial planes of three-dimensional T1-weighted images (spoiled gradient recalled acquisition in steady state: echo time, 7 ms; flip angle, 30; voxel size, 1.02 × 1.02 × 1.5 mm).
We used the Statistical Parametric Mapping 8 software (SPM8, Wellcome Department of Cognitive Neurology, London, UK) to preprocess brain images. The segmentation algorithm from SPM8 was applied to every T1-weighted MRI scan to extract tissue maps corresponding to gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF). The temporary common space of rigidly registered tissues is necessary as a starting point for the DARTEL algorithm. Next, the segmented tissues maps were used to create a custom template and associated warping fields were generated using the DARTEL template creation tool [4]. This tool estimates a best set of smooth, nonlinear deformations from each subject's tissues to their common average, applies the deformations to create a new average, and then reiterates until convergence.

Sparse Partial Least Squares.
Let Y denote an × dependent variable matrix and let X denote an × explanatory variable matrix. The core assumption of PLS regression is a latent decomposition of Y and X as follows: where T is an × score matrix, is the number of components, P and Q are × and × loading matrices, and E and F are × and × matrices of random errors.
The version of sparse PLS (sPLS) regression proposed by Lê Cao et al. [11] invokes singular value decomposition Among variations of PLS regression, this is called PLS-SVD. From these, we can obtain the regression form For ease of explanation for estimation, suppose that = 1, then the objective function with a 1 penalization on u and v, which are column vectors of U and V, respectively, is given as follows: where and are 1 penalization parameters for the weight vectors of matrices X and Y, respectively. This function is a minimized subject to ‖u‖ 2 = ‖v‖ 2 = 1. The amplitudes of and correspond to the increases and decreases of the number of X and Y variables, which contribute to the regression. For example, in the case of X, if the value of is large, then a large number of variables X would be selected. The same is the case for Y. Therefore, the sPLS concerns selection and modeling in a one-step procedure. This optimization problem is performed by the soft-thresholding function ( ) = sign( )(| | − ) + , where ( ) + = max(0, ) at each iteration of the NIPALS inner loop. Weight vectors u and v are computed using the following algorithm.
(1) Initialize u and v using, for instance, the first pair of singular vectors of the matrix X Y and normalize u ← u/‖u‖ 2 and v ← v/‖v‖ 2 .
(3) t = Xu, p = X t/t t, and q = Y t/t t, where t, p, and q correspond to column vector of T, P, and Q, respectively.
For the general case of > 1, the above algorithm is repeated for times with the deflation step X ← X − tp and Y ← Y − tq as the fourth step. The final solution can be obtained as T = (t 1 , t 2 , . . . , t ), Q = (q 1 , q 2 , . . . , q ), and P = (p 1 , p 2 , . . . , p ), where the elements are obtained at each step among steps.

Choice of Tuning Parameters.
The choice of penalization parameters , and number of components is important in model construction. We use a criteria called 2 proposed by Tenenhaus [14], which were used to select the number of components in the sPLS model in Lê Cao et al. [11] by performing cross-validation. We used 10-fold cross validation. Thus, our 2 has a functional form of , , and and is defined as where PRESS is the prediction error sum of squares and RSS is the residual sum of squares for the th-dependent variable and the PLS model with components defined as follows. Let : {1, 2, . . . , } → {1, 2, . . . , 10} be an indexing function that indicates the partition to which observation is allocated to ( )th part of the data by the randomization: (− (− )) ( , , ) is the predicted value for the thdependent variable from the sPLS model with penalization parameters and and number of components and estimated weight vectors from ( )th part of the data removed. That is, for any subject, we predict that̂( − ( )) ( , , ) = x̂( − ( )) ( , , ), wherê( − ( )) ( , , ) is the th column of estimated regression coefficient matrixB from the sPLS model with penalization parameters and and number of components and ( )th part of the data removed.
( ) ( , , ) is the predicted value with the same definition aŝ( − ( )) ( , , ) except for the estimated weight vector from all available subjects. We select the optimal set ( , , ) based on the maximization of 2 ( , , ) among given candidates. This is implemented by the grid search.

Simulation Studies
In this section, we will illustrate the proposed methods in a simulation study. We demonstrate the impact of knot distance in affecting the representation of the results and clarify the advantage of dimension reduction by RBF by comparison to the method without basis expansion.

Data Sets.
Consider patients and explanatory variables. We generated 100 datasets according to the following sPLS model with two components where MVN(0, Σ) denoted -dimensional multidimensional normal distribution with zero mean and variance covariance matrix Σ. P = (P 1 , P 2 ) is the × 2 matrix with P 1 = b ⨂ (1 /20 , 0 3 /20 ) , P 2 = b ⨂(0 /20 , 1 /20 , 0 /10 ) , and b = (5, 2, 1, −2, −5) , where ⨂ is the Kronecker product, 0 is a -dimensional vector with all elements 0, and 1 is adimensional vector with all elements 1. Q is the × 2 matrix whose columns are vectorized true images displayed in We performed this step in order to assess how much the performance of sPLS is influenced by the basis expansion and by the number of clinical parameters kept by the filter and to select the best pair of parameters. We provided a comparison with the original method (sPLS without the basis expansion) and also analyzed the impact of the distance between adjacent knots in our method for ℎ = 2, 4, and 8. We tested our pattern of data set; = 50/ = 40, = 50/ = 80, = 100/ = 40, and = 100/ = 80 to replicate the sample size of the CKD patients data set and the number of covariates . The images Y's were unfolded to obtain vectors of size = 100 × 100 = 10, 000.

Results.
We estimated P and Q from simulated data by the method described in Section 3. All results yielded the correct number of components. We computed the probability images by averaging up the estimated Q's from 100 datasets. The middle and bottom panels of Figure 2 display binary images converted from probability images with threshold 0.95 for the first and second components, respectively, in the case of = 50/ = 40. The top of Figure 2 shows the combined true image. The result for sPLS without the basis expansion showed nothing at all because the maximum probability calculated was 0.7. On the other hand, the sPLS with the basis expansion with distance between knots ℎ = 2 had a good shape, while for ℎ = 4 and 8, the true image could not be reconstructed.
To assess how effectively the estimated model predicts each variable, sensitivity, specificity, and c-index = sensitivity − (1 − specificity) were computed and averaged over 100 sets. As shown in Table 2, the mean values of the c-index for the proposed method with ℎ = 2 were relatively smaller than those for the method without the basis expansion and

Real Data Application
We applied sPLS with basis expansion to our MRI dataset of the CKD patients described in Section 2. We assessed additional demographic and health-related variables, as well as laboratory data obtained on the same day. These data were used as covariates in our statistical analyses. The number of covariates is = 73. Among the 2,122,945 (121 × 145 × 121) voxels for one subject, the voxels that represent brain regions are extracted, resulting in 839,089 voxels. The dimension of the basis function is = 13, 047 because of the 4-voxel (ℎ 0 = 4; therefore, ℎ = √ 3 × 4 2 = 6.93) equal spacing knots. The number of components was selected as = 2. The number of selected variables in the first component of X was 17, and 14 variables were in the second component. For Y, 785 and 947 variables were selected in the same manner. Figure 3 shows the results by the axial view of brain. The left image shows the coefficient image estimated as the first component. Similarly, the right one shows the second component. Our model revealed a relatively strong association between the bilateral temporal lobes and clinical markers of chronic kidney disease. The temporal lobes are one of the four main regions of the cerebral cortex. Structures of the limbic system, including the olfactory cortex, amygdala, and the hippocampus are located within the temporal lobes. The temporal lobes play an important role in organizing sensory input, auditory perception, language and speech production, and memory association and formation. These regions linked the 17 factors, in particular, age, sex, underlying disease (diabetes mellitus), smoking status, weight, serum albumin level, serum creatinine, total cholesterol, glucose, HDLcholesterol, LDL-cholesterol, glycoalbumin, cholinesterase, number of red blood cell, whole parathyroid hormone, pulse wave velocity, and coronary artery calcification score.
The occipital lobes were selected by our analysis as the second component. The occipital lobes are positioned at the back region of the cerebral cortex and are the main centers for visual processing, involved in several functions of the body including visual perception and color recognition. This region linked the following factors: sex, body height, body weight, diastolic blood pressure, ratio of toe to brachial systolic blood pressure, total bilirubin, glucose, chloride, serum iron levels, number of red blood cells, hemoglobin, hematocrit, plasminogen activator inhibitor-1, and transferrin saturation.
The variables selected as the first component are considered to be the factors most closely related to aging and arterial stiffness, while those associated with the second region are more closely related to markers of anemia. The extent of atherosclerosis, calcification, and renal anemia are important complications in CKD patients. Recently, these factors have been suggested to be involved in brain atrophy and depressed cerebral oxygen metabolism [15,16], but its mechanism remains to be elucidated. We also found a significant correlation between regional gray matter volume and hemoglobin level after adjusting for age, gender, residual renal function, underlying kidney disease, history of smoking, diastolic blood pressure, and LDL cholesterol level using multiple-linear regression methods [17]. In this analysis, we used only the whole gray matter volume as an objective variable, because multiple variables cannot be applied to conventional linear regression models, whereas the sPLS could select variables and modeling in a one-step procedure and use many objective variables.

Discussion
This paper describes that the radial basis function-sparse partial least squares (RBF-sPLS) technique was proposed Computational and Mathematical Methods in Medicine 7 and was applied to high dimensional brain imaging data. The original sPLS is a useful regression model to analyze data in which both dependent and explanatory variables are multivariate and correlated with one another. The most difficult problem in analyzing real brain data is the high dimensionality of these datasets. While prespecified regions were used in previous neuroimaging analyses, our method successfully handled a whole brain region following the basis expansion. The basis function has a spherical shape, but it was able to approximate the cross shape used in the simulation study. This would be expected because of the narrow spanned knots location. Thus, we set as close knots each other as possible in the real-data application, using 4-voxel equal spacing knots because computation using 2-voxel spacing was not possible. This method may be applicable to not only real brain data, but also general imaging datasets, because actual lesions would cause aggregates in adjacent voxels. Although the relative advantage of our proposed method was shown through the comparison between simulations run with and without the basis function conducted in the fair setting, further simulation studies with more realistic constraints are necessary. However, these simulations lie beyond the scope of the present paper and will be dealt with in the future. The significance of this study is to clarify the characteristics of RBF-sPLS presented visually for the analysis of imaging data.
We obtained clinically relevant findings about the relationship between aging, anemia, and brain morphology from the real-data application in our study. We are currently in the process of collecting longitudinal data and normal controls to expand this confirmatory evidence for future work. In summary, RBF-sPLS can help revealing the relationships between complex, large datasets, including brain imaging data.