Linearized and Kernelized Sparse Multitask Learning for Predicting Cognitive Outcomes in Alzheimer's Disease

Alzheimer's disease (AD) has been not only the substantial financial burden to the health care system but also the emotional burden to patients and their families. Predicting cognitive performance of subjects from their magnetic resonance imaging (MRI) measures and identifying relevant imaging biomarkers are important research topics in the study of Alzheimer's disease. Recently, the multitask learning (MTL) methods with sparsity-inducing norm (e.g., ℓ2,1-norm) have been widely studied to select the discriminative feature subset from MRI features by incorporating inherent correlations among multiple clinical cognitive measures. However, these previous works formulate the prediction tasks as a linear regression problem. The major limitation is that they assumed a linear relationship between the MRI features and the cognitive outcomes. Some multikernel-based MTL methods have been proposed and shown better generalization ability due to the nonlinear advantage. We quantify the power of existing linear and nonlinear MTL methods by evaluating their performance on cognitive score prediction of Alzheimer's disease. Moreover, we extend the traditional ℓ2,1-norm to a more general ℓqℓ1-norm (q ≥ 1). Experiments on the Alzheimer's Disease Neuroimaging Initiative database showed that the nonlinear ℓ2,1ℓq-MKMTL method not only achieved better prediction performance than the state-of-the-art competitive methods but also effectively fused the multimodality data.


Introduction
Alzheimer's disease (AD) is a severe neurodegenerative disorder that results in a loss of mental function due to the deterioration of brain tissue, leading directly to death [1]. It accounts for 60-70% of age related dementia, affecting an estimated 30 million individuals in 2011 and the number is projected to be over 114 million by 2050 [2]. The cause of AD is poorly understood and currently there is no cure for AD. AD has a long preclinical phase, lasting a decade or more. There is increasing research emphasis on detecting AD in the preclinical phase, before the onset of the irreversible neuron loss that characterizes the dementia phase of the disease, since therapies/treatment are most likely to be effective in this early phase. The Alzheimer's Disease Neuroimaging Initiative (ADNI, http://adni.loni.usc.edu/) has been facilitating the scientific evaluation of neuroimaging data including magnetic resonance imaging (MRI) and positron emission tomography (PET), along with other biomarkers and clinical and neuropsychological assessments for predicting the onset and progression of MCI (mild cognitive impairment) and AD. Early diagnosis of AD is key to the development, assessment, and monitoring of new treatments for AD.
Recently, rather than predicting categorical variables in the classification, various studies started to estimate continuous clinical variables from brain images. Therefore, instead of classifying a subject into binary or multiple predetermined categories or stages of the disease, regression focus is on estimating continuous values which may help to assess patient's disease progression. The most commonly used cognitive measures are Alzheimer's Disease Assessment Scale (ADAS) cognitive total score, Mini Mental State Exam (MMSE) score, and Rey Auditory Verbal Learning Test (RAVLT). Regression analyses were commonly used to predict cognitive scores from imaging measures. The relationship between commonly used cognitive measures and structural changes with MRI has been previously studied by regression models and the results demonstrated that there exists a relationship between 2 Computational and Mathematical Methods in Medicine baseline MRI features and cognitive measures [3,4]. For example, Wan et al. proposed an elegant regression model called CORNLIN that employs a sparse Bayesian learning algorithm to predict multiple cognitive scores based on 98 structural MRI regions of interests (ROIs) for Alzheimer's disease patients. The polynomial model used in CORNLIN can detect either a nonlinear or a linear relationship between brain structure and cognitive decline [3]. Stonnington et al. adopted relevance vector regression, a sparse kernel method formulated in a Bayesian framework, to predict four sets of cognitive scores using MRI voxel based morphometry measures [4]. One of the biggest challenges in the prediction of inferring cognitive outcomes with MRI is the high dimensionality, which affects the computational performance and leads to a wrong estimation and identification of the relevant predictors. To reduce the high dimensionality and identify the relevant biomarkers, the sparse methods have attracted a great amount of research efforts in the neuroimaging field due to its sparsity-inducing property. Ye et al. applied sparse logistic regression with stability selection to ADNI data for robust feature selection [5] and successfully predicted the conversion from MCI into probable AD and identified a small subset of biosignatures.
It is known that there exist inherent correlations among multiple clinical cognitive variables of a subject. However, many works do not model dependence relation between multiple tasks and neglect the correlation between clinical tasks which is potentially useful. When the tasks are believed to be related, learning multiple related tasks jointly can improve the performance relative to learning each task separately. Multitask learning (MTL) is a statistical learning framework which aims at learning several models in a joint manner. It has been commonly used to obtain better generalization performance than learning each task individually [6,7]. The critical issues in MTL are to identify how the tasks are related and build learning models to capture such task relatedness. The most recent studies [6,8,9] employed multitask learning with ℓ 2,1 -norm [7] regularization and aimed to select features that could predict all or most clinical scores. The ℓ 2,1norm is chosen to be the regularization. Thus, the ℓ 2,1 -norm regularized regression model is able to select some common features across all the tasks. However, in these learning methods, each task is traditionally performed by formulating a linear regression problem, in which the cognitive score is a linear function of the neuroimaging measures.
Kernel methods have been studied to model the cognitive scores as nonlinear functions of neuroimaging measures. Recently, many kernel-based classification or regression methods with faster optimization speed or stronger generalization performance have been proposed and investigated by theoretically analyzing and experimentally evaluating [10,11]. Multiple kernel learning (MKL) [12], which learns the optimal kernel for a given task by a weighted, linear combination of predefined candidate kernels, has been introduced to handle the problem of kernel selection. The multiple kernel learning method not only learns an optimal combination of given base kernels but also provides a flexible framework to exploit the nonlinear relationship between MRI measures and cognitive scores.
In building the predictive model for classification or regression in AD, kernel has been widely used; therefore, it is important to extend the existing kernel-based learning methods to the case of multitask learning. In this paper, we propose two nonlinear multikernel-based multiple learning methods in [13] for building regression models, to exploit and investigate the nonlinear relationship between MRI measures and cognitive scores. Moreover, an ℓ ℓ 1 -norm is used to extend the traditional ℓ 2 ℓ 1 -norm. The goal of our work is to (1) predict subjects' cognitive scores in a number of neuropsychological assessments using their MRI measures across the entire brain, (2) identify what the performance of the nonlinear method is compared with the linear ℓ ℓ 1 -norm MTL and other MTL methods with different assumption. No previous studies have systematically and extensively examined the prediction performance by linear MTL and nonlinear MTL methods, and (3) identify what the learning capacity of the multikernel framework on fusing multimodality data is.
The rest of the paper is organized as follows. In Section 2, we provide a description of the multitask learning formulation. A linearized MTL and two multikernel-based MTL methods with ℓ ℓ 1 -norm are provided in Section 3. In Section 4, we present the experimental results and compare the performance of linearized and kernelized MTL methods from the ADNI-1 dataset. The conclusion is drawn in Section 5.

Multitask Learning
Consider a multitask learning (MTL) setting with tasks. Let be the number of covariates, shared across all the tasks, and be the number of samples. Let ∈ R × denote the matrix of covariates, ∈ R × be the matrix of responses with each row corresponding to a sample, and Θ ∈ R × denote the parameter matrix, with column . ∈ R corresponding to task , = 1, . . . , , and row ℎ. ∈ R corresponding to feature ℎ, ℎ = 1, . . . , .
The MTL formulation focuses on the following regularized loss function: where (⋅) denotes the loss function and (⋅) is the regularizer. In the current context, we assume the loss to be square loss; that is, where y ∈ R 1× and x ∈ R 1× are the th rows of and , respectively, corresponding to the multitask response and covariates for the th sample. We note that the MTL framework can be easily extended to other loss functions. Base on some prior knowledge, we then add penalty (Θ) to encode the relatedness among tasks.
Computational and Mathematical Methods in Medicine 3

ℓ ℓ 1 -Norm Regularized Linearized Multitask Learning, ℓ ℓ 1 -MTL
The ℓ 2 ℓ 1 -norm was popularly used in multitask feature learning [14]. All the existing algorithms for multitask feature learning assume a linear relationship between MRI features and cognitive scores and aim to learn a common subset of features for all tasks. Since the ℓ 2 ℓ 1 -norm regularizer imposes the sparsity between all features and nonsparsity between tasks, the features that are discriminative for all tasks will get large weights. However, the ℓ 2 ℓ 1 -norm is a fixed nonadaptive penalty. To obtain an adaptive regularization and better suit different data structures, we extend the ℓ 2,1 -norm to a larger class of mixed norm ℓ ℓ 1 that can be adapted to the data. The objective function of linear ℓ ℓ 1 -MTL is formulated: When = 1, problem (3) reduces to the ℓ 1 -regularized problem; when = 2, problem (3) reduces to the ℓ 2,1regularized problem.
An efficient algorithm is based on the accelerated gradient method for solving the ℓ ℓ 1 -regularized problem, which is applicable for all values of larger than 1.
First, construct the following model for approximating the composite function M(⋅) at the point Θ ( ) : where > 0. In the model M ,Θ ( ) (Θ), apply the firstorder Taylor expansion at the point Θ (including all terms in the square bracket) for the smooth loss function (⋅), and directly put the nonsmooth penalty (⋅) into the model. The regularization term ( /2)‖Θ − Θ ( ) ‖ 2 prevents Θ from walking far away from Θ ( ) , and thus the model can be a good The accelerated gradient method is based on two sequences {Θ ( ) } and {Γ ( ) } in which {Θ ( ) } is the sequence of approximate solutions and {Γ ( ) } is the sequence of search points. The search point Γ ( ) is the affine combination of Θ ( −1) and Θ ( ) as where ( ) is a properly chosen coefficient. The approximate solution Θ ( +1) is computed as the minimizer of M ( ) ,Γ ( ) (Θ): where ( ) is determined by line search, for example, the Armijo-Goldstein rule, so that ( ) should be appropriate for The key subroutine is (6), which can be computed as Note that the ℎ features in (7) are independent. In [15], the method can be used for ease of different independent groups; that is, 1 where G is the independent groups. In our paper, we focus on how the method deals with multitask learning problem in (7), where G is equal to , and each group denotes the corresponding feature shared across the multiple tasks. Thus, the optimization in (7) decouples into a set of independent ℓ -regularized Euclidean projection problems: Then, the optimal solution * ℎ. of (8) can be gotten as follows: where = /( − 1), and thus and satisfy the following relationship: The algorithm ℓ ℓ 1 -MTL is summarized in Algorithm 1.

Multikernel
Learning. The limitation in this traditional ℓ 2,1 -norm MTL model is that subjects cognitive score under a task is modeled as a linear function of his/her MRI measures. The kernel methods, for example, SVM or SVR, can model the nonlinear distribution of the data by mapping the input data into a nonlinear feature space by kernel embedding. In this section, we consider the case that ℓ 2,1 -norm regularized MTL is extended to kernel method. Let us define the kernel function (x) : R → R̂, which maps the data samples from an input space to a feature space (a high-dimensional Hilbert space H), wherêdenotes the dimensionality of the feature space and x is a sample from the input space. A kernel function is capable of attaining the inner product of two mapped datasets in H: (x, x ) = (x) ⋅ (x ) in the original space without explicitly computing the mapped data. The associated Gram matrix has entries ( , ) = (x, x ).
The most suitable types and parameters of the kernels for a particular task are often unknown, and the selection of the optimal kernel by exhaustive search on a predefined pool of kernels is usually time-consuming and sometimes causes overfitting. Multiple kernel learning (MKL) attempts to achieve better results by combining several base kernels instead of using only one specific kernel. MKL assumes that x can be mapped to different Hilbert spaces, x → (x ), = 1, . . . , , implicitly with nonlinear mapping functions, and the objective of MKL is to seek the optimal kernel combination̂(x, x ) = ∑ =1 ( , ), ≥ 0, ∑ =1 = 1, where d is the kernel weight vector. The primal objective function of multiple kernel regression model is written as follows: MKL learns both the weights of the kernel combination d and the parameters of the regressioñby solving a single joint optimization problem.
Using to denote the Lagrange multipliers, the objective value of the dual problem of (10) can be written as follows: whereK = ∑ =1 K is the combined Gram matrix and , = 1, . . . , , is the given set of base kernels.

ℓ ℓ 1 -Norm Regularized Multikernel Multitask Learn-
ing, ℓ ℓ 1 -MKMTL. We follow the multiple kernel learning scheme and use the ℓ ,1 -norm to model the relationship between the tasks to learn a common kernel representation by imposing sparsity constraint on the kernel weight. The method, called ℓ ℓ 1 -MKMTL, assumes that few base kernels are important for the tasks and encourages a linear combination of only few kernels and assumes few selected kernels are similar across the tasks. The formulation of ℓ ℓ 1 -MKMTL can be expressed as follows: We now rewrite this formulation in a convenient form which can be efficiently solved using mirror-descent based algorithms. We introduce some more notations: let Δ , = {z ≡ [ 1 , . . . , ] | ∑ =1 ≤ 1, ≥ 0, = 1, . . . , } and with slight abuse of notation let Δ ,1 = Δ ⋅ . Next, we note the following [16]. Lemma 1. Let ≥ 0, = 1, . . . , and 1 < < ∞. Then, for Δ , defined as before, and the minimum is attained at with the convention that /0 is 0 if = 0 and is ∞ if Using the result of the lemma (with = 1) and introducing variables = [ 1 , . . . , ] , we have Now introducing dual variables ] = [] 1 , . . . , ] ] , = 1, . . . , , and using the notion of dual norm [17], we obtain where = /( − 2). With this, the objective in the ℓ ℓ 1 -MKMTL formulation can now be written as Using to denote the Lagrange multipliers, this has the Lagrangian Recall our foray into Lagrange duality. We can solve the original problem by doing max miñ , L (̃, , ) .
To begin, we attack the inner minimization: For fixed , we would like to solve for the minimizing̃and . We can do this by setting the derivatives of L with respect to and̃to be zero. Doing this, we can find where is a vector corresponding to the th task in the ℓ ℓ 1 -MKMTL formulation and Φ is the data matrix with columns as ( ), = 1, . . . , . So, we can solve the problem by maximizing the Lagrangian (with respect to ), where we substitute the above expressions for and̃. Thus, we have an unconstrained maximization.
Here, y is vector of scores of the th task training data points and K represents the Gram matrix of the th task training data points with respect to the th kernel. Equation (21) is just a quadratic in . As such, we can find the optimum as the solution of a linear system.
Then, (17) can be written as follows: The formulation can be transformed as follows: The algorithm ℓ ℓ 1 -MKMTL is summarized in Algorithm 2. (1) = 0 (2) repeat (3) initiate and ] (4) for = 1 to do (5) With fixed and ], compute * by using an SVR solver (6) end for (7) optimize with mirror-descent algorithm (8) optimize = + 1 (10) until convergence criterion is satisfied relationship between the MRI features and the cognitive outcomes. Such a model is the lack of capability to capture nonlinear predictive information from the features. Although the ℓ ℓ 1 -MKMTL builds the nonlinear relationship for the features and task by mapping to high-dimensional space, it considers that tasks to be learned share a common subset of kernel representations without capturing the interrelationships between different cognitive measures over the feature space.
To overcome the weaknesses of the previous two methods, we project the original feature vectors to a highdimensional space using multiple nonlinear mapping functions for performing regression task in a nonlinear manner and utilize multitask learning in the multiple kernel spaces for modeling the disease's cognitive scores with a joint ℓ 2,1ℓ sparsity-inducing regularizers. Moreover, we construct new features as orthogonal transforms of the given features, that is, L ( ), where L is an orthogonal matrix which is to be learned. Again, low empirical risk over each task would imply minimizing the following quadratic loss: ∑ =1 ∑ =1 min(∑ =1̃L ( ) − ) 2 . Before describing the regularization term, we introduce some more notations: Let the entries of̃bẽ, = 1, . . . , , where is the dimensionality of the feature space induced by the th kernel. Bỹ. we denote the vector with entries̃, = 1, . . . , .
Mathematically, the ℓ 2,1 ℓ -MKMTL formulation can be expressed as follows: where represents the set of all orthogonal matrices of dimensionality . In the following text, we rewrite this formulation in a form which is convenient to solve using an MD based algorithm. , .
Again, we substitute the above expressions for and̃. Thus, we have the following form: Denoting L Λ L by Q and eliminating variables ], , and L's lead to The difficulty in working with this formulation is that the explicit mappings 's are required. We now describe a way of overcoming this problem and efficiently kernelizing the formulation (refer to [1] also). Let Φ ≡ [Φ 1 , . . . , Φ ] and the compact SVD of Φ be U Σ V . Then, we introduce a symmetric positive semidefinite Q with the same rank as that of Φ such that Q = U Q U . By eliminating Q , we can rewrite the above problem using Q as where M = Σ −1 V Φ Φ . Note that calculation of M does not require the kernel-induced features explicitly and hence the formulation is kernelized. It can be transformed as follows: where B is a block diagonal matrix with entries as B = ∑ =1 M M . Q can be solved by mirror-descent. The gradient of ∇ with respect to Q is calculated as follows: where B ( ) is the value obtained using optimal obtained while evaluating (Q ( ) ).

Experimental Setup.
We use 10-fold cross valuation to evaluate our model and conduct the comparison. In each of ten trials, a 5-fold nested cross validation procedure is employed to tune the regularization parameters. Data wasscored before applying regression methods. The range of each parameter varied from 10 −1 to 10 3 . The candidate kernels are as follows: six different kernel bandwidths (2 −2 , 2 −1 , . . . , 2 3 ), polynomial kernels of degrees 1 to 3, and a linear kernel, which totally yields 10 kernels. The kernel matrices were precomputed and normalized to have unit trace. The reported results were the best results of each method with the optimal parameter. For the quantitative performance evaluation, we employed the metrics of Correlation Coefficient (CC) and Root Mean Squared Error (rMSE) between the predicted clinical scores and the target clinical scores for each regression task. Moreover, to evaluate the overall performance on all the tasks, the normalized mean squared error (nMSE) [7,18] and weighted R-value (wR) [4] are used. The nMSE and wR are defined as follows: where and̂are the ground truth cognitive scores and the predicted cognitive scores, respectively. A smaller (higher) value of nMSE and rMSE (CC and wR) represents better regression performance. We report the mean and standard deviation based on 10 iterations of experiments on different splits of data for all comparable experiments.
In ADNI, all participants received 1.5-Tesla (T) structural MRI. The MRI features used in our experiments are based on the imaging data from the ADNI database processed by a team from UCSF (University of California at San Francisco), who performed cortical reconstruction and volumetric segmentations with the FreeSurfer image analysis suite (http://surfer.nmr.mgh.harvard.edu/) according to the atlas generated in [19]. Totally, 48 cortical regions and 44 subcortical regions are generated. For each cortical region, the cortical thickness average (TA), standard deviation of thickness (TS), surface area (SA), and cortical volume (CV) were calculated as features. For each subcortical region, subcortical volume was calculated as features. The SA of left and right hemisphere and total intracranial volume (ICV) were also included. This yielded a total of = 319 MRI features extracted from cortical/subcortical ROIs in each hemisphere (including 275 cortical and 44 subcortical features). Details of the analysis procedure are available at http://adni.loni.usc.edu/methods/mri-analysis/.
Ten widely used clinical/cognitive assessment scores [3,20,21] were employed in this study, including Alzheimer's Disease Assessment Scale (ADAS) cognitive total score, Mini Mental State Exam (MMSE) score, Rey Auditory Verbal Learning Test (RAVLT) involving total score of the first 5 learning trials (TOTAL), Trial 6 total number of words recalled (TOT6), 30-minute delay score (T30), and 30-minute delay recognition score (RECOG), FLU involving animal total score (ANIM) and vegetable total score (VEG), and TRAILS including Trail Making test A score and B score.
Experimental results are reported in Tables 1 and 2 where the best results are boldfaced. A first glance at the results shows that ℓ 2,1 ℓ -MKMTL generally outperforms all the other compared methods on both metrics and across all the cognitive tasks. Additionally, a statistical analysis is performed on the results. As can be seen, our proposed method achieves statistically significant results compared to all the other methods on most of the results. These results reveal several interesting points: (1) All the compared multitask learning methods (ℓ ℓ 1 -MTL, ℓ ℓ 1 -MKMTL, and ℓ 2,1 ℓ -MKMTL) improve the predictive performance over the independent regression algorithms (Ridge, Lasso, and MKL). This justifies the motivation of learning multiple tasks simultaneously.
(2) The two multikernel-based MTL methods outperform the linearized ℓ ℓ 1 -MTL in terms of nMSE, and ℓ 2,1 ℓ -MKMTL outperforms the linearized ℓ ℓ 1 -MTL in terms of wR. It indicates that the nonlinear MTL models via kernel functions can capture complex patterns between brain images and the corresponding cognitive measures.
(3) By the appropriate ℓ 2,1 ℓ regularization, the ℓ 2,1 ℓ -MKMTL model enables us (1) to obtain capture nonlinear associations between MRI and cognitive outcomes, (2) to obtain the intrinsic relationships between multiple related tasks in H, and (3) to promote the sparse kernel combinations to support the interpretability and scalability. The outcomes demonstrate that ℓ 2,1 ℓ -MKMTL outperforms ℓ ℓ 1 -MTL and ℓ ℓ 1 -MKMTL, both of which neglect the inherently nonlinear relationship between MRI and cognitive outcomes, and the correlation among multiple related tasks in the feature space.
(4) Compared with the other multitask learning methods with different assumptions, our proposed methods belong to the multitask feature learning methods with  ApoE genotyping (MPD). Different from the above experiments, the samples from ADNI-2 are used instead of ADNI-1, since the amount of the patients with PET is sufficient. From the ADNI-2, we obtained all the patients with both MRI and PET, totally 756 samples. The PET imaging data are from the ADNI database processed by the UC Berkeley team, who use a native-space MRI scan for each subject that is segmented and parcellated with FreeSurfer to generate a summary cortical and subcortical ROI, and they coregister each florbetapir scan to the corresponding MRI and calculate the mean florbetapir uptake within the cortical and reference regions. The procedure of image processing is described in http://adni.loni.usc.edu/updated-florbetapirav-45-pet-analysis-results/. In the ℓ ℓ 1 -MKMTL and ℓ 2,1 ℓ -MKMTL, ten different kennel functions described in the first experiment are used for each modality. To show the advantage of the kernel-based methods, we compare them with linear ℓ ℓ 1 -MTL method, which concatenated the multiple modalities features into a long vector features. The prediction performance results are shown in Tables  3 and 4. From the results, it is clear that the methods with  q  1 --４，  q  1 --＋-４，  2,1  q --＋-４， Predicted y  q  1 --４，  q  1 --＋-４，  2,1  q --＋-４， (d) ANIM Figure 1: Scatter plots of actual versus predicted values of cognitive scores on each fold testing data using three comparable MTL methods based on MRI features. multimodality outperform the methods using one single modality of data. This validates our assumption that the complementary information among different modalities is helpful for cognitive function prediction. Regardless of two or three modalities, ℓ 2,1 ℓ -MKMTL achieved better performances than the linear based multitask learning for the most cases, the same as for the single modality learning task above.

Conclusion
Many multitask learning methods with sparsity-inducing regularization for modeling AD cognitive outcomes have been proposed in the past decades. However, the current formulations remain restricted to the linear models and cannot capture the relationship between the MRI features and cognitive outcomes. To address these shortcomings, we applied two multikernel multitask learning methods with a joint sparsity-inducing regularization to model the more complicated but more flexible relationship between MRI features and cognitive outcomes and demonstrated their effectiveness compared with linearized multitask learning methods by applying them to the ADNI data for predicting cognitive outcomes from MRI scans. Extensive experiments