Group Guided Fused Laplacian Sparse Group Lasso for Modeling Alzheimer's Disease Progression

As the largest cause of dementia, Alzheimer's disease (AD) has brought serious burdens to patients and their families, mostly in the financial, psychological, and emotional aspects. In order to assess the progression of AD and develop new treatment methods for the disease, it is essential to infer the trajectories of patients' cognitive performance over time to identify biomarkers that connect the patterns of brain atrophy and AD progression. In this article, a structured regularized regression approach termed group guided fused Laplacian sparse group Lasso (GFL-SGL) is proposed to infer disease progression by considering multiple prediction of the same cognitive scores at different time points (longitudinal analysis). The proposed GFL-SGL simultaneously exploits the interrelated structures within the MRI features and among the tasks with sparse group Lasso (SGL) norm and presents a novel group guided fused Laplacian (GFL) regularization. This combination effectively incorporates both the relatedness among multiple longitudinal time points with a general weighted (undirected) dependency graphs and useful inherent group structure in features. Furthermore, an alternating direction method of multipliers- (ADMM-) based algorithm is also derived to optimize the nonsmooth objective function of the proposed approach. Experiments on the dataset from Alzheimer's Disease Neuroimaging Initiative (ADNI) show that the proposed GFL-SGL outperformed some other state-of-the-art algorithms and effectively fused the multimodality data. The compact sets of cognition-relevant imaging biomarkers identified by our approach are consistent with the results of clinical studies.


Introduction
Alzheimer's disease (AD) is a chronic neurodegenerative disease, which mainly affects memory function, and its progress ultimately culminates in a state of dementia where all cognitive functions are affected. Therefore, AD is a devastating disease for those who are affected and presents a major burden to caretakers and society. According to reports conducted by the Alzheimer's Disease Neuroimaging Initiative (ADNI), the worldwide prevalence of AD would be 131.5 million by the year 2050, which is nearly three times as much as the number in 2016 (i.e., 46.8 million) [1]. Moreover, the total worldwide cost of dementia caused by AD is about 818 billion US dollars, and it will become a trillion dollar disease by 2018 [1]. been employed in AD research. Compared with the clinical criteria, these machine learning approaches are always dataoriented. That is, they seek to infer patient's cognitive abilities and track the disease progression of AD from biomarkers of neuroimaging data such as magnetic resonance imaging (MRI) and positron emission tomography (PET).
Regression-based models could explore the relationship between cognitive abilities of patients, and some valuable factors that may cause AD or affect disease development were widely applied for AD analysis field. Some early studies establish regression models for different cognitive scores or the same cognitive score over time independently. However, researchers have found that there exist inherent correlations among different cognitive scores or the same cognitive score over time, largely because the underlying pathology is the same and there is a clear pattern in disease progression over time [7][8][9][10]. To achieve a more accurate predictive ability, multitask learning (MTL) was introduced for AD analysis to learn all of the models jointly rather than separately [11]. In many studies, it has been proven that MTL could obtain better generalization performance than the approaches learning each task individually [12,13]. An intuitive way to characterize the relationships among multiple tasks is to assume that all tasks are related and their respective models are similar to each other. In [14], Zhang et al. considered regression models of different targets (such as MMSE and ADAS-Cog) as a multitask learning problem. In their method, all regression models are constrained to share a common set of features so that the relationship among different tasks can be captured. Wan et al. [15] proposed an approach called sparse Bayesian multitask learning. In this approach, the correlation structure among tasks is adaptively learnt through constraining the coefficient vectors of the regression models to be similar. In [16], the sparse group Lasso (SGL) method was also adopted to consider two-level hierarchy with feature-level and group-level sparsity and parameter coupling across tasks.
Besides, there also exist some studies which focused on analyzing longitudinal data of AD by MTL. That is, the aim of each task is to model a given cognitive score at a given time step, and different tasks are utilized to model different time steps for the same cognitive score. For AD, longitudinal data usually consist of measurements at a starting time point (t � 0), after 6 months (t � 6), after 12 months (t � 12), after 24 months (t � 24), and so on usually up to 48 months (t � 48). Zhou et al. employed MTL algorithm for longitudinal data analysis of AD [9]. In this work, we develop temporal group Lasso (TGL) regularization to capture the relatedness of multiple tasks. However, since the TGL enforces different regression models to select the same features at all time steps, the temporal patterns and variability of the biomarkers during disease progression may be ignored. In order to handle this issue, an MTL algorithm based on convex fused sparse group Lasso (cFSGL) was proposed [10]. Through a sparse group Lasso penalty, cFSGL could select a common set of biomarkers at all time steps and a specific set of biomarkers at different time steps simultaneously. Meanwhile, the fused Lasso penalty in cFSGL also took on the temporal smoothness of the adjacent time steps into consideration [17]. Since cFSGL is nonsmooth, the MTL problem with cFSGL regularization was solved by a variant of the accelerated gradient method.
Though TGL and cFSGL have been successfully implemented for AD analysis, a major limitation of the complex relationships among different time points and the structures within the ROIs are often ignored. Specifically, (1) the fused Lasso in TGL and cFSGL only takes into account the association existing between the two consecutive time points that are likely to skip useful task dependencies beyond the next neighbors. To summarize, in a case where every task (time step) is seen to be a node of a graph, together with the edges determining the task dependencies, cFSGL makes use of a graph where there exist edges between the tasks, t and t + 1, t � 1, . . . , T − 1; nonetheless, there do not exist any other edges. Assume that the scores between the two consecutive time points need to be close is quite logical [18]. Nevertheless, concerning medical practice, this supposition is unlikely to stay valid all the time. Figure 1 sheds light on how not just the real ADAS but also MMSE and RAVLT scores of several subjects from our dataset changed throughout the years. Besides, consistent periods are coupled with sharp falls and tangled with occasional enhancements. It suggests that the longitudinal medical scores are likely to have a more intricate evolution as compared with straightforward linear tendencies with the local temporal relationships [19]. (2) Conversely, concerning MRI data, many MRI attributes are interconnected, in addition to revealing the brain cognitive activities together [20]. In accordance with our data, multiple shape measures (which include volume, area, and thickness) from the same area offer a detailed quantitative assessment of the cortical atrophy, besides tending to be chosen as the collective predictors. Our earlier research work put forward a framework, which made use of the previous knowledge to guide a multitask feature learning framework. This model is an effective approach that uses group information to enforce the intragroup similarity [21]. Thus, exploring and utilizing these interrelated structures is important when finding and selecting important and structurally correlated features together. In our previous work [22], we proposed an algorithm that generalized a fused group Lasso regularization to multitask feature learning to exploit the underlying structures. This method considers a graph structure within tasks by constructing an undirected graph, where the computations are pairwise Pearson correlation coefficients for each pair of tasks. Meanwhile, the method jointly learns a group structure from the image features, which adopts group Lasso for each pair of correlated tasks. Thus, only the relationship between two time points in the graph was considered by the regularization.
For the sake of overcoming these two limitations, a structure regularized regression approach, group guided fused Laplacian sparse group Lasso (GFL-SGL), is proposed in this paper. Our proposed GFL-SGL can exploit commonalities at the feature level, brain region level, and task level simultaneously so as to exactly identify the relevant biomarkers from the current cognitive status and disease progression. Specifically, we designed novel mixed structured sparsity norms, called group guided fused Laplacian (GFL), to capture more general weighted (undirected) dependency graphs among the tasks and ROIs. This regularizer is based on the natural assumption that if some ROIs are important for one time point, it has similar but not identical importance for other time points. To discover such dependent structures among the time points, we employed the graph Laplacian of the task dependency matrix to uncover the relationships among time points. In our work, we consider weighted task dependency graphs based on a Gaussian kernel over the time steps, which yields a fully connected graph with decaying weights. At the same time, through considering the group structure among predictors, group information is incorporated into the regularization by task-specific G 2,1 -norm, which leads to enforce the intragroup similarity with group sparse. Besides, by incorporating task-common G 2,1 -norm and Lasso penalties into the GFL model, we can better understand the underlying associations of the prediction tasks of the cognitive measures, allowing more stable identification of cognition-relevant imaging markers. Using task-common G 2,1 -norm can incorporate multitask and sparse group learning, which learns shared subsets of ROIs for all the tasks. This method has been demonstrated to be an effective approach in our previous study [23]. And Lasso can maintain sparsity between features. The resulting formulation is challenging to solve due to the use of nonsmooth penalties, including the GFL, G 2,1 -norm, and Lasso. In this work, we propose an effective ADMM algorithm to tackle the complex nonsmoothness.
We perform extensive experiments using longitudinal data from the ADNI. Five types of cognitive scores are considered. Then, we empirically evaluate the performance of the proposed GFL-SGL methods along with several baseline methods, including ridge regression, Lasso, and the temporal smoothness models TGL [9] and cFSGL [24]. Experimental results indicate that GFL-SGL outperforms both the baselines and the temporal smoothness methods, which demonstrates that incorporating sparse group learning into temporal smoothness and multitask learning can improve predictive performance. Furthermore, based on the GFL-SGL models, stable MRI features and key regions of interest (ROIs) with significant predictive power are identified and discussed. We found that the results corroborate previous studies in neuroscience. Finally, in addition to the MRI features, we use multimodality data including PET, CSF, and demographic information for GFL-SGL as well as temporal smoothness models. While the additional modalities improve the predictive performance of all the models, GFL-SGL continues to significantly outperform other methods.
The rest of the paper is organized as follows. In Section 2, we provide a description of the preliminary methodology: multitask learning (MTL), two types of group Lasso norms, and fused Lasso norm. In Section 3, we present the GFL-SGL model and discuss the details of the ADMM algorithm proposed for the optimization. We present experimental results and evaluate the performance using the MRI data from the ADNI-1 and multimodality data from the ADNI-2 in Section 4. The conclusions are presented in Section 5.

Multitask
Learning. Take into account multitask learning (MTL) setting having k tasks [19,21]. Suppose that p is the number of covariates, which is shared all through each task, n indicates the number of samples. Suppose that X ∈ R n×p indicates the matrix of covariates, X ∈ R n×k implies the matrix of feedbacks with each of the rows that correspond to a sample, and Θ ∈ R p×k suggests the parameter matrix, with column θ .m ∈ R p that corresponds to task m, m � 1, . . . , k, and row θ j. ∈ R k that corresponds to the feature j, j � 1, . . . , p. Besides, the MTL issue can be established to be among the estimations of the parameters based on the appropriate regularized loss function. To associate the imaging markers and the cognitive measures, the MTL model minimizes the objective as follows: where L(·) is an indication of the loss function, whereas R(·) suggests the regularizer. In the present context, we make an assumption of the loss as a square loss, i.e.,   where y i ∈ R 1×k and x i ∈ R 1×p denote the i-th rows of Y and X that correspond to the multitask feedback as well as the covariates for the i-th sample. Besides that, we observe the fact that the MTL framework is possible to be conveniently elongated to other loss functions. Quite apparently, varying options of penalty R(Θ) are likely to result in significantly varying multitask methodologies. Based on some previous knowledge, we subsequently add penalty R(Θ) to encode the relatedness among tasks.
2.2. G 2,1 -Norm. One of the attractive properties of the ℓ 2,1 -norm regularization indicates that it provides multiple predictors from varying tasks with encouragement for sharing the same kind of parameter sparsity patterns. The ℓ 2,1 -norm regularization considers and is appropriate to concurrently enforce sparsity over the attributes of each task. The primary point of equation (3) involves using ℓ 2 -norm for θ j. , forcing the weights that correspond to the jth attribute across multiple tasks for being grouped, besides being inclined to selecting the attributes based on the robustness of k tasks collectively. Besides, there is a relationship existing among multiple cognitive tests. As per a hypothesis, a pertinent imaging predictor usually more or less impacts each of these scores; furthermore, there is just a subset of brain regions having relevance to each evaluation. Through the use of the ℓ 2,1 -norm, the relationship information among varying tasks can be embedded into the framework to build a more suitable predictive framework, together with identifying a subset of the attributes. The rows of Θ receive equal treatment in ℓ 2,1 -norm, suggesting that the potential structures among predictors are not taken into consideration.
In spite of the achievements mentioned earlier, there are few regression frameworks, which consider the covariance structure among predictors. Aimed at attaining a specific feature, the brain imaging measures usually correlate with one another. Concerning the MRI data, the groups are respective to certain regions of interest (ROIs) in the brain, for instance, the entorhinal and hippocampus. Individual attributes are specific properties of those areas, for example, cortical volume as well as thickness. With regard to each area (group), multiple attributes are derived for the measurement of the atrophy information for all of the ROIs that involve cortical thickness, in addition to surface area and volume from gray matters as well as white matters in the current research work. The multiple shape measures from the same region provide a comprehensively quantitative evaluation of cortical atrophy and tend to be selected together as joint predictors [23].
We assume that p covariates are segregated into the q disjoint groups G l , l � 1, . . . , q wherein every group has ] l covariates, correspondingly. In the backdrop of AD, every group is respective to a region of interest (ROI) in the brain; furthermore, the covariates of all the groups are in respect to particular attributes of that area. Concerning AD, the number of attributes in every group, ] l , is 1 or 4, whereas the number of groups q is likely to be in hundreds. After that, we provide the introduction of two varying G 2,1 -norms in accordance with the correlation that exists between the brain regions (ROIs) and cognitive tasks: ‖Θ‖ c G 2,1 encouraging a shared subset of ROIs for all the tasks and ℓ 2,1 encouraging a task-specific subset of ROIs.
The task-common G 2,1 -norm ‖Θ‖ c G 2,1 is defined as where w l � � � ] l √ is the weight of each group. The taskcommon G 2,1 -norm enforces ℓ 2 -norm at the features within the same ROI (intragroup) and keeps sparsity among the ROIs (intergroup) with ℓ 1 norm, to facilitate the selection of ROI. ‖Θ‖ c G 2,1 allows to learn the shared feature representations as well as ROI representations simultaneously.
The task-specific G 2,1 -norm ‖Θ‖ s G 2,1 is defined as where θ G l ,m ∈ R ] l is the coefficient vector for group G l and task m. The task-specific G 2,1 -norm allows to select specific ROIs while learning a small number of common features for all tasks. It has more flexibility, which decouples the group sparse regularization across tasks, so that different tasks can use different groups. The difference between these two norms is illustrated in Figure 2(a).

Fused Lasso.
Fused Lasso was first proposed by Tibshirani et al. [25]. Fused Lasso is one of the variants, where pairwise differences between variables are penalized using the ℓ 1 norm, which results in successive variables being similar. The fused Lasso norm is defined as where H is a (k − 1) × k sparse matrix with H m,m � 1, and H m,m+1 � −1. It encourages θ .m and θ .m+1 to take the same value by shrinking the difference between them toward zero. This approach has been employed to incorporate temporal smoothness to model disease progression. In longitudinal model, it is assumed that the difference of the cognitive scores between two successive time points is relatively small. The fused Lasso norm is illustrated in Figure 2(b).

Group Guided Fused Laplacian Sparse Group Lasso (GFL-SGL)
3.1. Formulation. In longitudinal studies, the cognitive scores of the same subject are measured at several time points. Consider a multitask learning problem over k tasks, where each task corresponds to a time point t � 1, . . . , k. For each time point t, we consider a regression task based on data (X t , y t ), where X t ∈ R n×p denotes the matrix of covariates and y t ∈ R n is the matrix of responses. Let Θ ∈ R p×k denote the regression parameter matrix over all tasks so that column θ .t ∈ R p corresponds to the parameters for the task in time step t. By considering the prediction of cognitive scores at a single time point as a regression task, tasks at different time points are temporally related to each other. To encode the dependency graphs among all the tasks, we construct the Laplacian fused regularized penalty: where D ∈ R k×k has the following form: We assume a viewpoint that is under inspiration from the local nonparametric regression, being specific, the kernel-based linear smoothers like the Nadaraya-Watson kernel estimator [26]. Considering this kind of view, we model the local approximation as   In our current work, weights are figured out with the help of a Gaussian kernel, as stated in equation (9), wherein σ indicates the kernel bandwidth, which requires a mandatory definition. As σ is small, the Gaussian curve shows a quick decay, followed by subsequent rapid decline of the weights w |t−ℓ| with the increasing |t − ℓ|; conversely, as σ is large, the Gaussian curve shows a gradual decay, followed by the subsequent slow decline of the weights w |t−ℓ| with the increasing |t − ℓ|. In this manner, the matrix D shares symmetry with w t,ℓ � w ℓ,t , as an attribute of |t − ℓ|. Taking into account the covariance structure among predictors, we extend the Laplacian fused norm into group guided Laplacian fused norm.
The task-specific G 2,1 -norm was used here to decouple the group sparse regularization across tasks. G 2,1 -norm allows for more flexibility so that different fused tasks are regularized by different groups. The group guided fused Laplacian (GFL) regularization is defined as The GFL regularization enforces ℓ 2 -norm at the fused features within the same ROI and keeps sparsity among the ROIs with ℓ 1 -norm to facilitate the selection of ROI. The GFL regularization is illustrated in Figure 3. The regularization involves two matrices: (1) Parameter matrix (left). For convenience, we let each group correspond to a time point in the transformation matrix. In fact, the transformation matrix operates on all groups. (2) Gaussian kernel weighted fused Laplacian matrix with σ � 1 (right). Since this matrix is symmetric, we represent the columns as rows.
The clinical score data are incomplete at some time points for many patients, i.e., there may be no values in the target vector y i ∈ R k . In order not to reduce the number of samples significantly, we use a matrix Λ ∈ R n×k to indicate incomplete target vector instead of simply removing all the patients with missing values. Let Λ i,j � 0 if the target value of sample i is missing at the j-th time point, and Λ i,j � 1 otherwise. We use the componentwise operator ⊙ as follows: 1 and Lasso to GFL model, the objective function of group guided fused Laplacian sparse group Lasso (GFL-SGL) is given in the following optimization problem: where R λ 1 λ 2 (Θ) � λ 1 ‖Θ‖ 1 + λ 2 ‖Θ‖ c G 2,1 and λ 1 , λ 2 , λ 3 are the regularization parameters.

ADMM.
Recently, ADMM has emerged as quite famous since parallelizing the distributed convex issues is quite convenient usually. Concerning ADMM, the solutions to small local subproblems are coordinated to identify the global best solution [27][28][29]: The formulation of the variant augmented Lagrangian of ADMM methodology is done as follows: where f and g indicate the convex attributes and u denotes a scaled dual augmented Lagrangian multiplier, whereas ρ suggests a nonnegative penalty parameter. In all of the iterations of ADMM, this issue is solved through the alternation of minimization L ρ (x, z, u) over x, z, and u. Concerning the (k + 1)-th iteration, ADMM is updated by

Efficient Optimization for GFL-SGL.
We put forward an efficient algorithm to solve the objective function in equation (11), equaling the limited optimization issue as follows: where Q, Γ refer to slack variables. After that, the solution of equation (15) can be obtained by ADMM. The augmented Lagrangian is where U, V are augmented Lagrangian multipliers. Update Θ: from the augmented Lagrangian in equation (16), the update of Θ at (s + 1)-th iteration is conducted by 6 Computational and Mathematical Methods in Medicine that is a closed form, which is likely to be extracted through the setting of equation (17) to zero.
It requires observation that D indicates a symmetric matrix. Besides, we state Φ � DD, wherein Φ is also an indication of a symmetric matrix where Φ t,l denotes the value of weight (t, l). Through this kind of a linearization, Θ can be updated in parallel with the help of the individual θ .t . In this manner, in the (s + 1)-th iteration, it is possible to update θ (s+1) .t efficiently with the use of Cholesky.
The above optimization problem is quadratic. The optimal solution is given by θ (s+1) Computing θ (s+1) .t deals with the solution of a linear system, the most time-consuming component in the entire algorithm. For the computation of θ (s+1) .t in an efficient manner, we perform the calculation of the Cholesky factorization of F as the algorithm begins: Observably, F refers to a constant and positive definite matrix. With the use of the Cholesky factorization, we require solving the following two linear systems at all of the iterations: Accordingly, A t indicates an upper triangular matrix, which solves these two linear systems, which is quite effective.

Computational and Mathematical Methods in Medicine 7
Update Q: updating Q effectively requires solving the problem as follows: which equals the computation of the proximal operator for R λ 1 λ 2 (·). Being specific, we require solving where Ω (s+1) � Θ (s+1) + (1/ρ)U (s) . This is aimed at being capable of computing Q (s+1) � Ψ in an efficient manner. The computation of the proximal operator for the composite regularizer can be done effectively in two steps [30,31], which are illustrated as follows: These two steps can be carried out efficiently with the use of suitable extensions of soft-thresholding. It is possible to compute the update in equation (25a) with the help of the soft-thresholding operator ζ λ 1 /ρ (Ω (s+1) ), which is stated as follows: After that, we emphasize updating equation (25b), effectively equivalent to the computation of the proximal operator for G 2,1 -norm. Specifically, the problem can be jotted down as follows: Since group G ℓ put to use in our research work is disjoint, equation (27) can be decoupled into Because ϕ(q j. ) is strictly convex, we conclude that q (s+1) j.
refers to its exclusive minimizer. After that, we provide the introduction of the following lemma [32] for the solution of equation (28).
where q j. is the j-th row of Q s+1 . Update Γ: the update for Γ efficiently requires solving the problem as follows: which is efficiently equivalent to the computation of the proximal operator for GFL-norm. Explicitly, the problem can be stated as follows: where Then, we introduce the following lemma [32].

Lemma 2. For any λ
where c G ℓ t , z G ℓ t are rows in group G ℓ for task t of Γ (s+1) and Z (s+1) , respectively. Dual update for U and V: following the standard ADMM dual update, the update for the dual variable for our setting is presented as follows: It is possible to carry out the dual updates in an elementwise parallel way. Algorithm 1 provides a summary of the entire algorithm. MATLAB codes of the proposed algorithm are available at https://XIAOLILIU@bitbucket.org/ XIAOLILIU/gfl-sgl.

Convergence.
The convergence of the Algorithm 1 is shown in the following lemma.
The condition allowing the convergence in Theorem 1 is very convenient to meet. λ 1 , λ 2 , and λ 3 refer to the regularization parameters, which are required to be above zero all the time. The detailed proof is elaborated in Cai et al. [33]. Contrary to Cai et al., we do not need L(Θ) as differentiable, in addition to explicitly treating the nondifferentiability of L(Θ) through the use of its subgradient vector zL(Θ), which shares similarity with the strategy put to use by Ye and Xie [28].

Experimental Results and Discussions
In this section, we put forward the empirical analysis for the demonstration of the efficiency of the suggested model dealing with the characterization of AD progression with the help of a dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) [34]. The principal objective of ADNI has been coping with testing if it is possible to combine serial MRI, together with PET, other biological markers, and medical and neuropsychological evaluations to measure the progression of MCI as well as early AD. Approaches for the characterization of the AD progression are expected to assisting both researchers and clinicians in developing new therapies and monitoring their efficacies. Besides, being capable of understanding the disease progression is expected to augment both the safety and efficiency of the drug development, together with potentially lowering the time and cost associated with the medical experiments.

Experimental Setup.
The ADNI project is termed as a longitudinal research work, in which the chosen subjects are classified into three baseline diagnostic cohorts that include Cognitively Normal (CN), Mild Cognitive Impairment (MCI), and Alzheimer's Disease (AD), recurrently encompassing the interval of six or twelve months. Also, the date of scheduling the subjects for performing the screening emerges as the baseline (BL) after that approval; also, the time point for the follow-up visits is indicated by the period time that starts from the baseline. Moreover, we put to use the notation Month 6 (M6) to denote the time point half year following the very first visit. Nowadays, ADNI possesses up to Month 48 follow-up data that some patients can avail. Nevertheless, some patients skip research work for several causes.
The current work places emphasis on the MRI data. Furthermore, the MRI attributes put to use in our assays are made based on the imaging data from the ADNI database that is processed with the help of a team from UCSF (University of California at San Francisco), carrying out cortical reconstruction as well as volumetric segmentations using the FreeSurfer image analysis suite (http://surfer. nmr.mgh.harvard.edu/). In the current investigation, we eliminate the attributes that have over 10% missing entries  Tables 1 and 2 [19,21] shed light on the names of the cortical and subcortical regions. For each cortical region, the cortical thickness average (TA), standard deviation of thickness (TS), surface area (SA), and cortical volume (CV) were calculated as features. For each subcortical region, subcortical volume was calculated as features. The SA of left and right hemisphere and total intracranial volume (ICV) were also included. This yielded a total of p � 319 MRI features extracted from cortical/subcortical ROIs in each hemisphere (including 275 cortical and 44 subcortical features). Details of the analysis procedure are available at http://adni.loni.ucla. edu/research/mri-post-processing/.
For predictive modeling, five sets of cognitive scores [25,35] are examined: Alzheimer's Disease Assessment Scale (ADAS), Mini-Mental State Exam (MMSE), Rey Auditory Verbal Learning Test (RAVLT), Category Fluency (FLU), and Trail Making Test (TRAILS). ADAS is termed as the gold standard in the AD drug experiment concerning the cognitive function evaluation that refers to the most famous cognitive testing tool for the measurement of the seriousness of the most pivotal signs of AD. Furthermore, MMSE (2) Compute the Cholesky factorization of F. Computational and Mathematical Methods in Medicine measures cognitive damage, which includes orientation to both time and place, coupled with the attention and calculation, spontaneous and delayed recall of words, and language and visuoconstructional attributes. RAVLT refers to the measurement of the episodic memory and put to use to diagnose memory interruptions, comprising eight recall experiments as well as a recognition test. FLU refers to the measurement of semantic memory (verbal fluency and language). The subject is requested for naming varying exemplars from a provided semantic classification. Furthermore, TRAILS is termed as an array of processing speed and executive attribute, comprising two components, wherein the subject is directed for connecting a set of twenty-five dots at the fastest possible, meanwhile performing the maintenance of precision. The specific scores we used are listed in Table 3. Note that the proposed GFL-SGL models are trained to model progression for each of these scores, with different time steps serving the role of distinct tasks. Since the five sets of cognitive scores include a total of ten different scores (see Table 3), results will be reported on each of these ten scores separately.
Concerning all of the trials, 10-fold cross valuation is employed for the evaluation of our framework, together with carrying out the comparison. For all of the experiments, 5fold cross validation on the training set is carried out to select the regularization parameters (hyperparameters) (λ 1 , λ 2 , λ 3 ). The approximated framework makes use of these regularization parameters for the prediction on the   Trail making test B score experiment set. About the cross validation, concerning a fixed set of hyperparameters, the use of four folds is made to train, besides using one fold for assessment with the help of nMSE. Concerning the hyperparameter choice, we take into account a grid of regularization parameter values, in which every regularization parameter varies between 10 −1 and 10 3 in log scale. The data were z-scored before the application of the regression methods. The reported findings constituted the optimal findings of every method having the best parameter. Regarding the quantitative efficiency assessment, we made use of the metrics of correlation coefficient (CC) as well as root mean squared error (rMSE) between the forecasted medical scores and the targeted medical scores for all of the regression tasks. Besides, for the evaluation of the overall efficiency on each task, the use of normalized mean squared error (nMSE) [12,24] and weighted R-value (wR) [36] is made. The nMSE and wR are defined as follows: where Y and Y are the ground truth cognitive scores and the predicted cognitive scores, respectively. A smaller (higher) value of nMSE and rMSE (CC and wR) represents better regression performance. We report the mean and standard deviation based on 10 iterations of experiments on different splits of data for all comparable experiments. We also performed paired t-tests on the corresponding cross validation performances measured by the nMSE and wR between predicted and actual scores to compare the proposed method and the other comparison methods [9,24,35,37].
The p values were provided to examine whether these improved prediction performances were significant. A significant performance has a low p value (less than 0.05 for example). Aimed at assessing the sensitivity of the three hyperparameters in the GFL-SGL formulation (equation (11)), we investigated the 3D hyperparameter space, in addition to plotting the nMSE metric for all of the mixes of values, in the way we had done in our recent investigation [19]. The sensitivity research work is of importance for the study of the impact of all the terms in the GFL-SGL formulation, together with guiding on the way of appropriately setting the hyperparameters. The definition of the hyperparameter space is made as λ 1 , λ 2 , λ 3 ∈ [0.1, 100]. The nMSE put forward was calculated in the test set. Owing to the space constraints, Figure 4 merely sheds light on the plots for ADAS as well as MMSE cognitive scores. Observing the fact is possible that, concerning all of the cognitive scores, smaller values for λ 3 resulted in the low regression efficiency, which suggested that the temporal smooth penalization term mainly contributes to the forecast and requires consideration. Moreover, the bigger values for λ 2 (linked to the taskcommon group Lasso penalty) tends to enhance the findings for smaller λ 1 . With the rise in λ 1 , we bring into force more sparsity on θ parameters, accordingly breaking the group structure that prevails in the data.

Prediction Performance Based on MRI Features.
We compare the performance of GFL-SGL with different regression methods, including ridge regression [38] and Lasso [39], which are applied independently to each time point, and temporal group Lasso (TGL) [9] and convex fused sparse group Lasso (cFSGL) [24], which are state-of-the-art methods for characterizing longitudinal AD progression. TGL incorporates three penalty terms to capture task relatedness, which contains two ℓ 2 -norms to prevent overfitting and enforce temporal smoothness, and one ℓ 2,1 -norm to introduce joint feature selection. The optimal function is formulated as min Θ L(Θ) + λ 1 ‖Θ‖ 2 F + λ 2 ||RΘ T || 2 F + λ 3 ‖Θ‖ 2,1 . cFSGL allows the simultaneous selection of a common set of biomarkers for multiple time points and specific sets of biomarkers for different time points using the sparse group Lasso (SGL, λ 1 ||Θ|| 2,1 + λ 2 ‖Θ‖ 1 ) penalty and in the meantime incorporates the temporal smoothness using the fused Lasso penalty ( k−1 t�1 |θ t. − θ t+1. |). The downloading of the codes of TGL and cFSGL is carried out from the authors' websites, whereas the AGM algorithm is put to use as the optimization methodology. It is recalling the fact that every trial emphasizes a particular cognitive score, having varying time points that serve as different tasks for the multitask learning formulations. Since, in aggregate, there are ten cognitive scores, we carry out the trials, besides reporting the outcomes separately about all of the scores. The calculation of the average and standard deviation of the efficiency measures is carried out with the help of the 10-fold cross validation on the different splits of data, summarized in Table 4.
The results show that multitask temporal smoothness models (TGL, cFSGL, and GFL-SGL) are more effective than single-task learning models (ridge and Lasso) in terms of both nMSE and wR over all scores, especially for the task at the later time points where the training samples are limited. Both the norms of fused Lasso (TGL and cFSGL) and group guided fused Lasso (GFL-SGL) can improve performance, which demonstrates that taking into account the local structure within the tasks improves the prediction performance. Furthermore, GFL-SGL achieved better performances than TGL and cFSGL, which indicates that it is beneficial to simultaneously employ transform matrix taking into account all the time points and group structure information among the features. Two types of group penalties are used in our model (G c 2,1 -norm and G F 2,1 -norm). The former learns a shared subset of ROIs for all the tasks, whereas the latter learns a task-specific subset of Laplacian fused ROIs. Our GFL-SGL model performs consistently better than TGL and cFSGL, which further demonstrates that exploiting the underlying dependence structure may be advantageous, and exploiting the structure among tasks and features simultaneously resulted in significantly better prediction performance. The statistical hypothesis test reveals that GFL-SGL is significantly better than the contenders for most of the scores.
We shed light on the scatter plots of the actual values against the forecasted values on the test dataset. For lacking the space, we just illustrated two scatter plots, which included ADAS as well as MMSE in Figures 5 and 6, correspondingly. Owing to the small sample size at M36 and M48 time points, we indicate the scatter plots for the first  four time points. As the scatter plots indicate, the forecasted values, as well as the actual values scores, are similarly highly correlated to both of these tasks. The scatter plots demonstrate the fact that the prediction efficiency for ADAS is better as compared with that of MMSE. Section 4.4 is going to incorporate more modalities, which include not just PET but also CSF and demographic information, aimed at improving efficiency.

Identification of MRI Biomarkers.
In Alzheimer's disease research works, researchers have interest in the provision of the improved cognitive scores forecast, besides identifying which constitute the brain regions that are more impacted by the disease that has the potential of helping perform the diagnosis of the preliminary phases of the disease, besides its way of dissemination. After that, we revert to analyzing the identification of MRI biomarkers. Our GFL-SGL refers to a group sparse framework, capable of identifying a compact set of relevant neuroimaging biomarkers from the region level for the group Lasso on the attributes, which is expected to give us improved interpretability of the brain region. Due to lack of space, we only show the top 30 ROIs for ADAS and MMSE by obtaining the regression weights of all ROIs in each hemisphere for six time points in Figure 7. The value of each item (i, j) in the heat map indicates the weight of the ith ROI for the j-th time point and is calculated by where k is the k-th MRI feature. The larger the absolute value of a coefficient is, the more important its corresponding brain region is in predicting the corresponding time point of that cognitive score. The figure illustrates that the proposed GFL-SGL clearly presents sparsity results across all time points, which demonstrates that these biomarkers are longitudinally important due to the advantage of smooth temporal regularization. We also observe that different time points share similar ROIs for these two cognitive measures, which demonstrates that there exists a strong correlation among the multiple tasks of score prediction at multiple time points.
Moreover, the top 30 selected MRI features and brain regions (ROIs) for ADAS and MMSE are shown in Table 5. We also show the brain maps of the top ROIs in Figures 8  and 9, including cortical ROIs and subcortical ROIs. Note that the top features and ROIs are obtained by calculating the overall weights for the six time points. From the top 30 features, we can examine the group sparsity of GFL-SGL model at the ROI level. It can be seen clearly that many top features come from the same ROI due to the consideration of Some important brain regions are also selected by our GFL-SGL, such as middle temporal [20,[40][41][42], hippocampus [42], entorhinal [20], inferior lateral ventricle [35,43], and parahipp [44], which are highly relevant to the cognitive impairment. These results are consistent with the established understanding of the pathological pathway of AD. These recognized brain regions have been figured out in the recent literature besides having been presented as have a high correlation with the medical functions. For instance, the hippocampus is situated in the temporal lobe of the brain that plays the part of the memory as well as spatial navigation. The entorhinal cortex refers to the first region of the brain being impacted; also, it is termed as the most severely impaired cortex in Alzheimer's disease [45]. Together with that, there are some of the recent findings stressing the significance of parahippocampal atrophy as a preliminary biomarker of AD, owing to the fact parahippocampal volume makes better discrimination in comparison with the hippocampal volume between the cases of healthy aging, MCI, and mild AD, being specific, in the preliminary stage of the disease [44]. In addition to that, the findings also reveal the fact that the changing thickness of the inferior parietal lobule takes place early while progressing from normal to MCI, together with being associated with the neuropsychological efficiency [46].

Fusion of Multimodality.
Clinical and research studies commonly demonstrate that complementary brain images can be more accurate and rigorous for assessment of the disease status and cognitive function. The previous experiments are conduced on the MRI, which measures the structure of the cerebrum and has turned out to be an efficient tool for detecting the structural changes caused by AD or MCI. Fluorodeoxyglucose PET (FDG-PET), a technique for measuring glucose metabolism, can determine the likelihood of deterioration of mental status. Each neuroimaging modality could offer valuable information, and biomarkers from different modalities could offer complementary information for different aspects of a given disease process [4,14,[47][48][49].
Since the multimodality data of ADNI-1 are missing seriously, the samples from ADNI-2 are used instead. The PET imaging data are from the ADNI database processed by the UC Berkeley team, who use a native-space MRI scan for each subject that is segmented and parcellated with Freesurfer to generate a summary cortical and subcortical ROI and coregister each florbetapir scan to the corresponding MRI and calculate the mean florbetapir uptake within the cortical and reference regions. The procedure of image processing is described in http://adni.loni.usc.edu/updatedflorbetapir-av-45-pet-analysis-results/. The amount of the patients with MRI at M48 is small (29 subjects), and there are no data with PET at M6; 4 time points' data were used. Furthermore, there is no score measure for FLU.ANIM and lack of samples for FLU.VEG and TRAILS, so we use ADAS, MMSE, and RAVLT for a total of 6 scores in this experiment. We followed the same experimental procedure as described in Section 4.1, which yields a total of n � 897 subjects for baseline, and for the M12, M24, M36 time points, the sample size is 671, 470, and 62, respectively.
To estimate the effect of combining multimodality data with our GFL-SGL method and to provide a more comprehensive comparison of our group guided method and the method without group structure, we further perform some experiments, which are (1) employing only MRI modality, (2) employing only PET modality, (3) combining two modalities: MRI and PET (MP), and (4) combining four modalities: MRI, PET, CSF, and demographic information including age, gender, years of education, and ApoE genotyping (MPCD). Note that, for the CSF modality, the original three measures (i.e., Aß 42 , t-tau, and p-tau) are directly used as features without any feature selection step. We compare the performance of TGL, cFSGL, and GFL-SGL on the fusing multimodalities for predicting the disease progression measured by the clinical scores (ADAS-Cog, MMSE, and RAVLT). For TGL and cFSGL, the features from multimodalities are concatenated into long vector features, while for our GFL-SGL, the features from same modality are considered as a group.
The prediction performance results are shown in Table 6. It is clear that the methods with multimodality outperform the methods using one single modality of data. This validates our assumption that the complementary information among different modalities is helpful for cognitive function prediction. Especially, when two modalities (MRI and PET) are used, the performance is improved significantly compared to using the unimodal (MRI or PET) information. Moreover when four modalities (MRI, PET, CSF, and demographic information) are used, the performance is further improved. Regardless of two or four modalities, the proposed multitask learning GFL-SGL achieves better performance than TGL and cFSGL. This justifies the motivation of learning multiple tasks simultaneously with considering the group of variables regardless of the ROI structure or the modality structure.

Conclusion
In this paper, we investigated the progression of longitudinal Alzheimer's disease (AD) by means of multiple cognitive scores and multimodality data. We proposed a multitask learning formulation with group guided regularization that can exploit the correlation of different time points and the importance of ROIs or multiple modalities for predicting the cognitive scores. Alternating direction method of multipliers (ADMM) method is presented to efficiently tackle the       associated optimization problem. Experiments and comparisons of this model, with the baseline and temporal smoothness methods, illustrate that GFL-SGL offers consistently better performance than other algorithms on both MRI features and multimodality data.
In the current work, group guided information is only considered for each cognitive score separately with multiple tasks corresponding to the same cognitive score across multiple time points. And the group guided information used in this work is predefined; there is no ability to automatically learn the feature groups. Since the cognitive scores are used in different ways to measure the same underlying medical condition and the features have different structures, we expect that a more general group guided framework that learns group information automatically will be considered for all cognitive scores across all time points simultaneously. While the current study illustrates the power of our proposed method, we expect to perform more general experiments to validate the effectiveness in our future work. All of the regions processed by UCSF are used in this work. We will consider the medical background and screen these features. In order to compare the significant performance of the methods more effectively, we will randomly split the subjects into train and test. This will be repeated many times to obtain enough scores for statistical analysis.

Data Availability
The data used to support the findings of this study can be obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/).

Conflicts of Interest
The authors declare that they have no conflicts of interest.