Assisted Diagnosis of Alzheimer’s Disease Based on Deep Learning and Multimodal Feature Fusion

,


Introduction
Magnetic resonance imaging (MRI) is a medical imaging technology with rapid development in recent years.It has many advantages such as high contrast for soft tissues, high resolution, and noninvasive way.It is widely used in various types of cardiovascular and cerebrovascular diseases and has promoted the progress and development of contemporary medicine.At present, structural MRI (sMRI) and functional MRI (fMRI) are widely used in the diagnosis of Alzheimer's disease (AD).
A complete and clear intracranial anatomical structure through hierarchical scanning using sMRI can be obtained, which is helpful to analyze the morphological structure of brain gray matter, white matter, and cerebrospinal fluid and to determine whether a disease or injury exists.e brain structure imaging analysis of patients with AD and normal people (normal control, NC) has found that the gray matter volume of AD patients was significantly lower than that of normal people, and the gray matter in the hippocampus, temporal poles, and temporal islands also has significant shrinkage [1].Comparing the different stages of AD patients, it is found that hippocampus atrophy is significant in the initial stage.en, the inferior lateral area of the temporal lobe changes obviously, and finally, the frontal lobe begins to shrink [2].fMRI is used to measure the changes in hemodynamics caused by neuronal activity which can show the location and extent of brain activation and can detect dynamic changes in the brain over a period of time.
e application of artificial intelligence (AI) in medical treatment has become a research hotspot for scholars at home and abroad [3,4].AI combined with machine learning methods is applied to medical image processing to obtain biomarkers and to assist doctors in making correct diagnoses.Deep learning is an important branch of machine learning, and its application in the field of medical imaging has attracted widespread attention.Ehsan Hosseini-Asl [5] and Adrien Payan [6] used 3D convolutional neural networks and autoencoders to capture AD biomarkers.Zhenbing Liu [7] used a multiscale residual neural network to collect multiscale information on a series of image slices and to classify AD, mild cognitive impairment (MCI), and NC.Ciprian D. Billones [8] improved the VGG-16 network for constructing classification model of AD, MCI, and NC.Deep learning algorithms are also widely used in fMRIassisted diagnosis of brain diseases.Junchao Xiao [9] used stacked automatic encoders and functional connection matrices to classify migraine patients and normal people.Meszlényi Regina [10] proposed a dynamic time normalization distance matrix, Pearson correlation coefficient matrix, warping path distance matrix, and convolutional neural network to realize AD-assisted diagnosis.
With the development of deep learning research, the number of network layers has been continuously deepened.
e network structure has gradually become more and more complex, and the requirements for the hardware environment have gradually increased.In order to reduce the environmental demand of the model and to promote the application and improvement of the model, lightweight network operations such as MobileNet [11] and ShuffleNet [12] were born.In this paper, the ShuffleNet model is improved and an AD-assisted diagnosis model based on 3DShuffleNet is proposed, which directly uses the sMRI image preprocessed by the VBM-DARTEL [13] method and uses the deep features of the image to classify AD, MCI, and NC. e proposed method not only reduces the voting link of the slicing method to obtain the test results but also is more conducive to the promotion and application of the model in a low computing power environment because of the use of a lightweight network.
e high-dimensional and small sample characteristics of datasets often bring difficulties of classification and modeling such as fMRI data.erefore, in this paper, the anatomical automatic labeling (AAL) template is used to calculate the functional link matrix after preprocessing of the original image.Functional connection matrix is a universal and effective method to analyze the correlation characteristics of each brain and can greatly reduce the data dimension.
e feature redundancy in the functional connection matrix exists.us, data dimensionality reduction and feature extraction are usually improved.Principal component analysis network (PCANet) is an unsupervised deep learning feature extraction algorithm, which can effectively solve the problem of insufficient experimental samples.In this study, PCANet network is used to extract the matrix features and support vector machine (SVM) classifier is used to classify.
In addition, kernel canonical correlation analysis (KCCA) is used to fuse the features of two different modalities to achieve complementary information before the classifier is used, so as to reduce the impact of inherent defects because of a single-modal feature.

Data Introduction and Preprocessing
sMRI data are helpful for observing the changes in brain structure during the course of illness.fMRI reflects the influence of illness on brain function by detecting brain activity.e sMRI and fMRI images come from the Alzheimer's Disease Neuroimaging Initiative (ADNI), and in order to facilitate the fusion of the two modal data information, the experimental data are required to contain both types and the data are obtained at close times.At the same time, because early MCI and late MCI belong to the MCI process and have only slight differences, so they are regarded as the same category.Datasets including 34 cases of AD, 36 cases of MCI (including 18 cases of early MCI, 18 cases of late MCI), and 50 cases of NC were finally selected as experimental data.
VBM-DARTEL [13] method is used to preprocess sMRI images including segmentation, generating specific templates generation, flow fields generation, and normalization.e above preprocessing steps are all implemented using SPM8 software.Medical image processing software DPABI is used to preprocess fMRI images including the data removal of the first 10 time points, slice timing, realignment, normalization, smoothing, detrending, filtering, and extracting time series to calculate function link matrix.Figure 1(a) shows the coronal, sagittal, and cross-sectional views of gray matter images obtained by sMRI preprocessing, and Figure 1(b) is an example of the functional connection matrix obtained by preprocessing fMRI.

Method
e amount of experiment data in this paper is very small; in order to avoid as much as possible the overfitting phenomenon that often occurs in convolutional deep neural networks, this paper uses a lightweight network ShuffleNet with fewer parameters and PCANet that does not require feedback adjustment parameters to implement deep features extraction and classification.

MRI Feature Extraction and ShuffleNet.
ShuffleNet is a deep learning network designed especially for mobile devices with limited computing power.It uses point-by-point grouping convolution and channel shuffling to achieve its high-efficiency architecture [12].It reduces computational complexity while ensuring that the network still has a good classification performance.
e network consists of one convolution layer, one maximum pooling layer, three sets of ShuffleUnit structure, one global pooling layer, and one full connection (FC) layer.Each group of ShuffleUnit structure consists of one ShuffleUnit module like Figure 2(b) and several ShuffleUnit modules like Figure 2(c) connected, and the number of series units is set by the Repeat parameter.ShuffleNet has outstanding performance in image classification [15] and has been applied to face recognition [16].
is network integrates the strengths of many classic networks.It inherits the bottleneck module in the classic deep learning network ResNet [14], as shown in Figure 2(a).It uses the idea of residual learning to speed up model convergence and enhance model performance.It combines MobileNet's deep capabilities resulting from Separate 2 Complexity convolution and AlexNet [17] network grouping method to reduce computational complexity.ShuffleUnit is shown in Figure 2, improved from bottleneck in the ResNet network, and the unit output uses the idea of residual learning.e residual learning unit learns the difference between the input layer and the output layer through the parameterized network layer during training process of the network.Reference [16] proves that residual learning is easier to train and classification accuracy of the model is higher than that which directly learns the mapping of input and output.ShuffleUnit not only uses the idea of summation to ensure the transmission of original information to the back layer but also proposes that the first unit in each group of ShuffleUnit uses concat to increase the number of channels and to achieve the purpose of fusion of original information and global information.
e depth separable convolution mentioned by Mobi-leNet is applied to the convolution of ShuffleUnit.Depth separable convolution splits the ordinary convolution process into two steps.First, each channel corresponding to a 3 × 3 convolution kernel is used to obtain a single channel feature map.en, point-by-point convolution is used to combine full channel features.Depth separable convolution can obtain similar effects as ordinary convolution and at the same time greatly reduce the amount of calculation.e reduction factor of the calculation amount is shown in Complexity where H × W refers to the size of the feature map.C in and C out , respectively, represent the number of input channels and output channels of the convolutional layer.e idea of grouped convolution was originally derived from the limitation of hardware resources when running the AlexNet network, and Hinton splits the information into two GPUs to run [17].Considering that point-by-point convolution still has a large amount of calculation, Shuf-fleUnit adopts a grouping operation for point-by-point convolution to further reduce the amount of calculation.e reduction factor of the calculation amount is shown in formula (2).In order to strengthen the flow of information between groups, reduce the constraints between channels, and enhance the ability of information presentation, the channel shuffle method is used to realize information exchange between groups.
where g refers to the number of groups in point-by-point grouping convolution.Most of the convolutional networks proposed at present are suitable for color images and use 2D convolution to extract image features.In order to adapt to the characteristics of the network, the slicing method is proposed in [5] and [6].Although the slicing method is convenient for the training and application of existing 2D convolutional neural networks, the result can only represent the category of the corresponding slice of the brain, rather than the overall category.erefore, the slicing method often requires the majority voting method to integrate the results of each part and further to evaluate the overall category.e process is complicated.In order to avoid the above-mentioned complicated process, the classification features of the entire sample are directly obtained, which facilitates subsequent fusion with fMRI features.en, a 3D form of ShuffleNet is implemented, and it is also beneficial to retain more threedimensional spatial information.
In 3DShuffleNet, the number of groups in grouped convolution is set to 3, and in order to adapt to the 3D structure of gray matter images, the 2D convolution is changed to 3D convolution.e parameters of the model structure are shown in Table 1.
Amyloid deposition and neurofibrillary tangles in the brain are typical pathological changes in patients with AD, which can cause brain nerve cell atrophy and death or abnormal signal transmission between cells.Experienced doctors can distinguish AD by observing the degree of atrophy of specific parts of the sMRI imaging.e gray matter of the brain is a dense part of neuronal cell bodies and is the center of information processing.rough it, the distribution and number of neuronal cells in the test patient can be analyzed to screen for AD.In this paper, the sMRI gray matter image obtained after preprocessing is read into this 3DShuffleNet to obtain auxiliary diagnosis results, and the outputs of the penultimate layer and the inputs before the classification layer are regarded as classification features.

fMRI Feature Extraction.
e changes in cerebral hemodynamics over a period of time are recorded in fMRI, so the characteristics of high-dimensional small samples are particularly prominent among them.How to effectively extract the information expressed by brain imaging and reduce the dimensionality has become the primary problem in establishing auxiliary diagnostic models.ALFF value analysis, functional connection matrix analysis, and local consistency analysis are included in the present fMRI data processing methods.Among them, the most common method is functional connection matrix.It measures the coordination and consistency of the work of two brain regions by calculating the Pearson correlation coefficient of the brain interval time series, and it can greatly reduce the data dimension.Because diseases can cause changes in the connection characteristics of certain brain areas, it retains the most AD diagnostic information.In this paper, the functional connection matrix obtained by fMRI preprocessing is used as the input of the auxiliary diagnosis network.
PCANet is a simple deep learning baseline proposed by Chang Tsung-Han [18] which consists of cascaded principal component analysis, binary hash, and block histogram and is widely used in face recognition [19], age evaluation [20], deception detection [21], and other fields.is network has good deep feature extraction capabilities.It can be roughly divided into three stages, among which the first and second stages are PCA convolution, and the third stage is the feature output stage [22].
In the first stage, PCA and convolution are used to achieve features.Suppose the number of input samples is N, the size of the sample matrix is [m, n], and the size of the sliding window is image blocks are obtained by sliding window as shown in the following equation: en, the image matrix after removing the mean value of each dimension of all image blocks can be gotten by

Complexity
Assume that the number of convolution filters in the first step is L 1 .PCA is used to learn the convolution filter, and the parameters of the convolution filter are where mat k,k (q l (XX T )) represents the mapping from a vector of size k1 × k2 to a matrix W 1 l of size [k1, k2] and q l (XX T ) represents the eigenvector of the l-th principal component.e second stage is similar to the first stage.Assume that the size of the second-stage filter is [k 3 , k 4 ] and the number is L 2 .In the first stage, the output of the l-th convolution filter of the i-th image is where l � 1, 2, . . ., L 1 and i � 1, 2, . . ., N. e signal * represents two-dimensional convolution.
Using the same operation like the first stage on the l-th convolution, the output of each sample is described as e results of each convolution kernel performing the operation shown in equation (7) are combined together; we can get en, the convolution filter parameters can be obtained using the following equation: e output of the second stage is where r � 1, 2, ..., L 2 .In this way, each feature map of the input of the second stage produces L 2 outputs.e third stage is the feature output stage which includes binary hash coding, block histograms for encoding, and downsampling operations.
e binarized image of outputted feature map in the second stage is obtained by the Heaviside function, and different weights are assigned to get the encoded decimal feature map as shown in the following equation: where H(.) represents the Heaviside function.e feature map T is divided into several blocks with the same size, and histogram statistics are made for each block.All block histogram statistics are concatenated to obtain the output feature as described by where B(.) stands for block histogram statistics.
PCANet is applied to extract effective classification features of AD, and functional connection matrix is calculated as input, using linear SVM classifier to output auxiliary diagnosis results.

Multimodal Features Fusion
Method.sMRI and fMRI images have their own characteristics, which provide information for AD from different angles.e information can be complemented by feature fusion, so as to obtain a more accurate description of samples.
At present, there are few researches on feature fusion in the field of AD-assisted diagnosis and mainly through the concatenation of features to improve the diagnosis effect.In this paper, we take the features extracted from sMRI and fMRI data as the fusion object.Since the images are from the same subject and were obtained on very close dates, it can be considered that there are some certain correlations between the description of the disease in sMRI and fMRI, and the two can be fused by analyzing typical correlation relationship of two feature vectors.At the same time, considering that the correlation is not only linear but also nonlinear, these features from two modal data can be fused by KCCA [23] methods.
KCCA is similar to CCA.It is the promotion of the CCA method in kernel space.e difference is that the two sets of variables are first projected into high-dimensional space before CCA.Radial basis function (rbf ) kernel function is usually chosen, as shown in equation (13), to realize the space mapping.
CCA [24] is a multivariate statistical analysis method which uses the correlation between the comprehensive variable pairs to reflect the overall correlation between the two sets of indicators.e specific implementation steps are as follows.
e first pair of linear combinations with the greatest correlation is found separately from each group of variables as typical variables and is described by where u 1 and v 1 represent typical variables and a 1 and b 1 are the canonical correlation coefficients.e following constraints are required: where Secondly, the second pair of typical variables which are not related to the first pair of typical variables in this group are found, and it is a pair of linear combinations with the second largest correlation.
e process of finding canonical correlation variables is repeated, and the newly found canonical correlation variables are not correlated with the existing ones in the group until all the variables are extracted.
Assuming that Y and Z are sMRI data and fMRI data features, A and B are the corresponding kernel canonical correlation coefficients.Difference in the unit scale between the features of different modal data maybe exists.If the features of the two modalities are directly fused, the feature with a large unit scale will play a decisive role, while the function of the feature with a small scale may be ignored.In order to eliminate the influence of unit and scale differences between features and to achieve the goal of treating each dimension feature equally, the most common feature processing method, namely, z-score standardization, is used to map feature vectors to the same distribution.
where x is the mean value of the data and σ is the standard deviation of the data.Knowing that the several canonical correlation variables corresponding to each group of variables in KCCA are not correlated with each other, it can be known that the linear combination of the canonical correlation coefficients of the two groups of variables is also not correlated.e fusion of the two modal characteristics can be achieved by adding the canonical correlation variables corresponding to the two sets of variables as shown in the following equation:

Experimental Setup and Model Evaluation
e experimental results in this paper are all obtained under the server equipped with Nvidia TITAN Xp GPU, 32 GB RAM, 256 GB SSD, 2 TB HDD, quad-core Intel Xeon E5-2620 v3 2.4 GHz processor, win10 system, and CUDA10.2environment configuration.e experimental training set and testing set account for 70% and 30% of the data, respectively.
In sMRI feature extraction and classification experiments, the preprocessed gray matter images are input into the 3DShuffleNet network for training (classification).e model training batch size is set to 4. e Adam optimization algorithm is used.e weight decay value is set to 1e-3, the initial value of the learning rate is set to 1e-3, and it decays exponentially as the number of trainings increases.e total number of iterations is 50, and the attenuation rate is set to 0.9.e 3DShuffleNet model initializes the 3D convolution weights by the normal distribution method.e weights of the BatchNorm3D layer are initialized to a fixed value of 1.
e weights of the fully connected layer are initialized to a normal distribution with a mean value of 0 and a variance of 0.001.e bias values are all set to 0. In addition, in order to improve the reliability of the experimental results, the model in this paper and the comparative test model were repeatedly trained and tested for 10 times.
In fMRI feature extraction and classification experiments, by setting different size and number of PCA kernels and block size, the influence on the diagnosis results is explored.
In multimode data fusion experiment, we use the grid search method to adjust KCCA parameters.After that, the classification results of the proposed method in this paper and the experimental dataset in CCA and series fusion method are compared and analyzed.
In order to effectively evaluate the method proposed in this paper, Acc, Sen, Spec, Precision, Recall, F1 score, and AUC are calculated.

Classification Experiments of sMRI.
In order to prove the superiority of the 3D model proposed in this paper, some classic models are compared, and the results on sMRI data using 3DShuffleNet are shown in Table 2.
It is found from Table 2 that the 3DShuffleNet proposed in this paper has significant advantages, and better classification results on AD versus NC and AD versus MCI are achieved.But the classification effect of MCI versus NC is poor.It is speculated that, on the one hand, because MCI is the early stage of the AD patient, the brain gray matter structure has not changed significantly, and the network is difficult to locate the disease characteristics.On the other hand, because the experimental samples are relatively scarce, the model is not fully trained.Similarly, because the difference between LMCI and EMCI is very slight, the result of LMCI versus EMCI is worst.
In addition, the complexity is evaluated with FLOPs and the number of floating-point multiplication adds.For proving the advantages of the 3DShuffleNet proposed in this paper over other networks, the experimental results of the proposed model on FLOPs are compared with those of the 3D forms of ResNet and DenseNet, which are widely used in image classification.3DShuffleNet needs 0.79 GFLOPS of computational force, which is much smaller than the comparison models including 3DResNet network with 10 layers and 3DDenseNet network with 121 layers, which requires 38.97 and 89.71 GFLOPs.At the same time, the parameters amount of the network is obtained.3DShuf-fleNet has 957.72 thousand parameters; 3DResNet with 10 layers and 3DDenseNet with 121 layers, respectively, have 14.36 and 11.24 million parameters.e proposed network has obtained relatively good classification results with a small computational cost.

Classification Experiments of fMRI.
If the AAL template is used to calculate the function connection, a 90 * 90 or 6 Complexity 116 * 116 function connection matrix will be obtained, respectively, using 90 or 116 regions of cerebrum.erefore two datasets with different sample sizes are obtained.e whole brain function connection matrix is selected as the experimental data, and the effects of three variables on the classification results are analyzed, respectively.First of all, the impact of different PCA kernel sizes on the experimental results is compared and analyzed.
e initial number of PCA kernels is set to L1 � L2 � 8, and the block size to 16.Because the data have unbalanced categories phenomenon, the average value of sensitivity and specificity is used as evaluation criterion.e detailed results are shown in Figure 3.
From Figure 3, we can see that, as the PCA kernel's size continues to increase, the classification result firstly becomes better and then worse.It is speculated that this phenomenon is related to the receptive field theory which is similar to traditional convolutional neural networks.e larger the receptive field is, the more image information can be obtained.So PCANet can obtain better expression ability.However, as the PCA kernel size continues to increase, the number of parameters soars, which reduces the computing efficiency.
Next, the impact of the number of PCA kernels on the classification results is discussed, and the results are shown in Figure 4. Considering that, in different classification combinations, the PCA kernel's size corresponding to the best classification effect is different, the size of PCA kernel in different classification combinations is set as 3 * 3, 5 * 5, 7 * 7, and 3 * 3, and the block size keeps unchanged.
e experimental results show that, in a certain range, the increase in the number of PCA kernels retains more data information as the dimension increases, which makes positioning of the disease more accurate.When the number of PCA kernels reaches a certain level, the experimental result decreases.
e reason is that too many PCA kernels will cause the introduction of noise.
Finally, the influence of block size (for histogram calculation) on the robustness of experimental results is analyzed.When the PCA kernel size is set to 3 * 3, 5 * 5, 7 * 7, and 3 * 3 and the number of PCA cores is set to 8, 8, 6, and 6, the experimental results in Figure 5 are obtained.
e results show that an appropriate block size provides better robustness, but blindly increasing the block size will sacrifice model performance.After the above-mentioned optimization method of control variables, the experimental results are shown in Table 3.
Considering that the size of the PCA kernel, the number of the PCA convolution kernels, and the block size of calculation histogram may affect each other, the grid search method is used for further experiments.e gradient of PCA kernel size is set as [3, 5, 7,...,11], and the gradient of the number of PCA kernels is [1, 2,...,11].e side length of the block of histogram is set to a multiple of 4, and the maximum value is set to half of the side length of the function connection matrix.
e experimental results are shown in Table 4.
e control variable method and the grid search method are used to adjust the parameters, and the global brain function connection matrix is used as the experimental data.It can be seen from Tables 3 and 4 that the grid search method is better than the control variable method in adjusting the parameters, because there is a close relationship between the three variables and they influence each other.Complexity e experimental results obtained from the global brain and cerebrum function connection matrix are compared and analyzed.In general, better performance can be obtained using the whole brain function connection matrix as an experimental sample.Among them, the classification accuracy of AD versus NC and MCI versus NC both increased by 4%, and the classification accuracy of AD versus MCI was equal.e presumed reason is that although AD focuses on appears in the part of cerebrum when one brain area is affected and the other brain areas are intact, the connection characteristics will also change.
erefore, by adding the cerebellum part to enrich the features information, a better diagnosis result can be obtained.In addition to the above results, we also apply our method to classify EMCI and LMCI.It can be seen from the results that the PCANet network is sensitive to the disease progresses from EMCI to LMCI, and functional characteristics changes in brain can be observed, which proves the feasibility and effectiveness of feature extraction using PCANet.

Classification Experiments of Feature Fusion.
In this paper, the z-score standardization method is selected to normalize the features of the two modalities to the same scale, the KCCA feature fusion algorithm is selected to obtain the fused features of sMRI and fMRI, and SVM classifier used for training and recognition.
In order to prove the effectiveness of the KCCA fusion algorithm, in addition to comparing the difference between the single-modal feature and the fusion feature classification effect, at the same time, the experimental results obtained by using CCA and the series method are compared.In the SVM classifier, the sigmoid kernel is used to train and obtain the recognition results.e experimental results are shown in Table 5. e sMRI features extracted by the 3DShuffleNet network are fused with the fMRI features extracted by the PCANet.
It can be seen from Table 5 that, compared with the CCA fusion method, the KCCA with rbf kernel has a significant improvement on the recognition results, and by this way,  8 Complexity information complementary of two modal is realized.e KCCA algorithm considers the influence of nonlinear features during feature fusion, which makes the feature description more reasonable and enhances the identification ability of subsequent classifiers.is also explains why the effect of feature fusion using CCA is not satisfactory.Compared with the traditional serial fusion method, the KCCA fusion algorithm still has advantages in experiments.

Conclusions
Using deep learning algorithms to assist doctors in diagnosing AD has broad research prospects.Furthermore, the idea of features fusion can achieve an obvious improvement.In this paper, 3DShuffleNet is used to build an sMRI-assisted diagnosis model, and PCANet is used to build an fMRIassisted diagnosis model.Both methods can achieve better results and can provide help on correct diagnosis and early detection of AD.At the same time, the features fusion of two kinds of data is realized, and compared with single modality, better classification results on multiple modalities are obtained.e addition of fMRI features not only further improves the diagnostic advantages of the sMRI-assisted diagnosis model on AD versus NC and AD versus MCI but also avoids the disadvantages of sMRI on the MCI versus NC and LMCI versus EMCI experiments.In addition, multiple modalities methods overcome the shortcomings of singlemodal recognition which cannot make full use of target features.e method proposed in this paper also has the characteristics of low requirements for equipment computing capabilities, which is helpful for its promotion in practical applications.

Data Availability
e data in this paper come from the Alzheimer's Disease Neuroimaging Initiative database, which is an open-source third-party database.e specific dataset of the experiment cannot be provided due to copyright reasons.For the experimental data in this paper, subjects who have both fMRI and sMRI are selected.e amount of experimental data is 34 cases of AD, 18 cases of early MCI, 18 cases of late MCI, and 50 cases of NC.ADNI database link: http://adni.loni.usc.edu/.

Figure 5 :
Figure 5: Accuracy of different block size.

Table 3 :
Experimental results of adjusting parameters by controlled variable method (%).

Table 4 :
Experimental results of adjusting parameters by grid search method (%).

Table 5 :
Comparison results of three feature fusion methods and single-modal method (%).