Feature Selection and Classification for High-Dimensional Incomplete Multimodal Data

Due to missing values, incomplete dataset is ubiquitous in multimodal scene. Complete data is a prerequisite of the most existing multimodality data fusion methods. For incomplete multimodal high-dimensional data, we propose a feature selection and classification method. Our method mainly focuses on extracting the most relevant features from the high-dimensional features and then improving the classification accuracy. The experimental results show that our method produces considerably better performance on incomplete multimodal data such as ADNI dataset and Office dataset, compared to the case of complete data.


Introduction
In the era of Internet, there are many different modalities, such as images, video, and text.Different modalities can provide complementary information; therefore, multimodal classification can generally produce better performance than individual modality in accuracy and reliability.The diagnoses of Alzheimer's Disease (AD) by multimodal classification are a great example and have achieved remarkable success compared to single modal methods in multiple experiments.Pang et al. [1] explored the possibility of improving emotion prediction by highly nonlinear relationships between low-level features in different modalities.Zhang et al. [2] incorporated three modalities of biomarkers (structural MR imaging (MRI), Positron-Emission Tomography (PET), and cerebrospinal fluid (CSF)) to discriminate AD (or mild cognitive impairment (MCI)) from healthy controls.Pang et al. [3] recommended using multilabel multiple-kernel learning with visual and textual features for multilabel image classification.Hu et al. [4] utilized multimodality data including both tag feature and visual feature for popularity prediction on social media.Ballard [5] suggested a multimodal learning interface which could learn words from natural interactions with users.Liu et al. [6] mentioned a multihypergraph learning (MHL) method to deal with multimodality data.This method achieved promising results in AD/MCI classification.Zhang et al. [7] proposed multimodal multitask learning to jointly predict multiple variables from multimodal data.Liu et al. [8] proposed a linearized and kernelized sparse multitask learning for predicting cognitive outcomes in Alzheimer's Disease.Li et al. [9,10] proposed a multitask deep learning method for diagnosing Alzheimer's Disease by combining MRI, PET, and Assessment Scale-Cognitive subscale (ADAS-Cog) with the restricted Boltzmann machine.Wang et al. [11] explained a novel multimodality multicenter classification method for autism spectrum disorder diagnosis; they regarded the classification of each imaging center as one task and solved the classification for all imaging centers by introducing the task-task and modality-modality regularizations.Liu et al. [12] proposed a view-aligned hypergraph learning (VAHL) method and utilized incomplete multimodality data for AD/MCI diagnosis.
Complete data is a prerequisite of the most existing multimodality data fusion methods.Since complete data requires the modality type and the modality number to be consistent, it is rare in reality.In Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, for example, only about 1/3 of its total samples contain complete MRI, PET, and CSF data at baseline.In view of incomplete multimodality data, it usually explores imputing the missing values [13,14] and discarding                                    samples, and doing these will lead to waste of or bring unpredictable noise.To address the incomplete multimodality data, Zhao et al. [15] proposed an unsupervised method which processes the incomplete multimodality data by transforming the original and incomplete data to a new and complete representation in a latent space.Thung et al. [16] used incomplete multimodal dataset via matrix shrinkage and completion to identify AD patients.Li et al. [17] proposed a pioneer work to handle two-modal incomplete data case by projecting the partial data into a common latent subspace via nonnegative matrix factorization (NMF) and  1 sparse regularizer.Following this line, Shao et al. [18] proposed a similar idea of weighted NMF and  2,1 regularizer.Most existing incomplete multimodality methods have low efficiency with high-dimensional data.Inspired by this, we propose a feature selection and classification for incomplete multimodal high-dimensional data.Our method has the following features: (1) It focuses on incomplete data and makes full use of the data from different modalities without data wasting.(2) It selects the most relevant features in high-dimensional space and facilitates the discovery of the inherent relationship between features.(3) It achieves better classification accuracy when compared with the other methods.
The rest of the paper will demonstrate the details of the proposed approach; experiments on various datasets and comparison between our method and the currently most advanced methods.
First-order spatial selection is difficult to reveal the highorder dependency relationship between features.Since the incomplete data has limitations about the number of samples and the feature dimensions, we need more features to discover the correlation between them.We can consider using different kernel functions to extend the data, for example, a linear kernel function or Gaussian kernel function.The linear kernel function directly linearizes the data, which makes it difficult to reflect the correlation between the features and may result in the loss of data.The Gaussian kernel function is too expensive to calculate.Therefore, we transform low-dimensional features into high-dimensional features by nonlinear kernel explicit expression and then reveal high-order correlation between features.
For degree-d polynomials, the polynomial kernel is defined as where  and  are vectors in the input space, and  ≥ 0 is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial.As a kernel,  corresponds to an inner product in a feature space based on some mapping : Let  = 2, and we get the special case of the quadratic kernel.After using the multinomial theorem and regrouping, From this, the explicit feature mapping of polynomial kernel is Compared with the linear case, the second-order feature map contains dependency of the feature pair.The key problem of this explicit feature mapping is that its features are high dimensional in the extended feature space.For polynomial kernel expansion, the dimension of the feature map increases exponentially.When  = 2,  is the original feature dimension, and extended dimension is ( + 2)( + 1)/2.Generally, when  = 10 in which constant  is a regularization parameter that makes a trade-off between the model complexity and the fitness of the feature selection.By introducing the dual variable ,  ∈  = { |   ≥ 0,  = 1, ⋅ ⋅ ⋅ , }, the Lagrangian function of (5) can be written as in which ⟨⋅, ⋅⟩ denotes the inner product.By setting the derivatives of (w, , ) with respect to w and  to zero, we can obtain the Karush-Kuhn-Tucker (KKT) conditions,  = (  ⊙) and   = −/.By substituting the above results into the Lagrangian function, problem (5) can be transformed into the following dual formulation: where Since the feature selection vector is zero-one vector, this is still a nonconvex problem.Following the convex relaxation in [19], we have (1) Initialize  = 1 and the constraint subset  = .
(2) Find the most active constraint d  , update set  by  =  ∪ {d  }.
Algorithm 1: The cutting plane algorithm.
By introducing an additional variable  ∈ R, the above problem can be converted into the following convex quadratically constrained quadratic program (QCQP) problem: It is very hard to solve as there are infinite number of quadratic inequality constraints in (9), and we solve this problem by the cutting plane algorithm [20].We generate an active constraint and add it to an active constraint set  which is initialized to empty set .The active constraint set  is a subset of ; i.e.,  ⊆ .Based on a new active constraint set , we need to solve QCQP problem to update .Specifically, we need to solve the following problem: max ∈,∈R  s.t. ≤ (,d  ) , ∀d  ∈  (10) The cutting plane algorithm can be presented in Algorithm 1.

Learning 𝑑.
The cutting plane algorithm mainly deals with how to find the most active constraint d  of problem (10) at the th iteration.Let   =  2  , and the optimization problem becomes Due to   ∈ {0, 1}, problem (11) can be solved by sorting   , and then find the largest   .

The Optimization of 𝛼.
After updating the active constraint set , we solve the problem in (10) with constraints which are defined by .Because the number of constraints in  is no longer large, we can use subgradient method to solve this problem.However, it is very expensive to get the dual variables  when  is very large.
For convenience, we define () = (1/2)(∑  =1 ‖  ‖) 2 and () = () + ().We solve the primal problem by using the accelerated proximal gradient (APG) [21], which minimizes the following quadratic approximation of (12): in which ∇ denotes the gradient of  at point V,  > 0 denotes the Lipschitz constant, and  = V − (1/)∇(V).We need to solve the following Moreau Projection problem: Problem ( 14) has a unique closed-form solution, it can be solved in the following manner via Moreau Projection [22].Suppose   () be the optimal solution to problem ( 14) and  = [ 1 ,  2 , ⋅ ⋅ ⋅ ,   ] ∈ R  be an intermediate variable.Then   () is unique and can be calculated as in which  = {1, 2, ⋅ ⋅ ⋅ , }.The intermediate vector   can be calculated via a soft-threshold operator (, ) in [22,23] and the threshold value  can be calculated as in step (4) of Algorithm 2. The overall APG algorithm for solving problem ( 14) is summarized in Algorithm 3. We can obtain  from APG and then predict the results of each modality by our method.It can be expressed as  = (x  ⊙ d).Eventually, the integration of the prediction result will enable us to do classification.

Performance Evaluation
3.1.Datasets.We evaluate the performance of our method by employing the ADNI and Office dataset, respectively.The ADNI dataset was launched in 2003 by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, the Food and Drug Administration, private pharmaceutical companies, and nonprofit organizations.The primary purpose of ADNI project was to study the effects of combining multiple biomarkers, such as MRI, PET, and CSF data accompanied with neuropsychological assessments, to predict the progression of MCI and early AD.We employ a 3-modality (MRI, PET, and CSF) dataset with 103 subjects which include 51 AD patients and 52 healthy controls.The  The multimodality data has 189 dimensionality features; for each subject, we obtain 93 features from MRI image, another 93 features from PET image, and 3 features from the CSF biomarkers.The size of feature dimension is relatively small.The nonlinear explicit expression is used to expand the dimension of data.After each item becomes one dimension, the 189 features are expanded into 8940 features.Now we can obtain the feature of a combination high-order disease.
Office dataset is as follows: amazon (e.g., images downloaded from the Internet), webcam (e.g., low-resolution images captured by web cameras), and dslr (e.g., high-resolution images taken from digital SLR cameras).Each dataset has 10 object classes.Specifically, Surf and Decaf features are extracted for all the images, and Decaf-LeNet and Decaf-AlexNet represent different Decaf features by training LeNet and AlexNet model, respectively.The feature dimension of Surf is 800 and the feature dimension of Decaf by training LeNet and AlexNet model is 4096, respectively.Table 2 lists the summarization of Office dataset.We expand these features into 180902 dimensional features by applying nonlinear explicit polynomial expression method.

Results on ADNI.
We first use a 10-fold cross-validation strategy to classify AD and healthy controls in the single modality.We select 29 samples as training data and 10 samples as testing data from the ADNI dataset.For the purpose of the robustness and repeatability, this process is repeated 10 times to calculate the average of the classification accuracy as the final classification accuracy.The results are demonstrated in Table 3.For complete data, the classification accuracy on individual modalities MRI, PET, and CSF are 83.50%,Given input  = [ 1 ,  2 , . . .,   ] and  = 1/.
(2) Sort û to obtain  such that  Initialization: Initialize the Lipschitz constant   =  −1 and set  −1 =  0 by warm start,  0 =   ,  ∈ (0, 1), parameter  −1 =  0 = 1, and Algorithm 3: Accelerated proximal gradient for solving problem (10).77.50%, and 78.70%, respectively.When using MRI and PET combination, the accuracy is 83.30%.When using PET and CSF combination, the accuracy is only 81.00%.The combined measurements of all three biomarkers of MRI, PET, and CSF achieves a classification accuracy of 81.40%.Furthermore, due to the limitation of complete data, the size of incomplete data is larger than complete data.Specifically, our multimodal classification method for incomplete data achieves a classification accuracy of 91.10%, while the classification accuracy for complete data is only 81.40%.As we see from Table 3, incomplete data demonstrates much better performance than complete data in AD and healthy controls classification.The flexibility of incomplete data is better than complete data, because it takes advantage of valuable data samples and does not lead to waste data.
In Figure 2, we plot classification accuracy of complete and incomplete data corresponding to different iterations.The classification accuracies of incomplete data are better than complete data.
As mentioned in Section 2.3, B controls the sparsity of feature selection and has an important effect in the process of feature selection.In Figure 3, since different  values produce different classification accuracies when MRI is used, the classification accuracy is greatly impacted by the choice of appropriate B value.In Figure 3, when  = 30, the mean of classification accuracy is higher than others.Therefore, we choose  = 30.So far, our method demonstrates much better performance on incomplete data.
In Table 4, we use incomplete multimodality data to compare the proposed method with other methods, including domain transfer support vector machine (denoted as DTSVM) [24] and multiple-kernel learning method (denoted as MKL) proposed in [2] using Lasso as feature selection.The number of iteration  Since our method uses nonlinear kernel explicit expansion and it maps features into high-dimensional features space, it is better in revealing high-order correlation between features.As we see in Table 4, our method outperforms the other methods for AD and HC classification.Our method achieves the classification accuracy of 91.10% with 90.00%  sensitivity and 91.38% specificity.These results further validate the efficiency of our multimodal classification method.

Results on Office Dataset.
In this section, we evaluate our method on Office dataset which includes the following three modalities Surf, Decaf-LeNet, and Decaf-AlexNet.We start the evaluation of conducting image classification by using our method on different modalities.Then we compare classification accuracy of incomplete and complete multimodal data on amazon, dslr, and webcam, respectively.In the experiments, we expand the dimensions of feature to 180902.We test the classification performance on different datasets.show comparison of classification accuracy of incomplete and complete multimodality data on amazon, dslr, and webcam, respectively.As we see in Tables 5-7, the classification performance on incomplete multimodality data is better compared to complete multimodality data.We want to emphasize that our method maps the low-order feature to the high-dimensional space, and this is helpful to discover the nonlinear related features.Incomplete data not only make the best use of the precious samples, but also utilize the inherent relation and knowledge of all modalities data.

Conclusion
Authors proposed a feature selection and classification method for incomplete multimodal high-dimensional data.Our algorithm produces considerably better classification performance.The flexibility of incomplete data is better than complete data.Our method takes advantage of valuable data samples and does not lead to waste data.In addition, our method focuses on extracting the relevant features from

Figure 1 :
Figure 1: Multimodal classification framework based on high-dimensional feature selection.

Figure 2 :
Figure 2: Classification accuracy of complete and incomplete data with respect to different iterations.

Figure 3 :
Figure 3: Performances of our method using different B parameters.
2.3.Feature Selection and Classification.At first, we introduce a feature selection vector d ∈ {0, 1}, whose entries are 1 for selected features and 0 otherwise.Let  = {d | d ∈ {0, 1}} be the domain of d.We use ‖d‖ 1 ≤  to control the sparsity of the feature selection, where  controls the number of selected features; then the proposed problem can be written as min d min ,

Table 2 :
Summarization of Office dataset.
5 or 1.0 are considered as AD.Table 1 lists the demographics of all these subjects.

Table 3 :
Comparison of classification accuracy of incomplete and complete multimodal data.

Table 4 :
Comparison of performance of different multimodal classification methods.

Table 4
lists the comparison of different methods for AD and HC classification.

Table 5 :
Comparison of classification accuracy of incomplete and complete multimodality data on amazon.