Kernel-Based Multiview Joint Sparse Coding for Image Annotation

1School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, No. 10, Xitucheng Road, Haidian District, Beijing 100876, China 2School of Electronics and Information Engineering, North China University of Technology, No. 5, Jinyuanzhang Road, Shijingshan District, Beijing 100144, China 3School of Computer Science, North China University of Technology, No. 5, Jinyuanzhang Road, Shijingshan District, Beijing 100144, China


Introduction
With the surge of private and online images, automatic image annotation has become of great research interest in computer vision since it is a vital step for image retrieval and management [1,2].Image annotation aims to automatically predict a set of semantic labels, such as "sea," "beach," and "sand," for the unannotated images by learning the relevance among images.However, the visual similar images may not be correlated with each other in semantics, which makes the problem still challenging.
In the past decade, considerable research efforts [3][4][5][6][7][8][9][10][11][12] have been made for automatic annotation.They can be roughly classified into three types of models: discriminative model, generative model, and nearest neighbor model.Discriminative model [3][4][5] treats image annotation as a multilabel classification problem, in which each label is a single class.Given a test image, a semantic label will be propagated to it only if a corresponding classifier makes a decision that the image belongs to it.But this type of model neglects the correlation between different labels, which is also very important for image annotation.Generative model [6,7] attempts to infer the correlations or joint probabilities between images and semantic concepts.By learning the statistical models, the test image is annotated based on probability computation.However, there exist many parameters in these models, which leads to heavy computational cost for parameter estimation.The nearest neighbor (NN) based model [8][9][10][11][12] predicts labels by taking a weighted combination of the label absence or presence among neighboring images.Because of its simplicity and efficiency, NN-based methods attract more researching attention.

Mathematical Problems in Engineering
Recently, sparse coding scheme and its variations have been successfully used in image annotation task.It is substantially related to the NN-based model since the test image is often represented by a few of representative samples in a low-dimensional manifold.For example, Wang et al. [13] reconstructed the test image sparsely and transferred labels by the sparse coefficients.Gao et al. [14] considered each (sub)class of images as a (sub)group and employed multilayer group sparsity to classify and annotate single-label images concurrently.Cao et al. [15] utilized the group sparse reconstruction framework to learn the label correlation for dictionary and reconstructed the test image for label prediction under weakly supervised case.Lu et al. [16] presented a more descriptive and robust visual bag-of-words (BOW) representation by semantic sparse recoding method for image annotation and classification.Jing et al. [17] learned a label embedded dictionary as well as a sparse representation for image annotation.In addition, the sparse representation can be further enhanced by exploiting the kernel mapping, since it maps nonlinear separable features into a higher dimensional feature space, in which features with the similar labels are closely grouped together while those without the same labels become linearly separable.Moran and Lavrenko [18] introduced a sparse kernel learning framework for the continuous relevance model, which can adaptively select different kernels for different features, and obtained great improvement.
Despite these efforts, most of existing annotation methods combine the information from different image features by concatenating them into a long feature vector, which treats all the features equally and omits their different contributions for the final decision.To solve this problem, Kalayeh et al. [19] introduced a multiview learning technique in image annotation, in which each type of features as well as label matrix is considered as a view, and all the views are adaptively integrated to exploit the complementary information, but the sparsity prior to the training samples used to reconstruct the test image is not considered.Liu et al. [20] introduced a multiview sparse coding framework for semisupervised image annotation.But they assumed the different views share a common sparse pattern, which ignores the diversity between the views.Yuan et al. [21] adopted a multitask joint sparse representation for image classification, allowing different coefficient representations for different task and enforcing the similarity among different tasks by joint sparsity.But image annotation is a multilabel classification problem, which cannot use the framework directly.
Motivated by the previous research, we formulate image annotation as a kernel-based multiview joint sparse coding (KMVJSC) learning problem. Figure 1 describes the framework of KMVJSC.Particularly, we map the feature views and label views to an implicit high-dimensional space and integrate all the views adaptively by multiview learning to strengthen the power of multiple modalities from an image.We aim to find a set of optimal sparse representations as well as dictionaries for each view simultaneously.Different views in an image should have similar coding coefficients to joint represent the same image on the one hand; on the other hand, these coefficients from different views should have some diversity to reflect the distinctive property of different views.Thus, we adopt different sparse coefficients for different views, allowing each view to be flexibly coded over its associated dictionary; at the same time, we employ the joint sparsity constraint to make the sparse coefficients among different views to be similar.The optimization algorithm and label prediction scheme under the proposed framework are developed.Given a test image, we also map its multiple features into the same kernel space and reconstruct the test image by joint sparse coding using the learned atom representation dictionary.The product of the atom representation dictionary and the corresponding sparse coefficients is considered as scores of the near neighbors for the test image, and a greedy label transfer scheme is used to get the annotation.Experiments on three datasets demonstrate the effectiveness of our proposed method and the competitive performance compared with the related methods.
In summary, the major contributions of this paper are as follows: (1) An effective kernel-based multiview joint sparse coding framework is proposed and successfully applied in image annotation; (2) the optimization algorithm is proposed by extending the accelerated proximal gradient (APG) and K-singular value decomposition (KSVD) algorithms into a kernel-based multiview case; (3) a label prediction algorithm is proposed for kernel sparse representation framework based on the sparse reconstruction and weighted greedy label transfer scheme.
The rest of this paper is organized as follows: Section 2 briefly discusses the related work.Section 3 describes the details of our kernel-based multiview joint sparse representation, optimization algorithm, and label prediction scheme.Experiment results are reported and analyzed in Section 4. Finally, we conclude this paper in Section 5.

Related Work
In this section, we will review the related work including sparse representation, kernel sparse learning, and multiview learning.

Sparse Representation Based Image Annotation.
Sparse coding aims to represent the observed data as a linear combination of dictionary entries (or training samples), with the constraint that each image feature vector is only represented by a subset of all the available dictionary atoms.Denote the feature matrix formed by original  training samples as X = [X 1 , X 2 , . . ., X  ] ∈ R × , where X  = [ 1 ,  2 , . . .,   ]  ( = 1, 2, . . ., ) is the feature vector of the th sample image and  is the feature dimension.Then, the conventional sparse representation aims to seek a sparse coefficient matrix W = [W 1 , W 2 , . . ., W  ] ∈ R   × and the associated dictionary D = [D 1 , D 2 , . . ., D   ] ∈ R ×  by solving the following optimization problem [22]: where ‖W  ‖ 1 is the L 1 -norm sparse constraint regularization and  is a trade-off parameter used to balance the sparsity and the reconstruction error.
Figure 1: Schematic illustration of the proposed KMVJSC approach for automatic image annotation.First, we extract multiple features from all the training images; each type of feature as well as the label matrix is treated as a view.Then, we map the different views into a kernel space, where joint sparse coding and dictionary learning are conducted.To label a test image, we extract its multiple features and reconstruct the test image using the learned atom representation dictionary in the mapped space by joint sparse coding.Finally, the product of atom representation dictionary and the corresponding sparse coefficients gives score for the near neighbors of the test image, and a greedy label transfer scheme is used to get the annotation.
For label transfer, given a test image, [13] adopts all the training images as dictionary and propagates the labels from the training images to the test image directly by the product of label matrix of training images with the learned sparse vector.Document [14] predicts the test image label based on the reconstruction error in the (sub)groups and assigns labels from the (sub)groups with the minimum reconstruction error to the test image.But the discriminative power of sparse representation in the original space is still limited compared to the kernel space.

Kernel Sparse Representation.
The kernel sparse coding improves (1) by introducing a kernel trick on both training images and dictionary and has obtained great success in image classification.Let (X) denote the matrix whose columns are obtained by embedding the input features ] denote the dictionary in this mapped feature space; the kernel sparse representation [23] problem is defined as follows: Equation ( 2) can be rewritten as in (3), which depends on a Mercer kernel function (⋅, ⋅), but not the mapping : where Κ DD ∈ R   ×  is a positive semidefinite matrix with the entries computed from Mercer kernel: is a function representing the nonlinear similarity between two vectors D  and D  .Some commonly used kernels include polynomial kernels, Gaussian kernels, and histogram intersection kernels (HIK).
For image classification, given a test image, it is assigned to the class with the minimum reconstruction error obtained by the learned dictionary and sparse coefficients for that class in the mapped space.However, image annotation task is a multilabel classification problem and cannot use this judgment decision for label transfer directly.In addition, the label transfer scheme [13] in original space cannot be moved to the mapped space since the mapping function is unknown.Thus, the application of kernel sparse representation in image annotation is limited.

Multiview Learning.
While the previous framework has been proven successful for many tasks, it has only been applied to the single-view case.With multiple types of input modalities, [20] proposed a multiview learning based sparse representation model, which is based on the following general framework: arg min Here, each image sample is represented by  different features of views, X (V) and D (V) are the feature matrix for all the training images and learned dictionary of the Vth view, respectively, and W is the sparse coefficient matrix shared for all views.
In [20], labels are treated as an additional ( + 1)th view; then, X (+1) and D (+1) represent the label view matrix of training samples and the associated label view dictionary, respectively.For a new test sample y = {y (1) , y (2) , . . ., y () }, the corresponding sparse code w can be obtained by solving the convex problem: and the label view of the test sample y (+1) is then predicted by y (+1) = D (+1) w directly.Although [20] implements the image annotation effectively, it adopts shared sparse coefficients for all views, which is not the case in reality because of the diversity of different views.

The Proposed Method
In this section, we introduce in detail our proposed method, optimization procedure, and label propagation algorithm.Throughout this paper, given a matrix M, we will use the term M  to denote its th column vector and M ,⋅ to denote its th row vector.

Kernel-Based Multiview Joint Sparse Representation.
Motivated by the previous works of [20,21], we utilize different sparse coefficient representation for different views, allowing the flexibility of sparse coefficients from distinct views; then, we introduce a joint sparsity constraint to our kernel-based multiview sparse learning framework to keep the correlation among multiple views.Specifically, let {X 1 , X 2 , . . ., X  } be a set of  training image samples obtained from  different feature views; each view contains the sample feature denoted by is the th sample.The label information can be considered as another view is the label vector of the th image, and each entry is either 1 or 0 representing the presence or absence of a given label in the image.Let be an overcomplete dictionary (  >  V ), and D V  denotes the th column of D V .Let (X V ) and (D V ) denote the corresponding sample matrix and dictionary of the Vth view obtained using the mapping , respectively.Thus, we have (X Then, the object function of kernel-based multiview joint sparse representation can be defined as arg min where W V  ∈ R   corresponds to the sparse representation coefficient of X V  over dictionary D V in the mapped space, ) , 0 ≤  ≤ , and ‖Ω  ‖ 1,2 is defined as the sum of the  1 -norm of all rows of the matrix Ω  , which encourages Ω  to be sparse in column direction and dense in row direction.It helps to enforce the sparse coefficients of different views to share a similar pattern. is the tuning parameter to control the regularization term.
The dictionaries from different views must be correlated with each other since they are learned from the same training images.To keep the correlation, we employ D V = X V A, since dictionary atoms lie within the subspace spanned by the input data, where A is called the atom representation dictionary and X V is called base dictionary [24].Then, (6) can be rewritten as arg min Compared with the dictionary obtained by using all the training samples directly, in which each dictionary atom corresponds to a training sample, the learned dictionary in (6) has no explicit physical meaning in their structure; that is, atoms located in the same column of different dictionary views may not originate from the same training sample, and also the correlations among different views are lost.By introducing (D V ) = (X V )A, we can learn a shared implicit dictionary A among multiple views in a similar way as (6), while the explicit nature (atom location information) of the dictionary is also maintained in the base dictionary (X V ), which is very important for subsequent label transfer in kernel space.

Optimization.
To solve the problem in (7), we adopt an iterative optimization procedure by alternately optimizing technique [25] with respect to Ω  ( = 1, 2, . . ., ) and A while holding the other one fixed.In the following, we provide a brief description of the alternating optimization for our method.For convenience, Notation lists the important notations used in this paper.
First, keeping A fixed, the problem in ( 7) can be simplified as arg min In the next step, keeping Ω  fixed, the problem in ( 7) is simplified as which can also be represented by the matrix form: Here, . Such process can be iterated until the solutions of Ω  and A converge to some local minimum.The following is a detailed description for optimization algorithm solving problems (8) and (10).
Learning Sparse Codes.Equation ( 8) can be decoupled into  distinct subproblems; the th (1 ≤  ≤ ) subproblem is formulated as follows: arg min The objective function in (11) is a nonsmooth convex function since ‖Ω  ‖ 1,2 is nondifferentiable, which can be solved by accelerated proximal gradient (APG) method [26,27] with the extension that signals are now in the multiview kernel space with a very high dimension.
Then, the optimization process can be iterated alternately by generalized gradient mapping and aggregation steps [27].
In the generalized gradient mapping step, we denote then, the gradient of (Ω  ) with respect to Ω  is calculated as follows: Then, we apply the gradient mapping to (11) It comprises the regularization term ‖Ω  ‖ 1,2 and the approximation of (Ω  ) by the first-order Taylor expansion at U regularized as the Euclidean distance between Ω  and U, where  is a parameter controlling the step penalty.

Label Prediction.
The proposed method treats labels as an additional view, so we can infer the label information from the sparse coefficients.In particular, given a test image represented by multiple feature views {y 1 , y 2 , . . ., y  }, visual features of training images X V , the learned dictionary A from the features, and labels of the training images, we obtain the sparse coefficient vector { 1 ,  2 , . . .,   } (  ∈ R   , V = 1, 2, . . ., ) for the test image in terms of learned dictionary

Input:
The feature view of test image Tuning parameter .

Output:
Predicted labels y +1 for the test image.(1) Calculate  V (V = 1, 2, . . ., ) by Eq. (18) (2) For V = 1, 2, . . ., Find the five largest values in A V and corresponding indices (4) Find sample label column vectors in X +1 corresponding to these indices (5) End For (6) Compute and rank the weighted frequency of labels appeared in all the found samples, with weights equaling to the product coefficients of A V (7) Transfer labels according to their calculated frequency.Algorithm 2: Label prediction procedure.by solving the following convex problem, which can be solved similarly as to problem (11).
Since the magnitude of the product term A V can be considered as the importance of certain training samples in the reconstruction of the test image, the priority of the samples used for label propagation is based on the magnitude of its corresponding product term A V .Although  +1 is unknown, it must share the similar pattern with  V (1 ≤ V ≤ ), since they are used to represent the same image.So, we iterate for each view V, choosing the samples in X +1 corresponding the top five values in A V .Then, the tag propagation for the test image is based on a weighted version of greedy label transfer scheme [8].The proposed label prediction and propagation scheme is summarized in Algorithm 2.

Experiments
To validate the effectiveness of the proposed KMVJSC for automatic image annotation task, we conduct experiments and compare the results with related algorithms on three popular benchmark databases.

Experimental Settings.
The three datasets, we used to evaluate the performance of our approach are as follows.
Corel5K dataset is the most popular dataset for annotation evaluation [8-10, 13, 17-19, 28, 29].It consists of 5000 images from 50 different topics, such as "beach," "aircraft," and "horse," and each topic includes 100 similar images.The training and testing sets contain 4500 and 500 images, respectively.Each image is manually annotated with 1 to 5 labels from a dictionary of 260 labels, and with 3.5 labels on average.
IAPRTC12 [30] dataset consists 19627 images of natural scenes dealing with sports and actions and photographs of people, animals, cities, landscapes, and so on.In the dataset, 17665 images are selected for training and the remaining 1962 images are chosen for testing.The number of labels in this dataset is 291, and each image has up to 23 labels with an average of 5.7 labels per image.Besides, each label averagely relates with 347 images.
MIRFlickr25K [31] contains 25000 images which are collected from the social photography site https://www.flickr.comand equally split for training and test.It provides 38 labels as the ground-truth annotation, such as "animals", "baby", and "baby * ".The words with " * " indicate a further strict annotation for labels.Each image is associated with up to 17 labels and 4.7 labels on average, while each label averagely relates with 1560 images.Besides, the dataset provides 1386 tags.Since the Flickr tags are noisy, we kept the tags that appear at least 50 times, resulting in a vocabulary of 457 tags, and use it as another type of feature.
We adopt histogram intersection kernel as the kernel function because it is a parameter-free kernel, and it has obtained excellent performance in evaluating the similarity between two histograms [14].
Following [32], we adopt 15 visual features but limit the size of large feature vectors.Specifically, in order to reduce the computation complexity of both training and testing procedures, we use color histogram with 10 bins per color channel.For features encoded with the spatial layout information, we quantize the color with 8 bins in each channel and reduce the -means cluster centers to 500 for SIFT features.Thus, the 15 distinct features include 3 color features (RGB, LAB, HSV, 1000D), 2 SIFT features (DenseSIFT, HarrisSIFT, 1000D), 2 Hue features (DenseHue, HarrisHue, 100D), 7 above features with layout encoding (RGBV3H1, LABV3H1, HSVV3H1, 1536D; DenseSIFTV3H1, HarrisSIFTV3H1, 1500D; DenseHueV3H1, HarrisHueV3H1, 300D), and a GIST feature (512D).In addition, for MIR-Flickr25K dataset, we endow each image with a binary vector indicating the absence or presence of each tag, which leads to a Tag feature of length 457D.
There are two parameters that need to be tuned, that is, dictionary size   and regularization parameter .In this paper,   is selected from {2000, 2500, 3000, 3500, 4000, 4500}, and  is selected from {0.0001, 0.001, 0.01, 0.1, 1, 10}.We perform KMVJSC with fivefold cross validation on training set and conduct sensitivity analysis in Section 4.3.Finally, we set the parameters of our KMVJSC algorithm as   = 2000,  = 0.1 on Corel5K dataset and as   = 3000,  = 0.001 on the other two datasets.Due to the random entries in initialization, we repeat all the experiments 5 times separately and report the average results.
Following most research [10,13,17,19,32], all the test images are annotated by the top 5 relevant labels.To estimate the performance, we calculate precision (), recall (), and  1 -measure for each label.For a given label , the three metrics are defined by  =   /  ,  =   /  , and  1 = 2 ×  × /( + ).Here,   is the number of images labeled with  in the ground truth,   is the number of images labeled with  by our automatic annotation algorithm, and   is the number of correct labeled images with .We present the mean value over all labels for each metric.Besides, the number of labels with nonzero recall (+) is also used.

The Compared Methods.
We compare our KMVJSC with some sparse coding based baselines including multilabel sparse coding (MSC) [13], kernel sparse coding (KSC) [23], multiview Hessian discriminative sparse coding (MVHDSC) [20], multiview joint sparse coding (MVJSC) in original space, and kernel-based multiview sparse coding (KMVSC) without enforcing the joint sparsity across views.For singleview methods (MSC and KSC), we concatenate the multiple features as a long vector.For KSC, we use the same label transfer scheme as our method except using learned dictionary and reconstruct coefficient directly; for MVJSC, we calculate the average sparse coefficient from multiple feature views and use it as the sparse coefficient of label view and transfer labels with the same way as that of MVHDSC; and for KMVSC, we use the same label transfer method as our KMVJSC.Moreover, we take the same parameter selection strategy for these methods, where all the training, validation, and test sets are exactly the same as those used for our method.

Experimental Results
. Table 1 lists some image examples from three datasets and the predicted labels using our method, together with the ground truths.It contains at least one mismatched label compared with ground-truth labels (perfectly matched annotations are not listed here).The differences in predicted labels are marked in italic font.The results in Table 1 show that, in many cases, some predicted labels missed by the ground-truth annotation can still explain the image well, such as "meadow" for the first image from Corel5K dataset.Besides, some semantically similar words were also treated as errors, such as "cathedral" for the last image from IAPRTC12 dataset, which shows the potential effectiveness of our proposed method for automatic image annotation task.
The annotation performance of our proposed algorithms on three datasets along with other related methods is listed in Table 2, where the methods with " * " represent the implementation using our features, and KMVJSC+T refers to the annotation with our method using another Tag feature for MIRFlickr25K besides the 15 feature views and the label view.The maximum values are bold.From the result in Table 2, we can make the following observations.
(1) For single-view methods, KSC is slightly better than MSC; for multiview methods, kernel methods including KMVSC and KMVJSC both outperform MVJSC, which demonstrates the power of kernel mapping.We also can see  that MVHDSC is still comparable with KMVSC, which is probably because MVHDSC enforces structured sparsity on dictionary additionally.
(2) All multiview methods dramatically outperform the single-view methods on all datasets.Although MSC takes multilabel information into consideration, its performances are inferior to those of multiview methods including MVJSC, which shows multiview learning can harness the label view and feature views in a more natural way and capture the relation between multiple labels and that between labels and visual features for discrimination.
(3) Compared to KMVSC, which neglects the joint sparsity across multiple views, our KMVJSC clearly outperforms KMVSC, which shows the constraint of joint sparsity across different views is valuable to enforce the robustness in coefficient estimation among different views.
(4) Adding the Flickr tags as features helps to improve the annotation performance further, which illustrates that tags are another type of complementary information to visual features and label view and can be effectively used in multiview leaning.

Sensitivity Analysis of Parameters.
We investigate the sensitivity of the parameters   and  in our approach using the three datasets.The corresponding results are presented in Figures 2 and 3.
It should be noted that the results in these two figures are obtained by 5-fold cross validation on the training image set of the three datasets, respectively.Figure 2 illustrates the values of the  1 measure versus different size of dictionary with fixing  (0.1 for Corel5K dataset and 0.001 for the other two datasets).From the results, we can observe that although the performance in terms of  1 measure varies on changing the value of   , it is not very sensitive to its choice and remains fairly stable.Considering that high dimension corresponds to expensive computing cost, we set   = 2,000 for Corel5K dataset and   = 3,000 for IAPRTC12 and MIRFlickr25K datasets to leverage the performance and the cost.Figure 3 shows the values of the  1 measure versus different  with the selected fixing   .It is observed that the  1,2 -norm joint sparsity regularization term is effective while it is not too small or too large.If it is too small, the correlations between different views are lost while it limits the flexibility of all the variety of views with large values.We can see that the best result of our approach is reached when  = 0.1 on Corel5K dataset and  = 0.001 on the other two datasets.

Conclusions
In this paper, we present a kernel-based multiview joint sparse coding framework for image annotation problems.It can learn a set of optimized sparse representations as well as dictionaries in a multiview kernel space.We consider label view (or tag view) as an additional view and adaptively utilize the relationship between label views and visual feature views to find the more discriminative sparse representation by incorporating multiview learning and the  1,2 -norm joint sparsity.We extend KSVD and AGM to a multiview kernel version to solve our optimization problem.The priority of sparse coefficients in the kernel space is used to propagate labels to the test images by weighted greedy label transfer scheme.Experimental results on three widely used datasets show the effectiveness of our proposed method compared to the related methods for image annotation tasks.Considering that some non-ground-truth annotations can still describe the image well in our experiments, for the future research, we try to further involve human perception [33] to measure the relevancy between the non-ground-truth annotations and the images to show the potential flexibility of our proposed framework.Besides, we are going to apply deep learning features [10,34] and multiple kernel learning [18] into our framework.

1 Figure 2 :
Figure 2:  1 measure of our KMVJSC with   changing on the training set of the three datasets.The horizontal axis represents the size of the dictionary, from 2000 to 4500 with a step size of 500.The vertical axis represents the performance of  1 .

1 Figure 3 :
Figure 3:  1 measure of our KMVJSC with  changing on the training set of the three datasets.The horizontal axis represents the tuning parameter , from 1 − 4 to 10 with a step size of 10 times.The vertical axis represents the performance of  1 .

𝑁:
Number of training samples with labels X V : Training samples data of the Vth view X V  : th training sample of the Vth view  V : Dimension of Vth view : Kernel mapping D V : Dictionary of the Vth view : Number of feature views   : Number of dictionary atoms A: Atom representation dictionary WV  : Sparse coefficient of X V  over D V in  space W V : [W V 1 , W V 2 , . . ., W V  ] Ω  : [W 1  , W2 , . . ., W +1  ] : Parameter of regularization term y V : Test image data of the Vth view  V : Sparse coefficient of y V over D V in  space.

Table 2 :
Performance comparison using different methods on Corel5K, IAPRTC12, and MIRFlick25K datasets.Methods with " * " means the implementation using our features.The best scores for each method are highlighted in bold.Using the same features, KMVJSC get the best results on all the three datasets.KMVJSC+T is the KMVJSC method using an additional Tag feature on MIRFlick25K dataset.