Online Multikernel Learning Based on a Triple-Norm Regularizer for Semantic Image Classification

Currently image classifiers based on multikernel learning (MKL) mostly use batch approach, which is slow and difficult to scale up for large datasets. In themeantime, standardMKLmodel neglects the correlations among examples associatedwith a specific kernel, which makes it infeasible to adjust the kernel combination coefficients. To address these issues, a new and efficient multikernel multiclass algorithm called TripleReg-MKL is proposed in this work. Taking the principle of strong convex optimization into consideration, we propose a new triple-norm regularizer (TripleReg) to constrain the empirical loss objective function, which exploits the correlations among examples to tune the kernel weights. It highlights the application of multivariate hinge loss and a conservative updating strategy to filter noisy samples, thereby reducing the model complexity. This novel MKL formulation is then solved in an online mode using a primal-dual framework. A theoretical analysis of the complexity and convergence of TripleRegMKL is presented. It shows that the new algorithm has a complexity ofO(CMT) and achieves a fast convergence rate ofO(logT/T). Extensive experiments on four benchmark datasets demonstrate the effectiveness and robustness of this new approach.


Introduction
Image semantic classification is a challenging task in computer vision field.Researchers are constantly searching for efficient learning methods with good scalability to categorize large and complex image datasets [1][2][3][4][5][6].Among the current approaches used for image categorization, multikernel learning (MKL) [7][8][9][10][11] has been the subject of many recent studies and it delivers the state-of-the-art performance by solving a joint optimization problem, which comprises sample coefficients for the base kernel classifier and the optimal weights for combining multiple kernels associated with multiple clues [5].
However, most MKL methods use batch learning methods [6,[12][13][14][15][16], which are slow at classification, and they do not scale up well with large training datasets.To this end, various online methods have been proposed to facilitate efficient learning and real-time application [17][18][19][20][21][22][23].These different online MKL methods involve different regularization techniques and updating rules.For example, Hoi et al. [17] used a Perceptron algorithm [23] to learn about a base classifier for a given kernel before applying the Hedge algorithm [24] to combine multiple classifiers in a linear manner.However, regularization was not considered in this formulation.Cavallanti et al. [22] proposed an ℓ  -norm multiview perception algorithm, where differences in the clue-related kernel space were neglected.
Given the complex parameter structures of multikernel models, more researchers are using mixed norm regularization items to integrate sophisticated prior knowledge and to handle the parameter mutation caused by data noise.The (2, 1)-norm was first proposed as a regularizer for multiclass MKL [9].This approach induced absolute sparsity in the domain of the kernels, but it might weaken the convexity of the optimization problem or lead to poor performance [25].Thus, a general type of group norm, (2, )-norm (1 ≤  ≤ 2), was proposed in [19,26] to provide greater flexibility when tuning the level of sparsity required for a task.However, the algorithm has difficulty achieving convergence when  is close to 1.In addition, an elastic net form of regularization is available for MKL, which allows the solution to obtain exact mathematical zeros and is effective for filtering invalid kernels [20,27].In summary, the aforementioned norms impose a constraint on kernels or classes, whereas they neglect the correlations among examples.
In this paper, a new algorithm called TripleReg-MKL is proposed.It defines a triple-norm regularizer (abbreviated as TripleReg) with strong convexity to constrain the empirical risk upon the current incoming samples.An online solution is derived using primal-dual framework for this new MKL formulation.As the correlations among examples are considered in TripleReg-MKL to tune the sparsity of model, the updating of kernel weights involves the historical cumulative effects of the overall online training procedure [19,20].It also highlights the combination of multivariate hinge loss and a conservative updating strategy to filter noisy samples.A theoretical derivation is presented and an analysis of the complexity and convergence is conducted as well.Extensive experiments delivered on four benchmark datasets verify the claims. , . . ., x   ] denotes the corresponding  measurements of features and each x   ( = 1, . . ., ) is a multivariate feature vector, which describes a visual characteristic of the th sample.Then, the multiclass classifier is defined by

TripleReg-MKL Algorithm
where   (x) is the value of the score function when the instance x is assigned to the class .ỹ is the predicted class for which the function achieves the highest score.Based on a consideration of multikernel integration, the score function   (x) is defined as where  ⋅, = ( In the multiclass setup, the concept of class is introduced in the definition of a mapping function.This is different from traditional kernel machines, which ignore the class label information in the kernel definition.Specifically, we define where  ∈ Y,   (⋅) is a label-free feature map [19].
To obtain the solution of the optimization problem, the primal objective function that needs to be minimized is defined as where () is the triple-norm regularizer (TripleReg) used to measure the complexity of  and to constrain the problem in a low-complexity domain.The second item is the global loss that accumulates hinge losses over all possible samples in the training set.ℓ  () is the instantaneous loss function that measures the discrepancy between the predicted answer and the correct answer.Specifically, it can be denoted by ℓ(, (  ,   )) at the th iteration.Here a multivariate hingeloss function with convexity is defined for multiclass categorization as follows: (  ≥ 0) is a parameter that trades off the significance between the empirical loss and regularization item.From ( 6), the essence of this optimization problem is to learn about the optimal weight  to minimize the cumulative loss that occurs during the sequence of observations under regularization.

Triple-Norm
Regularizer.According to Section 2.1, each component of  associated with a specific class and kernel, that is,  , ( = 1, . . ., ,  = 1, . . ., ), is a coefficient vector.It inherently implies the triple structure of the model parameter.Thus, the triple-norm-based regularization is designed as where ‖‖

Online Solution Using a Primal-Dual Framework.
A primal-dual algorithmic framework [28][29][30] is adopted to derive the optimal solution of (6).Suppose that  0 is the notation of the optimal fixed solution to the minimization problem of (6), which is both objective and imaginary.It may be considered objective because it can be selected retrospectively from a class of hypotheses based on complex and varied concepts of progress toward an acceptable competing hypothesis using the entire sequence of training data pairs [28].It can be considered imaginary because it may require a long training period to be objective.  is the actual model parameter at the th iteration.The algorithm is expected to satisfy   =  0 at each iteration.Therefore, (6) of the primal domain can be rewritten with constraints: In (10),  is the set of all possible hypotheses.Introducing the Lagrangian multiplier   ( = 1, . . ., ) yields Based on the definition of the conjugate function and the equation of inf() = −sup(−()), the dual objective function can be obtained: Up to now, the constrained quadratic programming problem in the primal domain, as in (10), has been converted into a dual objective function, as in (13). Since we obtain the following equation after differentiating both sides of ( 14) with respect to −(1/  ) ∑  =1   : Similarly, since then, We denote the sum of the current  Lagrangian multipliers by   ; that is, where   is the dual variable with the same triple structure as the primal variable   .Thus, the solution of the primal objective is obtained in the dual domain, which is formulated as  0 = ∇ * (  ).Specifically, the dual norm of ( 8) is formulated as where A summary of the algorithmic framework of TripleReg-MKL is shown in Algorithm 1.
Algorithm 1 shows that the TripleReg-MKL algo -rithm updates the weight   through line (11).Here, endif ( 13) end for Algorithm 1: Pseudocode of the TripleReg-MKL algorithm.Parameters  and  are initially set to zero as online learning algorithms usually do [31].It is intuitive that model parameters  and  are empty at the starting time when no sample arrives.
is denoted by  +1 .Thus, the updating rule can be simplified to  , +1 =  +1  , +1 , which indicates that  is determined by the kernel coefficient  and the dual parameter .The relationship between the parameters during online learning is shown in Figure 1.
In Figure 1, the kernel weight is updated using information from the newly arriving sample and from all previous samples.It allows each sample to make different contributions to the model.In other words, the correlations among samples are introduced to tune the level of sparsity in the domain of the kernels because of the close relationship and high similarity among samples in the same class.
It is imperative for us to apply the kernel trick to avoid the difficult definition of Φ(, ) and the expensive calculation of the inner product in a high-dimensional transformation space [32] for the derivation of TripleReg-MKL algorithm.By setting   = sign(ℓ(  )) and according to the relationship of   (,   ) = ⟨Φ  (, ), Φ  (  ,   )⟩, the inner product between  , +1 and Φ  (, ) can be calculated as During the training process, TripleReg-MKL algorithm applies a conservative updating strategy.That is to say, the updating is only implemented in the current and the interference class model when the loss function is greater than zero, as shown in line ( 8) in Algorithm 1.The "one positive and one negative" approach is also used to increase the gap between the correct model and the max interference model.
Algorithm 1 shows that the time required by the TripleReg-MKL is dominated by line (5) in each iteration, which has a complexity of () in the worst case., , and  are the numbers of classes, kernels, and previous samples, respectively.This complexity is common to other state-of-the-art online learning MKL algorithms, such as OM-2 and UFO-MKL.

Convergence Analysis
In this section, we analyze a theoretical guarantee of the convergence rate of the TripleReg algorithm.Theorem 1 is derived from the regret bound of primal-dual optimization in Theorem 2 in [29] (the proof of Theorem 1 is given in Appendix B).

Experiment
Experimental evaluation of TripleReg-MKL is presented in terms of classification performance and capacity to combine features.A comparison with four state-of-the-art online MKL algorithms, that is, OM-2 [19], UFO-MKL [20], OMCL [21], and Perceptron [23], is performed on the benchmark Caltech-101 [34], Caltech-256 [35], Oxford Flowers (102) [36], and MNIST [37] datasets.Caltech-101 [34] is a collection of 9144 images from 102 object categories.The number of images in each category varies from 40 to 800.Most of the object categories contain 50 images.Caltech-256 [35] is an extension of Caltech-101 containing 29781 images from 256 object categories.The minimum, average, and maximum number of images in each category are 80, 119, and 827, respectively.Oxford Flowers (102) [36] contains 8189 images that cover 102 flower categories.Each class contains 40 to 258 images.MNIST [37] is a large dataset of 60000 training examples and 10000 test examples from 10 handwritten digit categories.The digits have been size-normalized to 28 × 28 gray-scale images and they are centered in the fixed size images.This dataset is good for testing learning techniques using real-world data since it requires minimal preprocessing and formatting effort.These four datasets are characterized by their high image diversity, large sample volumes and number of categories, or great vagueness among classes, which presents great challenges for classification.The codes of the three comparison algorithms are obtained from DOGMA [31].

Experimental Setup.
Thirty images of each category were selected randomly for training from Caltech-101 and Caltech-256, and the rest were used for testing.For the Oxford Flowers (102) dataset, the predefined training and testing splits recommended in previous studies [36,49] were used in this experiment: that is, only 10 of each class are from Oxford Flowers (102).Unless stated otherwise, the experimental process was replicated 10 times using a different random test set or sample sequence.The averages and standard deviations are reported.To obtain better experimental results, model parameters such as  1 and  2 were determined using a fivefold cross-validation procedure.Given the fact that online learning has a relatively slow convergence rate, we attempted to increase the training dataset by cycling the training examples through multiple epochs [45].

Comparing the Effect of Using Single Kernels or
Combining-All.In this experiment, TripleReg-MKL with a combined kernel is compared to that of a single kernel.The experiment results are shown in Table 1.
Table 1 shows that "combining-all" using the TripleReg-MKL algorithm results in significant improvement in the classification performance for any dataset compared with a single kernel.For example, the test accuracy increased by approximately 9.38% and 10.0% on the two large object class datasets, that is, Caltech-101 and Caltech-256, compared with that obtained using the best single kernel.With the Oxford   Accuracy (%) Caltech-101 Caltech-256 MKL-SRC [14] 7 5 .7 -MKSR [15] 82.9 46.9 SM1MKL [16] 88.42 47.7 SM2MKL [16] 88.51 49.76 Proposed 88.75 49.70 learning with more than one million samples.To be noted, the data size of million is simulated by means of multipass strategy in the experiment.Given the fact that some of the results cannot be seen clearly in Figure 2, the test accuracy of the four algorithms at the final iteration is presented in Table 2.It can be seen that test accuracy close to 50% was achieved using only about 25% of the images from Caltech-256 for training.It is the best performance achieved using MKL methods on the Caltech-256 dataset to the best of our knowledge.The test accuracy of 88.75% on Caltech-101 is also the highest result obtained with an online algorithm.Oxford Flowers (102) is characterized by its great variation within classes and vague gaps between classes.Only 12.5% of the images were used for training and an accuracy of 62.56% was obtained using TripleReg-MKL on this benchmark dataset.

Comparing with the State-of-the-Art Batch MKL Methods.
We compared the performance of TripleReg-MKL with some recent batch MKL methods including MKL-SRC [14], MKSR [15], and Soft Margin Multiple Kernel Learning [16] (abbreviated as SM1MKL and SM2MKL, corresponding to different setup of hinge loss and squared hinge loss, resp.).The comparison experiments are delivered on Caltech-101 and Caltech-256 as the literatures [14,15] provide the results for the different training conditions on these two benchmark datasets.We download SMMKL implementation code (https://sites.google.com/site/xinxingxu666/)and use oneversus-all strategy for its multiclass extension.The optimal SVM regularization parameters for SM1MKL and SM2MKL are searched within [1,100] using median method.To be concrete, the optimal SVM regularization parameter for SM1MKL and SM2MKL is set as 10 and 2.5, respectively.Table 3 shows the test accuracy of TripleReg-MKL and all the three batch MKL baseline algorithms.Comparison is relatively striking between TripleReg-MKL and MKL-SRC or MKSR on both datasets, demonstrating the generation performance of our algorithm.A similar case is with comparison between TripleReg-MKL and SM1MKL on the larger classes of Caltech-256 object dataset.The superiority is on the contrary not obvious when comparing test accuracy of TripleReg-MKL with that of SM1MKL and SM2MKL on Caltech-101.To summarize, the proposed TripleReg-MKL compares well with the state-of-the-art batch MKL methods in terms of test accuracy.

Conclusions
In this paper, a new TripleReg-MKL algorithm was proposed.In this approach, a novel triple-norm regularizer (TripleReg) was designed for MKL and an efficient online solution is derived.TripleReg introduced a constraint among correlated examples and makes the kernel weight updated using all previous historical sample information.It yielded greater flexibility for tuning the level of sparsity in the domain of kernels.This online solution allows the algorithm to be readily adapted to learning cases with millions of cases in an efficient manner.To examine the empirical performance of the proposed TripleReg-MKL algorithm, extensive experiments were conducted on a testbed using four diverse real datasets.Experiment results verified its high capacity for heterogeneous feature fusion, which is particularly important for recognizing large classes with crowded spaces and vague gaps between classes.It also achieved the highest classification performance compared with four state-of-theart online MKL methods, that is, OM-2, UFO-MKL, OMCL, and Perceptron.

Mathematical Problems in Engineering
Definition A.1.A function  is -strongly smooth with respect to a norm ‖ ⋅ ‖, if  is differentiable everywhere and if, for all x, y ∈ {k : (k) < ∞} and  ∈ (0, 1), one has The following theorem states that strong convexity and strong smoothness are dual properties [50].
Theorem A.2 (strong convexity/strong smoothness duality).Assume that  is a closed and convex function.The Fenchel conjugate of  is denoted by  * .Then,  is -strongly convex with respect to a norm ‖ ⋅ ‖ if and only if  * is 1/-strongly smooth with respect to the dual norm ‖ ⋅ ‖ * .

B. Proof of Theorem 1
To complete the proof, Theorem 2 in [29] is rewritten as follows.
Let ℓ 1 , . . ., ℓ  be a sequence of functions such that, for all  ∈ [] = {1, 2, . . ., } and ℓ  =    +   , where  is strongly convex with respect to a norm ‖⋅‖,   is a convex and closed function.Then, any algorithm that can be derived from "template algorithm for online strongly convex optimization" [29] satisfies

Figure 1 :
Figure 1: Relationships between the parameters during online training.

Table 1 :
Comparison of the results obtained with single kernel or combining-all.
Flowers (102) dataset, "combining-all" outperformed the best single kernel by approximately 35.57%.For the largest scale dataset MNIST, "error rate" is used as the performance index to clarify the numerical comparison.It is observed that a single SPHOG had a low error rate of 0.72% with MNIST, but "combining-all" resulted in a lower error rate of 0.65%.That is to say, 9.72% reduction in the error rate is achieved using TripleReg-MKL algorithm to fuse all the features of SPHOG, GIST, and LBP.examples from the epoch repetitions after the model reached the steady state.In contrast, the curve of "training sample size versus test accuracy" tends to rise with increasing numbers of iterations during online updating after early variability until it reaches a steady state.This trend agrees with the gradual optimization law of online learning.A comparison of the results obtained with all four approaches showed that TripleReg-MKL had the best classification performance after reaching the steady state.That is indicated by the position of the red curve above all of the other curves.The UFO-MKL algorithm ranked the second in terms of the trend curve of the training time is different for each dataset.This is because the quality and distribution of training samples from the four datasets vary, and these factors determine the number of noisy samples to be filtered and thus the runtime.Regardless, it is also observed from Figure2that the TripleReg-MKL algorithm is well suited to

Table 2 :
Test accuracy at the last iteration.