Exploring the Best Classification from Average Feature Combination

and Applied Analysis 3 From our description and Figure 1 we see that in the three orders, the best classification results come from the descending order, when a sample of most discriminative features is combined. Therefore we select the descending order as the best performing one out of the three orders. Besides the three orders tested above, there is another possible ordering method, that is, random sampling. In this order we randomly sample a number of features from all and use them in average combination. Although we have no theoretical or intuitive ground to support random sampling as a better order than the descending one, we cannot trivially discard this order without extensive experiments. Therefore we compare the random sampling with descending order as follows. With each number N of features in combination, that is, from 2 to the number of all features, we randomly sampleN features anduse them in combination.Note that the sampledN features must not be exactly the same as the ones used in descendingmode.We repeat the random sampling 50 times and use the best result to represent this order. In the experiments we use the same four datasets as in [15] and the experimental setups are briefly listed as follows: Event-8 [19]: randomly selected 70 images as training and 60 images as testing per class. Scene-15 [20]: randomly selected 100 images as training and all the others as testing per class. Oxford Flower-17 [21]: randomly selected 20 images as training and 20 images as testing per class. Caltech-101 [22]: randomly selected 30 images as training and 15 images as testing per class. The features adopted in our experiments include 500-bin Gabor filters, Bag-of-SIFT descriptors in gray, HSV and CIELab space, 20-bin oriented and 40-bin unoriented PHOG [23], and 64-bin gray value histogram. In building Bag-ofSIFT descriptors, SIFT descriptors [1] are extracted on regular grids with spacing of 10 pixels and with patches of radii r = 4, 8, 12, 16 to allow for scalability and then quantized into a 500-bin vocabulary. Altogether we used 7 types of features for Caltech-101, Event-8, and Flower-17 and 5 types for Scene-15 (only containing gray images). For each feature, we build the descriptors in spatial pyramid from level 0 to level 2. In total, we have 21 descriptors for the three color datasets and 15 for Scene-15. For each descriptor, the kernel matrix is built with each entry in the form of k(x, y) = exp(−d 0 d(x, y)), where d is the pairwise χ distances and d 0 is the mean of pairwise distances.We adoptχ distance to build kernels as it performs the best among several other commonly used kernels [15, 24]. In all our experiments the multiclass SVM is trained in a one-versus-all manner and the regulation parameter C is fixed to be 1000. The performance measure is reported as the mean recognition rate per class. For each dataset, we test with 10 different training-testing splits and report the mean of classification results in Figure 2. Although randomly sampling 50 times is not a exhaustive search, we can see in Figure 2 that random sampling is barely able to outperform descending mode. At the same time, we note that there do exist some cases where the best of random order performs a little better than descending order. This observation can be attributed to the ordering criterion, that is, cross-validation accuracy. Since cross-validation accuracy is only an approximate estimation of the discriminative power, but not a precise measure, the top N features in descending order may not be exactly theNmost discriminative features. In this case, it is likely that a random sample captures the N most discriminative features accidentally, while descending mode does not. This also explains why the recognition rate curves of descending order do not follow the “rise-peakdrop” shape strictly. Since in the random order we only report the best results, this observation does not influence our conclusion that the descending order performs the best in the four orders. To our best knowledge, we have listed all the possible orderings in average combination and we conclude that descending order performs the best and can be used to produce better performance than the ordinary average combination. When we look at the recognition rate curves of the four orders in Figures 1 and 2, we find that the behavior of features in average combination can be elegantly explained in the framework of k-Nearest-Neighbor (kNN) classification. Regarding the most discriminative features as the closest training examples and the least discriminative features as the furthest ones, we readily understandwhy the recognition rate curves of the three orders are of the shapes illustrated in Figure 1 and why the descending order is able to produce better results than the ordinary average combination. Furthermore, we arrive at some interesting conclusions from this kNN framework. In the case that the discriminative powers of individual features vary widely, the weak features added later shall drag down the performance curves significantly and make the average combination with all features much inferior to the one with only a sample of most discriminative features. This is where the optimization based algorithms show their advantage over ordinary average combination by suppressing the effect of weak features. On the other hand, if all features are of similar discriminative power, the recognition rate curves shall only be dragged down marginally as all training examples are of similar distance. Therefore the average combination with all features performs similar to the one with a sample of most discriminative features.This leaves little space for optimization based algorithms to improve and explains why, with a set of carefully designed features, the optimization based combination algorithms perform no better than the baseline average combination, as observed in [12].We shall see, in Section 3, that experiments validate these two arguments. 3. Selection Based Average Combination Now we are able to present the selection based average combination (SBAC) algorithm as a better baseline algorithm for feature combination. Since in the descending order the recognition rate curves follow the “rise-peak-drop” trend, what is left for us to do is to determine where the peak is reached, that is, the appropriate candidate of k in kNN. Cross-validation is an effective measure to evaluate 4 Abstract and Applied Analysis 1 3 5 7 9 11 13 15 17 19 21 80 82 84 86 88 90 Number of features Re co gn iti on ra te Descending Random CV accuracy (a) Event-8 1 3 5 7 9 11 13 15 75 77 79 81 83 85 Number of features Re co gn iti on ra te Descending Random CV accuracy


Introduction
Object classification is a difficult task as there usually exists large intraclass diversity and interclass correlation, even within a small image dataset.The existing single features, for example, SIFT [1], SURF [2], and HOG [3], while being powerful with some classes, seem not enough to deal with all classes alone.In this case, feature combination is proposed to combine the strengths of multiple complementary features and produce better performance than any single one.While classifier fusion [4] can also be used to improve classification performance, in this paper we focus on the combination at the feature level.More specifically, we use SVM classifier in classification and the feature combination is translated into kernel combination [5].
Multiple kernel learning (MKL) is one popular approach to accomplish kernel combination.MKL seeks to obtain the best combination performance by jointly optimizing the weights   on individual kernels in  * (, ) = ∑  =1     (, ) together with the SVM parameters  and  [6][7][8][9][10].Unlike this canonical MKL adopting a uniform weighting scheme over the whole input space, [11] presented a sample-specific MKL algorithm where kernel weights are determined based on both kernel functions and the samples.This algorithm produces some performance improvement at the cost of a large computation load and the risk of over-fitting.Between these two extremes, [10] proposed to use a group-sensitive MKL to make a trade-off between canonical and samplespecific MKL.Different from MKL algorithms optimizing weights and SVM parameters jointly, [12] presented a LPBoost algorithm where the weights and SVM parameters are trained separately in two steps.
While various MKL-like kernel combination algorithms have been published, the controversy surrounding these optimization based approaches has also become evident.On one hand, the optimization based algorithms are usually computationally expensive, and the optimization process consumes huge memory space.On the other hand, the real effectiveness of these algorithms in improving performance has been called in question.In [12] it is noticed that when all participated features are carefully designed to be powerful, the sophisticated optimization algorithms, for example, MKL, do not show evident advantage over the baseline average combination.Only when both strong and weak features are combined, the optimization based approaches suppress the effect of weak features and perform better than average combination.In the supplement to [12] the authors further claim that the MKL-like combination algorithms seem to be overestimated in the literature, due to missing comparison with the simple yet powerful average combination.Moreover, the supplement states that there seems to be an agreement on the fact that MKL almost never improves performance.
The tiny, if any, performance improvement from MKLlike algorithms together with the huge computation and memory space consumption seems to indicate that the existing optimization based combination approaches are approaching a bottleneck, and a new framework is needed to generate further evident performance improvement.This observation motivated us to investigate the behaviors of features in average combination, in an endeavor to find out the underlying mechanism of feature combination.In fact, our work is consistent with the shift of research focus from heuristic combination algorithms to theoretical explanation of the combination mechanism [13,14].In [15] we have found that if we add features into average combination one by one in descending order according to their discriminative power, the classification performance of combination firstly rises, then peaks, and finally drops.In other words, average combination with a sample of most powerful features produces better classification results than with all features, and the performance gain of using a most powerful sample can be quite large in some cases.This means that it may not be convincing to claim the superiority of a new combination algorithm by comparing with the ordinary average combination.This observation further renders it necessary to explore the potential of average combination and present a better baseline combination algorithm.
While the experiments in [15] show that it is possible to improve the classification performance of average combination, some problems are left unsolved.Firstly, in [15] we tested the feature combination performance by adding features into combination in descending order, ascending order, and mixed order and found that the best classification results are obtained with a sample of most powerful features in the descending order.However, it is not clear if some other orders, for example, random ordering, can be used to produce better results than descending order.Secondly, in descending order the best sample size needs to be determined in order to obtain the best classification performance.Thirdly, we are interested to find out how the features used in combination influence the final classification results.In order to solve these problems, in this paper we firstly compare the random order with the best performing descending order.As a result, we find that the behaviors of feature combination can be elegantly integrated into the k-Nearest-Neighbor (kNN) framework.Based on this framework, we then present a selection based average combination (SBAC) algorithm to obtain the best classification results from average combination.This SBAC algorithm is simple yet powerful and can serve as a better baseline algorithm in feature combination.Furthermore, the kNN framework provides a reasonable

Descending Ascending Mixed
Figure 1: Illustration of recognition rate curves in descending, ascending, and mixed orders, as observed in [15].explanation for the observations as to the relation between features combined and resulted performance gain.All these results enable us to conclude that the kNN framework sheds some light on understanding the mechanism underlying feature combination and is therefore helpful in motivating novel feature combination algorithms.Although in this paper we focus on image classification, we would like to highlight that the idea of combining features to improve classification performance is also applicable to other domains [16][17][18].
The remainder of this paper is organized as follows.In Section 2 we compare the descending order with random order and then present the kNN framework to explain the behaviors of features in average combination.Section 3 details the SBAC algorithm and experimental results.In Section 4 we conclude the paper.

kNN Framework
In this section we investigate the influence of the ordering of features being added into average combination on classification performance.To begin with, we review the three orders tested in [15].The discriminative power of a feature is evaluated by 10-fold cross-validation with its corresponding kernel matrix.In descending order, the features are sorted in descending order according to their discriminative power and added into combination one by one.In ascending order, we operate similarly in ascending order.In mixed order, the features are still sorted in descending order.Whereas in combination, we take features from the top and the bottom of the list alternatively and add them into combination one by one.
Based on the experiments in [15], the behaviors of features in the three orders can be illustrated in Figure 1.In descending order, the curve of recognition rates shows a "rise-peakdrop" shape with the addition of features into combination.In ascending order, the participation of new (and thus more discriminative) features in combination always improves the classification until all features produce the best results in this order.In mixed order, the strong and weak features push up and drag down the recognition rates alternatively.In all cases the ascending and mixed orders have no chance to outperform the descending order.
From our description and Figure 1 we see that in the three orders, the best classification results come from the descending order, when a sample of most discriminative features is combined.Therefore we select the descending order as the best performing one out of the three orders.
Besides the three orders tested above, there is another possible ordering method, that is, random sampling.In this order we randomly sample a number of features from all and use them in average combination.Although we have no theoretical or intuitive ground to support random sampling as a better order than the descending one, we cannot trivially discard this order without extensive experiments.Therefore we compare the random sampling with descending order as follows.With each number  of features in combination, that is, from 2 to the number of all features, we randomly sample  features and use them in combination.Note that the sampled  features must not be exactly the same as the ones used in descending mode.We repeat the random sampling 50 times and use the best result to represent this order.
In the experiments we use the same four datasets as in [15] and the experimental setups are briefly listed as follows: Event-8 [19]: randomly selected 70 images as training and 60 images as testing per class.Scene-15 [20]: randomly selected 100 images as training and all the others as testing per class.
Oxford Flower-17 [21]: randomly selected 20 images as training and 20 images as testing per class.Caltech-101 [22]: randomly selected 30 images as training and 15 images as testing per class.
The features adopted in our experiments include 500-bin Gabor filters, Bag-of-SIFT descriptors in gray, HSV and CIE-Lab space, 20-bin oriented and 40-bin unoriented PHOG [23], and 64-bin gray value histogram.In building Bag-of-SIFT descriptors, SIFT descriptors [1] are extracted on regular grids with spacing of 10 pixels and with patches of radii  = 4, 8, 12, 16 to allow for scalability and then quantized into a 500-bin vocabulary.Altogether we used 7 types of features for Caltech-101, Event-8, and Flower-17 and 5 types for Scene-15 (only containing gray images).For each feature, we build the descriptors in spatial pyramid from level 0 to level 2. In total, we have 21 descriptors for the three color datasets and 15 for Scene-15.
For each descriptor, the kernel matrix is built with each entry in the form of (, ) = exp(− −1 0 (, )), where  is the pairwise  2 distances and  0 is the mean of pairwise distances.We adopt  2 distance to build kernels as it performs the best among several other commonly used kernels [15,24].In all our experiments the multiclass SVM is trained in a one-versus-all manner and the regulation parameter  is fixed to be 1000.The performance measure is reported as the mean recognition rate per class.For each dataset, we test with 10 different training-testing splits and report the mean of classification results in Figure 2.
Although randomly sampling 50 times is not a exhaustive search, we can see in Figure 2 that random sampling is barely able to outperform descending mode.At the same time, we note that there do exist some cases where the best of random order performs a little better than descending order.This observation can be attributed to the ordering criterion, that is, cross-validation accuracy.Since cross-validation accuracy is only an approximate estimation of the discriminative power, but not a precise measure, the top  features in descending order may not be exactly the  most discriminative features.In this case, it is likely that a random sample captures the  most discriminative features accidentally, while descending mode does not.This also explains why the recognition rate curves of descending order do not follow the "rise-peakdrop" shape strictly.Since in the random order we only report the best results, this observation does not influence our conclusion that the descending order performs the best in the four orders.To our best knowledge, we have listed all the possible orderings in average combination and we conclude that descending order performs the best and can be used to produce better performance than the ordinary average combination.
When we look at the recognition rate curves of the four orders in Figures 1 and 2, we find that the behavior of features in average combination can be elegantly explained in the framework of k-Nearest-Neighbor (kNN) classification.Regarding the most discriminative features as the closest training examples and the least discriminative features as the furthest ones, we readily understand why the recognition rate curves of the three orders are of the shapes illustrated in Figure 1 and why the descending order is able to produce better results than the ordinary average combination.Furthermore, we arrive at some interesting conclusions from this kNN framework.In the case that the discriminative powers of individual features vary widely, the weak features added later shall drag down the performance curves significantly and make the average combination with all features much inferior to the one with only a sample of most discriminative features.This is where the optimization based algorithms show their advantage over ordinary average combination by suppressing the effect of weak features.On the other hand, if all features are of similar discriminative power, the recognition rate curves shall only be dragged down marginally as all training examples are of similar distance.Therefore the average combination with all features performs similar to the one with a sample of most discriminative features.This leaves little space for optimization based algorithms to improve and explains why, with a set of carefully designed features, the optimization based combination algorithms perform no better than the baseline average combination, as observed in [12].We shall see, in Section 3, that experiments validate these two arguments.

Selection Based Average Combination
Now we are able to present the selection based average combination (SBAC) algorithm as a better baseline algorithm for feature combination.Since in the descending order the recognition rate curves follow the "rise-peak-drop" trend, what is left for us to do is to determine where the peak is reached, that is, the appropriate candidate of k in kNN.Cross-validation is an effective measure to evaluate the discriminative powers of features and we have used it in the ordering of features.Here we can also use 10-fold crossvalidation to determine the best sample size in combination.Specifically, we sort the features in descending order and add them into combination one by one.When a feature is added into combination, we use 10-fold cross-validation to assess and record the discriminative power of the combined kernel matrix.When the cross-validation accuracy peak is reached, the peak of the recognition rate curve is also reached.In order to support this method, we compare the actual recognition rates and the cross-validation accuracy in Figure 2.Although the curves of recognition rate and cross-validation accuracy do not have exactly the same shapes, they do have very similar trends, and the peaks of cross-validation accuracy curves indicate the location of recognition rate peak correctly.
It is evident from Figure 2 that SBAC performs better than the ordinary average combination with all features.In our experiments, the performance gains are 1.5, 1.1, 5.8, and 5.6 percent for Event-8, Scene-15, Flower-17, and Caltech-101, respectively.Considering that MKL algorithms, for example, in [12], usually outperform the ordinary average combination by only several percent, which are in the same order of magnitude as our performance gains, we believe these results highlight the importance of exploring the best results from average combination and presenting a better baseline combination algorithm.An interesting observation here is that the performance gains vary from dataset to dataset, that is, quite large with Flower-17 and Catech-101 and fairly small with Event-8 and Scene-15.In Section 2 we attribute this observation to the variance of the discriminative power of individual features.In order to validate this explanation from the kNN framework, we list the 10-fold cross-validation accuracy of individual features in descending order in Figure 3. Comparing Figure 3 to Figure 2, we observe a definite correlation between the discriminative power variances and the performance gains.In fact, in our experiments the standard deviation of the cross-validation accuracy of individual features is 15.49, 14.56, 19.39, and 17.56 for Event-8, Scene-15, Flower-17, and Caltech-101, respectively.This explains why the performance gains are large for Flower-17 and Caltech-101 and small for Event-8 and Scene-15.These observations further confirm the effectiveness of the kNN framework in explaining the behaviors of features in average combination.
Although in this paper we focus on image classification, the idea of combining multiple features and classifiers to obtain better classification performance is also applicable to other related domains, for example, document classification, speech recognition, fault diagnosis, and others [25][26][27].Therefore in the next step we plan to continue our work in two aspects.Firstly, we shall investigate the behaviors of features in average combination in these domains and check if the kNN framework is still valid.This investigation shall deepen our understanding of the feature combination mechanism and help motivate novel and more powerful feature combination algorithms.Secondly, we plan to make an indepth study of the existing feature combination algorithms in these domains to see if it is possible to apply them to image classification.Altogether, we aim for a more precise and universal understanding of the feature combination mechanism and the best classification performance from average combination.

Conclusion
In this paper we investigated the behaviors of features in average combination through extensive experiments on four diverse datasets.As a result, we found that the average feature combination can be integrated into the k-Nearest-Neighbor framework where the most discriminative features are regarded as the closest training examples and the least discriminative features as the furthest ones.Based on this framework, we present a selection based average combination algorithm which performs evidently better than the ordinary average combination and thus serves as a better baseline combination algorithm.Since the kNN framework can be used to explain all the behaviors we observed in average feature combination, we believe it is helpful in understanding the feature combination mechanism and motivating novel feature combination algorithms.

Figure 2 :
Figure 2: Comparison of average combination in descending order and in random order.The cross-validation accuracy in descending order is also illustrated.

Figure 3 :
Figure 3: The cross-validation accuracy of individual features used in average combination.