Sampling Based Average Classifier Fusion

Classifier fusion is used to combinemultiple classification decisions and improve classification performance.While various classifier fusion algorithms have been proposed in literature, average fusion is almost always selected as the baseline for comparison. Little is done on exploring the potential of average fusion and proposing a better baseline. In this paper we empirically investigate the behavior of soft labels and classifiers in average fusion. As a result, we find that; by proper sampling of soft labels and classifiers, the average fusion performance can be evidently improved. This result presents sampling based average fusion as a better baseline; that is, a newly proposed classifier fusion algorithm should at least perform better than this baseline in order to demonstrate its effectiveness.


Introduction
Object classification is an important task in pattern recognition. Due to the difference in lighting conditions, viewing angles and occlusions, and so forth, there usually exist large intraclass diversity and interclass similarity in real image datasets. This presents great challenges to designing practical object classification systems. While many feature detectors, descriptors, and classification algorithms have been proposed in literature, it is evident that none of these algorithms is able to generate satisfactory classification results for real image datasets. In this case, classifier fusion and feature combination [1] are proposed to combine the decisions of multiple complementary classifiers and produce better performance than any single classifier. In this paper we focus on classifier fusion.
Majority voting is one of the most simple algorithms in classifier fusion. This algorithm uses only the class labeling and discards the probability information of the labels and thus may lead to performance loss. In order to make use of the class probability, average fusion combines the posterior probability of all training classes, that is, the soft labels. Some popular algorithms in this aspect also include weighted sum [2], logistic regression [3], Dempster-Shafer rules [4], and neural networks [5]. In this paper we focus on image classification. However, the classifier fusion algorithms are also applicable to other domains [6][7][8].
In proposing a new classifier fusion algorithm, researchers usually choose to compare it with average fusion to show the advantage of the new algorithms. While being simple, average fusion assigns equal weights to all classifiers regardless of their powerfulness. Intuitively this harms the discriminative power of this algorithm and then makes the claimed advantage of newly proposed classifier fusion algorithms less convincing. With this consideration in mind, in this paper we empirically investigate the impact of soft labels and classifiers on classifier fusion performance. As a result, we find that the behaviors of soft labels and classifiers in average fusion can be explained in the framework of kNN classification. This framework gives rise to a sampling based average fusion algorithm, which is shown to outperform the ordinary average fusion evidently in experiments on four diverse image datasets. This result enables us to believe that our sampling based average fusion algorithm explores the potential of average fusion and qualifies as a better baseline. A newly proposed algorithm should be compared with this new baseline to demonstrate its advantage.
The remainder of this paper is organized as follows. In Section 2 we introduce the experimental setups used in our classifier fusion experiments. Sections 3 and 4 present 2 Mathematical Problems in Engineering the details of our work on investigating the behaviors of soft labels and classifiers in average fusion, respectively. In Section 5 we present the sampling based average fusion algorithm based on the experimental results in Sections 3 and 4. Finally, Section 6 concludes the paper.

Experimental Setups
We use SVM in classification experiments on four diverse datasets. The regulation parameter is fixed to be 1000 and the multiclass SVM is trained in a one-versus-all manner. In all experiments we test with 10 different training-testing splits and report the average of recognition rates.

Datasets.
We use the following four datasets in experiments.
The Event-8 dataset [10] contains images from 8 categories of sports events. Each category is composed of 130 to 250 images with different lighting conditions and postures and so forth. Following the experimental setup in [10], we randomly select 70 images per class as training and another 60 images as testing and report the overall recognition rate.
The Scene-15 dataset [11] is composed of images from 15 scene categories with 200 to 400 images in each category. We use the same experimental setup as in [15], that is, randomly selected 100 images per class as training and all the others as testing, and report the mean recognition rate per class.
Oxford Flower-17 dataset [16] consists of 1360 flower images evenly distributed in 17 categories. Similar as in [16], we randomly select 40 images per class as training examples and 20 images as testing images. The overall accuracy is reported as the results.
With the well-known Caltech-101 dataset [15], we use 30 images per class for all the 102 classes in training, and select up to 15 images per class in the remaining for testing. The mean recognition rates per class are reported as the results.

Features.
We use the following features to build the kernels used in SVM classification. These features are popular due to their discriminative power in object classification, for example, in [13,17,18]. This makes our conclusions drawn from experiments convincing and meaningful.
PHOG Shape Descriptor. We construct oriented (20 bins) and unoriented (40 bins) PHOG descriptors [19] from level 0 to 3 and obtain 8 descriptors in total. Unlike the implementation in [19], in this paper the descriptor in level is formed only by its 2 windows.
Bag-of-SIFT. The SIFT descriptors [20] on patches of radius with spacing of 8 pixels are extracted and quantized into a 500-bin vocabulary, and we select = 4, 8, 12, 16 to allow for scalability. These descriptors are extracted in gray space for Scene-15 dataset which contains only gray images and, in gray, HSV and CIE-Lab spaces for Flower-17, Event-8, and Caltech-101. We build the visual words histograms from level 0 to 2 and obtain 3 or 9 descriptors.
Locally Binary Patterns. The histograms of the basic locally binary patterns (LBP) [21] are adopted from level 0 to 2.
Gist Descriptor. We extract the global gist descriptor [22] from level 0 to 1.
Self-Similarity Descriptor. The self-similarity descriptors [23] of 30 dimensions (10 orientations and 3 radial bins) are extracted and used to build a 500-bin vocabulary. The histograms are then built from level 0 to 2.
Gabor and RFS Filters. We use two texture features, that is, Gabor and RFS filters [23], to build histograms of 500 bins from level 0 to 2.
Gray Value Histogram. We also use the 64-bin gray value histograms from level 0 to 3.
For all these features, we use 2 distance to build kernels in the form of ( , ) = exp(− −1 0 ( , )), where is the pairwise distances and 0 is the mean of pairwise distances. Here 2 distance is selected due to its great distinctive power, as illustrated in [13,[24][25][26].

Behavior of Soft Labels
In majority voting, each classifier assigns only one label with the largest probability to the testing image. We count the times of each label being selected and adopt the label with the maximum times as the correct one. This approach discards the probability of each label, which may be useful in classifier fusion. Therefore soft labels, that is, the posterior probability of each training label, are proposed to be used in classifier fusion. Between the two extremes, that is, using only the most probable label and using all soft labels in fusion, we are interested to know if it is possible to achieve better performance by adopting a sample of all soft labels.
We evaluate the impact of soft labels sampling on average fusion performance as follows. For each classifier, we sort all labels in descending order according to their posterior probability. Then we use in average fusion only the top labels, that is, the labels corresponding to largest probabilities, where ranges from 1 to the number of all training labels. The experimental results are reported in Figure 1.
It is evident from Figure 1 that, for average fusion, neither adopting only the most probable label nor using all the soft labels is the best choice. Instead, several most probable soft labels generate the best classification results. This is a little similar to the NN classification framework as the top most probable soft labels produce the best classification results. Although the best is different for 4 datasets, = 2 seems an appropriate option as it produces the best or near-best performance for all 4 datasets.
Another interesting observation is that, with the increase of object categories, the performance gain obtained using most probable soft labels instead of all soft labels is enlarged. From Event-8 to Caltech-101, the performance gain ranges from 0.1 to 10 roughly. This indicates the importance of soft labels sampling, especially for large datasets with a large number of object categories. On the other hand, this

Behavior of Classifiers
As classifier fusion is to use multiple classifiers to improve classification performance, another problem of interest is if more classifiers definitely lead to better average fusion performance.
We evaluate the impact of the amount of classifiers on average fusion performance as follows. Firstly, we use the recognition rate of 10-fold cross-validation to estimate the powerfulness of each classifier. In the second step, we sort the classifiers in descending order according to the powerfulness. Then we add the classifiers into fusion one by one and record the fusion performance. The performance with different amounts of classifiers is reported in Figure 2. Note that, in this experiment, we firstly fuse the classifiers from different levels of the same features, for example, all 3 levels of LBP, and regard the fused decision as of one single classifier. In this way we have 11 classifiers for Caltech-101, Event-8, and Flower-17 and 9 classifiers for Scene-15. This is to compare different classifiers (features) more evidently.
Similar as in the case of soft labels, Figure 2 shows that with average fusion, the best performance is obtained with several most powerful classifiers. Adding more classifiers of less powerfulness into fusion only decreases the final classification performance. It is easy to see that = 4 can be an appropriate selection for the number of classifiers.

Sampling Based Average Fusion
In the last two sections we find that using a small sample of most probable soft labels and most powerful classifiers separately helps produce the best fusion performance. Although the optimal number of soft labels, that is, 2, and the optimal number of classifiers, that is, 4, are obtained empirically, they are applied to all the four datasets without special tuning to individual datasets. Now we test the combined performance of the sampling of both soft labels and classifiers. In experiments on the four datasets, we compare the average fusion performance with and without sampling of soft labels and classifiers and show the results in Tables 1 and 2. In the tables "average1" means recognition rates from average fusion with all classifiers and all soft labels, whereas "average2" indicates corresponding results with sampling of soft labels and classifiers, that is, 2 most probable soft labels and 4 most powerful classifiers. We also compare our algorithm with the state-of-the-art ones on these datasets. From the comparison we see that, with average fusion, using a small sample of soft labels and classifiers always produces a significant improvement in object classification performance. This means that the sampling based average fusion (SBAF) can serve as a better baseline than ordinary average fusion. Our algorithm performs also comparably to the state-of-the-art ones on these datasets. In fact, on Event-8 and Scene-15 our algorithm produces better results than 85.0 ± 0.5 [9] 84.2 ± 1.0 [9] 84.1 ± 0.5 [10] 73.4 [11] 81.4 ± 0.5 the state-of-the-art ones, and on Flower-17 and Caltech-101 our results are close to the best ones to date. Noticing that in experiments we only use simple features and average fusion, we believe that this is a very encouraging result which validates the effectiveness of our SBAF algorithm. Since in this paper we present SBAF as a better baseline but not a novel fusion method, we only compare this algorithm with the 86.0 ± 1.5 Average2 71.0 ± 1.2 [12] 88.3 ± 0.3 [13] 77.8 ± 0.4 [13] 85.5 ± 3.0 [14] 66.2 ± 0.5 ordinary average fusion and not with other fusion methods, for example, [2,4]. Another observation from experiments is that the behaviors of soft labels and classifiers can be explained in the framework of NN classification. Regarding the most probable soft labels and most powerful classifiers as the nearest neighbors, we can explain all the observations from experiments based on the NN framework easily. This framework provides theoretical support to our following conclusions. Firstly, the best performance of average fusion is not achieved with all soft labels and all classifiers, but with a sample of most probable soft labels and most powerful classifiers. This gives rise to SBAF as a better baseline. Secondly, with a dataset of tens to hundreds of categories, the performance gain of SBAF over average fusion can be rather large (over 10 for Caltech-101). Since in modern time there is an explosive increase in the amounts and categories of images, this observation highlights the importance of soft label and classifier sampling and the necessity to adopt SBAF as the baseline.
Although in this paper we focus our work on image classification, the idea of classifier fusion is also useful to some other related domains, for example, document classification, speech recognition, and fault diagnosis [27][28][29]. In the next step we plan to explore the possibility of extending the work to more domains [30][31][32].

Conclusion
In this paper we investigated the impact of soft labels and classifiers sampling on average classifier fusion performance through experiments on four diverse datasets. As a result, we found that the behaviors of soft labels and classifiers in average fusion can be elegantly explained in the framework of NN classification. This framework further gives rise to a sampling based average fusion method, that is, using a sample of most probable soft labels and most powerful classifiers in fusion to obtain the best performance. Experiments indicate that this sampling based average fusion performs evidently better than the ordinary one and thus can serve as a better baseline to be compared with. Our results on the four datasets are also comparable to the state of the art in literature. As the NN framework elegantly captures the behaviors of soft labels and classifiers in classifier fusion, we believe that it can be helpful in designing novel classifier fusion methods.