Toward a General-Purpose Heterogeneous Ensemble for Pattern Classification

We perform an extensive study of the performance of different classification approaches on twenty-five datasets (fourteen image datasets and eleven UCI data mining datasets). The aim is to find General-Purpose (GP) heterogeneous ensembles (requiring little to no parameter tuning) that perform competitively across multiple datasets. The state-of-the-art classifiers examined in this study include the support vector machine, Gaussian process classifiers, random subspace of adaboost, random subspace of rotation boosting, and deep learning classifiers. We demonstrate that a heterogeneous ensemble based on the simple fusion by sum rule of different classifiers performs consistently well across all twenty-five datasets. The most important result of our investigation is demonstrating that some very recent approaches, including the heterogeneous ensemble we propose in this paper, are capable of outperforming an SVM classifier (implemented with LibSVM), even when both kernel selection and SVM parameters are carefully tuned for each dataset.


Introduction
The present trend in machine learning is focused on building optimal classification systems for very specific, well-defined problems. Another research focus, however, would be to work on building General-Purpose (GP) systems that are capable of handling a broader range of problems as well as multiple data types. Ideally, GP systems would work well out of the box, requiring little to no parameter tuning but would still perform competitively against less flexible systems that have been optimized for very specific problems and datasets. One promising avenue of exploration is to build ensembles that are composed of diverse classifiers that merge their hypotheses [1], thereby resulting in a better approximation of a true hypothesis [2].
Many ensemble construction techniques are available. One approach is to perturb the information that is given to the base classifiers. The basic assumption behind this approach is that each of the base classifiers makes errors that are independent of each other, but as part of an ensemble they offer stronger classificatory power. To build an ensemble using this approach, training sets are first created, and classifiers are then trained on each of the training sets. The results of the classifiers are combined using some decision rule such as majority voting, sum rule, max rule, min rule, product rule, median rule, and Borda count. Different types of perturbations methods have been developed to maximize the classifier diversity in an ensemble. These methods focus on perturbing the training patterns, the feature sets, the classifiers, or some combination of these perturbation methods.
In pattern perturbation, new training sets are created (commonly following an iterative approach) by perturbing the original training set, and a different classifier is trained on each new set. Some well-known pattern perturbation techniques include Bagging [3], Arcing [4], Class Switching [5], and Decorate [6]. In Bagging [3], new training sets are subsets of the original training data. In Arcing [4], each new training set is created based on the misclassified patterns in the previous iteration. In Class Switching [5], new training sets are created by randomly changing the labels of a subset of the original training data. Decorate [6] creates new training sets by adding artificial patterns misclassified by the combined decision of the ensemble.  [7] and Input Decimated Ensemble [8]. In RS [7], new training sets are randomly generated from subsets of the feature set. In Input Decimated Ensemble [8], new training sets are generated using the principal component analysis (PCA) transform, where PCA is calculated on the training patterns belonging to each particular class. The ensemble size is thus bounded by the number of classes. This limitation can be avoided, however, as shown in [9], if PCA is performed on training patterns that have been partitioned into clusters.
New ensembles can also be composed by mixing the two perturbation methods discussed above. For example, Random Forest [10] uses a bagging ensemble of decision trees, where a random selection of features is used to split a given node.
Finally, in classifier perturbation, each classifier of the same type (homogeneous ensembles) can be given different parameter values, or different classifiers (heterogeneous ensembles) can be combined and trained on the same training set. Classifier perturbation methods for building ensembles have been the least studied in the literature, but recently several studies have focused on this type of ensemble [11]. Moreover, several papers have investigated building GP heterogeneous ensembles [12,13]. In [12], an ensemble combining the RS approach with an ensemble using an editing approach to reduce outliers was compared with other state-of-the-art methods across sixteen benchmark datasets representing very different problems (numerous medical problems, image problems, a vowel dataset, a credit dataset, etc.). Although none of the ensembles investigated in [12] worked consistently well across all sixteen datasets, one GP ensemble worked well across all the image datasets. Moreover, in some cases, the GP ensemble performed better than an SVM whose parameters had been optimally tuned on a specific dataset.
GP ensembles that exploit information available in different feature extraction methods and representations of the data have also been explored. In [13], for instance, the goal was to search for a GP ensemble for protein classification that combined an optimal set of different protein representations and descriptors and that performed well across fourteen protein classification datasets representing different protein classification tasks. It was discovered in [13] that large descriptors work better when a large training set is available (due to the curse of dimensionality). Although no ensemble was discovered that provided the best performance across all fourteen datasets, it was shown that it is always possible to find a more limited GP ensemble that performed well across each type of dataset.
In this work the focus is on testing different classifiers and their combinations across twenty-five datasets (fourteen image datasets and eleven UCI data mining datasets). In the image datasets, two state-of-the-art texture descriptors are utilized: Local Ternary Patterns [14] and Local Phase Quantization [15]. As the majority of machine learning papers published in the literature are based on the LibSVM implementation of SVM, the aim of this work is to compare the performance of the LibSVM library with several recently proposed classifiers (Gaussian process classifiers, RS of AdaBoost, RS of rotation boosting, and deep learning) and to show that a heterogeneous GP ensemble of classifiers works well across the different datasets. For all the classifiers compared in this study, we use well-known toolboxes that have been extensively tested and that are freely available. Moreover, to make results reproducible and to gain a wider diffusion of this type of research, the MATLAB code/interface for building the GP heterogeneous ensembles proposed in this work is provided. We hope this tool will also prove useful for practitioners.
The most interesting result obtained from our experiments is that the best GP ensemble proposed in this paper outperforms each stand-alone classifier without any ad hoc tuning on the dataset: the same fusion rule is used for all twenty-five datasets tested in this work. As a result, we are confident that the proposed GP ensemble can easily be extended to other problems and should prove useful to researchers who want a reliable classifier that works well without tuning it. It should be noted, however, that there is a cost associated with using heterogeneous ensembles: increased computational time.
The remainder of this paper is organized as follows. In Section 2, the different classifiers explored in this paper are briefly described. In Section 3, the feature descriptors are outlined. In Section 4, we provide an overview of the twentyfive datasets. In Section 5, we present the experimental results along with our best GP heterogeneous ensembles. We conclude in Section 6 with a few reflections and remarks on some issues involved in developing GP ensembles and list some future directions of research. The MATLAB code for all the classifiers used in the proposed ensembles are available at https://www.dei.unipd.it/node/2357. Moreover, for the purpose of reproducing and comparing results, the split training/testing sets are also available at the above website.

Classifiers
Since the aim of this study is to find a heterogeneous multiclassifier system that works well with a large number of datasets, we examined the fusion by sum rule of several stateof-the-art classifiers: the Support Vector Machine (SVM), Gaussian process classifier (GPC), RS of AdaBoost (RS AB), RS of rotation boosting (RS RB), and deep learning (DL). The sum rule [2] simply sums the matching scores (normalized to mean 0 and standard deviation 1) provided by each of the different classifier systems. Each of the classifiers examined in this study is described briefly below. [16] is a binary classifier and is used as the core classifiers in several of our ensembles. An SVM performs classification by cutting the -dimensional space ( being the number of features) into two regions associated with two distinct classes, often referred to as the positive class and the negative class. The regions are separated by an -dimensional hyperplane that has the largest possible distance from the training vectors of the two classes. Three kernels are tested in our experiments: linear, radial basis function, and polynomial. For each kernel, a dataset driven fine-tuning of parameters is performed. SVM is implemented using LibSVM, available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm/.

Support Vector Machine (SVM). SVM
In addition to SVM implemented with LibSVM, we test an RS ensemble with SVM as the classifier (using fifty different subspaces that include 50% of the original features). This ensemble is called RS SVM. [17] is a probabilistic approach for learning in kernel machines. It considers two procedures for approximating inference for binary classification: (1) Laplace's approximation, which is based on an expansion around the mode of the posterior, and (2) the Expectation Propagation algorithm, which is based on matching moments approximations to the marginals of the posterior. GPC is implemented using the MATLAB code available at http://www.gaussianprocess.org/ gpml/code/matlab/doc/index.html. [18] is a supervised learning algorithm that boosts the classification performance of a simple binary classifier by combining a collection of weak classifiers. The output of the weak learners is combined into a weighted sum that represents the final output of the boosted classifier. In this work, we combine AdaBoost with the RS method [7] for building ensembles. This results in a pseudorandom selection of subsets of components in the feature vector that are then used for training the different classifiers of the ensemble. We use RS to construct fifty different subspaces that include 50% of the original features. A different AdaBoost.M2 [18] is trained on each subspace. AdaBoost.M2 gives the weak learner (a neural network in our studies) more expressive power. The 50 classifiers are combined by sum rule.

Random Subspace of Rotation Boosting (RS RB)
. RS RB is the random subspace version of rotation boosting (RB). RB [19] is an ensemble of decision trees based on randomly splitting the feature set into subsets. In each subset a feature transform is applied (PCA without feature reduction in the original version of RB). Instead of PCA, the feature transform used in this study is Neighborhood Preserving Embedding (NPE) as in [9]. NPE is described in Section 3.3.

Deep Learning (DL).
Deep learning is a recent and one of the best-performing approaches to Artificial Intelligence, a field that was revolutionized when it was first proposed in 2006 [20]. The main feature of deep learning is its layered structure: there are several layers of processing nodes between its input and output, with every layer adding a certain level of abstraction to the overall representation. For example, image interpretation is a task that can be performed through several steps: at a lower level, small image patches are considered, leading to features like edges and texture. Such low-level descriptors can be combined together to build a more complex representation: at the second level, for example, features like larger image patches and contours can be considered. Moving toward upper levels, the elements considered by the network are of increased complexity and are extracted from larger areas in the image (i.e., larger sets of input data).
Another major feature of deep learning networks is that they are able to exploit unlabeled data, a crucial feature when dealing with huge sets of data. Deep learning has been widely used in computer vision and image understanding applications, including object recognition in 3D (RGB-D) data [21] and face detection and verification [22].
In this work, we test the deep learning approach based on the Feedforward Backpropagation Neural Network (FBNN) [23] with a sigmoid activation function. We train FBNN for 10000 epochs with minibatches of size 25. Three versions of DL are tested DL1, DL2, and DL3. Each has an input layer that is the size of the input feature vector and an output layer that is the size of the number of classes. DL1 has a hidden layer of size 100. DL2 has two hidden layers each of size 100, and DL3 has two hidden layers each of size 500. DL is implemented using the MATLAB code available at http://it.mathworks.com/ matlabcentral/fileexchange/38310-deep-learning-toolbox.

Feature Extraction
In Sections 3.1 and 3.2, we briefly describe the texture features used with the fourteen image datasets. Many methods are available for extracting features from texture. Two of the bestperforming methods are Local Ternary Patterns (LTP) and Local Phase Quantization (LPQ). In Section 3.3, we describe the NPE descriptor that is used as a transform in RS RB. [14] is an extension of the canonical Local Binary Pattern (LBP) operator designed to be more discriminant and less sensitive to noise in uniform regions. The LBP operator [24] is computed at each pixel of an image by considering the differences between grey-level values of a small circular neighborhood (with radius pixels):

Local Ternary Patterns (LTP). LTP
where is the number of pixels in the circular neighborhood and ( ) is a threshold function such that In the LTP, the threshold function ( ) is substituted by a ternary coding function that makes the operator more robust to noise.
The ternary coding ( ) is defined as where is a threshold fixed to 3 in this work.

Computational Intelligence and Neuroscience
The ternary code obtained by the ( ) function is split into two binary codes by considering its positive and negative components, according to the following binary function V ( ( )): The resulting binary codes are used to create two histograms of LTP values.
In this work two values of and are used: ( = 1; = 16) and ( = 2, = 16); hence, we have four codes (two sets of parameters, with a positive and a negative code for each of them).

Local Phase Quantization (LPQ).
LPQ is a texture descriptor [15] based on the blur invariance of the Fourier Transform Phase. For each pixel position x of the image (x), the 2D short-term Fourier transform (STFT) is computed over a rectangular neighborhood of size × , and four complex coefficients, corresponding to the 2D frequencies, are considered and quantized to construct the final descriptor: where is a scalar frequency parameter.
The four complex coefficients [ 1 , 2 , 3 , 4 ] need to be decorrelated before quantization to become statistically independent and maximally preserve the information. Assuming a Gaussian distribution and a fixed correlation coefficient between adjacent pixel values , a whitening transform can be obtained from the singular value decomposition of the covariance matrix of the transform coefficient vector. After decorrelation, the vector ∈ R 8 that contains the decorrelated STFT coefficients for the pixel is quantized using a scalar quantizer ( ) (already defined in (2)). Then the final LPQ code is represented as an integer between 0 and 255 using the binary coding: Finally, a histogram of these integer values is composed and used as a feature vector. In this work, we tested LPQ using two sizes for the local window (3 and 5), both with Gaussian derivative quadrature filter pairs for local frequency estimation. LPQ is implemented using the MATLAB code available at http://www.cse.oulu.fi/CMV/Downloads/LPQMatlab.

Neighborhood Preserving Embedding (NPE).
First, proposed in [25], the NPE transformation is a global approach that preserves the local neighborhood structure on the data manifold. PCA, in contrast, preserves the global Euclidean structure. Thus, NPE is less sensitive to outliers than PCA.
As described in Section 2, we use NPE as the transform for dimensionality reduction in RB.
Given a set of points 1 , 2 , . . . , ∈ , the idea behind NPE is to find a transformation matrix A that maps these points into another set 1 , 2 , . . . , ∈ , where ≪ . In this way, = represents in a space with significantly less dimensions.
NPE begins by building a weight matrix to describe the relationships between data points: each point is described as a weighted combination of its neighbors. An optimal embedding is sought such that the neighborhood structure is preserved in the reduced space.
The algorithm can be formalized in three steps: (1) Build an adjacency graph: define a graph G with nodes. The th node represents the point . There is an edge between and if and only if is one of the nearest neighbors of .
(2) Compute weights: in this step weights on edges are calculated. W is the weight matrix and is the weight of the edge from node to node . The matrix can be computed by minimizing the objective function: Subject to: ∑ = 1, = 1, 2, . . . , .
(3) Compute the projection: in this step the linear projection is computed. The following eigenvector problem is solved: a = a; = ( − ) ( − ). The local manifold structure is then preserved using the following transformation matrix A that maps to :

Datasets
To assess their generalizability, the approaches proposed in this paper were tested across twenty-five datasets: fourteen image classification datasets and eleven UCI data mining datasets.

Image Classification Datasets.
The following fourteen image classification datasets that represent very different computer vision problems were selected to evaluate the generalizability of our approach: (i) PS: this Pap Smear dataset [26] contains 917 images representing cells that are used in the diagnosis of cervical cancer.
(ii) VI: this dataset, reported in [27], contains images of viruses. A split training/testing set is provided by the authors and is used in this paper. The masks for subtracting image backgrounds were not utilized.
Computational Intelligence and Neuroscience 5 (iv) SM: this dataset, reported in [29], contains images extracted from video-based smoke detection surveillance systems. The same division of the dataset into training/testing sets, reported in [29], is used in this paper.
(v) HI: this dataset, reported in [30], contains images extracted from four fundamental tissues.
(vii) PR: this is a dataset containing 118 DNA-binding proteins and 231 non-DNA-binding proteins. Texture descriptors are extracted from the 2D distance matrix that represents each protein. This matrix is obtained from the 3D tertiary structure of a given protein considering only atoms that belong to the protein backbone (see [32] for details).
(viii) HE: the 2D HeLa dataset [28] contains single cell images divided into 10 staining classes that were taken from fluorescence microscope acquisitions on HeLa cells.
(ix) LO: the locate endogenous mouse subcellular organelles dataset [33] contains 502 images unevenly distributed among 10 classes of endogenous proteins or features of specific organelles.
(x) TR: the locate transfected mouse subcellular organelles dataset [33] contains 553 images unevenly distributed in 11 classes of fluorescence-tagged or epitope-tagged proteins transiently expressed in specific organelles. (xiii) PA: this dataset, reported in [35], contains 2338 paintings by 50 painters, representative of 13 different painting styles: abstract expressionism, baroque, constructivism, cubism, impressionism, neoclassical, pop art, postimpressionism, realism, renaissance, romanticism, surrealism, and symbolism. A split training/testing set is provided by the authors [35] and is used in this paper.
(xiv) LE: this dataset contains images of 20 species of Brazilian flora [36]. A total of 400 samples, divided into 20 classes (20 samples per class), were collected. Three windows were extracted from each sample. A constraint to the fivefold cross-validation technique was added that required that all windows extracted from a given leaf belong either to the training set or to the testing set, not both.
A descriptive summary of each dataset, along with the URL where each dataset can be downloaded, is reported in Table 1. If a dataset contains RGB images, these were converted to grey-level images before the feature extraction step. The testing protocol used with these datasets is the fivefold cross-validation method, with the exception of three dataset, SM, VI, and PA, where the protocols and testing/training sets defined by the datasets were used (these protocols, which are briefly described above, were obtained from the creators of each of these datasets).

UCI Data Mining Datasets.
We report results obtained using eleven datasets from the UCI repository [37]. In Table 2 we list each dataset used in this study and describe each of them according to the number of attributes (#A), the number of samples (#S), and the number of classes (#C) that each contains. The testing protocol is the fivefold crossvalidation method. All features in these datasets were linearly normalized between 0 and 1 before classification, using only the training data for normalizing the test data. In all the tests the testing set is completely blind.

Results and Discussion
The performance indicator used in all experiments is the area under the ROC curve (AUC) because it provides a better overview of classification results [38]. In the multiclass problem, AUC is calculated using the one-versus-all approach, where a given class is considered "positive" and all other classes are considered "negative." The average AUC is reported in all tables. In order to better compare the ensembles, we also consider their average AUC (Av) and the average rank (RA) obtained in all datasets. RA should be minimized: if a classifier obtains the perfect classification in all the datasets, its RA is 1. The last row labelled Av in all the tables included in this section reports the average AUC performance on all the datasets. To statistically validate these experiments, the Wilcoxon Signed-Rank test [39] was used for all methods.
For the purpose of reproducing and comparing results, the split training/testing sets used for each dataset are available at the website listed at the end of the Introduction. For the image datasets, we also provide both the LTP and LPQ features that were extracted from each dataset for this study.

Experimental Results in Image
Classification. The first set of experiments are aimed at comparing the performance of the proposed approaches with the stand-alone methods described in Section 2. Tables 3 and 4  A + B is the sum rule applied to classifier A and classifier B, after classifier scores have been normalized to mean 0 and standard deviation 1.
In Table 5, we compare the classifier performances given in Tables 3 and 4 using the Wilcoxon Signed-Rank test. Three symbols are used in Table 5: (i) " " indicates that the method in the given row exhibits a lower performance (with value < 0.10) than the method listed in the corresponding column (i.e., the classifier in that row is the "loser" compared with the column classifier).
(ii) "ND" indicates that there is no statistically significant difference between the performances of the two methods.
(iii) " " indicates that the method in the given row exhibits a higher performance, with value < 0.10, than the method listed in the corresponding column (i.e., the classifier in that row is the "winner" compared with the column classifier).
Analyzing the results reported in Table 5, we observe some very interesting results: (i) Both SVM and RS-SVM fail to outperform any of the other state-of-the-art approaches.
Computational Intelligence and Neuroscience 7 Computational Intelligence and Neuroscience  Table 7: Comparisons between all the pairs of methods tested in Table 6.
(iv) The simple ensemble S D obtains a performance that is comparable with all the other base methods.
(v) RS, which involved no parameter tuning step, outperforms SVM (implemented with LibSVM) where the parameters are optimally tuned for each dataset.

Experiments Results in Data Mining
Datasets. The same tests reported in the previous section using the image datasets are also run for the data mining datasets. The results reported in Tables 6 and 7 clearly show the usefulness of our ensemble approach.
In these tests we obtain results that are similar to those reported in Tables 3 and 4, but there is an important difference: RS-SVM works poorly in the two datasets containing few features (HA and TR). The reason for this is simple: when only a few features are used to describe a pattern, they are likely to be uncorrelated, so an RS approach is not advised. When the datasets HA and TR are removed from consideration, RS-SVM outperforms SVM. Notice as well that in this test the ensembles outperform the base methods.

Conclusions
The aim of this paper was to compare and combine several state-of-the-art classifiers for proposing a GP ensemble that works well across a broad set of datasets (fourteen different image datasets and eleven UCI data mining datasets) with no parameter tuning. No single approach was discovered that outperformed all the other classifier systems in all the tested datasets. This finding lends support to the "no free lunch" hypothesis/metaphor that claims that "any two algorithms are equivalent when their performance is averaged across all possible problems" [40]. Nonetheless, several interesting findings are obtained when examining the classifier results across the twenty-five datasets: (i) Among the different state-of-the-art methods, there is no winner.
Computational Intelligence and Neuroscience 9 (ii) The GP ensembles clearly outperform the state-ofthe-art methods without any complex fusion rule (the simple sum rule is used throughout the experiments). In particular, the GP ensembles outperform SVM implemented with the LibSVM toolbox, which is probably the most used classification toolbox reported in the literature.
In our opinion, a heterogeneous system based on different state-of-the-art classifiers (including classifiers that are themselves an ensemble, such as a random subspace of rotation boosting) is the most feasible way of avoiding the "curse" of the "no free lunch" metaphor.
There are many avenues for exploring GP ensembles further. A suggested list of future explorations is the following: (i) Test the performance of ensembles using more complex fusion rules.
(ii) Test systems on data mining problems where a large set of features is available.
(iii) Expand the base methods to combine different deep learning approaches, such as an extreme learning machine [41] or convolutional neural networks [42] where the input is the whole image and not a feature vector.