We perform an extensive study of the performance of different classification approaches on twenty-five datasets (fourteen image datasets and eleven UCI data mining datasets). The aim is to find General-Purpose (GP) heterogeneous ensembles (requiring little to no parameter tuning) that perform competitively across multiple datasets. The state-of-the-art classifiers examined in this study include the support vector machine, Gaussian process classifiers, random subspace of adaboost, random subspace of rotation boosting, and deep learning classifiers. We demonstrate that a heterogeneous ensemble based on the simple fusion by sum rule of different classifiers performs consistently well across all twenty-five datasets. The most important result of our investigation is demonstrating that some very recent approaches, including the heterogeneous ensemble we propose in this paper, are capable of outperforming an SVM classifier (implemented with LibSVM), even when both kernel selection and SVM parameters are carefully tuned for each dataset.
The present trend in machine learning is focused on building optimal classification systems for very specific, well-defined problems. Another research focus, however, would be to work on building General-Purpose (GP) systems that are capable of handling a broader range of problems as well as multiple data types. Ideally, GP systems would work well out of the box, requiring little to no parameter tuning but would still perform competitively against less flexible systems that have been optimized for very specific problems and datasets. One promising avenue of exploration is to build ensembles that are composed of diverse classifiers that merge their hypotheses [
Many ensemble construction techniques are available. One approach is to perturb the information that is given to the base classifiers. The basic assumption behind this approach is that each of the base classifiers makes errors that are independent of each other, but as part of an ensemble they offer stronger classificatory power. To build an ensemble using this approach,
In pattern perturbation,
Feature perturbation techniques manipulate a set of original features into new training sets composed of perturbed features. Some important examples of feature perturbation include random subspace (RS) [
New ensembles can also be composed by mixing the two perturbation methods discussed above. For example, Random Forest [
Finally, in classifier perturbation, each classifier of the same type (homogeneous ensembles) can be given different parameter values, or different classifiers (heterogeneous ensembles) can be combined and trained on the same training set. Classifier perturbation methods for building ensembles have been the least studied in the literature, but recently several studies have focused on this type of ensemble [
GP ensembles that exploit information available in different feature extraction methods and representations of the data have also been explored. In [
In this work the focus is on testing different classifiers and their combinations across twenty-five datasets (fourteen image datasets and eleven UCI data mining datasets). In the image datasets, two state-of-the-art texture descriptors are utilized: Local Ternary Patterns [
The most interesting result obtained from our experiments is that the best GP ensemble proposed in this paper outperforms each stand-alone classifier without any ad hoc tuning on the dataset: the same fusion rule is used for all twenty-five datasets tested in this work. As a result, we are confident that the proposed GP ensemble can easily be extended to other problems and should prove useful to researchers who want a reliable classifier that works well without tuning it. It should be noted, however, that there is a cost associated with using heterogeneous ensembles: increased computational time.
The remainder of this paper is organized as follows. In Section
Since the aim of this study is to find a heterogeneous multiclassifier system that works well with a large number of datasets, we examined the fusion by sum rule of several state-of-the-art classifiers: the Support Vector Machine (SVM), Gaussian process classifier (GPC), RS of AdaBoost (RS_AB), RS of rotation boosting (RS_RB), and deep learning (DL). The sum rule [
SVM [
In addition to SVM implemented with LibSVM, we test an RS ensemble with SVM as the classifier (using fifty different subspaces that include 50% of the original features). This ensemble is called RS_SVM.
GPC [
RS_AB [
RS_RB is the random subspace version of rotation boosting (RB). RB [
Deep learning is a recent and one of the best-performing approaches to Artificial Intelligence, a field that was revolutionized when it was first proposed in 2006 [
Another major feature of deep learning networks is that they are able to exploit unlabeled data, a crucial feature when dealing with huge sets of data. Deep learning has been widely used in computer vision and image understanding applications, including object recognition in 3D (RGB-D) data [
In this work, we test the deep learning approach based on the Feedforward Backpropagation Neural Network (FBNN) [
In Sections
LTP [
In the LTP, the threshold function
The ternary coding
The ternary code obtained by the
Thus, the
The resulting binary codes are used to create two histograms of LTP values.
In this work two values of
LPQ is a texture descriptor [
The four complex coefficients
Finally, a histogram of these integer values is composed and used as a feature vector. In this work, we tested LPQ using two sizes for the local window
First, proposed in [
Given a set of points
NPE begins by building a weight matrix to describe the relationships between data points: each point is described as a weighted combination of its neighbors. An optimal embedding is sought such that the neighborhood structure is preserved in the reduced space.
The algorithm can be formalized in three steps: Build an adjacency graph: define a graph Compute weights: in this step weights on edges are calculated. Compute the projection: in this step the linear projection is computed. The following eigenvector problem is solved:
To assess their generalizability, the approaches proposed in this paper were tested across twenty-five datasets: fourteen image classification datasets and eleven UCI data mining datasets.
The following fourteen image classification datasets that represent very different computer vision problems were selected to evaluate the generalizability of our approach: PS: this Pap Smear dataset [ VI: this dataset, reported in [ CH: this dataset, reported in [ SM: this dataset, reported in [ HI: this dataset, reported in [ BR: this dataset, reported in [ PR: this is a dataset containing 118 DNA-binding proteins and 231 non-DNA-binding proteins. Texture descriptors are extracted from the 2D distance matrix that represents each protein. This matrix is obtained from the 3D tertiary structure of a given protein considering only atoms that belong to the protein backbone (see [ HE: the 2D HeLa dataset [ LO: the locate endogenous mouse subcellular organelles dataset [ TR: the locate transfected mouse subcellular organelles dataset [ PI: this dataset, reported in [ RN: this is a dataset containing 200 fluorescence microscopy images evenly distributed among 10 classes of fly cells subjected to a set of gene knockdowns using RNAi. The cells were stained with DAPI to visualize their nuclei. PA: this dataset, reported in [ LE: this dataset contains images of 20 species of Brazilian flora [
A descriptive summary of each dataset, along with the URL where each dataset can be downloaded, is reported in Table
Descriptive summary of the image datasets.
Dataset | Number of classes | Number of samples | Sample size | URL for download |
---|---|---|---|---|
PS | 2 | 917 | Various |
|
VI | 15 | 1500 | 41 × 41 |
|
CH | 5 | 327 | 512 × 382 |
|
SM | 2 | 2868 | 100 × 100 |
|
HI | 4 | 2828 | Various | Upon request to Loris Nanni [ |
BR | 2 | 584 | Various | Upon request to Geraldo Braz Junior [ |
PR | 2 | 349 | Various | Upon request to Loris Nanni [ |
HE | 10 | 862 | 512 × 382 |
|
LO | 10 | 502 | 768 × 512 |
|
TR | 11 | 553 | 768 × 512 |
|
PI | 13 | 903 | Various |
|
RN | 10 | 200 | 1024 × 1024 |
|
PA | 13 | 2338 | Various |
|
LE | 20 | 1200 | 128 × 128 | Upon request to |
We report results obtained using eleven datasets from the UCI repository [
UCI datasets and their features: number of attributes (#A), number of samples (#S), and number of classes (#C).
Dataset | Acronym | #A | #S | #C | Brief description |
---|---|---|---|---|---|
BREAST | BR | 9 | 699 | 2 | For breast tumor diagnosis |
|
|||||
HEART | HE | 13 | 303 | 2 | For detecting heart disease; the “goal” field refers to the presence of heart disease in the patient |
|
|||||
PIMA | PI | 8 | 768 | 2 | For forecasting the onset of diabetes mellitus |
|
|||||
Spam | SP | 57 | 4601 | 2 | For classifying E-mail as spam or nonspam |
|
|||||
SONAR | SO | 60 | 208 | 2 | For discriminating between sonar signals bounced off a metal cylinder and those bounced off a rough cylindrical rock |
|
|||||
IONOSPHERE | IO | 34 | 351 | 2 | For classifying radar returns from the ionosphere |
|
|||||
Liver | LI | 7 | 345 | 2 | For classifying liver disorders that might arise from excessive alcohol consumption |
|
|||||
Haberman | HA | 3 | 306 | 2 | A dataset that contains cases on the survival of patients who had undergone surgery for breast cancer |
|
|||||
Vote | VO | 16 | 435 | 2 | For classifying Republican versus Democrat US representatives (this dataset includes votes for each member of the US House of Representatives on 16 key votes) |
|
|||||
Australian | AU | 14 | 690 | 2 | For credit card applications |
|
|||||
Transfusion | TR | 5 | 748 | 2 | This study adopted the donor database of Blood Transfusion Service Center; the aim is to predict whether a person donated blood in March, 2007 |
The performance indicator used in all experiments is the area under the ROC curve (AUC) because it provides a better overview of classification results [
In order to better compare the ensembles, we also consider their average AUC (
For the purpose of reproducing and comparing results, the split training/testing sets used for each dataset are available at the website listed at the end of the Introduction. For the image datasets, we also provide both the LTP and LPQ features that were extracted from each dataset for this study.
The first set of experiments are aimed at comparing the performance of the proposed approaches with the stand-alone methods described in Section S_D: (DL1 + DL2 + DL3)/3. E1: GPC + RS_AB. E2: GPC + RS_AB + RS_RB. E3: GPC + RS_AB + RS_RB + RS-SVM. E4: GPC + RS_AB + RS_RB + RS-SVM + S_D.
Performance (AUC) obtained in different image datasets using LTP as texture descriptor.
LTP | Datasets (AUC) | Av | RA | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PS | VI | CH | SM | HI | BR | PR | HE | LO | TR | PI | RN | PA | LE | |||
SVM | 0.9144 | 0.9349 |
|
0.9975 | 0.9156 | 0.9692 | 0.8968 | 0.9814 | 0.9949 | 0.9926 | 0.9286 | 0.9696 | 0.8903 | 0.9792 | 0.9546 | 7.0 |
RS-SVM | 0.9071 | 0.9352 |
|
|
0.9195 | 0.9763 | 0.9030 | 0.9826 | 0.9950 | 0.9924 | 0.9316 | 0.9713 | 0.8944 | 0.9807 | 0.9562 | 5.9 |
GPC | 0.9086 | 0.9131 | 0.9997 | 0.9971 | 0.9198 | 0.9789 | 0.8865 | 0.9816 | 0.9964 | 0.9930 | 0.9090 | 0.9769 | 0.8968 | 0.9752 | 0.9523 | 8.1 |
RS_AB | 0.9121 | 0.9254 | 0.9998 | 0.9974 | 0.8924 | 0.9810 | 0.9079 | 0.9813 | 0.9965 | 0.9953 | 0.9242 | 0.9771 | 0.8959 | 0.9738 | 0.9543 | 7.0 |
RS_RB | 0.9110 | 0.9293 | 0.9999 | 0.9953 | 0.9136 | 0.9739 | 0.8886 | 0.9806 | 0.9969 | 0.9955 | 0.9178 | 0.9900 | 0.8940 | 0.9738 | 0.9543 | 7.6 |
DL1 | 0.8927 | 0.9173 | 0.9999 | 0.9952 | 0.9072 | 0.9811 | 0.8486 | 0.9801 | 0.9962 | 0.9965 | 0.9147 | 0.9837 | 0.8865 |
|
0.9486 | 8.7 |
DL2 | 0.8965 | 0.9220 | 0.9998 | 0.9956 | 0.8945 | 0.9815 | 0.8780 | 0.9806 | 0.9956 | 0.9959 | 0.9061 | 0.9878 | 0.8869 | 0.7900 | 0.9365 | 9.4 |
DL3 | 0.7802 | 0.9239 | 0.9999 | 0.9963 | 0.9082 | 0.9815 | 0.8779 | 0.9812 | 0.9958 | 0.9962 | 0.9014 | 0.9916 | 0.8895 | 0.9525 | 0.9412 | 8.5 |
S_D | 0.8985 | 0.9244 | 1.000 | 0.9958 | 0.9143 |
|
0.8783 | 0.9818 | 0.9958 | 0.9966 | 0.9151 |
|
0.8960 | 0.9806 | 0.9537 | 6.0 |
E1 | 0.9130 | 0.9196 | 0.9997 | 0.9974 | 0.9162 | 0.9812 | 0.8999 | 0.9816 | 0.9962 | 0.9945 | 0.9191 | 0.9798 | 0.8983 | 0.9748 | 0.9551 | 7.2 |
E2 |
|
0.9337 | 0.9998 | 0.9973 | 0.9184 | 0.9809 | 0.9007 | 0.9837 | 0.9969 | 0.9960 | 0.9238 | 0.9884 | 0.9030 | 0.9768 | 0.9583 | 5.1 |
E3 |
|
0.9361 |
|
0.9975 |
|
0.9816 |
|
0.9843 | 0.9970 | 0.9968 | 0.9313 | 0.9835 | 0.9080 | 0.9796 | 0.9603 | 2.8 |
E4 | 0.9164 |
|
|
0.9975 | 0.9235 | 0.9824 | 0.9059 |
|
|
|
|
0.9864 |
|
0.9808 |
|
|
Performance obtained on the different image datasets using LPQ as texture descriptor.
LPQ | Datasets (AUC) | Av | RA | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PS | VI | CH | SM | HI | BR | PR | HE | LO | TR | PI | RN | PA | LE | |||
SVM | 0.9039 |
|
0.9999 | 0.9986 | 0.9138 | 0.9565 | 0.8618 | 0.9757 | 0.9764 | 0.9767 | 0.9071 | 0.9532 | 0.8834 |
|
0.9461 | 8.4 |
RS- |
0.8951 | 0.9485 | 0.9999 | 0.9988 | 0.9251 | 0.9568 | 0.8727 | 0.9786 | 0.9809 | 0.9817 | 0.9128 | 0.9531 | 0.8854 | 0.9891 | 0.9485 | 7.5 |
GPC | 0.9020 | 0.9282 | 0.9991 | 0.9985 | 0.9199 | 0.9720 | 0.8883 | 0.9793 | 0.9891 |
|
0.9073 | 0.9439 | 0.8867 | 0.9782 | 0.9490 | 7.4 |
RS_AB | 0.9013 | 0.9417 | 0.9998 | 0.9989 | 0.8783 | 0.9671 | 0.8843 | 0.9781 | 0.9868 | 0.9907 | 0.9255 | 0.9478 | 0.8777 | 0.9826 | 0.9472 | 7.9 |
RS_RB | 0.8994 | 0.9393 | 0.9992 | 0.9978 | 0.9120 | 0.9711 | 0.8999 | 0.9741 | 0.9800 | 0.9889 | 0.9116 | 0.9562 | 0.8806 | 0.9799 | 0.9493 | 8.6 |
DL1 | 0.8701 | 0.9382 | 0.9994 | 0.9982 | 0.9083 | 0.9684 | 0.8758 | 0.9815 | 0.9847 | 0.9873 | 0.9110 | 0.9537 | 0.8858 | 0.9819 | 0.9460 | 9.0 |
DL2 | 0.8081 | 0.9379 | 0.9989 | 0.9979 | 0.9025 | 0.9682 | 0.8745 | 0.9813 | 0.9851 | 0.9852 | 0.9033 | 0.9550 | 0.8783 | 0.9853 | 0.9401 | 10.2 |
DL3 | 0.8717 | 0.9401 | 0.9990 | 0.9983 | 0.9097 | 0.9647 | 0.8694 | 0.9813 | 0.9861 | 0.9854 | 0.9038 |
|
0.8785 | 0.9833 | 0.9456 | 9.3 |
S_D | 0.8864 | 0.9415 | 0.9997 | 0.9982 | 0.9165 | 0.9687 | 0.8807 |
|
0.9871 | 0.9885 | 0.9118 | 0.9594 | 0.8894 | 0.9848 | 0.9497 | 6.3 |
E1 | 0.9045 | 0.9345 | 0.9994 | 0.9989 | 0.9137 | 0.9726 | 0.8884 | 0.9794 | 0.9899 | 0.9931 | 0.9202 | 0.9469 | 0.8860 | 0.9807 | 0.9506 | 6.3 |
E2 | 0.9065 | 0.9441 | 0.9995 |
|
0.9168 |
|
0.8942 | 0.9793 | 0.9883 | 0.9932 | 0.9219 | 0.9574 | 0.8910 | 0.9834 | 0.9535 | 4.2 |
E3 |
|
0.9467 | 0.9999 | 0.9990 | 0.9238 | 0.9716 | 0.8968 | 0.9805 | 0.9891 | 0.9927 | 0.9228 | 0.9581 | 0.8981 | 0.9867 | 0.9554 | 3.3 |
E4 | 0.9097 |
|
|
0.9990 |
|
0.9714 |
|
0.9825 |
|
0.9926 |
|
0.9635 |
|
0.9870 |
|
|
A + B is the sum rule applied to classifier A and classifier B, after classifier scores have been normalized to mean 0 and standard deviation 1.
In Table “ “ND” indicates that there is no statistically significant difference between the performances of the two methods. “
Comparisons between all the pairs of tested methods.
SVM | RS-SVM | GPC | RS_AB | RS_RB | DL1 | DL2 | DL3 | S_D | E1 | E2 | E3 | E4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SVM | — | L | ND | L | ND | ND | ND | ND | ND | L | L | L | L |
RS-SVM | — | — | ND | L | ND | ND | ND | ND | ND | L | L | L | L |
GPC | — | — | — | ND | ND | ND | W | W | ND | L | L | L | L |
RS_AB | — | — | — | — | ND | W | W | W | ND | ND | L | L | L |
RS_RB | — | — | — | — | — | W | W | ND | ND | L | L | L | L |
DL1 | — | — | — | — | — | — | ND | ND | L | L | L | L | L |
DL2 | — | — | — | — | — | — | — | ND | L | L | L | L | L |
DL3 | — | — | — | — | — | — | — | — | ND | L | L | L | L |
S_D | — | — | — | — | — | — | — | — | — | ND | L | L | L |
E1 | — | — | — | — | — | — | — | — | — | — | L | L | L |
E2 | — | — | — | — | — | — | — | — | — | — | — | L | L |
E3 | — | — | — | — | — | — | — | — | — | — | — | — | L |
Analyzing the results reported in Table Both SVM and RS-SVM fail to outperform any of the other state-of-the-art approaches. E2, E3, and E4 outperform all the other approaches. E4 (the fusion among all the base methods) always obtains the highest performance. The simple ensemble S_D obtains a performance that is comparable with all the other base methods. RS, which involved no parameter tuning step, outperforms SVM (implemented with LibSVM) where the parameters are optimally tuned for each dataset.
The same tests reported in the previous section using the image datasets are also run for the data mining datasets. The results reported in Tables
Performance on the different data mining datasets.
Datasets (AUC) | Av | RA | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BR | HE | PI | SP | SO | IO | LI | HA | VO | AU | TR | |||
SVM | 0.9941 | 0.8809 | 0.824 | 0.9708 | 0.9517 | 0.9814 | 0.7558 |
|
0.9855 | 0.9164 | 0.714 | 0.8796 | 7.3077 |
RS-SVM | 0.9931 | 0.9076 | 0.8221 | 0.9771 |
|
0.9795 | 0.7411 | 0.6399 | 0.9853 | 0.9221 | 0.6931 | 0.8745 | 8.6923 |
GPC | 0.9924 | 0.9024 | 0.827 | 0.979 | 0.9409 | 0.9713 | 0.729 | 0.6804 | 0.9882 | 0.9267 | 0.7295 | 0.8788 | 8.000 |
RS_AB | 0.991 | 0.9101 | 0.8229 |
|
0.9371 | 0.9788 | 0.7581 | 0.6727 | 0.9887 | 0.9313 | 0.735 | 0.8831 | 7.000 |
RS_RB | 0.9925 |
|
0.8208 | 0.9873 | 0.9334 |
|
0.7664 | 0.6071 | 0.9884 | 0.9326 | 0.674 | 0.8731 | 7.3846 |
DL1 |
|
0.8852 | 0.8252 | 0.966 | 0.8794 | 0.9222 | 0.7541 | 0.6751 | 0.9795 | 0.9155 | 0.7338 | 0.8664 | 8.7692 |
DL2 | 0.9941 | 0.8754 | 0.8149 | 0.9691 | 0.8789 | 0.9242 | 0.7478 | 0.6679 | 0.9808 | 0.9088 | 0.7318 | 0.8631 | 10.3077 |
DL3 |
|
0.8941 | 0.8193 | 0.9684 | 0.8501 | 0.9022 | 0.6966 | 0.6537 | 0.9787 | 0.9154 | 0.7351 | 0.8553 | 10.2308 |
S_D | 0.9942 | 0.883 | 0.8238 | 0.9683 | 0.8781 | 0.9297 | 0.751 | 0.6772 | 0.9813 | 0.9186 | 0.7357 | 0.8674 | 8.6154 |
E1 | 0.992 | 0.9096 | 0.8277 | 0.9856 | 0.9426 | 0.9772 | 0.7532 | 0.6868 |
|
0.9331 |
|
0.885 | 6.000 |
E2 | 0.9924 | 0.9124 | 0.8285 |
|
0.9426 | 0.9817 |
|
0.6724 | 0.9897 |
|
0.7257 | 0.8856 | 5.3846 |
E3 | 0.9933 | 0.9141 | 0.8288 | 0.9873 | 0.9508 | 0.9819 | 0.7723 | 0.6726 | 0.989 | 0.9343 | 0.7258 |
|
|
E4 | 0.9934 | 0.9113 |
|
0.9862 | 0.942 | 0.9805 | 0.7717 | 0.6794 | 0.9895 | 0.9339 | 0.7297 | 0.8861 | 5.2308 |
Comparisons between all the pairs of methods tested in Table
SVM | RS-SVM | GPC | RS_AB | RS_RB | DL1 | DL2 | DL3 | S_D | E1 | E2 | E3 | E4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SVM | — | ND | ND | ND | ND | W | W | W | W | ND | L | L | L |
RS-SVM | — | — | ND | ND | W | ND | ND | W | ND | L | L | L | L |
GPC | — | — | — | L | ND | W | W | W | W | L | L | L | L |
RS_AB | — | — | — | — | ND | W | W | W | W | ND | L | L | L |
RS_RB | — | — | — | — | — | ND | ND | W | ND | ND | L | L | L |
DL1 | — | — | — | — | — | — | W | W | ND | L | L | L | L |
DL2 | — | — | — | — | — | — | — | W | ND | L | L | L | L |
DL3 | — | — | — | — | — | — | — | — | L | L | L | L | L |
S_D | — | — | — | — | — | — | — | — | — | L | L | L | L |
E1 | — | — | — | — | — | — | — | — | — | — | ND | ND | ND |
E2 | — | — | — | — | — | — | — | — | — | — | — | ND | ND |
E3 | — | — | — | — | — | — | — | — | — | — | — | — | ND |
In these tests we obtain results that are similar to those reported in Tables
The aim of this paper was to compare and combine several state-of-the-art classifiers for proposing a GP ensemble that works well across a broad set of datasets (fourteen different image datasets and eleven UCI data mining datasets) with no parameter tuning. No single approach was discovered that outperformed all the other classifier systems in all the tested datasets. This finding lends support to the “no free lunch” hypothesis/metaphor that claims that “any two algorithms are equivalent when their performance is averaged across all possible problems” [ Among the different state-of-the-art methods, there is no winner. The GP ensembles clearly outperform the state-of-the-art methods without any complex fusion rule (the simple sum rule is used throughout the experiments). In particular, the GP ensembles outperform SVM implemented with the LibSVM toolbox, which is probably the most used classification toolbox reported in the literature.
In our opinion, a heterogeneous system based on different state-of-the-art classifiers (including classifiers that are themselves an ensemble, such as a random subspace of rotation boosting) is the most feasible way of avoiding the “curse” of the “no free lunch” metaphor.
There are many avenues for exploring GP ensembles further. A suggested list of future explorations is the following: Test the performance of ensembles using more complex fusion rules. Test systems on data mining problems where a large set of features is available. Expand the base methods to combine different deep learning approaches, such as an extreme learning machine [
The authors declare that there is no conflict of interests regarding the publication of this paper.