Improving Classification Performance through an Advanced Ensemble Based Heterogeneous Extreme Learning Machines

Extreme Learning Machine (ELM) is a fast-learning algorithm for a single-hidden layer feedforward neural network (SLFN). It often has good generalization performance. However, there are chances that it might overfit the training data due to having more hidden nodes than needed. To address the generalization performance, we use a heterogeneous ensemble approach. We propose an Advanced ELM Ensemble (AELME) for classification, which includes Regularized-ELM, L2-norm-optimized ELM (ELML2), and Kernel-ELM. The ensemble is constructed by training a randomly chosen ELM classifier on a subset of training data selected through random resampling. The proposed AELM-Ensemble is evolved by employing an objective function of increasing diversity and accuracy among the final ensemble. Finally, the class label of unseen data is predicted using majority vote approach. Splitting the training data into subsets and incorporation of heterogeneous ELM classifiers result in higher prediction accuracy, better generalization, and a lower number of base classifiers, as compared to other models (Adaboost, Bagging, Dynamic ELM ensemble, data splitting ELM ensemble, and ELM ensemble). The validity of AELME is confirmed through classification on several real-world benchmark datasets.


Introduction
An ensemble learning is a machine learning process to get better prediction performance by strategically combining the predictions from multiple learning algorithms [1]. Ensembles are known to reduce the risk of selecting the wrong model by aggregating all the candidate models [2,3].
In the process of improving ensemble accuracy and stability, different techniques have been established. These techniques vary in their approach to treat the training data, the type of algorithms used, and the combination methods followed. Bagging [4], Boosting [5], and their variants, such as Adaboost [6], are some of the popular ensembling techniques.
Traditional neural network-based classifiers usually suffer from overfitting and local optimum issues and have remained an active research subject for performance improvement by different ensemble methods. However, recently Extreme Learning Machine (ELM) has gained popularity for solving classification problems. ELM is a single-hidden layer feedforward network (SLFN) extension. Unlike the traditional classic gradient based learning algorithms, which only work for differentiable activation functions and are prone to issues like local optimum, improper learning rate, and overfitting, etc., ELM can deal with nondifferentiable activation functions and tends to reach the solution straightforward without such trivial issues [7]. Random initialization of input to hidden layer parameters in ELM helps evade the tuning process for hidden layer parameters, which extensively shortens the learning time. Although ELM is fast and achieves good generalization performance, there is still a lot of room for improvement. Several modifications have been recently introduced in the base of ELM algorithm to improve accuracy and generalization, such as optimally pruned Extreme Learning Machine (OP-ELM) [8] and Regularized-ELM [9][10][11][12].

Computational Intelligence and Neuroscience
On the contrary, ensemble learning offers an inexpensive alternative due to its performance optimization. Several approaches were proposed to generate ensembles-based ELM, such as DELM [13], EnELM [14], and DSELME [15]. Such Ensembles of Extreme Learning Machine classifiers were successful in achieving good performance for hyperspectral image classification and segmentation in a semisupervised and spatially regularized process [16]. Bagging-ELM (B-ELM) [17] is another ELM ensemble classifier, which leverages the bag of little bootstraps technique and has been found efficient for large-scale data classification. An online sequential-ELM (OS-ELM) based framework supports ensemble methods including Bagging, subspace partitioning, and cross-validating [18].
Diversity among the performance of each single classifier in the ensemble is essential for combining the predictions from several member classifiers. Different techniques are followed to introduce diversity among member classifiers. A cross-validation [13,14] to validate each ELM before adding it to the ensemble is used. The proposed work related to ELM ensembles in the literature used a homogeneous base classifier algorithm for members' training in the model [13][14][15]. Motivated by the accuracy achievement of enhanced ELM algorithms and ensemble approach; we propose a heterogeneous ensemble model with different ELM algorithms for members' training. More specifically, we adopt three types of ELM algorithms, namely, Regularized-ELM [10], ELML2 [11], and Kernel-ELM [19]. These ELM algorithms are briefly described in Section 2. These base classifiers are an enhancement for the standard ELM algorithm and are chosen on the basis of their better generalization, regularization, and resilience to the outliers. A random resampling strategy is chosen to split the training data into subsets. Each member classifier is learned on a randomly chosen data subset through a randomly selected base ELM algorithm. The proposed ensemble algorithm evolves by monitoring the diversity and generalization performance of the updated ensemble during training. Majority voting method is used for combining the predictions from several member classifiers in AELME. Ten real-world benchmark datasets (Iris, Climate, Credit, Wave, Satellite, Letter, Firm, Colon, Liver, and Vowel) are used for detailed performance analysis and comparison. Experimental results, as reported in this work, show that the proposed AELME approach gives better classification accuracy on the benchmark datasets. The remainder of this paper is organized as follows: AELME ensemble algorithm and implementation details are elaborated under Section 3. Performance analysis of the proposed AELME algorithm (see Algorithm 1) is reported in Section 4, by comparing its accuracy with base classifiers {RELM, ELML2, KELM, ELM, SVM} and other ensemble methods, which include DELM, DSELME, and EnELM. Finally, the paper is concluded in Section 5.

Background
2.1. The Base Classifiers. Three types of ELM classifiers, namely, ELML2, RELM, and KELM, are used as base classifiers to build AELME ensemble. Here we will briefly introduce the strengths of the selected base ELM classifiers. ELML2 [10] is a regularized algorithm-based ELM, which has all the basic ELM advantages of regression, binary, and multiclass classification. Moreover, it introduced a Lagrange multiplier based constraint optimization method. Therefore, the resultant solution is more stable and has a better generalization performance with different types of hidden nodes (feature mappings). KELM [19] is an optimization methodbased Extreme Learning Machine, which links the ELM minimal weight norm property to Support Vector Machines (SVM) maximal margin for classification. It is shown that, through standard optimization for ELM, a so-called support vector network with better generalization property can be obtained by ELM Kernels. However, in comparison with standard SVM, the Kernel-ELM is less sensitive to the userspecified parameters and has fewer optimization constraints. RELM [11] is a constrained and optimized algorithm-based ELM for regression and multiclass classification. For better generalization, RELM makes a tradeoff between the structural (weight norm) and empirical risk (least square error) by regulating a proportion of them during optimization. To achieve this balance, the empirical risk in the objective function is weighted by a regulating factor gamma. For more details, the reader can refer to [10,11,19].
(2) Compute the hidden layer output matrix.
Consequently, the base classifiers must commit their errors on different instances, which is the informal meaning of the term diversity. We use three variants of ELM to improve diversity among the base classifiers. Overall, the proposed ensemble is designed to improve performance in terms of accuracy and it is more stable.

Advanced Ensemble for Classification Using Extreme Learning Machines
Unlike designing a single classifier in traditional pattern recognition field, ensemble learning aims at constructing multiple diverse classifiers and combines their outputs to form a hybrid predictive model. Consequently, the overall classification performance of ensemble classifier tends to be better than when using a single classifier. As ELM uses random weights, it often has a low misclassification rate.
To improve the classification rate performance, a number of multiclassifiers based on ensemble learning have been proposed in [21,22]. In this work, we use data splitting of training data and three types of ELM algorithms as the base learners to build a classifier on split data and majority voting to combine outputs from all member classifiers in ensemble pool. Different training parameters of base ELM learning algorithms allow each member classifier to generate different decision boundaries. Hence different errors are made resulting in a reduced overall error for the ensemble. Training data distribution has an effect on the generalization of the learning classifier. For example, a training set may contain instances from a particular class such that the feature values of those instances are skewed towards a particular intraclass member. To address this issue, we divide training dataset into different parts as it tends to preserve the original data distribution by using random resampling on the dataset. Consequently, classifiers with large diversity and different errors are produced. For example, if we divide training data 퐷 into 3 parts 퐷 = {퐷 1 , 퐷 2 , 퐷 3 } then we have three training subsets: A sufficient and necessary condition for the ensemble to outperform its base members is that component learners should be simultaneously accurate and diverse; therefore, a new member is added to AELME if it increases both diversity (in terms of disagreement) and accuracy of the model. General description of the model is shown as flowchart in Figure 1.

Architecture.
The training dataset is divided randomly into 푚 equal size subsets. If we have 푁 samples, then the size of each subset will be (푁/푚). To maximize the diversity among reconstructed training datasets, each new training set is obtained through resampling on 푚 − 1 out of 푚 subsets. Then, training with each subset is done using one out of the three base classifier learners, which is selected randomly. The trained classifier is added to the ensemble and the process is repeated for all the remaining subsets. In the next iteration, if diversity and accuracy of the current ensemble 퐸 are improved with the addition of 퐸 (ensemble number 퐾), then it will be retained in the updated ensemble and excluded otherwise. The final ensemble model is a mixture of all classifiers trained on all subsets. Our model has three types of ELM algorithms, specifically Regularized-ELM, ELML2, and Kernel-ELM. Once the training is complete, labels for tested data are obtained by majority voting method applied to the member classifiers' outputs in the evolved ensemble.

Majority Vote.
The implementation procedure for the ensemble construction and training stage is described in the algorithm of AELME.
Given a testing instance (푋, 푡), an ensemble of 푚 × 퐾 predictors is created. For pattern 푋, we use majority voting to make the final decision. Suppose we have one 퐶-class's problem. If the 푘th ELM in the ensemble predicts the pattern 푋 as class 퐶, we assign vote one to it and zero otherwise. Once all the votes have been assigned, the class that receives the highest votes from all predictors is considered the predicted class.

Weighted Sum.
Given a testing instance (푋, 푡), an ensemble of 푚 × 퐾 predictors are created. In decision making on the ensemble, for pattern 푋, we use weighted sum to make the final decision. Suppose that there is 퐶-classe's problem, and we calculate the weighted sum for all classifiers for all classes. The class that receives the maximum weighted sum from all predictors is considered as the predicted class: where 훼 is the weight of base learner and 푓 (푦) is the prediction result.

Simulation Settings.
To test the performance of the model, we carry out our simulation experiments on ten diverse datasets from several domains with different characteristics and diversity in size and input feature dimensions. The datasets come from machine learning repository (UCI) [23] besides including one dataset from LIBSVM [24] which is sourced in [25]. A brief description of the datasets is included in Table 1 Table 2. To study the generalization performance of AELME on the combination of (퐶, 휆), we select a medium dataset size (Wave). From Figure 2, it can be noticed that changing the value of 퐶 and (휆) parameters does not have a significant effect on the accuracy. So, the model seems to have less sensitivity towards the combination parameters (퐶, (휆)).

Metrics.
We use a set of measures to evaluate the efficiency of AELME model. We use accuracy as an indication of the classification output correctness. Standard deviation of the accuracy rates is used as an indication of ensemble stability; the lower standard deviation the method has, the more stable the method is. The cost of training a new (test) data should not have a significant change on the ensemble accuracy when we train the ensemble with any training set of size a bit more or less than the original data. We use the decrease or increase in average absolute error averaged over all our datasets, assuming they represent a reasonable real-world distribution of datasets. The average relative error reduction measure is also used. For two algorithms [26] A and B with errors 푒1 and 푒2, the decrease in relative error between A and B is (푒1 − 푒2)/푒1. The average relative error is the average (over all our datasets) of the relative error between the pair of algorithms compared. We compared our model with all other approaches. A negative value for the error implies that our model reduces error, while positive values correspond to increase in error for our model. Time costs of Adaboost, Bagging, EnELM [14], DSELME [15], DELM [10], and AELME are also compared.

Diversity Measures.
It is not straightforward to express actual diversity among classifiers in an ensemble through standard diversity measure. While there are some measures with which to approximate its value, there is no perfect one [27,28]. Here, we use disagreement to measure diversity and also 푄-Statistic which is recommended in [29].

Disagreement.
The diversity within the whole ensemble is calculated by averaging disagreement measure [30] over all pairs of base classifiers: where 퐿 is the number of base classifiers, dis , is the disagreement between classifier 푗 and classifier 푘. This measure is defined based on the intuition that two diverse classifiers perform differently on the same training data. Disagreement measure is used to test the diversity within the whole set of base classifiers. The diversity increases with the value of the disagreement measure. [31] measures the similarity between two classifiers (퐶 and 퐶 ). It can be calculated as follows:

푄-Statistic. Yule's 푄-Statistic
where (푎, 푏) represents the number of samples for which both the classifiers are making (correct, wrong) classification, respectively. Similarly, (푐, 푑) represents the number of samples for which both the classifiers are committing errors. Then the averaged 푄 value for more than two classifiers can be calculated as follows: where 퐶 is the number of classifiers and 푄 ∈ [−1, 1]. When 푄 equals zero, it implies that the classifiers are independent.
And if 푄 equals one, it implies identical (dependent) classifiers. A positive value of 푄 means that the classifiers have classified the same input correctly and negative value of 푄 means that the classifiers have committed errors on different inputs. Diversity increases if 푄 value decreases and vice versa. However, it is not easy to attain large negative 푄 value [29] for more than two classifiers. For 푄 calculations, we use the diversity measure toolbox (http://pages.bangor .ac.uk/∼mas00a/ensemble_diversity.html).

Wilcoxon Test.
The Wilcoxon test is a nonparametric statistical test [32]. The purpose is to compare between two models and several data samples to measure the difference between them and to know if one is significant or both are equal. It is insensitive to the sample size and outliers. Our null hypothesis (퐻 0 ) is that "there is no difference between our model and the one to which it is being compared." The alternative hypothesis (퐻 1 ) is that our model is more significant than the compared model. We use a significance level of 95% (threshold is equal to 0.05). Small values of 푝 (푝 value) cast doubt on the verity of the null hypothesis. A small 푝 value verifies that one approach is more significant than the other. The procedure is as follows: find the performance (Pf) difference between the two compared algorithms. Rank the absolute values of Pf in ascending order (the smallest value = 1, the second value = 2, and so on). If there are two equal values, then assign the average rank for all equal values. Compute the negative and the positive rank sum according to the Pf sign. Find the minimum of the two sums (Wilcoxon statistic: 푊). Find the critical value of 푊 [33] that corresponds to the dataset number with a level of significance used to examine if the null hypothesis can be rejected. For more details, the reader can refer to [34].

Friedman Test.
Friedman test is a nonparametric statistical test [35]. The purpose is to compare the performance of multiple models and several data samples to measure the differences between them and to determine whether there is a significant difference or they are equal. Our null hypothesis (퐻 0 ) is that there is no difference between AELME and other algorithms. The alternative hypothesis (퐻 1 ) is that there is a significant difference between at least two of the compared models. The significance level used is 95% (threshold is equal to 0.05).

Statistical
Results. The (푝 value) result of Wilcoxon test of AELME compared with all algorithms is shown in Table 5. It is less than 0.05 in all cases; that implies the null hypothesis (퐻 0 ) is rejected, and the alternative hypothesis (퐻 1 ) is accepted.
There is enough evidence that our model is more significant as compared to other models reported in this work. Moreover, the result of Friedman test value is 22.08, with 푝 value that equals 0.0086. We reject the null hypothesis and accept the alternative hypothesis that there are differences between the compared models.
Computational Intelligence and Neuroscience 7     Table 8. We   Table 9. However, at the same time there exists a range of improvements against these 푄 values, while the top-improvements are dispersed across a wide spectrum of negative 푄-values. Almost all ensembles in our experiments show accuracy improvement over the single best classifier (Ensemble Accuracy − Maximum individual accuracy). Nevertheless, we cannot draw a conclusion that there is strong relationship between accuracy and diversity for all ensembles, because it depends on the experiment settings. Moreover, there is a need for more dedicated, in-depth research to investigate the relationship between accuracy and diversity.

Performance Analysis and Discussion.
The classification experiments on datasets are performed using Bagging, Adaboost, Regularized-ELM (RELM) [10], ELML2 [11], Kernel-ELM (ELMK) [19], EnELM [14], DELM [13], DSELME [15], and AELME algorithms. The average classification accuracy rates with their corresponding standard deviations of the experiments over ten runs are shown in Table 4. Accuracy rates on the tested datasets show the strength of the model, as we can observe from results that our model achieves the highest accuracy rates in most cases.
The base classifiers of our model ELML2, Regularized-ELM, and Kernel-ELM have accuracy rates less than the ensemble model. From Table 4 we observe that the accuracy rates on almost all datasets of Bagging and Adaboost algorithms are lower than our model and they have low accuracy rates on Wave, Credit, and Firm datasets. Moreover, we use weighted sum method in all base classifiers to test AELME on unseen data. Table 6 shows the accuracy rates using weighted sum. We observe from results weighted sum method outperforms majority vote in most datasets. Stability is an important factor related to whether the ensemble classifier can improve the accuracy rate of classification. To analyze the stability of AELME, we repeat the experiments for 10 times on Climate dataset. 200 + 8, 200 + 16, 200 + 32, 200 + 64, and 200 + 128 instances are selected in sequence, corresponding to 1st to the 5th group, respectively. The standard deviation of the accuracy rates is calculated based on these 10 runs. Stable classifiers are less likely to overfit. To make use of the variations of the training set, the base classifier should be unstable [35]; otherwise, the resultant ensemble will be a collection of almost identical classifiers. As shown in Table 7, our ensemble classifier is more stable than all the base classifiers. Disagreement is a measure of diversity. As shown in Table 3, it is mostly increased as the size of the dataset increases. This demonstrates that diversity is increased between base classifiers in AELME. The mean average error (MAE) of our model is the lowest one among all algorithms on all the datasets as shown in Table 10. There is a relative error reduction of our model compared to almost all other ensembles tested on all the datasets in this research. For example, for Letter dataset, there is an error reduction of 147% compared to Bagging, 157% compared to Adaboost, 175% compared to EnELM, 54% compared to DSELME, and 171% compared to DELM. The average absolute error of our model is 0.2758 which is the smallest among all ensembles as shown in Table 10. There is an average error reduction of 39% for DELM, 43% for DSELME, 35% for EnELM, 44% for Adaboost, and 59% for Bagging. We compare the time costs of Bagging, Adaboost, and AELME algorithms.   instances are selected in sequence, corresponding to 1st to the 5th group, respectively, as shown in Figure 3. For Climate dataset, we observe that Adaboost algorithm is the most time-consuming algorithm, and the time cost of the AELME algorithm is less than the Bagging algorithm and Adaboost. The average training time of all algorithms is compared by taking the average of ten runs of the ten datasets. Table 11 shows average training time for the different algorithms. We observe that our algorithm training time on some datasets is higher than other ensembles due to the computations in our algorithm.

Conclusion
In this work, we have discussed an advanced approach for classification using different ELM algorithms, namely, Regularized-ELM, ELML2, and Kernel-ELM. Each learner member is independent of the other to achieve diversity within the proposed ELM ensemble (AELME). By using different types of ELM algorithms in the ensemble and by using different training datasets for each classifier, it allows the base classifiers to generate different decision boundaries and different errors while reducing the total error. So, a combination of all the classifiers achieves better classification accuracy and the generalization performance of the ensemble increases. Experimental results show that the proposed AELME model is accurate and stable and outperforms other models. It would be an interesting future work to identify the optimal number of classifiers to be used in an ensemble for improving overall accuracy. Furthermore, in the future we will discuss the applications of the proposed method in some practical fields, for example, the Internet of Things and cyberphysical systems [36,37].