Multiclass AdaBoost ELM and Its Application in LBP Based Face Recognition

Extreme learning machine (ELM) is a competitive machine learning technique, which is simple in theory and fast in implementation; it can identify faults quickly and precisely as compared with traditional identification techniques such as support vector machines (SVM). As verified by the simulation results, ELM tends to have better scalability and can achieve much better generalization performance andmuch faster learning speed comparedwith traditional SVM. In this paper, we introduce amulticlass AdaBoost based ELM ensemble method. In our approach, the ELM algorithm is selected as the basic ensemble predictor due to its rapid speed and good performance. Compared with the existing boosting ELM algorithm, our algorithm can be directly used in multiclass classification problem. We also carried out comparable experiments with face recognition datasets. The experimental results show that the proposed algorithm can not only make the predicting result more stable, but also achieve better generalization performance.


Introduction
Many research works have been done in feedforward neural networks, which pointed out that the feedforward neural networks are able to not only approximate complex nonlinear mapping, but also provide models for some natural and artificial problems which classic parametric technics are unable to handle.
Recently, Huang et al. [1] proposed a new simple algorithm based on single layer feedforward networks (SLFNs) called extreme learning machine (ELM).For ELM randomly generates parameters of the networks, its learning speed can be thousands of times faster than traditional feedforward network learning algorithms like back-propagation (BP) algorithm, which needs to iterate many times to get optimal parameters.
In addition, Huang [2] also shows that in theory ELMs (with the same kernels) tend to outperform SVM and its variants in both regression and classification applications with much easier implementation.Based on this conclusion, the paper in the literature proposed by Wong et al. [3] explores the superiority of the fault identification time of ELM.
In view of the advantages of the algorithm, Cao et al. put it into some areas, such as landmark recognition [4] and protein sequence classification [5].Besides, Cao et al. [6] proposed an improved learning algorithm which incorporates the voting method into the popular extreme learning machine in classification applications and outperforms the original ELM algorithm as well as several recent classification algorithms.
AdaBoost [7] is one of the most popular algorithms of classifier ensemble to improve the generalization performance.Wang and Li in [8] proposed an algorithm named dynamic AdaBoost ensemble ELM (named DAEELM in this paper).The proposed algorithm takes the ELM as the basic classifier and applies AdaBoost to solve binary classification problem.Similarly, Tian and Mao in [9] combined the modified AdaBoost.RT [10] with ELM to propose a new hybrid artificial intelligent technique called ensemble ELM.

A Review of Related Work
In this section, a review of the original ELM algorithm and PCA and multiclass AdaBoost and the LBP based face recognition is presented.
Here,   = [ 1 ,  2 , . . .,   ]  is the weight vector connecting the th hidden node and the input nodes,   = [ 1 , . . .,   ]  is the weight vector connecting the th hidden node and the output nodes, and   is the threshold of the th hidden node.
The standard SLFNs with  hidden nodes with activation function ℎ() can be compactly written as follows: where Different from the conventional gradient-based solution of SLFNs, ELM simply solves the function by + is the Moore-Penrose generalized inverse of matrix .As Huang et al. have pointed out in [14],  + can be represented by where  is an identity matrix, which has the same dimension with   . is a constant number which can be set by the user.Adding / can avoid the situation that   is singular.Huang et al. [1] successfully applied ELM to solve binary classification problem and Huang et al. [14] extended the ELM to directly solve the multiclass classification problem.Since the original ELM randomly generates the weights between the input layer and the hidden layer, as well as the bias of the activation function, its performance may be not so stable.Instead of that, some other ways like PCA algorithm rewards to try.Pearson [15], as an analogue of the principal axes theorem in mechanics, which was later independently developed (and named) by Hotelling in the 1930s [16].Now, it is mostly used as a tool in exploratory data analysis and for making predictive models.PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using -scores) the data matrix for each attribute [17].The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point) and loadings (the weight by which each standardized original variable should be multiplied to get the component score).

PCA. Principal component analysis (PCA) was invented in 1901 by
The procedure of PCA is as follows: Step 1. Compute the matrix  which is the covariance matrix of .
is an identity matrix, which has the same dimension with .The matrix  consists of  row vectors, where each vector is the projection of the corresponding data vector from matrix .
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.The number of principal components is less than or equal to the number of original variables.This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components.Principal components are guaranteed to be independent if the dataset is jointly normally distributed.PCA is sensitive to the relative scaling of the original variables.

Original AdaBoost and Multiclass
AdaBoost.AdaBoost has been very successfully applied in binary classification problem.Original AdaBoost is proposed in [7].Before proposing the AdaBoost algorithm, the function () is predefined as AdaBoost algorithm is summarized as follows.
(3) Output Here,  is +1 or −1.In binary classification, any classifier whose generalization performance is better than 1/2 is a weak classifier.For the original AdaBoost, we have the following.
(1) For the th and the th classifiers, if err  < err  < 1/2, we have   >   > 0, which means the final ensemble classifier values more of the th classifier's result.Specifically, if err  = 1/2,   = 0, which means the final ensemble classifier just ignores the classifier since its effect is the same as random guess.
(2) If the th classifier misclassifies the th sample, the th sample will have a big weight in the next iteration.As a result, the ( + 1)th classifier will pay more attention to it.On the contrary, if the th classifier classifies the th sample correctly, the th sample will have a small weight in the next iteration, which means ( + 1)th classifier will pay less attention to it.
However, for a -class classification problem, we have   ∈ {1, 2, . . ., } and  > 2. If a classifier's generalization performance is better than 1/ (maybe much smaller than 1/2), it can be called a weak classifier.Since original AdaBoost only takes a classifier whose generalization performance is better than 1/2 as a weak classifier, obviously, it cannot be directly implemented to multiclass conditions that  is bigger than 2. Freund and Schapire [11] extend the original AdaBoost to multiclass condition.The weight of the th classifier is modified as Similar to the binary condition, for the th and the th classifiers, if err  < err  < 1 − 1/, we have   >   > 0, which means the final ensemble classifier values more of the th classifier's result.In particular, if err  = 1 − 1/,   = 0.

LBP Based Face Recognition.
The original LBP operator goes through each 3×3 neighborhood in a picture.It takes the center pixel as the threshold value of the neighborhood and considers the result as a decimal number.The LBP operator is shown in Figure 1.Then, the texture of the picture can be represented by the histogram of all the decimal numbers.
To apply LBP operator in face recognition problem, Ahonen et al. [12] divided the face image into several windows and calculated the histogram of each window by LBP operator.The final feature vector is gotten by combining the histograms into a spatially enhanced histogram.The spatial enhanced histogram is provided with three levels of information: the patterns of pixel level; the patterns of regional level; the global patterns of the face image.Experiments in [12] have shown that the LBP description is more robust against variants in pose or illumination than holistic methods.All our experiments in Section 4 are done with the most original LBP operator.

MAELM and Face Recognition Structure
In this part, the multiclass AdaBoost ELM (MAELM) algorithm is proposed and a structure of face recognition based on LBP and ELM is also included.(1) Initialize the observation weights   = 1/,  = 1, 2 . . ., .
(2) For  = 1 : , (a) fit a classifier ELM  () to the training data using weights   ; (b) compute the weighted error (c) compute the weight of the th classifier (d) update the weight of sample data, for all  = 1, 2, . . ., (e) renormalize   .
(3) Output Part (2)(a) of the proposed algorithm should be paid more attention.Both [8,9] did not give any detail of how to fit the basic classifier ELM  () with weighted samples, but it is a very important part of AdaBoost.Zong et al. [18] proposed an algorithm named weighted ELM by introducing a diagonal matrix  ∈  × , whose element  , denotes the weight of the th training sample.In view of some special situations, we introduce the weighted ELM algorithm.Obviously, it boils down to the original one when the weighted matrix is the identity matrix.
The proposed method maintains the advantages from original ELM: (1) it is simple in theory and convenient in implementation; (2) wide types of feature mapping functions or kernels are available for the proposed framework; (3) the proposed method can be applied directly into multiclass classification tasks.In addition, after integrating with the weighting scheme, the weighted ELM is able to deal with data with imbalanced class distribution while maintaining the good performance on well-balanced data as unweighted ELM; by assigning different weights for each example according to the users' needs, the weighted ELM can be generalized to cost sensitive learning.
Under the weighted circumstance, the solution of  becomes

Application in LBP Based
Face Recognition.This paper combines LBP based feature vectors with ELM to build a face recognition structure.There have been some papers [19,20] about applying ELM in face recognition problem.However, the existed ELM based face recognition structures are all based on statistical features, for example, PCA [21] and LDA [22].
In order to get better generalization performance, the proposed face recognition structure implements the LBP based method to get the feature vector and ELM as the classifier.It has been proved in [12] that LBP based method is more robust than PCA and LDA when lighting, facial expression, and poses change.At the same time, ELM is very fast in classification and has very good generalization performance.So, it is reasonable to combine LBP method and ELM to build the face recognition structure.
There are two steps of the proposed face recognition structure.The first step is to train the training samples by ELM or MAELM.In this step, the training samples are represented by LBP based feature vectors.Then, the feature vectors are used to train the classifier model by ELM or MAELM; see Figure 2. The second step is to predict the labels of the test samples.The test samples are also represented by the LBP based feature vectors.Then, the classifier model trained in the first step is implemented to predict the labels of the test samples; see Figure 3.

Experiments
In this paper, two of the mostly used face recognition datasets Yale and ORL are used to prove the efficiency of the proposed  The dimension after reduction algorithm.To make the results valid, except for Section 4.2, the average testing accuracy is obtained on 20 trials randomly generated training set and test set.This paper chooses the sigmoid function as the activation function for it is the most commonly used one.The parameters to set and their meanings in the experiments are listed in Table 1.For example, if the experiment sets  = 10,  = 1,  = 1000,  = 5, and  = 5, it means that selecting 5 images of each person builds the training set and the remaining images build the test set.Each image is divided into 5 × 5 windows.After building the training and test set, ELM with  = 1,  = 1000 and MAELM, which combines 10 ELMs with  = 1,  = 1000, are evaluated in the built sets.

Performance Changes with 𝐶 and 𝐿.
Although ELM is comparatively not that sensitive to the arguments as SVM, its performance still changes with the hidden layer number  and the constant value .
Suppose we have  training samples; Huang et al. [1] rigorously prove that SLFNs (with  hidden nodes) with random bias and input weights can exactly learn the  distinct observations.If the training error is allowed, the number of hidden nodes can be much smaller than .At the same time, the constant value  also has some impacts of the solution of 's Moore-Penrose generalized inverse.
It is obvious that both ELM and MAELM are not sensitive to the change of arguments.The difference between ELM and MAELM is mainly in the region where  is very small and  is very large.From Figure 4, one can conclude that ELM performs badly in this region, since its accuracy rate is below 0.6.On the contrary, MAELM is still very stable in this region.Its accuracy rate is bigger than 0.8.
After seeing PCA's good performance in the region of face recognition, we wonder if PCA could have a stable and better performance when it replaces the way we originally construct the matrix .
The experiment is also conducted in Yale dataset with the same parameters.Besides, the new parameter , which is the dimension after reduction, could not be set bigger than the number of input nodes.In view of the dimension of dataset and other limitations in the experiment, the parameter  is set as 10, 20, . . ., 60.Since it is complex in the picture because of the imbalance with the parameters change, we choose to show them in the table.The performance of ELM (Figure 4(a)) and MAELM (Figure 4(b)) with PCA is listed in Table 2; the best accuracy rate in the table is bold.
It is clear that both ELM and MAELM with PCA are not so sensitive to the change of arguments.The difference between them is mainly in the region where  is very small and  is very large.From Table 2, one can conclude that MAELM with PCA performs better in this region when  is very small, but when  is large and  is small, ELM with PCA performs rather well and stable.Besides, ELM with PCA's performance is almost as well as the other one in the region where  and  are both very large, and its accuracy rate is bigger than 0.85.

Prediction Stability Analysis.
Since the original ELM randomly generates the weights between the input layer and the hidden layer, as well as the bias of the activation function, its performance even for the same training and test set changes each time.This is to say the performance of  original ELM may not be so stable.The proposed algorithm successfully reduces the instability.
From Figure 4, one is able to conclude that ELM tends to get better performance when  = 1, while  = 10 3 is better for MAELM.Let  = 10,  = 10 3 ,  = 1000,  = 5, and  = 3 for MAELM and  = 1,  = 1000,  = 5, and  = 3 for ELM.Besides, ELM and MAELM ( = 20) with PCA are also included under the corresponding situations because of the considerate performance above.Experiments are done in Yale datasets.In order to prove that the proposed algorithm is more stable than the original ELM, experiments are done in the same training set and test set (randomly generated) for 20 times.The result is shown in Figure 5.
In Figure 5, it is obvious that the performance of MAELM is much more stable than the original ELM.Although ELM or MAELM with PCA performs far more stable than the original ELM and MAELM (since they take the algorithm of PCA into consideration instead of the random weights between the input layer and the hidden layer and the bias of the activation function), the accuracy rates of them, which are always in the middle from Figure 5, are still not so good as the original MAELM.We conclude the result of Figure 5 in Table 3. Please notice that although the generalization performance  6 and 7.
From Figure 6, it is obvious that as the  increases, the generalization performance also becomes better.However, the trend becomes slower as  increases.From Figure 7, one can conclude that as the  increases when  is small, the performance decreases a little, while  becomes larger after 25; the performance also becomes better, although the trend is not so stable as the original MAELM.This situation indicates that in real-world applications,  does not need to be very big.Good generalization performance can be obtained by setting  less than 30 in the algorithm of original MAELM, which achieves better than MAELM with PCA under the same situation.The experiment indicates that MAELM has better generalization performance both in Yale and ORL datasets under different window sizes.See Figure 8 for details, while in Figure 9, it is obvious that ELM with PCA has much better performance both in Yale and ORL datasets under different window sizes.In addition this algorithm keeps more stable than any other algorithms both in Yale and ORL datasets.

The Performance in PCA.
After seeing all these experiments, we can conclude that although MAELM with PCA performs not so well as the original one, ELM with PCA performs much better than before, especially in the experiment in Section 4.2.It is obvious that the performance of the experiments with PCA is just between the original ELM and MAELM.
What is more, since the original ELM randomly generates the weights between the input layer and the hidden layer, as well as the bias of the activation function, its performance is not so stable.The proposed algorithm with PCA successfully reduces the instability which is very important in the real world.
Although PCA improves the performance of ELM in a certain degree, it still could not reach the ability of MAELM with random weights and bias.Finally, it comes to the result that the proposed algorithm named MAELM performs much better in solving the multiclass classification problem.

Complexity Comparison.
Very similar to MAELM, the DAEELM [8] also considers taking the ELM as the weak classifier and implements AdaBoost as the ensemble method.The difference is that MAELM implements multiclass AdaBoost which can be directly used in multiclass classification problem, while DAEELM implements dynamic ensemble AdaBoost [23], which aims to solve the binary classification problem.
Many methods have been developed to apply binary classifier to multilabel problem.One-against-all (OAA) [24] and one-against-one (OAO) [25] are mostly used.For a -class classification problem, under OAA condition,  classifiers have to be trained.Each of them separates a single class from all the remaining classes.Under the OAO condition, ( − 1)/2 classifiers have to be trained.Each of them separates a pair of classes.
Suppose that both MAELM and DAEELM have  iterations.For a -class classification problem, MAELM only needs to train  ELMs, while DAEELM needs to train  ×  and ( ×  × ( − 1))/2 classifiers for OAA and OAO condition, respectively.Although DAEELM may stop the iteration earlier, it is obvious that, in theory, MAELM's computation complexity is much lower than DAEELM for class classification problem, especially when  is a very big number.
The authors of DAEELM have not published its codes and DAEELM has its own arguments which MAELM does not have.DAEELM also does not provide details of how it trains weighted data with ELM, so it will be unfair to compare the performance of MAELM and DAEELM.However, the conclusion that MAELM is much faster than DAEELM in multiclass classification problem can be drawn from the complexity analysis above.

Train ELM with Weighted
Data.Section 3.1 has mentioned that training ELM with weighted data is a key problem when applying AdaBoost.However, [8,9] did not mention the key point at all.
Toh in [26] first applied ELM to classify imbalanced data with two classes.ELM tries to minimize the training error of   the data while the proposed algorithm tends to minimize the total error rate (TER), which takes the weights of the positive and negative data into consideration.
In Section 3.1, the weighted ELM is applied in MAELM.Actually, the weighted ELM is inspired and in a way that is very similar to regularized ELM proposed by Deng et al. in [27].The regularized ELM aims to minimize the weighted training error of the weighted data.

Conclusion
This paper proposes a new boosting ELM named MAELM, which applies the multiclass AdaBoost in ELM ensemble to directly solve multiclass classification problem.A face recognition structure combined LBP based method and ELM is also presented in the paper.What is more, this paper proposes the way in which ELM combined with PCA instead of using random weights between the input layer and the hidden layer, as well as the bias of the activation function.
Experiments in LBP based face recognition will show the stable and good performance in a certain degree.Although PCA improves the performance of ELM, it still could not be better than MAELM with random weights and bias.Experiments show that in LBP based face recognition problem, the recognition result of MAELM is more stable than the original ELM and better than any other algorithms listed in the paper.
Finally, it comes to the result that the proposed algorithm named MAELM, which applies the multiclass AdaBoost in ELM and combines with LBP method, performs much better in solving the multiclass classification problem.Also, MAELM is compared with DAEELM in multiclass classification problem in theory, which indicates that MAELM has much lower computation complexity than DAEELM.Moreover, this paper makes the problem how to train weighted data by ELM clear.

(
a) fit a classifier   () to the training data using weights   ; (b) compute the weighted error

3. 1 .
Proposed MAELM Algorithm.By applying the multiclass AdaBoost to ELM, this paper proposes the multiclass AdaBoost ELM (MAELM) algorithm.The algorithm takes a number of ELM classifiers as the weak classifiers.ELM  () denotes the th ELM classifier.The proposed algorithm is put as follows.

Figure 8 :
Figure 8: Performances in Yale and ORL.

Figure 9 :
Figure 9: Performances in Yale and ORL.

Table 2 :
Performance of ELM and MAELM with PCA.

Table 3 :
Performance of ELM and MAELM under the same training set and test set.In order to evaluate the changes of performance when  changes, the experiment in this part lets  = 1,  = 5,  = 4,  = 1000 for the original MAELM,  = 20 for MAELM with PCA, and  = 2, 4, 6, . . ., 50.The average test accuracy is obtained on Performance of ELM and MAELM under the same training set and test set.20 trials randomly generated training set and test set.Yale dataset is used for the experiment.The result is presented in Figures 4.3.Performance Changes with .