Extreme learning machine (ELM) is a competitive machine learning technique, which is simple in theory and fast in implementation; it can identify faults quickly and precisely as compared with traditional identification techniques such as support vector machines (SVM). As verified by the simulation results, ELM tends to have better scalability and can achieve much better generalization performance and much faster learning speed compared with traditional SVM. In this paper, we introduce a multiclass AdaBoost based ELM ensemble method. In our approach, the ELM algorithm is selected as the basic ensemble predictor due to its rapid speed and good performance. Compared with the existing boosting ELM algorithm, our algorithm can be directly used in multiclass classification problem. We also carried out comparable experiments with face recognition datasets. The experimental results show that the proposed algorithm can not only make the predicting result more stable, but also achieve better generalization performance.
1. Introduction
Many research works have been done in feedforward neural networks, which pointed out that the feedforward neural networks are able to not only approximate complex nonlinear mapping, but also provide models for some natural and artificial problems which classic parametric technics are unable to handle.
Recently, Huang et al. [1] proposed a new simple algorithm based on single layer feedforward networks (SLFNs) called extreme learning machine (ELM). For ELM randomly generates parameters of the networks, its learning speed can be thousands of times faster than traditional feedforward network learning algorithms like back-propagation (BP) algorithm, which needs to iterate many times to get optimal parameters.
In addition, Huang [2] also shows that in theory ELMs (with the same kernels) tend to outperform SVM and its variants in both regression and classification applications with much easier implementation. Based on this conclusion, the paper in the literature proposed by Wong et al. [3] explores the superiority of the fault identification time of ELM.
In view of the advantages of the algorithm, Cao et al. put it into some areas, such as landmark recognition [4] and protein sequence classification [5]. Besides, Cao et al. [6] proposed an improved learning algorithm which incorporates the voting method into the popular extreme learning machine in classification applications and outperforms the original ELM algorithm as well as several recent classification algorithms.
AdaBoost [7] is one of the most popular algorithms of classifier ensemble to improve the generalization performance. Wang and Li in [8] proposed an algorithm named dynamic AdaBoost ensemble ELM (named DAEELM in this paper). The proposed algorithm takes the ELM as the basic classifier and applies AdaBoost to solve binary classification problem. Similarly, Tian and Mao in [9] combined the modified AdaBoost.RT [10] with ELM to propose a new hybrid artificial intelligent technique called ensemble ELM. Ensemble ELM aims to improve ELM’s performance in regression problem.
However, until now, not so much works have been done to apply AdaBoost to ELM for multiclass classification problem directly. In Freund and Schapire’s work [11], they give two extensions of their boosting algorithm to multiclass prediction problems in which each example belongs to one of several possible classes (rather than just two). Since ELM can directly work for multiclass classification problem, this paper proposes an algorithm named multiclass AdaBoost ELM (MAELM). This new algorithm applies multiclass AdaBoost as an ensemble method to a number of ELMs. In addition, this paper proposes a structure to apply ELM and MAELM to local binary patterns (LBP) [12] based face recognition problem. Experiments in LBP based face recognition will show that the proposed algorithm outperforms the original ELM.
This paper is an extension of our previous work [13]. In this paper, we extend our previous work by proposing a new way to combine ELM with PCA instead of using random weights between the input layer and the hidden layer, as well as the bias of the activation function. Experiments in LBP based face recognition will show the stable and good performance with our extended approach.
The rest of the paper is organized as follows. Section 2 gives a brief review of the ELM and PCA, original and multiclass AdaBoost and LBP. The proposed MAELM is presented in Section 3. The experimental result will be shown in Section 4 and a short discussion about the proposed algorithm will be presented in Section 5. Finally, in Section 6, we conclude the paper.
2. A Review of Related Work
In this section, a review of the original ELM algorithm and PCA and multiclass AdaBoost and the LBP based face recognition is presented.
2.1. ELM
For N arbitrary distinct samples (xi,ti), where xi=[xi1,xi2,…,xid]T∈Rd and ti=[ti1,ti2,…,tiK]T∈RK, standard SLFNs with L hidden nodes and activation function h(x) are mathematically modeled as follows:(1)∑i=1Lβihixj=∑i=1Lβihiwi·xj+bi=oj,where j=1,2,…,N.
Here, wi=[wi1,wi2,…,wid]T is the weight vector connecting the ith hidden node and the input nodes, βi=[βi1,…,βiK]T is the weight vector connecting the ith hidden node and the output nodes, and bi is the threshold of the ith hidden node.
The standard SLFNs with L hidden nodes with activation function h(x) can be compactly written as follows:(2)Hβ=T,where(3)H=h1w1·x1+b1⋯hLwL·x1+bL⋮⋮⋮h1w1·xN+b1⋯hLwL·xN+bL,β=β1T⋮βLT,T=t1T⋮tNT.
Different from the conventional gradient-based solution of SLFNs, ELM simply solves the function by(4)β=H+T.
H+ is the Moore-Penrose generalized inverse of matrix H. As Huang et al. have pointed out in [14], H+ can be represented by(5)H+=HTIC+HHT-1,where I is an identity matrix, which has the same dimension with HHT. C is a constant number which can be set by the user. Adding I/C can avoid the situation that HHT is singular. Huang et al. [1] successfully applied ELM to solve binary classification problem and Huang et al. [14] extended the ELM to directly solve the multiclass classification problem.
Since the original ELM randomly generates the weights between the input layer and the hidden layer, as well as the bias of the activation function, its performance may be not so stable. Instead of that, some other ways like PCA algorithm rewards to try.
2.2. PCA
Principal component analysis (PCA) was invented in 1901 by Pearson [15], as an analogue of the principal axes theorem in mechanics, which was later independently developed (and named) by Hotelling in the 1930s [16]. Now, it is mostly used as a tool in exploratory data analysis and for making predictive models. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute [17]. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point) and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
The procedure of PCA is as follows:(6)X=xijn∗p=x11x12⋯x1p⋮⋮⋯⋮xn1xn2⋯xnp.
Step 1. Compute the matrix V which is the covariance matrix of X.
Step 2. Find out the eigenvalue of V-λE=0, λ1≥λ2≥⋯≥λp.
Step 3. Compute the standardization feature vector of V-λEβ=0β1,β2,…,βp.
Step 4. Yield the principal components Yr=βr′X(r=1,2,…,p).
E is an identity matrix, which has the same dimension with V. The matrix Y consists of n row vectors, where each vector is the projection of the corresponding data vector from matrix X.
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are guaranteed to be independent if the dataset is jointly normally distributed. PCA is sensitive to the relative scaling of the original variables.
2.3. Original AdaBoost and Multiclass AdaBoost
AdaBoost has been very successfully applied in binary classification problem. Original AdaBoost is proposed in [7]. Before proposing the AdaBoost algorithm, the function I(x) is predefined as(7)Ix=1,ifx=true0,ifx=false.
AdaBoost algorithm is summarized as follows.
Given the training data {x1,y1,x2,y2,…,(xN,yN)}, where xi∈Rd denotes the ith input feature vector with d dimensions, yi denotes the label of the ith input feature vector, where yi∈{-1,+1}. Use Tj(x) to denote the jth weak classifier and suppose M weak classifiers will be combined.
Initialize the observation weights ωi=1/N, i=1,2,…,N.
For m=1:M,
fit a classifier Tm(x) to the training data using weights ωi;
compute the weighted error(8)errm=∑i=1NωiIyi≠Tmxi∑i=1Nwi;
compute the weight of the mth classifier(9)αm=log1-errmerrm;
update the weights of sample data, for all i=1,2,…,N(10)ωi=ωi·expαm·Iyi≠Tmxi;
renormalize ωi, for all i=1,2,…,N.
Output(11)Cx=argmaxk∑m=1Mαm·ITm=k.
Here, k is +1 or -1. In binary classification, any classifier whose generalization performance is better than 1/2 is a weak classifier. For the original AdaBoost, we have the following.
For the ith and the jth classifiers, if erri<errj<1/2, we have αi>αj>0, which means the final ensemble classifier values more of the ith classifier’s result. Specifically, if errj=1/2, αj=0, which means the final ensemble classifier just ignores the classifier since its effect is the same as random guess.
If the pth classifier misclassifies the qth sample, the qth sample will have a big weight in the next iteration. As a result, the (p+1)th classifier will pay more attention to it. On the contrary, if the pth classifier classifies the qth sample correctly, the qth sample will have a small weight in the next iteration, which means (p+1)th classifier will pay less attention to it.
However, for a K-class classification problem, we have yi∈{1,2,…,K} and K>2. If a classifier’s generalization performance is better than 1/K (maybe much smaller than 1/2), it can be called a weak classifier. Since original AdaBoost only takes a classifier whose generalization performance is better than 1/2 as a weak classifier, obviously, it cannot be directly implemented to multiclass conditions that K is bigger than 2. Freund and Schapire [11] extend the original AdaBoost to multiclass condition. The weight of the mth classifier is modified as(12)αm=log1-errmerrm+logK-1.
Similar to the binary condition, for the ith and the jth classifiers, if erri<errj<1-1/K, we have αi>αj>0, which means the final ensemble classifier values more of the ith classifier’s result. In particular, if errj=1-1/K, αj=0.
2.4. LBP Based Face Recognition
The original LBP operator goes through each 3×3 neighborhood in a picture. It takes the center pixel as the threshold value of the neighborhood and considers the result as a decimal number. The LBP operator is shown in Figure 1. Then, the texture of the picture can be represented by the histogram of all the decimal numbers.
Basic LBP operator.
To apply LBP operator in face recognition problem, Ahonen et al. [12] divided the face image into several windows and calculated the histogram of each window by LBP operator. The final feature vector is gotten by combining the histograms into a spatially enhanced histogram. The spatial enhanced histogram is provided with three levels of information: the patterns of pixel level; the patterns of regional level; the global patterns of the face image. Experiments in [12] have shown that the LBP description is more robust against variants in pose or illumination than holistic methods. All our experiments in Section 4 are done with the most original LBP operator.
3. MAELM and Face Recognition Structure
In this part, the multiclass AdaBoost ELM (MAELM) algorithm is proposed and a structure of face recognition based on LBP and ELM is also included.
3.1. Proposed MAELM Algorithm
By applying the multiclass AdaBoost to ELM, this paper proposes the multiclass AdaBoost ELM (MAELM) algorithm. The algorithm takes a number of ELM classifiers as the weak classifiers. ELMi(x) denotes the ith ELM classifier. The proposed algorithm is put as follows.
Initialize the observation weights ωi=1/N, i=1,2…,N.
For m=1:M,
fit a classifier ELMm(x) to the training data using weights ωi;
compute the weighted error(13)errm=∑i=1NωiIci≠ELMmxi∑i=1Nwi;
compute the weight of the mth classifier(14)αm=log1-errmerrm+logK-1;
update the weight of sample data, for all i=1,2,…,N(15)ωi=ωi·expαm·Ici≠ELMmxi;
renormalize ωi.
Output(16)Cx=argmaxk∑m=1Mαm·IELMmx=k.
Part (2)(a) of the proposed algorithm should be paid more attention. Both [8, 9] did not give any detail of how to fit the basic classifier ELMm(x) with weighted samples, but it is a very important part of AdaBoost. Zong et al. [18] proposed an algorithm named weighted ELM by introducing a diagonal matrix W∈RN×N, whose element Wi,i denotes the weight of the ith training sample. In view of some special situations, we introduce the weighted ELM algorithm. Obviously, it boils down to the original one when the weighted matrix is the identity matrix.
The proposed method maintains the advantages from original ELM: (1) it is simple in theory and convenient in implementation; (2) wide types of feature mapping functions or kernels are available for the proposed framework; (3) the proposed method can be applied directly into multiclass classification tasks. In addition, after integrating with the weighting scheme, the weighted ELM is able to deal with data with imbalanced class distribution while maintaining the good performance on well-balanced data as unweighted ELM; by assigning different weights for each example according to the users’ needs, the weighted ELM can be generalized to cost sensitive learning.
Under the weighted circumstance, the solution of β becomes(17)β=HTIC+WHHT-1WT.
3.2. Application in LBP Based Face Recognition
This paper combines LBP based feature vectors with ELM to build a face recognition structure. There have been some papers [19, 20] about applying ELM in face recognition problem. However, the existed ELM based face recognition structures are all based on statistical features, for example, PCA [21] and LDA [22].
In order to get better generalization performance, the proposed face recognition structure implements the LBP based method to get the feature vector and ELM as the classifier. It has been proved in [12] that LBP based method is more robust than PCA and LDA when lighting, facial expression, and poses change. At the same time, ELM is very fast in classification and has very good generalization performance. So, it is reasonable to combine LBP method and ELM to build the face recognition structure.
There are two steps of the proposed face recognition structure. The first step is to train the training samples by ELM or MAELM. In this step, the training samples are represented by LBP based feature vectors. Then, the feature vectors are used to train the classifier model by ELM or MAELM; see Figure 2. The second step is to predict the labels of the test samples. The test samples are also represented by the LBP based feature vectors. Then, the classifier model trained in the first step is implemented to predict the labels of the test samples; see Figure 3.
Training the samples by ELM or MAELM.
Predicting the labels of test samples.
4. Experiments
In this paper, two of the mostly used face recognition datasets Yale and ORL are used to prove the efficiency of the proposed algorithm. To make the results valid, except for Section 4.2, the average testing accuracy is obtained on 20 trials randomly generated training set and test set. This paper chooses the sigmoid function as the activation function for it is the most commonly used one.
The parameters to set and their meanings in the experiments are listed in Table 1. For example, if the experiment sets M=10, C=1, L=1000, t=5, and w=5, it means that selecting 5 images of each person builds the training set and the remaining images build the test set. Each image is divided into 5×5 windows. After building the training and test set, ELM with C=1, L=1000 and MAELM, which combines 10 ELMs with C=1, L=1000, are evaluated in the built sets.
Parameter list.
Parameters
Meaning
M
Number of the basic classifiers
C
Constant value in generalized inverse of H
L
Number of hidden nodes in ELM
t
Number of training images of each person
w
Divide each face image into w * w windows
r
The dimension after reduction
4.1. Performance Changes with <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M144"><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M145"><mml:mrow><mml:mi mathvariant="bold-italic">L</mml:mi></mml:mrow></mml:math></inline-formula>
Although ELM is comparatively not that sensitive to the arguments as SVM, its performance still changes with the hidden layer number L and the constant value C.
Suppose we have N training samples; Huang et al. [1] rigorously prove that SLFNs (with N hidden nodes) with random bias and input weights can exactly learn the N distinct observations. If the training error is allowed, the number of hidden nodes can be much smaller than N. At the same time, the constant value C also has some impacts of the solution of H’s Moore-Penrose generalized inverse.
In this part, the experiment is conducted in Yale dataset. The experiment sets M=20, t=5, and w=3. In addition, the L is set as 100,400,700,…,1900 and the C is set as 10-5,10-4,…,1,101,102,…,105. The performance of ELM and MAELM is shown in Figure 4.
The performance of ELM (a). The performance of MAELM (b).
It is obvious that both ELM and MAELM are not sensitive to the change of arguments. The difference between ELM and MAELM is mainly in the region where L is very small and C is very large. From Figure 4, one can conclude that ELM performs badly in this region, since its accuracy rate is below 0.6. On the contrary, MAELM is still very stable in this region. Its accuracy rate is bigger than 0.8.
After seeing PCA’s good performance in the region of face recognition, we wonder if PCA could have a stable and better performance when it replaces the way we originally construct the matrix H.
The experiment is also conducted in Yale dataset with the same parameters. Besides, the new parameter r, which is the dimension after reduction, could not be set bigger than the number of input nodes. In view of the dimension of dataset and other limitations in the experiment, the parameter r is set as 10,20,…,60. Since it is complex in the picture because of the imbalance with the parameters change, we choose to show them in the table. The performance of ELM (Figure 4(a)) and MAELM (Figure 4(b)) with PCA is listed in Table 2; the best accuracy rate in the table is bold.
Performance of ELM and MAELM with PCA.
C/r
10
20
30
40
50
60
10-5
0.19/0.38
0.29/0.33
0.2/0.32
0.23/0.41
0.09/0.15
0.22/0.30
10-4
0.19/0.20
0.22/0.40
0.21/0.21
0.36/0.35
0.27/0.28
0.26/0.38
10-3
0.27/0.39
0.38/0.35
0.24/0.46
0.37/0.36
0.33/0.31
0.38/0.30
10-2
0.43/0.34
0.76/0.32
0.85/0.32
0.86/0.43
0.86/0.37
0.86/0.35
10-1
0.79/0.36
0.84/0.43
0.88/0.40
0.88/0.36
0.90/0.42
0.91/0.53
100
0.84/0.58
0.95/0.76
0.86/0.81
0.90/0.85
0.87/0.82
0.91/0.87
101
0.77/0.66
0.90/0.87
0.89/0.88
0.91/0.91
0.98/0.97
0.91/0.88
102
0.77/0.73
0.85/0.87
0.91/0.91
0.90/0.92
0.94/0.92
0.92/0.92
103
0.77/0.74
0.90/0.89
0.92/0.93
0.94/0.94
0.93/0.95
0.90/0.92
104
0.74/0.55
0.91/0.88
0.93/0.93
0.88/0.88
0.91/0.91
0.85/0.86
105
0.79/0.71
0.87/0.87
0.87/0.87
0.97/0.97
0.96/0.96
0.91/0.91
It is clear that both ELM and MAELM with PCA are not so sensitive to the change of arguments. The difference between them is mainly in the region where L is very small and C is very large. From Table 2, one can conclude that MAELM with PCA performs better in this region when C is very small, but when C is large and r is small, ELM with PCA performs rather well and stable. Besides, ELM with PCA’s performance is almost as well as the other one in the region where C and r are both very large, and its accuracy rate is bigger than 0.85.
4.2. Prediction Stability Analysis
Since the original ELM randomly generates the weights between the input layer and the hidden layer, as well as the bias of the activation function, its performance even for the same training and test set changes each time. This is to say the performance of original ELM may not be so stable. The proposed algorithm successfully reduces the instability.
From Figure 4, one is able to conclude that ELM tends to get better performance when C=1, while C=103 is better for MAELM. Let M=10, C=103, L=1000, t=5, and w=3 for MAELM and C=1, L=1000, t=5, and w=3 for ELM. Besides, ELM and MAELM (r=20) with PCA are also included under the corresponding situations because of the considerate performance above. Experiments are done in Yale datasets. In order to prove that the proposed algorithm is more stable than the original ELM, experiments are done in the same training set and test set (randomly generated) for 20 times. The result is shown in Figure 5.
Performance of ELM and MAELM under the same training set and test set.
In Figure 5, it is obvious that the performance of MAELM is much more stable than the original ELM. Although ELM or MAELM with PCA performs far more stable than the original ELM and MAELM (since they take the algorithm of PCA into consideration instead of the random weights between the input layer and the hidden layer and the bias of the activation function), the accuracy rates of them, which are always in the middle from Figure 5, are still not so good as the original MAELM. We conclude the result of Figure 5 in Table 3. Please notice that although the generalization performance of MAELM seems to be much better than ELM in the table, it is improper to conclude that MAELM performs better. The reason is that the training set and test set are fixed. One cannot exclude the possibility that MAELM performs better than ELM only under this dataset. Some other experiments will be done in the following parts to show MAELM’s better generalization performance.
Performance of ELM and MAELM under the same training set and test set.
Algorithm
Mean accuracy rate
Standard derivation
ELM
0.8972
0.0213
MAELM
0.9361
0.0157
ELM_PCA
0.9222
0
MAELM_PCA
0.9222
0
4.3. Performance Changes with <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M198"><mml:mrow><mml:mi mathvariant="bold-italic">M</mml:mi></mml:mrow></mml:math></inline-formula>
In order to evaluate the changes of performance when M changes, the experiment in this part lets C=1, t=5, w=4, L=1000 for the original MAELM, r=20 for MAELM with PCA, and M=2,4,6,…,50. The average test accuracy is obtained on 20 trials randomly generated training set and test set. Yale dataset is used for the experiment. The result is presented in Figures 6 and 7.
MAELM’s performance.
MAELM with PCA’s performance.
From Figure 6, it is obvious that as the M increases, the generalization performance also becomes better. However, the trend becomes slower as M increases. From Figure 7, one can conclude that as the M increases when M is small, the performance decreases a little, while M becomes larger after 25; the performance also becomes better, although the trend is not so stable as the original MAELM. This situation indicates that in real-world applications, M does not need to be very big. Good generalization performance can be obtained by setting M less than 30 in the algorithm of original MAELM, which achieves better than MAELM with PCA under the same situation.
4.4. Better Generalization Performance Than ELM
In this part, experiments are done both in Yale and ORL datasets. The experiments set the parameters of those algorithms as follows: C=1, L=1000, t=5, M=20 (MAELM), and r=20 (PCA). The experiments take 3×3, 4×4, 5×5, 6×6, and 7×7 windows into consideration, which means setting w=3,…,7. The average testing accuracy is obtained on 20 trials randomly generated training set and test set.
The experiment indicates that MAELM has better generalization performance both in Yale and ORL datasets under different window sizes. See Figure 8 for details, while in Figure 9, it is obvious that ELM with PCA has much better performance both in Yale and ORL datasets under different window sizes. In addition this algorithm keeps more stable than any other algorithms both in Yale and ORL datasets.
Performances in Yale and ORL.
Performances in Yale and ORL.
4.5. The Performance in PCA
After seeing all these experiments, we can conclude that although MAELM with PCA performs not so well as the original one, ELM with PCA performs much better than before, especially in the experiment in Section 4.2. It is obvious that the performance of the experiments with PCA is just between the original ELM and MAELM.
What is more, since the original ELM randomly generates the weights between the input layer and the hidden layer, as well as the bias of the activation function, its performance is not so stable. The proposed algorithm with PCA successfully reduces the instability which is very important in the real world.
Although PCA improves the performance of ELM in a certain degree, it still could not reach the ability of MAELM with random weights and bias. Finally, it comes to the result that the proposed algorithm named MAELM performs much better in solving the multiclass classification problem.
5. Discussion5.1. Complexity Comparison
Very similar to MAELM, the DAEELM [8] also considers taking the ELM as the weak classifier and implements AdaBoost as the ensemble method. The difference is that MAELM implements multiclass AdaBoost which can be directly used in multiclass classification problem, while DAEELM implements dynamic ensemble AdaBoost [23], which aims to solve the binary classification problem.
Many methods have been developed to apply binary classifier to multilabel problem. One-against-all (OAA) [24] and one-against-one (OAO) [25] are mostly used. For a K-class classification problem, under OAA condition, K classifiers have to be trained. Each of them separates a single class from all the remaining classes. Under the OAO condition, K(K-1)/2 classifiers have to be trained. Each of them separates a pair of classes.
Suppose that both MAELM and DAEELM have M iterations. For a K-class classification problem, MAELM only needs to train M ELMs, while DAEELM needs to train M×K and (M×K×(K-1))/2 classifiers for OAA and OAO condition, respectively. Although DAEELM may stop the iteration earlier, it is obvious that, in theory, MAELM’s computation complexity is much lower than DAEELM for K-class classification problem, especially when K is a very big number.
The authors of DAEELM have not published its codes and DAEELM has its own arguments which MAELM does not have. DAEELM also does not provide details of how it trains weighted data with ELM, so it will be unfair to compare the performance of MAELM and DAEELM. However, the conclusion that MAELM is much faster than DAEELM in multiclass classification problem can be drawn from the complexity analysis above.
5.2. Train ELM with Weighted Data
Section 3.1 has mentioned that training ELM with weighted data is a key problem when applying AdaBoost. However, [8, 9] did not mention the key point at all.
Toh in [26] first applied ELM to classify imbalanced data with two classes. ELM tries to minimize the training error of the data while the proposed algorithm tends to minimize the total error rate (TER), which takes the weights of the positive and negative data into consideration.
In Section 3.1, the weighted ELM is applied in MAELM. Actually, the weighted ELM is inspired and in a way that is very similar to regularized ELM proposed by Deng et al. in [27]. The regularized ELM aims to minimize the weighted training error of the weighted data.
6. Conclusion
This paper proposes a new boosting ELM named MAELM, which applies the multiclass AdaBoost in ELM ensemble to directly solve multiclass classification problem. A face recognition structure combined LBP based method and ELM is also presented in the paper. What is more, this paper proposes the way in which ELM combined with PCA instead of using random weights between the input layer and the hidden layer, as well as the bias of the activation function.
Experiments in LBP based face recognition will show the stable and good performance in a certain degree. Although PCA improves the performance of ELM, it still could not be better than MAELM with random weights and bias. Experiments show that in LBP based face recognition problem, the recognition result of MAELM is more stable than the original ELM and better than any other algorithms listed in the paper.
Finally, it comes to the result that the proposed algorithm named MAELM, which applies the multiclass AdaBoost in ELM and combines with LBP method, performs much better in solving the multiclass classification problem.
Also, MAELM is compared with DAEELM in multiclass classification problem in theory, which indicates that MAELM has much lower computation complexity than DAEELM. Moreover, this paper makes the problem how to train weighted data by ELM clear.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This research is based on work supported in part by the National Natural Science Foundation of China (61370173, 61173123) and the Natural Science Foundation Project of Zhejiang Province under Project LR13F030003.
HuangG.-B.ZhuQ.-Y.SiewC.-K.Extreme learning machine: theory and applicationsHuangG.-B.An insight into extreme learning machines: random neurons, random features and kernelsWongP. K.YangZ.VongC. M.ZhongJ.Real-time fault diagnosis for gas turbine generator systems using extreme learning machineCaoJ. W.ChenT.FanJ.Fast online learning algorithm for landmark recognition based on BoW frameworkProceedings of the 9th IEEE Conference on Industrial Electronic and Application201411631168CaoJ.XiongL.Protein sequence classification with improved extreme learning machine algorithmsCaoJ.LinZ.HuangG.-B.LiuN.Voting based extreme learning machineFriedmanJ.HastieT.TibshiraniR.Additive logistic regression: a statistical view of boostingWangG.LiP.Dynamic Adaboost ensemble extreme learning machineProceedings of the 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE '10)August 2010Chengdu, ChinaIEEEV3-54V3-5810.1109/ICACTE.2010.55797262-s2.0-78149290661TianH.-X.MaoZ.-Z.An ensemble ELM based on modified AdaBoost.RT algorithm for predicting the temperature of molten steel in ladle furnaceSolomatineD. P.ShresthaD. L.AdaBoost.RT: a boosting algorithm for regression problems2Proceedings of the IEEE International Joint Conference on Neural Networks200411631168FreundY.SchapireR. E.A decision-theoretic generalization of on-line learning and an application to boostingAhonenT.HadidA.PietikäinenM.Face recognition with local binary patternsShenY.JiangY. L.LiuW.LiuY.Multi-class AdaBoost ELMProceedings of the International Conference on Extreme Learning Machines (ELM '14)December 2014SingaporeHuangG. B.ZhouH.DingX.ZhangR.Extreme learning machine for regression and multiclass classificationPearsonK.On lines and planes of closest fit to systems of points in spaceHotellingH.Analysis of a complex of statistical variables into principal componentsAbdiH.WilliamsL. J.Principal component analysisZongW.HuangG.-B.ChenY.Weighted extreme learning machine for imbalance learningZongW.HuangG.-B.Face recognition based on extreme learning machineMohammedA. A.MinhasR.Jonathan WuQ. M.Sid-AhmedM. A.Human face recognition based on multidimensional PCA and extreme learning machineTurkM.PentlandA.Eigenfaces for recognitionEtemadK.ChellappaR.Discriminant analysis for recognition of human face imagesLiR.LuJ.ZhangY.ZhaoT.Dynamic Adaboost learning with feature selection based on parallel genetic algorithm for image annotationHeiseleB.HoP.WuJ.PoggioT.Face recognition: component-based versus global approachesAllweinE. L.SchapireR. E.SingerY.Reducing multiclass to binary: a unifying approach for margin classifiersTohK.-A.Deterministic neural classificationDengW.ZhengQ.ChenL.Regularized extreme learning machineProceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM '09)April 2009Nashville, Tenn, USA38939510.1109/CIDM.2009.49386762-s2.0-67650463106