Two-Dimensional Extreme Learning Machine

Extreme learning machine (ELM) has achieved wide attention due to faster learning speed compared with conventional neural network models like support vector machine (SVM) and back-propagation (BP) networks. However, like many other methods, ELM is originally proposed to handle vector pattern while nonvector patterns in real applications need to be explored, such as image data. We propose the two-dimensional extreme learning machine (2DELM) based on the very natural idea to deal with matrix data directly. Unlike original ELMwhich handles vectors, 2DELM take the matrices as input features without vectorization. Empirical studies on several real image datasets show the efficiency and effectiveness of the algorithm.


Introduction
Pattern representation is probably one of the basic problems in machine learning; almost all learning algorithms aim to build the mapping functions from the input to output.The output value of a learning model is always straightforward while different input representations could influence the results much.For statistical learning, the input pattern is commonly represented by a vector which contains the values belongs to corresponding features.Even though the original data is not sampled as vectors, there exists a standard preprocessing method named vectorization, which aims to transform the original data into vectors for the convenience of computation.Taking the face image, for example, each sample of a  1 -by- 2 face image is always transformed into a  1 ×  2length vector by concatenating all columns or rows, so that the sample can be processed by popular learning algorithms such as support vector machine (SVM) or artificial neural networks.Input vectors almost become another name for input samples, and some of them have discriminative ability which define the margin of largest separation are called support vectors in SVM [1].
On the one hand, vectorization helps the input data to fit in mature models as well as to accelerate computation procedure using popular linear algebra libraries.On the other hand, the drawbacks of vectorizing image data are obvious from at least two aspects [2,3].(1) Structural or contextual information may be lost during the transformation due to the changes of relative position of the pixels, and the reason is quite intuitive.(2) Vectorization needs more parameters and thus leads to the curse of dimensionality.For example, in order to classify 1024 × 1024 images by neural networks with 1000 hidden nodes, one need 10 9 parameters in the first layer.The feedforward computation can be slow.Now look at the general class of mapping function adopted by many discriminative models, which take the sample vector as input and classification label or regression value as the output: where x ∈ R  is the input vector and ℎ  (x) is the th output value of the hidden layer in three-layer neural network, or the th output value of other two-layer model such as least square regression and logistic regression.  is the parameter vector which connects ℎ  (x) and the final output value.In order to have a scalar output  easily, a linear or nonlinear transformation needs to be conducted on the input space; thus ℎ  (x) is sometimes regarded as point in the feature space.Function (⋅) controls the final output value according to 2 Mathematical Problems in Engineering specific learning tasks.The definition of feature mapping function is where w  ∈ R  is the weight vector that connects the input nodes and the th hidden node in neural network models and b  is the bias of the th hidden node in this case.(⋅) is probably a nonlinear continuous function.For linear regression models as well as back-propagation networks, the w  are the main parameters that need to be learned.The feature mapping stage here is a linear transformation, and the output of each hidden node is a linear combination of input units and corresponding weights.Similar to vector case, the feature mapping function for the matrix pattern  ∈ R  1 × 2 looks differently as the following form [4]: where u  ∈ R  1 and k  ∈ R  2 are two weight vectors similar to w  in the vector pattern.This might be the simplest way to transform a matrix into a scalar using vector inner product similar to (2), since matrix-vector product is essentially sum of several vector inner products.We can see that there are only  1 +  2 parameters needed instead of  1 × 2 in (2) for each hidden node.From this point, using matrix pattern could reduce model complexity with fewer parameters, even if the original sample is not matrix as long as the vector can be recombination into matrix.Take the single layer feedforward neural network (SLFN), for example, here, as Figure 1 shows the differences between two input patterns: (a) needs  1 +  2 nodes in the input layer while (b) just needs  2 for the same input sample.
As opposed to vector case learning methods, twodimensional methods have been used on feature extraction as well as conventional learning models, in the last decade.Yang et al. [5] proposed two-dimensional principle component analysis (2DPCA) for image representation, which turned out to be advantageous over PCA in several aspects.Ye et al. [6] proposed two-dimensional linear discriminant analysis (2DLDA) which works with data in matrix representation and could overcome the singularity problem in conventional LDA.Wang et al. [3] provided a fully matrixed approach, applied in both feature extraction and classifier design, including his previous work [4] which proposed MatLSSVM, that is, least squares support vector machines (LS-SVM) based on matrix patterns and its fuzzy version.Empirical studies in these literatures showed that two-dimensional methods helped to improve classification performance and reduce the computational and space complexity, compared with the base models.
A more general representation pattern over matrix is tensor, which takes  ∈ R  1 × 2 ×⋅⋅⋅×  as the input.Tao et al. [7] described a supervised tensor learning framework and the alternating projection optimization to obtain the solution.Conventional models like SVM and Fisher discriminant analysis were contained in this framework.Possible solutions of tensor based ELM will be discussed later.
Inspired by the very natural idea to let ELM process matrices directly and matrix pattern related works [3], we propose the two-dimensional extreme learning machine (2DELM) in this paper, and our main contributions can be summarized as follows: (i) providing a simple method to process matrix pattern for SLFN; (ii) analyzing the random feature mapping from a probabilistic perspective for both ELM and 2DELM; (iii) comparing the proposed algorithm with original ELM on image datasets based on a statistical approach.
The remainder of this paper is organized as follows.Section 2 reviews the vector based ELM.Section 3 describes the 2DELM and related concepts, including a sparse version, kernels tricks, and tensor based ELM.We evaluate our methods on several image data in Section 4. Finally Section 5 concludes this paper.

Extreme Learning Machine: A Vector Case
Extreme learning machine [8] was proposed as an efficient learning algorithm for single hidden layer feedforward neural networks, which outperforms the gradient-based methods to learn the same architecture.The structure is also shown in Figure 1 where H is defined as the hidden layer output matrix of  training samples,  is output weights vector that connect hidden layer and output layer (with  hidden nodes), and y ∈ R  is the target vector that contains real values for regression and class labels for classification: where ℎ  (x  ) is the output of the th hidden node of the th input vector and probably has the same form as (2).The significant characteristic of ELM lies at the random choice of the weights w that connect the input layer and hidden layer as well as the bias b of hidden layer, which is different from traditional algorithms like back-propagation where all parameters need to be tuned.This makes the hidden layer output matrix by hand and only the output weights  need to be learned.Under the ERM principle, the optimal solution  * to (3) can be analytically resolved as where H † is the Moore-Penrose generalized inverse of matrix H.The idea that hidden node parameters need not to be learned has been extended to many other models beyond neural network models like SVM, RBF networks, and so forth [9].The simplicity of ELM has been also extended to form a unified framework, which mainly takes three steps as follows.
(1) Randomly choose parameters at first layer of SLFN for feature mapping.Kernel tricks such as in SVM could be used in ELM to obtain more powerful classification ability [10,11].The fast solution in step (3) makes online learning and real time prediction possible.The whole procedure is also suitable for many other models in ensemble learning; the weights of multiple predictors can be determined by a similar way in (5).To be more specific, the last layer of SLFN can be viewed as a linear combination of multiple weak predictors to form a strong predictor, which is consistent with the design of ensemble learning.
Because of above properties, ELM and its variants have been widely used in many areas like face recognition [12], object recognition [13], large scale data analysis [14], network security [15], and so forth.Almost all these application examples deal with vector pattern, even if the objects are images.It is necessary to extend ELM to matrix pattern, so that we can use it in a more generalized form in practise.

Two-Dimensional ELM
3.1.Basic Formulas.The goals of 2DELM are to process matrix pattern directly, instead of vectorizing by concatenating all columns or rows at first.At the feature mapping stage, which is corresponding to the input layer and the hidden layer in SLFN architecture, each hidden node encode all original features of a sample someway.
Assume activation function is sigmoid () = 1/(1 + exp(−)), vector based ELM takes a linear combination of all features as input of activation function , and the linear weights are randomly generated.Inspired by ELM, the first layer parameters which actually do feature mapping need not to be tuned in SLFN.In order to get random features in the hidden layer like ELM, we can randomly choose u  , k  , and b  in (3), at the first layer in SLFN.As we mentioned, u  k might be the simplest way to transform a matrix  into a scalar using vector inner products.The entries in hidden layer output matrix (H) × of SLFN are formally defined as where H  is the output of the th hidden node of the th input matrix sample.Each hidden node thus gets all entries' information of matrix   while keeping various via different random weights u  , k  , and b  .For a complete learning model, we have random parameters U ∈ R  1 × , V ∈ R  2 × , and bias b ∈ R  .u  and k  in (7) are the th row of U and V, respectively.Having the hidden layer output matrix by hand, the next thing would be the same as ELM: to solve the optimal weights by (6).With  hidden nodes, we can see that ( 1 +  2 ) ×  input parameters are needed here while the number is ( 1 ×  2 ) ×  after vectorization.Conversely, we could also conduct reformatting a vector into a matrix to reduce the parameters as long as the length of the vector is not a prime.
In order to get a stable solution, ridge regression [16] could be applied to solve  for ELM [9,17] as well as 2DELM.The corresponding objective function is defined as (8), and  is the parameters to balance the loss and  2 regularizer.This problem can be analytically solved by  = (H  H + I) − The whole procedure of MatELM is illustrated in Algorithm 1.As we can see, using the same amount of hidden (2) Compute the hidden layer output matrix H as in ( 7); (3) Solve output parameters  by ( 6) or (8) Algorithm 1: 2DELM.nodes, ELM and MatELM have the same training speed to compute  since they share the size of H and y.But in theory, the computational complexity to build H is different: ( 1 ×  2 ) for ELM and ( 1 +  2 ) for 2DELM.In practice the speed to compute H also depends on how the original data are stored: 2DELM tends to outperform ELM if samples are stored in matrices and vice versa.

Further Discussion.
As mentioned in Section 1, tensor is a more general representation pattern over matrix pattern.The essence of learning model is to transform the input to considerable output, regarding tensor based ELM; we can extend it similar to 2DELM.The most import step lies in step (2) of Algorithm 1, that is, to compute the hidden layer output matrix H.For tensor pattern  ∈ R  1 × 2 ×⋅⋅⋅×  , we can define the entries in H by where the weights u  ∈ R   are randomly chosen.Once the hidden layer output matrix H is ready, the rest of training is the same as ELM.The sparse weight vector  can be also obtained by  1 norm regularizer in tensor based ELM as well.Castaño et al. [18] proposed PCA-ELM, a robust and pruned ELM based PCA, which aims to determine the hidden nodes in ELM with the information retrieved from principal components analysis of training data.PCA-ELM reduced the model parameters by taking low-dimension training data, which is different from our method.Explicit vectorization is needed in these methods; however the frameworks are not in contradiction with 2DELM since the latter focuses on the pattern representation.In other words, PCA related techniques can be combined with the idea of 2DELM in practice.

Experiments
In this section, we mainly compare 2DELM and ELM on image datasets for multiclass classification.Assume the number of classes is ; we transform the label vector  into ground truth matrix  × in both training and testing stage.The definition of entries of  is The primary solution of  × in all experiments is based on Moore-Penrose generalized inverse (we replace the  by  in ( 6)) since it needs only one user-defined parameter: the number of hidden nodes .In practice, it is very time consuming to choose the parameter  when using ridge regression in a wide range; moreover, in many cases, Moore-Penrose generalized inverse solution tends to be stable as well.At predicting stage, for each tested sample , we use (11) to get the output vector t, and then the index of largest entry of t is taken as the label t = h () , (11) where h() is hidden layer output vector of sample , sized by -length, and each entry has the same form as (7).

Data Description.
In order to show the effectiveness of 2DELM, we get several popular image datasets, and the application background covers face recognition and other image classifications.In specific, there are five face databases and four OCR datasets in the following discussion, the data size and dimensions vary in wide range.
(  (5) PIE face: the CMU Pose, Illumination, and Expression (PIE) database, provided by [20].There are 41,368 images of 68 people which were collected in 2000.These images were taken with each person under 13 different poses, 43 different illumination conditions, and with 4 different expressions.We get a sunset containing 11,554 images in our experiment, and each has the size 32 × 32. Figure 2 More details of these datasets are provided in Table 1.Similar to [4], we introduce the ration  = ( 1 ×  2 )/( 1 +  2 ) to indicate the input parameters needed ration for vector pattern versus matrix pattern.The last column of these tables indicates whether the training data and the testing data are provided separately.

Comparison Results
. We first compare the processing time of ELM and 2DELM when dealing with matrix pattern.Since most of the datasets were provided with vectors in Table 1, we find two image databases with samples stored in original pictures.We put the cell structure in Matlab which contains matrices as the elements as the input and count the cpu-time for each trial.Note that implementation by vectorization is accelerated by some linear algebra libraries; here we count tics for getting H without vectorization in ELM due to the same conditions with 2DELM.The time needed by calculating hidden layer output matrix H for ELM or 2DELM is also depend on the number of hidden nodes . Figure 3 shows the results when  is in the range {100, 200, . . ., 1000}.With each , mean time and standard deviation of ten trails are shown on the figures.We can see that 2DELM achieves faster speed under the same implementation condition and tends to be more stable with random parameters.
We set the number of hidden nodes  = 1000 in both ELM and 2DELM in the following comparison.Table 2 shows the average training time and testing time.We can see that the training time (time needed for calculating  with H by hand) in ELM and 2DELM keeps roughly the same, due to the same size of H they share, so as the testing time.
The accuracy comparison results are shown in Table 3. Bold number indicates better mean testing accuracy, and • indicates this advantage is significant under pairwise -tests at 95% significance level (∘ otherwise).We can see that 2DELM achieves better testing accuracy than ELM in most cases, with the same amounts of hidden nodes and much less input parameters.

Conclusion
In this paper we propose a matrix pattern representation based ELM algorithm 2DELM, which take matrices as input instead of commonly used vectors in the SLFN.The key difference between 2DELM and ELM lies at the feature mapping stage; vectorization is not needed when dealing with matrices and thus reduces the input weights compared with vector pattern case.The learning stage keeps the same as ELM and inherits most characteristics of ELM.The comparing experiments on several image datasets show the effectiveness of the proposed algorithms.In most cases, 2D could achieve better or comparable testing accuracy as ELM while using fewer input weights parameters.
From ELM to 2DELM, we aim to simplify the learning model by reducing parameters while keeping the predicting accuracy under the basic ELM framework.The method also keeps consistent with the general principle of Occam's razor [21] in classifier design.What is more, for dealing with high dimensional data, the matrix or tensor pattern representation may provide another perspective besides traditional dimension reducing techniques.

Mathematical Problems in Engineering
(a).According to a general principle for learning machine, that is, to minimize the empirical risk (ERM), ELM aims to reach the smallest training error by min  :     H − y     2 2 ,

Figure 1 :
Figure 1: Two cases for single layer feedforward neural network on matrix pattern.(a) needs  1 ×  2 nodes in the first layer while (b) just needs  2 .reshape(⋅) here is the vectorization process.

( 2 )
Use various activation function to generate new feature representations.(3) Fast solution of required parameters at last layer of SLFN.

Figure 2 :
Figure 2: The samples from face database in the experiments.

7 100Figure 3 :
Figure 3: The time needed for getting hidden layer output matrix H on UMist data and Georgia tech face database.

Table 1 :
Summary of the image datasets for multiclass classification,  = ( 1 ×  2 )/( 1 +  2 )."Separate" indicates whether the training data and the testing data are provided separately.Fifty trials have been conducted for each problem when comparing ELM and 2DEM on all datasets.For the face datasets, which do not provide separate training set and testing set, we randomly choose the 2/3 of the total samples as training set and the rest as testing in each trail.For other datasets, we conduct the experiments with fifty different random initializations of ELM and 2DELM.The averaged training accuracy, testing accuracy, training time, and testing time of all trials are recorded, and the comparison of testing accuracy is based on pairwise -tests at 95% significance level.

Table 2 :
Training time and testing time (in seconds) for ELM and 2DELM, average values on fifty trails.

Table 3 :
Mean accuracy (± standard deviation) of fifty-trial comparison of ELM and 2DELM.Testing accuracy is also compared based on pairwise -tests at 95% significance level.
• or ∘ indicates whether the advantage is significant or not.