Deep Extreme Learning Machine and Its Application in EEG Classification

Recently, deep learning has aroused wide interest in machine learning fields. Deep learning is a multilayer perceptron artificial neural network algorithm. Deep learning has the advantage of approximating the complicated function and alleviating the optimization difficulty associated with deep models. Multilayer extreme learning machine (MLELM) is a learning algorithm of an artificial neural network which takes advantages of deep learning and extreme learning machine. Not only does MLELM approximate the complicated function but it also does not need to iterate during the training process. We combining with MLELM and extreme learning machine with kernel (KELM) put forward deep extreme learning machine (DELM) and apply it to EEG classification in this paper. This paper focuses on the application of DELM in the classification of the visual feedback experiment, using MATLAB and the second brain-computer interface (BCI) competition datasets. By simulating and analyzing the results of the experiments, effectiveness of the application of DELM in EEG classification is confirmed.


Introduction
Brain-computer interface (BCI) is a kind of technology that enables people to communicate with a computer or to control devices with EEG signals [1].The core technologies of BCI are to extract the feature of preprocessed EEG and classify readyprocessed EEG, and this paper is mainly about classification analysis.In recent years, BCI has gotten a great advance with the rapid development of computer technology.BCI has been applied to many fields, such as medicine and military [2][3][4].Currently, many different methods have been proposed for EEG classification, including decision trees, local backpropagation (BP) algorithm, Bayes classifier, -nearest neighbors (KNN), support vector machine (SVM), batch incremental support vector machine (BISVM), and ELM [5][6][7][8].However, most of them are shallow neural network algorithms in which the capabilities achieve approximating the complex functions that are subject to certain restrictions, and there is no such restriction in deep learning.
Deep learning is an artificial neural network learning algorithm which has multilayer perceptrons.Deep learning has achieved an approximation of complex functions and alleviated the optimization difficulty associated with the deep models [9][10][11].In 2006, the concept of deep learning was first proposed by Hinton and Salakhutdinov who presented deep structure of multilayer autoencoder [12].Deep belief network was proposed by Hinton [13].LeCun et al. put forward the first real deep learning algorithm-convolutional neural networks (CNNs) [14].More and more people put forward some new algorithms based on deep learning.Then convolutional deep belief network was put forward [15].In 2013, the model of multilayer extreme learning machine (MLELM) was proposed by Kasun et al. [16], and DELM takes advantages of deep learning and extreme learning machine.Extreme learning machine (ELM) proposed by Huang et al. is a simple and efficient learning algorithm of single layer feed-forward neural networks (SLFNs) [17,18].In addition, some people put forward some deformation algorithms based on ELM, such as regularized extreme learning machine (RELM) [19], extreme learning machine with kernel (KELM) [20], optimally pruned extreme learning machine (OP-ELM) [21], and evolving fuzzy optimally pruned extreme learning machine (eF-OP-ELM) [22].We combining with multilayer extreme learning machine (MLELM) and extreme learning machine with kernel (KELM) put forward deep extreme learning machine (DELM) and apply it to EEG classification, and the paper is organized as follows: Section 2 gives the model of ELM, RELM, and KELM.Section 3 describes the model structure of MLELM.Section 4 details the model structure of DELM.Section 5 first evaluates the usefulness of DELM on UCI datasets and then applies DELM to EEG classification.In Section 6, the conclusion is gotten.1, with  input layer nodes,  hidden layer nodes,  output layer nodes, and the hidden layer activation function ().
The above equation can be written compactly as where where   = [ 1 ,  2 , . . .,   ]  are the weights connecting the th input nodes and hidden layer,   is the bias of the th hidden node, and V  = [V 1 , V 2 , . . ., V  ]  are the weights connecting the th hidden node and the output layer. is output matrix of the neural network.We need to set input weights   and the bias of the hidden layer   ; the output weights  can be obtained by a series of linear equations transformations.
In conclusion, using ELM to obtain the output weights  can be divided into three steps.
Step 1. Randomly select numerical values between 0 and 1 to set input weights   and the bias of the hidden layer   .
Step 2. Calculate the output matrix .
Step 3. Calculate the output weights : where  † represents the generalized inverse matrix of the output matrix .
where  is a scale parameter which adjusts experiential risk and structural risk.By setting the gradient of  RELM with respect to V to zero, we have When the number of training samples is more than the number of hidden layer nodes, the output weight matrix V in RELM can be expressed as When the number of training samples is less than the number of hidden layer nodes, the output weight matrix V in RELM can be expressed as

Extreme Learning Machine with Kernel (KELM).
Huang et al. combining with the kernel method and extreme learning machine put forward extreme learning machine with kernel (KELM).The outputs of the hidden layer of ELM can be regarded as the nonlinear mapping of samples.When the mapping is an unknown, we can construct the kernel function instead of   : The most popular kernel of KELM in use is the Gaussian kernel (  ,   ) = exp(−‖  −   ‖/), where  is the kernel parameter.
Thus, the output weight matrix V in KELM can be expressed as (12) and the Classification of formula of KELM can be expressed as (13): . (13)

Extreme Learning Machine-Autoencoder (ELM-AE).
Autoencoder is an artificial neural network model which is commonly used in deep learning.Autoencoder is an unsupervised neural network, the outputs of autoencoder are the same as the inputs of autoencoder, and autoencoder is a kind of neural networks which reproduces the input signal as much as possible.ELM-AE proposed by Kasun et al. is a new method of neural network which can reproduce the input signal as well as autoencoder.
The model of ELM-AE constituted input layer, singlehidden layer, and output layer.The model structure of ELM-AE is shown in Figure 2, with  input layer nodes,  hidden layer nodes,  output layer nodes, and the hidden layer activation function ().According to the output of the hidden layer representing the input signal, ELM-AE can be divided into three different representations as follows.
> : Compressed Representation: this represents features from a higher dimensional input signal space to a lower dimensional feature space. = : Equal Dimension Representation: this represents features from an input signal space dimension equal to feature space dimension.
< : Sparse Representation: this represents features from a lower dimensional input signal space to a higher dimensional feature space.
There are two differences between ELM-AE and traditional ELM.Firstly, ELM is a supervised neural network and the output of ELM is label, but ELM-AE is an unsupervised neural network and the output of ELM-AE is the same as the input of ELM-AE.Secondly, the input weights of ELM-AE are orthogonal and the bias of hidden layer of ELM-AE is also orthogonal, but ELM is not so.For  distinct samples,   ∈   ×   , ( = 1, 2, . . ., ), the outputs of ELM-AE hidden layer can be expressed as (14), and the numerical relationship between the outputs of the hidden layer and the outputs of the output layer can be expressed as (15): ℎ (  )  =    ,  = 1, 2, . . ., .
Using ELM-AE to obtain the output weights  can be also divided into three steps, but the calculation method of the output weights  of ELM-AE in Step 3 is different from the calculation method of the output weights  of ELM.
For sparse and compressed ELM-AE representations, output weights V are calculated by ( 16) and ( 17).When the number of training samples is more than the number of hidden layer nodes, When the number of training samples is less than the number of hidden layer nodes, For equal dimension ELM-AE representation, output weights  are calculated by MLELM makes use of ELM-AE to train the parameters in each layer, and MLELM hidden layer activation functions can be either linear or nonlinear piecewise.If the activation function of the MLELM th hidden layer is (), then the parameters between the MLELM th hidden layer and the MLELM ( − 1) hidden layer (if  − 1 = 0, this layer is the input layer) are trained by ELM-AE, and the activation function should be (), too.The numerical relationship between the outputs of MLELM th hidden layer and the outputs of MLELM ( − 1) hidden layer can be expressed as

Multilayer Extreme Learning Machine (MLELM)
where   represents the outputs of MLELM th hidden layer (if −1 = 0, this layer is the input layer, and  −1 represents the inputs of MLELM).The model of MLELM is shown in Figure 3, V  represents the output weights of ELM-AE, the input of ELM-AE is  −1 , and the number of ELM-AE hidden layer nodes is identical to the number of MLELM th hidden nodes when the parameters between the MLELM th hidden layer and the MLELM ( − 1) hidden layer are trained by ELM-AE.
The output of the connections between the last hidden layer and the output layer can be analytically calculated using regularized least squares.

Deep Extreme Learning Machine (DELM)
MLELM makes use of ELM-AE to train the parameters in each layer, and ML-ELM hidden layer activation functions can be either linear or nonlinear piecewise, and the mapping of MLELM is linear or nonlinear.When the mapping is an unknown, we can add one hidden layer and construct the kernel function.In other words, at last the outputs of MLELM hidden layer   (the matrix size is   * ) are the inputs of KELM, and we can construct the kernel function instead The model of MLELM The model of DELM is shown in Figure 4, V  ( ∈ [1, . . ., ]) represents the output weights of ELM-AE, the input of ELM-AE is  −1 , and the number of ELM-AE hidden layer nodes is identical to the number of DELM th hidden nodes when the parameters between the DELM th hidden layer and the MLELM ( − 1) hidden layer are trained by ELM-AE.And we can construct the kernel function instead of  +1  +1  ; thus the output weight matrix V in DELM can be expressed as (21) and the classification of formula of KELM can be expressed as (22):

Experiments and Analysis
The execution environment of experiments is MATLAB 2012B.All activation functions of ELM, MLELM, and DELM select sigmoid function and the kernel functions of KELM and DELM are Gaussian kernel.ELM, MLELM, and DELM were executed100 times, and the average values and the best values are reported.

UCI Datasets Classification.
In this part, the UCI datasets were used to test the performances of DELM, and the details  of UCI dataset are presented in Table 1, including ionosphere dataset and diabetes dataset.As shown in Figure 5, we can make choices that the numbers of ELM hidden layer nodes on ionosphere dataset and diabetes dataset are 50 and 40, the regularized parameter  and the kernel parameter  of KELM on ionosphere dataset are 10 3 and 10 2 , and the regularized parameter  and the kernel parameter  of KELM on diabetes dataset are 10 2 and 10 1 .The structure of MLELM on ionosphere dataset is 34-30-30-50-2, where the parameter  for layer 34-30 is 10 3 , the parameter  for layer 30-50 is 10 −2 , and the parameter  for layer 50-2 is 10 8 .And the structure of MLELM on diabetes dataset is 8-10-10-40-2, where the parameter  for layer 8-10 is 10 6 , the parameter  for layer 10-40 is 10 8 , and the parameter  for layer 40-2 is 10 5 .The structure of DELM on ionosphere dataset is 34-30-30-L-2, where the parameter  for layer 34-30 is 10 1 , the parameter  for layer L-2 is 10 3 , and the kernel parameter  is 10 2 .And the structure of DELM on diabetes dataset is 8-10-10-L-2, where the parameter  for layer 34-30 is 10 1 , the parameter  for layer L-2 is 10 2 , and the kernel parameter  is 10 1 .
The performance comparison of DELM with ELM, KELM, and MLELM on UCI datasets is shown in Table 2.It is clearly observed that DELM testing accuracy is higher than MLELM, either the average or the maximum, and the best values of DELM testing accuracy are higher than ELM and    KELM.And DELM training time is the longest, but there is little difference between testing times.Sigillito et al. investigated ionosphere dataset using backpropagation and the perceptron training algorithm; they found that "linear" perceptron achieved 90.7%, a "nonlinear" perceptron achieved 92%, and backprop an average of over 96% accuracy [23].Although the average value of DELM on ionosphere dataset only achieves 94.74%, the best value has reached to 99.34%.

EEG Classification.
The effectiveness of DELM has been confirmed, so the effectiveness of the application of DELM in EEG classification is tested in this part.

Visual Feedback Experiment (Healthy Subject
).The performances of DELM on the second BCI competition dataset IA are tested in this section, and this dataset comes from the visual feedback experiment (healthy subject) provided by University of Tuebingen [24].
The datasets were taken from a healthy subject.The subject was asked to move a cursor up and down on a computer screen, while his cortical potentials were taken.Cortical positivity leads to a downward movement of the cursor on the screen.Cortical negativity leads to an upward movement of the cursor.Each trial lasted 6 s.The visual feedback was presented from second 2 to second 5.5.Only this 3.5-second interval of every trial is provided for training and testing.The sampling rate of 256 Hz and the recording length of 3.5 s result in 896 samples per channel for every trial, and the details are presented in Table 3.
As shown in Figure 6, we can make choices that the number of ELM hidden layer nodes on BCI competition II dataset IA is 3000; the regularized parameter  and the kernel parameter  of KELM are 10 MLELM is 5376-500-500-3000-2, where the parameter  for layer 5376-500 is 2 1 , the parameter  for layer 500-3000 is 2 8 , and the parameter  for layer 3000-2 is 2 −7 .The structure of DELM is 5376-500-500-L-2, where the parameter  for layer 5376-500 is 10 −1 , the parameter  for layer L-2 is 10 −1 , and the kernel parameter  is 10 2 .
The performance comparison of DELM with ELM, KELM, and MLELM on the BCI competition II dataset IA is shown in Table 4.It is clearly observed that DELM testing accuracy is higher than MLELM, either the average or the maximum, and the best values of DELM testing accuracy are higher than ELM and KELM.MLELM training time is the longest, and the testing time of MLELM and DELM is less than ELM.The performance comparison of DELM with the results of BCI competition II dataset IA is shown in Table 5.It is clear that the average error value of DELM on BCI competition II dataset IA achieves 13.50%, but the min error value has reduced to 8.19%, which is much lower than the results of BCI competition II.

Visual Feedback Experiment (ALS Patient).
The performances of DELM on the second BCI competition dataset IB are tested in this section, and this dataset comes from the visual feedback experiment (ALS patient) provided by University of Tuebingen.
The datasets were taken from an artificially respirated ALS patient.The subject was asked to move a cursor up and down on a computer screen, while his cortical potentials were taken.Cortical positivity leads to a downward movement of the cursor on the screen.Cortical negativity leads to an upward movement of the cursor.Each trial lasted 8 s.The visual feedback was presented from second 2 to second 6.5.Only this 4.5-second interval of every trial is provided for training and testing.The sampling rate of 256 Hz and the recording length of 4.5 s result in 1152 samples per channel for every trial, and the details are presented in Table 6.
As shown in Figure 7, we can make choices that the number of ELM hidden layer nodes on BCI competition II dataset IA is 2000; the regularized parameter  and the kernel parameter  of KELM are 10 −1 and 10 3 .The structure of MLELM is 8064-500-500-2000-2, where the parameter  for layer 8064-500 is 10 1 , the parameter  for layer 500-2000 is 10 8 , and the parameter  for layer 2000-2 is 10 4 .The structure of DELM is 8064-500-500-L-2, where the parameter  for layer 8064-500 is 10 −2 , the parameter  for layer L-2 is 10 −8 , and the kernel parameter  is 10 1 .
The performance comparison of DELM with ELM, KELM, and MLELM on the BCI competition II dataset IB is shown in Table 7.It is clearly observed that the best of DELM testing accuracy is not lower than MLELM, ELM, and KELM.

Figure 2 :
Figure 2: The model structure of ELM-AE.

Figure 5 :
Figure 5: Basic ELM and KELM for UCI dataset.

Figure 6 :
Figure 6: Basic ELM and KELM for the BCI competition II dataset IA.

3 and 10 4 .Figure 7 :
Figure 7: Basic ELM and KELM for the BCI competition II dataset IA.
ELM proposed by Huang et al. is a simple and efficient learning algorithm of SLFNs.The model of ELM constituted input layer, single-hidden layer, and output layer.The model structure of ELM is shown in Figure 2.1.Basic Extreme Learning Machine (Basic ELM).
In 2006, Hinton et al. put forward an effective method of establishing a multilayer neural network on the unsupervised data.In the new method, first the parameters in each layer are obtained by unsupervised training, and then the network is fine-tuned by supervised learning.In 2013, MLELM was proposed by Kasun et al.Like other deep learning models, MLELM makes use of unsupervised learning to train the parameters in each layer, but the difference is that MLELM does not need to fine-tune the network.Thus, compared with other deep learning algorithms, MLELM does not need to spend a long time on the network training.

Table 1 :
The details of UCI datasets.

Table 2 :
Performance comparison of DELM with ELM, KELM, and MLELM on UCI datasets.

Table 3 :
The details of the second BCI competition dataset IA.

Table 4 :
Performance comparison of DELM with ELM, KELM, and MLELM on the BCI competition II dataset IA.

Table 5 :
Performance comparison of DELM with the results of BCI competition II on dataset IA.