Sparse Representation-Based Extreme Learning Machine for Motor Imagery EEG Classification

Classification of motor imagery (MI) electroencephalogram (EEG) plays a vital role in brain-computer interface (BCI) systems. Recent research has shown that nonlinear classification algorithms perform better than their linear counterparts, but most of them cannot extract sufficient significant information which leads to a less efficient classification. In this paper, we propose a novel approach called FDDL-ELM, which combines the discriminative power of extreme learning machine (ELM) with the reconstruction capability of sparse representation. Firstly, the common spatial pattern (CSP) algorithm is adopted to perform spatial filtering on raw EEG data to enhance the task-related neural activity. Secondly, the Fisher discrimination criterion is employed to learn a structured dictionary and obtain sparse coding coefficients from the filtered data, and these discriminative coefficients are then used to acquire the reconstructed feature representations. Finally, a nonlinear classifier ELM is used to identify these features in different MI tasks. The proposed method is evaluated on 2-class Datasets IVa and IIIa of BCI Competition III and 4-class Dataset IIa of BCI Competition IV. Experimental results show that our method achieved superior performance than the other existing algorithms and yielded the accuracies of 80.68%, 87.54%, and 63.76% across all subjects in the above-mentioned three datasets, respectively.

Sparse representation (SR), originally proposed by Olshausen et al. [6], attempts to simulate the working mechanism of primary visual cortex in the human visual system. e basic idea is to represent the data as a linear combination of atoms in a dictionary, whose requirement is that the coefficients are sparse, i.e., they contain only a small number of nonzero elements. In the last two decades, SR has been widely studied for reconstruction, representation, and compression of high-dimensional noisy data, such as computer vision, pattern recognition, and bioinformatics [7][8][9]. Recently, the SR techniques have also yielded promising results in the BCI systems [10][11][12][13][14][15]. Although SR is a powerful tool to reconstruct the originals from noisy and imperfect data, using the original training samples as the dictionary may not fully exploit the discriminative information hidden in the training samples. To address the problem, Yang et al. [16] proposed a Fisher discrimination dictionary learning (FDDL) framework to learn a structured dictionary and had good reconstruction capability for the training samples, yielding a 3.2% improvement over the sparse representation-based classification (SRC) algorithm on AR datasets in face recognition.
Recently, Huang et al. developed a new efficient learning algorithm called extreme learning machine (ELM) [17,18] for training single-layer feedforward neural networks (SLFNs), featuring faster learning speed and better generalization capability in comparison with the well-known back propagation (BP) neural networks and support vector machines (SVMs). ELM has been applied to pattern recognition tasks in the BCI systems and has shown superior performance over traditional classification approaches [19][20][21][22]. In light of this advancement, efforts have been made in developing algorithms to integrate ELM and SR, thus exploiting the speed advantage and discriminative power of ELM and the antinoise performance and reconstruction ability of SR. A recent approach called extreme sparse learning (ESL) has been proposed in [23], which simultaneously learns sparse representation of the input signal and trains the ELM classifier. In the study by Yu et al. [24], the sparse coding technique is adopted to map the inputs to the hidden layer, instead of the random mapping used in classic ELM. Other ELM-SR hybrid models were also extensively studied, in which the ELM classifier is firstly employed to estimate noisy signals, and then a further identification for the estimated signals is carried out using the SRC algorithm [25][26][27].
Most of the existing ELM methods employ a single hidden layer network structure. While benefitting from a relatively fast training speed, it is well known that for a single hidden layer network, the training sample is always the original training sample set, which could limit the robustness of the network. Furthermore, due to its shallow architecture, feature learning using SLFNs may not be effective for natural signals (e.g., EEG). To incorporate a deeper network structure, a hierarchical-extreme learning machine (H-ELM) method has recently been developed, which allows for a layer-wise architecture design, and have shown to yield great classification performance [28]. In addition, multilayer structure has been extended into ELM in [29,30] as well. Inspired by these works, we propose a new layer-wise structure framework called FDDL-ELM, which combines the idea of SR with ELM to learn a powerful nonlinear classifier. e proposed method first employs the Fisher discrimination criterion to learn a structured dictionary. With the learned dictionary, more discriminative sparse coding coefficients can be obtained, and more robust feature information can be extracted. Subsequently, the ELM classifier is utilized to discriminate the extracted features. e classification accuracy of the proposed method has been manifested by several benchmark datasets, as well as 2-class and 4-class real world EEG data from BCI Competition III Datasets IVa and IIIa and BCI Competition IV Dataset IIa. e rest of the paper is organized as follows: Section 2 presents a brief introduction to basic ELM and FDDL and provides detailed description of the proposed FDDL-ELM algorithm. Section 3 evaluates the performance of the FDDL-ELM method through a series of experiments on several benchmark datasets, as well as motor imagery EEG datasets. Finally, we will conclude the paper and present some future work in Section 4.

Methodology
2.1. Classic ELM. ELM was originally implemented for single-hidden layer feedforward neural networks and then extended to generalize feedforward networks. By using random hidden node parameters and tuning-free strategy, ELM has some notable advantages, such as easy implementation, fast learning speed, and superior generalization performance [17], thus making it a suitable choice for the recognition problem of EEG signals in different motor imagery tasks.
Consider a dataset containing N training samples, . , x ip ] T ∈ R p and its corresponding desired output y i � [y i1 , y i2 , . . . , y iq ] T ∈ R q , where T denotes a transpose operation. Assuming that m is the number of hidden neurons, and g(·) is the activation function, the output function of ELM is mathematically modeled as where β i � [β i1 , β i2 , · · · , β iq ] T is the weight vector that connects the i-th hidden neuron and the output neurons, a i � [a i1 , a i2 , · · · , a ip ] T is the randomly chosen input weight vector connecting the i-th hidden neuron and the input neurons, b i is the randomly chosen bias of the i-th hidden node, and y j is the actual output corresponding to input x j . For convenience of expression, the Equation (1) is written in matrix notation as where Y � [y 1 , y 2 , · · · , y N ] T N×q is the expected network output, β � [β 1 , β 2 , · · · , β m ] T m×q denotes the weight of output layer, and H is the hidden layer output matrix which is defined as To have better generalization performance, the regularization parameter C is introduced in [19], and its corresponding objective function is given by arg min where ‖ · ‖ 2 denotes the l 2 -norm of a matrix or a vector. We can obtain the output weight vector β using the Moore-Penrose principle. e solution of Equation (4) is

Fisher Discrimination Dictionary
Learning. Sparse representation-based classification (SRC) was proposed for face recognition, which directly used the training samples of all classes as the dictionary to code the query face image and classified it by evaluating which class leads to the minimal reconstruction error [7]. However, the dictionary in use may not be effective enough to represent the query images due to the uncertain and noisy information in the original training images, and the discriminative information hidden in the training samples is not sufficiently exploited by such a naïvesupervised dictionary learning approach [16]. To address these problems, the FDDL method is proposed, utilizing both the discriminative information in the reconstruction error and sparse coding coefficients.
as the training set, where A i is the subset of the training samples from class i, and c is the total number of classes, and an overcomplete dictionary where D i is the class-specified subdictionary associated with class i. Let X be the coding coefficient matrix of A over D, we can write X as X � [X 1 , X 2 , · · · , X c ], where X i is the submatrix containing the coding coefficients of A i over D. e objective function is written as follows: where r(A, D, X) is the discriminative fidelity term, ‖X‖ 1 is the sparse constraint in which the notation ‖ · ‖ 1 denotes the l 1 -norm, f(X) is a discrimination constraint, and λ 1 and λ 2 are the scalar parameters.
where ‖ · ‖ F means the F-norm, X i i is the coding coefficient of A i over the subdictionary D i , and X j i is the coding coefficient of A i over the subdictionary D j . e minimization of r(A, D, X) means that the reconstruction error of the i-th class of samples is minimized, and the reconstruction error through the i-th subdictionary is also minimized while the reconstruction by other subclass dictionaries should be minimized. Its purpose is to ensure that the reconstruction error constraint is minimized, and the sparse coefficient can be more discriminative.
f(X) is a discriminative coefficient term which is given in the following: where tr(·) indicates the trace of subspace, S W (X) is the within-class scatter of X, S B (X) is the between-class scatter of X, and η is a parameter.
where m i and m are the mean vectors of X i and X respectively, and n i is the number of samples in class A i .

e Proposed FDDL-ELM Method.
In this section, we propose a novel nonlinear classification model that rests on a new ELM framework for multilayer perceptron (MLP), named FDDL-ELM. e framework consists of two stages: an encoding stage and a classification stage. e former stage uses the FDDL approach to map the input features into a midlevel feature space, and then the ELM algorithm is performed for final decision making in the latter stage. e framework of FDDL-ELM is shown in Figure 1.
, and Y is the corresponding desired output, where A i is the subset of the input from class i, and c is the total number of classes.
Step (1): utilize the FDDL algorithm to learn a structured dictionary D.
By incorporating Equations (6) and (7) into Equation (5), the objective function is rewritten as e optimization of the objective function consists of two steps: First, update X by fixing D, and then update D while fixing X. e procedures are iteratively implemented for the desired discriminative dictionary D and the discriminative coefficients X as done in [16].
Step (2): reconstruct the signals for the high-level sparse feature information.
With the desired dictionary D and the coefficients X in the Step (1), we can get the reconstructed signals B which can uncover important information hidden in the original signals and is simplified as follows: Step (3): discriminate the reconstructed signals B using the ELM classification method.
(1) Randomly generate the hidden node parameters (a i , b i ) for i � 1, 2, · · · , m. (2) e new hidden-layer output matrix G can be written as (3) e regularization parameter C is introduced, and the output weight β is calculated as follows: Computational Intelligence and Neuroscience Extensive efforts have been paid to the optimal selection of C and the leave-one-out (LOO) cross-validation strategy combining with the predicted residual sum of squares (PRESS) statistic is one of the most effective methods [26].
Step (4): For test data A test , Y test and the learned dictionary D, we can reconstruct the A test in the encoding stage and then calculate the labels Y predict using the ELM classifier.

Experimental Results and Discussion
In this section, several experiments on benchmark datasets and EEG datasets were performed to evaluate the performance of the proposed FDDL-ELM method, as compared with the other state-of-the-art approaches. All methods were implemented using MATLAB 2014b environment on a computer with a 2.6 GHz processor and 8.0 GB RAM.

Experiment on Benchmark Datasets
3.1.1. Description. In order to evaluate its performance, the proposed FDDL-ELM method was first applied to four popular benchmark datasets in the UCI repository [31]. e details of these datasets are shown in Table 1.
e Liver Disorders dataset is a medical application, which consists of 345 samples belonging to 2 categories, and each sample extracts 6 features for representation.  Table 1, the column "Random perm" denotes whether the training and test data are randomly assigned. In each data partition, the ratio between training and test sample is 1 : 1. e classification process was repeated ten times, and the average of these outcomes was the final classification rate.

Experimental Setup. In
is proposed FDDL-ELM algorithm has 5 tuning parameters: λ 1 , λ 2 , and η in the encoding stage, as well as the number of hidden nodes m, and the regularization parameter C in the classification stage. In all the experiments, the optimal parameters λ 1 and λ 2 are searched using five-fold crossvalidation from a small set 0.001, 0.005, 0.01, 0.05, 0.1 { }, and η is set to 1, as done in [16]. e optimal parameters m and C were determined from m ∈ 100, 200, · · · , 1500 { } and C ∈ e −5 , e −4 , · · · , e 5 using the LOO cross-validation strategy based on the minimum MSE PRES [27]. It is noted that C is automatically chosen and not fixed during the process of repeating ten times in the classification stage. e settings of these tuning parameters for four benchmark datasets are summarized in Table 2.

Comparisons with Other State-of-the-Art Algorithms.
In this experiment, we compare the proposed FDDL-ELM with three baseline algorithms, including ELM, FDDL, and H-ELM. e classification performance is evaluated in terms of average accuracy and standard deviation (acc ± sd). Table 3 summarizes the performance results using four methods on the benchmark datasets. From the results shown in Table 3, it is evidenced that the FDDL-ELM algorithm achieved comparable performance with other state-of-the-art methods, such as single-layer ELM, FDDL, and H-ELM with deep architecture. For the Diabetes dataset, the FDDL-ELM approach achieved more than 9% improvement over FDDL. For the Liver Disorders dataset, although the average classification accuracy of H-ELM (74.01%) was better than that of FDDL-ELM (72.38%), the accuracy of FDDL-ELM was higher than ELM by 0.23% and FDDL by 6.61%. For the Waveform dataset, the FDDL-ELM approach yielded a mean accuracy of 85.02%, a 0.57% improvement over ELM, and a 0.30% improvement over H-ELM. e average classification accuracy of COIL-20 dataset obtained by FDDL-ELM was 98.33%, higher than those of ELM (96.14%) and H-ELM (97.13%). Based on these observations, the proposed FDDL-ELM approach outperformed the original ELM and FDDL methods on all four datasets and had comparable performance on most of the four datasets compared with H-ELM.

3.1.4.
e Impact of the Parameters. ere are five parameters in our algorithm: λ 1 , λ 2 , η, m, and C. Since η is set to 1 [16], and C is automatically chosen [27], we will investigate the impact of the other three parameters (λ 1 , λ 2 , and m) on the performance of our algorithm in this section. λ 1 and λ 2 are respectively changed among 0.001, 0.005, 0.01, 0.05, 0.1 { }. e parameter m decides the number of hidden neurons, and its value is selected among 100, 200, · · · , 1500 { }. Figures 2(a)-5(a) show the testing results of our algorithm as the parameter m changes on four datasets: Diabetes, Liver Disorders, Waveform, COIL-20 datasets, respectively. As can be seen, the performance of our algorithm is relatively stable with respect to m. Figures 2(b)-5(b) give the plots of testing accuracies as λ 1 and λ 2 vary on four datasets, respectively. From these results, it can be observed that the performance is more pronouncedly affected by the parameters λ 1 and λ 2 , relative to m. For a tradeoff between the classification performance and computation complexity, we can use a comparatively small number of nodes m when selecting λ 1 and λ 2 in real EEG applications.

Description.
is section evaluates the performance of the proposed FDDL-ELM method on MI EEG datasets.
ere are three datasets for analysis, including two datasets for binary classification and one dataset for multiclassification, as described below: (1) Dataset IVa, BCI competition III [32]: this dataset contains EEG signals from 5 subjects, who performed 2-class MI tasks: right hand and foot. EEG signals were recorded using 118 electrodes. A training set and a testing set were available for each subject. eir size was different for each subject. More precisely, 280 trials were available for each subject, among which 168, 224, 84, 56, and 28 composed the training set for subjects A1, A2, A3, A4, and A5, respectively, and the remaining trials composed the testing set.
(2) Dataset IIIa, BCI competition III [33]: this dataset comprised EEG signals from 3 subjects who performed left hand, right hand, foot, and tongue MI. EEG signals were recorded using 60 electrodes. For the purpose of binary classification, only 2-class EEG signals (left and right hand MI) were used as done in [34]. Both the training and testing sets were available for each subject. Both sets contain 45 trials per class for subject B1, and 30 trials per class for subjects B2 and B3. (3) Dataset IIa, BCI competition IV [35]: this dataset consists of EEG signals from 9 subjects who performed 4-class MI tasks: left hand, right hand, foot, and tongue MI. EEG signals were recorded using 22 electrodes. e training and testing sets contain 288 trials for each class, respectively.

Experimental Setup.
Data preprocessing was first performed on the raw EEG data. In particular, for each trial, we extracted features from the time-segment spanning from 0.5 s to 2.5 s after the cue instructing the subject to perform MI. Each trial was band-pass filtered in 8-30 Hz, using a fifth-order Butterworth filter. Next, the dimension of the EEG signal was reduced using the common spatial pattern (CSP) algorithm, a widely used feature selection method for MI-based BCIs [12,34]. Finally, the filtered EEG signals by CSP were discriminated by different classification methods in our experiment.     In this work, the classification process was repeated ten times, and the average accuracy was recorded for further analysis. e selection process of the parameters λ 1 , λ 2 , η, and C were the same as those described in Section 3.1.2, and the setting of the hidden node m was 10, 20, · · · , 100 { }.  Table 4. From Table 4, it can be seen that the proposed method outperformed the ELM and FDDL algorithms on almost all subjects (except subject B2) in binary-classification applications. For subject B2, the ELM method obtained the average accuracy of 68.33%, a 0.33% improvement over FDDL-ELM. Compared with H-ELM which adopts a deep architecture, FDDL-ELM yielded comparable performance on all the 8 subjects and especially performed better in 4 subjects (A2, A3, A4, and B2). Furthermore, the proposed algorithm yielded the highest average accuracy on Datasets IVa and IIIa. For the Dataset IVa, the FDDL-ELM approach achieved a mean accuracy of 80.68%, a 0.73% improvement over ELM, and a 1.35% improvement over H-ELM. Moreover, a paired t-test revealed no significant difference between the FDDL-ELM and H-ELM approaches (p � 0.626) and a significant difference between the FDDL-ELM and ELM approaches (p � 0.04). For the Dataset IIIa, the average classification accuracy obtained by FDDL-ELM was 87.54%, higher than that of ELM (87.35%), FDDL (83.26%), and H-ELM (85.63%). Furthermore, a paired t-test revealed no significant difference between the FDDL-ELM and H-ELM approaches (p � 0.596). ese results have shown that the FDDL-ELM method has achieved a great classification capacity in binary-classification applications.

Comparisons with Related
In the 4-class-classification application for BCI Competition IV Dataset IIa, the average classification accuracies for the 9 subjects using four algorithms are also shown in Table 4. Note that our method also outperformed ELM and FDDL in 8 of the 9 subjects (except subject C8). For subject C8, ELM gained the best result (81.87%) compared with FDDL (63.72%), FDDL-ELM (80.60%), and H-ELM (76.46%). e FDDL-ELM approach performed the best in 4 subjects (C1, C3, C6, and C7), whereas H-ELM achieved the best result in 4 subjects (C2, C4, C5, and C9). e average classification accuracy of 9 subjects using FDDL-ELM was 63.76%, slightly lower than H-ELM (64.46%). A paired t-test revealed no significant difference between the FDDL-ELM and H-ELM approaches (p � 0.519) and a significant difference between the FDDL-ELM and FDDL approaches (p is less than 0.01).
ese results showed that when compared to the H-ELM algorithm, our method can achieve similar results without the deep architecture.
In these experiments, the proposed FDDL-ELM method exhibited an excellent performance in both binaryclassification and multiclassification cases. e nonlinear property of FDDL-ELM allowed for its superior performance over the FDDL approach when processing the   nonstationary EEG signals. Furthermore, FDDL-ELM is more suitable in analyzing noisy EEG data than basic ELM because its encoding stage can acquire a higher representation of the raw signals and extract more effective feature information. Compared with the H-ELM, an algorithm with a deep architecture design, our method also yielded comparable results. In particular, on the binary-classification datasets (BCI Competition III Datasets IVa and IIIa), our method gained higher average accuracies (80.68% and 87.54%) than that of H-ELM (79.33% and 85.63%), respectively.

Conclusion
In this paper, we have proposed a new ELM framework called FDDL-ELM, which achieves a sparse representation of input raw data with layer-wise encoding, while still benefiting from the universal approximation capability of the original ELM. We verified the generalizability and capability of FDDL-ELM using publicly available benchmark databases and MI-BCI datasets. In these applications, FDDL-ELM demonstrated superior classification performance than the other relevant state-of-the-art methods. However, there are still several questions to be further investigated in future work. e nonstationary nature of EEG signals means that a classification model built earlier using the previous data is not able to well reflect the changes that have already taken place to the signals. Consequently, the online updates to the classification model are needed. Recently, the ensemble of subset online sequential extreme learning machine (ESOS-ELM) method is proposed for class imbalance learning [36]. In addition, an online sequential extreme learning machine with kernels (OS-ELMK) has been proposed for prediction of nonstationary time series [37]. In the study by Mirza et al. [38], a multilayer online sequential extreme learning machine has been proposed for image classification. In the future work, we will investigate the online learning algorithm of FDDL-ELM for analyzing MI EEG signals.

Data Availability
ree datasets were employed in this study, including two datasets for binary classification and one dataset for multiclassification, which are publicly available: (1) dataset IVa, BCI competition III [31]: this dataset contains EEG signals from 5 subjects, who performed 2-class MI tasks: right hand and foot. (2) Dataset IIIa, BCI competition III [34]: this dataset comprised EEG signals from 3 subjects who performed left hand, right hand, foot, and tongue MI. (3) Dataset IIa, BCI competition IV [32]: this dataset consists of EEG signals from 9 subjects who performed 4-class MI tasks: left hand, right hand, foot, and tongue MI.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.