Embedding Tangent Space Extreme Learning Machine for EEG Decoding in Brain Computer Interface Systems

In motor imagery brain computer interface system, the spatial covariance matrices of EEG signals which carried important discriminative information have been well used to improve the decoding performance of motor imagery. However, the covariance matrices often suffer from the problem of high dimensionality, which leads to a high computational cost and overfitting. +ese problems directly limit the application ability and work efficiency of the BCI system. To improve these problems and enhance the performance of the BCI system, in this study, we propose a novel semisupervised locality-preserving graph embedding model to learn a low-dimensional embedding. +is approach enables a low-dimensional embedding to capture more discriminant information for classification by efficiently incorporating information from testing and training data into a Riemannian graph. Furthermore, we obtain an efficient classification algorithm using an extreme learning machine (ELM) classifier developed on the tangent space of a learned embedding. Experimental results show that our proposed approach achieves higher classification performance than benchmark methods on various datasets, including the BCI Competition IIa dataset and in-house BCI datasets.


Introduction
Brain computer interface (BCI) systems have been developed to provide a novel and promising alternative method for humans to interact with their environment. Among currently operational BCI systems, many have been based on electroencephalography (EEG) signals, owing to the noninvasive nature of the technique and affordable data acquisition equipment, both of which also facilitate real-time operation [1]. Although many EEG signals such as P300 waves or steady-state visual evoked potentials (SSVEP) could be used for BCI [2], in this study, we focus on motor imagery-based BCI systems, with which trained subjects can voluntarily generate EEG control signals by imagining movements of different parts of their bodies [3]. With the advantage not requiring external stimulation, motor imagery-based BCI has proven effective in practice. e common spatial pattern (CSP) algorithm has been shown to be one of the most effective methods for feature extraction from motor imagery signals [4]. e success of CSP techniques in BCI applications depends heavily on learning a proper spatial filter from a training dataset. Some extensions of CSP were proposed to improve the performance of feature extraction in motor imagery classification. In [5], a novel regularized CSP combined with correlationbased channel selection was proposed to extract effective features of EEG. In [6], a discriminative canonical pattern matching combined with CSP was used to extract the movement-related features. In [7], to capture the shared salient information across multiple related spatial patterns, a multiview learning-based sparse optimization was proposed to jointly extract robust CSP features with L1L2-norm regularization. In [8], a generic model training by weighted linear discriminant analysis was proposed to reduce the correction time of the BCI system. In [9], a fusion method for internal feature selection was proposed to fuse different feature selection rules. It can consume less computational cost and obtain high performance of motor imagery classification. Typically, in such motor imagery classification system, some discriminant information contained in the testing dataset remains unutilized. However, graph-based models are able to utilize such information because they naturally incorporate diverse types of information, such as the relationship between training and testing datasets [10].
Graph construction is an essential step in graph-based models. Two conventional methods are commonly used, including the k-nearest neighbor method and the ε-ballbased method [11]: (1) k-nearest neighbor methods consider data samples x i and x j as neighbors if x i is among the k nearest neighbors of x i or x j is among the k nearest neighbors of x i , where the nearest neighbors are determined as the samples with the smallest Euclidean distance (2) ε-ball neighborhood methods consider data samples where ‖·‖ 2 is the Euclidean norm and ε ∈ R e above traditional graph construction methods are relatively simple to implement and have been applied effectively in practice. However, if the data samples considered do not lie in a Euclidean space, the k-nearest neighbor and ε-ball-based methods cannot guarantee the connectivity of the whole graph and often result in several separated subgraphs because neighbors defined by Euclidean distance cannot characterize the real geometrical relations among samples well [12]. An obvious example is the covariance feature data used in motor imagerybased BCI. Because covariance features of EEG signals with the form of symmetric positive-definite (SPD) matrices lie on a Riemannian manifold [13], we propose an extension of the graph construction approach based on Riemannian geodesic distance to construct a proper graph for EEG signals, namely, a Riemannian graph. In fact, EEG classification techniques on high-dimensional Riemannian manifolds have recently received increased attention [14]. For example, an efficient classification algorithm called tangent space linear discriminant analysis (TS + LDA), was proposed in [14]. TS + LDA mapped all the data points in a Riemannian manifold into its tangent space, which is known as the best hyperplane for classification tasks [15], and then applied linear discriminant analysis (LDA) for classification in the tangent space.
After constructing a Riemannian graph for an EEG signal, many graph-based algorithms can be exploited on the constructed Riemannian graph. Such graph embedding algorithms are among the most successful in graphbased data processing on high-dimensional Riemannian manifolds because the most important information is always contained in a low-dimensional embedding of a high-dimensional Riemannian manifold. Many state-ofthe-art semisupervised and unsupervised algorithms have been proposed. Semisupervised learning methods can utilize both training and test data to boost the algorithmic performance. Many related works, such as [16][17][18], were proposed to classify EEG based on training and test data in the context of semisupervised learning. In [16], a self-training support vector machine (SVM) method was proposed to use the test data for updating the parameter of SVM model. In [17], an iterative semisupervised SVM method was proposed for channel selection and classifier training. In [18], a semisupervised expectation maximization (EM) was incorporated into CSP to extract more effective features. On the contrary, many unsupervised dimensionality reduction methods also are proposed to learn a low-dimensional embedding preserving similarities between vertex pairs on such graphs, including locally linear embedding [19], locality-preserving projection [20], and Laplacian eigenmaps [21].
Considering locality preservation, in this study, we propose a bilinear locality-preserving embedding learning algorithm on a constructed Riemannian graph. Because the learned embedding was also an SPD matrix space, inspired by the TS + LDA algorithm, we applied an extreme learning machine (ELM) [22,23] method on the embedding tangent space for classification. Many related works of graph embedding methods and ELM-based methods have been recently proposed. In [24], a hybrid-graph discriminant learning method was proposed to reveal the complex highorder relationship of the original data and to extract the discriminant features. In [25], a local constrained manifold structure collaborative preserving embedding was proposed to enhance the intraclass compactness and the interclass separability. In [26], to improve the classification performance of motor imagery, a sparse Bayesian ELM algorithm was proposed to automatically control the model complexity and exclude redundant hidden neurons. In [27], a multikernel ELM method integrating two types of kernels was proposed to effectively explore the supplementary information from multiple nonlinear feature spaces.
In [28], a clustering-based multitask feature learning method was proposed to improve EEG pattern decoding. Most of these studies have achieved great success in graph embedding and classification. However, the above methods learn embedding or classifier on Euclidean space and ignore that the covariance matrices lie on Riemannian manifold. e contributions of this study are threefold.
(1) A graph model of the covariance features of motor imagery signals based on Riemannian geodesic distance was constructed. e neighborhood relationships among data samples were characterized by Riemannian geodesics because the EEG covariance features endowed with a Riemannian metric formed a Riemannian manifold.
(2) A novel framework of graph embedding is proposed for the constructed Riemannian graph. e calculation of graph embedding is formulated as an eigenvalue problem.
(3) A classification algorithm, called an embedding tangent space extreme learning machine (ETS + ELM), is proposed to learn low-dimensional embeddings. Experimental results show that the proposed algorithm achieved higher classification performance than prior methods used for comparison e remainder of this paper is organized as follows. In Section 2, some basic concepts concerning SPD matrix spaces and BCI data models are briefly reviewed. In Section 3, we introduce the construction of Riemannian graphs and propose a bilinear locality-preserving graph embedding algorithm and an embedding tangent space ELM for classification. Extensive experimental results are presented in Section 4 to demonstrate the effectiveness of the proposed method. Finally, some conclusions are presented in Section 5.

Riemannian Geometry.
In mathematics, Riemannian manifold M is defined as a smooth manifold equipped with a Riemannian metric 〈, 〉, i.e., a smoothly varying inner product (〈, 〉 p , p ∈ M), on its tangent space [29]. A smooth manifold is a topological manifold with a smooth structure locally similar to Euclidean space. Every point on this topological manifold has a neighborhood for which there exists a homeomorphism mapping the neighborhood to a Euclidean space. e materials and methods section should contain sufficient detail so that all procedures can be repeated. It may be divided into headed subsections if several methods are described. For many applications, the space of SPD matrices endowed with a Riemannian metric is one of most important Riemannian manifolds [30]. Owing to the properties of SPD matrices, explicit formulae exist for the major concepts of Riemannian geometry. erefore, these concepts can be applied relatively easily in the context of pattern recognition.

Space of Symmetric Positive-Definite Matrices.
Denoting the space of symmetric matrices by and the space of positive-definite matrices by the space of SPD matrices is defined as e space of SPD matrices lies on a differentiable Riemannian manifold [30]. Hence, all the mathematical tools defined in Riemannian geometry can be applied to any SPD(N).
e Riemannian distance between two matrices P 1 , P 2 ∈ calSPD(N) is the minimum length of the curves connecting P 1 and P 2 on a Riemannian manifold [31]. ere are many possible mathematical definitions of the Riemannian distance. In this study, we adopt the following formula for Riemannian distance: where ‖·‖ F is the Frobenius norm of a matrix and η i is the ith real eigenvalue of P − 1 1 P 2 . e associated Euclidean tangent space is critical in the analysis of a Riemannian manifold, as illustrated in Figure 1. As a preliminary step to define the tangent space, we first, respectively, introduce the logarithmic mapping and exponential mapping operators as where Log P (P j ) is a mapping from a point on the manifold to the tangent space at P, while Exp P (S j ) is a mapping from the tangent space at P to the manifold. ese two operators are both one-to-one mapping operators between the Riemannian manifold and the tangent space.
Because the tangent space T P (N) � Log P (P j ), P j ∈ SPD(N) is the space of symmetric matrices, it comprises only N(N + 1)/2 independent elements. A minimal representation of the tangent space T P (N) [10] is defined as where the upper (·) operator is used to retain the upper triangular portion of the symmetric matrix and obtain its vectorization.
e mean of the SPD matrices plays an important role in classification. Since the neighborhood of Riemannian manifold is local homeomorphic to its tangent space, the trinational classifier can be performed on tangent space to obtain high classification performance [14]. However, large neighborhood will lead to large distortion between neighborhood and its tangent space, particularly when the selection of tangent point is not representative. e mean of SPD matrices is the best representation of all data points. It is defined as the point P m � SPD(N) with the minimum sum of squared distances to all SPD matrices in a dataset C: 2.3. BCI Data Model. e recorded EEG signal was a multilag/multichannel signal: where N and L denote the number of channels and sampled points, respectively, and is the snapshot vector. In motor imagery BCIs, the secondorder statistical information (covariance feature) of X(t) often provides discriminative information for classification. e covariance feature of X(t) is represented by its sample covariance matrix (SCM) [32] as Journal of Control Science and Engineering

Riemannian Graph Construction.
In SPD matrix spaces, because the data lie on a Riemannian manifold, it is natural to construct a Riemannian graph G R � (V, W) via Riemannian geodesic distance, as illustrated in Figure 2. V represents the set of graph vertices, and W represents the graph weight matrix. For a given dataset, the Riemannian graph is constructed as follows. First, to construct an adjacency graph, a Riemannian distance-based k nearest neighbor method is used. e points P i , P j ∈ V are considered neighbors if P i is among the k nearest neighbors of P i or P j is among the k nearest neighbors of P i , where k is a positive integer and k nearest neighbors are measured by the Riemannian geodesic distance d ij � δ R (P i , P j ).
Second, to calculate graph weights, many approaches may be used, such as heat kernels [33], inverse Euclidean distance [34], and local linear reconstruction coefficients [19]. In this study, we use a heat kernel based on the Riemannian geodesic distance to construct graph weights: where W ij is the (i, j)th element of W and the parameter σ controls the width of the neighborhoods.

Bilinear Locality-Preserving Embedding and ELM on
Tangent Space. Because the covariance features studied in this paper lie in SPD matrix spaces, we propose a bilinear mapping with V ∈ R M×N and V T under the constraint VV T � I M , to construct a system learning a low-dimensional embedding. e low-dimensional embedding P e � VPV T ∈ R M×M was expected to preserve the locality geometry of a high-dimensional Riemannian manifold. Locality preservation attempts to ensure that if P i and P j in a high-dimensional manifold are "close," then VP i V T and VP j V T in low-dimensional embedding are also "close." A reasonable criterion for locality preservation is to minimize the following objective function: where the F-norm ‖ · ‖ F is used to measure the "closeness" of data points in the embedding. ere is no closed-form optimal solution for the optimization problem (11). It should be noted that the objective function (11) is the trade-off model between optimized solution and Riemannian locality preserving. In this study, we propose an iterative strategy to obtain a local optimal solution of (11). In the proposed strategy, the matrix V is first initialized as V 0 � [I M , 0], and the new optimal V t+1 of the (t + 1)th iteration can be obtained by solving the following eigendecomposition problem, given the tth iteration of V and V t : where L � |C| i,j�1 (P i − P j )V T t V t (P i − P j ) T W ij . Negating (11), the minimization problem becomes a maximization problem. Because the optimization problem in (12) is a generalized eigenvector problem [35,36], the optimal transformation matrix V t+1 is the first M generalized eigenvectors of L corresponding to the M largest eigenvalues. e iteration is repeated until the solution converges, and a  local optimal solution of (11) is obtained. Further details are provided in Algorithm 1. After learning the optimized bilinear mapping matrix V opt , the proposed approach maps all points in the embedding into the tangent space according to (6). It has been reported that the tangent space located at the Riemannian mean is the best hyperplane for classification, as it is associated with the minimal distance loss [15]. A three-layer ELM network was constructed on a tangent space for classification because the ELM classifier is more flexible to handle complex and multivariate data than LDA and SVM classifier. e mapping function from the jth input tangent vector s j to the ith hidden node was defined as where θ � a i , b i denotes the parameters of the sigmoid mapping function. Based on the principles of ELMs, we randomly chose parameters a i and b i within the range [− 1, 1]. We denote by h(s j ) ∈ R 1×N h the output vector of the hidden layer with respect to s j and by β ∈ R N h ×N o the output weights that connect the hidden layer to the output layer, where N h and N o are the number of hidden neurons and output neurons, respectively. Moreover, the output weights β can be learned by solving an optimization problem [22]: where H � [h(s 1 ) T , . . . , h(s j ) T , . . . , h(s |C| ) T ] T and λ parameters are used to balance the loss and L 2 is a regularizer. e parameter Y is a matrix of the target output, and the jth row of Y is a binary vector with only one entry (corresponding to the class to which s j belongs) equal to one. e framework of ETS + ELM algorithm is shown in Figure 3; the proposed ETS + ELM algorithm includes a Riemannian graph construction for data representation, locality preserving with bilinear mapping for dimensionality reduction, tangent space mapping for feature extraction, and 3-layer extreme learning machine network for classification. (1) CSP + LDA: the CSP in [4] and linear discriminant analysis (LDA) in [38] were applied for evaluation (2) CSP + SVM: the number of CSP spatial filters was set to 8, as suggested in [4], for feature extraction, and a support vector machine (SVM) [39] was used for classification (3) MDRM: minimum distance to Riemannian mean [14] was used for evaluation (4) TS + LDA : LDA was used to classify signals on a tangent space [14] 4.1.3. Parameter Setting. In all experiments, we empirically set k � 40 and σ � 1 during Riemannian graph construction.

Experimental
In ETS + ELM, the number of iterations was set to N ite � 8, and the iterative stop threshold to τ � 0.01. In fact, most information of the original space was lost during the process, while the number of embedding dimensions was too small. However, all of the methods evaluated suffer from computational cost problems under conditions of excessively high dimensionality. Hence, the dimension M was empirically set to 10. Because the ELM method suffers from an overfitting problem in the case of excessively hidden neurons, a regularizer parameter was used to control the overfitting problem. erefore, the regularizer parameter λ was empirically set to 0.001, while the number of hidden neurons was 2000.

Experiments for EEG Classification.
To evaluate the performance of the proposed algorithms, ETS + ELM, some competing algorithms (CSP + LDA, CSP + SVM, MDRM [14], and TS + LDA [14]) were compared on the EEG datasets from the 2008 BCI Competition and our in-house BCI system. First, we evaluated the convergence performance of the iterative algorithm for optimization problem (12). Because the Kappa coefficient [40] scales linearly from 0 to 1 in the range between random and perfect classification, it was selected as a performance measure [40] for the classification algorithms in this study. In each iteration step of (12), we were able to obtain a learned mapping matrix V t and a corresponding low-dimensional embedding. e performance of the ETS + ELM classification algorithm under the same parameter settings was evaluated on these embeddings. e average performance of nine subjects on the BCI Competition IIa dataset is shown in Figure 4. It is clear that the mapping matrix V t learned from the proposed algorithm converged quickly. Normally, it required only two iteration steps.

Journal of Control Science and Engineering
We also compared the performance of various methods on classification of the BCI Competition 2008 data. In Table 1, the performance of the proposed methods, MDRM, TS + LDA, and the top three Competition winners for dataset IIa of BCI Competition 2008 (1st of BCI08, 2nd of BCI08, and 3rd of BCI08) are presented for comparison. From the results shown in Table 1, it may be observed that the proposed ETS + ELM method achieves a mean performance of 0.577, which was the best performance of all the methods compared.
We also tested the performance of the proposed algorithms in terms of classification accuracy on the BCI Competition 2008 IIa dataset with a 30-fold cross-validation procedure. e cross-validation reflects the generalization performance of the proposed method. From the results in Table 2, it is clear that, for all subjects, ETS + ELM Input: given SPD matrix set C with samples Ρ i ∈ R N×N , i � 1, . . . , |C|, dimensions of embedding M, iterative number N ite , stop threshold τ, and ELM parameter a i , b i , N h , and N o ; (1) Construct a Riemannian graph G R over all samples based on Riemannian geodesic distance (Section 3.1); (2) Initialize: V 0 � [I M , 0]; (3) For t � 1 : 1:(N ite ), do; Calculate the matrices L (Section 3.2); Obtain mapping matrix V t+1 ∈ R M×N by (12); where V opt is convergent matrix in step 4; (5) Calculate the Riemannian mean of all data on embedding: Ρ cm � argmin Ρ c i δ 2 R (Ρ c , Ρ ci ) (6) Project embedding point onto tangent space on Riemannian mean by (6) T(N) � s j ; (7) Apply a three-layer ELM in the tangent space, learning the weight β from (14), using the training data; (8) e label of the testing data may be obtained by H t β, where H t is the mapping matrix of the testing data; Output: the label of the testing data. outperformed the reference methods. is indicates that, by finding a low-dimensional embedding with the property of locality preservation, the classification method ETS + ELM exhibited the best performance.
Moreover, we evaluated the performance of all algorithms on the dataset recorded from our in-house BCI systems. As shown in Table 3, the proposed ETS + ELM showed a significant performance improvement over the prior methods (an approximately 2% improvement in classification accuracy), even when compared with the benchmark algorithm CSP [4]. In conclusion, the experimental results on the competition dataset and in-house dataset demonstrate that the proposed ETS + ELM approached achieved a higher performance than the other methods. In additional, we provide the ANOVA and t-test analysis of Tables 1-3. e repeated measures ANOVA results for Tables 1-3 are p � 0.2 × 10 − 10 , 0.7 × 10 − 7 , and 0.03, respectively. It indicates that all studied methods have statistically significant difference on classification performance. e t-test analysis of Tables 1-3 are shown in Table 4.
We also compared the computational loads of the proposed algorithm with the competing methods. In the BCI competition dataset, the training time of proposed algorithm is 24.7 s, which is far longer than the training time of MDRM (0.12 s), TSLDA (1.17 s), CSPLDA (0.18 s), and CSPSVM (0.19 s). It is because that the proposed algorithm is a semisupervised learning model where the training and test dataset are simultaneously used for learning features, but the competing methods only used training dataset to learn feature. Other reason is that the proposed algorithm takes a lot of time in iterative strategy of optimization (equation (11)), but the competing methods solve the optimization problem by quick eigenvalue decomposition. However, the test time of proposed algorithm is 0.06 s, which is close to the test time of MDRM (0.17 s), TSLDA (0.03 s), CSPLDA (0.04 s), and CSPSVM (0.0.3 s). e computational loads on in-house dataset are similar with those of competition dataset.
Finally, we compared our bilinear locality-preserving embedding with other state-of-the-art manifold learning algorithms. We show the 2-dimensional embedding learning from ISOMAP, LLE, and the proposed algorithm in Figure 5. e motor imagery EEG data are a spatiotemporal signal, and the correlations of EEG signals in high-dimensional space are expected to be destroyed in 2dimensional embedding. However, compared with feature distributions extracted by other manifold learning methods, the proposed method can also obtain a moderately good separability of classes in the feature space.

Conclusions
Inspired by the fact that the covariance feature of EEG signals in the form of SPD matrices lies on a Riemannian manifold, in this study, we have presented an approach to construct a Riemannian graph for motor imagery classification. A bilinear locality preservation method has been proposed to find low-dimensional embeddings carrying discriminant information. A classification algorithm, called ETS + ELM, was further proposed on a tangent space of the learned embedding. Experimental results on BCI competition and our in-house datasets show that the proposed bilinear locality-preserving embedding was able to capture the useful locality and discriminant information of the original manifold, and the proposed classification algorithms have higher performance than the other competing algorithms compared. e proposed methods can also be applied to many other pattern-recognition problems with input data in the form of SPD matrices, such as pedestrian detection problem [15] or hyperspectral image processing [41]. In future, we will further study weight calculation method to construct preferable Riemannian graph and study the interpretation of the feature learned by Riemannianbased methods.

Data Availability
Previously reported BCI competition EEG data used to support this study are available at https://www.bbci.de/ competition/iv/#dataset2a. ese prior studies (and datasets) are cited at relevant places within the article as references. e in-house EEG data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. Journal of Control Science and Engineering 9