EEG Signal Classification Using Manifold Learning and Matrix-Variate Gaussian Model

In brain-computer interface (BCI), feature extraction is the key to the accuracy of recognition. There is important local structural information in the EEG signals, which is effective for classification; and this locality of EEG features not only exists in the spatial channel position but also exists in the frequency domain. In order to retain sufficient spatial structure and frequency information, we use one-versus-rest filter bank common spatial patterns (OVR-FBCSP) to preprocess the data and extract preliminary features. On this basis, we conduct research and discussion on feature extraction methods. One-dimensional feature extraction methods like linear discriminant analysis (LDA) may destroy this kind of structural information. Traditional manifold learning methods or two-dimensional feature extraction methods cannot extract both types of information at the same time. We introduced the bilinear structure and matrix-variate Gaussian model into two-dimensional discriminant locality preserving projection (2DDLPP) algorithm and decompose EEG signals into spatial and spectral parts. Afterwards, the most discriminative features were selected through a weight calculation method. We tested the method on BCI competition data sets 2a, data sets IIIa, and data sets collected by our laboratory, and the results were expressed in terms of recognition accuracy. The cross-validation results were 75.69%, 70.46%, and 54.49%, respectively. The average recognition accuracy of new method is improved by 7.14%, 7.38%, 4.86%, and 3.8% compared to those of LDA, two-dimensional linear discriminant analysis (2DLDA), discriminant locality property projections (DLPP), and 2DDLPP, respectively. Therefore, we consider that the proposed method is effective for EEG classification.


Introduction
Brain-computer interface (BCI) is a kind of real-time communication system connecting the brain and external devices. BCIs based on electroencephalogram (EEG) can convert the information sent by the brain into commands to drive the external devices, so as to realize the communication between people and the outside world [1]. ere are several control signal types in BCI and, among them, motor imagery (MI) is one of the most studied applications [2]. MI is an independent BCI method that uses motor cortex as a signal source. e user imagines moving his/her limbs without any actual muscular movement. Studies on EEG signal indicate that when people perform motor imaging tasks, this will cause an event-related desynchronization (ERD) and event-related synchronization (ERS) of oscillations in alpha band (8)(9)(10)(11)(12)(13) and beta band (14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30) [3]. Due to these characteristics, researchers can process and analyze EEG signals in relevant frequency bands for the classification of motor imaging tasks.
In MI-based BCI system, a well-known problem is how to handle very large amounts of features extracted from multichannel EEG signals. Feature extraction is a commonly used approach for solving this problem. To solve the computational complexity and data storage problem caused by the high dimension of signals, many dimensionality reduction methods have been used in traditional BCI technology. Kumar et al. [4] used independent components analysis (ICA) to remove artifacts in EEG signals and used principal component analysis (PCA) for reducing high-dimensional data. Although PCA seeks to learn a projection that can preserve the main energy of data, it does not contain discriminability. Linear discriminant analysis (LDA) uses the label information to enlarge the between-class distance and reduce the within-class distance [5]. Although LDA uses the label information in MI, it ignores the important local information in the EEG signal. Motor imaging EEG signals are collected through electrodes spread over the cerebral cortex. When a certain imaging task is performed, ERD/ERS will occur in specific local areas of the brain [6].
e goal of methods based on manifold learning is to keep locality-geometric structure of data in a neighboring area and successfully find the inherent features existing in nonlinear manifold [17]. Locality property projections (LPP) [18], neighborhood preserving embedding (NPE) [19], and locally linear embedding (LLE) [8] are the popular methods of manifold learning, which preserve the locality property. But they suffer from a limitation that they deemphasize discriminant information which is important in recognition problem. Discriminant locality property projection (DLPP) [20] aims to find the subspace that best discriminates different classes by minimizing the withinclass distance, while maximizing the between-class distance. However, they are all vector-based methods. EEG signals, as multichannel data, contain rich spatial information. e spatial information here means that the EEG signal is in the form of a matrix, and the matrix contains two-dimensional information closely related to the spatial structure. Significant spatial channel information will be lost by using vector-based methods [21]. According to these factors, the two-dimensional dimensionality reduction method is considered. Two-dimensional discriminant locality preserving projection (2DDLPP) [22] is the 2D expansion of DLPP. 2DDLPP can extract two-dimensional information and preserve geometric structures of original data. But, for EEG signals, both spatial information and spectral information are important information that is effective for classification. 2DDLPP only calculates a singlesided projection and cannot use the spatial and spectral characteristics of the EEG signal at the same time.
In addition to the manifold learning method, sparse learning has been increasingly applied to feature extraction. For example, sparse linear discriminant analysis (SLDA) [23] and sparse two-dimensional discriminant locality preserving projection (S2DDLPP) [17] are proposed to learn a sparse discriminant subspace for feature extraction.
Researches have shown that the method based on sparse can effectively reduce the computational complexity and has good robustness to noise [24][25][26]. But the above methods are somewhat sensitive to the selection of the number of dimensions, since the discriminability of each projection direction is fixed [27].
To solve the above issues and combine the important local information in the EEG signal with the 2D matrix processing method, we propose an extension of 2DDLPP based on the Gaussian variable model. e main idea of matrix-variable Gaussian model [28] implies a separable structure for the covariance matrix of the vectorized data and it shows that the covariance between any two spatialspectral features can be decomposed into two terms. On this basis, we calculate the eigenvalues and eigenvectors of spatial term and spectral term, respectively. After that, the two sets of eigenvalues are multiplied in pairs and sorted, and those features with a large weight are selected.
is allows the spectral and spatial characteristics to be analyzed at the same time, thereby ensuring that the extracted features have the best discriminativeness. is paper proposes a bilinear two-dimensional discriminant locality preserving projection (B2DDLPP) algorithm that based on a matrix-variable Gaussian model. Compared with 2DDLPP, B2DDLPP uses a bilinear structure to fully extract the connections between the channels of the EEG signal. Both spatial information and frequency information are considered at the same time, so it is more suitable for the task of feature extraction of motor imaging EEG signals.

Related Methods.
e input data of the feature extraction algorithm introduced in this paper is the feature matrix after OVR-FBCSP, and each sample gets a feature matrix with a size of N f × N g , where N f is the number of frequency bands in FBCSP (9 in this article), N g � 2 × m × Numclass, m is the number of pairs of features, and Numclass is the total number of classes. When using the vector-based feature extraction method, we vectorize the feature matrix of each sample as input.
e class of each feature matrix is the same as the corresponding sample; and the OVR-FBCSP algorithm will be introduced in Section 2.3.1.

Discriminant Locality Property Projections (DLPP).
DLPP is based on the extension of LPP, and it considers the discrimination information. is makes DLPP have better performance on classification problems compared to LPP. e objective function of DLPP is as follows: where Z is the number of motor imagination classes, n s is the number of samples in the sth class, y s i denotes the ith weight vector in the sth class, m i and m j are separately the mean weight vectors for the ith class and jth class, respectively; that is, Computational Intelligence and Neuroscience n j are the numbers of samples in the ith class and jth class, separately. W s ij is the weight between y s i and y s j , and B ij is the weight between m i and m j . It should be noted that s in y s i and W s ij is the upper corner mark, not power operation. Suppose that a is a transformation vector; that is, Y � aTX. By simple algebra formulation, the objective function can be turned to where L � D-W, and W can be defined as D is a diagonal matrix, and its entries are column (or row) sum of Ws; where f i is the mean value of samples in the ith class; F � f1, f2,. . .,fs; E is a diagonal matrix, and its entries are column (or row) sum DLPP subspace is spanned by a set of vectors a, satisfying e numerator of objective function reflects within-class distance, while the denominator reflects between-class distance. e vectors a i that minimize the objective function are given by minimum eigenvalues solutions to the generalized eigenvalues problem.

Two-Dimensional Discriminant Locality Preserving
Projection (2DDLPP). 2DDLPP is the 2D expansion of DLPP. e main advantage of 2DDLPP over DLPP is that it has a more accurate approximation of the original signals, which can avoid the losses of important information for recognition. e objective function of 2DDLPP is where Y s i and Y s j denote the projected feature matrices in class s, corresponding to the original EEG signals. Z is the number of motor imagination classes. W s ij and B ij are the within-class weight matrix and the between-class weight matrix, separately. M i and M j represent the mean matrices of the projected signals in class i and class j; that is, Suppose that X is an N f × N g feature matrix signal and A denotes the transformation matrix. e linear transformation is Y � ATX. e objective function can be reformed as follows: where where F i is the mean matrix of the ith class; it is defined as B is the weight matrix between any two classes' mean matrix and it is defined as . W s is the weight matrix between any two samples in the sth class, and it is defined as where t is a parameter that can be set empirically and in this paper it is set to 1.
where Ds is a diagonal matrix, and its entries are column (or row) sum of Ws; e projection directions can be obtained by minimizing the objective function, satisfying e minimization problem can be turned into a generalized eigenvalue problem: us, A � a1, a2,. . ., ad are the solutions of (9), ordered according to their eigenvalues, λ 0 , λ 1 , . . . , λ d− 1 .

Matrix-Variate Gaussian Model and Its Combination with
where Z is the total number of classes, the matrices M i denote the mean matrix of the classΩ i , and ϕi and Ψi denote covariance matrix of the class Ω i . In this paper, ϕi and Ψi denote spectral covariance and spatial covariance, respectively. ese matrices are defined as follows: Computational Intelligence and Neuroscience e conditional probability of matrix X can be determined by Mi, ϕi, and Ψi as follows: where det(·) represents the determinant of a matrix. We define column vectorization as vec (·); then the mean of matrix X equals vec (Mi). By vectorizing the matrix Gaussian distribution, it can be converted into a multivariate Gaussian distribution of vector data as follows: where ϕ∈R N f ×N f , Ψ∈R N g ×N g , and ⊗ represents the Kronecker product operator. It can be seen from equation (14) that the covariance matrix of vectorized data can be transformed into a separable structure, which consists of the Kronecker product of two matrices.
e matrix-variate model in equation (11) corresponds to a specific structure for the covariance of the vectorized data. is model implies that the covariance matrix of the vectorized data can be decomposed into two parts.
is separability property will be used in the algorithms proposed in this paper.

Bilinear Two-Dimensional Discriminant Locality Preserving Projection (B2DDLPP).
Bilinear two-dimensional discriminant locality preserving projection (B2DDLPP) method is based on the matrix-variate Gaussian model. is model denotes that within-class covariance between any two spatial-spectral features can be decomposed into two parts.
Exchanging the numerator and denominator in the objective function of the 2DDLPP algorithm, the objective function of B2DDLPP can be reformed as follows: where e projection directions can be obtained by maximizing the objective function, satisfying Combining moment estimation of separable covariance matrix in separable LDA [29] and the high similarity between LDA and DLPP [30], we use the two following equations to estimate the corresponding within-class covariance matrices. Decompose S w to get where X s i denotes the ith feature matrix in class s, X s j denotes the jth feature matrix in class s, W s ij is the weight between X s i and X s j , and it is defined as W s ij � exp(− ‖X s i − X s j ‖ 2 /t). n s is the number of samples in the sth class. Z is the total number of classes. T is the transpose operation of the matrix.
Besides, the between-class scatter matrix can be considered as a separable structure. Vectorize the mean matrix in each class to get SB� SBR⊗SBL, where where u denotes the vectorization of F; u � vecF. F s is the mean matrix of the sth class: F s � (1/n s ) n s i�1 X s i . B is the weight matrix between any two classes' mean matrix and it is defined as Using this assumption, we obtain the eigenvalues and eigenvectors of ϕ-1SBL, denoted by λl and ul, respectively. Similarly, we obtain the eigenvalues and the eigenvectors of Ψ-1SBR, denoted by cj and vj. en, sort the two eigenvalues λl and cj in a descending order. e corresponding projection matrix can be constructed, respectively: us, the feature matrix Y is defined as Finally, in order to get the d-dimensional features, we choose the yij elements of Y which correspond to the d largest λ l c j values.

Preprocessing and the Flow Chart of the Experiment.
We conducted feature extraction using the proposed B2DDLPP in three databases. B2DDLPP algorithm is compared with LDA, two-dimensional linear discriminant analysis (2DLDA), DLPP, and 2DDLPP. It should be noted that the same preprocessing method was applied to three data sets in this experiment: Before using the three feature extraction methods mentioned above, we apply the oneversus-rest (OVR) multiclass extension of the filter bank common spatial patterns (FBCSP) [31,32] method to process the data. Taking four classes as an example, we take one class of samples as positive samples and the remaining samples as negative samples and perform FBCSP operations on the data to obtain a set of features. By analogy, a total of four sets of features can be obtained. Combine the four groups of features, and finally get a feature matrix ofN f × N g size, where N f is the number of frequency bands in FBCSP (9 in this article), N g � 2 × m × Numclass, m is the number of pairs of features, and Numclass is the total number of classes.
is method can reduce the dimensionality of the EEG data beforehand to ensure that the feature dimension of the data is less than the number of samples.
is allows the subsequent feature extraction method to proceed smoothly. At the same time, effective spatial structure and frequency information are retained to obtain better classification results. As alpha band (8)(9)(10)(11)(12)(13) and beta band (14-30 Hz) contain rich information for MI task, we divide all EEG signals into frequency subbands. e FBCSP employs a filter bank that covers 4-40 Hz, which comprises 9 bandpass filters that cover 4 Hz each [31]; and, in order to get a flatter delay response and low signal distortion, we use 6-order Chebyshev type II filter in this paper. e specific process is shown in Figure 1.
Considering that SVM classifier has been widely used in EEG classification [33][34][35], we use SVM classifier for classification in this paper. We divide the data set into two sets: training set and test set. e performance of BCI algorithms highly depends on the dimensionality of the feature space at the classifier's input, denoted by d. For each method, the optimal dimensionality d op of the feature space in training set is determined based on the average performance of each subject over all the validation runs. e feature space dimensionality for each method in test set is set based on the value of d op in the validation phase. In order to make different methods more suitable for the data, so as to get the optimal dimension d op , the selection range of parameter m in the preprocessing method is from 1 to 4. e value of the parameter m affects the data dimension after preprocessing. e flow chart of the experiment is shown in Figure 2. All experiments are performed on MATLAB R2017a and Windows 10, with AMD core 2600X CPU and 16 GB RAM.

Data Set
(1) Set 1: BCI competition IV, data sets 2a (Exp.1). e ultimate purpose of this experiment is to classify the following motor-imagery tasks: left hand, right hand, feet, and tongue movement.
is data set contains EEG signals of nine healthy subjects. It is recorded in two sessions and the signals are recorded using 22 Ag/AgCl electrodes at 250 Hz sampling rate. Each session consists of six runs and each of which includes 48 trials of a length of 3 seconds, yielding a total of 288 trials per session. ese two sessions are used as training set and test set, respectively. It should be mentioned that there are three Electrooculogram (EOG) channel recordings in this data set and they can be used as a reference for denoising. In this paper, we chose the time period of 3 s to 6 s and in order to preserve complete spatial information, all channels are reserved.
(2) Set 2: BCI competition III, data sets IIIa (Exp.2). is data set consists of recordings from three healthy subjects (k3b, k6b, and l1b). Each subject sat in a relaxing chair with armrests and was asked to perform imagery movements with four different tasks: left hand, right hand, foot, and tongue. Each subject completed 60 trials per class. Recordings were made with a 60-channel EEG amplifier from Neuroscan with the left mastoid for reference and the right mastoid as ground. EEG signals were recorded with a sampling rate of 250 Hz and filtered between 1 and 50 Hz with the notch filter on. In this paper, we chose the time period of 3 s to 7 s and all channels are reserved.

(3) Set 3: data sets 3 (Exp.3)
. e third data set used in this paper was obtained by our laboratory. is data set consists of recordings from 10 subjects and each subject sat in a relaxing chair with armrests and was asked to perform imagery movements with three different tasks: left hand, right hand, and the idle state. Motor imaging duration is 4 seconds. e signals are recorded using 62 Ag/AgCl electrodes at 1000 Hz sampling rate and each subject completed 125 trials per class, yielding a total of 375 trials. Among them, 300 trials are used as the training set, and the remaining data are used as the test set. In order to reduce data redundancy and improve the efficiency of data processing, we performed downsampling to reduce the frequency to 250 Hz.

Results and Discussion
In this section, we show the results of different methods on training set and the test set. We used a 5-fold cross-validation method to process the training set and then use the optimal parameters obtained in the training set for the classification of the test set. All results are displayed by the classification accuracy. e pseudocode for training the B2DDLPP feature extractor is presented in Table 1 Tables 2-4, respectively. For each subject, the highest average recognition rate over all the cross-validation runs and the corresponding feature dimension of different algorithms are reported. We also studied the case where no feature extraction method is utilized and the FBCSP features are directly passed to the classifier. Table 2 shows that the average recognition accuracy of B2DDLPP is improved by 2.81%, 8.47%, 7.07%, 4.86%, and 3.8% compared to those of None, LDA, 2DLDA, DLPP, and 2DDLPP, respectively. From the results, B2DDLPP has the highest accuracy rate for each subject; and whether to use 2D method or manifold learning method, it can be seen from the comparison result that these two factors have an Computational Intelligence and Neuroscience important influence on the accuracy of EEG classification. In the operation of LDA and DLPP, vectorization of the extracted matrix features will cause the loss of important information.
ere is an indispensable connection in the matrix structure, which has a great influence on the accuracy of classification. e accuracy of 2DLDA and 2DDLPP using the 2D method has improved compared with the method without 2D processing; and, compared to 2DLDA, the accuracy of 2DDLPP increased by 3.27%, which shows that keeping locality-geometric structure of EEG data in a neighboring area plays an effective role in improving the classification accuracy.
When compared with 2DDLPP, the result of B2DDLPP has greatly improved. 2DDLPP only transforms the rows or columns of matrix data, which inevitably leads to unnecessary information loss. B2DDLPP uses bilinear structure to decompose the covariance matrix into row part and column part, which has a better grasp of the internal interconnections of spatial data. erefore, B2DDLPP can extract more discriminative information. We can draw the same conclusion in Table 3. e overall recognition accuracy in Table 4 is not as high as that in the other two tables. e reason may be that the subject of this data set is not a professionally trained person    Algorithm: B2DDLPP Inputs: -Training sample X N f ×N g . e total number of samples is N. e number of training samples in each class is N i , 1≤ i ≤ Z. Outputs: -e feature extraction operators U N f ×N f and V N g ×N g -the corresponding λ l and c j values which determine the priority in selecting the elements in feature matrix. Procedure: 1. Calculate the spatial covariance matrix ψ and the spectral covariance matrix ϕ according to (18) and (19). 2. Calculate S BL and S BR according to (21) and (22). 3. Calculate the eigenvalues λ l and the corresponding eigenvectors u l of ϕ − 1 S BL , 1≤ l ≤N f . And calculate the eigenvalues c j and the corresponding eigenvectors v j of ψ − 1 S BR 4. Construct U and V. 5. Calculate the feature matrix Y according to (24). 6. Choose the y lj elements of Y which correspond to the d largest λ l c j values. D is the dimension after feature extraction. 6 Computational Intelligence and Neuroscience having a long-term specialized motor imaging training. is insufficient effective information in the original data leads to the result, whereas, in Exp.1 and Exp.2, the subject is the specially trained group of people, which generates less noise when performing a motor imaging task. Table 4 shows that the average recognition accuracy of B2DDLPP is improved by 2.34%, 6.75%, 5.7%, 4.66%, and 3.66% compared to those of None, LDA, 2DLDA, DLPP, and 2DDLPP, respectively. is result also fully proves the effectiveness of the 2D method and manifold learning method for the extraction of spatial and structural information from EEG signals.
It can be seen from the results of the three tables that although B2DDLPP has the highest accuracy rate, the classification accuracy of other feature extraction methods listed in the table is a bit lower than the accuracy rate without feature extraction. e reason can be attributed to the two following points: First, the methods of LDA and DLPP are not very robust to noise. Among the acquisition channels, the channels related to motor imagery are only a part of them. In this experiment, in order to ensure the integrity of the channel structure, we use the data of all channels, which brings redundant data and noise. Second, the extraction of effective information is insufficient. 2DLDA and 2DDLPP methods only transform the rows or columns of matrix data and this one-sided compression is not sufficient for the extraction of effective features in the spatial-spectral matrix.
It should be noted that the parameter t related to the calculation of the weight matrix in DLPP, 2DDLPP, and B2DDLPP is set to 1. e following shows the results of the test set. e results of test set for Exp.1 and Exp.2 are presented in Figures 3 and  4, respectively. Note that the feature space dimensionality for each method in test set is set based on the value of d op in the validation phase.
We can see from the figures that the performance results on the test data show a trend that is very similar to that of the performance results during the cross-validation phase.
In addition, we studied the influence of different feature dimensions on the effects of these methods. Figure 5 shows recognition accuracy of all the methods on different number of dimensions under three subjects in Exp.2. e results in the figure are shown by the average accuracy of 5-fold crossvalidation. We can see from the figure that the overall trend of various methods rises slightly with the increase of the dimension and finally tends to be flat. Combining the curve results of the three subjects, the accuracy of the B2DDLPP method is the highest, followed by 2DDLPP. e results of  For each method and each subject, optimal m related to FBCSP's output and the optimal dimension (d op ) are presented.

Conclusions
In this paper, we propose a B2DDLPP algorithm. By adding the matrix-variate Gaussian model and a bilinear structure, B2DDLPP decomposes the within-class covariance matrix and the between-class scatter matrix to obtain the optimal projection matrix. As a matrix-based method, B2DDLPP is more effective for extracting spatial information than vectorbased methods. At the same time, the bilinear structure further enhances this effect.
In order to fully verify the effectiveness of the algorithm, we apply the B2DDLPP algorithm to three EEG data sets in this paper. e results show that B2DDLPP has a higher feature extraction performance compared to other methods.

Data Availability
e Exp1 and Exp2 data used to support the findings of this study are available at http://www.bbci.de/competition/.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.