An Adaptive EEG Classification Algorithm Based on CSSD and ELM_Kernel for Small Training Samples

Rehabilitation technologies based on brain-computer interface (BCI) have become a promising approach for patients with dyskinesia to regain movement. In BCI experiment, there is often a necessary stage of calibration measurement before the feedback applications. To reduce the time required for initial training, it is of great importance to have a method which can learn to classify electroencephalogram (EEG) signals with a little amount of training data. In this paper, the novel combination of feature extraction and classification algorithm is proposed for classification of EEG signals with a small number of training samples. For feature extraction, the motor imagery EEG signals are pre-processed, and a relative distance criterion is defined to select the optimal combination of channels. Subsequently, common spatial subspace decomposition (CSSD) algorithm and extreme learning machine with kernel (ELM_Kernel) algorithm are used to perform the types of tasks classification of motor imagery EEG signals. Simulation results demonstrate that the proposed method produces a high average classification accuracy of 99.1% on BCI Competition III dataset IVa and 76.92% on BCI Competition IV dataset IIa outperforming state-of-the-art algorithms and obtains a good classification accuracy.


Introduction
Brain-computer interface (BCI) is a brain-computer communication system that does not depend on the brain's conventional channel. Te essence is to identify people's intentions through the electroencephalogram (EEG) signals, so as to realize human-machine communication. BCI has an extensive application prospect in many felds such as rehabilitation engineering, auxiliary control, and entertainment [1][2][3].
When taking a machine learning approach to BCI, one has to apply labelled training data to teach the classifer. To this end, the user usually performs a necessary stage of calibration measurement before the feedback applications. One important objective in BCI research is to reduce the time required for the initial measurement. Terefore, it is of great importance to have a method that can obtain high classifcation accuracy with a small number of training samples.
Multichannel EEG signals are usually necessary for spatial pattern identifcation. Most BCI systems require multichannel EEG data to achieve good performance [4]. However, multichannel EEG data will contain redundant information and noise for data processing and cause inconvenience for practical applications [5][6][7]. In several cases, there is no clear agreement that exists on the number and location of the necessary channels for motor imagery EEG [8]. Tus, channel selection is necessary for improving the performance of motor imagery-based BCI.
Currently, many feature extraction methods have been widely researched. Not only the characteristics of timevarying, nonstationary, and individuation diferences, but also physical state, mood, posture, and other factors pose great difculties for the analysis of EEG signals. Lin et al. proposed the adaptive autoregressive (AAR) model algorithm [9]. Te AAR model parameters can better refect the changes of event-related EEG signals, but it is sensitive to the artifact. Te common spatial pattern (CSP) algorithm of [10] is a classical algorithm toward the analysis of EEG signals. CSP algorithm, however, considers only two types of tasks which have the maximum separability on the projection of space and its performance is afected by nonstationary EEG signals and frequency fltering. Wavelet decomposition (WD) [11] has a lower time resolution and a higher frequency resolution at low frequency and a higher time resolution and a lower frequency resolution at high frequency, which leads to the loss of feature information. Wavelet packet decomposition (WPD) [12] is the development of WD and already has extensive applications [13]. WPD decomposes the low and high frequency information simultaneously in order to improve the time resolution. Te coefcients of the wavelet packets include signal information of the frequency bands, but the band-crossing phenomenon exists in every layer of WPD [14]. Te classifcation accuracy may not be high if the band-crossing phenomenon is not considered. Hilbert-Huang Transform (HHT) [15], an adaptive data-driven method, is widely applied for the analysis of nonstationary data. However, the intrinsic mode functions (IMF) criterion for sifting stoppage and boundary extension method of the HHT algorithm will produce some efect on the feature extraction. Common spatial subspace decomposition (CSSD) is an algorithm which is often employed in multichannel EEG data fltering. Each spatial model describes the distribution pattern of specifc signals which are located in diferent regions of the brain. Tere is a great deal of research value on the synergy work mechanism of multiple brain regions [16]. Regularization factor was introduced based on the CSSD algorithm in [17]. Knearest neighbor (KNN) algorithm was used to identify motor imagery EEG. Te R-CSSD algorithm produced good classifcation accuracy and less time consumption. In order to relax the presumption of strictly linear patterns between source signals and recorded EEG in CSSD, Gao et al. proposed the kernel CSSD algorithm to extend it to multiclass and obtained better classifcation results [18]. Grosse-Wentrup et al. proposed a method based on self-adaptive spatial fltering to improve spatial flter using the beamforming technique, and the classifcation rate was improved [19]. Under small training samples condition, Tomioka and Aihara applied logistic regression with the dual spectral (LRDS) method and obtained good classifcation accuracy [20]. Park and Lee divided 4-40 Hz band EEG signals into nine sub-bands, and Fisher's Linear Discriminant (FLD) was applied to the features of Regularized CSP (RCSP), which was extracted from individual sub-bands; the proposed method yielded a good classifcation accuracy in the vicinity of the motor area of the cerebral cortex and obtained particularly excellent performance in small-sample setting situations [21]. Alwasiti et al. proposed the preprocessing pipeline and the triplet network that provide a promising method to classify MI-BCI EEG signals with much less training samples [22]. Singh et al. proposed a new framework that transform covariance matrices into lower dimension through spatial flter regularized by data from other subjects. Te efcacy of the proposed approach was validated on the small sample scenario dataset [23]. Hou et al. proposed a novel framework based on bispectrum, entropy, and common spatial pattern (BECSP) for identifying multiclass EEG signals. Tis algorithm fused features extracted by higher order spectrum, entropy, and CSP algorithm. Te tree-based feature selection algorithm was used to select the required features to achieve the purpose of dimensionality reduction and performance improvement [24].
As for classifcation, SVM [25] has been widely used as a classifer for EEG and has been reported as owning minimum error and producing high classifcation accuracy, but long time efort needs to be made to fnd the appropriate parameters for SVM. Other classifcation techniques are also used such as back-propagation neural network (BP-NN) and KNN. BP-NN is computationally expensive for training and easy to fall into a local optimum. KNN has shown comparable performance with other state-of-the-art methods but its efciency is greatly reduced when encountering a large amount of training data.
Extreme learning machine (ELM) [26][27][28][29][30] is an algorithm for single-hidden layer feedforward neural networks (SLFNs) with randomly chosen hidden nodes and analytically determined output weights. Since only the output weights between the hidden and output layers are trained, it improves the generalization ability and accelerates the training speed. Gu and Hua proposed a fusion feature that combined temporal and spatial features as the fnal feature data. Te fusion features were input to the trained ELM classifer, and the ELM model achieved a better classifcation accuracy [31]. Extreme learning machine with kernel (ELM_Kernel) algorithm introduced the kernel function into the ELM algorithm can obtain the minimum square optimization solutions. It solves the problem of random initialization of ELM algorithm, and produces better robustness, better generalization performance, and is more stable with model learning parameters [32].
Tus, as for the aforementioned issue, the combination of ELM_Kernel algorithm and feature extraction based on CSSD algorithm can obtain a good balance between classifcation accuracy and computational efciency.
In this paper, the novel combination of feature extraction and classifcation algorithm is proposed for identifcation of EEG signals with a small number of training samples. Te motor imagery EEG signals are preprocessed, and a relative distance criterion is defned to select the optimal EEG channels. Subsequently, the CSSD algorithm and the ELM_Kernel algorithm are used to classify the types of imagery tasks. Simulation results demonstrate that the channel selection based on the relative distance criterion can enhance the performance of BCI by removing taskirrelevant and redundant channels. Te proposed method produces a high average classifcation accuracy of 99.1% on BCI Competition III dataset IVa and 76.92% on BCI Competition IV dataset IIa outperforming state-of-the-art algorithms and obtains a good accuracy for small training samples. Tis efectively reduced time consuming of the initial measurement for BCI systems, and helps to pave the way for using BCI systems in the rehabilitation feld.
Te remainder of this paper is organized as follows: Section 2 specifcally analyses the EEG channel selection and the CSSD algorithm. Section 3 presents the details of 2 Journal of Healthcare Engineering ELM_Kernel algorithm. Section 4 shows the experimental results and analysis. Section 5 provides the conclusions.

Te Acquisition of Experimental Data.
Te experiment data in this paper comes from BCI Competition III dataset IVa and BCI Competition IV dataset IIa. BCI Competition III dataset IVa poses the challenge of getting along with only a little amount of training data. Te recording was made using BrainAmp amplifers and a 128 channel Ag/AgCl electrode cap from ECI. 118 EEG channels were measured at positions of the extended international 10/20-system. Te dataset was recorded from fve healthy subjects. Subjects sat in a comfortable chair with arms resting on armrests. Imagery tasks included imagery right hand movement and imagery right foot movement. Given are continuous signals of 118 EEG channels and markers that indicate the time points of 280 cues for each of the 5 subjects (Aa, Al, Av, Aw, and Ay). For some markers no target class information is provided (value NaN) for testing. Table 1 shows the respective number of training (labelled) trials and test (unlabelled) trials for each subject. BCI Competition IV dataset IIa [33] consisted of EEG data from 9 subjects (S1-S9). Te cue-based BCI paradigm consisted of four diferent motor imagery tasks. From the four types, we considered only two types, which were imagery left hand movement and imagery right hand movement. EEG signals were recorded and sampled at the rate of 250 Hz using 22 EEG and 3 EOG channels. Only EEG channels were selected for this study. All subjects performed two sessions, one for training and the other for test. Te total number of trials per session were 288, with 72 trials per class.

Preprocessing.
Event-related desynchronization (ERD) occurs in mu and beta frequency bands, which can be utilized to estimate subject's cognition and emotion states. Te raw EEG data is fltered by the band-pass flter with bandwidth of 8-31 Hz in which the ERD physiological feature is apparent.

Channel Selection.
Te physiological studies on motor imagery demonstrate that the spatial distribution of EEG difers from diferent imagery movements. EEG oscillations at mu rhythms (8-13 Hz) are displayed on specifc areas of ERD corresponding to each imagery state. ERD represents the changes of the ongoing EEG activity characterized by a decrease of power in the given frequency bands. Diferent degree of ERD is activated via diferent imagery tasks. Figure 1 shows the comparison of the AR model power spectrum with 3 randomly selected channels of the imagery right hand and imagery right foot movement on BCI Competition III dataset IVa. Te diference in the power spectrum of channel P1 and channel C3 is distinct from the two types of tasks, but the value of the respective power spectrum is not the same. Te diference in the power spectrum of channel AF3 is not obvious. It can be distinctly observed that the intensity level of the ERD phenomenon on diferent channels of the two types of imagery tasks is not the same. In other words, the contribution of diferent channels to the EEG classifcation is not the same. It is related to the channel position. Terefore, it gives evidence for channel selection.
Multichannel EEG data applied in BCI systems may contain redundant information and cause inconvenience for practical application. Channel selection can enhance the performance of BCI by removing task-irrelevant and redundant channels. ERD phenomenon produces in specifc brain regions. When performing the right hand or right foot imagery movement, only a small amount of channels is activated and some of the channels remain in the stationary state. Terefore, a relative distance criterion is defned to measure the contribution of diferent channels for identifying tasks so as to select the optimal channels group.
Te power spectrum of the two types of imagery tasks is most distinct at 8-13 Hz, which is corresponding to mu rhythm. Te relative distance criterion is defned by the diference in the power spectrum between the two types of imagery tasks as follows: where P i,k (f) denotes the power spectrum density of the k-th channel for the i-th class, i � 1 denotes the class of right hand imagery movement, i � 2 denotes the class of right foot imagery movement, f represents frequency, and f T represents the frequency set of 8-13 Hz. It can be observed that h(k) ∈ [0, 1], the greater the value of h(k), the bigger the diference in the power spectrum of the two types of imagery tasks on the same channel and the higher the contribution to classifcation. Te relative distances of all channels are shown in Figure 2. We select the frst 25 channels for further analysis that are corresponding to channel F5, FFC3, Fz, F4, CCP3, FC5, FC3, FFC1, FC4, CFC3, CCP5, C3, C1, Cz, C4, CFC4, CCP1, CP3, CPz, CP4, CFC2, P3, Pz, P4, and P8.

Common Spatial Subspace Decomposition Algorithm.
CSSD is an algorithm that is often employed in multichannel EEG data fltering. It constructs spatial flters that can distinguish two types of EEG signals based on simultaneous diagonalization of two real symmetric matrices and spatiotemporal source modeling.
Suppose two types of tasks are A and B. Each subject completes both task A and task B with the same times, and the time is expressed as k. X A ∈ R n×M and X B ∈ R n×M denotes two types of tasks EEG by a subject, respectively, n denotes the channel number of the EEG signal, M denotes sample number of each channel in one trial. So, the feature extraction steps based on the CSSD algorithm are given by the following steps: Step 1. Estimate the covariance matrix R A ∈ R n×n and R B ∈ R n×n of the two types of imagery EEG signals. Te covariance matrix of A and B for the EEG signal is given by the following equation: where X T A is the transposition of X A , X T B is the transposition of X B , and trace(·) is the track of the matrix.  Step 2. Calculate the sum covariance matrix R of the two types of imagery EEG signals and decompose the eigenvalues and eigenvectors, we can obtain the whitening transformation matrix P as follows: where Σ ∈ R n×n denotes the eigenvalues matrix of R and U ∈ R n×n is the eigenvectors matrix.
Step 3. S A and S B is obtained from the whitening transformation of R A and R B as follows: where Σ A and Σ B denotes the eigenvalue matrix, Σ A + Σ B � I. U A ∈ R n×n and U B ∈ R n×n denotes the corresponding eigenvector matrix, U A � U B . For a same eigenvector, if S A has larger eigenvalues, S B will have smaller eigenvalues, and vice versa.
Step 4. Build spatial flter of the two types of imagery EEG signals. Select the biggest J eigenvalues from Σ A and Σ B , and we applied the corresponding eigenvectors to form the eigenvector matrix W A , W B ∈ R n×J , the spatial flter of two types of EEG signals is as follows: Step 5. Suppose X ∈ R n×M is the preprocessing EEG signal, the two types of EEG signals are fltered by spatial flter and the feature of EEG is given by the following equation: where Te feature vector of the two types of EEG signals is given by the following equation:

Basic ELM Algorithm.
Te ELM algorithm was frst proposed by Huang et al. for SLFNs with randomly chosen input weights, hidden nodes, and analytically determined output weights. It possesses an impressive generalization performance. A standard ELM algorithm classifer is shown in Figure 3, whose M hidden nodes use infnitely diferentiable activation functions, which could approximate arbitrary samples with zero error, which means given a training set where β j is the weight vector that connects the j-th hidden node with the output nodes, o i is the SLFNs output vector for the i-th sample, T i is the label vector of i-th sample, and w i is the weight vector connecting the i-th sample and the j-th Journal of Healthcare Engineering hidden node, b j is the bias of the j-th hidden node, and f(·) is the activation function. Equation (9) can be replaced by the following equation: where H w,b,x is named the hidden-layer output matrix.
Te smallest training error can be achieved by computing the corresponding least-squares solution Altogether, the ELM training algorithm consists of the following three steps: Step 1: Randomly assign hidden node parameters w i and b j , j � 1, 2, · · · , N.
Step 2: Calculate the hidden-layer output matrix H † w,b,x and its Moore-Penrose generalized inverse H † w,b,x .
Step 3: Calculate the output weight β.

ELM_Kernel Algorithm.
Te training process aims to minimize the training error T − Hβ 2 and the norm of output weight β. Te training process can be represented as a constrained optimization problem.
where constant C is used as a regularization factor to control the tradeof between the closeness to the training data and the smoothness of the decision function such that generalization performance is improved. Lagrange multiplier technique is used to solve the above optimization problem. If matrix ((I/C) + H T H) is not singular, solution β can be obtained as follows: Kernel technique can be applied into ELM based on Mercer's condition. Terefore, based on equation (13), the output vector f(x) of ELM_Kernel can be represented as follows:  Journal of Healthcare Engineering where K � H T H �

Experimental Results and Analysis
Te raw EEG signals are large volumes and high dimensionality and will increase the computing time if directly used for classifcation. Based on the channel selection mentioned in section II, this paper selected 25 channels on BCI Competition III dataset IVa for further analysis. We apply CSSD algorithm for EEG feature extraction and select the frst J � 10 eigenvalues that are fltered by spatial flter forming the feature vectors. Te above feature vectors are brought into the SLFNs, and ELM_Kernel algorithm is applied as the classifying method. Tis paper sets regularization factor C � 10, and RBF function is used as the kernel function.
Te respective training feature vectors are extracted from the training trials of 5 subjects shown in Table 1. Te unlabelled trials are extracted for testing feature vectors, respectively. Figure 4 shows that the classifcation accuracy and training time of 5 subjects under BCI Competition III dataset IVa. Te proposed algorithm uses an ultrafast time of 0.117 s for training although the subject Al has much more training samples, and the training time is on the decrease as the training samples reduces. Table 2 summarizes the performance of diferent combinations of feature extraction and classifcation methods on BCI Competition III dataset Iva. Te ACC represents the average accuracy of EEG and the STD represents the standard deviation. Te average classifcation accuracy of 5 subjects of the proposed algorithm is improved by 2.8% compared with that of the proposed algorithm without channel selection because some channels are irrelevant or redundant. Te average classifcation accuracy of 5 subjects of the proposed algorithm is improved by 4.1% compared with that of the CSSD and ELM algorithm. Te reason is that ELM_Kernel algorithm solves the problem of random initialization of ELM algorithm, and produces better robustness, better generalization performance, and is more stable with model learning parameters.
It can be seen from Table 3 that the feature vectors that are extracted based on the CSSD algorithm can efectively characterize EEG signals and the classifcation accuracy is higher than each item of the results of the 1st BCI Competition and the SBRCSP algorithm on BCI Competition III dataset IVa. Te average classifcation accuracy of 5 subjects of the proposed algorithm is improved by 5% and 16.4% compared with that of the 1st BCI Competition and the SBRCSP algorithm. Te high accuracy of the proposed algorithm is depended on the optimal combination of channels and the strong ability of function approximation and better generalization performance of ELM_Kernel algorithm, which is more stable with model learning parameters. Figure 5 demonstrates that the classifcation accuracy of diferent number of eigenvalues that we select based on the CSSD algorithm. It can be observed that 3 eigenvalues poorly characterize the EEG signals, and the classifcation accuracy is low. Te 5 curves own a same trend, the classifcation accuracy is elevated with the increasing number of eigenvalues. More eigenvalues can efectively characterize EEG signals, but is more time consuming. We obtain great classifcation accuracy with the frst 10 eigenvalues forming the feature vector. Figure 6 indicates that the classifcation accuracy of diferent proportion of the training samples. It can be obviously observed that the proposed algorithm obtains a good accuracy when the proportion of training samples is only 0.1. Te accuracy of 5 subjects is diferent because the EEG signals are afected by physical state, mood, posture, and other factors.  We also compute the classifcation accuracy with 20 training samples of each subject to further verify the effectiveness of the proposed algorithm under small training samples on BCI Competition III dataset IVa, and the classifcation accuracy is shown in Table 4. It can be seen that the proposed algorithm obtains higher classifcation    accuracy compared to the BECSP algorithm even though the number of training samples is big. Te average classifcation accuracy of 5 subjects of the proposed algorithm is improved by 11% compared with that of the BECSP algorithm. Te high accuracy of the proposed algorithm is depended on the optimal combination of channels and better generalization performance of ELM_Kernel algorithm, which is more stable with model learning parameters. It should be noted that the BECSP algorithm obtains higher classifcation accuracy compared to the proposed algorithm when the subject is Al. Te reason is that the number of training samples of the BECSP algorithm is much more than that of the proposed algorithm.
In order to check for the robustness of the proposed algorithm, we also report the comparison of the proposed algorithm with the BECSP algorithm on BCI Competition IV dataset IIa in Table 5. Simulation result shows that the proposed algorithm can efectively characterize EEG signals and the classifcation accuracy is higher than each item of the results of the BECSP algorithm. Te average classifcation accuracy of 9 subjects of the proposed algorithm is improved by 5.3% compared with that of the BECSP algorithm.

Conclusions
In this paper, the novel combination of feature extraction and classifcation algorithm is proposed based on a little amount of training data for EEG signals using CSSD and ELM_Kernel algorithm. Te motor imagery EEG is preprocessed, and a relative distance criterion is defned to select the optimal combination of EEG channels. CSSD algorithm combining with ELM_Kernel algorithm are used to classify the types of imagery tasks. Simulation results demonstrate that the channel selection can enhance the performance of BCI by removing task-irrelevant and redundant channels, the feature vectors can efectively characterize EEG signals and the proposed method produces high classifcation accuracy and outperforms state-of-the-art algorithms for small training samples. Te excellent performance of the classifcation is obtained as the stable ELM_Kernel algorithm is applied for classifcation. Te advantages of the ELM_Kernel algorithm in terms of both training time and classifcation accuracy lay a foundation for online classifcation of EEG. In future studies, the proposed method will be applied to more EEG classifcation and be further improved and tested so as to make it applicable for clinical applications in the rehabilitation feld.

Data Availability
Te data used to support the experiments and the fndings of this study are included within the article.

Conflicts of Interest
Te authors declare that they have no conficts of interest.  Journal of Healthcare Engineering 9