Transfer Kernel Common Spatial Patterns for Motor Imagery Brain-Computer Interface Classification

Motor-imagery-based brain-computer interfaces (BCIs) commonly use the common spatial pattern (CSP) as preprocessing step before classification. The CSP method is a supervised algorithm. Therefore a lot of time-consuming training data is needed to build the model. To address this issue, one promising approach is transfer learning, which generalizes a learning model can extract discriminative information from other subjects for target classification task. To this end, we propose a transfer kernel CSP (TKCSP) approach to learn a domain-invariant kernel by directly matching distributions of source subjects and target subjects. The dataset IVa of BCI Competition III is used to demonstrate the validity by our proposed methods. In the experiment, we compare the classification performance of the TKCSP against CSP, CSP for subject-to-subject transfer (CSP SJ-to-SJ), regularizing CSP (RCSP), stationary subspace CSP (ssCSP), multitask CSP (mtCSP), and the combined mtCSP and ssCSP (ss + mtCSP) method. The results indicate that the superior mean classification performance of TKCSP can achieve 81.14%, especially in case of source subjects with fewer number of training samples. Comprehensive experimental evidence on the dataset verifies the effectiveness and efficiency of the proposed TKCSP approach over several state-of-the-art methods.


Introduction
The brain-computer interface (BCI) offers a new pathway of communication between an external device and the brain through transforming metabolic or electrophysiological brain activities to control messages for devices and applications. The electroencephalogram (EEG) obtains time series data with multiple variants recorded at several sensors pressed on the scalp. It thereby presents electrical potentials under the induction of brain activities. These are used by noninvasive BCI systems to convert the mind or intention of a subject into a control message for certain device, such as a computer, a neuroprosthesis, or a wheelchair [1][2][3][4].
Currently, classification performance promotion of BCI systems based on the EEG has significant challenges. For one, it is necessary for a fresh subject to conduct a lengthy calibration session for sufficient training sample collection to establish classifiers and extractors of subject-specific features. The test session later employs the classifiers and extractors to classify the subjects brain signals. In a recent study on BCIs, it was shown to be very important to reduce training sessions on account of the time-consuming, tedious process of a calibration session. As a result, conducting a performance promotion using a scarce labeled set is more desirable compared with using a large one. Nevertheless, suitable methods must be identified to strengthen the performance. This is because a short calibration session means the availability of merely a few training samples for target users, which may result in overfitting or suboptimal feature classifiers or extractors.
To address the above problem, transfer learning is a promising approach [5,6]. It applies data represented in various feature spaces or obtained from various distributions for compensating the insufficient labeled data. In the BCI field, transfer learning has attracted considerable attention because it enables the establishing of subject-independent spatial classifiers and/or filters, and it lowers calibration times. Some studies concentrated on feature representation  transfer methods in EEG classification [7][8][9][10][11]. In this situation, we encode the knowledge that traverses domains into a fresh feature representation. Accordingly, precise classification performance is thereby expected in settings with a small sample.
A proposed schedule for practical applications of BCI systems based on subject transfer is presented in Figure 1 [12]. The datasets provided by the source subjects can be stored as a dataset group. Next, the BCI device can first acquire transfer data from the source subject groups when it is prepared to execute classification for the user. In this paper, we thus propose the transfer kernel common spatial patterns (TKCSP) method. The TKCSP computation is formulated by BCI as an optimization problem with multiple subjects, thereby incorporating data from other subjects to establish a common feature space.

Transfer Kernel Common Spatial Patterns
This section mentioned a new feature extraction method, TKCSP, which combines two previous approaches, kernel common spatial patterns (KCSP) [13] and transfer kernel learning (TKL) [6]. KCSP is an extraction approach for motor imagery, and TKL is a promising transfer learning method. In Sections 2.1 and 2.2, we describe the KCSP algorithm and the TKL algorithm, respectively. In Section 2.3, we would propose the TKCSP algorithm in combination with the above two algorithms.

Kernel Common Spatial
Patterns. The KCSP algorithm based on CSP is used to find the components with the largest energy difference between the two experimental conditions [13][14][15]. Its basic idea is to find the optimal spatial filter to maximize the component energy under two sets of experimental conditions after the spatial filtering.
The first step is to calculate the covariance between the two signals. Consider as an × matrix representing the th trial of EEG signals, wherein represents the channel amount and represents the points of time. The class-specific spatial covariance matrix can hence be acquired by the steps below.
where represents the class label, ( ) = ⟨ ( ), ( )⟩ represents the kernel function, and ⟨ ⟩ denotes the inner product. Thus we can replace the computation of the aggregate spatial covariance matrix with = 1 + 2 .
Additionally, we can factor to be = 0 Λ 0 , where 0 ∈ R × represents a matrix with eigenvectors in a row, while Λ represents the diagonal matrix of eigenvalues classified in declining order.
The variances can be equalized by using a whitening transmission within space that the eigenvectors span in 0 such that equals Thirdly, the whitening matrix can be used to transform 1 and 2 into 1 and 2 as 1 and 2 have the same eigenvectors, that is, if where represents the identity matrix. At this point, the sum is always one for these two corresponding eigenvalues. Hence, Computational and Mathematical Methods in Medicine 3 Input: Data . Output: Common spatial patterns −1 , and common spatial filters . (1) Compute spatial covariance matrix , = 1, 2 by (1) and the total spatial covariance matrix is = 1 + 2 . (2) Eigen decomposition = Λ 0 ; whitening transformation = 1/2 0 . (3) Transform covariance matrices 1 = 1 , 2 = 2 . and eigen decomposition 1 = 1 , 2 = 2 . (4) Construct the spatial filter = ( ) .
Algorithm 1: Kernel common spatial pattern algorithms.
the eigenvectors having the smallest eigenvalues for 1 have the largest eigenvalues for 2 and vice versa. This property enables eigenvector to sort these two classes.
Finally, owing to = ( ) as the common spatial filters, the common spatial patterns are columns of −1 , which can be regarded as the source distribution vectors for time-invariant EEG. Algorithm 1 shows the summary of a complete KCSP procedure.

Transfer Kernel
Learning. TKL can directly match the source distribution and target distribution to learn a domaininvariant kernel space, using the knowledge of the source domain to help complete the learning tasks in the target domain. This section begins with definitions of terminology used, and Notations section presents a summary of commonly used notations. In general, if two domains and have different marginal distributions or feature spaces, they will have difference; that is, Definition 2. Given domain D, a classifier ( ) and a cardinality label set Y compose a task T; that is, T = {Y, ( )}, in which ∈ Y, and the interpretation of ( ) = ( | ) can be conditioned probability distribution.
In general, if two tasks T and T have different conditioned distributions or label spaces, they will have a difference; that is, Firstly, calculate the target kernel function, the source kernel function, and the cross-kernel function. Assume an input kernel function is given to us, for example, Laplacian kernel ( , ) = | − | or Gaussian kernel ( , ) = ‖ − ‖ 2 , then the target kernel , the source kernel , and the crossdomain kernel can be computed. A domain-invariant kernel ∪ can be learned by utilizing these three kernels. Under this challenging situation, the sufficient matching of marginal distributions plays an indispensable role in efficient learning of the domain transfer.
To require two datasets (for example, target data and source data ) to conform to similar distributions of the feature space, that is, ( ( )) ≃ ( ( )), requiring them to have similar kernel matrices is sufficient, that is, ≃ [16]. Nevertheless, kernel matrices depend on data and the direct evaluation of closeness between varied kernels is improbable because of the varying dimensions; that is, ∈ × , ∈ × [17]. To solve this issue, the Nyströ m kernel approximation idea is adopted for the generation of an extrapolated source kernel ∈ × by an eigensystem of target kernel . Next, can arise to kernel as the ground truth source and can be comparable to a spectral kernel design. Figure 2 shows the whole learning procedure.
Secondly, Nyströ m kernel approximation is adopted to execute eigensystem extrapolation [16]. To this end, standard eigendecomposition is adopted on the target kernel which provides the eigensystem {Λ , Φ } of target kernel . Thirdly, we assess the eigensystem on source data by utilizing the Nyströ m approximation theorem. The derivation of the eigenvectors Φ for extrapolated source kernel is where ∈ R × is the cross-domain kernel matrix between and , assessed by kernel function .
The initial Nyström method directly utilizes target eigenvalues Λ and extrapolated source eigenvectors Φ to make approximation for the source kernel . In essence, the distribution difference across domains is embodied by the Nyströ m approximation error; that is, error is close to 0 if and only if ( ) ≃ ( ). An invariant kernel extrapolated to varied domains will be achieved if an extrapolated kernel 4

Computational and Mathematical Methods in Medicine
Target domain X Target kernel K X Eigendecomposition K X to form Λ X , Φ X by (5) Extrapolation for Φ Z using Nystr¨m method by (6)  o Relaxation Λ X as Λ Source domain Z Spectral design to form K Z by (7) Minimizing the approximation error between K Z and K Z by (8) Figure 2: Complete procedure of transfer kernel learning.
can be found for realizing a minimized Nyströ m approximation error, thereby facilitating a more efficient cross-domain generalization.
The spectral kernel design idea is adopted to establish a new kernel matrix from extrapolated eigensystem to reduce the Nyström approximation error [18]. The key construction of target kernel can thus be preserved by the kernel matrix generated via extrapolated eigensystem Φ ; however, the flexibility of the reshaping could be retained to keep the distribution difference minimized.
Fourthly, eigenspectrum Λ can be relaxed in the primary Nyströ m approach to be parameters Λ that can be learned resulting in a kernel family extrapolated from the target eigensystem yet assessed on the source data. The extrapolated source kernel is obtained as follows: The critical structures of the target domain can be preserved by this kernel family, that is, eigenvectors Φ . Moreover, the free eigenspectrum Λ remains undetermined. Unlike a conventional spectral kernel design that learns the parameters through Λ trained on the spectral kernel towards a previous kernel calculated in the same domain, kernel matching can be performed across domains.
Fifthly, we strive to minimize the approximation error between the ground truth source kernel and the extrapolated source kernel for explicitly minimizing the distribution difference herein by utilizing the squared loss where Λ = diag( 1 , . . . , ) belongs to the nonnegative eigenspectrum parameters, while ≥ 1 belongs to the eigenspectrum damping factor [19].
The marginal distributions of multiple source domains can be matched with the target domain using the generalized transfer kernel learning (TKL) approach. This approach can be conducted by the source-specific eigenspectrum Λ learning for every source domain separately in the initial place. Secondly, existing learning algorithms of multiple sources are used to implement consensus forecasting for the target domain on the basis of predicting multiple source domains [20,21].
Sixthly, the standard quadratic programming possessing (QP) linear constraints are used herein to show the solution of the TKL optimization problem (8). Here, eigenspectrum parameter is denoted as = ( 1 , . . . , ); that is, Λ = diag( ). Equation (8) is reformulated in the matrix form by linear algebra The following are the respective definitions of QP coefficient matrices , and constraint matrix where ≥ 1 represents the eigenspectrum damping factor, which is also the only tunable parameter within TKL. Additionally, ∈ R × denotes the identity matrix, and ∈ R × represents the first diagonal matrix with the nonvanishing elements.
Finally, constructing the domain-invariant kernel on the target and source data = ∪ is straightforward with the learned optimal eigenspectrum parameters Λ. According to spectral kernel design, we can generate from the where Φ ≜ [Φ ; Φ ] belongs to extrapolated eigenvectors on all data . We can directly feed the kernel invariant to the domain to normal kernel machines, for example, KCSP, for facilitating the cross-domain generalization and prediction. Algorithm 2 shows the summary of a complete procedure. Algorithm 3: Transfer kernel common spatial pattern algorithm.

Transfer Kernel CSP.
When transfer kernel replaces kernel ( ) in (1), we can build the TKCSP. For all methods based on the kernel, linear kernel is adopted by us; that is, ( , ) = . Then = /trace( ) can be used to estimate the spatial covariance. Algorithm 3 presents a summary of a complete TKCSP procedure.
We can compute the filtration of a trial by = ( ) as the projection matrix [14]: Decomposing the EEG based on (6) can be used to obtain the features utilized for classification. For every imagined movement direction, the classifier construction employs the variances owned by merely a small amount of signals that are the fittest for discrimination. The signals ( = 1 ⋅ ⋅ ⋅ 2 ) maximizing the variance difference of motor imagery EEG on the left versus the right belong to those associated with the largest eigenvalues 1 and 2 . These signals are blank in the last and first rows in because of the computation of = var ( ) The linear classifier can be calculated by using the feature vectors of right and left trials. The log-transformation contributes to approximating the standard data distribution.

Data Preparation.
In this study, we employed the IVa dataset from BCI Competition III [22]. The dataset includes EEG data containing a classification task of motor imagery with two levels: (1) imagery movement of the right hand (denoted by R) and (2) imagery movement of the right foot (denoted by F). We employed 118 electrodes to measure EEG signals in every trial from five different subjects, and each subject involved the performance of 280 trials. Table 1 presents a summary of the data descriptions, in which the number of subjects av, aw, ay of training samples is fewer than those of the test samples. Each trial was considered an × matrix , in which represents the electrode amount and the time point amount sampled. EEG signals measured were bandpass decomposed . SVM (Support Vector Machine) involving linear kernel was utilized as the classifier. The proportion of the number of samples properly classified to the aggregate number of used samples in this test was employed to evaluate the classification precision.
Our establishment of a dataset (containing the target domain and source domain) for cross-domain classification is described as follows. The dataset of each subject could become the target domain (ay, aw, av, al, aa), while the dataset of other subjects could become the source domains. This strategy of dataset construction ensured the relevance between domains of unlabeled and labeled data, as they were located in the same top-level categories. Accordingly, 1 4 + 2 4 + 3 4 + 4 4 = 15 datasets of the source domains were generated for each target domain. It was possible to generate five dataset groups, including 5 × 15 = 75 datasets.

Experimental Results.
In this section, TKCSP and six competitive methods are evaluated based on classification accuracy [8,11,23]. We established five dataset groups from the dataset described above. Each dataset group includes four source subjects in source domains and one target subject as target domain. If one subject is the target domain, it will no longer appear in the source domains, so that each target domain corresponds to 15 different source domains. The first column of Table 2 shows the different source domains, and the second column to the sixth column of Table 2 show the classification accuracy of each target domain in its source domains, respectively. Among them, the highest classification accuracy of target domain aa was 68.10% and the corresponding source domain was al + aw; the highest classification accuracy of target domain al was 93.88% and the corresponding source domain was aw; the highest classification accuracy of target domain av was 68.47% and the corresponding source domain was al + ay; the highest classification accuracy of target domain aw is 68.10% and the   Figure 3, the blue dashed line indicates the classification accuracy of CSP algorithm, the red solid line indicates the classification accuracy of TKCSP algorithm in different source domains, and the green square indicates the best classification accuracy of TKCSP. The horizontal axis of green square is corresponding to the optimal source domain. The results show that the classification accuracy of TKCSP method is better than that of CSP algorithm. Table 3 lists the classification (recognition) precisions of five comparison approaches and TKCSP on dataset IVa. Figure 4 visually depicts the results for improved accessibility. The performance achieved by TKCSP is significantly better than those achieved by the five comparison approaches. Several observations can be made from these results.
Firstly, TKCSP achieves classification precision on the aw and aa datasets as 90.58% and 68.47%, respectively. These are higher than those of the five comparison approaches. Moreover, TKCSP achieves an average classification precision on these datasets as 81.14%, providing a significant performance 8 Computational and Mathematical Methods in Medicine   improvement of 1.97% over ss + mtCSP, the best competitive approach. It is strongly verified by the consistent performance improvements on these datasets that TKCSP can successfully establish powerful domain kernels for classification of crossdomain motor imagery.
Then, a composite covariance matrix as a weighted total of covariance matrices, including subjects resulting in a composite CSP, is determined by CSP for subject-to-subject transfer (CSP SJ-to-SJ). This approach thus achieves an average classification precision of 75.97%.
Thirdly, regularizing CSP (RCSP) is intended to regularize the covariance matrix to the mean covariance matrix of other subjects for improving its estimation performance. Such regularization is particularly promising in settings with small samples. Furthermore, this approach achieves an average classification precision of 77.98%.
Finally, the stationary subspace CSP (ssCSP) focuses on the nonstationarity issue while multitask CSP (mtCSP) focuses on the estimation issue. The combined mtCSP and ssCSP (ss + mtCSP) method employs both approaches. That is, the nonstationary subspace acquired by ssCSP is firstly projected, and then the spatial filters are computed with mtCSP by regularization parameters acquired when it is applied to the initial data. The three above approaches achieve an average classification precision of 78.99%, 76.81%, and 79.17%, respectively.
In particular, TKCSP can assess the various cluster structures and naturally matches them between multiple domains. This procedure is achieved by TKCSP through the matching between the source domain kernel and kernel extrapolated from the target domain, while simultaneously increasing (declining) the domain-invariant (domain independent) eigenspectrum. The superior performance of TKCSP can be explained by this advantage.

Conclusion
In this paper, we proposed the TKCSP method to lower the training trial amount and improve the performance via learning a domain-independent kernel. To this end, direct matching of distributions between target subjects and source subjects within the kernel space is conducted. TKCSP and six competitive approaches were evaluated on EEG datasets provided by BCI Competition III. The results showed that the performance of the best approach, RCSP, was better than that of CSP by nearly 1.97% in terms of the mean classification precision. The results also revealed that RCSP can perform effective subject-to-subject transfer. Therefore, the behaviors matched with knowledge of neurophysiology could be classified by the TKCSP approach.

Conflicts of Interest
The authors declare that there are no conflicts of interest.