Supervised and Semisupervised Manifold Embedded Knowledge Transfer in Motor Imagery-Based BCI

A long calibration procedure limits the use in practice for a motor imagery (MI)-based brain-computer interface (BCI) system. To tackle this problem, we consider supervised and semisupervised transfer learning. However, it is a challenge for them to cope with high intersession/subject variability in the MI electroencephalographic (EEG) signals. Based on the framework of unsupervised manifold embedded knowledge transfer (MEKT), we propose a supervised MEKT algorithm (sMEKT) and a semisupervised MEKT algorithm (ssMEKT), respectively. sMEKT only has limited labelled samples from a target subject and abundant labelled samples from multiple source subjects. Compared to sMEKT, ssMEKT adds comparably more unlabelled samples from the target subject. After performing Riemannian alignment (RA) and tangent space mapping (TSM), both sMEKT and ssMEKT execute domain adaptation to shorten the differences among subjects. During domain adaptation, to make use of the available samples, two algorithms preserve the source domain discriminability, and ssMEKT preserves the geometric structure embedded in the labelled and unlabelled target domains. Moreover, to obtain a subject-specific classifier, sMEKT minimizes the joint probability distribution shift between the labelled target and source domains, whereas ssMEKT performs the joint probability distribution shift minimization between the unlabelled target domain and all labelled domains. Experimental results on two publicly available MI datasets demonstrate that our algorithms outperform the six competing algorithms, where the sizes of labelled and unlabelled target domains are variable. Especially for the target subjects with 10 labelled samples and 270/190 unlabelled samples, ssMEKT shows 5.27% and 2.69% increase in average accuracy on the two abovementioned datasets compared to the previous best semisupervised transfer learning algorithm (RA-regularized common spatial patterns-weighted adaptation regularization, RA-RCSP-wAR), respectively. Therefore, our algorithms can effectively reduce the need of labelled samples for the target subject, which is of importance for the MI-based BCI application.

However, EEG data analysis is challenging due to low signal-to-noise ratio and high artifacts [11,12]. Moreover, a long calibration procedure hinders the development of MIbased BCI. Each subject usually spends a tedious calibration time in training a subject-specific classifier before performing the real-time MI tasks. Since the MI EEG signals are evoked by spontaneous movement imagination without external stimuli, they are of high intersession/subject variability. us, it is difficult to build a generic classifier that fits all sessions/subjects. Instead, it is realistic to train a subjectspecific classifier that usually requires sufficient labelled data from the subject. Nevertheless, a long calibration procedure may unfortunately lead to high intersession differences and user frustration.
To cope with this problem, it is crucial for a target subject to reduce the need of amounts of labelled samples and effectively utilize the available samples. Rapid progress in machine learning motivates a variety of studies on how to make full use of the available samples [13][14][15][16][17][18]. EEG dataset reduction can reduce the feature dimensionality of the available EEG signals and improve the system learning speed. However, it cannot reduce the calibration time for the target subject [19,20]. Likewise, different polynomials-based recurrence algorithms are promising techniques for signal processing due to their special capabilities in feature extraction. Nevertheless, they also just shorten the computational cost instead of calibration time [21][22][23]. Deep learning has been widely used in computer vision, natural language processing, and physiological signal analysis [24][25][26][27][28]. Nevertheless, deep learning needs lots of labelled samples from the target subject to show its superiority. e artificial data generation method can generate numerous artificial labelled data by recombining the few original labelled data in the time and frequency domains [29]. However, this method highly relies on the quantity and quality of the labelled samples. Transfer learning is a popular machine learning technique, which usually transfers labelled samples from different source sessions/subjects for a new target session/ subject with no or few labelled samples [30]. Semisupervised learning can use the limited labelled set and comparably large unlabelled set from the same subject simultaneously [31]. erefore, we pay more attention to transfer learning and its combination with semisupervised learning.
In general, transfer learning can be divided into three categories: supervised transfer learning, unsupervised transfer learning, and semisupervised transfer learning, depending on whether the samples from the target domain are all labelled, all unlabelled, or partially labelled and unlabelled. It is noted that all samples from the source domains are labelled no matter whether transfer learning is supervised, unsupervised, or semisupervised. Here, a domain means a subject or a session. en, a labelled domain consists of the labelled samples from a subject/session, while an unlabelled domain includes the unlabelled samples from a subject/session. To our best knowledge, most studies focus on the supervised transfer learning since it can effectively use the discriminative labelled samples from the target domain to select and adjust the labelled samples from the source domains. In fact, it is of great importance for unsupervised and semisupervised transfer learning algorithms to explore the geometric information embedded in the amounts of unlabelled samples from the target domain. As an unsupervised transfer learning algorithm, manifold embedded knowledge transfer (MEKT) performed Riemannian alignment (RA), tangent space mapping (TSM), and domain adaptation to continually minimize the differences among different domains in ERP-based and MI-based BCIs [32]. In our opinion, it is difficult to apply unsupervised transfer learning to MI-based BCI due to the existence of BCI illiteracy. It is better to collect initial labelled samples from the target domain. erefore, inspired by MEKT, we develop a supervised MEKT algorithm (sMEKT) and a semisupervised MEKT algorithm (ssMEKT) to explore more possibilities of MEKT in all cases and improve the efficiency of transfer learning by using all available samples. e main contributions of our work can be summarized as follows: (1) We extend MEKT in the supervised and semisupervised versions to further testify the effectiveness of transfer learning on the Riemannian manifold and its tangent space for MI-based BCI. e rest of this paper is structured as follows. In Section 2, we introduce the related work on supervised, unsupervised, and semisupervised transfer learning. In Section 3, we present the applied MI datasets and the detailed methods of EEG processing, including preprocessing, RA, TSM, all MEKT based algorithms, and a shrinkage linear discriminant analysis (sLDA) classifier. e results of classification accuracy and computation time are shown in Section 4. e experimental results are discussed in Section 5. Finally, our conclusions are drawn in Section 6.

Related Work
Besides MI, the other EEG paradigms mentioned above, such as SSVEP and ERP, also use transfer learning to reduce or suppress the calibration time. us, different EEG-based BCIs can learn from each other in terms of transfer learning.
Most transfer learning algorithms are supervised, which are designed to address the shortage of labelled samples from the target domain. In MI-based BCI, the common spatial patterns (CSP) method is a classical feature extraction algorithm for only one subject. However, it performs poorly in 2 Computational Intelligence and Neuroscience the small labelled set scenario [33,34]. Originated from CSP, regularized CSP (RCSP) calculated the regularized average spatial covariance matrix for each class by giving the labelled samples from the source and target domains' different regularization parameters [35]. Based on the framework of RCSP, different distance metrics, such as Frobenius norm and cosine distance, were used to measure the similarity between the labelled source and target domains [36,37]. Combined CSP (CCSP) simply concatenated the labelled samples from the target domain and multiple source domains with equal weight to compute the spatial filters [38]. A dynamic time warping RCSP (DTW-RCSP) method performed domain adaptation by aligning the labelled source domains to the labelled target domain from the same class using an optimal warping path [39]. ese RCSP algorithms inherited the advantage of the CSP in MI-based BCI. However, the regularization parameters, which were used to evaluate the differences between the labelled source and target domains, were often manually set or were obtained by means of cross-validation. Recently, since affine transformation can make the covariance matrices of EEG data from different domains close, RA-based supervised transfer learning algorithms have received widespread attention in different EEG-based BCIs [40][41][42][43]. Zanini first performed RA for each domain using the Riemannian mean of its resting trials as the reference matrix and then concatenated all aligned matrices from the labelled domains to train a minimum distance to mean (MDM) classifier based on Riemannian Gaussian distributions [44]. A Riemannian procrustes analysis (RPA) algorithm executed the following transformations (translation, scaling, and rotations) for the covariance matrices from the source and target domains to shorten their differences and then constructed an MDM classifier using the transformed matrices from the labelled domains [45]. Due to good performance of RA, our proposed sMEKT also belongs to the RA-based supervised transfer learning algorithms. Many unsupervised transfer learning algorithms utilize the unlabelled samples from the target domain, as well as abundant labelled samples from the source domains to realize the zero-training for the target domain. In ERP-based BCI, Waytowich presented an unsupervised transfer learning method, where independent models were first trained using labelled samples from different source subjects, then the classification decisions of independent models were combined to classify unlabelled samples from the target subject, and finally each model's decision was weighted based on the inferred accuracy of direct classification [46]. Such method did not utilize the inherent information embedded in the unlabelled samples. In MI-based BCI, Xu proposed an unsupervised cross-dataset transfer learning, where EEGNet and ShallowConvNet were trained with the labelled source dataset, then an unsupervised domain adaptation was performed between the labelled source dataset and the unlabelled target dataset, and finally the pretrained model was validated on the unlabelled target dataset [47]. However, this algorithm did not realize the domain adaptation between the subjects. To minimize the differences among subjects, in ERP-based and MI-based BCIs, MEKT first executed RA for the covariance matrices from different subjects, then extracted the tangent feature vectors in the TSM module, and finally performed joint probability distribution shift minimization, labelled source domain discriminability preservation, and unlabelled target domain locality preservation [32]. Our proposed algorithms are based on the framework of MEKT, since MEKT can not only shorten the differences between domains but also preserve the characteristics of the labelled source domain and unlabelled target domain as much as possible.
eoretically, semisupervised transfer learning can provide more information than unsupervised transfer learning due to the existence of a labelled target domain. To achieve epileptic seizure classification from EEG signals, Jiang integrated transfer learning, semisupervised learning, and a Takagi-Sugeno-Kang fuzzy system [48]. is method achieved good performance at the cost of the computation time. In ERP-based BCI, Wu proposed online and offline weighted adaptation regularization (wAR) algorithms which performed domain adaptation between the labelled source domain and the entire target domain by integrating the loss function minimization, structural risk minimization, marginal conditional probability distribution adaptation, and conditional probability distribution adaptation [49]. Although wAR is a semisupervised transfer learning algorithm, its architect is similar to that of unsupervised MEKT.
In summarize, these approaches mentioned above inspired the design of our supervised and semisupervised transfer learning algorithms in MI-based BCI.

Description of Datasets.
Two publicly available MI datasets were used to assess the effectiveness of our proposed transfer learning algorithms. More details about these datasets are described as follows: (

e Frameworks of Different MEKT Based Algorithms.
Our proposed supervised MEKT (sMEKT) and semisupervised MEKT (ssMEKT) are based on the unsupervised MEKT. To better understand them, the frameworks of MEKT, sMEKT, and ssMEKT are shown in Figure 1. e detailed steps of the different algorithms are outlined below: (1) Preprocess the original EEG trials from different subjects. Note that the original EEG trials from all source subjects are labelled and used for all algorithms. For MEKT, all original EEG trials from the target subject are unlabelled. However, for sMEKT and ssMEKT, they are partitioned into the labelled and unlabelled EEG trials. e labelled ones are used for sMEKT and ssMEKT, whereas the unlabelled ones are only used for ssMEKT. (2) Convert the filtered labelled/unlabelled EEG trials from each subject into the corresponding labelled/ unlabelled covariance matrices. en, we perform RA for these covariance matrices using their Riemannian mean as the reference matrix to obtain the corresponding aligned matrices in different MEKT based algorithms.
(3) In the TSM module, the aligned matrices are transformed into the Euclidean tangent feature vectors. e tangent feature vectors from different source subjects are concatenated into the labelled source tangent feature vector set, which is inputted into ssMEKT, MEKT, and sMEKT along with all, unlabelled, and labelled target tangent feature vectors, respectively. It is noted that the target tangent feature vectors are the feature vectors from the target subject. (4) All MEKT based algorithms utilize the available tangent feature vector sets to yield different projection matrices which can be used to generate new lower dimensional feature sets. (5) All new labelled feature sets from different subjects are fed into the sLDA classifier [52] to train a subjectspecific model which is then used to classify the new unlabelled target feature set.
Next, we introduce the above procedures in detail.

Preprocessing of EEG Data.
For dataset 1, all original EEG trials from each subject were band-pass filtered between 8 and 30 Hz using a fifth-order Butterworth filter. en, the filtered EEG trials were extracted from the time interval between 0.5 and 2.5 s after the visual cue signalling the start of imagery. For dataset 2, all original EEG trials from each subject were spectrally filtered by a fiftieth order finite impulse response filter with cut-off frequencies of 8 and 30 Hz and temporally segmented from 0.5 to 3.5 s after the visual cue onset.
We only used channels in the central area of the brain where the sensorimotor rhythms (SMR) of MI are active. 25 and 29 channels were selected separately for dataset 1 and dataset 2. In Figure 2, the selected channels for the two datasets are marked in red.

Riemannian Alignment.
Although there are high intersubject variances in MI-based BCI, RA can make the marginal distributions of EEG trials from different subjects closer.
Let X i ∈ R n×t be the ith filtered trial from a subject, where n is the number of channels and t is the number of sample points of the selected time window. e spatial covariance matrix of X i can be defined as [40] Since the covariance matrices belong to a smooth Riemannian manifold of symmetric positive definite (SPD) matrices, they can be viewed as points on the manifold [44]. To show the benefits of RA, we first present some basic concepts of Riemannian geometry.

e Riemannian Distance.
Suppose that C i and C j are the points of the Riemannian manifold. e Riemannian distance δ(C i , C j ) between these two points can be defined as the length of the shortest curve (named geodesic) connecting them [53]: with the Frobenius norm ‖•‖ F and the eigenvalues ( λ k n k�1 ) of C −1 i C j .

e Riemannian Mean.
e Riemannian mean is usually used as the statistical descriptor of a set of SPD matrices on the manifold. Given N SPD matrices, their Riemannian mean M R is defined as below [44]: For N � 2, M R is the middle point of a geodesic connecting the two points. However, M R can be effectively calculated by an iterative procedure for N > 2 [40].

Congruence Invariance.
Congruence invariance is an important property about the Riemannian distance, which means that the distance between the two points remains invariant after affine transformation using an invertible matrix as below [45]: where GL(n) is the set of invertible symmetric n × n square matrices. Based on these concepts, RA executes the affine transformation for N points on the manifold using their Riemannian mean as the reference matrix. en, the aligned matrix of C i is [44] We perform RA for all covariance matrices from each domain using their own Riemannian mean as the reference matrix.
en, all aligned matrices from each domain are centred at the identity matrix. is property can be testified by the following [32]: where M(C 1 , C 2 , · · · , C N ) is the Riemannian mean operation and I is an identity matrix. us, RA can make the aligned matrices from different domains comparable and preliminarily shorten their differences.
Additionally, as in [40], the filtered trial X i can be spatially whitened by performing RA as follows: 3.5. Tangent Space Mapping. Most traditional classifiers, such as linear discriminant analysis (LDA) and support vector machine (SVM), are designed for the Euclidean space, instead of the Riemannian space. To inherit the advantage of RA, we transform the aligned matrices into the Euclidean tangent feature vectors. e SPD matrices lie in a differentiable Riemannian manifold.
eir derivatives at a reference point on the manifold compose a tangent space. As mentioned in [54], the choice of Riemannian mean for the reference point leads to a tangent space locally approximate to the manifold. Figure 3 shows a Riemannian manifold and its tangent space at a Riemannian mean point M R .
As shown in Figure 3, the logarithmic map projects C i onto the tangent space at a Riemannian mean point M R by [40] As in (6), the identity matrix I is the Riemannian mean of all aligned matrices from each domain. en, the logarithmic mapping of the aligned matrix C i onto the normalized tangent space can be calculated as follows [55]:

Computational Intelligence and Neuroscience
To obtain a minimal representation, we vectorize the above logarithmic mapping T i by keeping its upper triangular part and applying unity weight for its diagonal elements and � 2 √ weight for its nondiagonal elements [40]: where T i j,k ∈ T i . en, the aligned matrix C i is transformed into the tangent feature vector F i .

Manifold Embedded Knowledge
Transfer. e tangent feature vectors from different domains have similar marginal probability distributions inherited from the corresponding aligned matrices. However, their dimensionality d � n × (n + 1)/2 is very high. To further reduce their differences and dimensionality, MEKT based algorithms aim to find optimal projection matrices for these tangent feature vectors.
As shown in Figure 1, all labelled tangent feature vectors from multiple source subjects are concatenated into a labelled source tangent feature vector set. For convenience, this set is called a labelled source domain F S � F S,i n S i�1 . Let F S,i and n S be the ith tangent feature vector and the size of F S , respectively. Likewise, all tangent feature vectors from the target subject can be called a target domain F T � F T,i n T i�1 . Assume F T,i and n T are the ith tangent feature vector and the size of F T , respectively. For MEKT, F T is unlabelled.
Since MEKT transfers a labelled source domain to an unlabelled target domain, it is an unsupervised transfer learning algorithm.
MEKT seeks the optimal projection matrix P S ∈ R d×q for F S and the optimal projection matrix P T ∈ R d×p for F T , which can not only make the lower dimensional features P T S F S and P T T F T close, but also preserve the labelled source domain discriminability and the unlabelled target domain locality. Note that q ≪ d is the dimensionality of a shared subspace for MEKT. MEKT sets q � 10. e following four properties are designed:

Joint Probability Distribution Shift Minimization.
e traditional maximum mean discrepancy (MMD) is usually used to shorten the marginal and conditional probability distribution discrepancies between different domains [56]. For simplicity, in MEKT, the joint probability Figure 3: A Riemannian manifold and its tangent space (this figure was adopted from Barachant et al. [40]).  6 Computational Intelligence and Neuroscience MMD is proposed to measure and minimize the joint probability distribution shift between the source and target domains as follows [32]: where D( Let n k S and F k S,i be the size and the ith tangent feature vector of the labelled source domain belonging to the kth class, respectively. Likewise, n k T and F k T,i are the size and the ith tangent feature vector of the unlabelled target domain predicted to be the kth class, respectively. Here, only binary classification is considered. en, N S � Y S /n S and N T � Y T /n T , where Y S is the one-hot encoding matrix of y S , and Y T is the one-hot encoding matrix of y T . e one-hot encoding matrix will be [1, 0; 0, 1; 0, 1], if its corresponding true/pseudolabel vector is [class 1; class 2; class 2].

Source Domain Discriminability Preservation.
e source domain discriminability can be defined by the withinclass and between-class scatter matrices.
us, after projection, it can be maintained by [32] min P S tr P T S S s w P S , subject to: where tr(•) is the trace computation, S s w � 2 k�1 F k S H n k S F k S T is the within-class scatter matrix of the labelled source domain, represent the labelled source domain belonging to class k, its centring matrix, and its mean, respectively, and m S is the mean of the labelled source domain [57]. Note that 1 n k S ∈ R n k S ×n k S is an all-one matrix.

Target Domain Locality Preservation.
Although the target domain is unlabelled, its local manifold structure can be formulated by the graph Laplacian matrix. MEKTconstructs the normalized Laplacian graph L � in which σ is a scaling parameter and Near K (F T,j ) is a set of K nearest neighbours of F T,j using the Euclidean metric. MEKT sets σ � 1 and K � 10.
To maintain the target domain locality after projection and remove the scaling effect, a graph regularization is minimized under the following constraints [32]: where H n T � I − 1 n T /n T is also a centring matrix.

Parameter Transfer and
Regularization. e following constraints are imposed on the projection matrices for better similarity and generalization performance [32]: en, the four properties above are integrated into an overall loss function of MEKT using different weights α, β, c, and θ [32]: where α, β, c, and θ are manually set to be 1, 0.01, 0.1, and 20, respectively. For convenience, assume P � [P S ; P T ] is an overall projection matrix (P ∈ R 2 d×q ). e Lagrange function is designed as below [32]: Computational Intelligence and Neuroscience T � tr P T (αA + βB + cE + θG)P + μ I − P T JP , To obtain the optimal P, MEKT sets the derivative of T to be 0 and then has Note that μ � 10 − 3 is also a weight. After generalized eigen-decomposition, P is comprised of q trailing eigenvectors, where q � 10 is the dimensionality of the new feature. Consequently, P S and P T can be obtained from P. For the matrix A, N T relates to the pseudolabel vector y T . Since y T is unknown initially, it is set to be an all-zero vector first. At the next iteration, y T is updated using sLDA as the classifier. MEKT performs five iterations in total.

e Proposed Supervised Manifold Embedded Knowledge
Transfer. Supervised manifold embedded knowledge transfer (sMEKT) is an extension of MEKT in the supervised version. For sMEKT, all tangent feature vectors from the target subject are divided into the labelled and unlabelled target domains. Only the labelled source and target domains are used to train a subject-specific classifier. e remaining unlabelled target domain is used to evaluate the performance of sMEKT.
To make full use of the available data, the following two problems should be taken into consideration for sMEKT: (1) How to choose the appropriate regularization terms and constraints in the formulation of sMEKT? (2) How to map the unlabelled target domain into the projected subspace since only the projection matrices for the labelled source and target domains are obtained for sMEKT?
is section presents the corresponding solutions to the above questions. Let P TL be the projection matrix of the labelled target domain.
First, we exclude the target domain locality preservation because the limited labelled target domain does not effectively show the entire geometric structure of the target domain with the absence of the unlabelled target domain.
Secondly, we keep the source domain discriminability preservation as in (12) and do not add the labelled target domain discriminability preservation as min P TL tr P T TL S TL w P TL , where S TL w and S TL b are the within-class scatter matrix and between-class scatter matrix of the labelled target domain, respectively. e reason is that sMEKT aims at transferring the labelled source domain for the target domain. erefore, the source domain discriminability should be taken into consideration, instead of the target domain discriminability.
In addition, the crucial work of MEKT is to minimize the joint probability distribution shift between the source and target domains. In sMEKT, since we only have the labelled source and target domains, the joint probability distribution shift minimization is updated as below: where y TL and n TL are the label vector and the size of the labelled target domain F TL , respectively, n k TL and F k TL,i are the size and the ith tangent feature vector of F TL belonging to the kth class, and N TL � Y TL /n TL , in which Y TL is the one-hot encoding matrix of y TL .
Finally, we remove the parameter transfer and regularization. In (15) and (16), MEKT pays more attention to minimization of the differences between P S and P T because it sets θ � 20. In our opinion, this minimization can further shorten the gaps between the labelled source domain and the unlabelled target domain, which will benefit the classification of the latter. If we have the similar constraint, min P S ,P TL (‖P S − P TL ‖ 2 F ), and use the same weight on it, it may not play the same role as the constraint in (15) since the limited labelled target domain may not represent the remaining unlabelled target domain well. Calculating the 8 Computational Intelligence and Neuroscience optimal θ with cross-validation can make up this limitation. However, it will yield computational burden. us, the overall loss function of sMEKT can be formulated by min P S ,P TL αD S,TL + βtr P T S S s w P S , subject to: P T S S s b P S � I. (22) Let P � [P S ; P TL ](P ∈ R 2 d×q ). en, the corresponding Lagrange function is where en, (23) can be solved by the same means as in (17). Since Y TL in N TL is the one-hot encoding matrix of the label vector y TL , rather than that of the pseudolabel vector, we can obtain the optimal P without multiple iterations.
As for the second question presented above, we assume that the labelled and unlabelled target domains have similar joint probability distributions. Moreover, we minimize the joint probability distribution shift between the labelled source and target domains in (21). us, we define the average of the projection matrices of the labelled source and target domains as the projection matrix of unlabelled target domain: erefore, the new unlabelled target feature set is P T TU F TU , where F TU is the unlabelled target domain.

e Proposed Semisupervised Manifold Embedded Knowledge
Transfer. Semisupervised manifold embedded knowledge transfer (ssMEKT) utilizes the labelled source and target domains, as well as the unlabelled target domain.
We construct the following regularization terms and constraints for ssMEKT.
First, we keep the target domain locality preservation because of the existence of the labelled and unlabelled target domains. Let F T � F TL ∪ F TU . It is noted that F TL and F TU are separately obtained after performing RA and TSM for their original domains. en, we can minimize the graph regularization using all data from the target domain as in Section 3.6. Accordingly, we retain the projection matrix for the entire target domain, denoted as P T . en, like sMEKT, the source domain discriminability preservation is considered.
Additionally, to benefit the classification of the unlabelled target domain, we reduce the joint probability distribution discrepancies between the labelled and unlabelled domains, instead of those between the source and target domains. Actually, MEKT also minimizes the joint probability distribution shift between the labelled and unlabelled domains since the source domain is labelled and the target domain is unlabelled. For ssMEKT, the joint probability distribution shift minimization can be rewritten as min P S ,P T ,P TL D L,U � min P S ,P T ,P TL D Q F S ∪ F TL , y S ∪ y TL , Q F TU , y TU � min P S ,P T ,P TL where y TU and n TU are the pseudolabel vector and the size of the unlabelled target domain F TU , respectively, F k TU,i and n k TU are separately the ith tangent feature vector and the size of the unlabelled target domain predicted to be the kth class. Let N TU � Y TU /n TU , in which Y TU is the onehot encoding matrix of y TU . For simplicity, P T is temporally used as the projection matrix of the unlabelled target domain since it relates to the unlabelled target domain.
Finally, we keep and update parameter transfer and regularization since there are abundant tangent feature vectors in the target domain. en, we want the projection matrix P T learned in the entire target domain to be similar to the projection matrix P S learned in the source domain and to Computational Intelligence and Neuroscience be similar to the projection matrix P TL learned in the labelled target domain. For better generalization performance, we avoid extreme values for these projection matrices. erefore, we redefine the following constraints: min P S ,P T ,P TL After integrating the regularization terms and the constraints above, the overall loss function of ssMEKT can be formulated as follows: min P S ,P T ,P TL Given an overall projection matrix P � [P S ; P T ; P TL ](P ∈ R 3d×p ), the corresponding Lagrange function can be reformulated as where en, we can obtain the optimal P, P S , P T and P TL in the same way as MEKT and sMEKT. Like MEKT, N TU is updated along with the change of Y TU at each iteration.
Finally, we choose and average the most relevant projection matrices P T and P TL for the projection matrix of unlabelled target domain: 3.9. Classification. As depicted in Figure 1, the tangent feature vector sets from different domains can be transformed into new feature sets by different MEKT based algorithms. However, only labelled feature sets are inputted to the supervised sLDA classifier to build a subject-specific model. For MEKT and ssMEKT, the pseudolabels of the new unlabelled target features marked by the model can be used to iteratively update the projection matrices and the model. Additionally, the goal of LDA is to find an optimal hyperplane that can simultaneously maximize the betweenclass variances and minimize the within-class variances of the two-class projection data. To cope with high-dimensional data, sLDA uses a shrinkage estimate for the average covariance matrix of each class in the LDA algorithm. More details can be seen in [52].

Baseline Algorithms.
We compared the following six baseline algorithms with various properties of transfer learning: (1) CSP-LDA is the classical combination of feature extraction and classifier for MI. No source domain is used at all. Only the labelled target domain is used to design the CSP spatial filters and then to train the LDA classifier [33]. (2) RA-CCSP-LDA separately performs RA for the labelled source and target domains as in (7), then concatenates them with equal weight to calculate the CSP spatial filters, and finally inputs them into the LDA classifier [38]. (3) RA-RCSP-LDA is similar to RA-CCSP-LDA except for the way of generating the spatial filters. RCSP weights the labelled source and target domains using different regularization parameters [35]. To relieve the computational cost and give bigger weight to the labelled target domain, we manually set the two regularization parameters to be 0.1.
(4) RA-CCSP-wAR successively executes RA and CCSP before wAR. wAR is also a semisupervised transfer learning algorithm, which performs weighted domain adaptation between the labelled source domain and the entire target domain using SVM as the base classifier [49]. (5) RA-RCSP-wAR sequentially performs RA, RCSP, and wAR. For RA-CCSP-wAR and RA-RCSP-wAR, the hyperparameters of wAR were set according to its corresponding publication [49]. Note that three pairs of spatial filters were used for all spatial filteringbased algorithms in our experiments. (6) MEKT-sLDA first performs unsupervised MEKT and then feeds the new labelled features from the source domain into the supervised sLDA classifier [32].
A summary of the six baseline algorithms and the two proposed algorithms is shown in Table 1.

Experimental Design.
For each dataset, all trials from each target subject were randomly partitioned into two portions over twenty repetitions. e first portion was the labelled set to train a subject-specific classifier for the supervised and semisupervised algorithms, while the second portion was the unlabelled set to build the classifier for the unsupervised and semisupervised algorithms and to evaluate the effectiveness of different algorithms. All trials from the remaining source subjects were labelled and concatenated into a source domain to be transferred into a target domain. For each target subject, we varied the number of labelled trials from 10 to 50 with a step of 10 to investigate the robustness of all algorithms. For simplicity, denote the trial from the target subject and the trial from the source subject as the target trial and the source trial, respectively.

Classification Accuracy with Few Labelled and/or More Unlabelled Target Trials.
e goal of our proposed algorithms is to achieve good classification performance even using few labelled target trials. us, we first conducted the experiments with few labelled and/or more unlabelled target trials. For the two MI datasets, ten labelled target trials, with equal number per class, were randomly selected over twenty repetitions. en, 270 and 190 unlabelled target trials were separately available for dataset 1 and dataset 2. For each target subject, the classification accuracy of the unlabelled target trials was taken as an average of twenty repetitions. Detailed results of the two MI datasets are given in Tables 2 and 3. e bold-faced and italic numbers show the best and second-best classification accuracies, respectively.
Defining BCI illiteracy has been challenging because different researchers use different criteria to distinguish between good and bad subjects. Early work stated that 70% accuracy was effective for binary classification [58,59]. In Table 2, the nontransfer learning algorithm CSP-LDA reaches or approaches a benchmark accuracy of 70% for subjects al, aw, and ay. erefore, in our paper, subjects al, aw, and ay are grouped into good subjects, whereas subjects aa and av are grouped into bad ones. e transfer learning algorithms obviously improve the classification performance for bad subjects aa and av. Only MEKT-sLDA and ssMEKT-sLDA obtain satisfactory accuracy for bad subject av. Furthermore, sMEKT-sLDA outperforms the other supervised transfer learning algorithms RA-CCSP-LDA and RA-RCSP-LDA on average when the number of labelled target trials is as low as 10. RA-CCSP-wAR and RA-RCSP-wAR perform slightly higher than their corresponding supervised transfer learning algorithms. Our proposed ssMEKT-sLDA stands out itself among all semisupervised transfer learning algorithms.
In Table 3, CSP-LDA provides the benchmark accuracy of about 70% for subjects e, f, and g, leading to the following categorization in our paper: good subjects e, f, and g; bad subjects a, b, c, and d. e improvement in the classification performance of the transfer learning algorithms is substantial compared to CSP-LDA for bad subjects a, b, c, and d. However, sMEKT-sLDA yields worse classification performance than RA-CCSP-LDA and RA-RCSP-LDA on average. A possible explanation is that more than half of the subjects on dataset 2 perform MI tasks poorly. For sMEKT-sLDA, the target domain can be seriously affected by the bad source domain since the target and source domains are simultaneously adapted to make them close during the minimization of the joint probability distribution shift. However, for RA-CCSP-LDA and RA-RCSP-LDA, both the target and source domains are used to design the spatial filters without directly influencing each other. In addition, RA-CCSP-wAR and RA-RCSP-wAR still perform better than their corresponding supervised transfer learning algorithms.
As shown in Tables 2 and 3, ssMEKT-sLDA exhibits slightly higher classification performance than unsupervised MEKT-sLDA due to the existence of ten labelled target trials. In addition, for the target subjects with 10 labelled samples and 270/190 unlabelled samples, ssMEKT-sLDA shows a 5.27% and 2.69% increase in average accuracy on the two datasets compared to the best semisupervised algorithm RA-RCSP-wAR, respectively. en, we performed the paired-sample t-tests between the six baseline approaches and our proposed algorithms on the two datasets to further check if the performance differences among all algorithms were significant. e p-values are shown in Table 4. e paired-sample t-tests show that the results of our proposed algorithms are statistically higher than those of CSP-LDA (p < 0.005). Although our proposed algorithms are based on MEKT-sLDA, their performance differences are very big. In most cases, ssMEKT-sLDA shows its superiority in terms of the p-values, especially on dataset 1. e Computational Intelligence and Neuroscience performance differences between sMEKT-sLDA and other transfer learning algorithms are small when sMEKT-sLDA performs well. As depicted in Figures 4(b), 4(d), and 4(e), for good subjects al, aw, and ay, CSP-LDA shows better classification performance than the supervised and semisupervised transfer learning algorithms with the increase of the labelled target trials. As shown in Table 2, these subjects perform well even using few labelled target trials. us, as their labelled target trials increase, their increasing between-class   Table 3: Classification accuracy (%) with 10 labelled target trials and/or 190 unlabelled target trials on dataset 2, for CSP-LDA [33], RA-CCSP-LDA [38], RA-RCSP-LDA [35], RA-CCSP-wAR [49], RA-RCSP-wAR [49], MEKT-sLDA [32], and our proposed algorithms.  Table 4: e p-values in paired-sample t-tests on the two datasets between our proposed algorithms and CSP-LDA [33], RA-CCSP-LDA [38], RA-RCSP-LDA [35], RA-CCSP-wAR [49], RA-RCSP-wAR [49], and MEKT-sLDA [32].  discriminability can gradually reduce their dependence on transfer learning. As illustrated in Figures 4(a) and 4(c), for bad subjects aa and av, most transfer learning algorithms outperform CSP-LDA in most cases. It is necessary for them to transfer the source labelled trials due to their poor between-class discriminability. In Figures 4(a), 4(b), 4(c), and 4(e), the average performance of sMEKT-sLDA is higher than that of RA-RCSP-LDA for bad subjects aa and av, while RA-RCSP-LDA performs better than sMEKT-sLDA on average for good subjects al and ay. e reason is that much more good source subjects are available for bad target subjects aa and av, compared to good target subjects al and ay. Bad target subjects aa and av benefit from the domain adaptation used in sMEKT-sLDA, while good target subjects al and ay benefit from their bigger weights used in RA-RCSP-LDA. As shown in Figure 4(f ), for all subjects, the classification performance of MEKT-sLDA decreases due to the reduction of unlabelled target trials. On average, ssMEKT-sLDA shows its compelling validity.

Classification Accuracy with Varying Numbers of
As shown in Figures 5(e)-5(g), for good subjects e, f, and g, due to their good between-class discriminability, as the labelled target trials increase, the classification performance of CSP-LDA is close to that of the transfer learning algorithms. As depicted in Figures 5(b) and 5(c), for bad subjects b and c, due to their poor between-class discriminability, the transfer learning algorithms maintain obvious advantages over CSP-LDA with the increasing labelled target trials. As illustrated in Figures 5(a)-5(h), on average, sMEKT-sLDA performs worse than most supervised transfer learning algorithms since a poor source domain might affect the discriminability of the target domain during domain adaptation. For subjects a, d, e, and f, RA-RCSP-wAR achieves higher results than RA-CCSP-wAR in most cases. All semisupervised transfer learning algorithms reach better performance than their supervised or unsupervised counterparts. Moreover, ssMEKT-sLDA shows its superiority for four out of seven subjects.  Table 5. e best performance was highlighted in bold.
As shown in Table 5, CSP-LDA spends the shortest computation time among all algorithms. e computational cost of sMEKT-sLDA is slightly higher than that of RA-RCSP-LDA due to higher-dimensional features. ssMEKT-sLDA requires more time than other MEKT based algorithms because of more available trials. However, for all semisupervised transfer learning algorithms, even with higher-dimensional features, ssMEKT-sLDA runs faster than RA-CCSP-wAR and RA-RCSP-wAR on average. e possible reason is that wAR works more complicatedly than ssMEKT.

Discussion
In this section, we discuss the experimental results from various aspects.

Effectiveness of Riemannian Alignment and Transfer
Learning. In our experiments, all algorithms perform RA and transfer learning, except CSP-LDA. ey first execute RA for different domains in an unsupervised way, which can not only make different domains comparable, but also overcome the impact of limited labelled target trials. For all spatial filtering-based transfer learning algorithms, such as RA-CCSP-LDA, RA-RCSP-LDA, RA-CCSP-wAR, and RA-RCSP-wAR, the EEG trials are whitened by performing RA. Likewise, for all MEKT based transfer learning algorithms, the tangent feature vectors from different subjects are close to each other due to RA. us, RA can shorten the differences between domains, which is beneficial to successive transfer learning. As mentioned above, different transfer learning algorithms combine different domains in different ways. In Tables 2 and 3, even using few labelled and/or more unlabelled target trials, with the help of abundant labelled source trials, the supervised, unsupervised, and semisupervised transfer learning algorithms demonstrate better average classification accuracies than CSP-LDA. It is implied that both RA and transfer learning contribute to the good classification performance.
However, as shown in Figures 4 and 5, for good target subjects al, aw, ay, e, f, and g, with the increase of their labelled trials, their between-class discriminability can increase; thus the classification accuracies of CSP-LDA are gradually close to, or even higher than those of transfer learning algorithms. In contrast, for bad target subjects aa, av, b, and c, the transfer learning algorithms always outperform CSP-LDA as the labelled target trials increase. erefore, it is implied that the number of labelled target trials and the extent of between-class discriminability of the source subjects greatly affect the performance of transfer learning algorithms.

Differences between Spatial Filtering and Manifold Embedded Knowledge Transfer.
e CSP features used in all spatial filtering-based transfer learning algorithms have lower dimensions than the transformed tangent space Table 5: Computation times (seconds) of CSP-LDA [33], RA-CCSP-LDA [38], RA-RCSP-LDA [35], RA-CCSP-wAR [49], RA-RCSP-wAR [49], MEKT-sLDA [32], and our proposed algorithms on two datasets. To keep or highlight the importance of the target domain, for CCSP based algorithms, the weight of the target domain is the same as that of the source domain, and for RCSP based algorithms, the weight of the target domain is higher than that of the source domain. Consequently, the source and target domains independently play their roles in spatial filtering-based algorithms. However, to further reduce the differences between the source and target domains, all MEKT based transfer learning algorithms pay great attention to the joint probability distribution shift minimization between domains. erefore, the target domain is easily affected by the source domain during the minimization. e degree of positive transfer depends on the performance of the source domain.
As shown in Tables 2-4, most MEKT based transfer learning algorithms show better performance than the spatial filtering-based ones. However, for dataset 2, the average classification accuracy of sMEKT-sLDA is inferior to that of spatial filtering-based transfer learning algorithms, but still superior to that of CSP-LDA. e possible reason is that only three out of seven subjects perform MI tasks well on dataset 2.
us, the poor source domain affects the positive transfer of sMEKT-sLDA. us, our supervised transfer learning algorithm sMEKT-sLDA should identify the most suitable source subjects instead of all available source subjects. In addition, due to the existence of unlabelled target domain, MEKT-sLDA and ssMEKT-sLDA reduce the negative impact of the poor source domain. As given in Table 5, in terms of computation time, all supervised transfer learning algorithms run efficiently. Furthermore, ssMEKT-sLDA takes shorter time than RA-CCSP-wAR and RA-RCSP-wAR. Overall, our proposed MEKT based algorithms provide comparably good performance with efficient running time.

Impact of Labelled and/or Unlabelled Target Trials.
To investigate the roles of labelled and unlabelled target trials, all algorithms can be divided into supervised, unsupervised, and semisupervised algorithms. As shown in Tables 2 and 3, with abundant unlabelled target trials, MEKT-sLDA outperforms spatial filtering-based algorithms and sMEKT-sLDA on average. us, a large unlabelled target domain is beneficial for classification due to its embedded geometric structure. Moreover, with the help of a few labelled target trials, the performance of ssMEKT-sLDA is slightly better than that of MEKT. As illustrated in Figures 4 and 5, with increasing number of the labelled target trials and decreasing number of the unlabelled target trials, the performance improvement of ssMEKT over MEKT increases. As depicted in Figures 4(f) and 5(h), the curve of CSP-LDA grows fast with the increase of the labelled target trials. It is implied that the labelled target trials are crucial for the classification performance. Since the labelled and unlabelled target trials are less than the labelled source trials, the performance improvements of supervised and semisupervised transfer learning are not apparent as the labelled target trials increase.
Additionally, as shown in Figures 4 and 5, for good target subjects al, ay, e, f, and g, their classification accuracies of ssMEKT-sLDA are always similar to those of MEKT-sLDA. e possible reason is that both the unlabelled and labelled target trials of good target subjects can provide important information. However, for bad target subjects aa, av, a, and d, the differences between ssMEKT-sLDA and MEKT-sLDA are comparably obvious. e possible explanation is that the labelled target trials of bad target subjects can provide more valuable information than their unlabelled target trials. Overall, labelled target trials play an important role in the classification. e unlabelled target trials are also beneficial for classification, especially for good target subjects.
It is noted that our proposed ssMEKT-sLDA does not utilize the unlabelled target trials to train the classifier, resulting in the limited performance improvement.

Conclusion
To shorten the calibration time of the target subject, we propose a supervised MEKT algorithm (sMEKT) and a semisupervised MEKT algorithm (ssMEKT) in MI-based BCI. Both are combined with the sLDA classifier. Due to high intersubject variability, it is better to build a subjectspecific classifier rather than a generic classifier. Both sMEKT and ssMEKT transfer abundant labelled samples from multiple source subjects for a specific target subject. First, they perform RA in an unsupervised way to preliminarily reduce the differences among different subjects. en, they convert the aligned covariance matrices from different subjects into the corresponding tangent feature vectors for classification in the Euclidean space. Finally, to further cope with variations among different subjects, sMEKT performs domain adaptation between the labelled source and target domains, whereas ssMEKT performs domain adaptation between the labelled source domain and the entire target domain. During adaptation, both sMEKT and ssMEKT not only minimize the joint probability distribution shift among different domains, but also maintain the source domain discriminability as much as possible. Moreover, ssMEKT keeps the entire target domain locality to utilize its geometric structure. In addition, ssMEKT performs the parameter transfer and regularization, to make the projection matrices of different domains closer. For the target subjects with 10 labelled samples and 270/190 unlabelled samples, ssMEKT stands out itself with average accuracies of 82.50% and 81.94% on the dataset 1 and dataset 2, respectively, and sMEKT can also obtain higher classification performance (79.02%) than other spatial filtering-based transfer learning algorithms on dataset 1. erefore, the experimental results show that our proposed algorithms can reduce the need of labelled target trials. In the future, we will not only choose the most beneficial source subjects for the target subjects, but also make use of the unlabelled target samples in the classification module. In addition, our proposed ssMEKT is designed offline since the unlabelled samples from the target subject are obtained a priori, instead of on-the-fly. Future work will be dedicated to updating ssMEKT in real-time BCI applications, where domain adaptation is performed between the labelled source domain and the increasing target domain.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.