Multiclass Informative Instance Transfer Learning Framework for Motor Imagery-Based Brain-Computer Interface

A widely discussed paradigm for brain-computer interface (BCI) is the motor imagery task using noninvasive electroencephalography (EEG) modality. It often requires long training session for collecting a large amount of EEG data which makes user exhausted. One of the approaches to shorten this session is utilizing the instances from past users to train the learner for the novel user. In this work, direct transferring from past users is investigated and applied to multiclass motor imagery BCI. Then, active learning (AL) driven informative instance transfer learning has been attempted for multiclass BCI. Informative instance transfer shows better performance than direct instance transfer which reaches the benchmark using a reduced amount of training data (49% less) in cases of 6 out of 9 subjects. However, none of these methods has superior performance for all subjects in general. To get a generic transfer learning framework for BCI, an optimal ensemble of informative and direct transfer methods is designed and applied. The optimized ensemble outperforms both direct and informative transfer method for all subjects except one in BCI competition IV multiclass motor imagery dataset. It achieves the benchmark performance for 8 out of 9 subjects using average 75% less training data. Thus, the requirement of large training data for the new user is reduced to a significant amount.


Introduction
Brain-computer interface (BCI) is a system that establishes a communication channel between the brain and control devices without using the neuromuscular system of human body [1].
One of the noninvasive modalities of BCI is electroencephalography (EEG). BCI uses different types of EEG control signal from the external scalp of the brain. Some of the control signals used in BCI are visual evoked potential (VEP), P300 evoked potential, slow cortical potential (SCP), and sensory-motor rhythms (SMR) [2]. SMR can be modulated by actual as well as imagery motor task by user [3,4]. Thus, SMRs are used in BCI as the control signal for translating motor task (hand and foot movement) [5] and referred to as motor imagery-(MI-) BCI. Hence, MI-BCIs are used for supporting patients with spinal cord injury and stroke [6][7][8]. MI-based BCI system possesses some drawbacks such as lack of robustness, complex setup, and long calibration time [1,9,10].
Generally, it is recommended to use at least five times more training data per class than the features [11,12]. Channel-frequency-time information makes the feature vector of EEG signal very high-dimensional [13,14]. These highdimensional features necessitate the requirement for a large number of EEG epochs to be collected to train the classifier [15,16]. But, EEG data acquisition is a lengthy and exhaustive task for the user. For motor imagery purpose, it is sometimes a day-long process [10,16]. EEG signals recorded from the scalp are very subjective. It varies from one subject to another for same tasks. Even, it differs for same subject in different sessions [4]. Consequently, each individual has to go through this long data collection process in each attempt of using the system. It is most likely that long calibration time for a user has become one of the bottlenecks of BCI system. Calibration time reduction approaches reported in literatures also reflect the scenario well [17][18][19][20][21][22].
If labeled samples for certain tasks are available from other users, these samples can be used for a new user. The objective is to utilize the knowledge from data spaces of past 2 Computational Intelligence and Neuroscience users to learn predictive function for a new user. This process of knowledge and information conveying from other domain is known as transfer learning (TL) [23].
TL has been applied for BCI in two types: domain adaptation and rules or knowledge sharing [24]. Some of the domain adaptation approaches are subject invariant common space [18,19,[25][26][27][28], common stationary subspace transfer [29][30][31], conditional and marginal distribution adaptation [32,33], and subject-to-subject adaptation [34]. Rule adaptation or sharing prior learning to learn new user prediction function has been attempted in [20,26,28,35]. Active transfer learning (ATL) approach was proposed and implemented by Wu et al. in [36] for VEP based BCI. In their work, actively learned samples from the domain of new coming subject were combined with the samples of historical subjects. QBC was used as active learning method to select samples from the subject-specific domain. The authors used all samples from other subjects directly without any adaptation or selection. An improved version of ATL was proposed and implemented for binary MI-based BCI in our preliminary work [37,38]. Both works implemented ATL on binary classification. In [36], authors did it for target and nontarget VEP while our preliminary work was done on left-hand and right-hand motor imagery classification with two different feature extraction processes in sequence. Since instances are transferred directly from the source to target domain, it is named as direct transfer with active learning (DTAL). DTAL needs to be investigated for multiclass BCI. Instead of direct transfer, an informative and functional subdomain transfer from source to target also needs to be introduced in DTAL. In addition to finding actively learned samples from target domain (in DTAL), active learning based on most uncertain samples from the source to target domain is introduced in this work. To serve these purposes, the following attempts are made in this paper: (i) Multiclass direct transfer with active learning (mDTAL) is formulated and implemented. It is the multiclass extension of active transfer learning proposed in [36] for motor imagery BCI (Section 3.1).
(ii) Then, aligned instance transfer is introduced for multiclass MI-based BCI (Section 3.2).
(iii) After that, informative instances transfer framework is formulated and implemented with and without aligned subspace. Here, multiclass entropy as uncertainty criterion is applied in the source to target domain transfer (Section 3.3).
(iv) To address the subject-dependent performance variation of different methods, a generic optimized weighted ensemble of all proposed methods is constructed and applied (Section 3.4).
The main goal of this work is to develop an informative transfer learning framework for MI-BCI which is expected to perform better than direct transfer (mDTAL). The rest of the paper is organized as follows: Section 2 will describe the concept of different terms and methods which are used for further algorithm's development. Section 3 will describe developed multiclass frameworks and optimized ensemble method. Section 4 will describe experimental setup. Then, Section 5 will analyze and discuss the results. Finally, Section 6 will conclude the paper with the scope of future improvement. (TL). At first, we need to define some terms to state our problem in the scope of transfer learning. Transfer Learning [23]. Given a source domain and learning task , a target domain , and learning task , transfer learning aims to help improve the learning of the target predictive function (⋅) on using the knowledge in and , where ̸ = or ̸ = . Dataset of EEG epochs from a new user is the target domain. EEG epochs with the label from past users are source domain. Same feature extraction method has been applied for both target and source EEG epochs. So, it can be implied that = . Same types of classes are labeled for imagery EEG epochs in both source and target domain. It implies that = . But, different subjects neural responses to same motor imagery action have different characteristics. As a result, marginal distribution and conditional distribution are different for source and target domain [33]. That means ( ) ̸ = ( ) and ( | ) ̸ = ( | ). So, samples from source domain cannot represent the target domain correctly. Hence, it needs to get some subdomain from source efficiently which is related mostly to the target domain. The aim of TL is to learn a target prediction function ( ) → so that expected error on is as low as possible while ( ) ̸ = ( ) and ( | ) ̸ = ( | ). In this paper, our approach is to select the most informative instances from source domains with the help of few samples of the target domain. Then, we will add them to target domain samples to train a classifier for predicting the label of independent test data of target domain.

Active Learning (AL).
Active learning method queries for unlabeled samples which have most uncertainty [40]. Trained hypothesis on labeled samples gets confused over some unlabeled samples. These samples are more close to decision line. So, labeling these uncertain samples will accelerate learning process of the model. Hence, these samples carry more information than other certain samples ( Figure 1). In this work, active learning method is applied to two ends. At first, query by committee is applied to select the most informative samples from target domain. Then, entropy is applied as uncertainty measure to select informative samples from the source domain.
Query by Committee (QBC) [41]. A hypothesis is a kind of particular set of parameters that tuned on some training set and it can make the prediction over new data. Hypothesis space is all possible set of hypotheses. Version space is a subset of these hypotheses which are consistent with the labeled training set ( Figure 2). Consistent means that the member of version space can make a correct prediction on all instances of . One of the aims of AL is to select instances which can narrow down this version space. It will make the process of learning target prediction function more precise with fewer labeled instances.
QBC maintains a committee of hypothesis (version space) = { 0 , 1 , 2 , . . . , } ( Figure 2). Each member of this committee is trained on labeled data and represents a candidate hypothesis (ℎ 1 , ℎ 2 , . . . , ℎ ). Then, each member of committee votes for unlabeled samples about their label. The instances attaining the most disagreement about label among the members are considered as the most informative. In analytical perspective, QBC implementation has two steps: (i) Construction of committee of hypotheses which depict various regions of version space from specific to general ( Figure 2) (ii) Quantification of disagreement among the members of the committee.
In this work, linear discriminant analysis (LDA) is our learning model. This model gives negative decision score for one class and positive for others. So, decision boundary ideally is zero scoreline. It is unlikely to get extreme negative (−1) at the same time extreme positive (+1) score for a single sample. Certain instances will have the extreme sum of decision score for which most of the members are agreed. But, uncertain instances will not have the extreme score for any class. It makes the absolute value of the sum of the score for all classes close to zero. In case of LDA, ensemble sum of decision score close to zero represents more disagreement among the members. So, instances attaining the lowest absolute value of the algebraic sum of decision scores from members of the committee are the most informative.
Entropy. Entropy is the amount of information to encode a distribution [42]. It is used as the measurement of uncertainty. For binary classification, entropy enforces us towards posterior probability 0.5. For multiclass, entropy yields a central confusing area of posterior probability. It considers probability distributions for all classes.
Here, ( | ) is the predicted probability of th sample for class by the model . is the number of classes.

Feature Extraction: Common Spatial Pattern (CSP).
This method maximizes the variance for one task and minimizes the variance for other task. Therefore, it yields to generate discriminating features of two classes for EEG classification [43][44][45]. Let us consider that ∈ ch× is the th single-trial bandpass EEG signal and ∈ ch× is the spatially filtered signal with CSP projection matrix ∈ ch×ch . Here, ch is the number of channels and is the number of time points in single-trial bandpass EEG epoch.
(2)  Δ 1 and Δ 2 are the covariance matrixes of EEG signals for two classes which can be obtained by Here, is the set of indices of trials corresponding to class and is the total number of trials from class . is the transformation matrix satisfying below optimization.
This CSP filter matrix can be obtained by solving Here, is a diagonal matrix and it contains eigenvalues. Generally, first and last rows of (represented by * ∈ ×2 ) make the spatial filtered signal * [46]: * = * .
Finally, logarithm of variance of will give the feature vector .
This CSP is for binary class. We have used four V binary CSP for four classes implementation [44].

Linear Discriminant Analysis (LDA).
LDA is simple and fast to compute [47,48] which is very successfully paired with CSP feature extraction for MI-based BCI. For binary classification, it deals with two scatter matrixes and which are named as within-class and between-class scatter. and are defined as follows: Here, denotes the mean vector of th class and denotes the total mean vector. and are number of classes and total number of samples, respectively. Objective is set to find matrix for transformation such that it can ensure maximization of between-class and minimization of withinclass scatter. In this work, four V LDA classifiers are used for 4-class classification.
Decision score is calculated by Here, is the bias value and sign of ( ) will give the class label.

Multiclass Direct Transfer with Active Learning (mDTAL).
Multiclass extension of direct transfer learning with active learning or ATL [36] is formulated for MI-based BCI. CSP is used for feature extraction combined with LDA classifier since this combination is very successful for MI-based BCI [16,46]. For mDTAL, we have considered V approach [49,50] in three sections of this algorithm Stepwise process of mDTAL algorithm is described as follows.  Figure 4: Multiclass aligned instance transfer with active learning (AITAL).

Algorithm: mDTAL
Step 1. Start with randomly chosen labeled samples with equal class proportion and unlabeled instances from target domain. number of other subjects with labeled instances from th subject are available.
independent test samples of new subject are given for performance evaluation.
Step 2. Train classifier 0 using samples. Then, 0 will calculate the decision score for each class of samples as 0 .
Step 3. Train combined classifier using ∪ combined training samples.
Step 4. Get 10-fold cross-validation accuracy on ∪ training samples. Repeat Steps 2 and 3 for all historic subjects.
Step 5. Get ensemble weighted average decision score for each class on using the following equation: Here, is decision score calculated using classifier on unlabeled samples . For 0 , weight has been assigned as 1 to give subject-specific classifier higher priority over combined classifier. Similarly, ensemble weighted average decision score for test data set is also calculated as follows: It is the ultimate output of the algorithm in each iteration.
Step 6. Linear classifier LDA has the negative score for one class and positive for other. So, decision score close to zero represents more uncertainty than others. Equation (12) calculates ensemble decision score of the + 1 number of models or a committee of models. Unlabeled samples getting lowest or close to zero absolute decision score are more likely to learn decision boundary than others. Considering multiclass, (:, ) gives decision score for class V . So, the lowest absolute decision score of (:, ) will give most uncertain samples near class V boundary as follows ( Figure 6): Here, = 1, 2, . . . , (number of class) and is number of samples to be selected from each class (Figure 3).
Step 7. All selected unlabeled subject-specific samples are queried for label. Then this newly labeled samples are added to and removed from . Steps 2 to 7 are repeated until maximum number of iteration.

Multiclass Aligned Instance Transfer with Active Learning (AITAL).
There is no adaptation or selection from historic subjects in mDTAL method. Rather, it directly transfers all labeled samples from historic subjects. But, all samples from historic subjects may not be compatible with the domain of new subject. As a result, it may yield to negative transfer effect [51]. So, the idea is to transfer samples which are aligned with new subject decision boundary (Figure 4). Subject-specific model 0 classifies some samples accurately from historic subjects. It can be assumed that these accurately classified samples agree with the decision boundary of target domain classifier 0 . So, these samples are considered as being aligned with target domain.
AITAL is similar to mDTAL algorithm except Step 3 where it will not take all of samples from th historic subject. Instead, it will take aligned samples (see (15)) from th historic subject which are determined by subjectspecific model 0 (Figure 4).
Here, 0 is the label for samples from th historic subjects which are predicted by subject-specific classifier 0 . is the true class label for these samples.

Most Informative Instance Transfer with Active Learning (MIITAL).
According to active learning query method, samples lying close to decision boundary are more likely uncertain to be predicted. It makes uncertain samples more informative to learn decision boundary than that of other samples. If informative samples from historic subjects are transferred to learn classifier for new user, it will be more effective. In this work, entropy of instances are used as the quantification of information carried by these samples.
Here, ( | ) is probability of samples to be in class which is determined by classifier 0 and represented as model . For this work, we consider four V entropy calculation. Our goal is to find uncertain samples which are close to each V decision line. Ideally, samples having 50 : 50 probability ratio are most uncertain and have maximum entropy. We consider samples with probability ratio equal or more than 60 : 40 for this work. It yields to transfer samples that have entropy equal or greater than 0.29228 according to (16). This entropy limit is named information limit or cut-off ( ).
There are two combinations of this algorithm: (ii) Transfer most informative samples and ignore whether it is aligned or not (most informative instances transfer with AL (MIITAL) ( Figure 5)):  Algorithm of MIAITAL or MIITAL is the same as mDTAL except Step 3. In Step 3, MIAITAL or MIITAL will take according to (17) and (18) in place of . MIAITAL attempts to transfer most informative samples which are perfectly classified by classifier from previous iteration, whereas MIITAL attempts to transfer most informative samples (determined by entropy) and ignores alignment of those informative samples ( Figure 5).

Optimized Ensemble for Multiclass Actively Learned Space
Transfer. EEG epochs due to various motor imagery actions are not stable. So, finding prominent features followed by learning classifier does not always yield the expected result. As a result, performance is not generic for all subjects; it is subject-dependent. Some methods perform well for some subjects while not very good for others. The ensemble of different methods can give a general and steady performance  for all subjects. An optimized weighted ensemble is proposed to serve the purpose (Figure 7). Some arbitrary weight (say ) is assigned for each class in each method. Then, these weights are optimized for minimum loss on a validation set. Loss function for th subject is as follows: Here, the validation set is 10 percent of the subject-specific training set and is randomly chosen from that data set. The initial value of weight is some random value in the range of [0, 1]. Then, is optimized by the genetic algorithm using the loss function from (19). Ensemble decision-making probability on test set using optimized is then obtained by Here, is probability generated by th method for class and is optimized weight for corresponding class and method .

Experiment Setup
4.1. Experimental Data. Algorithms described in Section 3 are implemented for BCI competition IV dataset 2A [39]. This dataset consists of 9 subjects. In this dataset, each subject performs four types of motor imagery action for left hand, right hand, both feet, and tongue movement. Data is recorded in two sessions for each subject. In each session, a subject performs 72 trials per class which turns into 288 in total.
In each session, there are approximately 5 minutes of electrooculogram (EOG) recording keeping eyes open, close, and moving. Then, it is followed by the run of trials (Figure 8(a)). Each subject was facing a computer screen which was showing different indication guideline to the subject. Each single-trial starts ( = 0 s) with a fixation cross on the screen in front of the subject. After 2 seconds ( = 2 s), a cue appeared on the screen indicating arrow with the desired movement sign (left hand, right hand, foot, and tongue). After 1.25 seconds of cue appearing, subject starts to imagine the motor action and continues until = 6 s. A short break (black screen) is given until next trial starts (Figure 8(b)). EEG epoch of 2 seconds after 0.5 seconds of cue appearing is taken as training data. 22 channels (Ag/AgCl) are used for 8 Computational Intelligence and Neuroscience EEG signal recording and other three monopolar electrodes are used for EOG recording. The montage of the electrode was according to the international 10-20 system. Both of the EEG and EOG channels were sampled 250 Hz. After that, they had been filtered using 0.5 Hz to 100 Hz ranged bandpass filter. A 50 Hz notch filter was also enabled during recording to omit the line noise. The sensitivity of the amplifier was set to 100 uV and 1 mV for EEG and EOG recording, respectively.
rhythms (8)(9)(10)(11)(12) of EEG signals related to motor actions and sometimes correlate with rhythms [54,55]. For this reason, corrected EEG signal is bandpass filtered using casual Chebyshev Type II filter between 8 Hz and 32 Hz. After that, CSP is applied and features are extracted according to (6) and (7). Here, is set to 2 for * in (6). So, 4 features are obtained from each EEG epoch.

Experiment and Simulation.
For all method, first session of each subject is used as training set and second session is used as test set.
For comparison purpose, a baseline method is also implemented. In baseline method, the full training set of the respective subject is used to train LDA classifier. No sample from other users is used. After applying data preprocessing as described in Section 4.2, four V LDA classifiers are trained. These models are applied to predict label for respective independent test session (Figure 8(c)).
The accuracy achieved by this baseline method is the benchmark performance by an individual user. The purpose of other methods in this work is to achieve this performance using a reduced amount of training samples. This baseline process is followed for each internal model training and testing phase of other algorithms. As benchmark performance is a static value and does not depend on the iterative increment of subjective training samples, it is a straight line parallel to the horizontal axis.
Other methods in this work are iterative where samples from the new subject are added in training pool iteratively. Each subject is considered as the new user (target) while other 8 subjects are considered to be past users (source). Each simulation starts with 40 random samples ( in Step 1 of mDTAL algorithm) with equal class distribution from the target domain. Then, 2 samples per class ( in (14)) are added in each iteration until 20 iterations (maximum number of iteration in Step 7 of Section 3). So, maximum 200 subjective samples for each subject is added at final iteration. This amount of training samples from the new user is good enough to observe whether the new subject can reach the benchmark using a lower amount of training samples. For this reason, the maximum number of iterations is set to 20. This whole simulation is repeated 20 times for each subject to negate random starting samples effect. Then, the average of ten repeats in each iteration is taken as the performance of that iteration.
Only first session of each historical subjects is taken as source domain because label for the second session was kept closed in BCI competition IV. Training samples from the first session of target subject are added iteratively and the classification performance in each iteration is computed on the independent second session of the target subject.

Results and Discussion
The performance of proposed methods in this work is evaluated based on the following two criteria: first, investigation to find whether the method has reached the maximum baseline performance; second, the number of subjects for which intended method reaches the maximum baseline performance. Direct transfer method (mDTAL) is the multiclass extension of active transfer learning [36] for motor imagery BCI. Proposed informative space transfer algorithms (AITAL, MIAITAL, and MIITAL) will be compared with mDTAL based on the evaluation criteria mentioned above. Figure 9 presents the accuracy of all methods for comparison. The following observations can be drawn out from this result based on the above-mentioned evaluation criteria: (i) mDTAL method fails to achieve the baseline performance except for subject A03, A06, and A08.  (mDTAL) reaches baseline only for three out of nine subjects. It implies that informative subspace transferring enables the new subject to achieve the baseline performance with a reduced number of training data for more number of subjects compared with direct transfer methods. Table 1 shows the number of subject-specific training samples required to reach baseline or close to baseline. It also implies that MIITAL method reaches baseline or close to baseline using average 49% less subject-specific data for 6 out of 9 subjects.
Though informative instance transfer achieves better performance for most of the subjects, this is not a generic outcome for all subjects. Subject A05 and subject A07 are much closer to baseline, but they do not reach it. Exceptionally, subject A04 is very far from the expected line for all the methods. To find a generic solution for all subjects, an optimized weighted ensemble of the proposed four methods is applied (Figure 7). Performance of optimized weighted ensemble method is shown in Figure 9 (solid black line).
Optimized ensemble of all methods achieves the baseline and sometimes better than baseline with less amount of subject-specific samples for 8 out of 9 subjects. As per results in Table 1, optimized ensemble method reaches the baseline or close to baseline using average 75.5% less subject-specific training samples for 8 out of 9 subjects.  To get a generic view irrespective of subjects, mean of the accuracy of all subjects is presented in Figure 10. It shows that proposed informative transfer learning methods MIITAL and MIAITAL are performing always better than that of direct transfer (mDTAL). It infers that informative transfer has advantages over the direct transfer. However, mean performance of both algorithms is behind the baseline performance by 4-5%. On the other hand, mean of the optimized ensemble is much closer to mean baseline of all subjects (differs by only 1-2%). Subjective combination adaptation would have yielded better results in comparison to the optimized ensemble of the methods. However, this will be considered in a future study.
Another observation is that subject A04 has no improvement by all these methods. Any of the methods used in this study is unable to improve the performance of subject A04. This can be due to the fact that EEG response of some subjects have complete dissimilarity with others [55]. When a completely dissimilar subspace is transferred and combined with the target subject (A04), it does not much effect towards the improvement of predictive function for the target domain. A remedy for this issue could be achieved by clustering closely related subject [28]. Closely related subjects or domains form a cluster. Nonrelated or dissimilar subjects are excluded from this cluster. Then, informative subspace from this close group or cluster can make the transfer more effective. For EEG epochs consisting large number of channels, EEG channels selection could be a better addition for robustification [56][57][58][59][60].
Presented results infer that a single method is working well for some subjects and not up to the mark for others. It implies that performance of proposed TL methods is subject-dependent. Automatic selection of the best approach for a subject is an open question to be investigated. One of the possible causes behind the performance variation is CSP applied for extracting features from a broad range of and rhythms . Subjective frequency ranges can be yielded into better feature extraction and selection [49]. Incorporating this subject adaptive frequency ranges will ensure feature transfer from subjective range. Thus, it will lead to better features transfer into proposed TL algorithm. One concerning matter is the mean baseline performance of multiclass BCI that is not up to the mark. Advance feature extraction and learning algorithm could be applied to raise up this baseline which leads to subsequent incorporation into MIITAL and consequent performance raise of the MIITAL algorithm.
In summary, this paper presents two slightly different informative subspace transfer frameworks (MIAITAL and MIITAL) on multiclass BCI. Though MIITAL has achieved the expected result for a good number of subjects, still it is lagging behind the baseline in general. The optimized ensemble of these methods has overcome the gap. The primary goal of this work is to investigate the functionality of informative subspace transferring over the direct transfer for multiclass BCI. Though it succeeded for most of the subjects, there are many scopes to improve in the proposed framework. Secondary goal is to find comparatively better informative transfer approach. From empirical results, it is clear that MIITAL is serving the purpose better than MIAITAL.

Conclusion
In this work, we applied direct transfer learning with active learning on multiclass motor imagery BCI. To improve the performance, an informative instances transfer framework is proposed. Its key advantage compared with direct transfer methods is transferring informative instances that narrow down the search spaces more precisely around the decision line. Hence, it reduced training data significantly for most of the subjects (6 out of 9). A generic optimized ensemble of proposed methods is also implemented. It has achieved expected accuracy with fewer subject-specific samples (using average 75% less training samples) for 8 out of 9 subjects.
Results achieved in this paper point out some directions for future work as well. Subject adaptive method selection could give a more fine-tuned performance. Cluster base transfer combined with informative transferring could also lead to better performance for the underperforming subject. Another scope is filtering subject and subspace based on distribution similarity. Domain adaptation based on marginal and conditional distribution could introduce more generalize adaptation in the proposed TL framework. All these improvements can reduce the calibration effort remarkably and lead us towards a generic TL framework for BCI application.