Instance Transfer Subject-Dependent Strategy for Motor Imagery Signal Classification Using Deep Convolutional Neural Networks

In the process of brain-computer interface (BCI), variations across sessions/subjects result in differences in the properties of potential of the brain. This issue may lead to variations in feature distribution of electroencephalogram (EEG) across subjects, which greatly reduces the generalization ability of a classifier. Although subject-dependent (SD) strategy provides a promising way to solve the problem of personalized classification, it cannot achieve expected performance due to the limitation of the amount of data especially for a deep neural network (DNN) classification model. Herein, we propose an instance transfer subject-independent (ITSD) framework combined with a convolutional neural network (CNN) to improve the classification accuracy of the model during motor imagery (MI) task. The proposed framework consists of the following steps. Firstly, an instance transfer learning based on the perceptive Hash algorithm is proposed to measure similarity of spectrogram EEG signals between different subjects. Then, we develop a CNN to decode these signals after instance transfer learning. Next, the performance of classifications by different training strategies (subject-independent- (SI-) CNN, SD-CNN, and ITSD-CNN) are compared. To verify the effectiveness of the algorithm, we evaluate it on the dataset of BCI competition IV-2b. Experiments show that the instance transfer learning can achieve positive instance transfer using a CNN classification model. Among the three different training strategies, the average classification accuracy of ITSD-CNN can achieve 94.7 ± 2.6 and obtain obvious improvement compared with a contrast model (p < 0.01). Compared with other methods proposed in previous research, the framework of ITSD-CNN outperforms the state-of-the-art classification methods with a mean kappa value of 0.664.


Introduction
A brain-computer interface (BCI) is a communication method between a user and a computer that does not rely on the normal neural pathways of the brain and muscles. Motor imagery (MI), one of the paradigms of BCI, is a way of thinking that imitates the motor intention without real motion output; that is, the brain imagines the entire movement without contracting the muscles [1]. Research has shown that motor imagery (MI) can produce the same change of sensory motor rhythms as a real movement [2]. This phenomenon will cause energy increase or decrease in specific frequency bands of EEG, which are called eventrelated desynchronization (ERD) and event-related synchrony (ERS) [3]. The differences of ERD/ERS are always used to decode mental intentions, control a robot, and execute rehabilitation training for stroke patients [4]. During this process, the accurate decoding of MI is the essential factor that determines the effectiveness and quality of the rehabilitation. However, due to the differences in physiological structure and physiological condition across subjects/trials, there will be obvious variations in feature distribution for EEG signals. Especially, as a spontaneous potential activity, the signal of MI is extremely weak and always accompanied with nonlinearity and nonstationary. It brings a huge challenge for the decoding model for MI.
With the development of machine learning (ML) and deep learning (DL) technology, more classification models are widely used for EEG decoding [5]. During the training stage of the classification model, strategy can be divided into two ways: subject-dependent (SD) and subject-independent (SI). As shown in Figure 1, SD strategy is aimed at training a subject-specific model using their own data. In contrast, SI strategy utilizes data from other subjects to train a generalized decoding model for a new subject [6]. One of the main assumptions of ML and DL is that training data and test data belong to the same feature space and subject to the same probability distribution. But it is often violated in the field of EEG signal processing. In other words, the SI strategy cannot satisfy performance of accuracy and generalization due to the individual differences across subject/sessions. SD strategy provides an effective way to optimize this issue; however, it requires long calibration sessions to collect the high-quality and large amounts of EEG datasets. All these restrictions greatly affect the application of BCI in practice.
One effective approach to solve this problem is instance transfer learning (ITL) [7], which combined the advantages of training strategy of SI and SD, i.e., training personalization classification model with enough data. The definition of transfer learning is that given a source domain D s and learning task T s and a target domain D T and learning task T T transfer learning are aimed at helping improve the performance of target predictive function f T ð·Þ using the knowledge from D s . ITL is one of the typical TL methods, which transfer instance knowledge by reweight the data from D s to improve generalization ability for f T ð·Þ.
The essence of ITL does not change the feature space or property of signals in MI task, but it finds the optimal transfer weighting coefficient for source data by similarity measurement [8][9][10]. The transfer weighting coefficient is then weighted with the number of corresponding data from D s . As shown in Figure 2, w k S i /T represents the transfer weighting coefficient for D s data. k represents the serial number of the subject, and i is the ith trials. During the training stage, weighted data from D s were combined with D T data to train classifier. Based on this, we could utilize similar EEG data from other subjects or sessions to help reduce system calibration time and train decoding model for target subject [11]. For example, Azab et al. proposed a weighted transfer learning for improving MI task that they use Kullback-Leibler divergence to measure the similarity between two feature spaces of signal. According to the results of similarity, the weight coefficient is assigned to the source data to optimize the small sample problem in classification model training [12]. A Jensen-Shannon ratio method is used to measure similarity between target data with source data in Giles et al.'s work [13]. Based on this method, they propose a target subject identification framework based on rule adaptation transfer learning, which can reduce the calibration time of the online BCI system by reusing the data with the highest similarity between D s and D T .
However, due to the obvious individual differences across subjects, the direct instance transfer method may bring negative transfer effects. In addition, traditional measurement does not concentrate on the specific feature of EEG data, which will affect the performance of the transfer learning. Especially for motor imagery signals, the traditional timeseries signals cannot effectively reflect the feature of motor intention, but the energy feature of signal can represent the distribution of feature well. Therefore, choose the translocatable objects and assign transfer weights reasonably to the core research for instance transfer learning [14]. In the computer vision field, content-based image retrieval (CBIR) is an important research topic [15]. The goal of CBIR is to find images from the source domain that belongs to the same category. MI spectrogram image contains abundant information of frequency and energy feature, which is suitable for extracting feature of motor intention. Therefore, we 2 Computational and Mathematical Methods in Medicine assume that technology of CBIR may implement effective data matching across subject and achieve the effective instance transfer from D s to D T . The perceptive Hash (p-Hash) algorithm is one of the typical CBIR methods, which is used to judge the similarity between different images by transforming these images to perceptual hash code and measure its distance [16].
With the development of deep neural network (DNN) technology, EEG decoding based on DNN has attracted wide attention. Due to the excellent ability of fitting and automatic feature extraction, DNNs achieve outperformed results for EEG classification. In Reference [17], a convolutional neural network (CNN) and variational autoencoder (VAE) were used for two-class MI classification task. The CNN utilized multiple hidden layers to extract the features, and the VAE was used for feature classification. The CNN-VAE method achieved a 3% improvement in classification accuracy than the best methods in their referred literature. Lu et al. [18] proposed a novel method based on restricted Boltzmann machines (RBMS) for EEG classification. Fast Fourier transform (FFT) and wavelet packet decomposition (WPD) were used to extract the frequency-domain features of signals, which were used as inputs of the network. Three RBMs were stacked with an additional output layer to train the classification network. The authors verified that the classification performance of this network was better than stateof-art methods evaluated by public datasets ðp < 0:01Þ. In a recent study [19], the researchers compared the classification performance of a CNN and long short-term memory (LSTM) network for classifying the time-frequency domain signals of MI. The authors evaluated the adaptability between different network structures to individual differences and showed that CNN provided better performance for detecting the differences across subjects, and its classification rate was significantly higher than that of the LSTM. In summary, CNN shows satisfactory classification performance in MI-BCI task compared to traditional machine learning methods or other networks. However, the limitation of the amount of dataset hinders the practical applications of DNN. Especially for SD training strategy, it is difficult for a subject to collect enough high-quality EEG data. Therefore, we propose a novel instance transfer learning based on p-Hash to improve the utilization efficiency of data and build a CNN for MI classification.
Based on the problems mentioned above, we propose a novel instance transfer learning strategy combined with CNN for subject-dependent MI classification. The main contributions of this paper are as follows: (1) To address the issue of individual differences across subjects/sessions in MI classification, we creatively apply the methodology of content-based image retrieval to EEG classification. Based on this, we proposed a novel instance transfer learning (ITL) strategy using the p-Hash algorithm, which is aimed at calculating the transfer weight coefficient between the trails from different subjects/sessions (2) There are two main limitations of subject-dependent and subject-independent training strategies: smallscale dataset and large difference of signal across subjects. Therefore, we apply instance transfer learning to optimize the traditional training strategies. Similarity measurement in feature space is executed to calculate the transfer weight coefficient across subjects/sessions, which is aimed at exploring the correlation between different trials. And then we expand the training set for target subject based on instance transfer by weighted (3) To improve the classification performance in MI-BCI task, we combine CNN with transfer learning strategy using SD training strategy (ITSD-CNN) to classify MI signal. Experiments evaluate that the ITSD-CNN can achieve outperformed results than state-of-art methods The step of ITSD-CNN can divide into these steps: firstly, we preprocess the raw MI-EEG signals and adopt short-time Fourier transform (STFT) to transform the raw MI signal into a 2-D spectrogram signal. Then, an ITL based on the perceptive Hash algorithm is proposed to measure the similarity of MI signals between D s and D T . Next, we build a convolutional neural network to classify the MI data after transfer learning. The BCI competition IV-2b dataset is used to verify the effectiveness of this framework. Our results show that the proposed approach can significantly improve the classification performance. Meanwhile, the ITSD provides a new training strategy to optimize the performance of SD training. The rest of the sections is organized as follows. Section 2 explains the materials and methods for ITSD and CNN. Section 3 introduces the experimental results and discussion. Discussion is described in Section 4, and Section 5 is the conclusion of the paper.

Description of Datasets.
In this paper, we utilize BCI competition IV dataset 2b [20]. This dataset was provided by the BCI Research Institute in Berlin and contained two parts: the standard set and the evaluation set. Nine subjects participated this experiment, and three channels (C3, C4, and CZ) were used to record EEG with a 250 Hz sampling rate. Each subject is required to imagine the movement of left and right hands according to the cue. And all of them underwent 5-session experiments. The experimental process is shown in Figure 3.
The potential activity of MI always causes the variation of energy in the contralateral cortex and ipsilateral cortex during MI, which is recorded by C3, Cz, C4, and surrounding channels [3]. However, this phenomenon cannot reflect in time domain clearly. To describe features in a better form, we transform time-series signals to spectrogram signals after filtering. As shown in Figure 4, the three channels are converted into a two-dimensional form and are mosaicked into an image using vertical stacking. The variations along X-axis and Y-axis represent the trend of time series and frequency, respectively. And the depth of color indicates the energy feature. For one trial, we chose the data from 3 to 7 s (period of imagery) and set the window size of the STFT as 256. After transformation, all spectrogram images were resized to 64 × 64 for convenience and consistency in the subsequent calculation.

Instance Transfer Learning Based on the Perceptive Hash
Algorithm. The spectrogram signal of MI can vividly reflect the feature variations especially for the energy of frequency band. The perceptive hash (p-Hash) algorithm can obtain the most sensitive information by discrete cosine transform (DCT) in the human and machine vision system [16]. This transformation concentrates the energy on the main diagonal of the image matrix and has effectively removed redundant and irrelevant components. Under specific EEG task, the feature distribution of signals across subjects may exist difference but the form of feature is consistent. Therefore, we assume that change between different modes for MI can be effectively perceived and recognized by p-Hash.
This paper uses p-Hash to measure the similarity of spectrogram data across subject. Then, the obtained similarity is transformed into the ITL coefficient that inputs into a classifier combined with corresponding data. The implementation of ITL based on p-Hash are as follows.
Firstly, some denotation of symbols should be explained. We define D T representing the target subject's data and D S is other subjects' data. Let us define G i n = fg i t g n t=1 ∈ D s l×l which is a set of single-trial EEGs represented by spectrogram from D S , Q i n = fq i t g n t=1 ∈ D t l×l for target subject, where t is the number of EEG trials, l is the dimensions of a square matrix, i represents the i-th subject.q i n ð1/nÞ∑ n t=1 q i tqt represent the average spectrogram for the current target subject.
Before calculation, spectrogram images from D S and D T are separately resized to 64 × 64 and converted to grayscale level. Then, discrete cosine transform (DCT) is utilized to compress an image: siy,y cos αðuÞ and αðvÞ are coefficient matrixes after transformation. G u,u and Q v,v are results after transformation. The energy variations of the image after DCT are mainly concentrated in the low-frequency part [21]. Therefore, the 8 × 8 matrix d located in left diagonal is extracted for subsequent In addition, the mean value of DCT coefficients is set as threshold standard to compare with each coefficient. By the rule of threshold, the two-dimensional matrix of n × n can be compressed into one dimension of 1 × n matrix H i .
where h i is the bit of the perceptual hash at position i, m is the mean value of the DCT coefficients, and b i ði = 0, 1, ⋯, N − 1Þ is DCT coefficient of the array. The obtained 1 × n matrix is H i which represents perceptual hash code [22].
Finally, respectively, calculate the Hamming distance d H of perceptual hash code from D T and D S and set the distance d H ðH T , H S Þ as the ITL weight coefficient from source domain to target domain. For each trial from source subject, weight w S i /T t can be calculated: The calculation processing of transfer weight is shown in Figure 5.

Convolutional Neural Network.
Researches show that the CNN has obvious advantages in processing MI signals [23]. CNN is a multilayer neural network composed of a sequence of convolution, pooling, and fully connected layers. The convolution layer extracts different levels of feature of input image by kennel size, while the pooling layer reduces the complexity of the model by subsampling. With the increase of layers, the more advanced features can be extracted. The fully connected layer will transform the output matrix from the last layer to a n-dimensional vector (n is the number of classes). Backpropagation is utilized to decrease the classification error.
In the convolution layer, the input image can be convolved with a spatial filter to form the feature map and output function, which is expressed as This formula describes the jth feature map in layer l. X l j is calculated by the previous feature map X l−1 i multiplied by the convolution kernel W l ij and bias parameter b l j . Finally, the mapping is completed by RELU function f ðÞ.
The pooling layer is sandwiched in the continuous convolution layer to compress the amount of data and parameters and reduce overfitting. The max pooling method in this work is chosen as follows: where x i is the ith feature map and y i represents an output probability distribution. The gradient of back-propagation was calculated according to the cross-entropy loss function.
And we used the stochastic gradient descent (SGD) optimizer with a learning rate of 1e − 4 to improve the speed of network training.
where μ is the learning rate, while W k represents the weight matrix for kernel k and b k represents the bias value. E represents the difference between desired output and real output. There are eight layers in the proposed network ( Figure 6). The first layer is the input layer, and the second layer is a convolutional layer with kernel size 3 × 3; the next layer is the max pooling layer with kernel size 2 × 2. The next two layers have the same kernel size and function. Two fully connected layers, respectively, consist of 10 and 2 neuros to compute the predicted labels. The gradient of backpropagation is calculated according to the cross-entropy loss function. The stochastic gradient descent with momentum (SGDM) optimizer is used for optimization with learn rate = 1e − 4, momentum = 0:9, and decay = 1e − 6. We set the minibatch size to 50 and the max epoch to 6. To reduce computation time and prevent overfitting, we adopt to the dropout operation. The proposed CNN model is summarized in Table 1.
To evaluate the effectiveness of instance transfer learning, three training strategies are compared. As for the subject dependent method (Figure 7), a total of 720 trials for one subject are divided into the training data and test data using10fold cross-validation.
A generalized model is trained using data from other subjects (D s ) in the subject-independent training stage ( Figure 8).
In the ITSD method, weighted data from D s together with target data are input into the training set. And data from D T are used to test model performance; the method of data partition is shown in Figure 9.
To show the size of training and test data more clearly, we briefly summarize the number of data for three training strategy in Table 2.

Experimental Results and Analysis
3.1. Performance of the Proposed Framework. In this paper, we use BCI competition IV dataset 2b to verify the proposed methods. During the training stage for each subject in ITSD, we supply dataset from other subjects by instance transfer (Figure 9). After the mixture of source data and target data, we adjust the number of instances to keep the class balance in the training stage. To evaluate the performance of different training strategies, we compared the classification accuracy of different methods.
As depicted in Table 3, the SD training strategy is better than SI based on the CNN classifier even though SI obtains more training data. This indicates that MI-EEG from different subjects causes an obvious difference of feature under the same label. The average classification accuracy of ITSD-CNN is superior to that of SD-CNN, which obtains a 14.1% improvement. It is worth noting that subjects 2 and 3 can better adapt to model preference by efficiently data transfer to greatly improve the classification accuracy.
To verify the significance of results, analysis of variance was performed. As shown in Figure 10, there is no significant difference between SD-CNN and SI-CNN, while the strategy of ITSD-CNN performs satisfied convergence and high accuracy than the other two methods (p < 0:01).
By observing the training process, the weakness of small sample can directly influence the results of network training. Effective data transfer can increase the number of samples to improve the generalization of network and prevent underfitting. Moreover, this method can validly reduce the influence of classifier result from individual differences.

Comparisons with Previous
Research. Numbers methods have been proposed for MI classification using BCI competition IV dataset 2b. In this section, we further compared our method with that of a commonly used strategy by the metric of mean kappa value. Based on the analysis of Table 4 and Figure 11, we can observe that ITSD-CNN outperforms the baseline and the state-of-the-art methods. It is worth noting that the hybrid framework based on CNN obtains an ideal result among these methods. This indicates that CNN has strong robustness and high accuracy in MI classification. In addition, instance transfer effectively improves the classification performance of CNN using the same model and parameters.

Discussion
Compared with traditional methods, the application of deep learning for EEG classification has successfully improved the performance [28]. However, there are still some limitation hinder its application in practice. The feature distribution of EEGs always shows a difference in the same mental task across subject/session, which may cause the overfitting during the network training. Transfer learning turns out to be instrumental in subject/session classification performance. It can be used to initialize a BCI using knowledge transfer from other subjects for a naive subject. At the same time, this strategy may help a classifier to learn global features from all subjects without falling into the local optimal. Therefore, transfer learning combines the advantages of SI and SD strategy and outperforms them. In future studies, it is valuable to    Another limitation is small-scale sample for classifier training. Strict requirements of quality and collection of EEG data make it difficult to obtain large datasets in practice. The performance of EEG decoding based on DNN is directly related to the amount of training data. Data augmentation is a promising way to address this issue. As discussed in References [29,30], artificially generated data can be used to training classification model and the augmentation method has been proved efficient in EEG decoding. The addition of generated dataset improves the complexity and robustness of models. Traditional augmentation methods contain geometric transformation and model generation, which requires a long time to prepare and select suitable generated data. It takes up a lot of computing resources in the BCI system. Therefore, data augmentation from an available database may provide a probable method. As proposed in this study, instance transfer learning can easily obtain data from other subjects and adaptively assign weights to the transfer data, which achieve the utilization maximization of data across subjects. Although the training process of this method is similar with subject-dependent strategy, i.e., it requires recomputing for a new participant; low-cost calculation would not burden the operation of the BCI system. In later research, we will explore the detail of variability across subjects and achieve more effective transfer learning.

Conclusion
In this paper, we propose a novel instance transfer learning method with a deep neural network applying for the subject-dependent classification of motor imagery in the BCI system. In this work, we firstly transform the raw data to spectrogram image by STFT. Then, instance transfer learning based on the perceptive Hash algorithm is utilized to measure the similarity between the data of source domain and target domain. Next, we convert the similarity into a transfer weight coefficient to realize the data transfer of a single trial between different subjects. Finally, a convolutional neural network is built to verify the performance of proposed methods and some other methods are adopted to evaluate the results. Experiment verifies that instance transfer learning by the perceptive Hash algorithm can effectively provide data augmentation based on subject-dependent training strategy and improve the performance of the classifier, which demonstrates the superior performance and promising potential of proposed novel training strategy. Meanwhile, the proposed method provides a solution for the weakness of small samples in the deep neural network.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.