A Novel Time-Incremental End-to-End Shared Neural Network with Attention-Based Feature Fusion for Multiclass Motor Imagery Recognition

In the research of motor imagery brain-computer interface (MI-BCI), traditional electroencephalogram (EEG) signal recognition algorithms appear to be inefficient in extracting EEG signal features and improving classification accuracy. In this paper, we discuss a solution to this problem based on a novel step-by-step method of feature extraction and pattern classification for multiclass MI-EEG signals. First, the training data from all subjects is merged and enlarged through autoencoder to meet the need for massive amounts of data while reducing the bad effect on signal recognition because of randomness, instability, and individual variability of EEG data. Second, an end-to-end sharing structure with attention-based time-incremental shallow convolution neural network is proposed. Shallow convolution neural network (SCNN) and bidirectional long short-term memory (BiLSTM) network are used to extract frequency-spatial domain features and time-series features of EEG signals, respectively. Then, the attention model is introduced into the feature fusion layer to dynamically weight these extracted temporal-frequency-spatial domain features, which greatly contributes to the reduction of feature redundancy and the improvement of classification accuracy. At last, validation tests using BCI Competition IV 2a data sets show that classification accuracy and kappa coefficient have reached 82.7 ± 5.57% and 0.78 ± 0.074, which can strongly prove its advantages in improving classification accuracy and reducing individual difference among different subjects from the same network.


Introduction
e brain-computer interface (BCI) is a communication control system established between the brain and the external devices through the signals generated by brain activity. Creating direct communication between the brain and the external device, the system does not rely on muscles or peripheral nerves but the central nervous system [1]. Motor imagery (MI) is a psychological process in which an individual simulates the body movements. During the process of performing different MI tasks, when a certain area of the cerebral cortex is activated, the metabolism and blood flow of this area increase. Meanwhile, a simultaneous information processing will lead to an amplitude decrease or even block of EEG in its mu and beta spectrum oscillation. is electrophysiologic concept is called event-related desynchronization (ERD). In contrast, the phenomenon of a manifest amplitude increase of mu and beta oscillation, which appears in resting or inert states, is called eventrelated synchronization (ERS) [2]. e purpose of MI-BCI is to identify the imagined movements by classifying the electroencephalogram (EEG) characteristics of the brain, to control the external devices, such as robots [3,4]. On the one hand, MI-BCI can help patients with severe dysfunction and establish communication channels with the outside world. On the other hand, to some extent, it can activate the brain region to promote the remodeling of the patient's central nervous system [5]. In contrast to the traditional rehabilitation training, it can improve the patient's subjective initiative to achieve the rehabilitation effect, which overcomes the defect of the passive and single means of traditional rehabilitation [6]. erefore, MI-BCI has a growing potential value in the fields of motor function assist and motor neurorehabilitation. However, the high complexity and instability of the EEG signals make the feature extraction and pattern classification of signals very challenging. e very important part of the MI-BCI system is how to classify the EEG characteristics of MI task correctly and convert it into external control instructions [4]. At present, the traditional MI-EEG signal feature extraction is mainly based on ERD/ERS in the µ band (8)(9)(10)(11)(12) and the β band (16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31), including signal bandpass filtering [7], autoregressive model [8], frequency domain statistics [9], phase-locking value (PLV) [10], wavelet transformation and wavelet-packet transformation [11,12], information entropy [13], and common spatial pattern (CSP) [14]. Based on the above methods, Li et al. [15] used wavelet-packet transform (WPT) to analyze and rebuild the MI-EEG signals and extract the energy characteristics of the µ band and the β band. Zhang et al. [16] analyzed MI-EEG signals and extracted temporal and spatial features by using a one-versus-rest filter. However, traditional feature extraction relies on manual selection of specific frequency bands, and features are very limited, which may lose part of the EEG information. In addition, the pattern classification methods of MI-EEG signals include linear discrimination analysis (LDA) [17], bayesian linear discrimination analysis (BLDA) [18], logic regression (LR) [19], support vector machine (SVM) [20], and neural network (NN) [21]. e classification performance of these methods depends on the quality of feature extraction.
In more recent years, deep learning has made excellent achievements in the fields of speech recognition, image recognition, and natural language processing [22,23]; it has been used as a good machine learning method in these fields for its advantages on self-learning of features [24][25][26]. erefore, deep learning is also gradually used in the feature extraction and pattern classification of EEG signals, in some cases not only improve the accuracy, but also provide a new method to learn features from EEG data [27,28]. For example, Tang et al. [29] investigated how convolution neural network (CNN) displayed spectral features of the series of MI-EEG samples. Yang et al. [30] used augmented CSP to extract the spatial features and CNN to learn deep structural features for MI-EEG classification on the BCI Competition IV data sets, which revealed that feature extraction no longer relied on manual ways. Tabar and Halici [31] proposed using a kind of deep learning method to classify MI-EEG signal patterns, CNN was used to extract features, and then stacked autoencoder (SAE) was used to classify the extracted features. However, MI-EEG signals are time series with strong time-varying characteristics, and CNN is not completely suitable for learning time-series features. erefore, Lee and Choi [32] used continuous wavelet transform to extract the temporal-frequency features of MI-EEG signals and classified them by CNN. Zhou et al. [33] adopted a way based on wavelet-packet and long short-term memory (LSTM) neural network, which divided MI-EEG signals into several categories through the amplitude features and time-series information. An et al. [34] did some research on deep belief network (DBN) based on the restricted Boltzmann machine (RBM) linked up with fast Fourier transform (FFT) for MI-BCI pattern recognition, and the results were significantly better than those of traditional SVM-based algorithm. However, these methods simply extracted the temporal domain, frequency domain, or temporal-frequency domain features and did not fully extract the EEG signal features. Many other methods of deep learning have also been used in the recognition of MI-EEG signals, but the network structure is overly complex.
To sum up, all the above deep learning methods used in the recognition of MI-EEG signals do not take full advantages on self-learning of features, which still manually select features of specific frequency bands before pattern classification. Because the features selected manually are very limited and the objective function of feature extraction is different from that of pattern classification, it is easy to get information loss. What is more, multiclass MI-BCI classification mainly adopts splitting strategy. e whole process is extremely cumbersome and the classification accuracy is not high. Besides, the signal-to-noise ratio of MI-EEG signals is relatively low, and the data of the same person in the same task has randomness, instability, and individual variability, which makes the network trained with small-sample data sets have limitations. To reduce these limitations, a multimodal neural network is designed to form a novel end-to-end shared neural network in this paper. e main contributions are as follows: (1) For the sample size of BCI Competition IV 2a data sets is small, 1s time window is used to intercept the training data, and then autoencoder (AE) network is used to enlarge the sample size of all subjects' training data which is intercepted and merged in advance. It meets the requirement of a large amount of training data for neural network and effectively reduces the bad effect on signal recognition because of randomness, instability, and individual variability of EEG data.
(2) To ensure the classification results of MI-EEG signals, a novel convolution neural network structure named the shallow convolution neural network (SCNN) is proposed to extract the differentdimension frequency-spatial domain features. Because of its simple structure and fewer parameters, the training model is not easy to overfitting. Furthermore, the EEG signals processed in the frequencyspatial domain are input into bidirectional long shortterm memory (BiLSTM) network to extract the timeseries features, so that the features in the MI-EEG signals are fully extracted. Finally, to reduce the redundancy of the fusion features and improve the 2 Computational Intelligence and Neuroscience classification accuracy, the attention model is introduced into the feature fusion layer to dynamically weight the extracted temporal-frequency-spatial domain features. (3) rough the proposed multimodal neural network, the training data of all subjects is used to train the end-to-end shared neural network, and it is tested by the test data of each subject and compared with the state-of-the-art methods in the MI-EEG recognition field to prove its higher classification accuracy and the minimum individual difference. e structure of this paper is as follows: Section 1 is the introduction. Section 2 describes the data sets and the details of the neural network method that we proposed. e experiment and its results are presented in Section 3. e discussion is presented in Section 4. Finally, Section 5 is the conclusion.

Materials and Methods
Different from images and videos, MI-EEG signals are time series with strong time-varying characteristics, which own a mass of data information while the amount of data is not large. In addition, its signal-to-noise ratio is low, and randomness, instability, and individual variability still exist in the process of signal acquisition. What is more, with the increase in the number of MI-EEG classifications, the strategy that multiclass task was split and the method that patterns were classified after feature extraction were both introduced in the past, but it is still difficult to improve the classification accuracy. To solve these problems, we put forward a method of an attention-based time-incremental end-to-end shared neural network, as shown in Figure 1. With a combination of SCNN network and BiLSTM network, and an attention model introduced into the feature fusion process, it is practicable for feature extraction and pattern classification in a step-by-step way for the temporal-frequency-spatial domain features of multiclass MI-EEG signals. is method is simply called the method of SCNN-BiLSTM network based on attention.
Before feature extraction and pattern classification are carried out in a step-by-step way, the training data of all subjects is expanded using AE network. en, the differentdimensional frequency-spatial domain features which were abstract are extracted by different convolutional kernels of SCNN network, and the time-series features are extracted through BiLSTM network with time increments; after that, all the temporal-frequency-spatial domain features are combined with the attention mechanism. Finally, the above fusion features are input to output layer of the network for classifying. During the training process of this attentionbased time-incremental end-to-end shared neural network, the convolution layers and the recurrent layers can receive the reverse propagation error of the output layer at the same time, and the gradient drop caused by the error will gradually spread to the front of the network. So, after many iterations, the network parameters are gradually updated, and the error will become smaller and smaller.

Data Description and Processing.
In this paper, the data sets are taken from four-class MI-EEG data of left hand, right hand, foot, and tongue in BCI Competition IV 2a in 2008 [35]. In the data sets, the EEG data of 9 subjects was recorded with 22 Ag/AgCl electrodes and labeled as A01-A09. e sample frequency is 250 Hz, band-pass filtering is set between 0.5 Hz and 100 Hz, and line noise is suppressed by a 50 Hz notch filter. Each experiment consists of two sessions. e first session is training and the second session is testing. One run contains 48 trials (12 for each of the four possible classes), resulting in 288 trials per session. e timing scheme of experimental data acquisition is shown in Figure 2.
Firstly, we select 2 s-6 s data from the training data sets T and intercept it with a time window of 1 s. After processing, the training data sets T of 9 subjects are merged and then enlarged with AE network. Secondly, to accelerate the convergence speed of the network, prevent interference caused by abnormal EEG data, and avoid unnecessary numerical problems, the segmented training data is standardized. What is more, to increase the stability of the network, the training data sets T with training labels are reordered randomly.
Particularly, in the process of data standardization, we standardize the EEG data based on the mean and standard deviation of the raw data, so as to avoid the influence of outliers and extreme values in the data through centralization. e processed EEG data conforms to the standard normal distribution with mean 0 and standard deviation 1. e EEG signals are input to convolution layers and therefore preprocessed time series are generated. Moreover, the time series are input to BiLSTM cells for the exchange of information among different time points. And the attention mechanism module receives the output of BiLSTM cells, calculates weights for different time points, and outputs the ultimate result. e training sets and testing sets are both standardized before being input into the network as follows: where s in is the segmented training data and p in is the segmented testing data. And, during a single trial of BCI Competition IV 2a data, when t � 2 s, the prompt arrow appeared and lasted for 1.25 s; the subjects observed and imagined the corresponding action. When t � 3.25 s∼6 s, the subjects imagined the corresponding action. Because the EEG signals are instantaneous and susceptible to interference, in the process of motor imagery, the ERD/ERS characteristics of subjects' motor imagery EEG are uncertain during the transition from the preparation stage to the imagination stage, which is easy to cause invalid edge data. So, the method proposed in this paper is verified by selecting the motor imagery data of 4 s∼5 s with 0.75 s interval from t � 3.25 s in the final.

Data Expansion Based on Autoencoder.
e deep learning methods need a large amount of data to train the network models. However, the data samples of BCI Competition IV 2a are small. In the meantime, the different periods and the size of the electrode caps during the data acquisition process make each subject's EEG signals have randomness and instability. erefore, for the training sets, we select the data of 2 s∼6 s and use the 1 s time window to intercept and then use the autoencoder (AE) network to enlarge the data by generating the reconstructed data from the real training data in this paper, which is a three-layer neural network composed of an input layer, a hidden layer, and an output layer [36], as shown in Figure 3. After processing, it satisfies the need for a mass of training data of neural networks and improves the robustness of the network model, while effectively reducing the bad effect on signal recognition because of randomness, instability, and individual variability of EEG data.
In AE network, the output layer y has the same size as the input layer x, so y can be considered as an approximation of x. f and g represent the encoding and decoding functions, respectively. e encoding and decoding procedures are as follows: where h denotes the hidden layer information, W 1 denotes weight of the input layer to the hidden layer, and W 2 denotes weight of the hidden layer to the output layer; a and b are biases of the hidden layer and the output layer, respectively; s f and s g are activation function of the encoding and decoding procedures, respectively. Here, both s f and s g adopt the sigmoid function. And, to simplify the calculation, let W 2 � W T 1 � W. In this paper, the AE network firstly encodes the real training data to reduce the data dimension, and the important features of the data are retained through unsupervised learning. en, the encoded data is decoded to obtain the reconstructed data. And, finally, the average value of the error reconstruction function between the reconstructed data and the real training data, namely, the loss cost function, is calculated to measure the similarity between them. e smaller the loss cost function is, the more similar the reconstructed data is to the real training data. However, in the process of network learning to obtain the parameter θ � W, a, b { }, the value of the loss cost function will become smaller and smaller, which may result in overfitting. erefore, we adopt cross-entropy in reconstruction error function to suppress overfitting to obtain an AE network with strong generalization ability. e reconstruction error function R(x, y) is defined as follows: For the entire training sets S � x 1 , x 2 , . . . , x m , the overall loss cost function J(θ) is e function J(θ) is minimized to obtain the parameter θ by the gradient descent method.

Shallow Convolution Neural Network.
e structure of convolution neural network (CNN) is different from that of traditional hierarchical connections. e connections between neurons in CNN are not fully connected; what is more, the sharing weight of convolution kernel can reduce the complexity of the network model and reduce the weight parameters of network training, making it easier to train than the previous neural network [37].
Nevertheless, compared with the information volume of images and videos, that of EEG signals is very small. Besides, it is a kind of nonstationary, random, very weak, and low signal-to-noise ratios signal with unstable waveform. When classifying EEG signals, we found that too many convolutional layers of CNN can easily lead to overfitting of the training model. erefore, it is very crucial to structure a suitable CNN model. In this paper, the differentdimensional frequency-spatial domain features which were   [38]. It is a special CNN with a simple structure and few parameters. e training model is not easy to overfit and can directly extract the frequency-spatial domain features from the EEG data. e structure of SCNN is shown in Figure 4. e details of the structure used in this paper are shown in Table 1 in Section 3. e input is one-dimensional feature vector . , x N ] with a length of N corresponding to EEG signals of N channels; the convolution layer is composed of K convolution kernels, the size of each convolution kernel is 1 * S, the coefficient of the convolution kernel is w k ∈ R s , k � 1, 2, . . . , K, and the output is where b k denotes the bias of the convolution kernel and R denotes the nonlinear activation function that adopts the Leaky ReLU function [39,40].
In a single SCNN network structure, the network connects one or more full connection layers and a Softmax output layer after multiple convolution, pooling, and dropout layers. Supposing that the network has a total L layers, where the L M layer is the full connection layer, the L L layer is the final output layer, and the output number of cells is the number of classification categories n, the entire calculation process is as follows: where h l denotes the output of the convolution network's hidden layer information, w l and b l are the learning parameters of the network, a L is the value which is not activated before the last output layer, and P(t|x) is a posterior probability judging whether the input x belongs to the category t. e label for each input's EEG signal category is T � [1, 2, . . . , n]. For all samples in the training sets, cross-entropy Loss � (1/n) n i�1 log p(t � i|X) is taken as the objective function to optimize.

Bidirectional Long Short-Term Memory Network.
EEG signals are not images in the traditional sense, but time series with a strong correlation in time. e SCNN network is not fully suitable for learning time-series features of EEG signals; however, recurrent neural network has certain advantages on that [41,42]. erefore, in this paper, BiLSTM network with time increments which is a kind of recurrent neural network is connected in series before the full connection layer and after multiple convolution, pooling, and dropout layers of the SCNN network. Different from the traditional unidirectional LSTM network, the BiLSTM network improves on network structure so as to solve the gradient disappearance well and more fully extracts the information of each time point, which is suitable for EEG processing in temporal domain. e input at each moment of BiLSTM network comes from the information transmitted by the hidden layers in the forward and backward directions, and then the network combines the output of the forward and backward hidden layers to obtain its final output of each moment.
In this paper, to reduce the local convergence caused by fewer layers and the gradient disappearance caused by too many layers, the two-layer BiLSTM network is designed to converge more quickly and effectively reduce the gradient disappearance caused by too deep propagation between layers. e network's structure and principle are shown in Figure 5.
BiLSTM network is a unidirectional LSTM network when it is performing forward calculation, and the forward calculation requires the input data before the current time.
e forward calculation of the network is as follows: When it is performing backward calculation, it is associated with the future input data after the current time.
e backward calculation of the network is as follows: e LSTM networks in the forward and backward directions maintain the state information of their own network, respectively, there is no connection between them, and the unfolded diagram of the network is not a circular structure. Having superimposed the state information coming from both directions simultaneously, then the output layer can be calculated. e overall calculation of the network is as follows:  Figure 3: e structure of autoencoder neural network.
Computational Intelligence and Neuroscience where x t and y t denote the input and the output layers, respectively; W and b represent the network's weights and biases, respectively.
When BiLSTM network is combined in series with SCNN network, the calculation method for the whole network is as follows: where h l scnn is the hidden layer information that is output from the whole SCNN network. After a linear transformation, it is found that a L SCNN is a set of effective EEG features for different categories extracted by the network from the input EEG data. Having the hidden feature of EEG signals in temporal dimension, a BiLSTM is the synthesis of  en, the frequency-spatial domain features extracted by SCNN network and the timeseries features extracted by BiLSTM network are synthesized in the feature fusion layer and the fusion features of the temporal-frequency-spatial domain are obtained. Finally, the highly abstract features that have undergone multiple convolutions and cycles will be fused after a linear transformation. e relative proportions of the "good" and "bad" features are adjusted by learning weights from the training data; then, the proportions are sent into the output layer for the probability calculation of each category. e above feature fusion is only synthesized in the direction of one-dimensional vector. e frequency-spatial domain features extracted by SCNN network and the timeseries features extracted by BiLSTM network will have some redundancy. e fusion features, mechanically synthesized, will be redundant which will slow down the network training speed and then spoil final classification effect. erefore, in this paper, attention mechanism is added to process the fusion features.

Attention Mechanism.
In cognitive science, to reasonably use the finite resources of visual information processing, humans usually choose to ignore part of the information and pay attention to the more critical part of all the information; that is to say, the brain's attention is focused on the specific visual area; this mechanism is called attention mechanism [43]. In this paper, the feature fusion process is optimized through the attention mechanism. e frequency-spatial domain features extracted by SCNN network and the timeseries features extracted by BiLSTM network are fused and the important degree of the fusion feature is calculated to obtain the effective attention, so as to realize the automatic classification of MI-EEG signals by more effectively fusing temporal-frequency-spatial domain features.
In traditional sequence-to-sequence learning, the encoder-decoder structure is often used for learning, as shown in Figure 6. e encoder encodes the input sequence to get the intermediate state information C and then uses the intermediate vector as the input of the decoder to get the output of each sequence at the decoding end. e overall process is as follows: P y t | y 1 , y 2 , . . . , y t− 1 , c � g y t− 1 , s t , c . (21) e output at each moment uses the same context semantic vector C, but, in the process of sequence encoding and decoding, we hope that the context semantic vector for each moment's output is an appropriate vector, so the attention mechanism is introduced to select the appropriate context semantic vector according to the output of different moments. e attention model is shown in Figure 7. e decoding process for the attention model is as follows: where c i is the added attention; its role is to associate the output with the relevant input and to calculate the correlation a ij between the current output and all inputs; then, where h j is the hidden layer information at the position j of the encoder's input. In this paper, the SCNN outputs EEG signal with frequency-spatial domain features as the input of  Figure 5: e structure and principle of bidirectional long short-term memory network with time increments. "S" represents sigmoid operator; "tanh" signifies hyperbolic tangent operator. "C t " represents the state of BiLSTM cell at t moment.
the encoder BiLSTM; that is, where the forward and backward of the hidden layer information are synthesized. Weight a ij identifies the relevancy of the input sequence to the current output sequence, which is a normalized probability value, meaning the probability of the relationship between item j of the input and the output at the current moment. And weight a ij is defined as e definition of a ij introduces the symbol e ij as a feedforward neural network, which is jointly determined by the state information s i− 1 of the hidden layer at the decoding end and h j of the hidden layer at the encoding end. In the SCNN-BiLSTM network based on attention that we designed above, the attention module is an additional neural network, which can give different weights to each part of the fusion features and is more sensitive to the classification target; it can effectively enhance the performance of the whole neural network in a natural way.

Experiment and Results
We refine the model of SCNN-BiLSTM network based on attention that is designed in Section 2 and then train and test the model to verify its superiority in MI-EEG multiclass recognition. e model is trained and tested on the Intel 3.6 GHz Core i7-10700F CPU and 16 GB RAM NVIDIA GeFore RTX 2060 GPU. e details of the network model are shown in Table 1. For the above deep neural network model, the minibatch gradient descent method is used for network training. To accelerate the attenuation of the network, the Adam optimizer is used for the network model, so that the model converges to the optimal value [44]. In the training of the model, the setting of the learning rate and the selection of the minibatch size affect the model's final accuracy and training speed, so in this paper we fix the other parameters of the Decoder y 1 y 2 y 3 Figure 6: Encoder-decoder process.  While training the neural network, random dropout and padding strategies are used. Among them, the random dropout strategy for the SCNN network can prevent the network model from overfitting the training data, while the padding strategy makes the output size of the convolution layer equal to the input size to prevent the loss of feature size [45]. In this paper, the random dropout parameter P is 0.2. Figure 8 shows the training loss rate and accuracy curve of the neural network model after 500-time repeated training. It can be seen that, after 240 iterations, the training accuracy curve converges to 0.9 and the training loss rate is about 0.1.
We test the trained network model by the test data sets E; each point of the high-dimensional data of four-class MI-EEG features is assigned on the low-dimensional map and is avoided to concentrate in the center of the map, so as to form a scatter plot of T-distributed Stochastic Neighbor Embedding (T-SNE) [46], as shown in Figure 9. In the T-SNE scatter plot, the classification categories are represented by different colors, and it can be seen that all categories are clearly separated, but there is also some data hard to identify, which may be caused by interference during data acquisition.
Further, the network model which has been trained is measured by indicators such as accuracy, precision, sensitivity and specificity, and these indicators are calculated as follows: ACC � TP n + TN n TP n + TN n + FP n + FN n , (27) PPV � TP n TP n + FP n , where TP represents the number of testing samples whose real value and model predicted value of classification category are both true, TN represents the number of testing samples whose real value and model predicted value of classification category are both negative, FP represents the number of testing samples whose real value of classification category is negative but their model predicted value is positive, and FN represents the number of testing samples whose real value of classification category is positive but their model predicted value is negative. Accuracy (ACC) is the proportion of the total number of model's correct judgments in the total model prediction results of testing samples. Precision (PPV) is the proportion of the number of model's correct judgments in the model prediction results of testing samples whose predicted value of classification category is positive. Sensitivity (TPR) is the proportion of the number of model's correct judgments in the model prediction results of testing samples whose true value of classification category is positive. Specificity (TNR) is the proportion of the number of model's correct judgments in the model prediction results of testing samples whose true value of the classification category is negative. n denotes the classification categories. e trained model classifies the test data of subjects in BCI Competition IV 2a; the classification accuracy rate of each subject and average classification accuracy rate of all subjects are shown in Figure 10.

Discussion
Our method of SCNN-BiLSTM network based on attention in this paper is compared with the methods in the literature [16,31,[47][48][49][50][51][52] and the classification accuracy of each method is measured by kappa coefficient. In the classification problem, the higher the kappa coefficient [53], the higher the classification accuracy. e kappa coefficient is calculated as follows: where C denotes the number of known categories and ACC is the average classification accuracy. For analyzing the literature methods, see Table 2. e literature in [47] proposed the Filter Bank Common Spatial Pattern (FBCSP) method to extract features of MI-EEG signals and adopted the "one-versus-rest" multiclassification mechanism, which won the 2008 International Brain-Computer Interface Competition. e literature in [48] proposed an automatic method for the classification of general artifactual source components, which was a kind of Independent Component Analysis (ICA) for artifact removal in MI-EEG signals, and the classification accuracy was 69.7 ± 14.2%. e literature in [49] proposed a method which spectral regression kernel discriminant analysis (SRKDA), with a classification accuracy of 78.4 ± 14.0%. e literature in [50] proposed a method which combined CSP and Local features-scale Decomposition (LCD) to extract features of MI-EEG signals, with classification accuracy of 80.2 ± 8.10%. e literature in [51] proposed a method of adaptive Stacked Regularized Linear Discriminant Analysis (SRLDA) to analyze the temporal, spatial, and spectral information of MI-EEG signals.
e results showed that the adaptive SRLDA method was superior to the method of Data Space Adaptation (DSA) based on Kullback-Leibler divergence. However, the above-mentioned literature methods completely rely on human's current cognition of EEG signals and require relevant professional knowledge in the process of feature extraction, which makes the feature extraction too complicated and the classification effect poor. e literature in [52] proposed a method based on the combination of wavelet transformation and 2-layer CNN network, with classification accuracy of 81.2 ± 28.5%. e literature in [16] proposed using "one-versus-rest" Filter Computational Intelligence and Neuroscience Bank Common Spatial (OVR-FBCSP) mode to extract features of MI-EEG signals primarily; then, CNN and LSTM networks were applied to reextract and classify those primary processing features. e classification accuracy was 83.0 ± 8.34%. Although these methods have achieved some accomplishments, they did not fully utilize the advantage of deep learning's self-learning characteristics and still followed the idea of manually extracting features first and then classifying patterns. Tabar and Halici [31] and Amin et al. [37] proposed new deep learning methods of feature extraction and pattern classification for MI-EEG signals, but the classification accuracy was not high, which was 66.2 ± 11.2% and 74.5 ± 10.1%, respectively.
In this paper, we propose a method of an attention-based time-incremental end-to-end shared neural network. After extracting the frequency-spatial domain features by SCNN network and extracting the time-series features by BiLSTM network with time increments, the method effectively learns the temporal-frequency-spatial domain features of MI-EEG signals. Finally, an attention mechanism is added to the network feature fusion layer, and the extracted temporalfrequency-spatial domain features are dynamically weighted to reduce the redundancy of the fusion features and improve the classification accuracy rate to 82.7 ± 5.57%. e results of comparison between our method and the literature methods are shown in Figure 11. It is obvious from (a) and (b) that, compared with the nondeep learning methods and deep       ). e average classification accuracy rate of all subjects (average). Take the confusion matrix of A01 as an example. e first column indicates that, in a total of 72 testing samples from 72 left-hand trials, the number of testing samples is 64 whose model predicted value of classification category is left hand, and the number of testing samples is 8 whose model predicted value of classification category covers right hand, feet, and tongue; that is, TP � 64 and FN � 8; then, TPR � 64/72 � 88.9%. e second to fourth columns are the same. e first row indicates that, in a total of 72 testing samples from 64 left hand, 2 right hand, 5 feet, and 1 tongue trials, all the testing samples' model predicted values of classification category are left hand; that is, TP � 64 and FP � 8; then, PPV � 64/72 � 88.9%. e second to fourth rows are the same. e main diagonal indicates that the total number of model's correct judgments is 64 + 62+64 + 64 � 254 times, while the total model prediction results of testing samples are 288, so, ACC � 254/288 � 88.2%.  Computational Intelligence and Neuroscience learning methods, our method has the minimum individual difference among different subjects, and also as can be seen from (c) and (d) the classification accuracy of each subject is greater than 73.3%. In other words, our method has greatly improved the overall accuracy of all 9 subjects. erefore, our method is more suitable for the multiclass recognition of MI-EEG signals that are short-time series and is very effective in improving the overall classification and recognition.

Conclusions
In this paper, we propose an attention-based timeincremental end-to-end shared neural network which is essentially an end-to-end trainable model formed through the unification of SCNN network, BiLSTM network, and attention mechanism. With this end-to-end shared neural network, the feature extraction and pattern classification of MI-EEG signals are performed in a step-by-step way, which effectively improves the accuracy and robustness of EEG recognition.
In much research of deep neural network methods for EEG signals, the amount of training data is not enough for the network's training, and EEG signals are treated as images in usual, which may lead to the loss of information about time. Moreover, a simple network stacking can cause redundancy of features. So, to solve these issues, the method in this paper is divided into the following steps: first, a combination of all the sample training data followed by an increase in sample number using autoencoder meets the needs of a mass of training data for deep learning while effectively reducing the bad effect on signal recognition because of randomness, instability, and individual variability of EEG data. Second, BiLSTM network with time increments is connected in series with SCNN network, so that feature extraction of MI-EEG in its frequency-spatial domain and temporal domain successively uses SCNN network and BiLSTM network, which can make the features of MI-EEG in its temporal-frequency-spatial domain fully learned and ensure the classification results of MI-EEG signal. ird, the attention mechanism introduced into the dynamic weighted feature fusion of MI-EEG reduces the redundancy of the fusion features and improves the classification accuracy.
e results of comparison with the traditional nondeep methods and deep learning methods have shown the effectiveness of this end-to-end shared neural network that we proposed. e method is more suitable for the multiclass recognition of MI-EEG signals that are short-time series. It has the minimum individual difference among different subjects and is very effective in improving the overall classification and recognition of subjects.
In the near future, we will continue keeping focus on the research of information in raw MI-EEG data. rough the analysis of the irregularities of both distribution structure of  Figure 11: Comparison results of the accuracy and kappa coefficient. (a, c) e comparison results between the traditional nondeep learning methods and our method in this paper. (b, d) e comparison results between the deep learning methods and our method in this paper. It is obvious from (a, b) that our method has the minimum individual difference among different subjects, and also as can be seen from (c, d) our method has greatly improved the overall accuracy of all 9 subjects.
EEG channels and EEG data, we will try to learn more about the algorithms of feature extraction, feature fusion, and pattern classification to recognize motor imagery tasks more compared to four classes and improve the classification accuracy while reducing the individual difference of the same network model in different subjects. What is more, we will also try to deploy the combination of brain-computer-interface and limb rehabilitation robot in an online system.

Data Availability
e BCI Competition IV data set 2a is available at http:// www.bbci.de/competition/iv/.

Conflicts of Interest
e authors declare no conflicts of interest.