Channel and Feature Selection for a Motor Imagery-Based BCI System Using Multilevel Particle Swarm Optimization

Brain-computer interface (BCI) is a communication and control system linking the human brain and computers or other electronic devices. However, irrelevant channels and misleading features unrelated to tasks limit classification performance. To address these problems, we propose an efficient signal processing framework based on particle swarm optimization (PSO) for channel and feature selection, channel selection, and feature selection. Modified Stockwell transforms were used for a feature extraction, and multilevel hybrid PSO-Bayesian linear discriminant analysis was applied to optimization and classification. The BCI Competition III dataset I was used here to confirm the superiority of the proposed scheme. Compared to a method without optimization (89% accuracy), the best classification accuracy of the PSO-based scheme was 99% when less than 10.5% of the original features were used, the test time was reduced by more than 90%, and it achieved Kappa values and F-score of 0.98 and 98.99%, respectively, and better signal-to-noise ratio, thereby outperforming existing algorithms. The results show that the channel and feature selection scheme can accelerate the speed of convergence to the global optimum and reduce the training time. As the proposed framework can significantly improve classification performance, effectively reduce the number of features, and greatly shorten the test time, it can serve as a reference for related real-time BCI application system research.


Introduction
e brain-computer interface (BCI) is a communication and control system established between the human brain and a computer or external device without the involvement of the nervous system or muscles. Common brain activity patterns used in BCIs include P300 potentials, steady-state visual evoked potentials (SSVEP), and motor imagery (MI) [1]. Among these, the motor imagery-based BCI system (MI-BCI) involves imagining a moving body part without any actual body movement; this provides a new approach for patients with motor disabilities for effective communication. Similar methods have been widely used for rehabilitation applications [2].
Pattern recognition is an important aspect of the MI-BCI system. However, brain signals contain a large amount of physiological and pathological information, and registered electroencephalogram (EEG) signals are mixed with other brain activity signals, which can overlap in both time and space [3]. As a result, the extracted features contain a lot of redundant and misleading information, thus limiting the accuracy of classification. Furthermore, for enhancing the spatial resolution of EEG recording devices and tracing techniques, the number of signal acquisition channels was increased to 16 leads, 32 leads, and 64 leads or higher [4]. An increase in the number of channels not only increases the spatial resolution but also increases the number of features, thus increasing the running time of classification. Hence, task-related features require proper selection mechanisms. Although a genetic algorithm is used most commonly for task recognition [5,6], different MI optimization methods have been suggested, such as differential evolution [7], particle swarm optimization [8], concave-convex procedure [9], principal component analysis [10], and correlationbased channel and time window selection [11,12]. It should be noted that the PSO algorithm is another promising technique with simple computation and rapid convergence characteristics, which has been successfully applied to mechanical engineering optimization, business optimization, and clustering problems [13].
Analysis of the existing motor imagery recognition schemes revealed several drawbacks. Firstly, high variance among the optimized feature components may result in low classification performance in some schemes, due to the inability to identify and filter out all misleading features. Secondly, the number of optimized features is still large, which requires a lot of testing time and limits practical application of these schemes. Finally, the features and channels have been analyzed separately, not taking into account the fact that the selection of the optimal features depends on the channel used. Even if the best feature set can be identified, more training time is required.
To solve these technical problems, on the one hand, this study introduces an efficient optimization framework based on multilevel PSO (MLPSO). PSO algorithm is used to perform global search in the whole search space in this scheme, and local search is performed by running this algorithm continuously. is allows improving the ability of the procedure to switch from local to global optima. e scheme reduces the number of features and enhances the classification accuracy via selection of the best feature subsets that match the expected potential cortical activity patterns during the MI task, bringing it the advantages of a great range of application and easy implementation. e reason for using MLPSO in feature optimization is that when a particle's current position coincides with the global best position and the particle velocity is not zero, all particles will move to the position rapidly, leading to a rapid convergence of PSO algorithms. However, the algorithm convergence to the local optimal value is not guaranteed. e procedure means that all particles move to the best position found at present; this phenomenon is known as stagnation [14]. However, this problem can be solved by running the optimizer several times for the same cost function. Hence, MLPSO was used to optimize the task related to motor imagery in this study.
On the other hand, for the last problem, according to the proposed signal processing framework, three optimization schemes based on channel and feature selection, channel selection, and feature selection were designed. Among them, channel and feature selection eliminated irrelevant channels through channel selection and then selected features matching the task through feature selection. ese steps accelerated the screening of irrelevant features. e current study investigates a signal recognition MI-BCI framework. An MLPSO algorithm was used for optimization in combination with Bayesian linear discriminant analysis (BLDA) classification, and the modified Stockwell transforms (MST) were applied during feature extraction. ree optimization schemes of channel and feature selection, channel selection, and feature selection based on MLPSO optimization are designed for this framework. Figure 1 shows the block diagram of the proposed methodology framework. Signal processing was implemented with MATLAB, and the simulation was run on a workstation with LINUX Sever, 64 GB of memory, 512 GB of SSD, NVIDIA GeForce TITANX, and six-core Intel(R) Xeon(R) Silver 4114 CPU @ 2.20 GHz. e remainder of this paper is organized as follows. Section 2 describes the experimental dataset and preprocessing, feature extraction, classification, multilevel PSO-based channel and feature selection, and classification performance. Sections 3 and 4 present and discuss the classification results of the proposed optimization scheme. Finally, conclusions are summarized at the end of this paper.

Experimental Dataset and Preprocessing.
A high-quality signal is an important prerequisite for improving the classification accuracy and evaluating an algorithm's performance. Since an electrocorticogram (ECoG) is recorded on the surface of the cortex and provides higher temporal and spatial resolution, better signal-to-noise ratio, and broader bandwidth compared to those of EEG signals, the BCI Competition III dataset I [15] was used in this study. During the BCI experiment, a subject had to perform imagined movements of either the left small finger or the tongue. e time series of the electrical brain activity was picked up during these trials by using an 8 × 8 ECoG platinum electrode grid, which was placed on the contralateral (right) motor cortex. All recordings were performed using a sampling rate of 1000 Hz. Every trial consisted of either an imagined tongue or an imagined finger movement and was recorded for a duration of 3 seconds. e dataset consists of 278 trials of training data and 100 trials of test data, which are stored in a 3D matrix named X using the following format: trials × electrode channels × samples of time series. e label of the dataset is stored as a vector of -1/1 values named Y. To reduce the amount of data needed for signal processing, the data is downsampled to 100 Hz without causing distortion.

Feature Extraction.
Efficient feature extraction method can isolate event characteristics from registered brain signals, thus improving classification performance. e Stockwell transform (ST) is an extension of wavelet transform, based on a moving and scalable localizing Gaussian window, providing frequency-dependent resolution while maintaining a direct connection to the Fourier spectrum [16]. e ST of the time series x(t) can be obtained as follows: e Gaussian window g(τ − t, f) is defined by 2 Computational Intelligence and Neuroscience and the standard deviation σ(f) is the function of the frequency f, which is equal to By adjusting the time-frequency resolution of the standard deviation of the Gaussian window, MST can provide better energy concentration than ST, obtaining higher-frequency resolution at lower frequencies and better time localization at higher frequency [17]. Accordingly, it has been used to detect dynamic brain signals. e standard deviation of MST is represented as where the scaling factors p and q determine the width and height of the Gaussian window, respectively. According to the frequency range of the MI, the frequency range of the MST is set to 1-35 Hz and the interval is set to 1 Hz. e power spectral density (PSD) is then calculated. erefore, 35 features were extracted for each channel. Since there were 64 channels, 2240 features were extracted for each trial. erefore, the number of features of the training set and test set are 278×2240 and 100×2240, respectively. e power spectrum after feature extraction in two trials with different labels in the training set is shown in Figure 2. Observably, the frequency distribution of energy in the two figures is visibly different.

Classification.
As an extension of Fisher's linear discriminant analysis, BLDA applies regularization in the training process; it has the advantages of automatically adjusting parameters and avoiding data overfitting in classification [18]. ese characteristics make it suitable for realtime BCI systems.
In the dataset, the labels of the samples are denoted by "1" and "− 1," but the output of the classifier is usually not two values. erefore, the predicted output is changed by setting the threshold to 0; that is, the predicted outputs of experiments that are ≥0 are marked as "1," and those that are <0 are marked as "− 1."

Multilevel PSO for Channel and Feature Selection.
PSO is a population-based optimization algorithm based on the social behavior of bird flocking [19]. e algorithm firstly initializes a group of particles randomly in the given solution space, updating the velocity and position of the particles in the solution space by tracking two "best values." One "best value" is the best position found by a single particle in iteration, called the personal best position (pbest). e other is the global best position (gbest) found by all of the particles in the iteration. For the PSO, the particles are calculated based on the following equation: where x k ij and v k ij represent the position and velocity of the i-th particle in the j-th dimension, respectively, at iteration k. r 1 and r 2 are random values between 0 and 1. c 1 and c 2 are the acceleration coefficients. To prevent a blind search of particles and the expansion of the population, the position and velocity are limited to a certain interval When values exceeded this range, a boundary absorption strategy was adopted to set the parameters to the adjacent boundary values. w is the inertia weight. In this paper, in order to better balance the search ability of the algorithm, the linearly decreasing inertia weight is used; that is, where w max and w min represent the maximum and minimum values of inertia weights, respectively, t is the current iteration, and T represents the maximum number of iterations. Channel and feature selection occurs in discrete search space, so the value of particle in the state space can only be "0" and "1." e speed update rule of PSO algorithm is still retained, but the position of the particle is determined by the following equation:  Computational Intelligence and Neuroscience where r is a random number of [0, 1] and s(v k+1 ij ) is the sigmoid function. erefore, the equation indicates that the probability of a particle position value 1 is s(v k+1 ij ). e PSO optimization process is described as follows: (a) Initialize the population. Randomly initialize the position of the population C N×D using binary coding, initialize speed, and set the maximum number of iterations. Here, N is the number of particles, and D represents the dimension of the particle, which is determined by the number of features to be optimized. Each index represents one feature, where binary "1" represents the feature at the same index that will be used for classification, and "0" indicates that the feature will be ignored. Figure 3 shows a schematic of the initial population location creation. (b) Calculate the fitness of each particle in the population. Fitness value is an indicator used to measure the individual advantages and disadvantages of population. In the experiment, the inverse of the mean square error of the testing data is taken as the fitness function as follows: where Y � y 1 , y 2 , . . . , y n is the predicted value of the testing data, Y � y 1 , y 2 , . . . , y n is the true value, and n is the number of samples. (c) Update pbest and gbest. For each particle, its fitness value is compared to the fitness value of the best position it has experienced, and pbest is updated if this value is better. For each particle, its fitness value is compared with the fitness value of the global best position, and if it is better, gbest is updated. (d) Update velocity and position. First, the updated velocity v is calculated according to equations (5) and (6), where the inertia weight w is obtained by (7). Next, the new position is calculated from equation (8). (e) Repeat steps b-d until the maximum number of iterations is reached, and record the fitness value and gbest for each iteration to select the best combination of features. Figure 4 shows the flow of multilevel PSO for channel and feature selection. e loop is terminated when the maximum execution level is reached or the number of selected features does not change.

Classification Performance.
To evaluate the performance of the algorithm, Kappa values, F-score, and time-based statistics such as sensitivity (recall), specificity, and precision were used.
e Kappa values are an indicator used to measure the accuracy of the classification and can be given by Here, acc is the classification result, and rand is the result of random classification. For two-class classification, the value of rand is 0.5 [20].
Sensitivity (the proportion of positives that are correctly identified), specificity (the proportion of negatives that are correctly identified), precision (the proportion of correctly predicted positives to all predicted positives), and F-score  Figure 2: (a, b) MST-based power spectrum features of channel 12 of the first and second trials of the training set, respectively. e 12th channel was chosen because, as described in the channel selection section of this article, the 12th channel was one of the best channel combinations. 4 Computational Intelligence and Neuroscience (the harmonic mean of the precision and recall) are defined as follows: where true positive (TP) and true negative (TN) represent the numbers of left-hand little-finger movements and tongue motions, respectively, correctly classified by the algorithm. If these data are erroneously detected as the opposite movements, they are termed as false positive (FP) and false negative (FN).

Parameter Settings.
e scale factors p and q are used to adjust the width and height of the Gaussian window of the MST by continuously adjusting the value of the scale factor; the results are shown in Table 1, and the most effective features could be obtained when p � 0.85 and q � 1.
Because the two-level PSO feature selection scheme is very representative, it not only achieves higher classification accuracy but also saves a lot of time compared with multilevel optimization. erefore, the two-level PSO is taken as an example to illustrate the parameter adjustment process, as shown in Table 2; the optimal parameters are N � 100, c 1 � c 2 � 1.5, w max � 0.8, w min � 0.4, and v max � 20. In the experiment, we found that MLPSO usually reached the optimal value within 100 iterations. erefore, take T � 100. All experiments were started with the initial particle swarms independent of each other.

Channel Selection.
To confirm the effectiveness and feasibility of the proposed MLPSO optimization framework, we fixed the feature sets representing each channel and utilized the PSO algorithm to find the best channel combination. Results are shown in Table 3. e accuracy of classification increased from 89% to 97%. Figure 5 shows the distribution of selected channels with different levels of PSO for channel selection. Eleven channels were stable in each experiment. Since all experiments were started with independent initial particle swarms, these 11 channels were considered to be the optimal channel combination, far less than the initial 64 channels. Figure 6 illustrates the positional arrangements of the corresponding electrodes. e selected channels are relatively concentrated in the upper half, while channels are almost entirely absent from the lower left part. e electrodes of the dataset used in this study were placed under the dura of the cerebral cortex, covering the main motor area and premotor area, as well as the    Computational Intelligence and Neuroscience frontotemporal area of the left and right hemispheres [15]. erefore, the selected channels were located close to the motor cortex region of the brain. e unselected channels may be due to the interval of a week between collection of the training set and test set, electrode shedding, or decreased conductivity, resulting in poor signal quality.

Feature Selection.
e results of feature selection are listed in Table 4. After feature selection, the classification accuracy of all experiments was improved by more than 4%, and the best accuracy level reached 99%. In addition, the number of features used for classification was 40 when optimal accuracy was reached for the first time, which was only 1.8% of the original number of features. is coincided with a 95% reduction in testing time. e number of selected features in the last several experiments of the scheme remained unchanged, suggesting that the optimal feature set related to the task was identified. Figure 7 represents a scatter diagram of optimal feature distribution.

Channel and Feature Selection.
Since the optimal channel combination was identified via four-level PSO, we conducted four groups of feature selection experiments using different levels of PSO for channel selection; results are shown in Table 5. e best classification accuracy achieved in each group of experiments was 99%; and the adequate the channel selection, the shorter the feature selection time needed to achieve the greatest accuracy. Concurrently, the specificity of each experiment reached 100%.

Comparison of Channel and Feature Selection.
Tables 3-5 show that when only MLPSO is used for channel selection, the best classification accuracy is 97%, while the other two schemes achieved 99% accuracy.
is is a significant improvement in accuracy compared to 89% before optimization. At the same time, the number of features used to achieve 99% classification accuracy was less than 10.5% of the original number, and the test time was reduced by more than 90%. ese data suggest that MLPSO-based optimization framework not only significantly improves classification accuracy but also effectively reduces the number of features, thus greatly reducing the test time. ese characteristics indicate that MLPSO may be useful as a reference for related real-time BCI application system research. Figure 8 shows the change in classification accuracy of each experiment when different levels of PSO are used. Compared with the scheme using feature selection only, use of channel and feature selection requires fewer feature selection times to achieve 99% accuracy. Meanwhile, the total training time of channel and feature selection scheme is less than that of feature selection scheme. is demonstrates that channel selection can filter out channels that are not related to the task and simplify the complexity of the optimal feature selection process. ese data suggest that channel and feature selection can accelerate the convergence of the algorithm to the global optimal value, reduce computational complexity, and shorten the training time.

Comparison of the Classification Performance.
Tables 3-5 also provide evaluation of classification performance corresponding to each experiment, mainly in terms of Kappa, F-score, sensitivity, specificity, and precision. e optimal Kappa value of feature selection scheme based on PSO was 0.98, which demonstrated 20% improvement compared with the method without optimization. Moreover, the accuracy and sensitivity of the PSO-based method were greatly improved, and the specificity reached 100%. ese improved evaluation indices indicate the effectiveness of the proposed scheme. Table 6 presents a comparison of the proposed method with the current stateof-the-art scheme using the same dataset. e classification accuracy of the proposed framework is evidently higher than that of the previously used algorithms. Chang et al. [6] proposed a feature selection scheme based on a genetic algorithm; the classification accuracy of the algorithm is 96%, and the number of selected features is 48.6% relative to the number of original features. By contrast, our scheme achieves 99% accuracy with less than 10.5% features, which proves the effectiveness of the scheme proposed in the present study. Xu et al. [21] proposed using gradient boosting to classify brain signals by extracting the combined features of fractal measures and LBP operators; 41 channels with the highest precision were selected, yielding 95% accuracy. Zhao et al. [22] used band power for channel selection and feature extraction. Eleven channels with distinctive features were selected from the initial channel. Principal component analysis was used to reduce the dimensions of features. Finally, FLDA was used for classification, achieving 94% accuracy, but the algorithm has high complexity. Ince et al. [23] proposed an adaptive classification scheme, including the generation of a structured redundant feature dictionary based on dual-tree undecimated wavelet packet transform (UDWT) and linear discriminant analysis (LDA) classifiers. By using only three features, 93% accuracy can be achieved, but the subset of its features will increase algorithm complexity. Wei et al. [24] selected the optimal channel through the genetic algorithm,          Computational Intelligence and Neuroscience 9

Comparisons with Other Methods.
and then the common spatial pattern (CSP) extracted the power characteristics, and FLDA classified it to achieve 90% classification accuracy through seven channels. Compared with other methods, our algorithm has high classification accuracy and specificity.

Conclusions
is study describes three optimization schemes for motor imagery-based BCI. MLPSO is used to optimize the process of channel and feature selection, channel selection, and feature selection, respectively, and MST-based PSD and BLDA were used for feature extraction and classification. e scheme of using MLPSO for feature selection and hybrid channel-feature selection achieved 99% classification accuracy, the test time was shortened by more than 90%, and Kappa values were increased from 0.78 to 0.98, and the specificity reached 100%, achieving the best reported level.
e results show that the channel and feature selection scheme can accelerate the speed of finding the global optimal value and reduce the training time. Due to the excellent performance of the proposed optimization scheme, it can provide a reference for related real-time BCI application system research.
Data Availability e BCI Competition III dataset I is available at http://bbci. de/competition/iii/.

Conflicts of Interest
e authors declare no conflicts of interest.