Simultaneous Channel and Feature Selection of Fused EEG Features Based on Sparse Group Lasso

Feature extraction and classification of EEG signals are core parts of brain computer interfaces (BCIs). Due to the high dimension of the EEG feature vector, an effective feature selection algorithm has become an integral part of research studies. In this paper, we present a new method based on a wrapped Sparse Group Lasso for channel and feature selection of fused EEG signals. The high-dimensional fused features are firstly obtained, which include the power spectrum, time-domain statistics, AR model, and the wavelet coefficient features extracted from the preprocessed EEG signals. The wrapped channel and feature selection method is then applied, which uses the logistical regression model with Sparse Group Lasso penalized function. The model is fitted on the training data, and parameter estimation is obtained by modified blockwise coordinate descent and coordinate gradient descent method. The best parameters and feature subset are selected by using a 10-fold cross-validation. Finally, the test data is classified using the trained model. Compared with existing channel and feature selection methods, results show that the proposed method is more suitable, more stable, and faster for high-dimensional feature fusion. It can simultaneously achieve channel and feature selection with a lower error rate. The test accuracy on the data used from international BCI Competition IV reached 84.72%.


Introduction
Brain-computer interfaces (BCIs), which are communication systems designed to transmit information between the brain and computers or other electronic devices, are currently the most popular technique used in neurological rehabilitation [1]. The system does not depend on the brain's normal pathways of peripheral nerves and muscles but relies on signal acquisition technology to capture the signal generated from brain activity, which is used to control external equipment after analysis and processing. The electroencephalogram (EEG) signal is the brain signal that is obtained by noninvasive electrode acquisition. EEG signal feature extraction and classification have become a hot topic in BCI research.
The biggest problem of BCIs based on EEG signals is the high dimensions of the EEG feature space and the limited number of samples. This has prompted research into EEG channel selection and BCI feature selection. Research into feature selection and channel selection of the EEG signal can be roughly divided into two types. The first type is feature selection methods. Coelho introduced a new artificial immune network algorithm to realize automatic feature selection using the EEG signal power spectral density feature, which used an extreme learning machine as a classifier [2]. Rejer used blind source separation, a genetic algorithm and a forward feature selection method [3,4]. Bhattacharyya proposed a differential evolution and mimetic algorithm for high-dimensional EEG signal power spectrum density feature selection [5]. Noshadi proposed an algorithm which combines Lempel-Ziv with EMD for feature extraction on the EEG signal, using t-test and a forward or backward feature selection method [6]. The second type is EEG signal channel selection methods. Arvaneh proposed a sparse common spatial pattern algorithm and a robust sparse common spatial pattern algorithm for channel selection. The classification results are better than the feature selection method based on Fisher criterion, mutual information, support vector machine, and common spatial pattern or a regularized common spatial pattern [7,8]. He proposed a genetic algorithm for feature selection based on the maximized Rayleigh coefficient feature [9]. Yang proposed a method for channel selection of a specific object based on Fisher discriminant  analysis scoring criteria. This method can effectively reduce the number of channels from 118 channels to no more than 11 without significantly reducing the classification accuracy by shortening the training time [10]. Gonzalez proposed a combination of Fisher discriminant analysis and a multiobjective real/binary hybrid particle swarm algorithm, which can maximize the classification accuracy and minimize the number of channels while searching for EEG channels and the classifier parameters [11]. As can be seen, most studies undertake research on EEG signals by either feature selection or channel selection unilaterally. Lasso (least absolute shrinkage and selection operator) is a new regularization method which can be used to select high-dimensional features [12]. Group Lasso is an extended Lasso method [13], while Sparse Group Lasso is a regularization method which combines Lasso and Group Lasso [14,15]. Germán et al. proposed a Lasso feature selection method based on minimum angle regression using fusion characteristics of the EEG, such as the power spectrum, Hjorth parameters, AR model coefficients, and wavelet transform parameters. This method used linear discriminant analysis as the classifier [16]. Experimental results show that this method is superior to traditional methods. Yeh studied the image classification problem of audio and video, using the fusion of the Mel-frequency cepstral coefficients (MFCC) feature, scale invariant feature transform (SIFT) descriptor subfeatures, histogram of oriented gradients (HOG) descriptor subfeatures, Gabor texture features, and edge direction histogram (EDH) described characteristics, and then proposed a multicore learning framework which is based on Group Lasso for feature selection [17]. Xie studied the problem with selection of uncertainty characteristics based on Sparse Group Lasso for data mining and has done experiments on nine types of UCI machine learning datasets [18].
Based on the literature [16], this paper proposes the Sparse Group Lasso method for channel selection and feature selection of the EEG fused feature and estimates model parameters using a combination of the blockwise coordinate descent method and the coordinate gradient descent method. This has the ability to not only select features between channels but also select features within the channel and achieves high-dimensional EEG signal channel selection and feature selection simultaneously, while obtaining better sparse performance and classification accuracy. We conduct experimental verification on dataset 1 of the international BCI Competition IV. The EEG data is firstly preprocessed and then fused features are established from each channel of the multichannel signal; that is, the power spectrum, timedomain statistics, autoregression (AR) model coefficient, and wavelet features are extracted. The wrapped channel and feature selection method is then used. The logistic regression model is penalized with the Sparse Group Lasso to fit the training data, and parameter estimation is obtained using the blockwise coordinate descent method and coordinate gradient descent method. Finally, the test sample is classified using the trained model. The method proposed in this paper includes feature fusion, channel selection, and feature selection, as shown in Figure 1.

Feature Extraction
In study of EEG signal classification problems, an important factor in improving the recognition rate is to extract representative features to represent the EEG signal properly. In this paper, in order to extract the EEG signal features and establish high-dimensional feature fusion comprehensively, we jointly apply four types of feature extraction methods: frequencydomain analysis, time-domain analysis, analysis of time and space, and time-frequency analysis.
Power spectrum estimation can analyze the distribution and change in EEG signal rhythm [19] and capture the event-related desynchronization (ERD) and event-related For the time sequence of each channel of the EEG signal, we extract four commonly used statistical features: the mean value and standard deviation of the time sequence and the mean value of the first difference absolute value and the second difference absolute value of the time sequence.
The AR model is an effective tool for time sequence modeling and it has been widely used in BCI systems [20]. In our experiment, we establish a sixth order AR model for the time sequence of each channel and take the coefficient of the model as a feature of the EEG signal.
Wavelet transform is a type of variable resolution timefrequency analysis method; it has good localization in the time-domain and frequency-domain and is used for EEG signal feature extraction frequently. We use the Db4 wavelet as the mother wavelet in the experiment, make six decompositions of the time sequence of each channel, take the energy of the approximate coefficients and detail coefficients (sevendimensional) as features, and extract four features for each of them: the Shannon entropy, logarithmic energy entropy, and the mean value and variance of the Teager-Kaiser energy operator. This constitutes 55-dimensional wavelet features overall.

Channel Selection and Feature Selection
The feature extraction process described above is carried out for each channel in the time series. While tasks to imagine different movements activate different brain areas, not all regions of the brain's electrical activity are associated with each task, so the fused features established using every channel of the EEG signal have some redundancy. Hence, we need to complete channel selection and feature selection. Channel selection removes channels which are not related to the category of imagined movement. In addition, some of the features have nothing to do with classes other than the category of imagined movement, so feature selection is required. Feature selection considers whether each dimension's characteristic is associated with each category of imagined movement, and selections are made based on the features rather than the channel.
It is well known that the Lasso method can obtain a sparse solution from high-dimensional data. For feature fusion, the method extracts characteristics from each different channel without distinction, adopting the same selection standards, and can realize the process of feature selection, as shown in Figure 2(a). However, the method does not significantly reduce the number of channels. The Group Lasso method regards the fused features extracted from each individual channel as a feature set, and selection is made on a channel basis; that is, all characteristics of the channel are retained or discarded, as shown in Figure 2(b). However, with feature fusion, not all features extracted from a channel are necessarily associated with imagined movement categories, and therefore feature selection within the channel is needed. Therefore, a method is required to increase the sparsity of the feature set among channels and within each channel. The Sparse Group Lasso method is a combination of Lasso and Group Lasso, which can achieve sparsity between groups and within the group. Therefore, in this paper we propose the Sparse Group Lasso method to solve the problem of channel selection and feature selection for EEG signal feature fusion, as shown in Figure 2(c). Additionally, we propose a method that combines the blockwise coordinate descent and coordinate gradient descent to estimate the parameters of the Sparse Group Lasso model, where nonzero model parameters signify that the corresponding feature or feature group is selected and vice versa.

BioMed Research International
First, we provide the proposed logistic regression multiclassification model penalized with the Sparse Group Lasso of the EEG signal. The method can be described as follows: we assume that the training sample set is ( , ), = 1, . . . , , ∈ × is the observation vector, is the number of channels, and is the dimension of each channel. We let denote the multiclass response, ∈ {1, 2, . . . , }. The EEG data used in this paper is two-class data, but in order to make the algorithm more general in our description here, we give an example using a multiclassification model. The logistic regression model is used to represent the conditional probability; then the probability of sample belonging to class is described as Here, is the coefficient matrix, which represents the model parameters which need to be solved, and ⋅ is the th column of . We let = as a reference, and then we can obtain − 1 logistic models: Here, ⋅ = 0. Using maximum likelihood estimation to fit the model, we define matrix with elements as follows: ( The training dataset can be considered as independent observations to simplify calculations. We take the logarithm likelihood function as follows: After adding a Sparse Group Lasso penalty function to (4), the objective function becomes Here, > 0 and when is sufficiently large, is zero, ∈ [0, 1].
( ) is the th group of , which represents the coefficient vector of the th channel fused feature of each class, with dimension , = 1, . . . , .
( ) is the th feature coefficient of the th group of , is the th feature coefficient of , = 1, . . . , × × .
We can see that the Sparse Lasso penalty is a combination of Group Lasso penalty and Lasso penalty, and when = 0 or = 1, it converts to Group Lasso or Lasso estimation, respectively.
As described in a previous study [21], the model parameter estimation algorithm proposed in this paper is composed of three main loops: an outer coordinate gradient descent loop (Algorithm 1), a middle blockwise coordinate descent loop (Algorithm 2), and an inner modified coordinate descent loop (Algorithm 3).
The purpose of Algorithm 2 is to solve the quadratic optimization problem in (7). Since the penalty Φ is separable, (7) can be written as Here, we can use the blockwise coordination descent algorithm because Φ ( ) is convex. Taking the th group (the fused feature coefficients of the th channel), the problem can be simplified to Here,̂( ) represents the estimation of the th group.
Since H is a diagonal matrix, it can be broken down into block matrices of × size. By symmetry of H, we obtain Equation (10) can be rewritten as Here, g ( ) is the group gradient and For Algorithm 3, we rewrite (9) as The two first terms of (12) are considered to be the loss function and the last term is the penalty. The loss is not differentiable at zero due to the L2-norm, and thus we cannot completely separate out the nondifferentiable parts, so the coordinate descent method has been modified for this case. For the th iteration of the th group, we need to find the minimum of the function (̂( ) ): = , and ℎ is the th diagonal of the Hessian block H . Due to the convexity of ( ), we conclude that ℎ ≥ 0. Since the quadratic approximation ( ) is bounded by the constraints below, we obtain̂( ) = 0 when ℎ = 0. When ℎ > 0,̂( ) can be obtained as follows.
If > 0, > 0, and | | ≤ , then̂( ) = 0 and therefore We solve (15) by applying a standard root finding method. We define Δ ∈ × × and can then rewrite the descent direction at zero for function (12): Algorithm 3 (inner loop used in the model parameter estimation algorithm).

Experimental Process and Results
Analysis. The first step was to extract the features from the = 59 channels of the EEG signal. For each channel signal, a 5-dimensional power spectral feature, 4-dimensional time-domain statistical feature, 6-dimensional AR model coefficient feature, and 55dimensional wavelet decomposition coefficient feature were extracted. Thus, the fused features of each channel were 70dimensional.
In this paper, the Sparse Group Lasso method was firstly used for EEG signal processing. The features of each channel were extracted as a group, that is, ( ) , where = 1, . . . , , with = 59 groups in total. Here, = ( (1) , . . . , ( ) , . . . , ( ) ), where ( ) ∈ × . In experiments, we used the wrapped Sparse Group Lasso method for channel and feature selection. At first, the feature set consisted of features extracted from each channel of the EEG signal. The combined coordinate gradient descent method and blockwise coordinate method were then used to solve the objective function with a corresponding penalty term to get the parameter estimation results, based on the training data logistic regression model. A 10-fold cross-validation method was applied to select the parameter estimation with the highest training accuracy rate as a result of channel and feature selection. Finally, the test data corresponding to the selected channel and feature subset for the trained model under test was used to calculate the test error rate.
The first experiments use datasets A and E as follows. For dataset A, the method proposed in this paper can be compared to a type of feature extraction method and feature fusion method, respectively, with results shown in Table 2.
From Table 2 it can be observed that, compared with the AR coefficient and wavelet coefficient features, the feature fusion obtains a lower error rate for simultaneous channel and feature selection. For the power spectrum characteristic and the time-domain statistics characteristic, although the feature fusion error rate is slightly higher, the feature fusion method has obvious advantages for channel selection.
Electrode position point Therefore, it can be concluded that when considering comprehensive performance of the test error rate and channel selection number, a fused feature extraction method is better than a single feature extraction method. Figures 4 and 5 compare single feature extraction and feature fusion of the channel/feature selection for dataset A, respectively, and Figure 6 shows the fused feature channel selection result analysis. Figure 4 shows that the ratio of the number of channels and features selected is lowest when using feature fusion and it better reduces the redundancy of channel and feature. Figure 5 shows that, out of the 18 channels selected by the feature fusion method, 15 channels are included in the selection results from three or more extraction methods, a percentage of 83.33%. In addition, there are 10 channels, F2, F5, FCz, C4, CP1, CP3, P5, P6, O1, and O2, respectively, selected by all four feature extraction methods, and these channels are important for classification of dataset A. Finally, the proposed method includes all of these channels, which indicates that the feature fusion is superior at removing redundant channels and choosing the most relevant channels for signal classification.
As an example, we observe that the 12th (FC1) channel in Figure 6 only contains the power spectrum feature and the wavelet feature; that is, only these features contribute to the classification problem from the four types of heterogeneous feature of this channel. From analysis of all 18 channels, it can be observed that selection frequency of the power spectrum feature and the wavelet feature is 100%, while the time-domain statistic and AR coefficient feature have a selection frequency of 88.89%. Therefore, compared with the time-domain statistic and AR coefficient feature, the power spectrum feature and the wavelet feature are more important in the classification of dataset A.  The second experiment followed the same experimental procedure and analysis for dataset E. The results are shown in Table 3 and Figures 7-9.
We can draw similar conclusions from analysis of Table 3; that is, when there is a lower or equivalent test error rate, the feature fusion method can achieve better channel and feature selection. Figures 7 and 8 compare single feature extraction and feature fusion, respectively, for the channel and feature selection of dataset E. Figure 9 is the fused feature channel selection result analysis.
From Figure 7, we can directly observe that the fused feature extraction method achieves better dimensional reduction on the selected number of channels and features. In Figure 8, 23 channels are selected by feature fusion, with 16 channels contained in the selection results from three or more extraction methods, a percentage of 69.6%. Seven of the channels (F6, FC6, CFC8, C5, C3, C4, and CP6) are selected by all four feature extraction methods, and of these six channels (all except CFC8) are selected by the feature fusion method, a percentage of 85.7%. This shows that the feature fusion method can more accurately choose channels which are relevant to the classification. Figure 9 shows that 23 channels (which all include the power spectrum characteristic and wavelet feature) are selected by feature fusion, 13 channels include the timedomain statistic, and 11 channels contain the AR coefficient features. From this, we conclude that the power spectrum characteristics and wavelet feature play a more important role for classification.
The above experiments have shown that the feature fusion extraction method can provide alternative features for Sparse Group Lasso. It is suitable for handling data with high dimensions and can select the most effective features from the data.
The third experiment is as follows. In the following experiments, the feature fusion extraction method is adopted. The comparative results of Lasso feature selection, Group Lasso channel selection, and Sparse Group Lasso channel and feature selection for dataset A are shown in Table 4.
From Table 4 we can see that, compared with the Lasso and Group Lasso, Sparse Group Lasso can guarantee a lower error rate. Sparse Group Lasso selects more characteristics than the Lasso method but chooses a lower number of channels. Since the four datasets were collected from 59 electrodes, and each electrode corresponds to an individual channel for experiments, the channel selection represents the selection of an electrode. As each channel contains 70 characteristics, removing channel redundancy has more significance than removing redundant features. So, Sparse Group Lasso can be used for channel selection and feature selection at the same time with lower error rates. Figure 10 compares different channel and feature selection methods for dataset A. Figure 11 shows the channels and characteristics when parameter = 0.5 on dataset A. Each channel is composed of 70-dimensional features: the power spectrum characteristics are 5-dimensional, the time-domain statistical features are 4-dimensional, the AR model coefficient characteristics are     6-dimensional, and the characteristics of the wavelet decomposition coefficient are 55-dimensional. As can be seen in Figure 11, 18 channels are selected from the full range of channels, and not all features are selected. For example, on the 12th channel, no features are selected between the 775th dimension and the 785th dimension. AR coefficients and time-domain statistics characteristics are stored within this interval, therefore we can determine that the 12th channel does not choose the time-domain statistics and AR coefficient characteristics (also this can be concluded from Figure 6), and similar findings can be observed through further channel analysis. Therefore, it is more intuitive to discover the sparsity between channels and within each channel and then obtain the important features. This further proves that the Sparse Group Lasso method can realize channel and feature selection at the same time.
For the fourth experiment, we compared the performance of Lasso, Elastic Net, Group Lasso, and Sparse Group Lasso for feature selection and the classification problem. The results are shown in Table 5. We present the number of selected channels and features based on Sparse Group Lasso with different values of parameter (0, 0.25, 0.5, 0.75, 1). Sparse Group Lasso is equivalent to Group Lasso when equals 0 and is equivalent to Lasso when equals 1. Group Lasso shares the same grouping method as Sparse Group Lasso, which takes the fused features of each channel as a group and trades off on the group level in order to make the channel selection. Lasso and the Elastic Net method treat the features extracted from all channels equally and compromise on the feature level in order to make the feature selection. We use the packaging method with fused features    in the experiment on the four datasets based on Lasso feature selection, Elastic Net feature selection, Group Lasso channel selection, and Sparse Group Lasso channel and feature selection separately. As shown in Table 5, for different datasets, the larger is, the lower test error rate becomes. We can conclude that the Sparse Group Lasso method can obtain the lowest error rate when making channel and feature selection when the parameter setting is close to Lasso ( = 1). Table 5 shows the results of different channel/feature selection methods (Lasso, Elastic Net, Group Lasso, and Sparse Group Lasso) for different datasets. We can observe that, compared with other methods, Sparse Group Lasso obtains the lowest error rate, with the lowest number of selected channels, below 38.98% of the total number of channels for all datasets. The lowest number of selected channels is only 23.73% of the total, which reduces the redundancy of channels significantly. Since channel selection is equivalent to the choice of electrode, it has greater significance than feature selection. The number of features selected by Sparse Group Lasso is below 17.85% for all datasets, with the lowest only 7.97% of the total number of features. We conclude that the comparison shows that Sparse Group Lasso can achieve channel selection and feature selection simultaneously, while ensuring sparsity among channels and features when maintaining an error rate equal to or lower than other methods.
In comparison to other studies such as [22], we have only analyzed training sets, rather than using a testing set where the training set needs to be divided into 100 as 80% and 20% randomly. In the study in [22], 11 channels were chosen artificially: FC3, FC4, Cz, C3, C4, C5, C6, T7, T8, CCP3, and CCP4. This is different from the channels selected by our proposed method, since the previous study [22] used spatial pattern characteristics, while we use frequencydomain characteristics. The test set used, BCI Competition IV dataset 1, is continuous data, while we have piecewise processed the continuous data in order to increase the test samples. Therefore, it is not possible to compare our methods with other previous studies.

Conclusion
Classification of EEG signals is a core part of BCIs, so an effective feature extraction and selection method is the key to improving identification accuracy. For EEG signal processing, we present a new method: wrapped Sparse Group Lasso method for channel and feature selection. The joint application of a variety of feature selection methods was firstly used to establish high-dimensional feature fusion of the preprocessed EEG signals. Then, channels and features are selected in a wrapped way. The logistic regression model penalized with Sparse Group Lasso is fitted on the training data, and parameter estimation is obtained by a blockwise coordinate descent method and coordinate gradient descent method. The best feature subset is selected by using 10-fold cross-validation. Finally, the test sample is classified using the trained model, and the feature extraction method included the power spectrum, time-domain statistics, AR model, and the wavelet coefficients. Fusing multiple features to establish a collection to make a selection is a beneficial research area to explore for EEG signal classification. Experiments have shown that this method can extract the characteristics of the EEG signal more completely, so it is an effective way to improve the signal recognition accuracy. Compared with existing channel and feature selection methods, the results show that the method proposed is more suitable for selecting a subset of fused feature of the EEG signal, as well as being more stable and faster. It can also select a subset which is more relevant to the classification, and the test accuracy obtained on the data used from international BCI Competition IV reached 84.72%. This method is a good choice for future research in pattern recognition topics, such as speech recognition, face recognition, gene classification, remote sensing image recognition, and medical image recognition.