Feature Selection Applying Statistical and Neurofuzzy Methods to EEG-Based BCI

This paper presents an investigation aimed at drastically reducing the processing burden required by motor imagery brain-computer interface (BCI) systems based on electroencephalography (EEG). In this research, the focus has moved from the channel to the feature paradigm, and a 96% reduction of the number of features required in the process has been achieved maintaining and even improving the classification success rate. This way, it is possible to build cheaper, quicker, and more portable BCI systems. The data set used was provided within the framework of BCI Competition III, which allows it to compare the presented results with the classification accuracy achieved in the contest. Furthermore, a new three-step methodology has been developed which includes a feature discriminant character calculation stage; a score, order, and selection phase; and a final feature selection step. For the first stage, both statistics method and fuzzy criteria are used. The fuzzy criteria are based on the S-dFasArt classification algorithm which has shown excellent performance in previous papers undertaking the BCI multiclass motor imagery problem. The score, order, and selection stage is used to sort the features according to their discriminant nature. Finally, both order selection and Group Method Data Handling (GMDH) approaches are used to choose the most discriminant ones.


Introduction
Brain-computer interface (BCI) systems capture brain signals and decode them with the purpose of interacting with external devices without any muscular or physical intervention. Well-known examples are motor imagery tasks due to their importance in applications for severely motor impaired people. Likewise, other patterns can also be recognized within the brain signals, including word generation or object rotation. These patterns can be transformed to distinguishable signals and then to external commands or actions [1].
Technologically, most of the BCI mechanisms are based on electroencephalogram (EEG) techniques, where the sensors detecting the electric potentials originated by the neurons are placed on the scalp of the user [2]. Among the noninvasive technologies, where examples like magnetoencepahlography (MEG), position emission tomography (PET), or functional Magnetic Resonance Imaging (fMRI) systems can be considered, the main benefits of the EEG approach are the cost and the portability, making its use feasible in environments out of the laboratory. These systems show major benefits when being compared with invasive methods like electrocorticopgraphy (ECoG) [3] due to the fact that no brain surgery is being required to set up the montage.
According to how the brain signals get activated, two different paradigms can be distinguished [4,5]. On the one hand, they can be produced spontaneously by human specific thoughts without any sensory stimulus. Examples of this comprise the detection EEG rhythms ( : 0-4 Hz, : 4-8 Hz, : 8-12 Hz, : 8-13, and : 13-30 Hz) [6], slow cortical potentials (SCP), or event-related desynchronization (ERD)/event-related synchronization (ERS). On the other hand, the brain signals can be evoked by external stimulation, without prior training. Examples of the use of this method are the applications based on P300 [7], Steady-State Visual 2 Computational Intelligence and Neuroscience Evoked Potential (SSVEP), or hybrid BCI systems combining both of them [8][9][10].
Because the recorded brain signals are so small in amplitude, EEG devices in particular present a very low signal to noise ratio (SNR). For this reason, any interference coming from sources such as eye movement, eye-blink, muscular movements, teeth clash, or the heart rhythm deeply affects the quality of the measured signal, which can prevent the decoding system from properly recognizing the intention of the user. As a consequence, an effort to improve the spatial filtering methods [11], the feature extraction techniques [12,13], and the classification algorithms [14][15][16] has been undertaken by the scientific community.
In recent years, there has been increasing interest in minimizing the number of channels and features used by the classification algorithms. Yang et al. [17] identify three major drawbacks when using data from all channels by applying conventional ANNs, which can be extended to any EEG classifier: irrelevant features adding noise to the data and an increase in the complexity of the model and more computational burden. Other limitations can be added when considering the functional side and the cost of an EEG system. Tam et al. [18] measured the time to set up a 32channel montage, achieving a total of 10-15 minutes when being done by an experienced operator (between 20 and 30 seconds per sensor). Regarding the cost, a public pricing list is available in [19] where doubling the number of electrodes seems to increase the overall cost of the system by around 25%.
There are a large number of published studies describing different approaches to feature and channel selection. These approaches comprise both wrapper and filter methods of feature selection. The most popular methods are Genetic Algorithms (GA) [17,20,21], Distinction Sensitive Learning Vector Quantizer (DSLVQ) [22], Mutual Information algorithms (MI) [23], Fisher Criterion (FC) [1,18,24] methods, and Common Spatial Pattern (CSP) techniques [25,26]. In addition, other approaches based on wavelet packet decomposition (WPD) [1] and combinations or evolutions of the previous methods like Rayleigh Coefficient and Genetic Algorithms [27], Sparse CSP (SCSP), Robust Sparse CSP (RSCSP) [28], or Mutual Information improvements as shown in [29,30] have also been presented to the research community. Common to all of the studies, a direct relationship between the selected sensors and the expected cortical areas is shown, although different level of success has been attained.
In [18], a work is presented where the intention of movement detection is studied in stroke patients. The selection of a minimum number of electrodes allowing it to maintain a high success rate is suggested. For that purpose, two channel selection methods are proposed: Fisher Criterion and Support Vector Machine-Recursive Feature Elimination. From an initial number of 50 channels, it demonstrated that it is possible to select 12 electrodes while maintaining the performance. The Common Spatial Pattern algorithm has also been used to define methods of channel selection [25,26]. In both works, data from the BCI Competition is used and it is shown that it is possible to maintain and even improve the classification performance considerably reducing the number of used channels. In both scenarios, the channel selection is done by using the data in raw format before the feature extraction stage. An even more recent approach has been developed by Aler et al. [31], who present a new method for classification and feature selection, thus improving the preprocessing stage for the same data set and problem used in this paper.
Although extensive research has been carried out on feature selection, most of the available research has focused on reducing the number of channels required instead of the number of individual features. Also, no single study exists which adequately covers the result of implementing Statistical and Fuzzy approaches.
In Cano-Izquierdo et al. [14], the dFasArt is proposed as a neurofuzzy model for the self-organised learning whose defined clusters are determined by the weights of the units, which can be interpreted as rules on fuzzy sets. The connections between the units of the model and the value of the weights define a Fuzzy Logic System (FLS). Among the characteristics of the dFasArt, it is worth highlighting the way the clustering works according to the incoming values and their arrival sequence to the system. Also, the system can work with ambiguous or noisy data.
Later work [32] presents a methodology to undertake the motor imagery problem. A supervised version of the dFasArt (S-dFasArt) is added including the creation of different models from the learning sessions, a rule prune stage (which allows the reduction of the number of units of the models learning from the classification error on the learning sessions), and a later voting phase among the different models. This approach was successfully applied to the Data Set V of the BCI Competition.
The data processing on the BCI Competition data sets is always off-line. If the methods included on the literature were to be applied on live applications, the time constraints to produce a prediction would be a major issue to address. For instance, for the Data Set V, it is necessary to calculate the PSD function for 8 sensors and 12 frequency bands (96 features) and then apply the recognition logic 16 times per second. Moreover, there is a requirement of producing a prediction every 0.5 seconds. This computational burden requirement is not easily accommodated even on today's PCs. For on-line applications, reduction of the number of features to process is necessary. This paper introduces a new methodology to choose the most relevant features using different approaches, being the statistic properties of the data or the relationship between the fuzzy categories which are generated on a S-dFasArt model. These methods have been applied to the Data Set V available for BCI Competition III [33] showing a reduction from 96 to 4 (96%) in the number of features required to maintain the output accuracy of the system when using a Fuzzy and GMDH (Group Method Data Handing) methodology.
The remainder of this paper is organized as follows. Section 2 describes the data set format and structure. The methods applied are explained in Section 3. Section 4 details the results obtained. The validation of the results and  Figure 1: Image of the montage applying the 10-20 system convention.
a comparison with other literature results are presented in Section 5 and finally Section 6 concludes this paper.

Data Sets Description
The work presented in this paper is based on the Data Set V available for the BCI Competition III [33] organized in 2004 by the Berlin brain-computer interface area of Berlin Institute of Technology. It is aimed to use this contest as both benchmark source and data source. For this reason, the same rules defined by the BCI Competition organizers have been followed, allowing us to compare the results attained by the research community with those presented on this paper. This implies using the designated sensors and maintaining the algorithms used at the preprocessing stage. The data set was provided by the IDIAP Research Institute of Switzerland and undertakes the multiclass motor imagery problem. This set was recorded by a Biosemi system using a cap with 32 integrated electrodes located at standard positions of the International 10-20 system as depicted in Figure 1. The sampling rate was 512 Hz, the signals were acquired at full DC, and no artifact rejection or correction was employed.
This data set focuses on a benchmark to classify three mental tasks [34]: left hand movement, right hand movement, and generation of words beginning with the same random letter. All sessions were obtained from healthy users with no previous EEG or mental training. The recordings were completed during the same day, each lasting 4 minutes, with 5-10 minutes breaks between them. The users were required to think about one of the three defined tasks with intervals of 15 s. Processed data from 3 of them, who recorded 4 sessions each, is used.
The precomputed sets provided only include the sensors C3, Cz, C4, CP1, CP2, P3, Pz, and P4 out of the available 32 and they are the result of several transformations of the raw data. In the first stage, the potentials recorded were spatially filtered by means of a surface Laplacian. After that, a Power Spectral Density (PSD) calculation for the frequency band between 8 and 30 Hz with a resolution of 2 Hz was performed. Being the sampling frequency 512 Hz and the records divided in windows of 1 s with an additional rate of 32 samples, an overlapping of 93.75% between windows is defined.
The computational burden of this processing can be calculated as the product of 12 different features (or different frequencies bands) per sensor by 8 channels, involving a total of 96 features per sample, yielding 49,152 features per minute.
To facilitate the understanding of the results presented in this paper, Table 1 shows the exact equivalence between the component number selected from the feature vector and the channel and frequency associated with it.

Methods
For the purpose of reducing the size of the features vector, a new methodology has been developed. Initially, the size of the features vector is the result of multiplying the number of channels used in the analysis by the number of frequencies considered in the PSD calculation. The classification method used is based on the S-dFasArt architecture proposed by Cano-Izquierdo et al. [32], which shows superior performance to other proposals for the multiclass motor imagery problem. It is intended that the feature selection method and the classification algorithm complement each other to maintain the overall system performance. This way, the global classification success rate can be used as a baseline, which needs to be maintained while significantly reducing the input vector. Figure 2 presents the main stages of the selection process, which obtains a reduced set from all the initial features available in the input vector (96 in this case).
(1) Feature Discriminant Character. At this step, the discriminant capacity of every feature is determined. Two methods are proposed.
(i) Statistics Method. It is based on statistical results normally used in pattern recognition problems. This criterion only depends on the data.
(ii) Fuzzy Criteria. It is supported by the S-dFasArt architecture as a Fuzzy Logic System, which includes a set of rules to link fuzzy sets. Therefore, this criterion is affected by both the input data and the neurofuzzy model, which is defined by the rules calculated from the data.
(2) Score, Order, and Selection. For this study, a feature preselection method based on the obtained discriminant character of the data is introduced. First, the discriminant  value of every feature is assessed and the feature itself is scored from 1 to 10, with 10 indicating the most and 1 the least discriminant feature. After that, all of them are sorted in descending order according to the scores given.
Then, all the scores are added according to each feature, allowing the creation of a feature classification from most to least discriminant nature. Using this ranking, a first selection of the candidate features to form the reduced vector is obtained.
(3) Feature Selection. In this stage, those features yielding the best performance when using the neurofuzzy classifier are selected from the candidate features set. In order to obtain the best performing subset, two different methods are proposed.
The accuracy of every individual option is calculated by applying a -fold method with the three available learning sessions and the S-dFasArt classifier. After that, the best performing features vector will be chosen.

(ii) Group Method Data Handling (GMDH).
This selection method evaluates the features to be added to the subset according to a Regularity Criterion (RC).

Feature Discriminant Character.
Two methodologies, based on the training data sets, are evaluated to analyze the discriminant nature of each of the components of the feature vector: the first one is supported by applying classic statistics methods, while the second is based on the fuzzy logic interpretation of the classifier which gets created from the training data set.  of -dimensions in C classes. According to this premise, a set of vectors which are assumed to be "properly" classified is used and is denoted as the learning set. By using the learning set, the relative contribution for each of the features on the sampling vector to the class separability is studied. As a consequence, the properties of the statistic results from the learning vector set are calculated [35]. is denoted as the variance for the th feature in the th class, the a priori probability of the th class, and the total value of the variance of the th feature. The normalized variance can be defined as (1) When establishing the criteria to determine the discrimination capacity contribution of each of the features, the statistical entropy can be estimated as Alternative criteria to show the discriminant information of each feature can be defined as This expression has a maximum value of (1/ ) when all the values of̃are the same for a certain feature . In this scenario, it can be concluded that the feature does not add discriminant information and it can be dismissed.

Fuzzy Criteria for Feature
Selection. An architecture to classify EEG data applying the same benchmark as proposed in the BCI Competition Data Set V is proposed by Cano-Izquierdo et al. [32], whose output accuracy has demonstrated the ability to improve any other results published so far. The recognition system is based on the use of a neurofuzzy S-dFasArt model [14] and on a three-stage methodology, which intends to increase the utility of the three available learning sessions ( Figure 3).
(1) First, a learning session is used to generate a rule set defining the model.
(2) After that, a different learning session is devoted to adjust the model parameters to be applied at the test stage. Then, a rule prune is performed where the rules contributing to a higher error than success rate are discarded.
(3) Finally, once all the possible combinations of the three learning sessions are used for stages 1 and 2, there are six models available. For each one, 16 vectors per second are processed. Then, due to the fact that a prediction is produced every half a second only, 6 Computational Intelligence and Neuroscience every model contributes to 8 possible alternatives. To choose among the 48 = 6 × 8 possible predictions, a voting strategy is used where the most frequent prediction is selected.
For the purpose of feature selection, the third stage of the model is replaced by an "intermediate" model, which is defined with only three rules (each one associated with one single class). To do this, the weights defining every rule are calculated as the mean of the weights predicting the same category. The S-dFasArt model allows each class to be interpreted as a rule whose transference function is determined by the weights associated with fuzzy sets. Moreover, the rule associated with the class of each feature is represented by a fuzzy set as follows: Also, it is assumed that the discriminant character of each feature will be linked to the relationship between its associated fuzzy sets for two classes. If these fuzzy sets are very similar, the feature will not be very discriminant. If the fuzzy sets are clearly different, the discriminant character of the feature will increase.
For each feature, the discriminant character is obtained by comparing the corresponding fuzzy sets for two rules y , by using the expression: A value of ( ) near to zero denotes a very discriminant feature while a value approaching one denotes a very low discriminant feature.

Score, Order, and Selection.
To determine the minimum number of features that can be part of the system while maintaining the output accuracy, the criteria based on the accumulated scores with regard to the total punctuation are presented. The scores are calculated by using both statistics method and fuzzy criteria. After that, the features are sorted in a descending order and the number of candidate features to be part of the model is calculated as follows: = min{ } which fulfills The design parameter is adjusted to discard any feature whose SCORE value is the minimal.

Feature Selection
(1) Order Selection. The different models are being determined by selecting an increasing number of features according to the established relevance order. (RC) [36], which is calculated for different candidate models, starting from single feature models. RC is considered to be the average success rate of the models for the 6 possible combinations of ( , , ) as shown in Figure 4. Using single variable models as a starting point, the highest RC value is chosen. After that, a new feature is added and the model with the highest RC value is selected again. When the RC of the extended model is higher than the previous one, this one is selected as a baseline for a new iteration. When the maximum value of RC for the different models is less than the previous one, the model cannot expand and the method stops.

Results
This section summarizes the outcome of the application of the previous methodology and architecture to the BCI Competition III Data Set V database, addressing a three-class classification problem. First, the application of the statistics method is presented and the results for both Order and GMDH Selection are shown in different figures and tables. Then, the analogue information is shown for the methods based on fuzzy criteria. Section 5 joins the results of both approaches and compares them. Figure 5 provides the results obtained for the three users of the BCI Competition III Data Set V database. The value of ( ) has been calculated in a separate way for each one of the three learning sessions within the data. Given that the lower values on the figures are related to high discriminant features, the existence of a reduced number of features with a high discriminant character can be stated.

Statistics Method for Feature Selection.
To determine the most discriminant features, they have been ordered from higher to lower value of ( ). Only the first 10 are considered and a score from 10 to 1 is assigned according to the achieved position. Once the marks from the three learning sessions are added up, the final results are gathered in Figure 6. As can be seen, the discriminant nature seems to be confined within a small number of features. Table 2 shows the channel information and frequencies related to the ten most relevant features for each user.
The numbers of candidate features obtained after applying the 85% criteria for each of the three studied users results are = 9 for User 1, = 10 for User 2, and = 15 for User 3.

Order Selection.
The results are presented in Table 3. For User 1, the best value of the classification success rate is achieved when using the two highest scored features. These are 38 and 2 of the input vectors, which relate to CP1-10 Hz   and C3-10 Hz as shown in Table 2. Therefore, the results calculated for User 1 can be presented as Following the same criteria, User 2 selected features would be as follows:  whereas for User 3 the features would be represented as follows: Table 4 shows the selected models and their RC values. Figure 7 compares the discriminant character of the features for the three users by using session 1 for learning and session 2 for adjustment and rule prune. Similar results are attained when the other five combinations between the learning and the adjustment sets are calculated. If the features are sorted from the highest to lowest value of ( ) and only the ten most important ones are selected, assigning them scores from 10 to 1 and adding them up for the six possible scenarios, the results displayed by Figure 8 are obtained.
The best ten channels and the frequency value attached to them for every user are provided in Table 5. Table 6 presents the different results when considering this model with an increasing number of features.

Order Selection.
From them, the input vector for User 1 can be presented as while for User 2 it would be And for User 3 it is as follows:

GMDH Selection.
Analogously to the process followed for the statistic criteria, the GMDH method will be used with the purpose of selecting a model from a candidate feature set. In Table 7, the selection process and the final selected features are shown.

Final Validation and Discussion
It is fundamental to outline that the test set of the BCI Competition is first used in the calculations required to obtain Computational Intelligence and Neuroscience 9 the results presented in this section. In previous sections, only the learning session data sets are applied. In order to check the efficiency of the proposed methodology, a final stage has been performed following the method developed in [32] (Learning-Prune-Voting) with no additional parameter adjustment. The results obtained from the previous stage are shown in Table 8. The most striking result to emerge from the data is that a reduction from a total of 96 to a range between 3 and 9 features is achieved. Interestingly, the classification success rate is maintained or even slightly improved while reducing the number of features.
Aler et al. [31] also present a feature selection process over this same data set. However, their focus is based on selecting frequency bands across all channels, so the numbers shown should be multiplied by 8 in order to be comparable with the ones above yielding 4 × 8 = 32 features for User 1, 2 × 8 = 16 for User 2, and 5 × 8 = 40 for User 3. As can be seen, they  Figure 7: ( ) values for the three users using session 1 for learning and session 2 for parameter adjustment and rule prune. Class "2" identifies the "LEFT" task, "3" represents "RIGHT" and "7" corresponds to "WORD". are much higher than the ones presented here. Another point to consider is the fact that the classification success rate presented in this paper is about 10 points higher for Users 1 and 2.
Similarly, another approach for feature selection is presented in [37]. In this occasion, EEG maps are created as a geometrical representation of the activity of the precomputed data of the Data Set V and only 1 frequency is selected for each user (10 Hz for User 1, 10 Hz for Subject 2, and 12 Hz for Subject 3). Given that data was collected by using 8 sensors, each map includes information from 8 features. Also, the amount of data used to create the map is 5 seconds, compared to the 1 second window allowed by the BCI Competition rules. Even in that advantageous situation, the classification success rate achieved is still 1.60 points lower than the Statistical and GMDH approaches.
The correlation between the selected features and the users has been tested too. However, a set of common features cannot be generalized. The results show how 2 appears in all methods for Users 1 and 2, but it is not a part of the selected features for User 3 by the Fuzzy selection methods. Also, it is certainly difficult to find features adopted for all users within the same selection method. As an example, for the Fuzzy + GMDH selection method, 2 and 27 are selected for Users 1 and 2, but they do not seem to have the same relevance for User 3.
Turning now to the channel position associated with the selected features (Figure 1), it can be clearly noted that important channels not only locate on the lateral area of the motor cortex, but also in the centre zone between them. Table 6: Success rate (in %) as a function of the input vector dimension ( ) applying the fuzzy criteria and order selection.
: learning session, rule prune session, and test session.  Table 10, which includes all the relevant features for all users when applying a Fuzzy + GMDH feature reduction method, clearly shows that all the selected features belong to the and rhythms. Also, the importance of the C3 is common to all users while C4 only appears to be useful for Users 1 and 2. Besides, other channels and frequencies appear to be relevant too. For instance, the frequencies of the CP1 and CP2 sensors seem to be significant for Users 1 and 3, while frequencies of P3 are important for Users 1 and 2 as well. This sensor selection matches neurophysiological literature as in [38], but it adds certain features which are new to this. In fact, strong evidence of the importance of the sensor Computational Intelligence and Neuroscience 13 positions C3 and C4 on the selection process has been found, but very little has been said about CP1, CP2, P3, or the adjacent channels. The difference on this research can be clearly motivated by the different way of constructing the data set as established in [39]. Also, the data set comprises a status which is not related to motor imagery, like it is imagining words beginning with the same random letter. This one could activate other areas of the brain and cause features not included in the previous research to appear as highly discriminant in our model.

Processing Time Improvement.
The processing cost per feature added to the model has also been calculated for each subject.
(1) At the preprocessing stage, and due to the calculations performed by the Welch periodogram PSD function, the time consumption is linear with the number of features and everyone's preprocessing cost is 1.04% of the total.
(2) The neurofuzzy algorithm explained in this paper requires an increase of 9.21% of the processing time per feature during the model generation (learning and rule prune), which is very significant considering that six models are generated for each user.
(3) A final 7.53% increase at the test stage for every feature added to the model is also required.  Table 11.

Unified Model for the Three Users.
As can be found, the accuracy is slightly lower than that in the user specific models, but the reduction is only a 3.43% and the results are only improved by those shown in EEG Mapping [37], which are calculated with a 5-second window (different from the 1 s window used in the rest of the methods).
A further investigation on this field should be carried out across a larger population to determine if a reduced set of common features across the users can be found as performed by Fazli et al. [40].

Conclusion
The most obvious finding to emerge from this study is a way of drastically reducing the number of features required on the processing of the BCI systems while maintaining and even improving their classification success rate. This approach, being a three-status paradigm where only two of them are motor imagery related, has not been commonly undertaken by the literature.
The results of this investigation show that a 96% reduction of the required number of features (from 96 to 4) for a selection method based on Fuzzy and GMDH algorithms can be achieved. This translates into important time saving in computational burden when the analysis of the time consumption is performed over this simplified model. Moreover, the methodology proposed presents a native support to multiclass problems. Most of the research papers focus on reducing channels in two tasks motor imagery paradigms. Therefore, two-class classification algorithms are an excellent tool to address the problem yielding good results in terms of the calculation time and accuracy. However, when increasing the number of classes within the problem, feature selection methods based on algorithms such as CSF, FDA, SVM, and FC require a review of the entire system and the inclusions of decision trees. In addition, the calculations need to be repeated several times in two-class space combinations, increasing the processing time and power consumption before reaching an outcome.
In contrast, the use of S-dFasArt does not require any further tuning when increasing the number of classes and the processing time remains the same due to the fact that no new calculations are being required.
It has also been shown how the user and the features selected present an important correlation. As previous studies have reported, it has been found that the and rhythms of the C3 and C4 channels present a big discriminant nature on the motor imagery tasks for all the studied users. Also, other and rhythms appear to be relevant in this scenario, which includes a nonmotor imagery task. However, the generalization capability has shown to be low, as the subset of selected features appears to be very dependent on the subject performing the task.
Further experimental investigations are needed to estimate the smallest number of common features required for  the exercise presented in this paper across a larger population. An important practical implication of this would be the manufacturing of low-cost headsets with a small number of sensors. Also, the processing should be quicker as the preprocessing stage and the classification algorithm would only perform calculations on a very small set of the sampled data. Therefore, the design of devices including a reduced number of sensors could be possible. This would allow the EEG systems to be more user friendly by drastically reducing the setup time. Also, more appealing headsets compared with the current cap system could be manufactured. In summary, it has been demonstrated that the analysis of only a few frequency bands is required. This allows an important saving in computation time and power consumption as well, which is beneficial when integrating the system, due to the fact that less processing power and memory resources are being required. The aforementioned benefits can be critical when designing applications where the available times to provide them with an output or the hardware platform are limited, for example, in applications for mobile devices.
As a consequence of the reduction in the hardware, the creation of an affordable mass market mobile system based on EEG would be possible.