Common Spatio-Time-Frequency Patterns for Motor Imagery-Based Brain Machine Interfaces

For efficient decoding of brain activities in analyzing brain function with an application to brain machine interfacing (BMI), we address a problem of how to determine spatial weights (spatial patterns), bandpass filters (frequency patterns), and time windows (time patterns) by utilizing electroencephalogram (EEG) recordings. To find these parameters, we develop a data-driven criterion that is a natural extension of the so-called common spatial patterns (CSP) that are known to be effective features in BMI. We show that the proposed criterion can be optimized by an alternating procedure to achieve fast convergence. Experiments demonstrate that the proposed method can effectively extract discriminative features for a motor imagery-based BMI.


Introduction
Brain machine/computer interfacing (BMI/BCI) is a challenging technology of signal processing, machine learning, and neuroscience [1]. BMIs capture brain activities associated with mental tasks and external stimuli, realize nonmuscular communication, and control channel for conveying messages and commands to the external world [1][2][3]. Basically, noninvasively measured data such as electroencephalogram (EEG), magnetoencephalogram (MEG), and functional magnetic resonance imaging (fMRI) are widely used to observe brain activities. Among them, because of its simplicity and low cost, EEG is practical for use in engineering applications [4,5].
Efficient decoding around motor-cortex is a crucial technique for the realization of BMI associated with motorimagery (MI-BMI) [6,7] with the application to controlling external devices [7], prostheses [4], rehabilitation [8], and so forth. For instance, it is also known that the real and imaginary movements of hands and feet evoke the change of the so-called mu rhythm in different brain regions [2,3]. Therefore, the accurate extraction of these changes from the measured EEG signals in the presence of measurement noise and spontaneous components which are related to other brain activities enables us to classify the EEG signal associated with the different motor (imagined) actions such as movement of the right hand, left hand, or feet.
In classification of EEG signals in MI-BMI and analyzing of the brain activities during motor imagery, signal processing techniques such as bandpass filtering and spatial weighting are used [1]. For the processing, presuming the parameters such as coefficients of the filters and weights that extract the related components is a crucial issue. Moreover, the optimum parameters in classification are highly dependent on users and measurement environments [9].
In order to determine the parameters, data-driven techniques that exploit observed data are widely used [1,2]. The observed data essentially include class labels corresponding to the tasks. The techniques should find the parameters that extract discriminative features as much as possible. For example, the well-known common spatial pattern (CSP) method finds the spatial weights by using the observed signals [1,9,10] in such a way that the variances of the signals extracted by the linear combination of a multichannel signal and the spatial weights differ as much as possible between two classes. The standard CSP method has been extended 2 Computational Intelligence and Neuroscience to methods to estimate the other parameters, such as the frequency bands [11][12][13][14][15][16], and methods to select the CSP features extracted with various parameters [17,18].
Besides, one of the parameters to be decided by datadriven techniques is a time window because of the following reasons. A kind of BMIs is implemented based on cues which a user follows. In the BMI, the user begins to perform a task when the cue is given. Therefore, the time when the user begins to perform the task is known. However, the time when the brain activity associated with the task occurs is unknown. The time windows working to remove samples that do not contain the brain activity will not match the period when the cues are showed. For instance, the samples for a few hundreds of milliseconds after the cues are assumed not to be used to extract the features in previous works [13,17,19,20], which heuristically determined the time window. Contrary to these works, this paper hypothesizes that an optimal observation period in classification depends on users. For example, reaction time defined as the elapsed time between the presentation of a sensory stimulus and the subsequent behavioral response is strongly associated with age [21]. The reaction time can be related to the time of response to the cues. Therefore, the time window should also be designed by using the observed signals or data-driven.
In this paper, we propose a method for finding the time windows as well as the aforementioned parameters (CSP and temporal filters) by extending a framework proposed in [15]. The proposed method enables us to find these parameters that separately extract several components that are observed in different spatial patterns, frequency bands, and periods of time. We call these simultaneously designed parameters common spatio-time-frequency patterns (CSTFP). As compared to the methods, in which the parameters are selected out of the predefined candidates fixed in advance [18], the CSTFP method has a higher degree of freedom for the parameters reducing the computational costs. In CSTFP, the coefficients of the temporal filters and the spatial weights are searched in the set of real numbers. Moreover, although the time windows are selected out of a set of candidates, the computational cost of the CSTFP method does not grow rapidly even with a large number of candidates.
The rest of this paper is organized as follows. Section 2.1 reviews the CSP method. Next we, illustrate the CSTFP method from Section 2.2 to Section 2.4. We experimentally analyze the CSTFP method by using artificial data in Section 3. Section 4 presents experimental results of classification of EEG signals during motor imageries to show the performance of CSTFP. Finally, the conclusions of this paper are presented in Section 5.

Common Spatio-Time-Frequency Patterns (CSTFP)
We describe a novel method called the CSTFP method to simultaneously find the parameters for the spatial weights, the temporal filters, and the time windows. In this section, before introducing CSTFP, CSP is reviewed. Next, we define a signal extraction model using the spatial weighting, filtering, and time windowing. Then, we propose a criterion designing the parameters motivated by that for the CSP method. The proposed criterion evaluates not only the spatial weights but also the coefficients of the time windows and the temporal filters. Next, the optimization method for the proposed criterion with alternate updating is presented. Finally, we define feature vector with the feature extraction model for classification of unlabeled EEG signals.

Common Spatial Pattern (CSP): A Review.
Let X ∈ R × be an observed signal, where is the number of channels and is the number of samples. In BMI application, we do not directly use X, but we use the filtered signal described asX = H(X) to find the CSP, where H is a bandpass filter which passes the frequency components related to brain activity of motor imagery. Denote the components ofX bŷ X = [x 1 , . . . ,x ], wherex ∈ R and is the time index. We assume the sets of the observed signals, C 1 and C 2 , where C contains the signals belonging to class , ∈ {1, 2} is a class label, C 1 ∩ C 2 = 0, and 0 is a set having no elements. CSP is defined as the weight vector that maximizes the intra class variance in C under the normalization of samples, where is a class label. More specifically, for being fixed, the weight vector is found by solving the following optimization problem [9,10]: where X∈C [⋅] denotes the expectation over C , is the time average of X given by = (1/ ) ∑ =1x , ⋅ is the transpose of a vector or a matrix, and | ⋅ | is the absolute value of a scalar. The solution of (1) is given by the generalized eigenvector corresponding to the largest generalized eigenvalue of the generalized eigenvalue problem described as where Σ , for = 1, 2, are defined as

Signal Extraction Model.
The target signal and signal extraction procedure are formulated in this section. The filtered signal of a target signal, X, denoted asx = [̂1, . . . ,̂] , is defined aŝ= for = 1, . . . , , = − + 1, where [⋅] is the th entry of a vector, is the filter order of an FIR filter of which the coefficients are denoted by ℎ 1 , . . . , ℎ , is a spatial weight for th channel, and is a time window for th sample that takes a binary value of either 0 or 1. The structure of Computational Intelligence and Neuroscience  temporally filtering, spatial weighting, and windowing for X is illustrated in Figure 1. In this model, w is regarded as the spatial pattern, h is regarded as the frequency pattern, and b is regarded as the time pattern of the extracted signal.
The variance defined in (4) can be transformed to matrixvector form as follows. We define A , = 1, . . . , , whose elements are from X as for = 1, . . . , and = 1, . . . , , where [⋅] , is the element at the th row and the th column of a matrix. Then, (4) can be modified to

Optimization for Sets of Parameters.
We consider the problem of the design of sets of the FIR filter, the spatial weights, and the time window represented by These sets of the parameters are designed in such a way that w , h , and b maximize expectation of X (w , h , b ) with respect to X ∈ C samples under the normalization of an expectation of X (w , h , b ) over all of the observation. Additionally, we impose the orthonormality on h , = 1, . . . , to avoid the trivial solution. Moreover, the time windows are chosen from given candidates for efficient optimization. Therefore, we formulate the following maximization problem: where P represents set {w , h , b } and̂(P ) is the cost evaluating the ratio of the feature value in all samples defined aŝ( where S is any subspace in R , B is a candidate set for the time windows, defined as B = {b } =1 , is a class label chosen from 1 and 2, is a regularization parameter, and is the Kronecker delta defined as 1 for = and 0, otherwise.
Since it is difficult to simultaneously find all parameters, we consider sequential optimization to find the parameters with respect to each filter index . That is, we first find P 1 , and then find P 2 under the constraint on h 1 . This sequential optimization is represented with respect to each as whereŜ is a subspace defined aŝ where Span(⋅ ⋅ ⋅ ) represents a subspace spanned by vectors and the operator denoted by ⊕ gives the direct sum of two subspaces. Methods for choosingŜ has been discussed in [15]. In (9), to optimize the parameters indexed with , we adopt an alternating optimization procedure based on alternating least square (ALS). In the optimization, we separate the problem of (9) into three subproblems for w , h , and b , respectively. Then, we update the parameters by alternating solving the subproblems. The three subproblems and these solutions are as follows.
The first subproblem is to optimize w . While fixing h and b , w maximizing (9) is found as the generalized eigenvector corresponding to the largest generalized eigenvalue of the generalized eigenvalue problem [15] described as where for = 1, 2, and is an eigenvalue.  (11). Update h by solving (13). Update b by solving (19). Calculate cost, from the cost function,̂(P ). until − −1 is sufficiently small. end for Algorithm 1: Design of the FIR filters, the spatial weights, and the time windows.
The second subprolem is to optimize h . While fixing w and b , h maximizing (9) is found as the generalized eigenvector corresponding to the largest generalized eigenvalue of the generalized eigenvalue problem [15] described as where for = 1, 2, k 1 , . . . , k are vectors spanninĝdescribed as where I is the × identify matrix and is an eigenvalue. The third subproblem is to optimize b while fixing w and h . The cost of (9) can be reduced to Then, The procedure to design the spatial weights, the temporal filters, and the time windows is summarized in Algorithm 1 as a pseudocode.

Experimental Analysis of Artificial Signal
We give an analysis of CSTFP by a toy experiment with an artificial signal in this section. We assume a 2-class BMI where the observed EEG signals are modeled by a mixture of narrow-band signals (see Figure 2). In this model, a trial signal belonging to class is given by Computational Intelligence and Neuroscience where [ ] ∈ R represents a discrete spectrum, [ ] represents the time window that decides the period when the th source signal is generated, ∈ R is a stochastic phase of the source signals, and the operator denoted by R takes a real part of a complex number.
In particular, 200 artificial signals that we used in the experiment were generated with the conditions shown in  Table 1 where N( , 2 ) is a Gaussian distribution with a mean, , and a variance, 2 , and U( , ) is a uniform distribution whose minimum and maximum values are denoted by and , respectively. We applied the CSTFP method to the artificial signals as follows. The class label represented by in (8) was set to 1. The number of the sets of the parameters, , was set to 4. The order of the temporal filters was set to 41, yielding that = 41; the length of a filtered signal, , is 60. For the given candidates for the time windows, we define the following set. First, we define ten -dimensional vectors as

Experiment of EEG Signal Classification
A comprehensive comparative study was performed to illustrate the ability of the CSTFP method to produce more accurate classification of EEG signals during motor imagery over several conventional methods (CSP [10], common sparse spectral spatial patterns (CSSSP) [12], filter bank CSP (FBCSP) [17], and discriminative filter bank CSP (DFBCSP) [15]).

Data Description.
We used dataset IVa from BCI competition III [22], which was provided by Fraunhofer FIRST  [24]. The condition for each dataset is shown in Table 2. They have two classes of motor imagery. The signals in the provided datasets were recorded with the sampling rate of 1000 Hz. We furthermore applied to this dataset a Butterworth low-pass filter whose cutoff frequency is 50 Hz and the filter order is 4, and downsampled to 100 Hz.

Result.
For the experiments, as a sample for each trial, we used a signal observed in the period from 1 to 2 [sec] after the cue that directs the subject to perform the task. In the experiments, 1 was tuned by a method we mention later.
2 was set to 3.5 and 4 seconds for BCI competition III dataset IVa and IV dataset 1, respectively.
In order to compare the classification abilities for the methods, we obtained the classification accuracy rates by 5 × 5 cross-validation (CV). In each classification in the CV, we separated learning samples for selecting the parameters of the feature extraction and a linear discriminant analysis (LDA) classifier and test samples for obtaining classification accuracy rates.
For the methods to be compared (CSP, CSP-Exh, CSSSP, FBCSP, DFBCSP, and CSTFP), the parameters for the feature extraction were obtained as follows.  (i) CSP: the parameters determined in this method are spatial weights. Before obtaining the spatial weights by the CSP method, we applied the Butterworth bandpass filter with the passband of 7-30 Hz. In the CSP method, we minimized the variance cost of the right hand class in (1). The eigenvectors corresponding to the largest and smallest eigenvalues of the eigenvalue problem (2) were given as the spatial weights.  (ii) CSP-Exh: the parameters determined in this method are spatial weights and a passband of the Butterworth filter. The passband of the Butterworth filters was tuned as − Hz by an exhaustive search by the CSP method and the learning samples. After the filtering with the passband, the spatial weights were given by the same manner as that of the CSP method.
(iii) CSSSP: the parameters determined in this method are spatial weights and a bandpass filter. The bandpass filter between 7 and 30 Hz was applied as preprocessing [12]. CSSSP was applied with regularization parameters, , and the parameter for the number of the spatial weights, . The order of the filter was fixed to 16 [12].
(iv) FBCSP: the parameters determined in this method are bandpass filters out of a filter bank and associated spatial weights. FBCSP was applied with the mutual information based best individual feature and a naïve Bayesian Parzen window (NBPW) classifier [17]. The filter bank comprising 9 bandpass filters covering 4-40 Hz was used. All filters were Chebyshev type II filters with a bandwidth of 4 Hz each. In FBCSP, the number of the spatial weights, , in each band was set to 8. These parameters were decided by referring to [17].
(v) DFBCSP: the parameters determined in this method are FIR filters and spatial weights associated to each FIR filter. DFBCSP was applied with the FIR filter order of 41 as done in [15]. In optimization, we stopped iteration when error of the cost function between successive iterations becomes under 10 −5 .
(vi) CSTFP: the parameters determined in this method are FIR filters, the corresponding time windows, and spatial weights associated with each FIR filter. We fixed 1 = 0 to observe the behavior of the resulting time window. CSTFP is applied with the FIR filter order, 41 [15], and the following candidate set where we choose and out of a set {0, 5, 10, . . . , } such that > 50 and + ≤ . The regularization parameter, , was set to 0.1. In alternating optimization, we initialize h as a random vector which is orthonormalized from k 1 , . . . , k and b as a vector all the elements of which are one. We stopped iteration when error of the cost function between successive iterations becomes under 10 −5 .
For the obscure parameters such as in the above list, we furthermore tuned them by 5 × 5 CV using the learning samples as done in [25]. We conducted the nested CV [25] with all of the combinations of these parameters and obtained the classification accuracy rates. We adopted the combination that performed the highest rates as the parameters. The parameters tuned by the nested CV in the learning data and the candidates for them are summarized in Table 3.
After we obtained the feature vectors extracted by the filters, spatial weights, and the time windows that are designed by each listed method, we calculated the logarithm of the feature vectors. Then, the LDA for the learning samples was used to obtain a projector onto the 1-dimensional space. The threshold for classification was determined as the middle point of two class averages over the projected learning samples. The feature vectors from the test samples were classified by the projection and the threshold and we obtained the classification accuracy rate in each CV. In Tables 4 and 5, CSTFP results in the highest accuracy rate in the average over whole subjects. Moreover, we conducted the classification experiments, in which the parameters shown in Table 3 were fixed to show the effects of the changes of the parameters on the classification accuracy rates. Figure 5 shows the accuracy rates of CSP and the rates of each method averaged over all subjects with each 1 . For each subject, the parameters shown in Table 3 except for 1 were fixed to the combinations of the parameters that performed the highest accuracy rates with each 1 . In Figure 5(a), 1 that perform the highest accuracy rates are different among the subjects. Moreover, Figure 5(b) shows that the accuracy rates highly depend on 1 in the conventional methods. Figure 6 shows the variation of the accuracy rates by the various regularization parameters, , in CSTFP. For each subject, the parameters, and , were fixed to the combinations of the parameters that performed the highest accuracy rates with each . CSTFP performs higher accuracy as goes higher than 0.1.
We show examples of the spatial patterns, the frequency patterns (the amplitude characteristics of the FIR filters), and the time patterns designed by CSTFP in Figures 7 and 8. As we can observe in Figures 6, 7( Figure 6 in which the classification accuracy rates do not strongly depend on the regularization parameter, , by more than 0.12. Moreover, the FIR filters do not highly differ even with various . However, the spatial weights with the small windows (Figures 7(a) and 8(a)) differ from the other weights with the longer windows. Furthermore, there are slight differences in the time patterns and the frequency patterns between subject aa and subject av.

Conclusion
We have proposed a novel method called CSTFP for the classification of EEG signals during motor imagery. Our objective is to design the time windows that are adopted in signal processing for a cue-based BMI. Incorporating the idea into DFBCSP has allowed us to simultaneously design the parameters for the time windows, the spatial weights, and the FIR filters. These parameters are optimized in a single criterion based on the CSP method. We have shown the optimization procedure for the problem of the CSTFP method that is conducted by sequentially and alternatively solving the subproblems into which the original problem is divided. Through the experiments of the artificial signals and actual EEG signals, we have shown the performance of CSTFP. In the experiment, we have demonstrated that CSTFP achieves high classification accuracy rate. Our experimental results also suggest that the CSTFP method can find the frequency bands and the time periods in which brain activities associated with a mental task can be observed.
Methods for finding time intervals in which the feature components associated to some brain activities are observed are needed for accurate classification in an asynchronous Computational Intelligence and Neuroscience