Vowel Imagery Decoding toward Silent Speech BCI Using Extreme Learning Machine with Electroencephalogram

The purpose of this study is to classify EEG data on imagined speech in a single trial. We recorded EEG data while five subjects imagined different vowels, /a/, /e/, /i/, /o/, and /u/. We divided each single trial dataset into thirty segments and extracted features (mean, variance, standard deviation, and skewness) from all segments. To reduce the dimension of the feature vector, we applied a feature selection algorithm based on the sparse regression model. These features were classified using a support vector machine with a radial basis function kernel, an extreme learning machine, and two variants of an extreme learning machine with different kernels. Because each single trial consisted of thirty segments, our algorithm decided the label of the single trial by selecting the most frequent output among the outputs of the thirty segments. As a result, we observed that the extreme learning machine and its variants achieved better classification rates than the support vector machine with a radial basis function kernel and linear discrimination analysis. Thus, our results suggested that EEG responses to imagined speech could be successfully classified in a single trial using an extreme learning machine with a radial basis function and linear kernel. This study with classification of imagined speech might contribute to the development of silent speech BCI systems.


Introduction
People communicate with each other by exchanging verbal and visual expressions. However, paralyzed patients with various neurological diseases such as amyotrophic lateral sclerosis and cerebral ischemia have difficulties in daily communications because they cannot control their body voluntarily. In this context, brain-computer interface (BCI) has been studied as a tool of communication for these types of patients. BCI is a computer-aided control technology based on brain activity data such as EEG, which is appropriate for BCI systems because of its noninvasive nature and convenience of recording [1,2].
The classification of EEG signals recorded during the motor imagery paradigm has been widely studied as a BCI controller [3][4][5]. According to these studies, different imagined tasks induce different EEG patterns on the contralateral hemisphere mainly in mu (7.5-12.5 Hz) and beta (13-30 Hz) frequency bands. Many researchers have successfully constructed BCI systems based on the limb movement imagination paradigm such as right hand, left hand, and foot movement [5][6][7]. However, EEG signals recorded during imagination of speech without any movement of either mouth or tongue are still difficult to classify; however, this topic has become an interesting issue for researchers because speech imagination has high similarity to real voice communication. For example, Deng et al. proposed a method to classify imagined syllables, /ba/ and /ku/, in three different rhythms using Hilbert spectrum methods, and the classification results were significantly greater than the chance level [8]. In addition, DaSalla et al. classified /a/ and /u/ as vowel speech imagery for EEG-based BCI [9]. Furthermore, a study to discriminate syllables embedded in spoken and imagined words using an electrocorticogram (ECoG) was conducted [10].
Obviously, for the BCI system, the use of optimized classification algorithms that categorize a set of data into different classes is essential, and these algorithms are usually • Beep: beep sound for preparation of listening the sound or covert vowel articulation.
3 sec 300 msec 1 sec 1 sec Figure 1: Schematic sequence of the experimental paradigm. Vowels /a/, /e/, /i/, /o/, /u/, and mute were randomly presented 1 s after the beginning of each trial. After the third beep sound, the subject imagines the same vowel heard at the beginning of the trial. The EEG data acquired during the speech imagination period were used for signal processing and classification in this study. divided into five groups: linear classifiers, neural networks, nonlinear Bayesian classifiers, nearest neighbor classifiers, and combinations of classifiers [11]. For instance, various algorithms for speech classification have been used, such as k-nearest neighbor classifier (KNN) [12], support vector machine (SVM) [9,13], and linear discriminant analysis (LDA) [8].
The extreme learning machine (ELM) is a type of feedforward neural network for classification, proposed by Huang et al. [14]. ELM has high speed and good generalization performance compared to the classic gradient-based learning algorithms. There is growing interest in the application of ELM and its variants in the biomedical field, such as epileptic EEG pattern recognition [15,16], MRI study [17], and BCI [18].
In this study, we measured the EEG activities of speech imagination and attempted to classify those signals using the ELM algorithm and its variants with kernels. In addition, we compared the results to the support vector machine with a radial basis function (SVM-R) kernel and linear discriminant analysis (LDA). As far as we know, applications of ELM as a classifier for EEG data of imagined speech have been rarely studied. In the present study, we will examine the validity of using ELM and its variants in the classification of imagined speech and the possibility of our method for applications in BCI systems based on silent speech.

Participants.
Five healthy human participants (5 males; mean age: 28.25 ± 2.71, range: 26-32) participated in this study. All participants were native Koreans with normal hearing and right-handedness. None of the participants had any known neurological disorders or other significant health problems. All participants gave written informed consent, and the experimental protocol was approved by the Institutional Review Board (IRB) of the Gwangju Institute of Science and Technology (GIST). The approval process of the IRB complies with the declaration of Helsinki.

Experimental Paradigm.
Participants were seated in a comfortable armchair and wore earphones (er-4p, Etymotic research, Inc., IL 60007, United States of America) providing auditory stimuli. Five types of Korean syllables-/a/, /e/, /i/, /o/, and /u/, as well as a mute (zero volume) sound-were utilized in the experiment. Figure 1 describes the overall experimental paradigm. At the beginning of each trial, a beep sound was presented to prepare the participants for perception of the target syllable. These six auditory cues (including the mute sound) were recorded using Goldwave software (GoldWave, Inc., St. John's, Newfoundland, Canada), and the source audio was from Oddcast's online (http://www.oddcast.com/home/demos/tts/ tts example.php?sitepa). The five vowels and mute sound were randomly presented. Another 1 s after the onset of the target syllable, two beep sounds were given sequentially, with a 300 ms interval between them. After the two beep sounds, participants were instructed to imagine the same syllable heard at the beginning of the trial. The time for imagination was 3 s for each trial. Participants performed 5 sessions, with each session consisting of 10 trials for each syllable. Resting times were given between sessions for 1 min. Therefore, 50 trials were recorded for each syllable and the mute sound, and the total time for the experiment was approximately 10 min. All sessions were carried out in a day.
The experimental procedure was designed with e-Prime 2.0 software (Psychology Software Tools, Inc., Sharpsburg, PA, USA). A HydroCel Geodesic Sensor Net with 64 channels and Net Amps 300 amplifiers (Electrical Geodesics, Inc., Mean, variance, standard deviation, and skewness were extracted from all blocks and channels. Sequentially, sparse-regressionmodel-based feature selection was employed to reduce the dimension of the features. All features were used as the input of the trained classifier. Because each trial includes thirty blocks, thirty classifier outputs were acquired; therefore, the label of each trial was determined by selecting the most frequent output of the thirty classifier outputs. Eugene, OR, USA) were used to record the EEG signals, using a 1000 Hz sampling rate (Net Station version 4.5.6). In general, EEG classification has problems in terms of poor generalization performance and the overfitting phenomenon because the number of samples is much smaller than the dimension of the features. Therefore, to obtain enough samples for learning and testing the classifier, we divided each imagination trial for 3 s into 30 time segments with a 0.2 s length and 0.1 s overlap. Therefore, we obtained a total of 9000 segments = (6 (conditions) × 50 (trials per each condition) × 30 segments) to learn and test the classifier. We calculated the mean, variance, standard deviation, and skewness from each segment to acquire the feature vector for the classifier. The dimension of the feature vector is 240 (4 (types of features) × 60 (the number of channels)). Additionally, to reduce the dimension of the feature vector, we applied a feature selection algorithm based on the sparse regression model. The selected set of features extracted from all segments was employed to learn and test the classifier. Because a trial consists of thirty segments, a trial has thirty outputs of the classifier. Therefore, the label of the test trial was determined by selecting the most frequent output among the outputs of the thirty segments. The training and testing of the classifier model are conducted using the segments extracted only from training data and testing data, respectively. Finally, to accurately estimate the classification performance, we applied 10-fold cross-validation. The classification accuracies of ELM, extreme learning machine with linear function (ELM-L), extreme learning machine with radial basis function (ELM-R), and SVM-R for all five subjects were compared to select the optimal classifier to discriminate the vowel imagination. The overall signal processing procedures are briefly described in Figure 2.

Sparse-Regression-Model-Based Feature Selection.
Tibshirani developed a sparse regression model known as the Lasso estimate [19]. In this study, we employed the sparse regression model to select the discriminative set of features to classify the EEG responses to covert articulation. The formula for selecting discriminative features based on the sparse regression model can be described as follows: where ‖ ⋅ ‖ denotes the -norm, z is a sparse vector to be learned, and z * indicates an optimal sparse vector. t ∈ R ×1 is a vector about the true class label for the number of training samples, , and is a positive regularization parameter that controls the sparsity of z. F is the matrix that consists of the mean, variance, standard deviation, and skewness for each channel where f ∈ R ×1 is the th column vector of F. The coordinate descent algorithm is adopted to solve the optimization problem in (1) [20]. The column vectors in F corresponding to the zero entries in z are excluded to form an optimized feature set,F, that is of lower dimensionality than F.

Extreme Learning Machine.
Conventional feedforward neural networks require weights and biases for all layers to be adjusted by the gradient-based learning algorithms. However, the procedure for tuning the parameters of all layers is very slow because it is repeated many times, and its solutions easily fall into local optima. For this reason, Huang et al. proposed ELM, which randomly assigns the input weights and analytically calculates only the output weights. Therefore, the learning speed of ELM is much faster than conventional learning algorithms and has outstanding generalization performance [21][22][23]. If we assume the training samples , and l is the true labels, which consists of -classes, l = [ 1 , 2 , . . . , ] , a standard SLFN with ℎ hidden neurons and activation function (⋅) can be formulated as follows: where w = [ ,1 , ,2 , . . . , , ] is the weight vector for the input layer between the th hidden neuron and the input is the weight vector for the hidden layer between the th hidden neuron and the output neurons, o = [ ,1 , ,2 , . . . , , ] is the output vector of the network, and b is the bias of the th hidden neuron. The operator ⋅ indicates the inner product. We can now reformulate the equation into matrix form as follows where where matrix A is the output matrix of the hidden layer and the operator indicates the transpose of the matrix. Because the ELM algorithm randomly selects the input weights w and biases b , we can find weights for the hidden layer, w ℎ , by solving the following optimization problem: where L is the matrix of true labels for training samples The above problem is known as a linear system optimization problem, and its unique least-squares solution with a minimum norm is as follows:Ŵ where A † is the Moore-Penrose generalized inverse of the matrix A. According to the analysis of Bartlett and Huang, the ELM algorithms achieve not only the minimum square training error but also the best generalization performance on novel test samples [14,24]. In this paper, the activation function (⋅) was determined to be a sigmoidal function, and the probability density function for assigning the input weights and biases was set to be a uniform distribution function.

Time-Frequency Analysis for Imagined Speech EEG Data.
We computed the time-frequency representation (TFR) of imagined speech EEG data for every subject to identify speech-related brain activities. TFR of each trial was calculated using a Morlet wavelet and averaged over all trials. Among the five subjects, we plotted TFRs of subjects 2 and 5 which showed notable patterns in gamma frequency. As shown in Figure 3, much of the gamma band (30-70 Hz) powers of five vowel conditions (/a/, /e/, /i/, /o/, and /u/) in the left temporal area are totally distinct and much higher than those of the control condition (mute sound). In addition, topographical head plot of subject 5 was presented in Figure 4. Increased gamma activities were observed in both temporal regions when the subject imagined vowels. Figure 5 shows the classification accuracies averaged over all pairwise classifications for five subjects using ELM, ELM-L, ELM-R, SVM-R, and LDA. We also conducted SVM and SVM with a linear kernel, but the results of SVM and SVM with a linear kernel are excluded because these classifiers could not be converged during many iterations (100,000 times). All classification accuracies are estimated by 10 × 10-fold cross-validation. In the cases of subjects 1, 3, and 4, ELM-L shows the best classification performance compared to the other four classifiers. However, ELM-R shows the best classification accuracies in subjects 2 and 5. In the cases of all subjects, the classification accuracies of ELM, ELM-L, and ELM-R are much better than those of SVM-R, which are approximately the chance level of 50%. To identify the best classifier to discriminate the vowel imagination, we conducted paired -tests between the classification accuracies of ELM-R and those of the other three classifiers. As a result, the classification performance of ELM-R is significantly better than those of ELM ( < 0.01), LDA ( < 0.01), and SVM-R ( < 0.01). However, there is no significant difference between the classification accuracies of ELM-R and ELM-L ( = 0.46). Table 1 describes the classification accuracies of subject 2, which shows the highest overall accuracies among all subjects, after 10 × 10-fold cross-validation, for all pairwise combinations. In almost all pairwise combinations, ELM-R has better classification performance than the other four classifiers for subject 2. The most discriminative pairwise combination for subject 2 is vowels /a/ and /i/, which shows 100% classification accuracy using ELM-R for subject 2. Table 2 contains the results of ELM-R for the pairwise combinations and shows the top five classification performances for each subject. There is no pairwise combination to be selected from all subjects; however, /a/ versus mute and Table 1: Classification accuracies in % employing SVM-R, ELM, ELM-L, ELM-R, and LDA for subject 2. The highest classification accuracy among the four classifiers is marked in bold for pairwise combination. Classification accuracies are expressed as mean and associated standard deviation. SVM-R, ELM, ELM-L, ELM-R, and LDA denote the support vector machine with radial basis function, extreme learning machine, extreme learning machine with a linear kernel, extreme learning machine with a radial basis function, and linear discriminant analysis, respectively.    /i/ versus mute are selected from four subjects, and /a/ versus /i/ is selected from three subjects. Table 3 indicates the confusion matrix for all pairwise combinations and subjects using ELM, ELM-L, ELM-R, SVM-R, and LDA. In terms of sensitivity and specificity, ELM-L is the best classifier for our EEG data. Although SVM-R shows higher specificity than those of the other three classifiers in this table, SVM-R classified almost all conditions as positive and resulted in poor sensitivity; therefore, the high specificity of the SVM-R is possibly invalid. Thus, SVM-R might be an unsuitable classifier for our study.

Discussion.
Overall, ELM, ELM-L, and ELM-R showed better performance than the SVM-R and LDA algorithms in this study. In several previous studies, ELM achieved similar or better classification accuracy rates with much less training time compared to other algorithms using EEG data [16,[25][26][27]. However, we could not find studies on classification of imagined speech using ELM algorithms. Deng et al. reported classification rates using LDA for imagined speech with 72.67% of the highest accuracy, but the average results were not much better than the chance level [8]. DaSalla et al. using SVM showed approximately 82% of the best accuracy and  73% of the average result overall [9], whereas Huang et al. reported that ELM tends to have a much higher learning speed and comparable generalization performance in binary classification [21]. In another study, Huang argued that ELM has fewer optimization constraints owing to its special separability feature and results in simpler implementation, faster learning, and better generalization performance [23]. Thus, our results showed consistent characters with others' previous research using ELM and even similar or better classification results for imagined speech compared to other research using different algorithms. Recently, ELM algorithms have been extensively applied in many other medical and biomedical studies [28][29][30][31]. More detailed information about ELM can be found in a recent review [32]. In this study, each trial was divided into the thirty time segments of 0.2 s in length and a 0.1 s overlap. Each time segment was considered as a sample for training the classifier, and the final label of the test sample was determined by selecting the most frequent output (see Figure 2). We also compared the classification accuracy of our method with those of a conventional method that does not divide the trials into multiple time segments. As a result, our method showed superior performance in terms of classification accuracy to the conventional method. In our opinion, by dividing the trials, some effects such as increasing number of trials for classifier training might occur, and each time segment with a 0.2 s length is likely to retain enough information for discrimination of EEG vowel imagination. Generally, EEG classification has problems in terms of poor generalization performance and the overfitting phenomenon because of the deficiency of the number of samples for the classifier. Therefore, an increased number of samples by dividing trials could mitigate the aforementioned problems. However, further analyses are required to prove our assumptions in subsequent studies.
To reduce the dimension of the feature vector, we employed a feature selection algorithm based on the sparse regression model. In the sparse-regression-model-based feature selection algorithm, the regularization parameter, , of equation (1) must be carefully selected because determines the dimension of the optimized feature parameter. For example, when the selected is too large, the algorithm excludes discriminative features from an optimal feature set, F. However, when users set too small, redundant features are not excluded from an optimal feature setF. Therefore, the optimal value for was selected by cross-validation on 8 BioMed Research International  the training session in our study. For example, the change of classification accuracy caused by varying for subject 1 is illustrated in Figure 6. In the case of /a/ and /i/ using ELM-R, the best classification accuracy reached a plateau at = 0.08 and declined after 0.14. However, the optimal values of are totally different among the pairwise combinations and all subjects. Furthermore, our optimized results were achieved in the gamma frequency band . We also tested the other frequency ranges, such as beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), alpha (8)(9)(10)(11)(12)(13), and, theta (4-8 Hz); however, the classification rates of those bands were not much better than the chance level in every subject and pairwise combination of syllables. In addition, the results of our TFR and topographical analysis ( Figures  3 and 4) could support some relationship between gamma activities and imagined speech processing. As far as we know, in the EEG classification of imagined speech, there have been only a few studies that examined the differences between multiple frequency bands including gamma frequency [33,34]. Therefore, our study might be the first report that the gamma frequency band could play an important role as features for the EEG classification of imagined speech. Moreover, several studies using ECoG reported quite good results in the gamma frequency for imagined speech classification [35,36], and these findings are consistent with our results. However, several studies have been conducted that suggested the role of gamma frequency band for speech processing in neurophysiological perspectives [37][38][39]. However, those studies usually used intracranial recordings and focused on the analysis for the high gamma (70-150 Hz) frequency band. Thus, suggesting a relevance between those results and our classification study is not easy. However, a certain relation between some information in low gamma frequencies as a feature for classification and its implication from a neurophysiological view will be specified in future studies.
Currently, communication systems with various BCI technologies have been developed for disabled people [40]. For instance, the P300 speller is one of the most widely researched BCI technologies to decode verbal thoughts from EEG [41]. Despite many efforts toward better and faster performance, the P300 speller is still insufficient for use in normal conversation [42,43], whereas, independent of the P300 component, efforts toward extraction and analysis of EEG or ECoG induced by imagined speech have been conducted [44,45]. In this context, our results of high performance from the application of ELM and its variants have potential to advance BCI research using silent speech communication. However, the pairwise combinations with the highest accuracies (see Table 2) differed in each subject. After experiment, each participant reported different patterns of vowel discrimination. For example, one subject reported that he could not discriminate /e/ from /i/, and the other subject reported the other pair was not easy to distinguish. Although those reports were not exactly matched to the results of classification, these discrepancies of subjective sensory perception might be related to process of imagining speech and classification results. Besides, we have not tried multiclass classification in this study, yet some attempts in multiclass classification of imagined speech have been performed by others [8,46,47]. These issues related to intersubject variability and multiclass systems should be considered for our future study to develop more practical and generalized BCI systems using silent speech.

Conclusions
In the present study, we used classification algorithms for EEG data of imagined speech. Particularly, we compared ELM and its variants to SVM-R and LDA algorithms and observed that ELM and its variants showed better performance than other algorithms with our data. These results might lead to the development of silent speech BCI systems.