Multiclass Classification of Imagined Speech Vowels and Words of Electroencephalography Signals Using Deep Learning

The paper’s emphasis is on the imagined speech decoding of electroencephalography (EEG) neural signals of individuals in accordance with the expansion of the brain-computer interface to encompass individuals with speech problems encountering communication challenges. Decoding an individual’s imagined speech from nonstationary and nonlinear EEG neural signals is a complex task. Related research work in the field of imagined speech has revealed that imagined speech decoding performance and accuracy require attention to further improve. The evolution of deep learning technology increases the likelihood of decoding imagined speech from EEG signals with enhanced performance. We proposed a novel supervised deep learning model that combined the temporal convolutional networks and the convolutional neural networks with the intent of retrieving information from the EEG signals. The experiment was carried out using an open-access dataset of fifteen subjects’ imagined speech multichannel signals of vowels and words. The raw multichannel EEG signals of multiple subjects were processed using discrete wavelet transformation technique. The model was trained and evaluated using the preprocessed signals, and the model hyperparameters were adjusted to achieve higher accuracy in the classification of imagined speech. The experiment results demonstrated that the multiclass imagined speech classification of the proposed model exhibited a higher overall accuracy of 0.9649 and a classification error rate of 0.0350. The results of the study indicate that individuals with speech difficulties might well be able to leverage a noninvasive EEG-based imagined speech brain-computer interface system as one of the long-term alternative artificial verbal communication mediums.


Introduction
Speech is an essential communication channel for people to connect with society.
ere are di culties with external speech stimulation due to the medical conditions of some individuals, such as speech delay, autism, brain stroke, old age, Down syndrome, and other neurological diseases. Advanced human-computer interface (HCI) technology based on neural signals, also known as brain-computer (machine) interface (BCI/BMI), attempts to connect individuals with speech di culties to society by decoding messages from the brain's neural activity through mental imagery rather than depending on natural speech mechanisms [1,2]. Speech imagery, or imagined speech, is de ned as the neural representation of speech in the absence of natural speech, which occurs when a person imagines or thinks about syllables or words but does not produce natural sounds [3][4][5]. e availability of noninvasive EEG devices for measuring speech neural activity in the brain and advanced deep learning techniques has contributed to the development of the imagined speech-based BCI, which is expected to be the imminent verbal communication alternative for speech-disordered individuals. e imagined speech EEG-based BCI system decodes or translates the subject's imaginary speech signals from the brain into messages for communication with others or machine recognition instructions for machine control [6]. Decoding imagined speech from brain signals to bene t humanity is one of the most appealing research areas. In the absence of any traceable auditory output that is synced to the imagery speech of subjects' brain activity, decoding imagined speech is challenging.
We reviewed previous scientific work in the discipline of imagined speech decoding from neural EEG signals. Dasalla et al. [3] classified the English vowels using the feature extraction common spatial patterns (CSP) and a machine learning classifier, support vector machine (SVM). Wang et al. [7] experimented with two Chinese characters, extracting signal feature information with CSP and classifying them with SVM. Kim et al. [8] demonstrated effective vowel classification by utilizing a linear discriminant analysis (LDA) classifier and feature extraction approaches such as CSP and empirical mode decomposition (EMD). Min et al. [9] used the extreme learning machine (ELM) and classifier SVM to decode the imagined speech. Yoshimura et al. [10] decoded Japanese vowels two-class using sparse logistic regression (SLR). Sun et al. [11] classified the ten phonemes using feature extraction techniques such as the restricted Boltzmann machine (RBM) and neural network-(NN-) based classifiers. Nguyen et al. [12] improved the accuracy of multiclass imagined speech decoding by combining the Riemannian manifold feature extraction approach with classifier relevance vector machines (RVM). Saha et al. [13] demonstrated the classification of eleven speech sounds using temporal and spatial CNN with a deep autoencoder (DAE). Cooney et al. [14] investigated the imagined speech vowel classification using a deep CNN transfer learning (TL) model and observed that TL produced relatively better accuracy. Panachakel et al. [15] showed comparable higher accuracy than previous research when decoding imagery speech of two words using the feature extraction technique CSP and DWT, a deep neural network (DNN) classifier model. Cooney et al. [16] evaluated the impact of hyperparameter optimization of imagined speech EEG signals on deep learning models with CNN. Tamm et al. [17] used the classifiers CNN and TL to decode imagined vowels. Pawar et al. [18] demonstrated the covert speech multiclass classification of four distinct words using a kernel-based extreme learning machine (kernel ELM). Li et al. [19] classified the imagined speech of eight words using a hybrid convolution network. Sarmiento et al. [20] used the CNN-based model to classify five English vowels. Panachakel and Ganesan [21] used sliding window data augmentation and TL on the underlying ResNet50 model to classify imagined spoken words and vowels.
Even though multiple studies on this topic have been conducted over the last decade, we examined the relevant publications and found that the accuracy of decoding imagined speech necessitates more attention and investigation. e primary purpose of the work is to employ advanced deep learning (DL) approaches to decode or classify multiclass imagined speech from multichannel EEG neural signals of multiple subjects with better accuracy.
Mathematical Representation. e hypothesis is defined as a supervised multiclass classifier model. Assuming that the model's input includes labelled EEG signals, each labelled EEG signal is represented as (E i , s i ) for the record (i), the true imagined speech is s i ∈ Speech Labels , and the input E i ∈ R n×m is a preprocessed signal of a two-dimensional vector, where n is the number of EEG channels and m is the number of signal sample points for each channel. e model output classified or decoded imagined speech is s i ∈ Speech Labels and s i � s i + Error i , where classification error is Error i . e classifier model is mathematically defined as in equation (1).
In this paper, we presented a novel supervised deep learning model that integrated temporal convolutional networks (TCN) with CNN for the multiclass imagined speech decoding of EEG signals. e preprocessing method DWT was used for feature extraction and artifact removal. e input network layer and multiple hidden network layers were used to extract and transform the underlying signal features.
e output network layer was to classify the imagined speech vowels and words. To evaluate the classification outcome performance of the model, validation indicators such as recall, precision, f1-score, Cohen's kappa score, and confusion matrix were used. e following is an outline of the contribution of the paper. e wavelet-based analysis, together with the deep learning model consolidated by the TCN and CNN, demonstrated effectiveness in decoding the multiclass eleven imagined speech EEG signals into vowels and words.

Materials and Methods
e architecture of the EEG-based imagined speech BCI is depicted in Figure 1.
e system architecture consists of modules for acquiring neural signals, signal preprocessing, and the classification algorithm for converting the signals into decoded imagined speech. e DL technique was used to eradicate manual feature extraction since traditional machine learning algorithms have a challenging learning process that requires the manual extraction of the feature from the signals.
Multichannel imagined speech neural signals are collected using EEG electrodes placed on the scalp. Signal processing techniques such as resampling, band-pass filtering, notch filtering, artifact removal, and feature extraction are used to preprocess the raw signals. e deep learning model was trained and evaluated for the classification performance of the preprocess signal.

EEG Data.
e data used for the experiment in this study was an open-access dataset of EEG signals from fifteen subjects [22]. According to the ten-twenty international system of electrode scalp locations [23], the collected signals have six channels such as C3, F3, P3, C4, F4, and P4. e data were collected from multiple subjects, and each subject recorded multiple trials. e signals measured during the subject's imagined speech contained all five English vowels and six Spanish words. e signal sampling rate was 1024 Hz. Figure 2 illustrates the EEG signal acquisition with speech classes.
Step 1: contained a two-second preparedness phase to notify the subject that the trial was about to begin Step 2: involved presenting the stimulus word or vowel for two seconds Step 3: entailed capturing the subject's brain wave EEG for a four-second period of the subject's imagined or spoken utterance of the word or vowel Step 4: involved the subject relaxing for four seconds and preparing for the next stimulus

EEG Signal Preprocessing.
e raw EEG signals were preprocessed before being used in the deep learning model. e EEG signal sampling rate was recorded at 1024 Hz, and the signal was recorded for four seconds. Each signal record was represented as a two-dimensional vector of (n, m), where n � 6 was the number of EEG channels and m � 4096 was the signal sampling points of each channel. e sampling rate of the signal was rescaled from 1024 Hz to 512 Hz. e resultant resampled signal was a two-dimensional (n � 6, m � 2048) vector. e frequencies of the signals below 0.01 Hz and above 100 Hz were filtered out to keep all the frequency bands for data interpretation. e signals were notch filtered to reduce the 50 Hz noise created by the electrical environment around the collected EEG data.

Independent Component Analysis (ICA).
e EEG signal that caught the artifacts was generated by biological signals other than the brain, such as eye movements, heart rhythms, and muscle activity. e ICA blind source separation approach separates statistically independent signal components and can automatically eliminate artifacts from EEG signals [24]. e signals were processed by the FastICA algorithm to remove artifacts [25].

Discrete Wavelet Transformation (DWT).
e EEG signal is nonstationary and nonlinear in composition [26,27]. e wavelet time-frequency analysis gives the best performance results on the nonstationary input signals [28]. e EEG signal that caught the artifacts was generated by biological signals other than the brain, such as eye movements, heart rhythms, and muscle activity. e DWT was used to denoise the signal and extract features such as temporal information and local spectral information from signals. A wavelet is defined as a time-restricted wavelike oscillation. e orthogonal property of the mother wavelet db10 allows for smooth signal reconstruction. e mother wavelet db10 is notably effective for feature extraction in EEG wavelet based in several applications [29][30][31]. Although different mother wavelets were attempted in the study to  Figure 2: e diagram shows the procedure of acquiring an EEG signal.
Advances in Human-Computer Interaction increase classification performance, at the end of the analysis, wavelets db10 offered higher classification performance. With the mother wavelets db10, the raw EEG signal is decomposed into 4 levels. e frequencies of the resulting subbands are in the ranges of 0-12.5 Hz, 12.5-25 Hz, 25-50 Hz, and 50-100 Hz. e signal's energy is almost entirely contained in these decomposed subbands. Because the mother wavelet closely resembles the signal, higher coefficients corresponding to eye movements and lower coefficients relating to noise are generated. We also noticed that, after the 4-level decomposition, there was no improvement in classification performance. As a result, the signals were preprocessed using the mother wavelet, Daubechies (db10) with level 4. e DWT computes the wavelet present in the signal given the scale and position on the discrete grid. e signal's breakdown into several time series of wavelet coefficients depicts the signal's temporal evolution in the associated frequency band. To denoise the signals, the threshold value was applied to their decomposed wavelet coefficients to produce the estimated value of the wavelet coefficients. e threshold value for the decomposed wavelet is computed using the formula universal threshold defined in equation (2). e inverse wavelet transform reconstructs the signals using the estimated wavelet coefficient values:

Proposed Deep Learning
Model. e model architecture was designed with the objective of learning the feature representations from the EEG multichannel or multidimensional signals of imagined speech of the subjects. Figure 3 details the architectural diagram of the proposed model. e TCN stream was responsible for learning temporal features, whereas the CNN stream was responsible for learning spatial features from the preprocessed EEG signals. As a result, the model was able to learn both the temporal and spatial characteristics of the signals. e concatenation of the learned information streams was input into the fully connected layers for feature transformation and multiclass imagined speech classification.
In Figure 3, the input layer, TCN layer, CNN layer, and output layer are presented. A dilated casual convolution operation with the dilation factor and a kernel of size three is used to demonstrate the TCN residual block. e classification layer and the decoded speech label are displayed.

TCN Branch.
e TCN consists of a stack of residual blocks of one-dimensional causal dilated convolution, with the network layer input and output sequence lengths always being the same [32]. e TCN was used for the extraction of temporal-recurrent unique electrical signatures of each speech class from the imagined speech signal. e residual block consists of one-dimensional dilated causal convolution, batch normalization, nonlinear activation function, one-dimensional spatial dropout, and residual connection. One-dimensional causal convolution is divided into two segments such as causal convolution and one-dimensional dilated convolution.
Causal Convolution. Causal convolutions are a type of convolution used for temporal signals that preserve and will not compromise the order of the information. e causal convolution of output o t at time step t for a given channel's input signal sampling points e 0 , e 1 , . . . , e m−1 and output o 0 , o 1 , . . . , o m−1 is only performed on e 0 , e 1 , . . . , e t earlier observed sampling points, not the future observed sampling points e t+1 , . . . , e m−1 [33]. e convolution operation can be represented as a mathematical prediction function Pr(o t |e 0 , e 1 , . . . , e t ) for time step t = 0, 1, . . ., m−1.

One-Dimensional Dilated Convolution.
is is a convolutional variation in which the kernel is expanded by inserting gaps between the kernel elements. A convolution type onedimensional (1D) computation creates a 1D output signal by applying a 1D filter to a 1D input signal. In the dilated convolution, the convolution filter size of k was applied to the sampling points of an input signal of a length equal to or greater than the filter size k by skipping the signal sampling points with the dilation step, which means that as the depth increased, the dilated convolution generated an expanding receptive field [34]. As the depth of the layer is increased, the dilation step d is increased by a factor of two, resulting in d = 2 h− 1 at the layer h = 1, 2, 3, . . . of the block. Figure 4 shows examples of one-dimensional dilated convolution. A dilated convolution with a dilation value of one produced the conventional convolution.
In this case, the input signal is two-dimensional (number of channels and number of sampling points). A one-dimensional kernel filter has a size of 3, has 1 filter, and has 2 dilation steps.
For the input signal sampling point sequence e ∈ E, the dilation step d, and the filter size of k ∈ R, the one-dimensional dilated causal convolution function dilca-sualconv1d at the sampling time t is defined as in equation (3), where t − d.j is the index of the past sampling time and filter(j) is the convolutional filter function: Residual Connection. e residual block structure comprises multiple layers for increased receptivity. To allow a multilayer in the model while avoiding gradient explosion or vanishing, a residual connection of the input was added to the output residual transformation function after applying one-dimensional convolution to the input and allowing information to flow across the layers. e residual block output for the input e is defined by the following equation: e nature of neural signals has a distribution of features across the dataset due to the EEG data being collected from multiple subjects having independent medical conditions. e model experienced an internal covariate shift from layer to layer, and the model was unable to learn the features. e batch normalization was added before the activation layer to prevent the internal covariate shift in the model [35]. To avoid model overfitting in the residual block, regularization such as one-dimensional spatial dropout was used. e onedimensional spatial dropout with probability removed the complete feature learning of the convoluted filter channels, not simply a few neurons from each channel [36]. e nonlinear activation function gated activation unit was used in the residual block [37], and the respective activation function formula is defined as in the following equation: where z i is the input neuron and ⊙ is the elementwise dot product.

CNN Branch.
e CNN stream consists of a series of convolution blocks. Every convolution block has a sequence of components such as one-dimensional convolution, activation unit, batch normalization, regularization dropout, and one-dimensional max-pooling. e nonlinear activation function of the rectified linear unit known as the ReLU is used. e output of ReLU is nearly linear, defined as ReLU (z i ) = max (0, z i ), where s is the neuron input [38]. e regularization dropout technique was used to improve model generalization and avoid overfitting by randomly dropping the units from the model training [39]. e signals were downsampled using one-dimensional max-pooling, which takes the maximum value over a pool-sized sliding window by applying it to the signals, retaining only the highresolution feature information.

Transformation and Classification Layer.
e model performed the feature transformation and multiclass classification on the CNN and TCN stream outputs. e CNN stream output was flattened and created a one-dimensional feature of the learning vector. e TCN stream output was flattened, resulting in a one-dimensional feature learning vector. e one-dimensional feature learning vectors of TCN and CNN were then concatenated. e activation function exponential linear unit (ELU) defined in equation (6) was applied to the combined feature vector [40]. e combined feature vector was used in the dense or fully connected layers for the feature information transformation  Advances in Human-Computer Interaction or extraction, and the extracted information was used in the multiclass speech classification.
where the hyperparameter c>0 controls the value to saturate for the z i <0 negative neuron input. e softmax activation function was applied in the model output layer to achieve multiclass speech classification. e softmax formula, which calculates the probability value of each speech class, is defined as in equation (7). e model classification result was the decoded speech class with the highest softmax value.
where the input vector s j i ∈ s 1 i , s 2 i , . . . , s l i , where the expected speech label of record (i) and l represents the number of imagined speech classes. e total softmax value of all input vectors is equal to 1, l j�1 softmax(s j i ) � 1.

Model Training and Experiments.
e proposed deep learning supervised multiclass model was developed, trained, and validated using the scikit-learn, Keras, and TensorFlow frameworks [41][42][43]. e optimization algorithm Adam was used in the model training [44]. e categorical cross-entropy cost function, which computes the loss between the true speech labels and the model's decoded imagined speech outputs, is defined in equation (8) and was used in model training. e model architecture parameters were frozen after the model attained a certain level of performance and continued with hyperparameter optimization. Table 1 shows the detailed summary of model architectural parameters.
where s i denoted as [s 1 i , s 2 i , s 3 i , . . . , s l i ] T is the categorical representation of actual speech label of record (i), s i denoted as [s 1 i , s 2 i , s 3 i , . . . , s l i ] T is the decoded categorical representation (softmax) of imagined speech for record (i), l represents the number of imagined speech classes, and T stands for the total number of records.
All the subjects' EEG signal data were aggregated once the signal preprocessing was completed. In the preprocessed signal, the distribution of speech classes was almost balanced. e fivefold cross-validation technique was applied for the proposed model training and evaluation.

Classification Results of Different Models.
We carried out a few experiments involving signal processing (DWT; ICA), and the DL model architecture as the ICA (FastICA algorithm) blind source separation approach separates statistically independent signal components and can automatically eliminate artifacts from EEG signals. Table 2 displays the different experiments, mean cross-validation classification accuracy results, and comparisons. According to the outcomes, the wavelet-based signal processing and the DL model united TCN with the CNN and improved the performance.
In the case of experiment 5, both the ICA and DWT were used in the preprocessing of the EEG signals. As both the ICA and DWT are used in signal preprocessing, most artifacts are eliminated. In the case of experiment 6, only DWT was used in the preprocessing of the EEG signals, and the DWT removed some artifacts by thresholding the decomposed wavelet, but the resultant signals still have some artifacts [45]. As in the case of experiment 6, where signals contain some artifacts, the model's generalization capability increases, adding variation, and the proposed model performs better than experiment 5, which combined ICA and DWT signal preprocessing.

Outcomes of Proposed Methods.
e experimental outcomes of the proposed model (DWT; TCN + CNN) for multiclass decoding of imagined speech achieved an overall accuracy of 0.9649. e overall multiclass decoding error rate was 0.0350. e model's statistical metric precision, which measures the ability to decode each speech label, was in the range of 0.92-0.99. e model's statistical metric recall, which measures the ability to decode all relevant speech labels, was in the range of 0.95-0.99. e f1-score harmonizes the mean value between recall and precision, and the model f1-score ranged from 0.94 to 0.98. Table 3 summarizes the details of the precision, recall, and f1-score of each imagined speech vowel and word. e model's confusion matrix demonstrates how the model correctly decodes each imagined speech class and Advances in Human-Computer Interaction becomes wrongly classified or confused when trying to decode each imagined speech signal. Figure 5 depicts the entire experiment report confusion matrix.
A statistical function such as Cohen's kappa [46] was used to measure the degree of agreement between model predictions and true labels to examine the randomness of the proposed model and is defined in equation (9). Cohen's kappa value of the proposed model was 0.9614.
where p o represents the actual model overall accuracy and p e represents the expected random prediction accuracy. Figure 6 depicts the overall classification accuracy of imagined speech for all the fifteen subjects. Compared to the other subjects, subjects S05, S07, S13, and S15 had the highest performance accuracy, while subjects S03 and S14 had the lowest performance accuracy. Compared to other subjects, the proposed model's overall accuracy for subjects S03 and S14 is slightly lower. e significant aspect is the individual differences in multidimensional neural EEG signals [47], which limit the model's generalizability, which might be one of the reasons. e confusion matrix in Figure 5 exhibits considerable values across the diagonal from left top to right bottom and lower values of the diagonal, emphasizing the significance of the proposed model's uniform imagined speech decoding of all vowels and words. e proposed model's precision, recall, and f1-score reports revealed that the model is effective for multiclass decoding of imagined speech. e Cohen-Kappa result shows that the model has a strong level of agreement on the decoding of multiclass imagined speech, and the model adequately learns the underlying features from the input signals.

Comparison of State of the Art with Proposed Model.
e proposed model result was compared to the present state of the art for classification or decoding imagined speech. e proposed model performs slightly better in comparison to the present state of the art in EEG-based imagined speech decoding in terms of multiclass decoding or classification accuracy. Table 4 shows the comparison findings.
e proposed model was compared with the existing literature on CNN-based models, and it was observed that it has an efficacious influence on performance accuracy, although comparability is difficult as the differences in EEG signals occur because of variables in the collecting environment, such as different participants, instruments, and tasks.

Wavelet, Temporal-Spatial Information, and Convolution Network Significance.
e proposed TCN-CNN integrated model incorporates time-frequency resolution signals as input, permitting the network to learn more about the temporal and spatial properties of the signals using both the convolution path and enhanced classification accuracy. Considering that EEG signals are nonstationary, trying to find the right mother wavelet for the time-frequency analysis technique is instrumental for strengthening the decoding model's performance, and wavelet db10 has the capacity to illustrate the signal information with a significant timefrequency resolution, thus employed in the signal processing.
e EEG signal contained high temporal and low spatial information, and both types of information were important for improving the performance of decoding and analysing EEG signals. Additionally, we observe that the model's learning potential is reduced, and its accuracy falls when only TCN filters are used to extract the signal's temporal information. Earlier research [19] on imagined speech recognition from EEG signals indicated that both these features included in the input EEG signals could be utilized in the decoding, and the convolution-based network could become compelling for information extraction. Earlier research also illustrates the potential for decoding performance in other types of BCI activities, such as motor imagery, with the help of both these features of the EEG signals using the convolution-based deep learning model. In the study [48], one-dimensional convolutions were employed to extract both these features from the EEG signal, and the accuracy was significantly improved. In another study [49], temporal one-dimensional convolution was used first, followed by spatial one-dimensional convolution, to extract both of these features, allowing discriminative feature learning to classify the signals. Another study [50] found that both features can help with recognition in a BCI system based on steady-state visual evoked potential, where temporal filters were combined with spatial filters to improve event detection accuracy in noisy signals. e methodological consideration of mother wavelet db10 for the discrete time-frequency analysis and the deep learning framework learn the temporal features with one-  Advances in Human-Computer Interaction 7 dimension diluted casual convolution and spatial features with one-dimension convolution simultaneously, as well as consolidations of these features into classifications, which accelerates accuracy.

Limitations.
Although the proposed method enhances multiclass classification performance accuracy, it has limitations due to the model's training adopting an offline supervised learning method. To generalize the approach to larger imagined speech decoding tasks, the proposed method requires a large corpus of labelled signals for model learning. However, gathering such a corpus of neural signals is extremely challenging. A reinforcement learning technique, like a biological agent's learning process, is the most likely answer, in which an artificial model learns by interacting with the environment through feedback-based processes. e artificial model decodes the neural imagined speech signals and interacts with the environment to determine the correctness of the decoding. e model learns from its environment, which provides feedback in the form of rewards for both accurate and erroneous decoding.

Conclusions
In this study, we demonstrated that using a deep learning integrated model built with CNN and TCN to extract the spatial and temporal activity of EEG data for multiclass decoding of imagined speech was one of the most effective ways. Using an open-access EEG dataset, we exhibited an improvement in model performance accuracy in the decoding of imagined speech vowels and words. We preprocessed the raw EEG signals with filtering and extracted features, as well as removed artifacts. e proposed deep learning model was trained and validated to decode the imagined speech by using preprocessed signals. e model achieved an overall accuracy of 96.49% for multiclass decoding of imagined speech. Although the proposed model attained higher overall accuracy in the EEG-based decoding ACCURACY S02 S03 S04 S05 S06 S07 S08 S09 S10 S11 S12 S13 S14 S15 S01 SUBJECT Figure 6: e histogram depicts the proposed model evaluation accuracy for the imagined speech classification of EEG signals across all subjects. Proposed method TCN + CNN 96.49 * In the absence of average classification accuracy in the paper, the overall accuracy was estimated as the mean of all subjects' accuracy. 8 Advances in Human-Computer Interaction of imagined speech, the overall decoding error rate is a bit higher. erefore, the decoding error rate needs to be further improved.

Future Direction for
Optimization. e method used in the experiment was aligned with the specific imagined speech tasks and the specific set of subjects' neural EEG signals. is learning can be utilized in the generalization of other imagined speech tasks and neural EEG signals from other sets of subjects.
is further research could be the convolutional neural network-based cross-task learning of imagined speech decoding.
Data Availability e dataset used in this study is open access and available at https://doi.org/10.1117/12.2255697 from the earlier study and is cited at relevant places within the text as reference.