Fusion of Deep Features from 2D-DOST of fNIRS Signals for Subject-Independent Classification of Motor Execution Tasks

Functional near-infrared spectroscopy (fNIRS) is a low-cost and noninvasive method to measure the hemodynamic responses of cortical brain activities and has received great attention in brain-computer interface (BCI) applications. In this paper, we present a method based on deep learning and the time-frequency map (TFM) of fNIRS signals to classify the three motor execution tasks including right-hand tapping, left-hand tapping, and foot tapping. To simultaneously obtain the TFM and consider the correlation among channels, we propose to utilize the two-dimensional discrete orthonormal Stockwell transform (2D-DOST). Te TFMs for oxygenated hemoglobin (HbO), reduced hemoglobin (HbR), and two linear combinations of them are obtained and then we propose three fusion schemes for combining their deep information extracted by the convolutional neural network (CNN). Two CNNs, LeNet and MobileNet, are considered and their structures are modifed to maximize the accuracy. Due to the lack of enough signals for training CNNs, data augmentation based on the Wasserstein generative adversarial network (WGAN) is performed. Several simulations are performed to assess the performance of the proposed method in three-class and binary scenarios. Te results present the efciency of the proposed method in diferent scenarios. Also, the proposed method outperforms the recently introduced methods.


Introduction
1.1.Motivation.Te human brain is the most complex organ in the body, consisting of billions of neurons and unique computing capabilities such as parallel processing and learning.Terefore, researchers have always been interested in analyzing it from an early age.Diferent areas such as neuroscience, artifcial intelligence, cognitive science, and the brain-computer interface (BCI) have been explored to understand the brain better [1].BCI is a tool that translates thoughts and provides an interface for communicating with the outside world.Recent advances in BCI have led to a better understanding of neural functions and connections in the brain.BCI is an extensive study and requires knowledge of computer engineering, neuroscience, psychology, signal processing, and clinical rehabilitation [2].
Te functional near-infrared spectroscopy (fNIRS) is a noninvasive imaging technique that measures changes in blood oxygenation levels in the brain.It uses near-infrared light to penetrate the scalp and skull, allowing for the detection of hemodynamic responses associated with brain activity.Tis noninvasiveness makes fNIRS a safe and comfortable option for users, as it does not require any surgical procedures or direct contact with the brain.fNIRS has good spatial and temporal resolution.It can provide information about the location and timing of brain activity, allowing for the identifcation of specifc brain regions involved in cognitive processes.Tis spatial and temporal resolution is crucial for BCI applications, as it enables the accurate decoding and interpretation of brain signals for controlling external devices or communicating with the environment.It can be implemented in various settings, including home environments, clinics, or even during real-world tasks.Tis fexibility makes fNIRS a practical choice for BCI applications, as it allows for more natural and ecologically valid experiments and interactions.Also, fNIRS is less susceptible to motion artifacts compared to other imaging techniques.It can tolerate small head movements and is less afected by electrical interferences or muscle artifacts.Tis robustness to motion artifacts makes fNIRS suitable for real-time applications, where users may engage in natural movements or activities while using the BCI system [3].Tese characteristics make fNIRS a promising tool for developing practical and user-friendly BCI systems.

Related Works.
In general, BCI systems include signal acquisition, signal processing, and output units.Te recorded signals are low-power with a poor signal-to-noise ratio (SNR), nonstationary, nonlinear, and time-varying.Terefore, to improve the real-time processing of these systems, feature extraction methods should refect the timefrequency characteristics and spatial features.Temporal frequency analysis is widely used in BCI research.Tese methods are short-time Fourier transform (STFT), wavelet transform, and Hilbert-Huang transform (HHT).Teir results can be expressed as power spectrum density (PSD) and are the most efective in processing nonstationary and nonlinear signals.
Some works considered traditional feature extraction schemes based on statistical methods.In [4], the diference between the two mental tasks of computation and rest state was analyzed based on fNIRS signals.Te authors extracted six features from each channel of the fNIRS signal.Te results showed that multilayer perceptron (MLP) performs better than support vector machine (SVM) and k-nearest neighbor (kNN).In [5], the fNIRS signals with 22 channels were collected during three mental tasks: number subtraction, word generation, and rest.Te MLP model based on superfcial features determined the task.Subsequently, the authors controlled the robot remotely via fNIRS signals.
In [6], the combination of three-channel fNIRS and 123channel electroencephalography (EEG) signals was used to classify the left/right brain excitatory signals.Sixteen features were extracted from fNIRS signals, and an MLP with four hidden layers was used for classifcation.In [7], the concentration changes of oxygenated hemoglobin (HbO) and reduced hemoglobin (HbR) were measured, while volunteers repeated each of the three types of overt movements, including left-and right-hand unilateral complex fngertapping, and foot-tapping, by considering 20-channel fNIRS signals from 30 volunteers classifed by SVM.In [8], the authors aimed to distinguish the four brain activities including mental arithmetic (MA), motor imagery of left hand and right hand, and rest from fNIRS signals.After preprocessing, the six diferent statistical features are obtained in the time domain and 13 Mel-frequency cepstral coefcient (MFCC) features are obtained in the frequency domain, and then, classifcation is performed by SVM and kNN.Te least absolute shrinkage and selection operator (LASSO) homotopy-based sparse representation was employed in [9] for channel selection.Classifcation profts from statistical spatial features of concentration of blood oxygenation from fNIRS in walk and rest state tasks.In the presence of complicated and nonstationary signals, the mentioned methods based on statistical features cannot achieve the efcient accuracy.
Time-frequency analysis of fNIRS signals was considered in several works.In [10], the frontal hemodynamic responses were recorded considering 19-channel fNIRS signals from nine patients during mental tasks.Te authors used continuous wavelet transform for multiscale decomposition and a soft-threshold algorithm for denoising.Tey considered the MLP, linear discriminant analysis (LDA), and SVM and compared their performances.Te multilevel mental workload classifcation was performed in [11] by using bivariate functional brain connectivity features in three timefrequency bands.Tey utilized the public hybrid dataset consisting of EEG-fNIRS to evaluate their proposed method.Te mentioned approaches extract the nondeep features from time-frequency components and, as a result, fail to perform the correct classifcation in complex scenarios [12].
Methods based on neural networks and deep learning were also introduced for utilizing fNIRS signals in BCI applications.In [13], multistage fusion was performed to classify left-or right-hand motor-imagery tasks considering the EEG and fNIRS signals.Te results showed that the y-shaped neural network with early stage feature fusion has the best performance compared to the others.In [14] participants were asked to do left-and right-hand motor imagery experiments, and the corresponding fNIRS signals were recorded.Te classifcation is based on a convolutional neural network (CNN).A deep belief network (DBN) based on a restricted Boltzmann machine (RBM) was used in [15] to classify fNIRS signals of fexion and extension imagery involving the left and right arms.Te features of HbO concentration were used to train two RBMs.In [16], the authors attempted to classify the gender through four-channel fNIRS signals.Te authors used a three-layer denoising autoencoder (DAE) to extract distinct features to accommodate gender recognition by MLP.Te authors in [17] extracted the features from fve fNIRS signals by employing the convolutional autoencoder (CAE) and echo state network (ESN) autoencoder for driver cognitive load levels.In [18], a framework consisting of machine and deep learning methods classifed the fNIRS signals of motor execution for walking and rest tasks.Tey demonstrated deep learning approaches including the CNN, LSTM, and Bi-LSTM with the results of 88.50%, 84.24%, and 85.13%, respectively, that reached higher accuracy compared to kNN, SVM, and LDA.Tese methods considered the neural network and deep learning approaches; however, they did not consider the time-frequency analysis to consider nonstationary nature of biological signals.Te rest of this paper is organized as follows.Section 2 explains the dataset and preliminaries used in this paper.Section 3 presents the proposed method in detail.Section 4 contains the results, and fnally, Section 5 concludes the paper.

Dataset and Preliminaries
2.1.Dataset.In this paper, we considered the dataset used in [7] in which a total of 30 volunteers participated to collect the dataset.Each volunteer performed the following tasks 25 times in random order: right-hand fnger-tapping (RHT), left-hand fnger-tapping (LHT), and foot-tapping (FT).Terefore, we have a three-class classifcation scenario along with binary scenarios.For person-specifc classifcation, there are only 25 measurements for each class which may be not enough for the training of a CNN.On the other hand, if the data of all subjects are merged, there are 750 recordings from each class.Te fNIRS data were recorded by a threewavelength continuous-time multichannel fNIRS system (LIGHTNIRS, Shimadzu, Kyoto, Japan) consisting of eight light sources (Tx) and eight detectors (Rx).Four Tx and Rx were placed around C3 on the left hemisphere, and the rest were placed around C4 on the right hemisphere. Figure 1 depicts the channel locations of the fNIRS.Ch01-10 and Ch11-20 are located around C3 (Ch09) and C4 (Ch18), respectively.Te channels are created by a pair of adjacent light sources (Tx) and detectors (Rx) placed 30 mm away from each other.
Te experiment diagram is shown in Figure 2. A single trial comprised an introduction period (− 2 to 0 s) and a task period (0 to 10 s), followed by an inter-trial break period (10 to 27-29 s).Among RHT (⟶), LHT (⟵), and FT (↓), a random task type was displayed during the introduction period, which the volunteers were required to perform.For RHT/LHT, the volunteers performed unilateral complex fnger-tapping.Tey tapped their thumbs with other fngers one by one in the direction from the index fnger to the little fnger and repeated it in the reverse order.Te tapping continued at a steady rate of two Hz.For FT, the volunteers tapped their foot on the same side of their dominant hand constantly at a one Hz rate.Considering the 20 channels, measuring both oxygenated hemoglobin (HbO) and reduced hemoglobin (HbR), 10 s for task duration, and sampling frequency of 13.33 Hz, the duration of the task contains about 133 samples, and data of each task contain 40 × 133 matrix.

2D-DOST.
Stockwell transform (ST) was introduced in [19] and originates from STFT and wavelet transform.It is very efcient in terms of resolution at low frequencies and also has a higher resolution at high frequencies; for this reason, it is possible to access the frequency components in the time-frequency domain.However, it is highly redundant because it requires a lot of time and storage space.Discrete orthonormal ST (DOST), a downsampled version of ST, was proposed to overcome this problem.Because low frequencies have a high period, sampling is performed at a lower rate, and similarly, for high frequencies, high-rate sampling is performed by DOST.Suppose z(t) is a continuous-time signal; its ST is calculated as [20] where j � �� � − 1 √ , t, and τ are the time variables, f denotes the frequency, and σ � 1/|f| is the scale factor.Te output of ST is the complex-valued matrix whose rows and columns are related to time and frequency, respectively.On the other hand, assume that the discrete signal z[k], k � 0, 1, . . ., N − 1, is obtained from z(t) by sampling.By replacing τ ⟶ k and f ⟶ n/N, the discrete ST for z International Journal of Intelligent Systems where , which equals to DC value of the Fourier transform.Tere are N 2 ST coefcients for a signal of length N. Computing each coefcient requires the computational complexity of the order (N), and hence total computational complexity is of order (N 3 ).Let f(x, y) denote a 2D image, and its 2D ST is calculated as [21] where u and v are shift parameters used to move the Gaussian window on the x and y axes.Also, frequency parameters f u and f v are the frequencies related to shift parameters that control the spatial expansion of the window.S(u, v, f u , f v ) is a 4D complex-valued matrix.Te 2D-DOST of an N × N image, f(x, y), is defned as follows [20]: where v x � 2 p x − 1 + 2 p x − 2 and v x � 2 p y − 1 + 2 p y − 2 are the horizontal and vertical frequencies, respectively, and p x , p y � 0, 1, . .., log(N − 1).Also, F(m, n) is the 2D Fourier transform of the image f(x, y).It should be noted that the dimension of DOST points is equal to that of the input image.By integrating all the values p x and p y , a local spatial frequency range consisting of positive and negative frequency components from ) can be constructed.2D-DOST provides information about frequencies in the bandwidth of 2 p x − 1 × 2 p y − 1 frequencies [20].

CNN.
CNN is one of the most efcient machine learning methods for feature extraction and classifcation of images.Figure 3 presents a typical CNN, and its main layers are explained in the following.Convolution layers scan the pixels using a kernel that passes over the image and create feature maps which are then used to predict the feature class.Due to the large amount of information obtained from the convolution layer, the pooling layer each time retains the important information and reduces redundant information.Te fully connected layer acts similar to traditional MLP and predicts the output class using the extracted deep features.In this paper, two CNNs and their modifed versions are used for deep feature extraction which is explained in the following.
MobileNet [22] is a class of efcient models used in mobile and embedded vision applications.Te number of parameters is signifcantly reduced because of using separable convolutions in this model when compared to the network with regular convolutions with the same depth.In contrast to the standard convolution combination, in which the combination and fltering are done simultaneously in the same stage, in these networks, by using the ability of deep separation, in one stage, the flter operation is performed and then the combination operation is performed on the other stage.Tis separation has a strong efciency in reducing computational complexity.Te structure of MobileNet is given in Table 1, where conv.and conv.dw denote the standard and depthwise convolutions, respectively.
Te LeNet-5 architecture was introduced in [23].It is one of the earliest and most basic CNN architectures consisting of seven layers.Te frst layer consists of an input image with a size of 32 × 32.It is convolved with six flters of size 5 × 5 resulting in a dimension of 28 × 28 × 6. Te second layer is a pooling operation with a flter size of 2 × 2 and a stride of two.Hence, the resulting image dimension will be 14 × 14 × 6.Similarly, the third layer also involves a convolution operation with 16 flters of size 5 × 5 followed by a fourth pooling layer with a similar flter size of 2 × 2 and stride of two.Tus, the resulting image dimension will be reduced to 5 × 5 × 16.Once the image dimension is reduced, the ffth layer is a convolution with 120 flters with the size of 5 × 5. Te sixth layer is a fully connected layer with 84 units.Te fnal layer is a fully connected layer with ten neurons and a softmax activation function.
where P s and P n represent the power of fltered data (signal estimation) and unfltered data (noise estimation), respectively.Tis procedure was done for all channels, and several channels with the highest SNR were used for feature extraction considering the classifcation scenario.

TFM Calculation.
For each of the HbO and HbR signals, the data obtained from 20 channels include 133 samples.Te number of rows and columns of 2D-DOST must be a power of two, and hence four, eight, and sixteen channels with 128 To use the information of these TFMs for classifcation, we considered three fusion schemes, including early fusion, joint fusion, and late fusion [20].Te MobileNet and LeNet-5 are considered as base structures and we modify them based on the fusion scheme and the size of TFMs.Te additive combination of S O and S R , denoted as S O + S R , captures the additive information from both HbO and HbR signals.Tis fusion scheme allows for the integration of complementary information from these two sources, potentially enhancing the discriminative power of the features.In a classifcation scheme, this combined feature can provide a more comprehensive representation of the underlying neural activity by considering both oxygenation and deoxygenation dynamics simultaneously.Te diferential combination S O − S R represents the difference between HbO and HbR signals.Tis diferential information can highlight variations in oxygenation patterns that may be critical for distinguishing between diferent cognitive or motor tasks.In a classifcation context, this feature can be particularly useful when changes in the balance between oxygenated and reduced hemoglobin are relevant to the classifcation task.In summary, the rationale for using S O + S R and S O − S R lies in their ability to provide a more comprehensive view of the hemodynamic responses by considering both additive and diferential aspects of the HbO and HbR signals.
Tese combined features may capture unique patterns that are relevant to the classifcation scheme, potentially improving the accuracy and discriminative power of the classifcation model.

Early Fusion.
In early fusion, the fusion operation is performed at the feature level.Te inputs contain the main features or are extracted as features from diferent ways [20].
Tey join together and form the fnal feature maps before feeding into a machine learning model.Based on the considered channels for computing 2D-DOST, each TFM has the size of n ch × 128.Te considered early fusion merges the four TFMs to construct the CNN input as follows: It should be noted that the size of the matrix S early is 4 × n ch × 128.Te procedure of classifcation with early fusion is shown in Figure 11.In this procedure, the CNN is trained considering the common training algorithms.

Joint Fusion. Te early fusion scheme concatenates
TFMs at frst and then extracts the deep features using one CNN.Te procedure of joint fusion is shown in Figure 12.In contrast to early fusion, this scheme passes each TFM through a CNN and obtains the deep features for each TFM.Let x 1 , x 2 , x 3 , and x 4 denote the deep feature vectors corresponding to the inputs S O , S R , S O + S R , and S O − S R , respectively.Tese vectors are obtained from the fatten layer of CNNs and the structure of CNNs does not contain the fully connected layers.Since the structure of CNN is the same for all inputs, all feature vectors have the same number of features.Tese vectors are then concatenated to form the fnal feature vector x f (feature fusion block).Te vector x f is given to the classifer to predict the output class.Te classifer is the traditional MLP, and the structure is the same as fully connected layers of considered CNN.In the training process, the parameters of all CNNs and classifers are tuned simultaneously.

Late Fusion.
Te late fusion scheme, which is known as a combination at the decision level, utilizes one CNN to predict the output for each TFM separately as shown in Figure 13.In this scheme, the size of the feature vector given to dense layers is smaller than the joint fusion, but this scheme does not consider the possible correlation among deep features of TFMs.For the fnal decision, depending on the situation, diferent methods such as majority vote, averaging, weighted voting, or meta-classifcation based on model predictions are used.Let the vectors p 1 , p 2 , p 3 , and p 4 contain the prediction scores of diferent classes assigned by each four CNNs, respectively.It should be noted that all vectors have the size of n c × 1, where n c is the number of classes.Te aggregation scheme calculates the sum of prediction scores to obtain the fnal score, p agg , as Finally, the predicted class  y pred is the one that maximizes p agg as  y pred � argmax p agg  .
6 International Journal of Intelligent Systems    International Journal of Intelligent Systems

Simulation Setup.
Te three two-class scenarios and one three-class scenario are considered for classifcation as follows for distinct purposes: (RHT, LHT), (RHT, FT), (LHT, FT), and (RHT, LHT, FT).Te three-class scenario enables us to capture intricate nuances in our data, allowing diferentiation between multiple states or activities.Simultaneously, binary scenarios address specifc research questions with simpler distinctions.Tis dual approach provides versatility, accommodating a wide range of research objectives and allowing for comparative analysis of classifer performance.Overall, it enriches our research by ofering a comprehensive exploration of our dataset, catering to both complex and focused research questions.
Te performance of LeNet and MobileNet is obtained for each scenario for the diferent number of channels and fusion schemes.Also, the structure of the modifed CNNs yielding the highest accuracy is presented.In this paper, subject-independent classifcation is performed.Hence, train and test data were determined by the cross-subject validation protocol.Tis protocol trains the model with data from 29 subjects, and data from one subject evaluate the test accuracy of the model.Tis procedure is repeated for each subject as test data and average results are reported.Considering 25 trials for each task per subject, there are 750 signals from each task; hence, there are 725 and 25 signals from each task for training and testing, respectively.It should be mentioned that the data augmentation proposed in [24] is utilized to increase the number of training signals.Table 2 contains the parameters used for training CNN.Te learning rate balances the convergence to the optimal solution and stability.Regularization parameter controls the overftting and encourages generalization.Te maximum number of epochs efectively updates the model without overftting.Batch size balances the training speed by parallel processing and computational complexity.Te momentum enhances the convergence and escape from local minima.Learning rate drop factor and drop period are used for fnetuning the learning rate for efective convergence.SGDM optimizer combines the benefts of stochastic gradient descent with momentum.Cross-entropy loss function is suitable for classifcation tasks, measuring dissimilarity between predicted and actual class distributions.
Te performance of the proposed method is presented in terms of confusion matrix, accuracy (Acc.), sensitivity (Sens.),precision (Prec.),Kappa score (K p ), and F 1 -score, which are calculated as follows [25]: where the true positive (TP) and true negative (TN), respectively, denote the number of correctly classifed and rejected fNIRS signals.Also, the false positive (FP) and false negative (FN), respectively, denote the number of incorrectly identifed and incorrectly rejected fNIRS signals.Also, A r � 1/n c is the random accuracy, where n c is the number of classes.International Journal of Intelligent Systems columns of input of 2D-DOST should be a power of two, we consider the four, eight, and 16 top channels based on SNR value for ternary and binary classifcation.To this end, the SNR of channels was sorted in descending order and most repetitive channels among all subjects are considered.Figure 14 demonstrates the repetition of high-SNR channels among subjects.Te selected channels are also given in Table 3.

Channel Selection. As mentioned in
As given in [7], the motor cortex regions in contralateral hemispheres were well activated when the subjects perform fnger-tapping and distinct HbR values were observed at channels 5, 6, 15, and 16 located in the anterior areas of C3 and C4.According to Table 3, these channels are among high-SNR ones.

Accuracy of LeNet. Table 4 presents the accuracy of
LeNet and its modifed version, accuracy in parentheses, in diferent scenarios for diferent number of channels and fusion schemes.As observed, the modifed version reaches a higher accuracy than the original structure.Also, the binary scenarios have higher accuracy than the three-class scenario.It is observed that, in general, increasing the number of channels enhances the classifcation accuracy.Increasing the number of channels provided more information about brain activity, and hence classifcation accuracy increased.On the other hand, computational complexity increases.From fusion schemes, the joint fusion that extracts deep features from each TFM separately yields the highest accuracy, and the early fusion outperforms the late fusion.Since the joint fusion scheme concatenates the four vectors of deep features, it has a higher complexity compared to other fusion schemes.Te three-class scenario reaches the highest accuracy of 90.71%.Also, the scenarios (RHT, LHT), (RHT, FT), and (LHT, FT) have the highest accuracy of 95.72%, 94.88%, and 93.19%, respectively.Also, the standard deviation of classifcation accuracies obtained in cross-validation is given for modifed network.Te smaller values of standard deviations depict the generalization of the proposed method.
Table 5 presents the structure of modifed LeNet used for feature extraction and classifcation in joint fusion.Te input layer passes the input TFM with the size of 128 × 16 × 1 to the frst convolution layer.Tis structure for classifcation consists of two convolutions, two average pooling layers, and one fatten layer.Each CNN generates the deep feature vector with the size of 600 × 1, and considering four CNNs for feature extraction, according to Figure 4, the input of the frst fully connected layer is (4 × 600) × 1. Te last fully connected layer acts as the output layer, and its output has the size of n c × 1.

Accuracy of MobileNet.
Te accuracy of MobileNet for diferent structures is given in Table 6.It is observed that the modifed structure yields a higher accuracy than the original structure.Te accuracy of the proposed method for the three-class scenario is 93.02%.Also, the accuracy for the scenarios (RHT, LHT), (RHT, FT), and (LHT, FT) is 98.73%, 96.67%, and 95.65%, respectively.Tese accuracies are obtained in joint fusion with the top 16 channels.As observed, similar to LeNet, the proposed method has a higher accuracy for two-class scenarios compared to the three-class scenario.Comparing Tables 4 and 6, the standard deviations of modifed MobileNet are lower than those of the modifed LeNet.
Te structure of modifed MobileNet yielding the highest accuracy in the joint fusion scenario is given in Table 7. Te size of the input layer is 128 × 16.Each Conv.layer is a standard convolutional layer with batch normalization and rectifed linear unit (ReLU).Also, the Conv.dw denotes the depthwise separable convolutions with depthwise and pointwise layers followed by normalization and ReLU.

Confusion Matrix.
Te confusion matrices of the proposed method with joint fusion and 16 channels for diferent classifcation scenarios are given in Tables 8-11.It is observed that in the three-class scenario, the RHT and LHT signals have higher sensitivity than the FT, while in binary scenarios (RHT, FT) and (LHT, FT), the FT has higher sensitivity than the RHT and LHT.Also, in both three-class and binary (RHT, LHT) scenarios, the RHT has higher sensitivity than the LHT.Te sensitivity values for all signals in all classes are higher than 94%, except the FT in the threeclass scenario.

Efect of Data Augmentation.
Te CNNs require more data for training compared to traditional artifcial neural networks to avoid issues such as overftting and underftting and increase the training accuracy and generalization.As mentioned in this paper, the method based on WGANs proposed in [24] is employed for data augmentation.Tis network consists of two parts: critic and generator.Te former learns the structure of data and the latter generates the artifcial data, and both were confgured as fully connected feedforward neural network with three layers [24] as given in Table 12, where N time and N ch represent the number of the time samples (�133) and channels (depends on the number of used high-SNR channels), respectively.A bias term was also added to the input and hidden layers.Random ] represented input to the generator z that was a vector with N z dimension (�100) [24].
As mentioned, there are 725 training samples at each cross-validation which are used for training critic network and the accuracy was reported for diferent number of generated samples.Te efect of the number of augmented signals per training signal on the accuracy of diferent scenarios is shown in Figure 15.It is observed that for the lower number of augmented samples, the accuracy is low, and increasing the number of augmented samples increases the accuracy for all scenarios considering the diferent number of channels.Hence, using data augmentation is necessary to train the model for the classifcation of motor fNIRS signals based on deep learning.Te average changes of HbO and HbR signals with a length of fve seconds were calculated as features in [26] and then classifed by Bayesian neural networks.Te scenarios of (RHT, FT) and (LHT, FT) were considered, and the maximum accuracy of 86.44% was obtained.Te diference of HbO and HbR changes as well as vector size and angle are considered as features of fNIRS signals in [27] and then are classifed by LDA.Te maximum accuracy of 98.7% and 85.4% was obtained for two-and three-class scenarios, respectively.In [7], the features were average changes of HbO and HbR concentrations and the maximum accuracy of 84.4% and 70.4% was obtained for two-and three-class scenarios, respectively.A method based on the transformer self-attention mechanism was introduced in [28].To enhance data utilization and network representation, this method leverages spatial and channel representations of fNIRS signals.Te results show that the method yields the maximum accuracy of 75.49% for three-class scenario.Te authors in [29] designed fNIRSnet considering the inherent  12 International Journal of Intelligent Systems delayed hemodynamic responses of fNIRS signals [30].Te local interpretable model-agnostic explanation (LIME) algorithm was proposed for the feature selection for fNIRS datasets in [31].Te Gramian angular diference feld (GADF) was used to encode multichannel fNIRS signals into multichannel images.International Journal of Intelligent Systems Also, the results of fve-fold cross-validation protocol are given Table 13.As observed, this protocol outperforms the cross-subject cross-validation, while due to the following reasons, the cross-subject protocol is most popular than the k-fold one in BCI applications [32][33][34].

Performance Comparison.
(1) Generalization to new users: Cross-subject crossvalidation ensures that the model is tested on data from individuals who were not part of the training set.Tis helps assess the system's ability to generalize to new users, which is crucial for biomedical applications where the BCI needs to be applicable to a wide range of individuals.(2) Real-world variability: Biomedical applications often involve real-world scenarios where users may exhibit individual diferences, such as variations in brain anatomy, physiology, or cognitive processes.Crosssubject cross-validation allows for the evaluation of the BCI system's performance in capturing and adapting to these inter-subject variabilities.It provides a more realistic assessment of how the system will perform when deployed in diverse user populations.
(3) Avoiding data leakage: In some cases, k-fold crossvalidation may lead to data leakage, where information from the test set inadvertently infuences the training process.Tis can result in overly optimistic performance estimates.Cross-subject crossvalidation helps mitigate this issue by ensuring that the training and testing data come from diferent individuals, reducing the risk of data leakage and providing more reliable performance estimates.
(4) Clinical relevance: Biomedical applications often require BCI systems to be evaluated in a clinical context, where the performance and reliability of the system are critical.Cross-subject cross-validation allows for a more rigorous evaluation of the BCI system's performance across diferent individuals, which is important for establishing its clinical relevance and potential utility in real-world healthcare settings.

Conclusion
In this paper, a new method for the classifcation of motor execution fNIRS signals was presented.Te presented method is based on the joint fusion of TFMs of HbO, HbR, HbO + HbR, and HbO − HbR.Te TFMs were obtained by 2D-DOST to simultaneously consider the correlation among samples of diferent channels as well as the samples of each channel.Joint fusion was considered to merge the deep features extracted from four TFMs using CNN.Te openaccess dataset with 20-channel fNIRS signals of three motor executions collected from 30 subjects was used for performance evaluation.Te performance of LeNet, MobileNet, and their modifed version was obtained for diferent number of top channels and scenarios.Te results showed that increasing the number of channels increases the accuracy, and the proposed method reached the maximum accuracy of 98.73% and 93.04% for two-class and three-class scenarios, respectively, when modifed MobileNet is used deep feature extraction and classifcation.Also, performance comparison showed that the proposed method outperforms the recently introduced methods.

Figure 1 :
Figure 1: Te placement of sources and detectors to record fNIRS signals [7].

Figure 3 :
Figure 3: Structure of the typical CNN.

Figure 14 :
Figure 14: Te number of repetition of high-SNR channels among all subjects.(a) HbO signal, top four channels.(b) HbR signal, top four channels.(c) HbO signal, top eight channels.(d) HbR signal, top eight channels.(e) HbO signal, top 16 channels.(f ) HbR signal, top 16 channels.

Table 2 :
Parameters used for training.
Table 13 compares the performance of the proposed method with recently introduced ones on the considered dataset to demonstrate the efciency

Table 3 :
Te selected channels based on SNR values.

Table 4 :
Te accuracy of LeNet in diferent scenarios.

Table 5 :
Te structure of modifed LeNet used in joint fusion.

Table 6 :
Te accuracy of MobileNet in diferent scenarios.
Te accuracy of the modifed MobileNet is given in parentheses.

Table 7 :
Te structure of modifed MobileNet used in joint fusion.

Table 8 :
Confusion matrix for three-class scenario.

Table 12 :
[24]tructure of critic and generator networks used in this paper[24].