A Multitask Sign Language Recognition System Using Commodity Wi-Fi

,


Introduction
Sign language is an indispensable special language in the deaf community's daily life [1,2]. Communication barriers often occur between deaf and normal people who are not familiar with the sign language [3,4]. Wearable-based sign language interpreters aim to solve the abovementioned communication difculties, but they are often expensive, low in versatility, and inconvenient to carry [5][6][7][8]. Vision-based sign language translators can overcome the shortcomings of portable sign language translators since it only needs a camera for natural interactions [1,2,9,10]. However, users must place themselves within the feld of view (FOV) of the camera, which may cause personal privacy information to be disclosed. At some point, their sign language gesture recognition system was susceptible to lighting conditions and obstacles.
Compared with wearable-based and video-based sensing solutions, wireless sensing can cover a wider detection range with fewer privacy concerns [11]. Due to the low cost and ease of deployment of reusable wireless communication infrastructure, Wi-Fi based wireless sensing solutions are rapidly developing [11,12]. Currently, Wi-Fi sensing solutions mainly adopt two indicators: received signal strength indicator (RSSI) and channel state information (CSI) [13,14]. Compared with RSSI, CSI can measure more fnegrained information and is suitable for capturing smaller movements such as heartbeats and gestures [3,11,12]. In 2018, Ma et al. [3] released a CSI-based sign language gesture dataset called SignFi, which collected 276 sign language words in the daily life of deaf people through Wi-Fi signals. Tey used a nine-layer convolutional neural network (CNN) model to recognize these gestures. However, gestures in the real world usually may correspond to diferent objects, actions, and scenes. In our work, the main contributions of the proposed work are summarized as follows: (1) We propose a multitask framework called Wi-SignFi that can not only recognize gestures but also identify users and environments. Te Wi-SignFi model is a lightweight and end-to-end architecture consisting of an eight-layer CNN and a KNN module. Unlike existing references, the CSI data fed to Wi-SignFi does not require preprocessing such as denoising and unwrapping. Experimental results show that our proposed method achieves an average gesture recognition accuracy of 98.91%, which signifcantly outperforms previous works. Terefore, our proposed method is simple and efective. (2) Te experiments demonstrate that the accuracy of the model for recognizing gestures is afected by the resolution of the input data. In previous reports, the CSI data collected from three antennas on the Wi-Fi transmitter is normally converted into RGB CSI color images, which are then fed into CNNs. Tis approach does not increase the resolution of the input data, resulting in poor gesture recognition performance of the model. Conversely, the training time of the model grows proportionally to the resolution of the input data. In this regard, the model's time-consuming and gesture recognition accuracy can be balanced by extending the CSI data from diferent antennas into single-channel grayscale images as the input data. Tis fnding facilitates the deployment of Wi-SignFi on edge embedded devices. (3) Wi-SignFi can be deployed on a Nvidia's Jetson Nano device with 4 G memory. To the best of our knowledge, this is the frst WiFi-based gesture recognition system to be applied to embedded devices. Wi-SignFi on the Jetson Nano device achieves an inference speed of 27 CSI instances per second.
Te rest of this article is organized as follows: We review the existing literature in Section 2. Section 3 introduces the SignFi dataset and proposes the Wi-SignFi framework. Experiments and results are explained in Section 4. Wi-SignFi running on Jetson Nano devices is illustrated in Section 5. Section 6 concludes this work and gives future directions.

Related Works
At present, device-free sign language recognition systems are mainly divided into two categories: computer vision-based methods and wireless technology-based methods. Nath and Arun utilized the convex hull algorithm and template matching algorithm in the OpenCV software package for sign language recognition [1]. Tey implemented a real-time sign language translation on the ARM processor board. All sign language recognition systems in [2,9,10] were implemented by Microsoft's Kinect device. Aly et al. [2] combined a principal component analysis network (PCA-Net) and support vector machine (SVM) to recognize sign language gestures of diferent users. Huang et al. [9] found that it is difcult to obtain reliable features for hand-crafted features to adapt to various sign language gestures, so they proposed a 3D convolutional neural network (CNN) to automatically extract signifcant spatiotemporal features. A Brazilian sign language dataset named LIBRAS-UFOP was recognized by a two-stream convolutional network with a recognition accuracy of 74.25% [10]. In addition, Pu et al. [15] proposed a weakly supervised continuous sign language recognition system consisting of two modules: a 3D convolutional residual network (3D-ResNet) and an encoderdecoder sequence network. Te system was verifed on two large datasets RWTH-PHOENIX-Weather and CSL [15]. Cui et al. [16] utilized CNN and bi-directional recurrent neural network (RNN) to extract spatiotemporal information from raw sign language video datasets.
With the rise of the Internet of Tings (IoT) and autonomous driving technology, there is growing interest in wireless sensing technology [17]. Wireless sensing solutions based on Wi-Fi have been extensively investigated due to their low cost and ease of deployment [3,[11][12][13]18].
Wi-Fi sensing solutions have two indicators: RSSI and CSI [13,14]. Since RSSI is easily accessible, many researchers have extracted human motion features from RSSI in the early days of Wi-Fi wireless sensing. Sigg et al. [19] analyzed the static and dynamic properties of the collected RSSI to recognize human gestures. Abdelnasser et al. [20] proposed an RSSI-based gesture recognition system WiGest. Te system focuses on changes in Wi-Fi signal strength to recognize user's air gestures. In [21], a one-dimensional convolutional neural network (1D-CNN) general framework for RSSI dynamic gesture detection and recognition was built. Experimental results showed that the recognition accuracy of the seven complex dynamic gestures was 93.03%.
RSSI is the result of the superposition of multipath signals, which cannot efectively distinguish the multipath signals in the process of Wi-Fi signal propagation [22]. Hence, RSSI-based applications need to deploy multiple wireless links to reduce the impact of multipath efects [21]. For complex environments, the stability and reliability of RSSI fuctuate greatly, and it is impossible to capture real signal changes caused by human movements [13]. Nevertheless, CSI can distinguish multipath signals through the orthogonal frequency division multiplexing (OFDM) technology [23]. Compared with RSSI, CSI is more stable under static conditions and more sensitive under dynamic signals [13].
In 2011, Halperin et al. released the CSI tool, which greatly facilitated the acquisition of CSI information on commercial Wi-Fi devices [24]. Te CSI tool attracted a large number of researchers to utilize CSI for Wi-Fi activity sensing research [12,[25][26][27]. WiFinger is designed to recognize 9-digit fnger gestures from the American Sign Language (ASL) [12]. WiSign is an indoor sign language recognition system that can recognize fve gestures with an accuracy of 93.8% [25]. DF-WiSLR [26] can recognize 19 dynamic and 30 static sign gestures. Experimental results showed that gesture direction and environment had a great infuence on recognition performance. In [27], a dual-stream convolutional network was used to extract spatiotemporal information from six CSI action datasets.
Reference [28] describes that the mapping relationship between gestures and CSI data is not unique, which difers from traditional gesture image data. Te CSI data generated by the same gesture can vary greatly by person, location, orientation, and scenarios. Gao et al. [28] used dynamic phase index (EDP-index) error to remove the infuence of diferent positions and orientations on gestures to improve the quality of CSI-based wireless sensing. In [29], spatiotemporal information from CSI gesture data was extracted via a parallel long short-term memory fully convolutional network (LSTM-FCN) to accommodate user diferentiation and gesture diversity. Te gesture recognition system identifed 50 common gestures from 5 users with 98.9% accuracy. WiGRUNT [30] realized domain-independent features based on CSI gestures through a spatiotemporal dual attention mechanism and validated it on the Widar3 dataset.
SignFi [3] achieved the average recognition accuracy of 98.01%, 98.91%, 94.81%, and 86.66% in the lab276, home276, lab + home276, and lab150, respectively. In 2020, reference [31] compared the three types of deep learning: long shortterm memory (LSTM), CNN, and attentive bi-directional LSTM (ABLSTM). Te experimental results showed that the CNN model had the best recognition performance on the SignFi dataset. Te average recognition accuracy of the proposed CNN model for Lab276, Home276, Lab-+ Home276, and Lab150 was 99.855%, 99.674%, 99.734%, and 93.84%, respectively. In the same year, Lee and Gao [4] applied dual-output two-stream to the SignFi dataset and obtained good recognition results. In 2021, Ahmed et al. [32] used an LSTM framework with 150 hidden units to identify sign language in the SigniFi dataset. In addition to taking advantage of deep learning methods, Farhana Tariq Ahmed et al. [33] also adopted machine learning methods to manually extract high-order statistical (HOS) features from the SignFi dataset and implemented gesture classifcation via support vector machines (SVMs).

Channel State Information.
In wireless communications, CSI describes how a signal propagates information from the sender to the receiver and represents the combined results of refection, scattering, fading, and power attenuation over distance [34]. Let X(f, t) and Y(f, t) be the frequency domain representations of the transmitted and received signals with the carrier frequency f at the time t. Ten, the relationship between the transmitted signal and the received signal can be expressed as [14]: where H (f, t) is the channel frequency response (CFR) of the carrier with the frequency f at the time t. Te CSI is composed of the CFRs corresponding to diferent frequency subcarriers for each antenna. Each CSI includes the amplitude and phase relationship of each subcarrier in the orthogonal frequency division multiplexing (OFDM) link. Each CSI can be represented as follows [14]: where H (k) is the CSI of the k th subcarrier, ‖H(k)‖ and ∠H (k) are the amplitude and the phase of the k th subcarrier, respectively. Tey represent the important characteristics of CSI.
Using the CSI tool released by Halperin et al. [24], the raw CSI data can be obtained from each received data packet of a commercial Wi-Fi network interface card (NIC). Te amplitude and phase of each CSI on the subcarrier k sampled at the time i can be obtained from the following equations: where Re and Im are the real and imaginary parts of the CSI on the subcarrier k sampled at the time i. Tus, each subcarrier of the CSI provides amplitude and phase information that can be calculated at any time.

SignFi Dataset.
Te SignFi dataset contains CSI traces of 276 sign gestures that are commonly used in daily life. Te dataset has been gathered through a receiver with one internal antenna and a transmitter with three external antennas. Figure 1 shows a schematic diagram of the laboratory and home environment. As shown in Figure 1, the user is not standing in the lineof-sight (LOS) between the Wi-Fi transmitter (AP) and the receiver (STA). In comparison with LOS, the non-line-ofsight (NLOS) signals refected by human behavior are much smaller, which makes it more difcult for sign language gestures to be recognized. For home and lab environments, the distance between STA and AP is 1.30 m and 2.30 m, respectively. Te transmitting antenna array is orthogonal to the main transmission and receiving directions in the home environment. However, the angle between the transmit antenna array and the direct path difers by about 40 degrees in the laboratory environment. It can be seen from Figure 1 that the layout of the laboratory environment is more complex than the layout of the home environment. Tere is a substantial diference between these two environments, which results in completely diferent CSI signals received for the same gesture.
Te SignFi dataset contains two parts. Te frst part of the dataset contains 8,280 instances divided into 276 gesture categories. Among them, 5,520 instances and 2,760 instances are from the laboratory environment and the home environment, respectively. Tere are 20 and 10 instances of each gesture in the lab and home environment, respectively. Te second part of the dataset includes 7500 instances with 150 gesture categories, which means there are 50 instances for each gesture and only 10 instances for each user. Te frst part of the dataset was collected by one user, and the second Mobile Information Systems 3 part of the dataset was collected by fve users. Te statistics of the SignFi dataset are shown in Table 1.

System Overview.
Te fow chart of the proposed multitask sign language recognition method is shown in Figure 2. We obtained the raw CSI data from the SignFi dataset in Figure 2. Te magnitude and phase of each raw CSI sample can be extracted, normalized, and transformed into a CSI image of size 200 × 60 × 30 as described in [3,31]. Unlike the abovementioned literature, we do not denoise and unwrap the amplitude and phase information. Figure 2(a) shows that we resize each 200 × 60 × 3 CSI image to 224 × 224 × 3 images and use RandomVerticalFlip data augmentation technology. Te RandomVerticalFlip data augmentation technology only increases the diversity of samples during the training phase without increasing the number of data samples. Figure 2(b) shows fattening the third dimension of a color CSI image of 200 × 60 × 3 to obtain a grayscale CSI image of 200 × 180. Te number of input channels in the frst layer of the Wi-SignFi framework depends on the input depth of the CSI image.
After completing the abovementioned steps, the CSI image can be fed into the Wi-SignFi framework for tasks such as sign word, user, and environment recognition. Our experiments used nonrepetitive 5-fold cross-validation, which is consistent with the SignFi [3]. As can be seen from Table 1, the Lab150 and Lab + Home276 datasets are collected by fve users and two environments. In the Wi-SignFi framework, the CNN module is used to recognize sign language gestures from the whole SignFi dataset, while the K-nearest neighbor (KNN) module performs user recognition on the Lab150 dataset and environment recognition on the Lab + Home276 dataset, respectively.

Wi-SignFi Framework.
Te proposed Wi-SignFi framework consists of an eight-layer CNN and a KNN module, as shown in Figure 3.Te input size of a CSI colour image data is 224 x 224 x 3, so the frst layer of Wi-signf's convolution kernel requires 3 channels. Nevertheless, when the input data size is 200 × 180 CSI images, the frst convolutional layer of Wi-SignFi only needs one channel. To prevent the loss of features caused by the deepening of the network layer, the Wi-SignFi network adopts the shortcut structure. Te shortcut branch includes 1 × 1 convolution kernels and batch normalization (BN). A concatenation fusion is applied to the input of the last convolutional layer, which can fuse multilevel image features and reduce the loss of important information during the convolution process. Te branch of the concatenation fusion adopts 3 × 3 maxpooling. A 3 × 3 max-pooling can reduce the data dimension, enhance the local receptive feld, and improve the translation invariance of features.
Te CNN module in the Wi-SignFi framework includes seven convolutional layers and one fully connected layer. Te CNN module involves recognizing sign language gestures and the KNN module involves identifying diferent users or environments. Te CNN module covers all datasets, while the KNN module is limited to Lab150 and Lab-+ Home276 datasets. Te KNN module shares the feature maps extracted by the CNN module instead of manually extracting feature maps. Terefore, Wi-SignFi is a lightweight and end-to-end model.

Network Training and Test Settings.
We performed all experiments on sign language recognition tasks on a PC equipped with Intel (R) Xeon (R) CPU E5-2650 v3 @ 2.30 GHz CPU and GeForce GTX 2080 GPU with 8 GB of memory. We used the adaptive moment estimation (Adam) optimization algorithm with an initial learning rate of 0.0001 to train the network and update the weights and biases. Te batch size is set to 16, and the training epochs are 250. We choose the rectifed linear unit (ReLU) as the network activation. Te experiments adopt nonrepetitive 5-fold crossvalidation and follow the SignFi training and evaluation scheme. In other words, the ratio of training samples to test

Evaluation of the Diferent Input Sizes.
Te fle format of the raw CSI data is xxx.dat. Reference [24] mentions that CSI samples are extracted from the raw CSI data using Linux CSI Tool. Each CSI sample is a set of complex numbers.
Equations (3) and (4) are applied to each CSI sample for obtaining magnitude and phase information. However, the data format of the SignFi dataset is xxx.mat, which contains the extracted data. Generally, there are three input data sizes for amplitude and phase information: (1) Figure 4 shows the combination matrices of three diferent input sizes for sign language "CONTINUE" in the laboratory environment.
Te SignFi dataset provides a CSI sample for only 30 subcarriers. A receiver with one internal antenna needs to simultaneously receive three sets of the CSI data from a transmitter with three external antennas. Tere are 200 CSI instances for each sign gesture. Terefore, the size of each CSI matrix of SignFi is 200 × 30 × 3. Amplitude and phase information is obtained from the raw CSI measurements of the SignFi dataset. Tey can be combined and reshaped into combined matrices of size 200 × 60 × 3 as shown in Figure 4(a). Te Y-axis of Figure 4(a) represents the 200 CSI instances for each gesture. Te frst half (0-29) of the X-axis of Figure 4(a) is amplitude information, and the second half (30-59) is the phase information.
Te color channels of Figure 4(a) correspond to the three antenna signals. We resized the height and width of the combined matrix of 200 × 60 × 3 to 224 to get the combined matrix of 224 × 224 × 3, as shown in Figure 4  Next, let us explore and evaluate the impact of diferent input data sizes on model recognition performance. Te evaluations for diferent input sizes are shown in Table 2.
It can be seen from Table 2 that when the input resolution is 200 × 60 × 3, the recognition accuracy of the Wi-SignFi model is the lowest. Te recognition result of the model increases with the increase of the input resolution. Compared with the model recognition results of the input data with a resolution of 224 × 224 × 3, the model recognition results of the input data with a resolution of 200 × 180 are only slightly lower, excluding the Lab150 dataset. Input data of 200 × 180 resolution is suitable for multiuser datasets such as the Lab150 dataset. We consider the recognition accuracy of the model is easily afected by the resolution of the input data. When the resolution of the input data increases, the training time of the model increases accordingly, as shown in Figure 5.
To visually explore the impact of the increased resolution on the training time cost, we express the time at diferent resolutions as a percentage and use Wi-SignFi (224 × 224 × 3) as the benchmark. Figure 5 Table 2, we conclude that Wi-SignFi (200 × 180) can balance the time-consuming and recognition accuracy of the model.

Impact of Data Augmentation.
Te data augmentation introduced in this article performs RandomVerticalFlip processing on the CSI image input data without increasing data samples. Figure 6 shows the impact of the Random-VerticalFlip operation on the Wi-SignFi gesture recognition performance.
It can be seen from Figure 6 that the data augmentation of RandomVerticalFlip has little efect on the recognition of Wi-SignFi gesture recognition in the Home276, Lab276, and Lab + Home276 datasets. However, this data augmentation operation improves the recognition accuracy of Wi-SignFi gesture recognition by more than 3% on the Lab150 dataset. Te Lab150 dataset is collected from fve diferent users. Terefore, we believe that the data augmentation of Ran-domVerticalFlip helps to improve the Wi-SignFi model's sign language recognition accuracy for multiple users, but has a little efect on improving the single-user sign language recognition accuracy.

Comparison of Existing Sign Language Recognition Models on the SignFi Dataset.
Tere are fve sign language recognition models on the SignFi dataset. We summarize the comparison results of diferent sign language recognition technologies on the SignFi dataset in Table 3. Table 3 shows the recognition results of diferent kinds of literature on the SignFi dataset from 2018 to 2021. Te models in this literature can be divided into three categories: CNN, LSTM, and SVM. LSTM [32] only needs amplitude values to achieve good recognition performance, except for the Lab150 dataset. In contrast, HOS-Re [33] achieves good performance on the Lab150 dataset by manually extracting sign language gesture features from CSI traces and then using SVM as a classifer.
Te other methods are all CNN methods. Te input data modality of the Wi-SignFi model is the same as that of CNN [31] and SignFi [3], both are concatenations of amplitude and phase information. Te input data of dual-output twostream with ResNet50 [4] contains not only amplitude and phase information, but also diference information including the gesture motion. Te input data for the abovementioned CNN models are all preprocessed, except for the Wi-SignFi model. Te input data resolution of CNN [31] and SignFi [3] are both 200 × 60 × 3, while the dual-output two-stream with ResNet50 [4] is 224 × 224 × 3. CNN [31] achieved the best recognition results in Lab276 and Lab + Home276, 99.855% and 99.73%, respectively. However, Wi-SignFi outperforms other methods in recognition accuracy on the Home276 and Lab150 datasets, 99.75% and 96.41%, respectively.
Meanwhile, Wi-SignFi ranks frst in the four datasets with an average accuracy of 98.91%. Te Lab + Home276 dataset with mixed multienvironment data and the Lab150 dataset with multiuser data resulted in a signifcant drop in the recognition accuracy of the SignFi model. In contrast, our model can maintain good performance, which indicates that our proposed model has a certain generalization ability in complex environments.

Comparison with Existing Neural Networks.
Wi-SignFi is a CNN with only eight convolutional layers, so we also chose some lightweight neural networks for comparison. Te input data resolution for these lightweight models is fxed at 224 × 224 × 3. Since the CSI data is very diferent from the ImageNet data, the existing neural networks are trained from scratch. Te evaluation results of existing neural networks in sign language recognition are shown in Table 4.
According to Tables 3 and 4, we conclude that Wi-SignFi is suitable for the SigniFi dataset. For small sample data like SignFi, the network layer of the sign language gesture recognition model does not need to be very deep. lightweight networks such as shufeNet [35], MnasNet [36], and MobileNet [37] applied to mobile terminals are not very good at recognizing the SignFi dataset.  Te bold values given in Table 2   Mobile Information Systems 4.6. Classifcation Results for Users and Environments. In the multitask Wi-SignFi framework, we use SVM, random forest (RF), and KNN methods for user identifcation on the Lab150 dataset and environment identifcation on the Lab + Home276 dataset, respectively. It can be seen from Table 5 that there is little diference in the classifcation accuracy of users and environments between Wi-SignFi (200 × 180) and Wi-SignFi (200 × 60 × 3). Te environment recognition accuracy of SVM, RF, and KNN on the Lab + Home276 dataset is very well. It may be that the CSI data collected in the home environment and the laboratory environment are signifcantly diferent and easy to distinguish. Tis assumption is consistent with that described in [4]. However, KNN achieves the best result with 86.68% user recognition accuracy on the Lab150 dataset. We believe that the same gestures made by diferent people have a certain similarity.

Cross-Domain Sensing Based on Diferent Scenarios.
Reference [28] describes that the mapping relationship between gestures and the CSI data is not unique, which difers from the traditional gesture image data. Te CSI data generated by the same gesture can vary greatly by person, location, orientation, and scenarios. Te target domain contains only one of the home or laboratory environment data to evaluate the general performance of the Wi-SignFi model in diferent scenarios. As shown in Figure 7 Table 5. Figure 7(a) shows that when only Lab276 samples are used as the training dataset, the recognition accuracy of Home276 is only 0.36 percent. According to references [4,26,28], the poor results may be attributed to the signifcant diference between the CSI data of the two environments. Inspired by the few-shot learning method [30], few samples from the target domain are added to the source domain. By adding a Home276 sample for each action (5520 + 276) in the training dataset, the test accuracy can be reached to 85.39%. With more Home276 data samples in the training set, the recognition accuracy of the test dataset  Te bold values given in Table 3 represent the best action recognition results for each of the four datasets.

Wi-SignFi Framework Deployed on a Jetson Nano Device
Jetson Nano is a single-board computing platform from Nvidia [38]. Figure 8 shows Wi-SignFi (200 × 180) is applied to a Jetson Nano device for the gesture and user recognition on the Lab150 dataset. After saving the Wi-SignFi (200 × 180) model parameters, we randomly select 1500 CSI instances from the Lab150 dataset for testing. Te results obtained by averaging fve tests of our model on the Jetson Nano are shown in Figure 8.
As shown in Figure 8, although our model runs on a Jetson Nano device on almost the full 4G RAM, it achieves an average inference speed of 27 CSI instances per second. Gesture recognition accuracy for the Wi-SignFi framework running on Jetson Nano devices dropped slightly, while user recognition improved. Overall, the deployment of the Wi-SignFi framework on embedded devices performed well.

Conclusions
In this article, we implement a multitask sign language recognition system based on the Wi-SignFi framework. Te system not only recognizes gestures, but also the user and the environment to which the gesture corresponds. Experimental results on SignFi datasets are as follows: (1) Te CSI data fed to Wi-SignFi does not require tedious preprocessing such as denoising and unwrapping, but our model still outperforms previous works on gesture recognition. (2) For the Lab + Home276 dataset with mixed multienvironment data and the Lab150 dataset with multiuser data, our model can keep acceptable accuracy, which indicates that our proposed model has a certain generalization ability in complex environments. Meanwhile, we used KNN to perform user and environment recognition tasks on these two datasets, respectively. (3) According to the experimental results, the gesture recognition accuracy of the model is greatly afected by the resolution of the input data. Although Wi-SignFi (224 × 224 × 3) achieves the best results, its training time is greatly increased. Te experimental results show Wi-SignFi (200 × 180) can balance the time-consuming and recognition accuracy of the model. (4) Wi-   SignFi is a lightweight and end-to-end model. Wi-SignFi on the Jetson Nano device achieves an inference speed of 27 CSI instances per second. Our proposed sign language recognition system is expected to integrate with the IoT to improve the lives of the deaf community. Our proposed Wi-SignFi has some limitations, which will be the direction of our further research in the future. Firstly, Wi-SignFi is a lightweight CNN model that is suitable for small-scale datasets and cannot learn the complex temporal dynamics information of gesture actions. Combining Wi-SignFi with LSTM or a transformer would be an efective way for predicting the gesture activity.
Secondly, cross-domain wireless sensing is a difcult topic, so it has always been a research hotspot. In our experiments, we evaluate the cross-domain recognition performance of Wi-SignFi via the few-shot learning method. We expect to eliminate all samples from the target dataset in the future to facilitate the development of domain adaptation.
Finally, existing references tend to map the CSI data from the three antennas on the Wi-Fi transmitter to three RGB channels and then convert them to CSI color images. However, we extend the CSI data from diferent antennas to a single-channel grayscale image as the input data of the model and obtain satisfactory gesture recognition accuracy. We will investigate the correlation between transmitting or receiving antennas in future work. Additionally, we will actively explore more efcient ways to acquire the CSI data and real-time applications of deep learning on embedded devices.

Data Availability
Previously reported SignFi data were used to support this study and are available at https://yongsen.github.io/SignFi/. Tese prior studies (and datasets) are cited at relevant places within the text as references [3].