Cross-Individual Gesture Recognition Based on Long Short-Term Memory Networks

Laboratory for Embedded System and Intelligent Robot, Wuhan University of Science and Technology, Wuhan 430000, China Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China School of Engineering and Technology, China University of Geoscience (Beijing), Beijing 100083, China Key Lab of Industrial Computer Control Engineering of Hebei Province, Yanshan University, Qinghuangdao 066000, China Anhui Province Key Laboratory of Special Heavy Load Robot, Anhui University of Technology, Ma’anshan 243000, China Faculty of Information and Technology, Beijing University of Technology, Beijing 100124, China


Introduction
e core task of sEMG-based prosthetic hand control and human-computer interaction applications is to decode human motion intentions through sEMG signals [1]. Human motions can be regarded as a combination of discrete and continuous motions [2]. erefore, the research field of motion recognition based on sEMG is generally divided into two types of research questions. One is the research on identifying discrete motion states of limbs through sEMG signals, such as independent motion states like flexion and extension of fingers, making fists, and turning wrists [3][4][5]. And the other is research on using sEMG signals to estimate changes in joint continuous motion, such as joint torque and joint angle [6].
is paper only conducts research on the recognition of discrete movements of limbs through sEMG signals. Discrete action pattern classification is currently the most mature and fruitful method in the field of human action recognition based on sEMG. Representative research work includes the following: Englehart et al. [7] compared the effects of different features of sEMG on the accuracy of gesture classification and used linear discriminant analysis (LDA) for the first time to perform action recognition on the time-frequency domain features of sEMG, which can accurately identify 6 types of hand/wrist movements. Huang et al. [8] proposed to use a cascaded neural network with a feature self-organizing mapping mechanism to recognize 8 grasping movements, which improves the classification accuracy and performs better than models such as K nearest neighbours and back propagation neural network (BPNN). Chan and Englehart et al. [9] proposed using the hidden Markov model to recognize 6 upper limb movements, with a recognition accuracy of 94.6%. Liu et al. [10] proposed generalized discriminant analysis combined with support vector machine (SVM) to construct a new classifier to achieve fast and high-precision classification of sEMG, and the recognition accuracy can reach 92%. Chu et al. [11] proposed to use LDA combined with multilayer perceptron to recognize 9 kinds of hand/wrist movements based on the exploration of efficient feature projection methods and realize the online control of prosthetic hands.
Although the research results of discrete action classification are fruitful, it does not distinguish between personal habits and physiological differences of the experimental subjects, which will lead to greater randomness in the experimental results. Although the abovementioned research work has achieved relatively ideal accuracy, they are all based on the same individual data as the training set and the test set. e main focus is on designing a suitable classification model and selecting a feature set for classification. In the research work of gesture recognition investigated in this paper, none of them noticed that individual information such as behavior habits and exercise level may have a negative impact on gesture recognition. e composition of human hand muscles is basically the same, but the degree of development of muscle tissue is different between different subjects [12,13]. erefore, the collected sEMG signals will also be different, which will naturally have a certain negative impact on gesture recognition. Based on this analysis, this article proposes related research ideas that train the classifier through the data of several individuals doing the same gestures and then uses the data that are not included in the training dataset for testing the performance of the classifier. is paper is based on this analysis to carry out cross-individual gesture recognition research.
Niu et al. [14] proposed a dual-network structure model, which can reduce the negative impact of individual shape information on the detection of facial action units by simultaneously learning the local relationship information of the face and the individual-specific shape information and orthogonalizing it. Inspired by the research work of paper [14], the key to realize the cross-individual gesture recognition is that the gesture recognition model needs to learn gesture features and individual information features at the same time and perform orthogonalization. Based on this idea, this paper proposes the cross-individual long shortterm memory (CI-LSTM) model based on long short-term memory (LSTM). To minimize the negative impact of different individual information on the accuracy of gesture recognition, CI-LSTM orthogonalizes gesture features and individual information features by learning gesture labels and individual labels at the same time. Furthermore, this paper realizes the online control of prosthetic hands based on CI-LSTM. Learning gesture labels and individual labels  at the same time is impossible for the traditional single  network structure model because the single network  structure model can only learn one type of feature at a time. Based on this analysis, in order to complete the task of crossindividual gesture recognition, this paper designs CI-LSTM, which is a dual-network structure model based on LSTM.

CI-LSTM.
2.1.1. Model Structure. Gesture recognition based on sEMG can be naturally defined as a pattern classification problem, in which the classifier is usually trained through supervised learning. It is generally believed that the instantaneous value of the EMG signal is useless for gesture recognition [15][16][17].
is view is based on an empirical assumption because the original EMG signal of each channel is nonstationary, nonlinear, random, and unpredictable [18][19][20]. LSTM can remember the information in the past time through the gating mechanism and is recognized as good at processing time series signals [21]. erefore, the CI-LSTM model we proposed is designed based on the LSTM model. In a neural network, long-term memory can be regarded as a network parameter, which implies characteristic information in the data, and its update cycle is much shorter than that of shortterm memory. In LSTM, the memory unit can capture key information at a certain moment and has the ability to store this key information for a certain time interval. e period of storing information in the memory unit is longer than shortterm memory, but it is much shorter than long-term memory.
erefore, it is called long short-term memory [22]. e LSTM network structure diagram is shown in Figure 1(a). e sEMG signal is essentially time series data [23], so CI-LSTM is based on the LSTM model that is good at processing time series signals.
e CI-LSTM model has a dual-network structure, which are a gesture recognition module and an individual information recognition module. Both modules contain an LSTM network, a hidden layer, and a fully connected (FC) layer. e overall structure diagram of CI-LSTM is shown in Figure 1 In the offline training phase, the signals input to the two modules are the same, but the gesture recognition module only learns the labels of gestures, and the individual recognition module only learns the labels of individual information. e hidden layer of the gesture recognition module is only responsible to extract the features related to gesture recognition, regardless of the classification of individual information, then input the extracted features into the fully connected layer, and output the probability model of gesture categories H � h 1 , h 2 , h 3 , h 4 , . . . , h m . e hidden layer of the individual recognition module is only responsible for extracting the features of individual information, regardless of the classification of gestures, and then the extracted features are input to the fully connected layer, and the probability models of individual categories P � p 1 , p 2 , p 3 , . . . , p n are output. Among them, only the the online testing phase when the probability model P � p 1 , p 2 , p 3 , . . . , p n only assist the training of the gesture recognition module in the offline training phase. e main function of the individual probability model is to enable the individual recognition module to extract the characteristics of the individual information better and not as the output of model.

Realization of Cross-Individual Function.
To minimize the negative impact of individual information differences on the accuracy of gesture recognition, this paper implements the cross-individual function of the CI-LSTM model by designing its loss function; that is, based on the principle of cosine similarity, the orthogonality of gesture features and individual information features is achieved through inner product calculations. e principle of cosine similarity is to evaluate the similarity of two vectors by calculating the cosine of the angle between them [24]. erefore, the angle can be measured to judge the similarity of the vectors. Anyway, the smaller the angle between two vectors, the more similar they are. e cosine similarity formula is shown as follows: where u i and v i are the vectors of cosine similarity we require.
In the actual calculation process, calculating the inner product between two vectors can be approximately regarded as a simplified calculation of the cosine similarity between two vectors. And this paper uses this idea to realize the crossindividual function of the CI-LSTM model. Computing the  absolute value of the inner product between the hidden layer state vector of the gesture recognition module and the individual recognition module is defined as the generalization loss L gen , which is a part of the loss function of the CI-LSTM. e calculation formula is shown as follows:

FC layer Hidden
where f gesture is the feature extracted by gesture recognition module, f people is the feature extracted by individual recognition module, and ⊙ represents the inner product of two vectors. e loss function of CI-LSTM should be defined consisting of three parts, including the loss of gesture classification L hand obtained by the gesture recognition module, the generalization loss L gen mentioned in the previous section, and the loss of individual information classification L peo obtained by individual recognition module. e loss function formula is shown as follows: where μ, λ, and ω are the balance hyperparameters.
In the process of model iteration, the total loss will continue to decrease, which means that L gen is also continuously decreasing. According to the principle of cosine similarity, this indicates that the similarity of the feature vectors extracted from two modules is reduced; that is, the extracted feature vector from the gesture recognition module has a stronger correlation with the gesture itself, and the feature vector extracted by the individual recognition module is more related to individual information. e goal of this design is to realize the cross-individual gesture recognition function of the CI-LSTM. e gesture characteristics learned by gesture recognition module and the individual information characteristics learned by the individual recognition module tend to be orthogonal so as to reduce the negative impact of information differences on the accuracy of gesture recognition as much as possible.

Experimental Design.
We selected four gestures including fist, spread, thumb and index finger side pinch (TISP), and pinch with three fingers (PWTFs). Among them, fist and spread are the most basic gesture, TISP involves holding small pieces of objects such as keys and cards in daily life, and PWTF involves holding most of the smaller objects (see Figure 2). e sEMG sensors were placed in four positions of the forearm of the participant's preferred hand, which are flexor carpi radialis (FCR), flexor digitorum superficialis (FDS), extensor carpi radialis (ECR), and extensor digitorum communis (EDC), as shown in Figure 3. At the same time, the individuals participating in the collection experiment all adopted the same fixed arm posture, alternately performing designated actions and maintaining a resting state for 2 seconds between each action. We adopted this to reduce the influence of other factors on the experiment.

Active Segment Extraction.
To improve the recognition accuracy of gestures, active segment extraction is necessary after the input signal is preprocessed. e active segment of sEMG can straightforwardly describe the corresponding signal changes during a single muscle movement; that is, the collected signal is divided into a forceful motion signal segment and a nonforced resting signal segment to determine the start and the end of each gesture. After recent years of research, there are a large number of mature active segment extraction algorithms for references, such as moving average method, short-time Fourier method, and entropy-based method. Considering that real-time and high efficiency are critical to the user experience of prosthetic hands, this paper chooses the simple and effective moving window method based on threshold judgment to detect the active segment and label the corresponding gesture on the active segment. Firstly, set the corresponding threshold value for the preprocessed sEMG signal. According to the statistics of scientific methods, the threshold value is generally set to 30% of the maximum amplitude in the set of experimental data, which is used to find the starting point and ending point of all active segments. Because there may be continuous contraction of muscles between motion segment and resting segment, it is also necessary to calculate the distance between each set of starting point and ending point. rough conversion, the starting point and ending point with an interval greater than 1.5 s are regarded as valid points so that all the data between the valid points can be marked with corresponding gesture labels. e extraction experiments of FIST show that the selected method based on threshold judgment works well (see Figure 4).

Assessment Indicators.
To prove the advantages of the classification model we proposed, we have selected a series of evaluation indicators as standards for comparison with other algorithms, including Accuracy, Recall, and F1 Score. e expressions are as follows:   levels and behavior habits as the participants in this paper, hereinafter referred to as subjects A, B, C, and D. In order to verify the reliability of the algorithm, the training set is composed of the data of 3 participants, and the test set is the data of the remaining participant, named as four combinations of ABC_D, ABD_C, ACD_B, and BCD_A (see Table 1). Firstly, the original sEMG signal collected from 4 subjects is preprocessed by the bandpass filter of 10 Hz-300 Hz and the notch filter of 50 Hz, and then the active segment of the signal is extracted according to 30% of the maximum amplitude of each data as the threshold. e obtained data are divided into 80% training set, 10% validation set, and 10% test set. e corresponding data are combined according to the grouping method, and the cross-individual experimental dataset is designed. Due to different daily habits, there will inevitably be an imbalance in the number of data obtained when different individuals perform the same gesture, resulting in an imbalance in the data of the training set. is is the most common problem in training sets [25]. Because the collected sEMG signal is a time series signal, we choose the undersampling method to solve the category imbalance in the training set. e sample distribution of each group is shown in Figure 5.

Cross-Individual Gesture Recognition Experiment.
is paper pays more attention to verifying the cross-individual gesture recognition ability of CI-LSTM, so the gesture recognition performance is the key, and the classification loss of individual information L peo is used as the weight regularization item of the proposed model. e recognition performance of gesture actions is the key, so the hyperparameter μ � 1. e hyperparameters λ and ω are used to balance the overall loss function value of CI-LSTM. is paper chooses the hyperparameters as μ � 1, λ � 0.8, and ω � 0.2 based on experience.
To prove the effectiveness of our proposed model, we used SVM, standard LSTM, CNN-LSTM, and CI-LSTM to compare the classification effects of the four datasets (see Table 2). Among them, the model structure of CNN-LSTM is similar to CI-LSTM, but the individual recognition module of CI-LSTM is replaced with CNN. Table 3 shows the indicators of the experimental group with the best performance of the four models mentioned in this paper for individual generalization experimental data. In the experiments of individual generalized hand gesture recognition with different data combinations, the model we proposed shows better recognition results than other models. Compared with a single network structure, the accuracy rate is improved on average 11.88%, and the highest can reach 32.71%; compared with the CNN-LSTM model with the same dual-network structure, it also has a maximum improvement of 6.87%.
In cross-individual gesture recognition experiments with the cross-individual dataset, CI-LSTM shows a better recognition effect than other models. CI-LSTM can achieve an average accuracy of 70.58% when faced with test data not included in the training set. Based on the above experiments, it is verified that the CI-LSTM model can orthogonalize the gesture characteristics and the individual information characteristics through the design of the loss function and reduce the negative impact of different individual information on gesture recognition accuracy as much as possible. Comparing with other algorithms in this paper, CI-LSTM can effectively improve the accuracy of cross-individual gesture recognition.
According to the experimental results of group BCD_A (see Table 3) and the confusion matrix of each model (see Figure 6), CI-LSTM not only has the highest average recognition accuracy but also has an improvement of 6.87% compared with CNN-LSTM. Compared with CNN-LSTM and LSTM, CI-LSTM is more sensitive to the two similar gestures of FIST and TISP and the recognition accuracy of FIST can be increased by 16%-24%. Although the recall of FIST is not as good as the performance of SVM, the average accuracy is far better than SVM. is may also benefit from the design of the dual-network structure and loss function. By extracting features that are more related to gesture itself, the recognition accuracy of the gesture that is easy to be misjudged can be improved.

Online Control Experiment of Prosthetic Hand.
e complete process of the prosthetic hand online control includes offline training module and online recognition module (see Figure 7). e offline training module is for classifier training. First step of offline training module is to collect the sEMG of the selected subjects. e experiments in this section are based on the four movements of FIST, SPREAD, TISP, and PWTF. Each subject is collected 90 times for each gesture. After preprocessing [26,27] and extracting active segment by the method in Section 2.3, the obtained data are divided into 80% as the training set, 10% as the validation set, and 10% as the test set to create a dataset to train the model. e online recognition module is for reproducing gestures. After preprocessing and extracting active segment, the sEMG collected in real time is input to the trained model of the offline training module to obtain the classification result and output the label corresponding to specific gesture. According to the obtained label, the corresponding control instruction is sent to the prosthetic hand control system, and the recognized gesture is reproduced by the prosthetic hand. e prosthetic hand selected in this paper can make all the selected movements (see Figure 8).
We take the online control effect of the fist recognition as an example for analysis as an example (see Figure 9). Firstly, the subject makes a fist, the sensors placed on the forearm collect sEMG signal at this time, and the signal acquisition    Scientific Programming the corresponding motor rotation angle value is transmitted to the prosthetic hand control system, and the prosthetic hand reproduces the fist gesture. According to the experimental results in Section 3.2, the average classification accuracy is 77%. To improve the performance in the experiment, we adopt a voting decisionmaking method in the prosthetic hand online control system to reduce the influence of network model misclassification.
To recognize the gesture more accurately at the current moment, it is not necessary to encode each classification result into a control instruction but to refer to the output results of the classification model at the previous or later moments at that moment. After that, we select the gesture with the most occurrences from the specified number of   Table 4. e average recognition accuracy after the introduction of the decision method is 86.5%, which is 9.5% higher than the average recognition accuracy of the classification model. In general, as long as the control period from the intention of the prosthetic hand user to the corresponding gesture of the prosthetic hand is controlled within 250 ms [28], the user will not feel the operation delay. erefore, the online recognition speed of the prosthetic hand control  Scientific Programming system is certain requirements. After average time statistics, the prosthetic hand control system built in this article has not perfectly met the conditions of real-time response, and improvement work is still needed.

Conclusion
To overcome the negative impact of individual information differences such as behavior habits and exercise levels on gesture recognition, a cross-individual gesture recognition model CI-LSTM based on LSTM is proposed. CI-LSTM has a dual-network structure. By designing the loss function, the individual information recognition module assists the gesture recognition module to train, which tends to orthogonalize the gesture features and individual features to minimize the impact of individual information differences on gesture recognition. According to the experimental results, it is verified that CI-LSTM can effectively overcome the influence of individual characteristics and complete the task of cross-individual gesture recognition with a better accuracy. Although the accuracy of cross-individual gesture recognition tasks cannot reach the accuracy of existing gesture recognition research, we prove that differences in individual characteristics have a certain impact on gesture recognition. Meanwhile, we affirm the significance and importance of cross-individual gesture recognition.

Data Availability
e data used to support the findings of this study are supplied by Ziming Chen under license and so cannot be made freely available. Requests for access to these data should be made to Ziming Chen (ziming_chen1209@ 163.com).

Conflicts of Interest
e authors declare that they have no conflicts of interest.