With the continuous improvement of people’s requirements for interactive experience, gesture recognition is widely used as a basic human-computer interaction. However, due to the environment, light source, cover, and other factors, the diversity and complexity of gestures have a great impact on gesture recognition. In order to enhance the features of gesture recognition, firstly, the hand skin color is filtered through YCbCr color space to separate the gesture region to be recognized, and the Gaussian filter is used to process the noise of gesture edge; secondly, the morphological gray open operation is used to process the gesture data, the watershed algorithm based on marker is used to segment the gesture contour, and the eight-connected filling algorithm is used to enhance the gesture features; finally, the convolution neural network is used to recognize the gesture data set with fast convergence speed. The experimental results show that the proposed method can recognize all kinds of gestures quickly and accurately with an average recognition success rate of 96.46% and does not significantly increase the recognition time.
With the development of the times and technology, human-computer interaction methods are gradually enriched. Typical interaction ways that are active in people's sight include face recognition, gesture recognition, microexpression recognition, and human behavior recognition [
Gesture recognition can be studied from different fields, mainly including pattern recognition, signal processing, computer vision, and human-computer interaction, and typical methods include gesture recognition based on deep learning, gesture recognition based on Hidden Markov model (HMM) [
In the early stage of gesture recognition, gesture recognition mainly relies on gloves with sensors. Grime [
Deep learning has unique capabilities in the field of computer vision, which imitates human brain's abstract memory of data to realize the machine's abstract expression of voice data, picture data, text data, and so on. Hubel and Wiesel [
Lee et al. [
In order to solve the problem that the gesture data is messy, and the gesture features are not obvious, the preprocessing scheme proposed in this paper can effectively extract the gesture features, and our contributions in this paper are as follows: This research uses YCBCR color space skin color filter to segment the gesture area to enhance the features of the gesture. Second, Gaussian filtering is used to process the noise around the segmented skin color area; then, we use the morphological opening operation to process the gesture image data, the watershed algorithm based on marker to segment the gesture area, and the eight-connected filling algorithm to fill the gesture area to enhance the gesture feature information. Finally, the AlexNet convolutional neural network model is used to train the processed gesture data set. And the experimental results show that our average recognition success rate is 96.46% without significantly increasing the recognition time.
The gesture data preprocessing is mainly to eliminate interference features of data pictures, highlight gesture features, simplify data scale, improve training efficiency, and increase recognition accuracy. The preprocessing methods in this paper include skin color detection, marker-based watershed algorithm, eight-connected seed filling algorithm, and scale normalization, and the preprocessing processes are shown in Figure
Preprocessing processes of gesture recognition.
Gesture data is taken in different time periods, different lights, and different environments, and the gesture features are interfered by interference features, which increase the difficulty of gesture training. In order to reduce the brightness interference characteristics, the data picture in the RGB color space is converted to the YCbCr space, where the chroma and brightness can be separated, and the conversion relationship is
The
In the formula,
Skin color detection.
For the data samples processed by skin color detection, the data images are segmented by using the mark-based watershed algorithm, which is a way to connect the pixels of adjacent gray values into a contour to achieve the purpose of segmentation image, which is easy to segment images with noise and irregular gradient. The watershed algorithm based on mark solves the problem encountered by watershed algorithm, and the algorithm needs a mark image, which refers to a connected component of the image to be segmented, and the elevation of the connected component can be raised like a dam, which can prevent the local lower edge from being submerged and inseparable. The mark-based watershed algorithm can effectively segment the gesture features through the guidance of the marker graph, and the processing process of the mark-based watershed algorithm is shown in Figure
Data image segmentation process.
After the gesture features are accurately segmented, the eight-connected seed filling algorithm is used to fill the internal pixels of the segmented irregular gesture area. The eight-connected seed filling algorithm is the upgrade of the four-connected filling algorithm, which starts from an injection point in the area and extends in four directions, covering all pixels in the area, while the eight-connected seed filling algorithm accelerates the speed of filling the whole region by extending to eight directions. Through the eight-connected seed filling algorithm, the gesture feature data is obtained. The eight-connected seed filling algorithm is shown in Figure
Contour filling process.
After the segmentation and filling of the image data, the scale normalization operation can ensure the consistency of feature extraction and then label the gesture data with obvious features for neural network training. In this article, the image data obtained by segmentation and filling is normalized to 227
Deep learning refers to the use of deep neural network model for machine learning, through reasonable modeling, construction of reasonable loss function, forward propagation, backpropagation, and a large number of training data for machine training, so as to simulate the function from input data to output data and use the trained function to process new input, including but not limited to prediction and classification. Generally speaking, neural network is divided into three layers: input layer, hidden layer, and output layer. The input layer is a layer that receives input, and it receives a vector. Even if the input data is not a vector, for example, a picture needs to be transformed into a vector for input. The hidden layer is the middle layer, whose function is feature transformation. The output layer is the layer of output results. The essence of neural network is a function, which completes the transformation from input data to output.
The forward propagation of deep neural network means that, after the input data, the data flows into the next layer through calculation, and then to the output layer. The function between any two layers is
In the formula,
In the formula,
This article refers to the use of the well-known convolutional neural network AlexNet, which was proposed in the 2012 Image net image recognition competition and won the championship. The Relu activation function and dropout prevention technology proposed in the network have good performance in neural network training. The AlexNet model has eight layers; among them, the first five layers are convolutional pooling layers, and the last three layers are fully connected layers. The softmax layer can output a 1000-dimensional matrix, that is, 1000 types of label distribution; the detailed structure of the model is shown in Table
Model structure.
Layer | The required action |
---|---|
Convolutional pooling layer | Convolution, activation, pooling, local response normalization |
Convolutional pooling layer | Convolution, activation, pooling, local response normalization |
Convolution layer | Convolution, activation |
Convolution layer | Convolution, activation |
Convolutional pooling layer | Convolution, activation, pooling |
Fully connected layer | Activation, dropout |
Fully connected layer | Activation, dropout |
Output layer | Softmax classification function |
The input layer is 227
The third and fourth layers have only convolutional layers, which use a 3
The sixth layer is fully connected, and dropout is used to prevent over fitting. The sixth layer is the first full connection layer, which has a total of 1024 convolution cores, each of which is 6
The AlexNet model uses the Relu activation function
Dropout is used in the full connection layer of AlexNet model to effectively prevent over fitting. During training, neurons are randomly inactivated with a probability of 0.5, which doubles the convergence speed.
The preprocessed gesture data set is used as input to build the AlexNet network model, and the training of the model is the process of iterative forward propagation and backpropagation. With the parameter update of the hidden layer, the loss value is reduced, and the recognition accuracy of the model is improved. In this paper, the number of iterations is 3000, batch is set to 32, and the network model obtained through 3000 iterations of training can achieve a recognition accuracy of more than 95% on the test data set, and the change curve of recognition success rate in the training process is shown in Figure
Change curve of success rate.
Our experiment is implemented on a PC equipped with Nvidia Geforce GT 635m (GPU), Intel Core i5-7500 processor, 8 GB memory, and Windows 10 system (64 bit). The identification and classification processes were implemented by PyCharm Community Edition 2020.3.2, configuration of tensorflow in Python3.5 for model training, installation of OpenCV package and PIL package for image processing, pyqt5 package to design an interactive interface.
The model trained in this experiment can detect ten types of gestures from 0–9, the data set is collected by computer, and each gesture data collects 200 image data. First, 20 gesture pieces of data are collected for each class of 10 subjects in different time periods, with a total of 2000 data samples. By training gestures 3, 4, and 5, the training effect is not good. Adding sample data can effectively improve the training effect. Therefore, each gesture category is added with 100 gesture data, and then the recognition success rate is significantly improved. The gesture representation method is shown in Figure
Gesture representation.
In order to improve the training effect, the data set must be rich. For the same gesture, in different light sources, different angles, and different individuals, the position on the picture will be different, and Figure
Different data for the same gesture.
As shown in Figure
Partial gesture data sets.
Compared with the Histogram of Oriented Gradients (HOG) and local binary patterns (LBP), the preprocessing method used in this paper improves the recognition rate of this algorithm compared with the other two methods, and the comparison results are shown in Table
Comparison of recognition rate.
Method | Recognition rate |
---|---|
Algorithm | 96.463 |
Local binary patterns (LBP) | 86.734 |
Histogram of oriented gradients (HOG) | 94.205 |
In the recognition rate test, 100 test samples were taken for measurement in different time periods, 92 times were successfully identified, 7 times of recognition errors were made, and gesture was not detected once. The results of the test in the extremely dark environment are not ideal, and the recognition errors of gestures 4 and 5 are 4 times. This is because the gesture features cannot be separated in the dark environment, and the similarity between gesture 4 and gesture 5 is too high. In normal environment, the model obtained by the preprocessing data method, which enhances the gesture feature, can achieve an extremely ideal recognition rate. In order to test the robustness and stability of the model, four groups of sample data are randomly selected in the test data set, and each group of data is 30 test data graphs. Each group of data was identified and tested. After testing, the model has good robustness and stability. The comparison results are shown in Table
Recognition results.
Test data sample group number | Recognition success rate (%) |
---|---|
The first group | 97.63 |
The second group | 96.46 |
The third group | 97.13 |
The fourth group | 96.68 |
Because dropout is used to prevent over fitting, the convergence speed of the model is improved. In the training process, after many attempts, improving the batch value can reduce the number of iterations and speed up the training speed. However, due to the reduction of the number of iterations, it may not be able to extract data features well and reduce the recognition rate. After testing, when the batch is 32, the training time and model stability are well unified. Figure
Comparison of loss value function curves of different Batch.
In order to verify the recognition effect of the gesture recognition method on different individuals in this study, three people are randomly selected for testing in the laboratory. Each tester needs to test each gesture 50 times, divided into two groups, 25 times in each group; when there is sufficient light during the day, and when the light is insufficient at night, each of the 10 gestures is tested 500 times. The test results of the three tests are shown in Figure
Recognition rate of different testers.
In this paper, based on the deep learning of gesture recognition, by using skin color detection, marker-based watershed algorithm, and seed filling algorithm to preprocess the gesture data, the gesture data with obvious gesture characteristics can be obtained. Then, through the training of ten kinds of gesture data after preprocessing by AlexNet convolutional neural network, the recognition success rate of recognition test set data can reach 96.46% under natural light condition. The preprocessing method adopted in this paper can effectively reduce the interference of the environment on the gesture detection and recognition, and it does not bring longer training and detection time.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
This work was supported in part by the Science and Technology Key Project of Henan Province under Grant no. 202102210370.