Proposing a Recognition System of Gestures Using MobilenetV2 Combining Single Shot Detector Network for Smart-Home Applications

The paper proposes a system for identifying gestures and actions in smart homes. The proposed method is based on MobilenetV2 feature extraction combining with single shot detector (SSD) network. We used eleven types of gestures of walking, sitting down, falling back, wearing shoes, waving hands, falling down, smoking, baby crawling, standing up, reading, and typing for recognizing the gestures. In this system, the data are captured from the camera of mobile devices that are used to detect the object. The results are obtained objects on the frame by a bounding box. The results show that the system meets the requirements with an accuracy of over 90% that is suitable for real application.


Introduction
For identifying gestures, actions from still images and video sequences are challenging due to issues such as background image and lighting ratio. Many interactive applications between humans and computers or humans and robots or recently control electronic devices are widely studied. It allows computer systems to assist users for improving their lives and healthcare [1][2][3]. Two main methods for deploying the system are surveillance and wearable devices [4]. Surveillance equipment is usually fixed to the camera for user interaction or wearable devices such as smart watches that use voice to control or touch the automatic systems. In the content of this paper, we focus on the method of using fixed monitoring equipment.
ImageNet database inherits the combination of twothreaded ConvNet and recurrent neural network (RNN). In the paper, the authors receive not only information about time but also the space taken as input to the RNN [5]. Using a large number of parameters and computational complexity will not achieve high performance in terms of processing time or memory if used for feature extraction. In [6], the authors proved that the parameter number of Mobilenet is much smaller than the networks for extracting the characteristics, while the accuracy of the two models is almost the same. From there, we propose a method using Mobilenet networks which significantly reduces the number of parameters that are easy to use in weak configuration equipment.
When trying to accept human action gestures, we need to identify characteristics since computers can identify them effectively. Gestures, actions such as walking, sitting down, waving, and tying shoelaces, are very natural gestures in human life that are given priority. However, in machine learning, especially deep learning, when a large amount of computation is required, a computer with a strong configuration is required. In the paper, we reduce the Mobi-lenetV2 network parameters by removing the full connected layer to extract the image feature. We then used the MobilenetV2 network output as the input to the SSD network to identify the action.
In this paper, there are three main points that we propose as follows. Firstly, we propose the gesture recognition system combining Mobilenet V2 and SSD. Secondly, we propose building our set of gestures that are suitable for smart-home applications. ird, we build algorithm applications running on mobile devices with real data with an accuracy of over 90%.
Image recognition is comparable to human visual perception. It has come into everyday life and serves various demands. Facebook and media platforms use the technology to enhance searching image and assist visually impaired users. Businesses use image reception to scan large databases that satisfy customer demands and improve the customer experience in their stores and online shopping. In the healthcare system, medical image recognition and processing systems help professionals predict health risks and detect disease early which provide more services to patients. e goal of action identification is to create a system that can be used to control smart-home devices. It could be applied to control digital devices in the future. is is an advanced technology in the smart-home application that allows controlling the screen without touching the device using AI technology. e rest of the paper is presented as follows. In Section 2, we will present related work. In Sections 3 and 4, we present and evaluate the effectiveness of the proposed model, respectively. Finally, we give the conclusion in Section 5.

Related Work
Identifying action is one of the applications in the control of digital devices in the future. is is an advanced technology that is being widely used in smart homes. Currently, many companies and research centers are actively testing hightech models that allow screen control without touching the device by artificial intelligence (AI) technology. is is the area that is more concerned with action identification.
ere are many studies to identify actions [2][3][4]. In [2], the authors perform 3D skeleton identification based on datasets of NTU-RGB + D and Kinetic. e authors [3] perform noron-based identity and joint trajectory maps (JTM). Khowaja and Lee [4] propose the solution to follow which is a sequential combination of Inception-ResNetv2 and long short-term memory (LSTM) network to take advantage of time variance to improve recognition performance. In this paper, the identification accuracy is 95.9 and 73.5 % based on UCF101 and HMDB51 datasets, respectively.
anks to the ability to learn, neural networks do not need to be manually established during the simulation process of human learning and can conduct training of gesture patterns and actions to create classification map network. e deep learning model is inspired by communication and information processing models developed from biological nervous systems including neural networks with more than one hidden layer. ey can acquire the characteristics of learning subjects easily and accurately.
For complex subjects, it exhibits superior performance in computer vision and natural language processing (NLP) in [8,9]. Modern object detection systems are variants of Backpropagation Neural Network (BPNN) and Faster RCNN in [10,14]. In [14], the authors compared AI networks and concluded that BPNN achieved the highest efficiency. In [11], the author presents a SSD that optimizes object detection. Compared to Faster RCNN, SSDs are simpler and more efficient since it completely eliminates the stages of pixel creation and subsequent proposed reproduction. It also encapsulates all calculations in a network that makes the SSD easy to train and easy to integrate into systems. Besides, it works in conjunction with the Mobi-lenetV2 network to operate on embedded and mobile devices quickly and efficiently.
However, there are several challenges with identifying action as follows: Developing training sample sets: identification using machine learning requires an appropriate set of sample data, so it takes time to collect data to create standard samples. Processing time: we need to process large amounts of data. If a network has to handle too many parameters with a weakly configured machine, it will slow down affecting the results in real time. Accuracy evaluation methods: for conventional cameras (webcams), accuracy is affected by other conditions such as light, background, and hand movement speed, so we have to make some assumptions for the application.
As analysis above, we propose an action identification system based on the combination of MobilenetV2 network with SSD network for easy use on embedded devices with weaker hardware configurations.

Proposal System
3.1. Overview of Proposal System. We propose the system based on [6,15]. In [15], they use Resnet-101 model for object detection. Although the accuracy is high, size of network is large. Mobilenet that published later than Resnet-101 is proposed by authors from Google in 2017. In this network, the authors used a calculus convolution method called depthwise separable convolution to reduce size model and calculation complexity. As a result, the model is useful when implemented in mobile and embedded devices since we proposed to use Mobilenet and SSD to apply for our system. Metrics of convolution networks are shown in Table 1.
e proposed system is based on [12,20] for application in smart-home models, as shown in Figures 1 and 2. With the proposed network, we first expand the number of channels by deep convolution with a kernel size of 3 × 3 over the expanded space and finally through the bottleneck filter back several smaller channels combining with a residual connection. ey are used in gradient calculations to improve performance. Besides, we also reduce the Mobile-netV2 network parameters by removing the full connected layer to extract the image feature, as shown in Figure 2. e goal of this system is to build and process datasets from simple to complex actions. e proposed gestures include eleven actions, namely, walking, sitting down, falling back, putting on shoes, waving hands, falling down, smoking, baby crawling, standing up, reading, and typing.

Journal of Electrical and Computer Engineering
First, the system extracts the characteristics of the data input using the mobilenetV2 network and then enters the SSD network to predict the results. e results obtained after the train process are converted to Tensorflow Lite (.tflite) format for performing on mobile devices. e tensorflow model obtained Graphdef and checkpoint graphs after performing the training. ese graphs are converted to Tensorflow Lite (tflite) format and then added to its interpreter. e interpreter executes the model using a set of operators. Details of the steps will be presented below.

Processing
Steps. Tensorflow is used for creating models, training, manipulating data, and making predictions based on [12,20,21]. However, machine learning especially deep learning needs great computational power. Although training in mobile and embedded devices are possible, it will take a lot of time. To solve this problem, we will use Tensorflow for the training phase and Tensorflow Lite for the inference phase, as shown in Figure 1.
Proposal methods include the following steps: (i) Step 1: preparing data (ii) Step 2: assigning labels to data (iii) Step 3: using the MobilemetV2 network to extract features (iv) Step 4: using the output of MobilenetV2 network as input of SSD network to detect the object.

Preparing Data.
Firstly, we need to prepare the data including the self-built data source and the online source via Google and a part of UCF101 [22] and BU203 [23] with eight actions, namely, walking, sitting down, falling back, putting on shoes, waving hand, falling down, smoking, and baby crawling, as shown in Figure 3 [22,23], and three actions (standing up, reading, and typing) designing by ourselves. Number of labels and images are shown in Table 2.

Labeling Data.
In this step, we perform the ROI determination of each action based on manual labeling. In this paper, we use a built-in labeling tool. is process basically draws boxes around objects in the image. Figure 4 is an example using the LabelImg tool that automatically creates an XML file describing the location of the object in the image.
e values obtained are shown in Figure 5 based on [24]. After labeling the data, we divide them into train/test files. Next, we convert the XML files into CSV files and then create TFRecords from these files. is TFRecords train file is given for model training. Finally, values are included in the model for evaluation.

Extracting Feature.
e input image after being assigned will be saved in the csv format and converted into the record format in Tensorflow. We use a combination of two MobilenetV2 + SSD networks in Tensorflow to perform action identification to increase system accuracy.
First, the mobilenetV2 network uses 1 × 1 point convolution to expand the input channels. It then uses the deep convolution to extract the input feature and the convolution integrator linearly to combine the output features while reducing the network size. After reducing the size, it replaces the ReLU6 with a linear function to activate the output channel size to match the input, as shown in Figure 7.
e MobilenetV2 network also uses the reverse block to combine features over short-circuiting networks and features when traversing convolution to gain more functionality for output as follows. Depth convolution splits the input channels and filters into separate channels and then combines the output using 1 × 1 convolution. We have the network input D F × D F × M, where the kernel size is D K , and the output with the number of N channels. Depth convolution will map only on each individual input channel. erefore, the number of output channels and input channels is the same. Its computational cost function is Figure 8 [7]. e end result is a convolution. It is an assembly with a 1 × 1 kernel size that incorporates features created by depth convolution. Its calculated cost is M × N × D F × D F , as shown in Figure 9, based on [7]. e cost calculated on depth convolution is Performing calculations on each filter, we average the weights on an input filter. We then infer the output feature map calculated by the formula: where K is a kernel of size D K × D K × M, F is input, and G is output feature map.

Using SSD to Detect
Objects. SSD [8,21] is a good choice for object detection due to its greater accuracy than YOLO [9] and faster speeds than Fast-RCNN [10]. SSD uses VGG-16 base network with several additional layers such as extracting feature map. However, purpose of our paper is to perform on weaker devices such as mobile devices to reduce server-side bandwidth, reduce latency, and improve speed. As a result, the system reduces the cost of mobile traffic for users due to not having to download large amounts of raw data on computer. erefore, we propose to use Mobile-netV2 network instead of VGG16 base network to extract feature map. e SSD adds additional auxiliary bits after MobilenetV2 to predict the object.
SSD model creates a vector of probability of occurrence of c + 1 object, where c is the number of layers and a background layer indicates that there is no object. A vector with four elements (x, y, h, w) represents the position of object of frame.
After each training step, we calculate the loss function until they are reduced and adjust them to be closer to the real object. e model converges when the difference between facts and predictions is close to zero, as shown in Figure 10, based on [8,21]. e loss function is calculated as follows [8]:      Journal of Electrical and Computer Engineering e loss function consists of two terms: L conf and L loc , where N is the appropriate default boxes calculated as follows: where L conf is the confidence loss that is the softmax loss over multiple classes confidences (c) (α is set to 1 by cross validation). x p ij � 1, 0 is an indicator for matching ith default box to the jth ground truth box of category p.
From the training process, we have an algorithm diagram with the input of images through mobilenetV2 network to obtain their weight. e data is then put into the SSD network to determine the coordinates and probability of the object's appearance as well as the loss function value, as shown in Figure 11.

Converting objects to Tensorflow Lite (TSL) Format.
TSL is a lightweight Tensorflow solution for mobile and embedded devices. It allows running machine learning models on mobile devices. e process for this model is shown in Figure 12 based on [7]. e main components of Tensorflow Lite are the model file format, the interpreter for graph processing, a set of kernels to work with, and finally the interface for the hardware acceleration layer.

Journal of Electrical and Computer Engineering
(iii) Ops/kernel: it is a smaller set of operators. However, all models will not be supporting them. Tensorflow Lite provides an integrated core ops and is optimized for CPU using neon. ey operate in both float and quantization.
(iv) Proceed to increase hardware speed: it targets custom hardware. at is the neural network API Tensorflow Lite comes preloaded with links for the neural network API (API NN). If your device supports API NN, the data flow will delegate these operators to the API. Otherwise, it will execute directly on the CPU. Tensorflow Lite is functionally different from Tensorflow Mobile in a degree that is optimized to support system transition and deployment. Tensorflow Lite has been leveraged at every level from model compilation to hardware utilization to increase the viability of inference on the device while maintaining model integrity as follows: To deploy the Tensorflow Lite model file on our application, we build the system of three main components, as shown in Figure 13:

Simulation and Result
4.1. Setup. During implementation to increase accuracy, input data is passed through a preprocessing step to improve quality. rough this step, the data is transferred to Mobilenet and parameters are changed such as batch size, learning rate, and multibox detection match the input data as well as computer configuration to improve the accuracy and speed up the training process.

Input data
Data prediction TSL  Journal of Electrical and Computer Engineering In our simulation, we setup the parameters as follows [21]. e batch size is changed from 1 to 8. e learning rate decay policy is slightly different for each dataset and object. e initial learning rate is setup as 10 − 4 . In three parameters, multibox detection is the most important. For each location, we have k bounding boxes. ey have different sizes and aspect ratios. In our paper, we have 8732 bounding boxes with different aspect ratios 1, 2, 3, 1/2, and 1/3.
Each training image is randomly sampled by entering the original input image.
Object is 0.1, 0.3, 0.5, 0.7, or 0.9. e size of sampled patch is selected by [0.1,1] or original image, and the aspect ratio is from 1/2 to 2.

Result.
To perform the training with the action sets mentioned above, we get the results shown in Figures 14 and  15. Figures 14 and 15 show the training on CPU with Ram 8G and core I 5 . We perform for six days with the steps to reduce the loss function from 29 to 2.
We perform on models, namely, Tensorflow (RCNN + InceptionV2), Tensorflow (RFCN + Resnet101), Tensorflow Lite, and proposal model (SSD + MobilenetV2). e results are shown in Figures 16-19. We perform identification of the above set of actions with Tensorflow. e results of the operation are shown in Figures 16 and 20.
We continue to implement identification of the set of actions with Tensorflow (RFCN + Resnet101). e results of the operation are as shown in Figure 18.
We continue to implement identification of the set of actions with Tensorflow Lite. e results of the operation are as shown in Figure 17.
We perform identification of the set of actions with proposal model. e results of the operation are shown in Figure 19.
We also perform actions to check on the image background. e results are shown in Tables 3 and 4. From the above results, we see that the system meets the requirements set out with an accuracy of over 90%. Especially, with the use of Tensorflow and Tensorflow Lite, the system achieved an accuracy of up to 99% with an execution time of 14 seconds.
is is an acceptable time for an intelligent control system. e system of gesture recognition and action is built by SSD + MobilenetV2 algorithm and trained over 2500 images. We then use each action 10 different images for each gesture   Journal of Electrical and Computer Engineering and action. e results show that the system is feasible with an accuracy of over 98%. e graph freezes when converting to Tensorflow with 82% precision and 82% with Tensorflow Lite.
We made video with Tensorflow and Tensorflow Lite models on the i5 computer and 8G RAM. Memory and CPU results used for each model with the proposed dataset are shown in Table 5.
e results showed that although the proposed method (SSD + MobilenetV2) has low accuracy, the processing speed is 23.6 times faster than RCNN + InceptionV2 and 37.8 times faster than RFCN + Resnet101.
After moving to Tensorflow Lite format, we created an application to evaluate the real time of the system. To estimate proposal algorithm with real video, we use the input data that includes 30 frames/second with resolution 1920 × 1080 and bit rate of 82 kbps is suitable for real-time applications. e result is shown in Figures 21 and 22. e result shows that proposal model is suitable to apply for real device with accuracy up to 99%.
We perform to compare our model with [30][31][32]. e result shows that the accuracy of the proposal method is better than that of [30][31][32], as shown in Figure 23. As a result, the accuracy of the proposal system gains 98% with the Tensorflow model. e training process is difficult for the computer when the amount of calculation is huge. A simple convolution 2D lattice for classifying 101 layers has about 5 million parameters while the same architecture when structured in 3D leads to 33   days to train 3DConvNet on UCF101 and about two months on Sports-1M [33]. is makes finding the extended architecture difficult when used with an i5 configuration with an inadequate 8G ram CPU which is time-consuming. e results comparing the model of computational efficiency with other networks are shown in Table 6.
In Table 6, accuracy of our proposal is not high (about 82%). However, the time execution and size of the model   Standing up 100 10 Reading 100 11 Typing 90 Putting on shoes 80 5 Waving hand 70 6 Baby crawling 60 7 Smoking 60 8 Falling down 80 9 Standing up 90 10 Reading 90 11 Typing 70  that is determined by its architecture. By improving the architecture of the model, we will reduce its computational complexity and execution speed.

Conclusion
e paper focuses on the use of neural networks in identifying human actions. In this paper, we have identified actions with an accuracy of over 90%. However, the system still has disadvantages such as the result of recognizing the action is not high and the frame rate per second is still low. erefore, we will perform the steps to increse the frame rate per second, to improve accuracy by increasing the resolution of the input image or using the pretreatment method used in the previous paper [34,35], and to combine neural networks with other networks to increase the efficiency of calculations and performance with any object.

Data Availability
e data used to support the findings of the study include the self-built data source and the online source via Google and a part of UCF101 [14], BU203 [15], and HMDB51 with eight actions, namely, walking, sitting down, falling back, putting on shoes, waving hand, falling down, smoking, and baby crawling.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Firefly-based back propagation [31] Baum-Welch and Viterbi Path Counting [32] Proposal system (SSD + MobilenetV2 )