Wearable Sensor-Based Human Activity Recognition Using Hybrid Deep Learning Techniques

Human activity recognition (HAR) can be exploited to great benefits in many applications, including elder care, health care, rehabilitation, entertainment, and monitoring. Many existing techniques, such as deep learning, have been developed for specific activity recognition, but little for the recognition of the transitions between activities. *is work proposes a deep learning based scheme that can recognize both specific activities and the transitions between two different activities of short duration and low frequency for health care applications. In this work, we first build a deep convolutional neural network (CNN) for extracting features from the data collected by sensors. *en, the long short-term memory (LTSM) network is used to capture long-term dependencies between two actions to further improve the HAR identification rate. By combing CNN and LSTM, a wearable sensor based model is proposed that can accurately recognize activities and their transitions. *e experimental results show that the proposed approach can help improve the recognition rate up to 95.87% and the recognition rate for transitions higher than 80%, which are better than those of most existing similar models over the open HAPT dataset.


Introduction
Human behavior recognition (HAR) is the detection, interpretation, and recognition of human behaviors, which can use smart heath care to actively assist users according to their needs. Human behavior recognition has wide application prospects, such as monitoring in smart homes, sports, game controls, health care, elderly patients care, bad habits detection, and identification. It plays a significant role in depth study [1] and can make our daily life become smarter, safer, and more convenient.
Currently, human behavior data can be acquired in two ways: one is based on computer vision and the other is based on sensors [2]. Behavior recognition based on computer vision has been studied for a long time and has a mature theoretical basis. However, the vision-based approaches have many limitations in practice. For example, the use of a camera is limited by various factors, such as light, position, angle, potential obstacles, and privacy invasion issues, which make it difficult to be restricted in practical application. Although the research time of sensor-based behavior recognition is relatively short, with the development and maturity of microelectronics and sensor technology, there are various types of sensors, such as accelerometers, gyroscopes, magnetometers, and barometers. ese sensors can be integrated into mobile phones and wearable devices such as watches, bracelets, and clothes. Furthermore, state-of-theart wearable sensors have solved the issue of antimagnetic field interference, such as [3], which can accurately estimate the current acceleration and angular velocity of motion sensors in real time in the presence of magnetic field interference. So these wearable sensors are usually small in size, high in sensitivity, and strong in anti-interference ability, so the sensor-based identification method is more suitable for practical situations. Moreover, sensor-based behavior recognition is not limited by scene or time, which can better reflect the nature of human activities. erefore, the research and application of human behavior recognition based on sensors are more and more valuable and significant.
Besides, the HAR includes two types: basic actions and transition actions. Due to the low incidence and short duration of transition movement, there are relatively few studies on the transition movement from standing to sitting, walking to standing, and so on in the research of human behavior recognition [4]. However, the study of transitional movement is a very important part of human behavior recognition. In order to improve the behavior recognition rate, transition action recognition is not negligible. e transition action is the distinction of a variety of basic actions in frequent alternations. e accurate division of the transition action can accurately segment the streaming data to a certain extent and ultimately improve the recognition rate. In addition, the behavior recognition methods based on traditional patterns have shortcomings such as manual feature extraction. With the application and development of deep learning in different fields, the deep learning model also shows great advantages in the field of behavior recognition. e main contributions of this work are summarized as follows: (1) We presented a deep learning model composed of convolutional and Long Short-Term Memory recurrent layers, which can automatically learn local features and model the time dependence between features. (2) We discussed the influence of key parameters in deep learning model on performance and finally determined the best parameters in the model. (3) We analyzed and compared the experimental results with other models that adopt the same common data set. e results show that the proposed method is superior to the other advanced methods.
In this work, we use both acceleration sensor and a gyroscope sensor of smart phones to acquire data and proposed a CNN-LSTM hybrid model to recognize the transition motion. Convolution neural network (CNN) [5] is a type of depth neural network used as a feature extractor. It is characterized by local dependence, so it has good performance in extracting local features. However, human activity information belongs to long instance, which is composed of complex movements and changes with time. So the CNN model does not work well in extracting the relationship between time and features. e Long Short-Term Memory (LSTM) [6] neural network is a kind of recursion network that contains a memory to simulate a time dependent sequence problem. erefore, the mixture of CNN-LSTM can accurately identify the basic and transitional features of activities. e remainder of the paper is organized as follows: Section 2 reviews the literature on human activity identification based on deep learning and existing problems; Section 3 presents the mixed deep learning framework proposed in this paper for existing problems; Section 4 discusses and analyzes the experimental results based on experimental data. Finally, Section 5 concludes this paper.

Related Works
Due to the extensive application of human-computer interaction, behavior detection, and other technologies, human behavior recognition has become a hot field [7]. Human behavior recognition can be regarded as a representative pattern recognition problem. e traditional pattern of behavior recognition research using decision tree, support vector machine (SVM), and other machine learning algorithms can obtain much satisfactory results, in premise of some controlled experimental environments and a small number of labeled data. However, the accuracy of these methods depends on the effectiveness and comprehensiveness of manual feature extraction. In addition, these methods can only extract shallow features. Because of these limitations, the behavior recognition methods based on traditional pattern recognition are limited in classification accuracy and model generalization.
In recent years, deep learning has developed rapidly and attracted many research efforts, especially in image, processing time series, natural language, logical reasoning, and other complex data processing aspects and has achieved unparalleled achievements [8]. Different from the traditional behavior recognition method, deep learning could reduce the workload of feature design. In addition, the higher-level and more meaningful features can be learned via the end-toend neural network. Furthermore, the deep network structure is more suitable for unsupervised incremental learning. Moreover, deep networks created by superimposing several layers of features can model data with complex structures. In a word, the deep learning is an ideal method for HAR.
Since deep learning has made outstanding achievements in image feature extraction, many researchers first try to apply it to behavior recognition based on video. In early periods, Taylor et al. [9] used convolution threshold Boltzmann machine to identify video behavior data and extract sensitive features. Ji et al. [10] proposed a threedimensional CNN model to capture more action information from space and time. Liu et al. [11] proposed that CNN and conditional random domains (CRFs) be combined for action segmentation and recognition. e CNN can automatically learn space-time characteristics, while CRF is able to capture the dependency between outputs. Other common deep learning methods are also widely used, such as recursive neural network [12] and long short-term memory network. On one hand, it is successful on application of deep learning in video behavior recognition. On the other hand, it is also widely used in human behavior recognition based on sensors.
Zeng et al. [13] proposed treating the single-axis sensor data as one-dimensional data of images and then sending them to CNN for identification. Jiang and Yin [14] combined the signal sequences of accelerometer and gyroscope into an active image, enabling deep convolutional neural network (DCNN) to automatically learn the optimal features from the active image. Chen and Xue [15] modified the CNN convolution kernel to adapt to the characteristics of triaxial acceleration signals. Ronao and Cho [16] proposed a con-vNet, which realized efficient and data adaptive human behavior recognition with smart phone sensors. ConvNets not only utilize the inherent time-local dependence of sensor signal sequences but also provide an adaptive method for extracting robust features. Experimental results show that this method can recognize similar actions, which are difficult to be processed by traditional machine learning. Murad and Pyun [17] and Zhou et al. [18] proposed three deep recursive neural network structures based on LSTM to establish recognition models to capture time relations in input sequences and could achieve more accurate recognition. Due to the superior performance of LSTM in behavior recognition application, Guan and Plötz [19] and Qi et al. [20] improved the LSTM and proposed an integration model, integrating different LSTM learners into an integrated classifier.
rough the experimental evaluation in the standard data set, it is proved that the integrated system composed of LSTM learners is superior to a single LSTM network. Ignatov [21] combined the manually extracted statistical features with the features automatically extracted by neural network and realized a human behavior recognition method based on user autonomous deep learning. Among them, CNN extracted local features, while statistical features preserved the information about the global form of time series. Experiments on open data sets show that the model has the advantages of small computation, short running time, and good performance. Nweke et al. [22] and Wang et al. [23], respectively, summarized the application of deep learning method in sensor-based behavior recognition and not only put forward detailed views on the existing work, but also pointed out the challenges and improvement directions of future research.
is work demonstrated the potential of deep neural network to learn the potential features and time series features. Nevertheless, existing works on action recognition mainly focus on the aspect of basic behavior recognition, while the transition between actions is usually ignored because the transition action has a short duration. However, it is necessary to study the transition action in depth in order to improve the robustness of the model. e precise division of the transition action can accurately segment the streaming data to a certain extent and ultimately improve the recognition rate. In this paper, CNN combined with LSTM hybrid model is adopted to extract deep and advanced features, and elaborate description is made of basic and transition action, so as to realize accurate identification.

Proposed Method
e overall architecture diagram of the method proposed in this paper is shown in Figure 1, which contains three parts. e first part is the preprocessing and transformation of the original data, which combines the original data such as acceleration and gyroscope into an image-like two-dimensional array. e second part is to input the composite image into a three-layer CNN network that can automatically extract the motion features from the activity image and abstract the features, then map them into the feature map. e third part is to input the feature vector into the LSTM model, establish a relationship between time and action sequence, and finally introduce the full connection layer to achieve the fusion of multiple features. In addition, Batch Normalization (BN) is introduced [24], in which BN can normalize the data in each layer and finally send it to the Softmax layer for action classification.

Data Preprocessing.
Due to the large amount of behavioral data collected by the sensor, it is impossible to input all the data into the depth model at one time. erefore, sliding window segmentation should be carried out before data input into the model. e behavior recognition method proposed in this paper can recognize both the basic action and the transition action at the same time. e transition action lasts for a short time; it is necessary to choose the appropriate window size. If the window is too large, important information will be lost. Otherwise, the computational costs will be increased. After data segmentation, the behavioral data collected by sensors are one-dimensional time series different from image data. erefore, before applying the deep learning model to these input data, it is necessary to input and adapt them. Dimension transformation is carried out on the data after window segmentation. e method of transformation is to splice the sensor data of all axes into a two-dimensional matrix. e advantage of this approach to data processing is that it preserves the correlation between sensors' axes. Finally, samples similar to pictures are formed and input into the deep learning model. Figure 2 shows the model structure of data preprocessing.

Feature Learning Based 1D-CNN.
e original uniaxial acceleration and gyroscope data are equivalent to two-dimensional array of images after dimensional transformation. e feature image is input into the convolution neural network, and its structure is generally composed of convolution layer and pooling layer. e convolution layer carries out convolution operation on the input image through convolution kernel to obtain feature mapping. e pooling layer extracts local features from the feature map of the convolution layer through sampling operation to lessen the size of neurons and the number of parameters. e convolution layer and pooling layer are stacked to form a deep structure, which can automatically extract the action feature information from the original action data [5].
e CNN model structure designed in this paper is shown in Figure 3. e CNN network model consists of three convolution layers and three pooling layers (each convolution layer is followed by one pooling layer) and finally outputs a number of feature map images with action features. Table 1 illustrates the settings of different parameters for each convolution and pooling layer. Convolution is achieved by the convolution of two-dimensional convolution kernel with images superimposed by multiple adjacent frames. e convolution kernel number of the three convolution layers is 18, 36, and 72, respectively. e Security and Communication Networks convolution kernel size is 2 × 8, 2 × 18, and 2 × 36, and the step size is 1. Since the filter may not be able to process the data in a certain direction in the operation of convolution, to avoid reducing data of the image edge, the padding parameter is introduced and set to "SAME" and 0 is added to the edge of the input image matrix. After the convolution operation in the convolution layer, the output will usually pass through a nonlinear activation function and then form the output of the convolution layer. e popular activation functions include Sigmoid function, ReLU function, and Tanh function. Among them, ReLU function can change the negative value of the data extracted by CNN into 0, and the positive value of the data greater than 0 remains unchanged. After nonlinear processing operation, the positive value    greater than 0 can be more clearly expressed by the extracted features. erefore, ReLU activation function is used in the convolution layer of CNN: Further, we have Pooling layer is regarded as reducing the number of feature mappings and parameters. e popular pooling techniques include maximum pooling and average pooling. In recent years, relevant theoretical analysis and performance evaluation have shown the superior performance of the maximum pooling strategy, which is widely used in deep learning [25,26]. Moreover, some studies show that the maximum pooling technology is very suitable for sensorbased human behavior recognition [27]. erefore, all pooling layers of CNN in this paper utilized the maximum pooling technique. Specific convolution and pooling process parameters are set as shown in Table 2. Since LSTM has different gating units, memory units such as input gate, forgetting gate, and output gate are combined with learning weights to solve the problem of gradient disappearance in the process of back propagation of ordinary circular neural network. Meanwhile, LSTM can model time-dependent actions and fully capture global features, so as to improve the recognition accuracy [28]. LSTM cell controls the inward flowing information of neurons, which is composed of forgetting gate, input gate, and output gate. Furthermore, the predicted value of LSTM cell is obtained using Tanh function.

Feature Fusion and Action
Firstly, the forgetting gate determines how much information from the previous moment can be accumulated to the current cell. As shown in equation (3), the probability value is calculated to determine the amount of information that can pass through the gate: where w f represents the weight corresponding to the input vector, b represents the bias, a 〈t− 1〉 presents the output of the neuron at the last moment, and x 〈t〉 represents the current input of the neuron. Secondly, the input gate consists of update gate and Tanh layer, which controls how much information can flow into the current cell. e calculation process is shown in equations (4)- (6). e input of the input gate and the output of the forgetting gate update the cell at the same time, discarding unwanted information. en, the predicted value of the current unit is determined by the output gate, and the output of the model is obtained, as shown in equations (7) and (8): After the processing of LSTM layer, the final output is a set of vectors containing time and action sequence correlation, which are input into the full connection layer for the fusion of global action features. e training process of neural network model becomes complicated since the statistical distribution of input of each layer changes with the parameters of the previous layer. To keep the distribution of output data from changing too much, a lower learning rate will be used, which could reduce the training speed. To solve this issue, this paper introduces the BN to standardize the values of each layer in LSTM (the output of neurons at the last moment and the input at the current moment), so that the mean and variance of sum will not change with the change of the distribution of the underlying parameters and effectively separate the parameters of each layer from other layers. In this way, the gradient disappearance or explosion can be prevented and the training speed of the network can be accelerated. e BN algorithm is shown in Algorithm 1.
In Algorithm 1, μ x and ς 2 x are the mean and variance of x i obtained through minibatch. e mean and variance were used to normalize x i to make the sample follow normal distribution. However, the positive distribution is not able to reflect the characteristic distribution of the training samples, and thus it is necessary to introduce the scaling factor c and the shift factor β. As training progresses, c and β are also learned by back propagation to improve accuracy.
After BN operation, the features are more obvious, so input them to Softmax layer to extract the action features and classify them in time series. In this model, the output layer uses Softmax normalized exponential function to calculate the posterior probabilities of different actions to realize classification. It maps the output values of neurons between (0, 1), which can be regarded as the prediction probability of actions, and the largest one is the result of classification. en the Softmax output layer outputs a category vector such as [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], indicating that the classification result is an action numbered 5.

Model Implementation and Training.
e neural network described here is implemented in TensorFlow [29]. It is a lightweight library for building and training neural networks. Model training and classification runs on a conventional computer with a 2.4 GHz CPU and 16 GB memory. e model is trained in a fully supervised manner to backpropagate the gradient from the Softmax layer to the convolution layer. Network parameters are optimized by using minibatch gradient descent method and Adam optimizer through minimizing cross-loss function [13]. Adam is widely used due to its advantages in simple implementation, efficient calculation, and low memory demand. Compared with other kinds of random optimization algorithms, Adam has great advantages. In this paper, to better train the model, after the training data are input into the network. Adam optimizer and backpropagation algorithm are used to learn and optimize the network parameters. Meanwhile, the crossentropy loss function is used to calculate the total error, as shown in the following equation: where y is the true tag and a is the predicted value.
To improve efficiency, small batches of data segment size are segmented during training and testing. With these configurations, the cumulative gradient of the parameters is calculated after each small batch. e weights are randomly and orthogonally initialized. As a form of regularization, we introduce a dropout operator on each dense layer of input.
is operator sets the activation of a randomly selected unit to zero during training. Dropout technology proposed by Hinton et al. [30] is based on the principle of randomly deleting some nodes in the network while maintaining the integrity of input and output neurons, which is equivalent to training many different networks. Different networks may overfit in different ways, but their average results can effectively reduce overfitting. In addition, dropout allows neurons to learn stronger features by not relying on other specific neurons. e number of parameters to be optimized in a deep neural network varies depending on the type of layer it contains. And it has a great impact on the time and computer skills required to train the network. e specific model training parameters will reflect the best choices in the experiment.

Experiment Data.
In addition to common basic actions, this paper also studies transition actions. Actually, a few existing public data sets contain transition actions. erefore, this paper adopts the international standard Data Set, Smart phone Based Recognition of Human Activities and Postural Transitions Data Set [31,32] to conduct an experiment, which is abbreviated as HAPT Data Set. e data set is an updated version of the UCI Human Activity Recognition Using popularity Data set [8]. It provides raw data from smart phone sensors rather than preprocessed data. In addition, the action category has been expanded to include transition actions. e HAPT data set contains twelve types of actions. Firstly, it has six basic actions that include three types of static actions, such as standing, sitting, and lying, and three types of walking activities such as walking, going downstairs, and upstairs; Secondly, it has six possible transitions between any two static movements: standing to sitting, sitting to standing, standing to lying, lying to sitting, sitting to lying, and lying to standing. e HAPT data collection process is shown in Figure 4. e experiment involved 30 volunteers, whose ages range from 19 to 48, each wearing a smart phone on their waist. Data collection is carried out with the built-in acceleration sensor and gyroscope, and the sampling frequency is 50 Hz. Meanwhile, video records of the experimental process are made for the convenience of subsequent data marking. e collected data is saved in the form of .txt, and the acceleration and gyroscope data are stored independently, with 60 groups, respectively. As shown in Table 1, it is the label information corresponding to the original data of the experiment. Among them, the first column is the experiment ID, the second column is the experimenter number, the third column is the action label, and the fourth and fifth columns are the start and end row labels of the corresponding sensor data. e label ranges from 1 to 12, representing 12 types of actions. It can be seen from the figure that the collected data contains invalid data, and the first 250 pieces of data are unlabeled and belong to invalid data. Input: data set: χ � x 1 . . . x n Output: y i � BN c,β (x i ) (1) Calculate the mean of data set: μ x ←(1/n) n i�1 x i (2) Calculate the variance of data set: ς 2 Scale change and deviation: y i ←cx i + β � BN c,β (x i ) (5) Return learning parameter c and β ALGORITHM 1: Algorithm of batch normalization. 6 Security and Communication Networks After preliminary processing of the original data, all the data without labels were deleted. Finally, 815,614 valid pieces of data were obtained. Due to the low frequency and short duration of transition action, as well as the high frequency and long duration of basic action, there is a considerable difference in data volume between transition action and basic action. e data volume of the six transition actions is much lower than that of the other basic actions, accounting for only about 8% of the total data. Table 3 lists the amount of data for different actions. e original data is divided into three parts, training set, verification set, and test set, in which the training set is used for model training, and verification set is used to adjust parameters, and test set is used to measure the quality of the final model.

Parameters Setting.
In the deep learning network, the model parameters greatly affect its recognition rate. erefore, the experimental analysis of the number of neurons, learning rate, BN, Batch size, and other parameters in LSTM layer would be conducted in the following sections.

Number of Neurons in LSTM Layer.
In order to verify the influence of the number of neurons in LSTM layer on the recognition results, the following experiments are carried out in this paper, as shown in Figure 5. It shows that the recognition rate is the lowest when each LSTM layer contains only 8 neurons. is is because, given less neurons, the network lacks the necessary learning ability and information processing ability, resulting in the low recognition rate. As the number of neurons increases, the recognition rate tends to increase. When the number of neurons is 64, the recognition rate reaches 95.87%. If the number of neurons is too large, the complexity of network structure will increase and the learning speed of network will slow down. erefore, considering the training time of the network, the number of LSTM layer neurons in this paper is tentatively 64.

e Learning
Rates. Experiments are carried out at different learning rates in this paper. As shown in Table 4, it can be seen that the recognition rate of the model reaches a maximum of 95.87% when the learning rate is 0.002. erefore, the learning rate of 0.002 is adopted.

BN Operation.
To verify the improvement of the BN operation on the network model, a comparative experiment is carried out first with and without BN layer. e epoch is set to 400, and other parameters remain unchanged. e recognition rates of both methods on the test set are shown in Table 5. Obviously, the recognition rate on the test set is improved by about 4.24% after the BN layer is added.

Batch Size.
Batch size refers to the Batch sample size, whose maximum value is the total number of samples in the   training set. When the amount of data is small, the batch data is the whole data set, so that it can approach the extreme value direction more accurately. However, in practical applications, the amount of data used by deep learning is relatively large, and the principle of small batch processing is generally adopted. Using small batch processing requires relatively little memory and faster training time. Within an appropriate range, increasing the batch size can more accurately determine the direction of gradient descent and cause less training shock. However, when the batch size increases to a certain value, the determined downward direction will not change and the correction of parameters will slow down significantly. e identification results of different batch sizes are shown in Table 6. It can be seen that when the batch size is 150, the maximum identification rate reaches 95.87%. erefore, 150 is selected as the best batch size in this paper. e parameters of the CNN-LSTM model proposed in this paper are shown in Table 7.

Experimental Results and Analysis
For human movement recognition, Wang and Liu [33] proposed to use the F-measure standard measurement method to verify the performance of the deep-rooted LSTM network model in human activity recognition. Lu et al. [34] demonstrated the superiority of the model in behavior recognition by using accuracy, prediction rate, and recall rate in the experiment. erefore, to evaluate the performance of the motion recognition method proposed in this paper, we also used the measurement method of accuracy, recall rate, loss rate, and F-measure in the experiment.
According to the above parameters, the recognition confusion matrix of 12 different actions is shown in Table 8. Accuracy curve of CNN-LSTM model is shown in Figure 6. It can be seen from Table 9 that the overall recognition rate of CNN-LSTM is high, and the CNN-LSTM has a better recognition effect on the transition action.

Case Study
In the non-deep-learning method, the random forest classification method (RF) and K-nearest neighbor (KNN) classification perform well in action classification recognition. erefore, the CNN-LSTM model proposed is compared with the RF and KNN methods. First of all, input the HAPT data set into RF and KNN. en, segment the original       Table 10. It can be seen that the recognition rate of CNN-LSTM model is higher than that of RF and KNN methods for both basic actions and transition actions.
In addition to the comparison with RF and KNN classifier, our proposed model is also compared with a single CNN, a single LSTM, CNN-GRU, and CNN-BLSTM deep   Security and Communication Networks learning models. Table 11 shows the average accuracy of various actions in five different depth models. As can be seen from Table 11, CNN-LSTM not only has a slightly higher recognition of basic movements than the other five models, but also has a significantly better recognition of transition movements, especially standing to sitting, sitting to lying, and standing to lying. Table 12 shows the recognition rates of different models on the test set. It can be seen from the table that the average recognition rate of the three models is higher than 90%, but the recognition effect of CNN-LSTM model is slightly better than that of CNN, LSTM, CNN-GRU, and CNN-BLSTM.
To prove the effectiveness of the CNN-LSTM deep learning model, it is also compared with other deep learning methods using the same dataset. Kuang [35] applied BLSTM to construct the behavior recognition model. Hassan et al. [36] used deep belief network (DBN) for human behavior recognition. We compared the performance with the approaches in [35,36], with the result shown in Table 13. It follows that the proposed CNN-LSTM can achieve highest average recognition rate.

Conclusion
is paper explored the recognition method based on deep learning and designed the behavior recognition model based on CNN-LSTM. CNN learns local features from the original sensor data, and LSTM extracts time-dependent relationships from local features and realizes the fusion of local features and global features, fine description of basic and transition movements, and accurate identification of the two motion patterns. e actions identified in this paper only include common basic actions and individual transition actions. In the next step, more kinds of actions can be collected and more complex actions can be added, such as eating and driving. And the individual recognition can be realized by considering the behavior differences of different users. Meanwhile, the deep learning model still needs to be optimized and improved. Studies show that the combination of depth model and shallow model can achieve better performance. Deep learning model has strong learning ability, while shallow learning model has higher learning efficiency. e collaboration between the two can achieve more accurate and lightweight recognition.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare no conflicts of interest.   Table 13: Average accuracy of different methods on test set in the paper [35,36].