An Intelligent Optimization Strategy Based on Deep Reinforcement Learning for Step Counting

With the popularity of Internet of things technology and intelligent devices, the application prospect of accurate step counting has gained more and more attention. To solve the problems that the existing algorithms use threshold to filter noise, and the parameters cannot be updated in time, an intelligent optimization strategy based on deep reinforcement learning is proposed. In this study, the counting problem is transformed into a serialization decision optimization. *is study integrates the noise recognition and the user feedback to update parameters. *e end-to-end processing is direct, which alleviates the inaccuracy of step counting in the follow-up step counting module caused by the inaccuracy of noise filtering in the two-stage processing and makes the model parameters continuously updated. Finally, the experimental results show that the proposed model achieves superior performance to existing approaches.


Introduction
Accurate and efficient step counting plays an important role in exercise [1]. With the development of Internet of things (IoT) and artificial intelligence technology, a variety of motion monitoring devices have sprung up, such as pedometer and sports bracelet.
As a method of motion statistics, step counting is achieved by analyzing the gait characteristics, which mainly refer to the individual behavior characteristics of human body in normal walking, including speed, amplitude, and posture, and different individuals with strong differences [2].
ere are mainly two kinds of step counting devices, one is customized pedometer, which has single function and stable effect, and needs to fix special hardware equipment [3]. e other is common applications that are based on gait recognition system and are relatively simple, feasible, and easy to implement to obtain motion data through acceleration sensors.
Step counting has been widely used in personal health assistant, medical health monitoring, positioning navigation, etc. With the development of IoT, the step counting equipment can achieve a good accuracy rate, the function of the sensor is more powerful, the data acquisition is more convenient and accurate, and the anti-interference ability is stronger. However, there are two shortcomings that need to be seriously taken into consideration: one is the user's various behaviors that are easy to cause noise, such as arm shaking, steering, squatting, and falling; the other is the parameters that cannot be continuously optimized. e traditional pedometer mostly uses signal processing to analyze the sampled data and count the steps. However, different users have different uncertain behaviors in the actual environment, so the noise data are generated randomly. e general approach is to add threshold to filter the noise data. How to optimize the threshold value brings new problems, in practice, the high error.
How to solve the abovementioned problems? In this study, an intelligent optimization algorithm based on deep reinforcement learning [4] is proposed, which overcomes the shortcomings of parameter updating, optimization, and self-learning when filtering noise data. erefore, the original problem is transformed into a series of decision optimization. In the deep reinforcement learning, with the representation learning ability of deep learning, the agent can automatically and autonomously learn the effective features from the original data. Experiments on real data sets verify that the proposed model could optimize the ability of noise recognition and improve the final accuracy. e remainder of the study is organized as follows: the related work is presented in Section 2; the problem and the details of the proposed approach are shown in Section 3; the experimental results are shown in Section 4; and finally, we draw the conclusion in Section 5.

Related Work
Recently, a variety of step counting applications have been developed, and more approaches are constantly proposed. Many literatures have done research on the aspect. e traditional approaches include threshold methods and wave peak detection, in which the embedded module is mainly used and the calculation ability is weak. With the development of microelectronics technology and the emergence of intelligent devices, the computing power improved, and more and more new type approaches have been proposed.

Traditional Approaches.
Traditional approaches firstly collect the data of acceleration sensor, then smooth the waveform, and finally count the steps by signal processing approach. e threshold method is used to set up various conditions through the curve change in acceleration waveform. If the change in signal waveform meets these conditions, it is recorded as 1. Zhang et al. [5] used the method of dynamic threshold to count steps; he calculates the mean value of the data waveform as the threshold value and constantly updates the threshold value with the progress of sampling. When the sampling point is less than the current threshold for the first time, it is recorded as a step. He also pointed out in the literature that this method has strong sensitivity. When the pedometer vibrates frequently or vibrates slowly but is not walking, the pedometer still regards it as a step. In Ref. [6], the sensor is tied to the ankle to count steps according to the acceleration whether exceeds a threshold or not. In the process of reckoning, the authors [7] combine the data of vertical angle with the autocorrelation coefficient method to calculate the step, which shows good detection accuracy, but it also has the problem of high calculation cost. Chen et al. [8] used the ensemble network architecture for deep reinforcement learning, which is based on value function approximation. Jayalath and Abhayasinghe [9] proposed to use a gyroscope as acquisition sensor to calculate the number of steps by calculating the number of times of crossing the zero point.
Peak detection approach counts the number of wave peaks in sensor data to achieve step counting. e disadvantage is that it is easy to be affected by pseudo-peaks. Lu and Velipasalar [10] used the peak detection method to detect the trough and then calculated the trough value. Lou et al. [11] combined the intermediate threshold method to determine the wave peak value. Lou et al. [12] proposed an adaptive peak detection method, and according to the set threshold, the normal state and abnormal state are determined, and then, different neighborhood windows are set for different states, and the wave peaks are counted in the window. Liu and Yang [13] designed an adaptive time window to determine the wave crest and trough simultaneously through the adaptive double threshold. e peak detection approach has less calculation and is easy to implement. It can effectively complete the step counting in simple scenarios, but it has weak ability to identify noise, especially when the acquisition equipment vibrates randomly, such as shaking.
In addition, waveform analysis is a kind of the similarity calculation approaches, which counting steps by analyzing the ascending and descending trends of acceleration data. Wang et al. [14] abstracted the change in waveform into three states: stationary state, peak state, and trough state. Tang et al. [15] calculated the horizontal acceleration, vertical acceleration, and angle from linear acceleration and gravity acceleration, and calculated the distance between the collected data and the wave data in the sampling set. Rai et al. [16] proposed an autocorrelation analysis approach, which calculates the similarity between the data collected in the previous period and the current period.
e similarity calculation method can analyze the periodic data well, but it needs to analyze the cycle of the data in advance. For different speeds of walking, the data cycle would be different.

State-of-the-Art Methods.
Due to the defects of traditional approaches, various improved ones are proposed. Xiao [17] found that the interference signal of the waist is not completely random and designed an approach by Freeman coding the signal; in Ref. [18,19], the authors designed an approach that tests the signal waveform, which has three obvious phase changes, namely the downward swing phase, the upward swing phase, and the standing phase. e above two approaches are based on local features to improve the accuracy, with high time complexity and implementation complexity. However, the accuracy rate in practical application is not as high as that of the threshold method, especially in the case of long-time walking.
With the rapid development of intelligent devices, more and more approaches are proposed. In Ref. [20], the authors divided the movement into rhythmic and nonrhythmic activities by collecting the triaxial acceleration data of the wrist. When the oscillation meets the requirements of a certain length, intensity, and shape, it is considered as step counting data. In Ref. [21], the authors proposed to use a gyroscope as the acquisition sensor to calculate the number of steps by calculating the number of times passing through the zero point. Yang et al. [22] extracted amplitude feature, amplitude difference feature, triaxial cross-correlation feature, and energy ratio from data in acquisition time window and used OCSVM to distinguish normal data and abnormal data. Siddanahalli Ninge Gowda et al. [23] used acceleration sensor, gyroscope, and magnetic field sensor to collect data, identify the position of acquisition equipment by setting threshold value, and then carry out step counting. Combined with magnetic field sensor to filter the noise generated by random movement, the approach only uses a single sensor for step calculation after identifying the location.
Generally, the step counting process is mainly divided into two stages: filtering and then step counting. e twostage processing method further improves the ability of the algorithm to deal with the noise. However, the accuracy of noise filtering is required to be high. When the noise filtering is not accurate, it is easy to cause inaccurate step counting. In addition, the updating problem is not taken into consideration in the above approaches.
Because of the shortcomings of the current approaches, this study proposes an intelligent optimization strategy based on deep reinforcement learning. e effective features are extracted by deep learning, and the self-learning ability of reinforcement learning is combined to deal with the noise problem and parameter continuous updating problem. Finally, accurate step counting is realized in the presence of noise, and the parameters are continuously optimized.

The Proposed Framework
In this section, the problem is transformed into a series of decision optimization, and deep reinforcement learning framework is introduced to solve the problem (Section 3.1). e procedure of the proposed framework is illustrated (Section 3.2). en, the third section mainly introduces how to optimize the parameters in the proposed framework (Section 3.3). Finally, Section 3.4 shows the approach details.

3.1.
e Deep Reinforcement Learning Framework. Deep learning is a kind of technology with powerful function approximation ability and representation ability, which are mainly to combine multilayer simple but nonlinear modules. In these modules, each layer processes the output of the previous module and continuously transforms the low-level data into a higher-level data representation. Deep reinforcement learning has been applied to various fields of life and achieved excellent results. For example, the generation technology of confrontation network based on deep convolution in Ref. [24] is used to reconstruct the missing information between tomb mural blocks; reinforcement learning is a method for learning optimal sequential decision. e correct decision-making action cannot be known at any time, only a good or bad feedback can be obtained, and the agent needs to learn the feedback. erefore, agents need to interact with the environment constantly. By making decisions on the input of each moment, the corresponding reward signal is obtained. With the help of the reward signal, the corresponding decision-making mechanism is learned, and the optimal decision-making is finally realized. It is worth noting that the decision at the current moment may have an impact on the subsequent results.
Deep reinforcement learning, which combines the advantages of deep learning and reinforcement learning, can coordinate the learning of original data. Figure 1 is the framework of deep reinforcement learning. It includes environment (E), agent (A), and sum unit (S). Environment (E) mainly provides input for agent (A), evaluates the action of the agent, and receives the result of sum unit (S). Moreover, the step counting result is evaluated. e evaluation result is returned to the agent (A) as the final reward information r to guide the agent's strategy training. Agent (A) makes a decision and selects the corresponding action who is transported to the environment (E) and sum unit (S); the sum unit (S) accumulates the actions from the agent to obtain the total number of steps.
In application, the environment (E) corresponds to the user and the user's data acquisition device, the agent (A) corresponds to the algorithm used in step counting, and sum unit (S) corresponds to the step number accumulation and storage part. e reward information may come from the user's manual input or other input, such as the built-in simple step counting function.

e Procedure of the Proposed Framework.
e main procedure of step counting framework proposed here is as follows: first, the sensor data are collected; then, the agent obtains the data from the environment E and a reward was obtained; and at last, the sum unit accumulates steps.
In the step counting procedure, the agent makes decisions according to the observation obtained from the current environment. We transform the whole step counting procedure into decision-making procedure and establish reinforcement learning model for it. Reinforcement learning mainly includes five parts: Discrete Dynamics in Nature and Society which is not related to the action in environment, and the transition probability is 1. (d) Discount factor: when the data have completed the decision in a certain period, and the total number of steps obtained by the sum unit is returned to the environment, the current agent obtains an effective reward. erefore, the discount factor is more concerned about the reward that can be obtained at the end of step counting, which is also the evaluation of the final step counting result. (e) Reward R: the sum unit sums the actions selected by the agent in each observation to obtain the final total number of steps. e environment evaluates. Only when the environment obtains the final total steps, the effective reward can be returned. Otherwise, the reward is zero. Specifically, the closer the total number of steps is to the real steps, the higher the reward will be, otherwise the lower the reward will be.
Furthermore, there are two aspects that need to be explained. First, although the transition between the former and the latter states is certain, the decision of different states will affect the result; that is, the actions corresponding to each state will affect the total number of steps; in addition, in this study, the contribution of the actions selected by each state to the final reward is consistent. Second, different users, such as young people, middle-aged people, and the elderly, may not have the same pace frequency when walking; at the same time, the data will be different under different use scenarios, such as running and going upstairs and downstairs. How to extract effective features from data is the first problem to be solved. erefore, to complete the mapping function of the strategy network, this study uses convolutional neural network (CNN) to complete the end-to-end processing and uses the powerful representation ability of neural network to extract useful features from observations.

Parameter Optimization.
In deep reinforcement learning, the parameters in the strategy network are trained by maximizing the expectation of cumulative reward. In this study, the setting of reward function is connected with the accuracy of step counting. e more accurate the step counting is, the greater the reward will be, and the corresponding cumulative reward will be larger. When training the network parameters of the strategy, the existing optimization strategies are solved by deep reinforcement learning to achieve accurate step counting.
To solve the objective function, the optimization problem is solved quickly, which returns the reward value according to the current corresponding strategy. e objective function is as follows: (1) In equation (1), the Bellman equation is mainly used for expansion, which represents the probability of transition from state to state through steps in strategy. Set d π (s) � ∞ k�0 P(s 0 ⟶ s, k, π), and equation (1) can be expressed as follows: Equation (2) is independent of the current state distribution, by replacing q π (s, a), where q π (s t , a t ) � E π [G t | s t , a t ], G i t � T k�t r i k , and N is the times. Gradient descent strategy can be used to solve this optimization function, and the loss function is as follows:

e Deep Reinforcement Learning
Algorithm. e proposed algorithm is shown in algorithm 1.
In algorithm 1, the first line to the sixth line is the interaction between the policy network and the environment. In the seventh and eighth lines, the data processing of the current period is completed, and the sum unit returns the total step count to the environment. e environment inputs the total number of steps and the real total number y into the reward function to obtain the total feedback of the current round. Lines 9 to 13 update strategy network parameters. e 10th line calculates the cumulative reward G t of the current time t. Lines 11 and 12 use the cumulative reward G t and update the gradient of the strategy network.

Experimental Analysis
In this section, we conduct experiments on real data set to validate the effectiveness of our approach.

Data Set.
e data set used in this section is provided by Cambridge University [25], which not only contains the walking data, but also collects the noise data, and the duration of the two is roughly the same. e main features of the data set are shown in Table 1. e data set in Table 1 is collected by Cambridge University according to the walking data collected by the test object holding the mobile phone in various states, including each participant walking the same distance in various states, such as holding the mobile phone in his hand, in his pocket, in his backpack, or in handbag. At the same time, the route is divided into three stages. e walking states are pleased, acceleration, and deceleration. e data acquisition frequency was 100 Hz. ere were 27 young participants, including 18 men and 9 women, and a total of 130 walking routes. e height of all participants was in the normal range. e data came from sensors at six positions, including hand, typing on hand, front and rear pockets of trousers, handbags, and backpacks. e collected data samples record the start time and end time of walking. e data not in this time period are regarded as noise data. e data set comprehensively collects the current young people's use of mobile phones in various scenarios.

Data Preprocessing.
In the process of collecting original data, the higher the acquisition frequency, the more sensitive the sensor is. erefore, all kinds of tiny noises can be captured, so it is necessary to preprocess the original data. On the other hand, deep learning is used to learn data representation in deep reinforcement learning. To make the network better for feature learning, it is necessary to preprocess the data, such as unifying the dimensions of the data, to facilitate the matrix operation in the neural network. In this study, the data preprocessing operations are as follows: (1) Calculate the resultant acceleration and remove the gravity: in step counting, it is considered to calculate the combined acceleration value for the three components. Considering that the gravity value is always a fixed value in a certain environment, and to better reflect the positive and negative characteristics of the data, the gravity acceleration value of 9.80665 is removed, and the formula is a norm � acc − 9.80665 9.
(2) Down sampling: in deep learning, the model parameters are not only related to the network structure, but also related to the input data. Considering the operation efficiency, and to remove the small noise, the data obtained in this study are sampled without overlapping. e sampling formula is acc i ′ � l k�0 acc i+k . acc i ′ is the data after down sampling. In Figure 2, the y-coordinate is the range of the value after subtracting the gravitational acceleration 9.80665; therefore, the value of the y-coordinate in Figure 2 floats up and down 0. e x-coordinate is the sampling point, and the size of the lower sampling window is 4. Figure 2 shows the comparison of data before and after subsampling, and the red part indicates that the original small noise points disappear after down sampling. As can be seen, the sampled data can not only remove small noise, but also reduce the amount of data and reduce the input data dimension of the model. (3) Data segmentation and filling: the size and length of input data need to be set in advance when using convolutional neural network, so the dimension of input data needs to be fixed. In application, the length of the data collected by sensors increases with the operation of the acquisition equipment. It is necessary to segment the data to fit the preset size of the neural network. To unify the segmented data of different lengths, it is necessary to fill and crop the segmented data. In the filling process, the destruction of the original data should be reduced as much as possible. e left and right values should be used to copy and fill, so that the filled data have a uniform length without destroying the original data waveform. In the cutting process, the original data Random initialize the parameter of strategy network, Set learning rate parameter, total rounds (1) for i � 1, . . . , N. do (2) Get initialization status s 0 (3) for t � 1, . . . , T. do (4) Get a t according to π(a t |s t ).
Accumulate the reward according to a i (6) end for (7) Get the step sum unit y′ (8) Get reward according to y′ (9) for t � 1, . . . , T. do (10) G t � T k�t r k . (11) According to formula (4), optimizing parameters (12) Updating parameters: θ←θ − α · ∇L(θ). Discrete Dynamics in Nature and Society waveform should not be damaged as much as possible, so the data on the left and right sides can be trimmed at the same time. In Figures 3 and 4, when the mobile phone is placed on the hand, the acceleration collected by the mobile phone sensor changes. In Figure 3, the acceleration changes increase first and then decrease, and the peaks and troughs appear alternately. Using that, the peaks and troughs only appear once in each step, and taking the peaks and troughs as the dividing point between the current step and the subsequent step, without damaging the front and rear waveforms, the data are quickly segmented.
After the data are segmented, the length of the segmented data is inconsistent due to different walking speeds, different positions of the acquisition equipment, and walking on different roads. Taking Figure 3 as an example, after the data are divided, the data length appears 11, 13, and 15. To unify the segmented data with different lengths, it is necessary to fill and cut the segmented data. e comparison before and after segmentation and fixed length of 21 is shown in Figure 4.

Comparisons.
To show the performance improvement of the proposed approach, we compare the proposed approach with several common approaches or commercial products: peak detection, autocorrelation coefficient, frequency-based method using acceleration, Spring Run, and Ledongli. In addition, we choose to use accuracy that is defined as follows as metric to evaluate the results: where y i ′ is estimated number of steps andy i is the actual number of steps. In this study, firstly, the training data and the test data are randomly divided into 1 : 1, several groups of experiments are carried out, and the average value is taken as the result of the proposed framework.

Experimental Results.
is experiment is based on the framework of deep reinforcement learning. In the strategy network, parameter values have an impact on the results of the experiment. Firstly, the acceleration data are processed and used as the input of the model. For convenience, the convolution kernel length of convolution layer is set to 21, and the convolution kernel width is 1 because the resultant In this study, the influence of learning rate on the step counting error is shown in Figure 5 without adding three layers of fully connected layer. In Figure 5, the x-coordinate is the training rate and the y-coordinate is the error.
From the overall trend, the step counting errors of the four models first decrease and then increase with the increase in learning rate and reach the trough between the learning rates of 0.0001 and 0.01.
At this time, the effects of the two methods are similar, especially when the learning rate is between 0.0005 and 0.01. e step counting error almost does not change with the change in learning rate, showing good stability. It can be seen that when the number of convolution kernels is 30, the effect is better than the model with 10 and 20 convolution kernels. At this time, the performance of the model will not be greatly improved by increasing the number of convolution kernels. erefore, the model with 30 convolution kernels is considered as the basic model for subsequent experiments. e average accuracies of different approaches and products are shown in Figure 6. Deep reinforcement learning (DRL) is the framework proposed in this study; AC [6] represents autocorrelation coefficient algorithm; PD [10] represents peak detection algorithm; and FA [15] represents frequency-based method using acceleration. It can be seen that the accuracy of the proposed approach (DRL) is close to the optimal and its average accuracy is as high as 97.28%, which is higher than both of the comparing traditional approaches and commercial products by at least 5.01%. In complex cases, the accuracy of Spring Run and PD-based   Discrete Dynamics in Nature and Society algorithms is relatively low, and the accuracy is only about 80%. In addition, the blue color represents the average accuracy, and red and grey represent the best and worst accuracy in complex situations.

Conclusions and Future Work
In the step counting process of traditional algorithm, a twostage processing method is used; that is, the noise is filtered first, and then, the steps are counted. To alleviate the problem of inaccurate step counting caused by inaccurate noise filtering and continuously optimize the model parameters, a step counting framework based on deep reinforcement learning is proposed. Our main contributions are as follows: (1) Here, the step counting is transformed into a serialization decision optimization problem, and an intelligent optimization strategy based on deep reinforcement learning is designed. It is an end-to-end direct solution. (2) Down sampling, segmentation, and filling techniques are used to preprocess data, which reduce the dimension of the input data. Figures 2 and 3 are comparison results before and after data down sampling, segmentation, and filling. In addition, the interaction between users is utilized to update parameters by user feedback while identifying noise. (3) Experiments show that the proposed framework could obtain more effective and accurate results. Figure 4 shows the trend of step counting error   Discrete Dynamics in Nature and Society with the change in learning rate, so that it can be determined in the deep reinforcement learning framework to improve the step counting accuracy. Figure 5 shows the accuracy comparison between the proposed model and the existing pedometers or step counting algorithms on the same data set, and it shows that as an end-to-end direct solution, deep reinforcement learning can effectively improve the accuracy and reduce the error.
However, the update of strategies depends on the current payoff, which may lead to slow learning and large variance. In value-based reinforcement learning, value function and state action function are used to estimate the expected return value, which will cause information loss. In the future work, we can model the value distribution, reduce the instability in the value function estimation, reduce the decision-making risk, and provide more choices for the predicted values. e future work may consider introducing parallel computing methods, more feedback information, more features, etc.

Data Availability
e data set used in this section is provided by Cambridge University, please visit https://dl.acm.org/doi/10.1145/ 2493432.2493449.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this study.

Authors' Contributions
Many thanks are expressed to Xiaodong Zhang for his kind help during the preparation of the manuscript and to Pengfei Chen for assistance with the experiments.