Multipolicy Robot-Following Model Based on Reinforcement Learning

We propose in this paper a new approach to solve the decision problem of robot-following. Different from the existing single policy model, we propose a multipolicy model, which can change the following policy in time according to the scene. 'e value of this paper is to obtain a multipolicy robot-following model by the self-learning method, which is used to improve the safety, efficiency, and stability of robot-following in the complex environments. Empirical investigation on a number of datasets reveals that overall, the proposed approach tends to have superior out-of-sample performance when compared to alternative robotfollowing decision methods. 'e performance of the model has been improved by about 2 times in situations where there are few obstacles and about 6 times in situations where there are lots of obstacles.


Introduction
Robot [1] is a topic of general interest in the automotive industry. It is a key problem that the agent attains a suitable decision of following. At present, there are two methods to solve this question, including the algorithm based on rule and the algorithm based on machine learning [2]. e algorithm based on rule, in general, makes a decision by establishing the appropriate kinematics model, so that the agent shows perfect performance in the predictable environment but may be not adjusted to the new environment. Gao et al. [3] proposed a kind of vehicle cruise system based on simulating driver method, which can safely follow the vehicle on most occasions but is hard to ensure safety and stability at the same time when it is used in complex scenes because it is governed by the driving experience of the designer. Li et al. [4] proposed a feedback control method, which has excellent following efficiency and safety performance. However, it lacks the consideration of stability in the situation of frequent speed changes.
One of the algorithms based on machine learning is based on deep learning [5] and the other is based on reinforcement learning [6]. Wang et al. [7] proposed a robot-following system based on deep neural network [8]. Compared with the traditional following model, it can adapt to complex situations better. is method, however, works by fitting the existing data, rather than generating an optimized strategy. So, it will be limited by the size of the training data set, and it sometimes fails to achieve the optimal decision in the new environment. e algorithm based on reinforcement learning is that the agent explores continuously the environment and obtains the optimal decision by trial and error [9]. Gao et al. [10] proposed a following system based on Q-learning, which does not need a specific training set and has the ability of continuous learning. Howbeit, this method cannot respond immediately while the leading car changes its speed rapidly due to the discreteness of the algorithm. Deng et al. [11] proposed an improved DDPG algorithm. It transforms a multiobjective problem into a single objective problem by fixed weighting found by the experiment method.
is method has better stability performance compared with the above methods and faster training speed compared with the traditional DDPG algorithm. However, this method finds out the real optimal weight only through a large number of simulation experiments, which is difficult to achieve in real scenes. Gao et al. gave a complete follow-up system [12] and proposed a method based on inverse reinforcement learning [13], which solves the problem that the reward function is difficult to set in the following process. Meanwhile, they conducted experiments on real scenes and verified the effectiveness of the system. Even so, he did not consider the diverse needs of robot-following decisions in different environments.
At present, the following problems still exist in the robotfollowing decision-making problem: (1) It is difficult to obtain the optimal decision in different environments by the single decision-making method of robot-following (2) It is difficult to properly weigh the relationship between various targets (3) Most methods lack consideration of flexible targets such as robot-following stability In order to solve the above problems, this paper proposes the multipolicy model based on reinforcement learning. is method solves the following problems such as setting reward function hardly, various users' needs, and unstable operation. It achieves the effect of surpassing the current many advanced methods in the simulation environment, basically achieves the target of safe, efficient, and stable following, and can quickly adjust the policy to deal with different environments. Empirical investigation on a number of datasets reveals that overall, compared to alternative robot-following decision methods, this method can reduce the safety risk by 66% and improve the stability performance by 6 times without affecting the robot-following efficiency when the speed of the target changes frequently. Meanwhile, the safety risk can be completely avoided, and the stability performance can be improved by 2 times when the target moves smoothly.

Markova Model.
Markova model is a model that the probability of the next action, which is usually used in reinforcement learning modeling, is only related to the current state, but not to the previous action and state. is model can be represented by a five-tuple (S, A, P, R, c) [14]. S is the set of current state. A is the set of action at the next moment. P is the probability from S to A. R is the reward function. c, ranging from 0 and 1, is the discount factor, which is used to calculate the cumulative reward value. Policy Π is the probability that the state in set S maps to every action in set A:S ⟶ P(A). Provided that policy is π θ (a, s), the cumulative reward value is J(θ) � E s∼p π ,a ∼ π θ [ ∞ t�0 c t r t ]. Reinforcement learning attains the optimal policy by maximizing.
Firstly, the robot-following process is discretized, and the sampling interval is τ. When time is recorded as t, the distance between the end-effector and the target is recorded as d t , the relative velocity is recorded as △v t , the acceleration of the end-effector is recorded as a 2 t , and the state of t is recorded as s t , s t (d t , t, △v t , a 2 t ) ∈ S. After getting the information of the state s t , the endeffector outputs the action μ t and attains the reward R t � f(μ t ). Meanwhile, the state transforms s t into s t+1 . μ t is determined by policy Π.

Algorithm.
e algorithm is a reinforcement algorithm based on the Actor-Critic structure [15]. e true Q function [16] is replaced by Q θ Q : S × A ⟶ R, and the target function is as [17]: e pseudocode of the algorithm is shown in Algorithm 1.

Pilot Experience Structure.
Pilot experience structure is an important structure to reflect users' preferences and is also an important factor that affects the reward function.
e pilot experience structure consists of experience network and auxiliary network. Experience network is determined by the driving experience of the designer and auxiliary network is determined by users' preferences. e initial policy will be determined by the experience network and auxiliary network. e agent will generate multiple policies by adjusting the influence of the auxiliary network.

Reward Function.
Reward function plays an important role in the reinforcement learning algorithm. A good reward function can accelerate the convergence of the algorithm and improve the performance of the policy. However, it may be hard to find out a good reward function when people face some problems, especially multiobjective optimization problem, in that how to balance the relationship between various targets is difficult to solve. is paper proposes a selflearning method of achieving reward function. e following will introduce the reward function of each single target and the structure of the self-learning method and multipolicy model.

Safety.
Safety is the primary target. In general, keeping a reasonable safety distance can ensure safety performance, so the reward function of safety is defined as where D is the safety distance.

Stability.
Stability generally means that the robot should keep running smoothly at any time, and the speed should not change frequently. erefore, the stability can be described by acceleration and acceleration change rate, so the reward function of stability is defined as where a t is the acceleration and △a t /Δt is the acceleration change rate.

Efficiency.
Efficiency mainly refers to the good cooperation between the robot and the target at any time. At the same time, if possible, the robot should maintain a high speed to ensure less time to complete the task. erefore, the reward function for efficiency can be defined as where v t is the current velocity, d is the relative distance, △d is the change of the relative distance, and D is the safety distance.

Self-Learning Method and Multipolicy Model.
ere is clearly a conflict between these targets. When the relative distance is small, the efficiency is high but there is a risk that the distance is less than the safety distance. When the speed changes small, the stability is high but the relative distance changes greatly, so the efficiency and safety are difficult to guarantee. How to determine the weight relationship between the targets is the key to integrating into a reward function. us, we propose the self-learning method and multipolicy model.
Firstly, several policies are given by the pilot experience structure. e agent will first use these policies for training, and at the same time, it will sample the state-action transition. After many times of training in the training environment, the user needs to give scores of every single target and a comprehensive score. e agent will convert the scores into the total reward of every training and then the reward function can be estimated according to the total reward and the distribution of state-action transition. e maximum likelihood estimation method can be used to estimate the parameter of the reward function. Likelihood function can be defined as en, we can take logarithms on both sides of the likelihood function as en, we can derive the function ln L(μ) with respect to the variable μ and make it equal to 0 as en, the estimated value of the parameter can be solved. By the estimated value of the parameter μ, the probability distribution of the optimal action can be obtained. en, the Randomly initialize critic network Q(s, a|θ Q ) and actor μ(s|θ μ ) with weights θ Q and θ μ Initialize target network Q′ and μ′ with weights θ Q′ ←θ Q , θ μ′ ←θ μ Initialize replay buffer R for episode � 1, M do Initialize a random process N for action exploration Receive initial observation state s 1 for t � 1, T do Select action a t � μ(s|θ μ ) + N t according to the current policy and exploration noise Execute action a t and observe reward r t and observe new state s t+1 Store transition (s t , a t , r t , s t+1 ) in R Sample a random minibatch of N transitions (s t , a t , r t , s t+1 ) from R Set y i � r i + cQ′(s i+1 , μ′(s i+1 |θ μ′ )|θ Q′ ) Update critic by minimizing the loss: Update the actor policy using the sampled gradient: Update the target networks: Scientific Programming reward function can be derived from the optimal action sequence. e agent will train a new policy from the new reward function. Similarly, the user needs to score the new policy again. If the comprehensive score is lower than before, the new policy will be added to the training data to estimate a better policy until the score of the new policy exceeds the scores of any policy before.
Since the policy function that meets the requirements is not unique, we can obtain several different value functions. In order to meet the various needs of different users for every target in the multiobjective problem, the agent will train several reward functions and policies which surpass the policies given by the pilot experience structure through the above method. Meanwhile, the agent will classify the policies according to the reward value of the subobjectives which can be calculated by the subevaluation function, which help itself choose the suitable policy in time.
e subevaluation function can be attained by the subobjective reward function.

Experiment
To verify the validity of the model, two sets of trained policies (stability/efficiency) according to some users' needs are compared with other existing models in three different robot-following simulation scenarios.
In the following experiments, because the maximum measure distance of the distance sensor is 100 cm, we set d max to 100 cm. Similarly, v max is set to 55 cm/s, a 1 max is set to 5 cm/s 2 , a 1 min is set to −5 cm/s 2 , a 2 max is set to 5 cm/s 2 , and a 2 min is set to −5 cm/s 2 . Because we do not consider reversing in our experiment, v min is set to 0 cm/s. In order to better observe the following state, we set τ to 0.1 ms. In order to ensure safety, we set D to 2 cm. In general, when the distance between robot and target is less than the safe distance, we think that the state is in danger. Meanwhile, when the relative speed is equal to 0, the efficiency performance is thought of as well. Good stability performance requires that the acceleration of the robot following the target should be 0. erefore, the average of d − D and the frequency of d < D can be used to evaluate the performance of safety and the StDev of d − D can be used to evaluate the performance of efficiency. e average, StDev of |△v|, and the frequency of |△v| ≤ 0.1 can also be used to evaluate the performance of efficiency. e average, StDev of |a|, and the frequency of |a| � 0 can be used to evaluate the performance of stability. Similarly, the average and StDev of |△a/△t| can also be used to evaluate the performance of stability.

Training.
e model needs to be trained first. e initial speed of the target is 0. During 0 to 30 seconds, it moves at constant acceleration which is 1 m/s 2 . During 30 to 90 seconds, it moves at a constant speed. During 90 to 120 seconds, it moves at a changed speed that changes every 0.01 seconds and its value △v ∼ U(−0.05, 0.05). After 210 seconds, the target slows down at constant acceleration, which is −0.5 m/s 2 until the velocity becomes 0. e initial relative distance is 0, and the robot starts to follow the target 3 seconds after the target starts. e change of the velocity of the target in the training scene is indicated by a solid line in Figure 1.
After training, the policy of the model is basically stable.

3.2.
Testing. e method of this paper was compared with the simulation method from paper 2, Q-learning method from paper 5, and DDPG algorithm from paper 6. e performance of the models was first tested in the training scene, and the data analysis of distance, velocity, and acceleration is shown in Table 1.
In order to distinguish the various algorithms, the policy of algorithms proposed in paper 2 is called P1, and the policy of algorithms proposed in paper 5 is called P2. Similarly, the policy of algorithms proposed in paper 6 is called P3. P4 and P5 are the policy proposed by this paper. P4 tends to get higher efficiency but P5 prefers getting higher stability. P4 and P5 are trained by the self-learning method but do not apply the multipolicy model. P6 adopts the multipolicy model, and its policy set includes P4 and P5, which prefer ensuring safety and efficiency when the target changes velocity frequently and perform stability when the target moves at a constant speed.
We processed the original data to some extent and obtained the average value, standard deviation, and frequency of corresponding data. e calculation formula is as From Table 1, compared with P1 to P3, P4 to P6 perform better according to safety. Meanwhile, P4 and P6 also ensure high efficiency. P5 and P6 have excellent stability performance.
To verify the performance of the model in different environments, the following will test the performance in two different test scenarios: Scene 1. e initial speed of the target is 0. During 0 to 15 seconds, it moves with an acceleration of 1.5 m/s 2 + △a 11 , △a 1 ∼ N(0, 0.5). During 15 to 60 seconds, it moves at a changed speed that changes every 0.01 seconds and its value △v ∼ U (−0.1, 0.1). During 60 to 150 seconds, it moves at a changed speed that changes every 0.01 seconds and its value △v ∼ U (−0.2, 0.2), and after 150 seconds, it moves with an acceleration of −0.5 m/s 2 until the speed is 0. e initial relative distance is 0, and the robot starts to follow the target 3 seconds after the target starts. e change of the velocity of the target in scene 1 is indicated by a dashed line in Figure 1. Scene 2. e initial speed of the target is 20 m/s. During 0 to 15 seconds, it moves at a constant speed. During 30 to 80 seconds, it moves with an acceleration of 0.1 m/s 2 . During 80 to 180 seconds, it moves at a constant speed. During 180 to 230 seconds it moves with an acceleration of −0.2 m/s 2 . When 230 to 250 seconds, it moves at a constant speed. e initial relative distance is 40, and the target and the robot begin to move at the same time. e speed variation of the target is indicated by a long-dashed line in Figure 1.
From Figure 1, we can find the difference between various methods. First, from the perspective of the change of the velocity of the target in the test scene, the moving speed of the target is changeable and irregular in scene 1 and the movement of the target is smooth and does not change greatly in scene 2. Obviously, the robot needs to ensure safety and improve efficiency as much as possible in scene 1. On the contrary, robots in scene 2 need not pay too much attention to the efficiency but need to improve the stability performance as much as possible.
From the results of the test scenarios on Tables 2 and 3, it is not difficult to find that P1 hardly keeps stable and sometimes does not leave enough distance with the target if its velocity changes fast. At the same time, P2 has bad stable performance and often cannot keep a safe distance with the target even if the target moves at a constant speed. P3 has excellent efficiency and not bad stable performance, but its poor security performance is worrying. P4 excessively pursues safety and efficiency without satisfactory stable performance. Similarly, P5 has excellent stable performance but poor efficiency. In terms of safety, efficiency, and stability, P6 mostly performed best in experiments.
From Figure 2, it is not difficult to find that these methods perform better in the training scene than in the testing scene, which proves that the new environment affects the performance of robot-following. Compared with other methods, the performance of P4, P5, and P6 is less affected. erefore, the self-learning method has been proved to have better adaptability to the environment. At the same time, the efficiency performance of P6 is similar to that of P4, and the comfort performance of P6 is similar to that of P5, which shows that the multipolicy model can effectively improve the   P1 represents the policy of paper 3. P2 represents the policy of paper 10. P3 represents the policy of paper 11. P4 represents the policy of efficiency from this paper. P5 represents the policy of stability from this paper. P6 is an integrated model of P4 and P5. P1 represents the policy of paper 3. P2 represents the policy of paper 10. P3 represents the policy of paper 11. P4 represents the policy of efficiency from this paper. P5 represents the policy of stability from this paper. P6 is an integrated model of P4 and P5.  comprehensive performance of robot-following and better complete the following task.
To sum up, the multipolicy model overcomes the shortcoming that the traditional single-strategy model cannot adapt to different environments. By using different policies according to different environments, the multipolicy model can perfectly meet people's different requirements for different environments. erefore, it has high practicability and the value of the study.

Conclusion
In this paper, a Markov decision model for the robot-following process is first modeled, then a multiobjective reinforcement learning algorithm is proposed by improving the existing reinforcement learning algorithm, and the agent could obtain a multipolicy model. is method improves the stability performance by 6 times without affecting the robot-following efficiency when the speed of the target changes frequently and the stability performance can be improved by 2 times when the target moves smoothly.
In summary, we adopt a multipolicy model, which can take different policy preferences for different environments, so as to ensure that the optimal decision can be made in most cases. Meanwhile, we adopt the self-learning method to avoid the problem of artificial weighting, so that the robot can determine the weight ratio by itself and realize the intelligence of the robot to a greater extent. In addition, we introduce more objective considerations into the algorithm, so that the final model is a multiobjective and multipolicy agent. e contributions of this paper mainly include the following: (1) Self-learning mode is proposed to solve the problem that the reward function is difficult to determine in the robot-following problem. (2) e adoption of pilot experience structure reduces a large number of unnecessary explorations in the early stage of intensive learning and training.
(3) e multipolicy model can effectively improve the adaptability of the algorithm to the environment so that the decision can meet the needs of more users.
However, because the algorithm is currently only tested in simulation and real experiment has not been done yet, there are still many unknown factors influencing the actual environment. Further experiments in real environment are needed.

Data Availability
e data used to support the findings of this study are included within the article.