Research on Dynamic Path Planning of Mobile Robot Based on Improved DDPG Algorithm

. Aiming at the problems of low success rate and slow learning speed of the DDPG algorithm in path planning of a mobile robot in a dynamic environment, an improved DDPG algorithm is designed. In this article, the RAdam algorithm is used to replace the neural network optimizer in DDPG, combined with the curiosity algorithm to improve the success rate and convergence speed. Based on the improved algorithm, priority experience replay is added, and transfer learning is introduced to improve the training eﬀect. Through the ROS robot operating system and Gazebo simulation software, a dynamic simulation environment is established, and the improved DDPG algorithm and DDPG algorithm are compared. For the dynamic path planning task of the mobile robot, the simulation results show that the convergence speed of the improved DDPG algorithm is increased by 21%, and the success rate is increased to 90% compared with the original DDPG algorithm. It has a good eﬀect on dynamic path planning for mobile robots with continuous action space.


Introduction
Path planning is a very important part of the autonomous navigation of robots.e robot path planning problem can be described as finding an optimal path from the current point to the specified target point in the robot working environment according to one or more optimization objectives under the condition that the robot's position is known [1,2].At present, the commonly used algorithms include the artificial potential field method [3], genetic algorithm [4], fuzzy logic method [5], and reinforcement learning method [6].In recent years, many scholars have proposed path planning methods in a dynamic environment.In 2018, Qian [7] proposed an improved artificial potential field method based on connectivity analysis for the path planning problem of dynamic targets in the LBS System.In 2019, in order to solve the long-distance path planning problem of outdoor robots, Huang [8] proposed an improved D * algorithm, which is combined with Gaode mapping based on a vector model.In 2020, Nair and Supriya [9] applied neural network algorithm LSTM to path planning in a dynamic environment.e reinforcement learning (RL) algorithm is a learning algorithm that does not require agents to know the environment in advance.e mobile robot takes corresponding actions while perceiving the current environment.According to the current state and the actions taken, the mobile robot migrates from the current state to the next state.e Q-learning algorithm [10] is a classical reinforcement learning algorithm that is simple and convergent and has been widely used.However, when the environment is complex, with the increase of the dimension of state space, the reinforcement learning algorithm is prone to fall into "dimension explosion."Deep learning (DL) has a good ability to deal with high-dimensional information.Deep reinforcement learning (DRL), which combines DL with reinforcement learning [11,12], can not only deal with high-dimensional environmental information but also carry out corresponding planning tasks by learning an end-to-end model.erefore, the DQN algorithm [13] comes into being.It usually solves the problem of discrete and low-dimensional action space.e Deep Deterministic Policy Gradient (DDPG) algorithm proposed by DeepMind team in 2016 uses the actorcritical algorithm framework and draws lessons from the idea of the DQN algorithm to solve the problem of continuous action space [14].However, when the DDPG algorithm is applied to path planning in a dynamic environment, it has some shortcomings, such as low success rate and slow convergence speed, and most of the related research stays at the theoretical level, lacking solutions to practical problems.
In this article, a new DDPG algorithm is proposed in which the RAdam algorithm is used to replace the neural network algorithm in the original algorithm combined with the curiosity algorithm to improve the success rate and convergence speed and introduce priority experience replay and transfer learning.e original data is obtained through the lidar carried by the mobile robot, the dynamic obstacle information is obtained, and the improved algorithm is applied to the path planning of the mobile robot in the dynamic environment so that it can move safely from the starting point to the end point in a short time, get the shortest path, and verify the effectiveness of the improved algorithm.
e organizational structure of this article is as follows: the first section is an introduction, the second section introduces the DDPG algorithm principle and network parameter setting, the third section is path planning design of improved DDPG algorithm, the fourth section shows the simulation experiment and analyzed results, and the summaries are given in the last section.

DDPG Algorithm Principle and Network Parameter Setting
1.1.1.Principle of DDPG Algorithm.DDPG used in this article is a strategy learning method that outputs continuous actions.Based on the DPG algorithm and using the advantages of Actor-Critic's strategy gradient single-step update and DQN's experience replay and target network technology for reference, the convergence of Actor-Critic is improved.e DDPG algorithm consists of policy network and Q network.DDPG uses deterministic policy to select action a t � μ(s t |θ μ ), so the output is not the probability of behavior but the specific behavior, where θ μ is the parameter of policy network, a t is the action, and s t is the state.e DDPG algorithm framework is shown in Figure 1.
Actor uses policy gradient to learn strategies and select robot actions in the current given environment.In contrast, Critic uses policy evaluation to evaluate the value function and generate signals to evaluate Actor's actions.During path planning, the environmental data obtained by the robot sensor is input into the Actor network, and the actions that the robot needs to make are output.
e Critic network inputs the environmental state of the robot and the path planning actions and outputs the corresponding Q value for evaluation.In the DDPG algorithm, both Actor and Critic are represented by DNN (Deep Neural Network).Actor network and Critic network approximate θ μ , μ, and Q functions, respectively.When the algorithm performs iterative updating, firstly, the sample data of the experience pool is accumulated until the number specified by the minimum batch is reached, then the Critic network is updated by using the sample data, parameter θ Q of the Q network is updated by the loss function, and the gradient of the objective function relative to the action θ μ is obtained [15].en, update θ μ with Adam Optimizer.e robot in this article obtains the distance between itself and the surrounding obstacles through lidar.e detection distance range of lidar is (0.12, 3.5) (unit m), and the angle range of lidar [16] detection is (−90, 90), that is, 0 °in front of the robot, 90 °to the left, and 90 °to the right.e lidar data are 20 dimensions, and the angle between radar data in each dimension is 9 °. e basis for judging whether the robot hits an obstacle in the process of moving: if the distance from the obstacle is less than 0.2 m, it is judged as hitting the obstacle.In an actual simulation, 20-dimensional lidar distance information is obtained.

DDPG Network Parameter Setting
According to the distance between the robot and the obstacle, the state between the robot and the obstacle is divided into navigation state N and obstacle collision state C as follows: where d i (t) is the i-th dimension lidar distance data of the robot at time t.When the distance between the robot and the obstacle is d i (t) ≤ 0.2m, the robot is in state C of hitting the obstacle.When the distance between the robot and the obstacle d i (t) > 0.2m, the robot is in the normal navigation state N [17,18].

Action Space
Setting. e final output of DDPG's decision network is a continuous angular velocity value in a certain interval.e output is the continuous angular velocity, which is more in line with the kinematic characteristics of the robot, so the trajectory of the robot in the process of moving will be smoother, and the output action will be more continuous.In the simulation, it is necessary to limit the angular velocity not to be too large, so the maximum angular velocity is set to 0.5 rad/s.Hence, the final output angular velocity interval of DDPG is (−0.5, 0.5) (unit: rad/s), the linear velocity value is 0.25 m/s, the forward speed (linear velocity (v), angular velocity (ω)) is (0.25, 0), the left turn speed is (0.25, −0.5), and the right turn speed is (0.25, 0.5).

Reward Function Settings.
2 Mobile Information Systems In the above formula, reward is the return value.d i−0 (t) is the distance between the robot and the obstacle.In the experimental simulation, when d i−0 (t) is less than 0.2, the return value of collision with the obstacle is −200.d i−t (t) is the distance value between the robot and the target point, and 100 is rewarded when reaching the target point.In other cases, the difference between the distance from the target point at the previous moment and the distance from the target point at the current moment, that is, 300 , is taken as the return value.e design is to make the robot move to the target point continuously so that every action taken by the robot can get feedback in time, ensuring the continuity of the reward function and speeding up the convergence speed of the algorithm.In some neural network optimizer algorithms, SGD converges well, but it takes a lot of time.In contrast, Adam converges quickly, but it is easy to fall into local solutions.RAdam uses the warm-up method to solve the problem that Adam can easily converge to the local optimal solution and selects the relatively stable SGD + momentum for training in the early stage to reduce the variance stably.erefore, RAdam is superior to other neural network optimizers.In addition, the RAdam algorithm [19] is an algorithm proposed in recent years, which has the characteristics of fast convergence and high precision, and the RAdam algorithm can effectively solve the differences in adaptive learning methods.erefore, the RAdam algorithm is introduced into the DDPG algorithm to solve the problems of low success rate and slow convergence speed of mobile robot path planning in the dynamic environment caused by neural network variance problem [20].e RAdam algorithm formula can be expressed as follows:

Path Planning Design of
Mobile Information Systems where θ t is the parameters to be trained, t is the training time, α t is the step size, r t is the rectification term,  v t is the moving second-order moment after bias correction,  m t is the moving average after bias correction, attenuation rate { β 1 , β 2 }, { β t 1 , β t 2 } is the attenuation rate at time t, m t is the first-order moment (momentum), v t is the second-order moment (adaptive learning rate), g t is the gradient, ρ ∞ is the maximum length of the simple moving average, ρ t is the maximum value of the simple moving average, J(θ) is the target parameter, and ∇ θ is the gradient coefficient.

Prioritized Experience Replay.
In the path planning of mobile robot in a dynamic environment because of the uncertainty of the environment, there are a lot of invalid experiences due to collision in the early stage of training.e original DDPG algorithm uses these invalid experiences for training, which leads to a low success rate of path planning after training and wastes a lot of time.In order to solve the problem that the success rate of mobile robot path planning in a dynamic environment is not high due to ineffective experience, this article designs and adds prioritized experience replay.When prioritized experience replay extracts experiences, priority is given to extracting the most valuable experiences, but not only the most valuable experiences; otherwise, overfitting will be caused.e higher the value, the greater the probability of extraction.When the value is lowest, there is also a certain probability of extraction.
Prioritized experience replay uses the size of TD (Temporal Difference) error to measure which experiences have greater contributions to the learning process.In the DDPG algorithm, its core update formula is where TD-error is where max a Q w (s t+1 , a t+1 ) is the action a t+1 selected from the action space when the mobile robot is in the state s t+1 , so that Q w (s t+1 , a t+1 ) is the maximum value of Q values corresponding to all actions, and t is the training time.As a discount factor c, make it take the value between (0, 1), so that the mobile robot does not pay too much attention to the reward value brought by each action in the future, nor does it become short-sighted, but only pays attention to the immediate action return.r t+1 is the return value obtained by the mobile robot executing the action a t and transitioning from states s t to s t+1 .e goal of priority experience replay is to make TD-error as small as possible.If TD-error is relatively large, it means that our current Q function is still far from the target Q function and should be updated more.erefore, the TD-error is used to measure the value of experience.Finally, the binary tree method is used to extract the experiences with their respective priorities efficiently.

Curiosity Algorithm.
e core of interaction between the deep reinforcement learning algorithm and environment is the setting of the reward mechanism.A reasonable reward mechanism can speed up the learning process of the agent and achieve good results.However, in the path planning of mobile robots in a dynamic environment, as the working environment of mobile robots becomes more and more complex, external reward training alone cannot get good results quickly.erefore, the curiosity algorithm [21] is introduced in this document to provide internal rewards by reducing the form of actions and self-errors in the learning process of agents through internal curiosity module (ICM), so that mobile robots can train under the combined action of internal and external rewards and achieve good path planning effect.e final reward value combined with the DDPG algorithm is max r t � r i t + r ε t , where r t is the total reward value, r i t is the internal reward of curiosity module, and r ε t is the external reward of the DDPG algorithm.In a complete training process, the original and next state values and actions should be calculated through the internal curiosity module as shown in Figure 2. Specifically, the curiosity algorithm uses two submodules: the first submodule encodes s t into φ(s t ), and the second submodule uses two consecutive states, φ(s t ) and φ(s t+1 ), encoded by the previous module to predict action a t .at is, the action of the agent will pass through the forward model  a t � g(s t , s t+1 ; θ I ), where  a t is the predicted estimation value of the action, s t and s t+1 represent the original state and the next state of the agent, θ I is the neural network parameter, and the function g is the inverse dynamic model.Error calculation is carried out between the state prediction of the forward model and the coding of the next state, and the calculation result obtains an internal reward.e coding principle is as follows: where  φ(s t+1 ) represents a state prediction value, φ(s t ) represents a feature vector encoded by the original state s t , a t is the action, θ F is a neural network parameter, and the learning function f is called a forward dynamics model.e neural network parameter θ F is optimized by minimizing the loss function L F : e intrinsic reward value is where η is the scale factor, satisfyingη > 0. e coding results of the original state and the next state will be predicted by the inverse dynamic model.e overall optimization objectives of the curiosity algorithm are summarized as follows: In the formula, β and λ are scalars, θ I and θ P is the neural network parameter, the losses of the inverse model and the forward model are weighted to β satisfy 0 ≤ β ≤ 1, the importance of gradient loss to the reward signal in learning is measured λ, λ > 0 is satisfied, L I is the loss function to measure the difference between the prediction and the actual action, r t is the internal reward value at time t, and π(s t ; θ P ) represents a parameterized policy.In the simulation experiment, β is 0.2 and λ is 0.1.

Establishment of Simulation Experiment Environment.
Hardware configuration of simulation experiment: Intel i5-3320M CPU and 4G memory.
e operating system is Ubuntu 18.04.e ROS Melodic robot operating system is installed, and the simulation environment is established using Gazebo 9 under ROS.e generated experimental environment is shown in Figure 3.
In that simulation environment of Figure 3(a), a square environment with a length and width of 8 meters is established, no obstacles are added, the position of the mobile robot at the starting point is set to (−2, 2.5), and the color circle at the target point is set to (2, −2), which is mainly used to train the mobile robot's ability to complete the target in a limited space for transfer learning.In that simulation environment of Figure 3(b), eight dynamic obstacles are added on the basis of the above environment, among which the middle four (0.3 × 0.3 × 0.3) m 3 obstacles rotate counterclockwise at a speed of 0.5 m/s, the upper and lower two (1 × 1 × 1) m 3 obstacles move horizontally at a speed of 0.3 m/s, and the middle two (1 × 1 × 1) m 3 obstacles move vertically at a speed of 0.3 m/s.e starting point and target point of the mobile robot are the same as those of the first simulation environment, and the second simulation environment is used to train the robot to plan its path in a dynamic environment.

Simulation Experiment and Result Analysis.
In order to verify the algorithm, the original DDPG and the improved DDPG mobile robot path planning algorithm are trained for 1500 rounds in the simulation environment with the same dynamic obstacles, and the total return value of the mobile robot in each round of simulation training is recorded.A graph with the number of training rounds as the abscissa and the return value of each training round as the ordinate is drawn as follows.
e results of Figure 4 show that when the DDPG algorithm is trained in a dynamic obstacle simulation environment, the total return value curve has a gradual upward and downward trend with the increase of the number of training rounds, indicating that the mobile robot training in a dynamic obstacle environment does not converge at last.e total return value fluctuates greatly from 0 to 200 rounds, and the return value is mostly negative.is shows that the robot is "learning" to reach the target point.Observing the training process, it can be seen that the mobile robot collides with dynamic obstacles when gradually approaching the target point, and the total return value can reach 2400 or so in 200 to 400 rounds.Observing the training process, it can be seen that the robot can reach the target point in a few cases.After 400 rounds, the return value is high and low, and most of them are stable at about 500.Observing the training process, we can see that most of them end up colliding with dynamic obstacles.
Figure 5 shows the return curve of DDPG algorithm training in a dynamic environment with transfer learning.It can be seen from Figure 5 that the robot will move towards the target point at the beginning because of the addition of transfer learning, so the learning negative value of the return curve is much less than that in Figure 4, but the overall curve is uneven and does not converge.
Figure 6 shows the return curve of the DDPG algorithm trained in a dynamic environment with only priority experience replay.Because the priority experience replay eliminates a large number of invalid data in the early stage of training, it can be seen from Figure 6 that the return curve gradually increases from negative value and tends to be stable, but the convergence speed is slow and the success rate is low.Figure 7 is a return graph of DDPG algorithm training in a dynamic environment with priority experience replay and transfer learning.It can be seen from Figure 7 that the return value curve is improved compared with Figure 4, but the convergence effect has not yet been achieved.
Figure 8 shows that when the DDPG algorithm with RAdam is introduced and priority experience replay and transfer learning are added to train in the dynamic obstacle simulation environment, the return value curve gradually increases and tends to be stable with the increase of the number of training rounds.e results show that the DDPG algorithm with RAdam is approximately convergent in the dynamic obstacle environment.Because through the combination of internal and external rewards, compared with Figure 7, the convergence speed and success rate are obviously improved, which shows that adding the RAdam algorithm optimizer has a better effect on the path planning of mobile robots in a dynamic environment.
Figure 9 shows that when the improved DDPG algorithm (i.e., curiosity module is introduced on the basis of the RAdam algorithm optimizer) and priority experience replay and transfer learning are added to train in a dynamic obstacle simulation environment, the return value curve   During the experiment, Rviz subscribes to the selfodometer message Odom released by the mobile robot, visualizes the pose information of the mobile robot at each moment in the form of coordinate axes, and gives the path planning to reach the target point before and after improvement, as shown in Figures 11-13.
e time and path results of the mobile robot reaching the target point in the dynamic environment after the improvement of the DDPG algorithm are shown in Table 2.In order to ensure the validity of the results, the test results are averaged 10 times.
According to the data in Table 2, it takes 86 seconds and 280 steps for the original DDPG algorithm to reach the target point in the path planning of the mobile robot in a dynamic environment of this article.Adding the RAdam neural network optimizer algorithm, the time is shortened to 76 s, and the step size is reduced to 250. e curiosity module is introduced on the basis of the previous one, the time reaches 60 s, and the step size is shortened to 210.rough experimental comparison, it is proven that the time and step size of the improved DDPG algorithm in mobile robot path planning in a dynamic environment are improved, which verifies the algorithm's effectiveness.
Improved DDPG Algorithm 2.4.1.RAdam Optimization Algorithm Design.In deep learning, most neural networks adopt the adaptive learning rate optimization method, which has the problem of excessive variance.Reducing this difference problem can improve training efficiency and recognition accuracy.

Figure 6 :Figure 7 :
Figure 6: Add priority experience to replay the experimental results.

Figure 12 :
Figure 12: Dynamic obstacle path planning with the RAdam algorithm.

Figure 11 :
Figure 11: Dynamic obstacle path planning based on the original DDPG algorithm.
E[(r t + rQ′(s t+1 , a t+1 ) -Q(s t ,a t ))] In order to verify the success rate of path planning of the trained models in the dynamic environment, the four training models were tested 50 times in the training dynamic environment, and the same test was carried out three times to get the average value.eexperiment is done three times to ensure the validity of the data.etestresultsand the time taken to train the model are recorded in Table1.Experimental results show that, in Table1, the success rate of the original DDPG algorithm can reach 50%, and the training time of the model is 14 hours.After only adding transfer learning, although the training time is reduced, the success rate is reduced.Only adding priority experience replay, the success rate increased to 70%, but the training time did not decrease.By adding priority experience replay and introducing transfer learning, the success rate increases to 74%, and the model training time is shortened to 13 hours.After adding the RAdam algorithm, the success rate increases to 86%, and the model training time is shortened to 12 hours.Finally, the success rate of the curiosity module is increased to 90%, and the model training time is shortened to 11 hours.Compared with the original DDPG algorithm, the convergence speed is increased by 21%, and the success rate is increased to 90%.

Table 2 :
Path effect comparison.Figure 13: Improved algorithm for dynamic obstacle path planning.