Path-Integral-Based Reinforcement Learning Algorithm for Goal-Directed Locomotion of Snake-Shaped Robot

This paper proposes a goal-directed locomotion method for a snake-shaped robot in 3D complex environment based on path-integral reinforcement learning. This method uses a model-free online Q-learning algorithm to evaluate action strategies and optimize decision-making through repeated “exploration-learning-utilization” processes to complete snake-shaped robot goal-directed locomotion in 3D complex environment. The proper locomotion control parameters such as joint angles and screw-drive velocities can be learned by path-integral reinforcement learning, and the learned parameters were successfully transferred to the snake-shaped robot. Simulation results show that the planned path can avoid all obstacles and reach the destination smoothly and swiftly.


Introduction
Research in the field of snake-shaped robots has been carried out for several decades [1][2][3]. is is because snake-shaped robots can not only be used as experimental platforms [4,5] but also be employed for searching and rescuing operations, such as mine disaster rescue, search, and detection of people buried after the earthquake [6]. e snake-shaped robots are multimodule serially connected robots with no fixed base by imitating the snake's body structure and movement mode.
Generally speaking, the snake-shaped robot consists of several segments connected in a serial manner as shown in Figure 1, and the main form of movement is undulation movements, which imitate real snake movements [7,8]. In this paper, the snake-shaped robot is composed of screwdrive units with passive wheels, and the screw-drive units are serially connected through active joints. With this structure, the snake-shaped robot can move in any direction by appropriately adjusting the angular velocity of screw-drive units.
e problem of robot path planning in an unknown environment has attracted more and more researchers' attention.
e main research methods include the genetic algorithm, the gradient algorithm, the nonlinear mapping algorithm, the linear mapping algorithm, and so on [9,10]. ese traditional approaches are improved to solve the path planning problem. Liljebäck obtains the deviation of the centroid of the snake-shaped robot from the expected path through real-time monitoring. is deviation is used as a joint control parameter to adjust the head movement direction so that the snake-shaped robot follows the desired straight path [11][12][13][14][15][16]. Rezapour uses the desired angle of the head joint to adjust the swing of the head link, so that the centroid of the snake-shaped robot follows the desired straight path [17]. Hasanabadi uses sensors to measure the distance between the robot and the desired path to calculate the desired angle, and the angle is used as the joint control parameter to realize that the robot follows a straight path and a circle without side slip path [18]. Matsuno introduces a functional model of the desired path to establish the controller, so that the robot head follows the desired circular path [19]. Mohammadi controls the velocity components in all directions to make the snake robot's center of mass follow a circular path, but it requires that the path function is known [20]. Ariizumi established a cost function based on the expected path and head trajectory and used the cost function as the control parameter to realize the snakeshaped robot following a sinusoidal path [21].
In order to make the snake-shaped robot reach the destination safely and smoothly, the first task is to find proper control parameters for generating goal-directed locomotion. is is a complex and challenging task, and we apply a special type of reinforcement learning, called pathintegral reinforcement learning combining the time series analysis method for the task. Path-integral reinforcement learning is selected because this method can stably solve the "dimensional disaster" problem [22] and has proved successful for different robot learning tasks [23].
is paper proposes a collision-free goal-directed locomotion method based on path-integral reinforcement learning in a three-dimensional environment. e novelty of this paper and the main contributions of this work are as follows: a path-integral reinforcement learning model is established, which describes in detail how the robot can accumulate experience from historical data to optimize decision-making. e Q table is updated based on the prior information and provided to the robot as historical experience, which greatly improves the learning efficiency of the robot. To describe the influence of collision avoidance, the perturbation matrix is introduced and the planned path can avoid obstacle safely and reach the destination eventually.
e simulation results show that the planned paths can avoid all obstacles smoothly and swiftly and reach the destination eventually. e rest of the paper is organized as follows: Section 2 describes the kinematic model of the snake-shaped robot. Section 3 focuses on the path-integral-based reinforcement learning algorithm. Section 4 explains goal-directed locomotion of the snake-shaped robot. e simulation results are given in Section 5. Section 6 concludes the paper.

The Description of Kinematic Model of Snake-
Shaped Robot e kinematic model of the snake-shaped robot [7] can be described as where T is the control input vector. (x, y) is the position of the head of the snake-shaped robot. ψ is the orientation of the first screwdrive unit with respect to (x, y) and is automatically adjusted with the snake-shaped robot kinematics. ϕ 1 , ϕ 2 , ϕ 3 , . . . , ϕ n− 1 are the yaw joint angles which are relative orientations of one screw unit to the previous one. θ n are the angular velocities in corresponding screw-drive units from the head, respectively. _ ϕ 1 , _ ϕ 2 , _ ϕ 3 , . . . , _ ϕ n− 1 are the angular velocities of the joints starting from _ ϕ 1 which is the angular velocity of the first joint from the head. A and B are the system matrices and depend on system configurations such as screw-drive unit radius, inclination of blade, and length of one screw-drive unit [7]. e learning of locomotion control parameters such as screw velocities and joint angles can be carried out as follows. Suppose that the initial position of the head of the snake-shaped robot is at (x 0 , y 0 ) and the target position of the snake-shaped robot is (x g , y g ). e two parameter vectors used to represent the control policy can be described as follows: (2) e snake-shaped robot's locomotion control policy following the kinematic model in equation (1) can be represented by U of equation (2). After learning final values of U, the snake-shaped robot will start from the initial position (x 0 , y 0 ) and then move toward the desired goal (x g , y g ).
We know that the cost function or the objective function is the key component that affects the convergence of learning. In this article, we use Euclidean distance as a cost function as follows: e cost function provides the distance between the reached head position (x, y) and the target position (x g , y g ) of the snake-shaped robot. It can be seen that when the cost function is almost zero, the learned parameter vector will reach the final value. at is to say, the snake-shaped robot reaches the goal and completes the task. erefore, with the final parameter vector, the robot will move toward the given target point.
In addition, the influence of external noise on pathintegral reinforcement learning needs to be considered. As an open parameter in the learning process, external noise needs to be designed in a task-based manner. For pathintegral reinforcement learning in this paper, a random value ε is selected by us from the normal distribution N(0, 1), and ε can be dynamically adjusted according to the noise-free cost C t− 1 obtained at the end of the previous update.
en, the external noise interference can be expressed as follows: where m is the number of update, k is the number of noisy rollout, and e m,k is the external noise interference during k-th noisy trajectory or rollout of the m-th update. Suppose that the noise applied to screw velocities is e θ . 1 (m,k) , e θ . 2 (m,k) , . . . , e θ . n (m,k) , respectively. Similarly, assume that the noise applied to joint angles is e ϕ 1 (m,k) , e ϕ 2 (m,k) , . . . , e ϕ n− 1 (m,k) , respectively. All these noise distributions follow above description of the external noise interference in every noisy trajectory τ m,k , but their patterns are different in a trajectory as ε is a random value. e main steps for path-integral reinforcement learning applied to goal-directed locomotion of the snake-shaped robot are as follows: firstly, the parameter vector U can be selected according to the learning task. Secondly, the number of rollouts K per update needs to be fixed and K � 45 in this paper. According to the parameter vector U to be learned, in every rollout k, the trajectory with noisy parameters is simulated starting from robot start position (x 0 , y 0 ): i (m,k) , (i � 1, 2, . . . , n) represent the noisy of screw velocity parameters and ϕ j + e ϕ j (m,k) , (j � 1, 2, . . . , n − 1) denote the noisy of joint angles. Trajectory of the snake-shaped robot can be generated by using the robot kinematic model described as equation (1) and the parameters of locomotion control such as screw velocities and joint angles. It can be seen that both the noisy screw velocities and the noisy joint angles as required for learning U are given in equation (5). According to the reached position (x m,k , y m,k ) of the snake-shaped robot's head at the end of this rollout, then the final cost for this rollout can be calculated by evaluating the cost function as follows: I m,k � C(x m,k , y m,k ). With this method, all K noisy rollouts from start position of the snake-shaped robot within one update process m can be completed and corresponding I m,k can be stored. en, the exponential value can be calculated for each rollout as follows: en, the probability weighting P k m for each rollout can be calculated as follows: en, corrections for the parameter vector can be calculated based on P k m : e complete update vector can be constructed according to equation (8): en, the control vector U m at the end of an update m can be calculated as U m � U + ΔU.

Path-Integral-Based Reinforcement
Learning Algorithm e path-integration algorithm is used to solve the stochastic optimal control problem, and its purpose is to minimize the performance index of the stochastic dynamic system. e analysis and derivation of stochastic optimal control is based on the optimal Bellman principle and the Hamilton-Jacobi-Bellman (HJB) equation. In this paper, we first analyze the stochastic optimal control of the mobile robot and then use the generalized path-integral control method to obtain the numerical solution of the stochastic dynamic system.

Stochastic Dynamical System.
In general, the stochastic dynamic system can be expressed as follows [10]: (10) where X t ∈ R n×1 denotes the system state, G t ∈ R n×p is the control matrix, f t ∈ R n×1 represents the passive dynamics, u t ∈ R p×1 stands for the control vector, and ε t ∈ R p×1 is the Gaussian noise with variance Σ ε ∈ R p×p and zero expectation. For mobile robots, the general stochastic dynamic system can be expressed by the following equation: where v and ω t are the linear and angular velocities, (x t , y t ) denotes the position of the robot on the 2-D plane, and φ t represents its orientation, and is paper studies the path planning of the snake robot based on the path-integration RL algorithm. erefore, the linear velocity v remains constant, and only the angular velocity ω t is used as a control to reduce the parameters in the RL, thereby accelerating the training convergence process.

Stochastic Optimal Control.
For a trajectory τ i with a start time of t i and an end time of t f , its performance function is defined as follows: where dt is a process value. e immediate cost can be defined as where q t � q(X t , t) is the cost function related to the state and R is the positive semidefinite weight matrix of the Discrete Dynamics in Nature and Society quadratic control cost. e goal of stochastic optimal control is to find the control u t to minimize the following performance criterion function: where is the expectation of all trajectories starting from X t i . In order to solve the noiseless optimal control problem, there are two main methods: one is to use the Ponteria Gold Minimal Principle (PMP), which requires solving a pair of ordinary differential equations, and the other is to use partial differential equations (PDEs) of the HJB equation. For the stochastic dynamic system with noise shown in equation (10), the extension of the HJB method for stochastic optimal control is expressed as where ∇ x V t and ∇ xx V t are the corresponding Jacobian and Hessian matrices and e corresponding optimal control can be calculated as [10]. en, a nonlinear second-order PDE can be obtained as follows: V t can be transformed into exponential form V t � − λlogΨ t , where λ is a positive constant, and then use λR − 1 � Σ ε [24,25]; then, PDE (16) can be simplified to the following form with the boundary condition which is

Simplification of Path-Integral-Based RL.
It is very difficult to obtain the analytical solution of PDE (17), so we replace it with commonly used numerical solutions, and one of which uses the Feynman-Kac theorem [26] as follows: After discretization, the stochastic optimal control problem can be approximately transformed into the following path-integration problem: where τ i � (x t i , . . . , x t f ) is a sample trajectory with its probability as p(τ i |x i ). After some analysis [27], the following numerical solutions of optimal control can be obtained: where u(τ i ) � (u t i , . . . , u t f ) is used to generate the trajectory τ i and the probability is defined as 4 Discrete Dynamics in Nature and Society where S(τ i ) is the general version of the path cost defined in [25]. However, the analytical form of calculating the path cost is very time-consuming because it involves many matrix inversions. Considering this fact, in order to simplify the calculation and speed up the training, we choose the linear path cost S(τ i ) function: It can be clearly seen that the generalized path-integral control method using the numerical solution of the stochastic dynamic system can avoid the calculation of gradient and kinematic inverse, ensuring fast and reliable convergence.

Goal-Directed Locomotion Algorithm
In the process of snake-shaped robot path-integral reinforcement learning, the reward that the snake-shaped robot gets depends on its own actions. erefore, this paper extends the Markov decision process (MDP) model of pathintegral reinforcement learning to multiple Markov decision processes (MDPs). Suppose that the snake-shaped robot can choose m actions (a i , i � 1, 2, . . . , m), and the number of states of the snake-shaped robot is k (s j , j � 1, 2, . . . , m). Based on the basic model of path-integral reinforcement learning, combined with task goal, this paper defines a snake-shaped robot goal-directed locomotion learning framework as shown in Figure 2.
It can be seen from Figure 2 that in order to improve the learning speed of the robot, first of all, environmental factors are preprocessed to eliminate some irrelevant environmental factors, and then a three-dimensional environmental model is established. en, the prior information is updated to the knowledge base, which improves the learning efficiency. In this model, based on the current state S t , the snake-shaped robot takes an action A t in the action set according to a certain policy rule based on the historical experience of the knowledge base. Next, when the action is executed, the environment will be transformed into a new state S t+1 , and a new reward value R t+1 will be obtained. Finally, the historical experience is updated through continuous learning and further the knowledge base is improved. e snake-shaped robot periodically interacts with the environment, repeating the process of "exploration-learning-decision-utilization," learning from historical actions, and updating their knowledge base as historical experience to guide the next action selection.

Guidelines for Setting the Reward Function.
e reward function defines the snake-shaped robot's learning goal and determines its perception state based on the environment; that is, it defines the value of the action. Since the robot tries to maximize the reward, the reward function is essentially used to guide the robot to achieve its goal. erefore, the setting of the reward function will determine the convergence speed and degree of the reinforcement learning algorithm. In this article, the reward function is defined in the form of sparse reward, as shown in the following: where R 1 , R 2 , R 3 are constants and R 1 < 0, R 2 < 0, R 3 > 0.

Guidelines for Setting the Update Value
Function. e goal-directed locomotion of the snake-shaped robot uses the Q-learning algorithm; that is to say, after determining all the environment states S and actions A, a matrix Q will be generated, and the element Q(S, A) in the matrix Q represents the value of robot's choice of action A t under the environment state S t . e process of updating the value function is as follows: in the environment state S t , when the snake-shaped robot selects an action A t according to the action selection strategy, the snake-shaped robot reaches a new environment state S t+1 after executing the action. At this time, we update the value of Q(S, A) in the matrix Q. e update formula of Q(S, A) is as follows: where c is the attenuation coefficient and α is the learning efficiency.

Guidelines for Setting the Action Selection.
In the pathintegral reinforcement learning problem, exploration and utilization are a pair of contradictions: exploration means that the robot must try different behaviors and then collect more information, but utilization is that the robot makes the best decision under the current information. Exploration may sacrifice some short-term benefits and obtain more long-term and accurate benefit estimates by collecting more information; utilization focuses on maximizing short-term benefits based on the information already available. e exploration cannot be carried out endlessly, otherwise, too much short-term benefits will be sacrificed, and the overall benefits will be damaged; at the same time, short-term benefits should not be valued too much while ignoring some unexplored behaviors that may bring huge benefits. In this paper, ε-greedy exploration is used, where ε is the probability of the snake-shaped robot randomly selected (0 < ε < 1). In the case of probability 1 − ε, the snake-shaped robot selects Discrete Dynamics in Nature and Society an action corresponding to the maximum Q value. If there is multiple Q with the same value, an action is randomly selected. When the probability is ε, the snake-shaped robot randomly selects an action from the action set.

Guidelines for Setting the Collision Avoidance.
Collision avoidance must be considered in the process of the goal-directed locomotion, to simplify the calculation process, the ideal velocity for robot is assumed to be v T , and then the actual velocity v d can be calculated as follows: where P(u, v) is the improved perturbation matrix [28], and the actual velocity v d is calculated by modifying the target velocity v T . en, the planned path can be obtained by the recursive integration of v d . To describe the influence of collision avoidance, the perturbation matrix P(u, v) is defined as follows: where I is a 2 × 2 identity matrix, n q is a column vector, f u is a tangential vector (the partial derivative of f(u, v) on u) at the point Q (u, v). λ(u, v) is a saturation function defining the orientation of tangential velocity; ‖ · ‖ is 2-norm of a vector or a matrix. Besides, ρ(u, v) and σ(u, v) are defined as the weight of n q and t q , respectively: , where ρ 0 is the repulsive parameter, σ 0 is the tangential parameter, D 0 is the minimum permitted distance between the boundary of the robot, and λ 0 is a small positive threshold of the saturation function λ(u, v). en, the actual velocity v d can be expressed as It can be seen from equation (28) that v d consists of three parts: v T can be called the target velocity; (n T q v T /D * (u, v) ρ(u,v) ‖n q ‖ 2 )n q is taken as the repulsive velocity; (λ(u, v)n T q v T /D * (u, v) σ(u,v) ‖f u ‖‖n q ‖)f u can be called the tangential velocity. Similarly, the perturbation matrix P(u, v) can be divided into three parts: attractive matrix I, repulsive matrix (n q n T q /D * (u, v) ρ(u,v) ‖n q ‖ 2 ), and tangential matrix (λ(u, v)f u n T q /D * (u, v) σ(u,v) ‖f u ‖‖n q ‖). erefore, we can readjust the shape of the path by changing parameters  ρ(u, v) and σ(u, v). However, the perturbation matrix in this method can fully consider the shape of the obstacle and the position of the goal-directed locomotion and can more objectively describe the influence of the obstacle on the planned path.

Theorem 1.
e planned path can avoid obstacle safely.
Proof. Vectors n q and f u are perpendicular exactly, i.e., n T q f u � 0. Suppose that the distance between the boundary of the robot is close to the minimum permitted distance D 0 ; that is, suppose that D * (u, v) � D 0 + δ(u, v) and δ(u, v) is a monotonically decreasing function: It can be inferred that δ(u, v) ⟶ 0+, then D ρ 0 e − (1/δ(u,v)) 0 ⟶ 1, (1 + (δ(u, v)/D 0 )) (D 0 /δ(u,v)) ⟶ e, and (δ(u, v)/D 0 )ρ 0 e − (1/δ(u,v)) ⟶ 0; therefore, D * (u, v) ρ(u,v) ⟶ 1. P(u, v) can be simplified as erefore, we can infer 6 Discrete Dynamics in Nature and Society Because the equation n T q f u � 0 means that n T q v d � 0, so the path is outside of the minimum permitted distance and there is no collision.

Theorem 2.
e planned path can reach the destination eventually.
Proof. Because the goal of the path planning is to make the robot reach the destination safely, therefore, the velocity of the robot should have a component in the direction of the target velocity; that is, velocity v T and v d should satisfy v T T v d ≥ 0, and the planned path will converge to the target point. Besides, v d ≈ v T should be satisfied if the robot near to the destination ξ T � (x T , y T , z T ): where (·) is the inner product of vectors, where 〈v T , n q 〉 denotes the angle between v T and n q . It is obvious that when the robot approaches to the destination, then As D * (u, v) ρ(u,v) ⟶ + ∞ and cos 2 〈v T , n q 〉 ≤ 1 hold, it can be inferred that (1 − (cos 2 〈v T , n q 〉/D * (u, v) ρ(u,v) )) ≥ 0. From equation (30), it can be inferred that When the robot approaches to the destination, that is to say, D * (u, v) ρ(u,v) ⟶ + ∞ and D * (u, v) σ(u,v) ⟶ + ∞, then according to equation (29), it can be inferred that P(u, v) ⟶ I. erefore v d ≈ v T holds. In the process of goal-directed locomotion, the snake-shaped robot continuously interacts with the environment according to the current environment state and updates and modifies the learning results to guide the robot's action selection. Finally, find a set of action sequences that can maximize the reward, and complete the goal-directed locomotion. e pseudocode of this method is shown in Algorithm 1.

Build the 3D Environment Model.
To guarantee real-time performance, parameters are first learned offline according to different initial states and then correspondingly adjusted online based on the robot's states. To verify the effectiveness of the method proposed in this paper, a 40 × 40 virtual environment is set up, which contains several obstacles of different shapes. Suppose that ξ � (x, y, z) is robot position and obstacle center is ξ 0 � (x 0 , y 0 , z 0 ), with axis lengths a, b, c, and index parameters d, e, f: en, the parameters of the path-integral-based reinforcement learning algorithm are given in Table 1.

Collision-Free Goal-Directed Locomotion that the Robot Does Not Need to Climb Obstacles.
e method proposed in this paper adds prior information and has historical experience as decision support, so collisions are avoided during path planning. After 660 rounds of training, robots can successfully reach the target point. When there is no trap area, the robot does not need to climb obstacles and can avoid all obstacles smoothly and swiftly and reach the destination eventually. Suppose that the starting position of the robots is [5, − 3, 0] and the target position is [39, 39, 0]. e planned path is shown in Figure 3. It can be shown from Figure 3 that when the snakeshaped robot starts from different starting points, the algorithm proposed in this paper can avoid all obstacles smoothly and swiftly and reach the destination eventually.

Compared with Ant Colony
Algorithm. By using Taguchi's experimental design method and other intelligent optimization methods [29,30], the ant colony algorithm is compared to the algorithms proposed in this paper. Suppose that the starting position of the on-water automatic rescue intelligent robot is [0, 20, 0], and the target position is [19,2,0]. In the ant colony algorithm of this paper, the Discrete Dynamics in Nature and Society number of ants is 100. After each iteration, the pheromone attenuation ratio is 0.5, and each time the ant passes a path, the pheromone increase ratio is 2. It can be seen from Figure 4 that when the 3D ant colony algorithm is used for fast path planning, with different iteration times, the number of ants, etc., the planned path is also different. Moreover, 3D ant colony algorithm generally only seeks the shortest path, so the planned path is not very smooth as shown in Figure 5.
It can be seen from Figures 4 and 5 that when the number of iterations is the same, the best individual fitness curve varies with the number of ants. In other words, ant colony algorithm relies on prior knowledge to a large extent, so it has its shortcomings in fast path planning. e method proposed in this paper considers prior information, uses historical experience as decision support, and introduces a climbing coefficient, so that obstacles can be successfully overcome. After more than 900 rounds of training, the two robots reached the target point safely without collision. In the process of goal-directed locomotion, the saturation λ represents the climbing ability of the robots. Furthermore, different λ will lead to different planned paths and will cause different climbing difficulties and different climbing heights. erefore, we need to consider how to choose an appropriate coefficient λ to plan the most effective path. Suppose that the starting position of the robot is [− 10, − 10, 0], and the target position is [39, 20, 0]. e planned path is shown in Figure 6.
It can be seen from Figure 6 that the planned path according to different saturation λ and the saturation λ of the robot is 1/6. Both paths can bypass obstacles, escape the trap area, and reach the destination.

e Situation that the Robot Encounters the Trap Area.
During the process of goal-directed locomotion, the robot will encounter the trap area. It is necessary to consider the distribution of obstacles and the climbing ability of the robot; otherwise, the robot will not be carried out timely and effectively. Suppose that there are six obstacles in environment as shown in Table 3, and ρ 0 (ξ) � 1 and σ 0 (ξ) � 1. Suppose that the starting position is [− 10, − 10, 0], and the target position is [39, 20, 0].
It can be seen from Figure 7 that the trap area is formed by two different obstacles, and the snake-shaped robot will fall into the trap area. erefore, we need to consider the climbing ability of the robot during path planning in the 3D environment.   Discrete Dynamics in Nature and Society 9

Conclusions
is paper proposes a collision-free goal-directed locomotion method based on path-integral reinforcement learning in a three-dimensional environment. First, a path-integral reinforcement learning model is established, which describes in detail how the robot can accumulate experience from historical data to optimize decision-making. In order to improve the learning efficiency of this method, some irrelevant environmental conditions are removed. e Q table is updated based on the prior information and provided to the robot as historical experience, which greatly improves the learning efficiency of the robot. e simulation results show that the planned paths can avoid all obstacles smoothly and swiftly and reach the destination eventually.

Data Availability
e data used to support the findings of this study are included within the article.