Path Planning Algorithm for Unmanned Surface Vessel Based on Multiobjective Reinforcement Learning

It is challenging to perform path planning tasks in complex marine environments as the unmanned surface vessel approaches the goal while avoiding obstacles. However, the conflict between the two subtarget tasks of obstacle avoidance and goal approaching makes the path planning difficult. Thus, a path planning method for unmanned surface vessel based on multiobjective reinforcement learning is proposed under the complex environment with high randomness and multiple dynamic obstacles. Firstly, the path planning scene is set as the main scene, and the two subtarget scenes including obstacle avoidance and goal approaching are divided from it. The action selection strategy in each subtarget scene is trained through the double deep Q-network with prioritized experience replay. A multiobjective reinforcement learning framework based on ensemble learning is further designed for policy integration in the main scene. Finally, by selecting the strategy from subtarget scenes in the designed framework, an optimized action selection strategy is trained and used for the action decision of the agent in the main scene. Compared with traditional value-based reinforcement learning methods, the proposed method achieves a 93% success rate in path planning in simulation scenes. Furthermore, the average length of the paths planned by the proposed method is 3.28% and 1.97% shorter than that of PER-DDQN and dueling DQN, respectively.


Introduction
In ocean exploration, the competition among countries to protect marine territorial sovereignty and develop marine resources has become increasingly ferce. Te unmanned surface vessel (USV), as a kind of vessel with high autonomy, has broad application prospects in the feld of ocean exploration. As one of the current research hotspots, the path planning of USV faces many challenges, including unknown environment, perceptual uncertainty, and dynamic obstacles [1][2][3]. Te USV path planning is aimed to obtain a collisionfree path under specifc circumstances. It can be divided into two subtarget tasks, such as goal approaching and obstacle avoidance. Te goal approaching method helps the USV reach the destination, focusing on reducing path length and travel time. Te obstacle avoidance method makes the USV conduct real-time collision avoidance through a series of decisions [4].
Traditional path planning methods perform well in simple known static environments and reach a destination while avoiding obstacles [5][6][7][8][9]. But there are still major defciencies in the exploration and decision-making capabilities of algorithms in complex environments, failing to guarantee the success rate and environmental adaptability. Currently, the deep reinforcement learning (DRL) methods have advantages in unknown environment exploration and real-time action decision making in path planning problems [10,11]. Terefore, the use of DRL methods to solve the path planning problem has become one of the new research directions [12]. For example, Tai et al. used radar observations and target positions as inputs and applied DRL methods to path planning tasks for the frst time [13]. Te agent uses the discrete control commands generated by the algorithm to avoid obstacles in the indoor mobile environment. Chen et al. proposed an intelligent collision avoidance algorithm with DRL improving the path quality compared with optimal reciprocal collision avoidance (ORCA) [14]. Chen et al. constructed the interaction model between the agent and the obstacle, providing the basis for the reinforcement learning strategy of the agent's path planning in complex dynamic environment [15]. Tus, it is efective to use DRL algorithms for goal approaching and dynamic obstacle avoidance.
However, the path length inevitably increases in the obstacle avoidance process, which conficts with the requirement of destination reaching for the goal approaching subtarget task. Terefore, it is difcult for a single optimization strategy to simultaneously achieve these subtarget tasks. Recently, intelligence computing algorithms have been widely used in related felds [16][17][18][19]. A more comprehensive model can be obtained in ensemble learning by combining multiple weak learners [20]. Inspired by the idea of integrated learning, a multiobjective reinforcement learning architecture is designed to trade of these subtarget tasks.
Tere is a need to investigate the USV path planning based on multiobjective reinforcement learning.
Main contributions in this paper can be summarized as follows: (1) Based on the main scene of path planning considering random goals and multiple dynamic obstacles, the dynamic obstacle avoidance subtarget scene and the goal approaching subtarget scene are constructed. Te double deep Q-network with prioritized experience replay (PER-DDQN) is applied to the action decision of USV in two scenes, respectively. (2) A multiobjective reinforcement learning architecture based on ensemble learning is designed, optimizing the multiobjective policy integration method in the USV path planning task. (3) A USV path planning algorithm based on multiobjective reinforcement learning is proposed, improving the success rate of USV path planning tasks and shortening the planned path length in the complex environment.
Te rest of this paper is organized as follows. Te theoretical background of the PER-DDQN and the multiobjective reinforcement learning method is introduced in Section 2. Te proposed algorithm is introduced in Section 3. Simulation experiments and results are presented in Section 4. Discussion is given in Section 5, and Section 6 concludes this paper.

Q-Learning.
Te Q-learning algorithm is a value-based reinforcement learning algorithm [21]. A Q-value table is built and updated in the Q-learning algorithm. Each action is selected with the greatest beneft based on the Qvalue. Te maximum Q-value of the next state is used to estimate the Q-value of the current state. Te update formula is as follows: where Q(s, a) denote agent's expectation of reward for performing action a in state s. α represents the learning rate and c represents the discount factor. Te reward obtained by the agent after performing action a is r, and the state is changed to s ′ . Q(s, a) denote agent's expectation of reward for performing action a ′ in state s ′ .

Deep Q-Network.
To address the curse of dimensionality in high-dimensional state spaces, Mnih et al. used a neural network with θ to approximate the Q-value: Q(s, a; θ) ≈ Q(s, a) [22]. DQNs are optimized by reducing and minimizing i are the parameters of a target network that is frozen for a number of iterations while updating the online network Q(s, a; θ i ) by gradient descent. Te action a is chosen from Q(s, a; θ i ) by an action selector, which typically implements an ε-greedy policy that selects the action that maximizes the Q-value with a probability of 1-ε and chooses randomly with a probability of ε.

Experience Replay.
Online reinforcement learning (RL) agents incrementally update their parameters (of the policy, value function or model) while they observe a stream of experience [23]. Because the agent discards experience after one update in simple reinforcement learning, rare valid experience is underutilized. At the same time, there is a substantial correlation between neighbouring experiences, which is not favourable to model training. By storing experiences in replay memory, experience replay can efectively solve the above problems. It becomes possible to break the temporal correlations by mixing more and less recent experience for the updates [24].

Related Literature.
Value function-based DRL algorithm uses deep neural network to approximate value function or action value function and uses temporal difference or Q-learning, respectively, to update the value function or action value function. Many scholars use DRL methods based on value functions, including DQN algorithm and some improved variant algorithms, to motivate robots or other agents to obtain optimal paths [25][26][27]. Additionally, with the introduction of the strategy gradient method, DRL based on strategy gradient is used in robot path planning, such as A3C [28], DDPG [29], TRPO [30], and PPO [31]. When it comes to agent data control and management, blockchain hyperledger fabric is one of the practical technologies [32,33]. We have briefy summarized some of the recent literature, as shown in Table 1.

Methodology
When the USV performs a mission in a complicated marine environment with various dynamic impediments, it needs to arrive at its destination without colliding with the obstacles.

Computational Intelligence and Neuroscience
It is necessary to create a model that can select appropriate actions in diferent states in order to achieve dynamic obstacle avoidance and goal approaching.

PER-DDQN.
Te PER-DDQN improves the learning efect and the learning speed by introducing the DDQN and priority experience replay. Two Q-networks are used in DDQN to eliminate the bias caused by the greedy policy [34]. Te current Q-network is used to calculate the action corresponding to the maximum Q-value, and the target Qnetwork is used to calculate the target Q-value corresponding to the maximum action. Prioritized experience replay is a stochastic sampling method that interpolates between pure greedy prioritization and uniform random sampling [35]. Te probability of being sampled is monotonic in a transition's priority, while guaranteeing a nonzero probability even for the lowest priority transition. Te probability of sampling transition i is defned as where i is the priority of transition. Te exponent α determines how much prioritization is used, with α � 0 corresponding to the uniform case. In the actual process, all samples can be divided into n intervals, and uniform sampling is performed in each interval. Te PER-DDQN is used for the action decision of the agent in the constructed scene. Te fowchart of the algorithm is shown in Figure 1.

Framework for Multiobjective Reinforcement Learning.
Te path planning task of the USV includes two subtarget tasks, such as dynamic obstacle avoidance and goal approaching. Te traditional reinforcement learning architecture for a single task is no longer appropriate. A multiobjective reinforcement learning architecture is built for policy learning and ensemble in the main scene of path planning, inspired by ensemble learning. Te fundamental principle of ensemble learning is to integrate the learning results of numerous weak models to produce better overall results, which can be classifed into bagging, boosting, and stacking. Te sample training set is sampled with replacement in the bagging method, yielding T independent sample sampling sets. T weak learners are trained from T sample sets. Weighted average, voting, and other strategy integration approaches are employed to provide fnal decision results [36]. Figure 2 depicts the fowchart.
Corresponding to weak learners in ensemble learning, the designed multiobjective reinforcement learning architecture leverages subagents for training in subtarget scenes. Diferent from traditional integration methods, the proposed method uses a main agent based on the reinforcement learning method for policy integration. According to the Computational Intelligence and Neuroscience environmental state of the main scene, the main agent selects the strategy of the subagent in the corresponding state of the subscene and makes a decision. Te designed multiobjective reinforcement learning architecture is shown in Figure 3.

Te Proposed Approach.
Te PER-DDQN algorithm is combined with the designed architecture for the constructed path planning scene, and a USV path planning algorithm based on multiobjective reinforcement learning is proposed. Figure 4 depicts the overall process of the proposed method.
Step 1. Te subagents in each subtarget scene are trained using the PER-DDQN algorithm, and the strategies of each subagent are saved.
Step 2. In the constructed path planning main scene, the main agent is trained by the PER-DDQN method. Te main agent selects subagent according to the current environment state and gives the actions according to the strategy of the selected subagent in this state.
Step 3. Te main agent executes actions of the selected subagent, generating and storing experience for the main agent to learn from.

Simulation Experiments
Te main scene, dynamic obstacle avoidance subtarget scene, and goal approaching subtarget scene are built in Unity3D to verify the efectiveness of the proposed method. Te settings for scenario conditions and reinforcement learning parameters are provided separately. Algorithms were written by Python 3.8 and processed by a server with a RAM (64G) and a CPU (Intel Core i9-11900K).

Scene Building.
Te main scene of path planning considering random goal and multiple dynamic obstacles is generated on a two-dimensional plane, as illustrated in Figure 5, to represent the complicated marine environment. Te dynamic obstacle avoidance subtarget scene is built on the basis of the main scene, as shown in Figure 6, to focus on the dynamic obstacle avoidance subtarget. Te agent does not need to consider the problem of goal approaching and instead attempts to travel through the obstacle region without colliding. It is deemed efective obstacle avoidance when the agent's ordinate is larger than the ordinate of all  obstacles. When a collision occurs, it is regarded as obstacle avoidance failure.
Te goal approaching subtarget scene is built on the basis of the main scene, as shown in Figure 7, to focus on the goal approaching subtarget. Dynamic obstacles are removed, and the only learning objective is to approach the goal.

Reinforcement Learning Parameter.
Te reinforcement learning settings for the agent in each scene, such as the action space, state, and rewards, are set as follows.
(1) Action Space Setting. Te agent's action space is set to 5 directions divided evenly into in the main scene and each subscene, as shown in Figure 8, to reduce the training cost.
(2) State Settings. Te states of the agent (s 1,t and s 2,t ) are set as equations (3) and (4) in two subtarget scenes (dynamic obstacle avoidance and goal approaching): where S self1 and S self2 are the states of the agent in two subtarget scenes, represented by agent's positions, respectively, at t − 2, t − 1, t. S obs1 , S obs2 , and S obs3 are the states of the obstacles in dynamic obstacle avoidance subtarget scene, represented by obstacles' positions, respectively, at t − 2, t − 1, t. S tgt is the state of the goal in goal approaching subtarget scene, represented by goal's position, respectively, at t. Te state of agent s t is set as equation (5) in the main scene: Goal Agent (USV) Dynamic Obstacle  Computational Intelligence and Neuroscience where s obs1 ′ , s obs2 ′ , s obs3 ′ , and s self ′ are the states of the obstacles and the agent in the main scene, represented by their positions, respectively, at t − 2, t − 1, t. s tgt ′ is the state of the goal in the main scene, represented by goal's position, respectively, at t. In dynamic obstacle avoidance subtarget scene, the dimensions of action space and observation space are 5 and 8, respectively. In goal approaching subtarget scene, the dimensions of action space and observation space are 5 and 3, respectively. In dynamic obstacle avoidance subtarget scene, the dimensions of action space and observation space are 5 and 9, respectively.
(3) Reward Setting. Te rewards of the agent (r 1,t , r 2,t , and r t ) are set as equations (6)-(8) in two subtarget scenes and the main scene: where obs_dist represents the distance between the agent and the dynamic obstacle at time t. y t and y t−1 are the ordinates of the agent at time t and time t − 1, respectively. target_collided indicates whether the agent is in contact with the goal, and dist and pre_dist are the distances between the agent and the goal at time t and time t − 1, respectively. Te training parameters of the reinforcement learning algorithm in each scene are shown in Table 2.

Result Analysis.
After 800 times of training in the dynamic obstacle avoidance subtarget scene, the two samples are shown in Figure 9. Te agent's obstacle avoidance strategy is slightly diferent in diferent scenes. Te dynamic obstacles are evenly dispersed in front of the agent, as indicated in Figure 9(a), and the collision risk is substantial.
Te agent chooses to move to the right, avoiding the range where obstacles might congregate. Te dynamic obstacles are concentrated at the agent's front right, as shown in Figure 9(b), and the agent chooses to go straight at the start. When there is a risk of collision, the agent turns left to avoid obstacles urgently. After 800 times of training in the goal approaching subtarget scene, the two samples are shown in Figure 10. When the goal is in front of the agent, as shown in Figure 10(a), the agent continues to adjust at the beginning and end of the path while moving forward in the middle. When the goal is far from the front of the agent, as shown in Figure 10(b), the agent remains adjusted throughout. Te results show that the agent can rapidly approach the goal under various initial conditions without the interference of dynamic obstacles.
After 800 times of training in the main scene, the two samples are shown in Figure 11. As shown in Figure 11(a), the obstacles are distributed in front of the agent. At the same time, the target is far from the front of the agent. Te agent chooses to move sideways quickly after going straight through the obstacle area in the initial stage to approach the goal. As shown in Figure 11(b), the obstacles are evenly distributed in front of the agent. At the same time, the target is near the front of the agent. Te agent chooses to go straight and dynamically avoid collision in the obstacle area.
Te experimental results show that by dynamically selecting the strategy of subagents, the main agent can avoid obstacles and approach the goal in various scenes to accomplish the path planning task well. Terefore, the efectiveness of the proposed method has been verifed.

Discussion
To verify the efectiveness of the proposed framework on strategy integration, a comparison is made between reinforcement learning methods that use integration methods such as linear voting method and rank voting method and our method. At the same time, the proposed method is compared with A * + ORCA and the path planning algorithm based on single-objective reinforcement learning to demonstrate the advantages of the proposed method in path planning tasks.

Comparison with Other Ensemble Learning Algorithms.
In the linear voting method, the Q-value of each action in the main scene is the normalized sum of the Q-values in the corresponding states of each subscene. In the rank voting method, the rank of each action in the main scene is the sum of the ranks of the corresponding states of each subscene. In these methods, the subagents and their strategies are consistent with the proposed method. Te performance indicators of these methods in the results of 100 random experiments are shown in Table 3.
Te rank voting method has the worst integration efect and the lowest success rate. Compared with the rank voting method, the linear voting method considerably enhances the success rate by keeping the path length from increasing. Te path length of the proposed method is slightly longer than Computational Intelligence and Neuroscience that of the other two methods, but the success rate of goal approaching and dynamic obstacles avoidance are higher. Te proposed method has the best overall performance.
Tree random samples of the path planning results of the three methods in the same environment are shown in Figure 12. "CA" represents the main agent to choose the strategy in the dynamic obstacle avoidance subtarget scene. "TA" represents the main agent to choose the strategy in the goal approaching subtarget scene. Each agent makes better decision in lowcomplexity environments to avoid obstacles and approach the goal. In a more complicated context, however, the agents using the traditional ensemble method face the issue of disordered decision making. Te decision strategy of the obstacle avoidance agent is selected in the initial stage of the path to maximize the success rate of obstacle avoidance. When approaching the goal, the decision strategy of the goal approaching agent is selected to maximize the success rate of goal approaching. Te proposed method has a greater success rate of path planning in varied situations than the other two ensemble methods, demonstrating the superiority of the proposed ensemble learning architecture over the traditional ensemble methods.

Comparison with Other Path Planning Algorithms.
Te training times of PER-DDQN are 2400 times, and other hyperparameters are consistent with the method's parameter setting in the main scene. Te performance indicators of these methods in the results of 100 random experiments are shown in Table 4.
Te assumptions of the ORCA in the decision-making process are inconsistent with the requirements of dynamic obstacle avoidance in practical applications. Terefore, the success rate of agent using the A * + ORCA is low. DDQN algorithm solved the problem of overestimation of action value function in Q-learning. On this basis, PER-DDQN uses priority sampling to accelerate the convergence speed of the algorithm, and dueling DQN uses the competitive architecture to estimate the value function more precisely. Tey perform well in the constructed scenes. Our approach combines the strengths of reinforcement learning with ensemble learning. Te experimental results show that the method proposed in this paper has the best overall performance when considering path length and success rate.
Four random samples of the path planning results of the four methods in the same environment are shown in Figure 13. Te policies provided by the A * + ORCA method are not sufcient for the agent to always avoid obstacles. Te policies provided by dueling DQN are conservative, and there may be detours. Te policies provided by the PER-DDQN are not mature enough in dealing with conficts between subtarget tasks. Many problems still exist such as long planning path, failure to avoid obstacles, and          approaching the goal. Te experimental results show that the method proposed in this paper is generally safe and performs well in various environments.

Conclusion
In this paper, a path planning algorithm for USVs in complex marine environments based on multiobjective reinforcement learning is proposed. To simulate complex ocean environment, a complex scene including dynamic obstacles and random goal is built. On this basis, two subtarget scenes with goal approaching and dynamic obstacle avoidance are established, respectively. Te PER-DDQN algorithm is used to train the action decision of the agent in the two subtarget scenes. A multiobjective reinforcement learning architecture is designed to optimize the agent's policy integration method in path planning. Te simulation results show that the proposed method achieves a higher path planning success rate and a shorter path length than the traditional path planning methods. Although the proposed method realizes the decision making of the agents in the constructed scenes, the complexity of the scene is still insufcient. Te computational efciency and path planning success rate of the algorithm will be reduced in complex environments. Modelling more actual scenes and building more realistic training scenes can efectively improve the adaptability of the algorithm. In addition, the action space in the established model is discrete, which is somewhat diferent from the real world. Agents cannot output continuous action decisions in the scenario of discrete action strategy only. Te assumption that the next time step after the action can reach the target position is also idealized, and inertial factors need to be taken into account to optimize the model. In future work, hostile ships with tracking capabilities will be added to the scene to train the model better. Te dimension of the action space will be increased to enhance the USV's mobility. In addition, changing the scene from 2-dimensional space to 3dimensional space is our follow-up research direction.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.