An Optimized Path Planning Method for Coastal Ships Based on Improved DDPG and DP

Deep Reinforcement Learning (DRL) is widely used in path planning with its powerful neural network ﬁtting ability and learning ability. However, existing DRL-based methods use discrete action space and do not consider the impact of historical state information, resulting in the algorithm not being able to learn the optimal strategy to plan the path, and the planned path has arcs or too many corners, which does not meet the actual sailing requirements of the ship. In this paper, an optimized path planning method for coastal ships based on improved Deep Deterministic Policy Gradient (DDPG) and Douglas–Peucker (DP) algorithm is proposed. Firstly, Long Short-Term Memory (LSTM) is used to improve the network structure of DDPG, which uses the historical state information to approximate the current environmental state information, so that the predicted action is more accurate. On the other hand, the traditional reward function of DDPG may lead to low learning eﬃciency and convergence speed of the model. Hence, this paper improves the reward principle of traditional DDPG through the mainline reward function and auxiliary reward function, which not only helps to plan a better path for ship but also improves the convergence speed of the model. Secondly, aiming at the problem that too many turning points exist in the above-planned path which may increase the navigation risk, an improved DP algorithm is proposed to further optimize the planned path to make the ﬁnal path more safe and economical. Finally, simulation experiments are carried out to verify the proposed method from the aspects of plan planning eﬀect and convergence trend. Results show that the proposed method can plan safe and economic navigation paths and has good stability and convergence.


Introduction
With the development of economic globalization, trade between countries is getting closer. Ships have been an important means of transportation in international trade and national transportation due to their characteristics of large transportation capacity, low energy consumption, small transportation cost, and green environment protection, which has a pivotal position in economic development [1,2]. e development of economy has put forward requirements for maritime intelligent transportation, and ship automation is the most basic and urgent part of the solution. In the study of ship automation, path planning is one of the most important parts [3][4][5]. e coastal waters are different from narrow waters and open waters. e narrow waters have coastline restrictions and are relatively narrow, mainly referring to straits and rivers. Ships sailing in open waters are not restricted by coastlines, but dynamic obstacles such as icebergs will appear. ere are proven obstacles in coastal waters, and there will be no dynamic obstacles and reefs like icebergs, but the shore-based information obstacles (temporary obstacle areas) such as ship-wreck area, restricted navigation area, and military exercise area will appear. e path planning of coastal ships is mainly to avoid proven obstacles and shorebased information obstacles and plan a safe and effective path for ships [6,7]. e problem of avoiding ships belongs to the field of collision avoidance, and there are special rules for collision avoidance [8], so this paper mainly avoids proven obstacles and shore-based information obstacles. For the coastal ship path planning problem, many methods have been proposed by domestic and foreign scholars, mainly including traditional path planning methods, bionic intelligence algorithms, and machine learning-related algorithms. Traditional path planning methods require complete environmental information as a priori knowledge, but it is quite difficult to obtain the surrounding environmental information in the unknown marine environment. Bionic intelligence algorithms regard the path planning problem as an optimal path problem, using the path distance as a constraint and the collision hazard and range loss as objective functions [9]. However, this kind of method is particularly prone to local optimality, and the solution is computationally intensive and requires very high system performance.
In recent years, DRL has a good performance in the field of path planning. DRL obtains the state from the environment by interacting with the unknown environment and provides training samples for neural network. At the same time, it can use the strong fitting ability of neural network to complete the task better [10,11]. At present, most DRL-based path planning methods use algorithms based on discrete action space, such as Q-learning and Deep Q-learning (DQN) [12][13][14][15]. In these algorithms, the optional actions of ships are limited, which may make it impossible to learn an optimal path planning strategy. Some scholars also use DDPG or A3C algorithm to establish path planning models in continuous action space [16,17]. However, this kind of research partly depends on the grid environment, and grid partition strategy directly affects the planning path. Moreover, the reward function is defined as follows: while the ship performing an action, it will be given a fixed positive value if its next state is closer to the target point; otherwise, it will be given a fixed negative value. is will lead to the slow convergence of the algorithm, and the planned path does not conform to the ship navigation specification. In addition, the influence of the historical state on the current state is not considered, which also lowers the learning efficiency and convergence speed of the model. is paper focuses on the ship autonomous path planning in continuous action space and continuous environment and adopts the DDPG algorithm for path planning. Meanwhile, in view of the poor processing capability of time-series data in the fully connected layer of DDPG, LSTM, which has better timing capability, is added to improve the approximation accuracy and the data utilization rate by approximating current environment state information with historical state information. Meanwhile, the mainline reward function and auxiliary reward function are established to optimize the strategy of DDPG and guide the ship to the target point by avoiding obstacles. Because of the size of the ship, the planned path is required to be as straight as possible, with few corners and avoiding passing through complex obstacles, which is the biggest difference from the path planning of unmanned vehicle and robot paths. erefore, this paper proposes an improved Douglas-Peucker (DP) algorithm to optimize the planned path so as to make it more in line with the actual navigation requirements of ships. In summary, the main contributions of this paper are as follows: (1) e network structure of the DDPG is improved. In the traditional DDPG network structure, each layer of the network is a fully connected layer network, and only the current status data is obtained each time, which is ignoring the historical status data. For this reason, the method in this paper not only obtains the current state data of the ship but also obtains the final training data input of the historical observation state as a collection of current data and historical state data. To better learn the relationship between the historical state and the current state, this paper changes the first layer of the DDPG network to the LSTM that processes more time-series data, so that the LSTM allows the DDPG to predict better actions. (2) A ship path planning method based on the aboveimproved DDPG and reward function optimization is proposed. Aiming at the problems of low data utilization and slow convergence speed of most unmanned ship path planning based on deep reinforcement learning, this paper optimizes the traditional reward function and designs the mainline and auxiliary reward function. e mainline reward function is used to guide the ship to reach the target point and complete the path planning task. Meanwhile, the auxiliary function gives reasonable punishment in the process of path planning, so as to avoid obstacles. e ship path planning is realized by improved DDPG and optimized reward function.
(3) A path optimization method based on the improved Douglas-Peucker (DP) algorithm is proposed. Because of the excessive turning points in the planned path, which increases the risk of the ship during navigation and is not economical, this paper proposes an improved DP algorithm to compare the path. Optimization and removal of redundant turning points make the planned path safer and more economical and more in line with the actual sailing requirements of the ship. e rest of this paper is organized as follows. Section 2 reviews the related works. Section 3 presents the ship path planning model based on optimized DDPG. Section 4 recommends path optimization based on an improved DP algorithm. Section 5 introduces simulation experiments and result analysis. Conclusion and future work are presented in Section 6. algorithms, bionic intelligence algorithms, and machine learning-related algorithms. e traditional algorithms mainly include the speed barrier method, A * , Artificial Potential Field (APF), and Rapidly Exploring Random Tree (RRT). e A * algorithm divides the area to be searched into square lattices, checks its adjacent squares from the starting point, then expands around to find the target, and finally finds the path with the least movement cost from the feasible lattices. For example, Gao et al. [22] proposed a global path planning method for surface unmanned ships based on an improved A * algorithm. e method finds the global optimal solution in a larger range by expanding the search region of the original A * to 24 and 48 neighborhoods in the established raster map. e disadvantage of this method is that it depends on the design of the raster map, and the size and number of raster spacing can directly affect the computational speed and accuracy of the algorithm. Chen et al. [23] proposed a hybrid method for global path planning of autonomous surface ships at sea, which considers the collision risk, the proximity of the path to obstacles, and the speed constraint that the ship may reach while generating a ship navigation path. Experimental results show that the method can find the optimal path considering collision risk and obstacle distance. Fan et al. [24] added a distance correction factor and positive hexagonal guidance to the repulsive potential field function to address the problems of unreachable targets and the existence of local minimum in the traditional AFP method, while the relative velocity method for motion target detection and obstacle avoidance was proposed for dynamic environments. Experiments show that the method can be used in both static and dynamic environments. However, when there is an equal repulsive and attractive force or when the repulsive force at the target point is large, the ship stalls and falls into a local optimum. Wang et al. [25] combined the ship domain model with the artificial potential field method to plan the path by judging the motion characteristics of the obstacle, taking into account the speed and heading of ship. Experiments show that the method can plan out a path, but it is not practical. Xiang et al. [26] proposed an improved two-way RRT algorithm for local path planning of unmanned ships. e algorithm solves the problem of more course twists and turns of the original RRTplanning by adding corner constraints to the randomly set nodes of the original RRT [27] and setting a step length strategy. e method can plan out a path quickly, but it may not be the optimal one.
Bionic intelligent algorithms mainly include genetic algorithm, particle swarm algorithm, and ant colony algorithm. Wei et al. [28] proposed an improved genetic algorithm to establish the motion model of the underwater robot and design a reasonable crossover and variation adaptive probability model. Experiments proved that the algorithm effectively improved the convergence of the genetic algorithm. Jiang et al. [29] proposed an improved adaptive genetic simulated annealing algorithm, which improved the initial population generation strategy based on the traditional genetic algorithm and introduced an improved adaptive operator with a simulated annealing strategy in the mutation operation. e experimental results show that the algorithm may avoid falling into the local optimum and convergence fast. Ding et al. [30] modeled the navigation environment information extracted from the electronic chart as the experimental data needed and adopted a particle swarm algorithm for unmanned ship path planning with path distance as the constrain. e simulation experimental results show that the algorithm works. Lazarowska et al. [31] converted the ship path planning problem into an optimization problem and used the ant colony algorithm to solve the optimal path with collision hazard and range loss as the objective function. Experiments show that the method can plan out a ship navigation path, but it is computationally intensive and requires very high system performance. Xie et al. [32] proposed an improved Beetle Antenna Search (BAS) based algorithm for the underdriven surface ship path planning problem. Simulation experiments show that the planned path is quite good, but the algorithm is mainly used for offline decision-making.
Machine learning algorithms mainly refer to deep learning algorithms, reinforcement learning algorithms, and DRL algorithms. DRL generates training samples by interacting with the environment and guides itself to learn a strategy to complete the task based on these samples [33]. e purpose of deep reinforcement learning is to maximize the cumulative rewards that an agent receives during training and to learn the optimal strategy [34].
Many scholars have applied DRL to ship path planning. For example, Zhou et al. [12] proposed a DQN-based collaborative path planning for unmanned ships by defining thirteen actions and letting the ship choose the optimal action in the current state among them. e action space of this method is discrete and cannot be applied to a continuous action environment. Chen et al. [13] used the Q-learning algorithm for unmanned ship path planning and maneuvering. Using the trained model, ships could find the correct path and navigation strategy by themselves. A comparison with existing methods shows that the method is more effective in self-learning and continuous optimization and is closer to human operation. However, the method is prone to be latitude disasters under complex navigation conditions. Bhopale et al. [14] used an improved Q-learning algorithm for underwater vehicle obstacle avoidance, which forced the underwater robot to leave the unsafe area instead of attempting new or random actions when the underwater vehicle detected an obstacle, reducing the number of collisions. Experiments show that the method allows the underwater vehicle to avoid the detected obstacles. However, the method relies on Q-tables and has a poor fitting ability. Zhang et al. [35] proposed a behavior-based hazard avoidance decision model for unmanned ships based on the Sarsa algorithm and argued that it is feasible for the Sarsa algorithm to enhance the hazard avoidance of unmanned ships. However, the method only demonstrated the feasibility at the theoretical level and was not tested in an experimental environment and no experimental results were given. Zhang et al. [15] proposed an autonomous navigation decision model based on hierarchical deep reinforcement learning. e model consists of two main layers: a scene segmentation layer and an autonomous navigation decision layer. e method uses the environment model, ship motion space, reward function, and search strategy to learn the environment state in quantized subscenes and train the navigation strategy. Experimental results show that the improved DRL algorithm can effectively enhance navigation safety and collision avoidance ability, but the planned path still has some unnecessary waypoints, which increase the ship navigation overhead. Guo et al. [16] performed ship path planning by combining DDPG and artificial potential field, but it mainly focused on local collision avoidance and did not consider the influence of the historical state on the current state. Cao et al. [17] proposed an A3C-based target search algorithm for underwater robots, which enables the underwater robots through a designed asynchronous dominance evaluation network structure to learn from their experience and generate a search strategy and uses DRL and dual-stream Q-learning algorithms for underwater robot navigation to further optimize the search path. It is shown that the agent can avoid obstacles and reach the search target point, but this method depends on the grid environment and the action space is discrete and it cannot be applied to the continuous action environment. e feature, advantages, and disadvantages of the abovementioned ship path planning methods are compared, as shown in Table 1.
To solve these problems, this paper proposes a ship path planning model based on LSTM and DDPG. e model uses the historical state information to approximate the current environmental state information and constructs the mainline reward function and auxiliary reward function to optimize the action selection strategy of DDPG to guide the ship to avoid obstacles and reach the target point. An improved DP algorithm is designed to compress and optimize the planned path, which makes it safer and more economical.

Ship Path Planning Based on Improved DDPG
A ship path planning model is trained by ship's actual navigation environment information, so the actual navigation environment information should be processed firstly, and then the structure, state space, action space, and reward function of the model should be designed.

Coastal Ship Path Planning Framework.
To improve the safety and practicability of coastal ship path planning, this paper proposes research methods related to path planning. By processing environmental information to complete marine environment modeling, the coastal ship path planning model is constructed according to safety and economic indicators. e coastal ship path planning framework is shown in Figure 1.
As can be seen from Figure 1, the framework includes two parts: marine environment modeling and path planning of coastal ships. First, the marine environmental information is processed and the experimental environment is constructed using quantitative environmental data. At the same time, the ship state space is designed to analyze actual navigation characteristics. Secondly, according to the design of the relevant algorithm based on the combination of the DDPG algorithm and the path planning method, the path planning model based on the improved DDPG is obtained by improving the network structure of the DDPG algorithm and optimizing the reward function. In response to the actual navigation requirements of the ship, this paper proposes an improved DP algorithm to optimize the planned path, so that the final path is safer and more economical. In this section, we mainly introduce the modeling of the marine environment and the path planning model based on the improved DDPG. Path optimization based on the improved DP algorithm will be introduced in Section 4.

Ship Actual Navigation Information Processing.
When processing the actual navigation environment information of the ship, first the Mercator transformation method is used to convert the longitude and latitude of the obstacle and the ship's starting and target point into coordinates in the Cartesian coordinate system. en the smallest area enclosing the starting point and target point is regarded as the target environment, and coordinates of obstacles are scaled to the target environment. Finally, each obstacle is replaced with an expanded circumscribed circle to prevent the planned path from being too close to the obstacle and increase the navigation risk of the ship.
Visibility Graph is adopted to obtain the circumcircle of an obstacle. Assume that the number of vertices of an obstacle is n. A polygon is obtained by drawing a line between every two adjacent vertices. en the circumscribed circle of the polygon is created. Since obstacles are not regular polygons in the actual environment, if the center of the polygon is defined as the center of the circle, the obstacle might not be enclosed within it. In this paper, the center of gravity of the polygon is regarded as the center of the circle, and the longest one of n · (n − 1)/2 line segments generated by connecting any two vertices of the obstacle is regarded as the diameter of the circle.

Structure Design of the Path Planning Model Based on
Improved DDPG. DDPG can randomly select actions in the continuous action space according to the learned strategy. It is a deterministic policy algorithm that can output only a certain action according to a given state. Compared with DQN, DDPG samples fewer data and executes more efficiently. erefore, this paper uses the DDPG algorithm for ship path planning. e path planning of the ship needs to obtain the current ship state information and then calculate the data of the hidden layer network by formula (1). Finally, the hidden layer network predicts an action according to the learned strategy.
e ship performs this action and the reward function evaluates the state after performing the action and gives a reward or punishment. At the same time, rewards and penalties in turn affect the parameter update of the network. rough such a cycle of learning and updating, the model learns a good strategy for path planning.

Improvement of DDPG Algorithm
(1) Improvement of the DDPG Structure. DDPG is a DRL algorithm based on the AC framework, which includes a policy network, called an Actor network, and an evaluation network, called a Critic network. e Actor network is used to map states to a specific action, and the Critic network is used to estimate the action. e network of DDPG is structured as the main network and target network. e main network, including the main Actor and Critic network, is used to yield and evaluate actions and update network parameters. e target network includes the target Actor and Critic network, which is used to update the value function and select the optimal action according to the next state. e target network does not carry out online training and updating of network parameters. e target and main network have the same neural network structure and initialization parameters. A soft update method is used to update the target network parameters, greatly improving the stability of learning.
In the process of path planning, the ship preferentially avoids the nearest obstacle in the current state, because the ship can only observe part of the environment in the current state, which increases the difficulty of the algorithm to predict the action [36,37].
In the traditional DDPG network structure, each layer of the network is a fully connected layer network, and only the current status data is obtained each time, ignoring the historical status data. e DRL algorithm learns the strategy to complete the task by interacting with the environment. If the data obtained in this way is not fully utilized, the learned strategy may not be optimal, and the most effective action may not be predicted. For this reason, this paper chooses Long Short-Term Memory (LSTM) as the first layer network of the Actor and Critic networks in DDPG. By integrating current state information and historical state information, the integrated data is used to calculate the next input to the layer network, so that the action predicted by the algorithm is more consistent with the current state [38,39]. e improved DDPG network structure is shown in Figure 2.
e LSTM layer in the Actor network is used to receive the ship status information s t at the current moment t in the simulation environment and integrate it with the historical status information h t−1, and the integrated information is recorded as h t . At the same time, h t is used to calculate the input o t of the next layer of the network through the forget gate and information enhancement gate of LSTM, and finally, the integrated data h t is saved through the output gate for the next calculation. e calculation of o t is shown in formula (1), where f is determined by LSTM and ω is the network parameter of LSTM, and then enter o t into the hidden layer composed of a 3-layer fully connected layer network. is can help DRL learn the optimal strategy faster. e number of neurons in each hidden layer is 256, and the ReLU activation function performs nonlinear processing on the output node of each hidden layer. Among them, in a neural network, when the number of neurons is too small, the network cannot have the necessary learning ability and information processing ability. On the contrary, if there are too many neurons, it will not only greatly increase the complexity of the network structure but also make the network easier to fall into a local minimum during the learning process, and the learning speed of the network will  become very slow. is paper comprehensively refers to the selection of the number of neurons in literature [15] and literature [16] and finally determines the number of neurons in this paper. e DDPG algorithm predicts the best execution action in the current state according to formula (2), where μ is the hidden layer network parameter and π is the learned strategy. At the same time, in the last layer of the network, the Tanh activation function is used to limit the network single output action value between [−1, 1], and finally, the network output action is converted into action in the continuous action range. Critic network and Actor network have the same network structure. e evaluation value of the output action of the Critic network is the final network result, which is called the Q value. When outputting the final action value Q, the activation function is not used to ensure that the output result of the network is a certain action value, which is used to evaluate the output action of the Actor network. At each timestep t, the LSTM layer in the Actor network is used to receive the ship status information s t and form the integrated state information o t : where f is a transition function determined by LSTM, ω is the network parameter of the LSTM, and h t � s t−T , s t−T+1 , . . . , s t is the historical observations from t − T to t. Inputting o t instead of h t into the hidden layer composed of a 3-layer fully connected layer network can help DDPG learn the optimal strategy faster. e number of neurons in each hidden layer is 256, and the ReLU activation function performs nonlinear processing on the output node of each hidden layer. At the same time, the Tanh activation function is used to limit the network output action value at between [−1, 1] in the last layer of the network: where μ indicates parameters of the hidden layer. e Critic network that has the same network structure as the Actor network is used to evaluate the output action of the Actor network. Its result is called the Q value which does not need to be processed by the activation function in the final output.
(2) Optimized Design of Reward Function. In deep reinforcement learning, the reward function plays an important role in evaluating the effectiveness of behavior decision and the safety of obstacle avoidance. e goal of deep reinforcement learning is to obtain the most rewarding search strategy in the process of ship navigation. When designing the reward function, the ship should avoid the obstacles safely and quickly to reach the target point. At present, most of the reward functions used in the path planning of unmanned ship based on deep reinforcement learning are fixed positive reward values when the next state of the ship is closer to the target point after the ship performs the action. Otherwise, given a fixed negative reward value, the use of this reward function will lead to slow convergence speed of the deep reinforcement learning algorithm. And the planned path does not meet the ship navigation specifications [40].
Based on the traditional reward function, this paper optimizes and designs the mainline reward function and auxiliary reward function. e mainline reward function is used to guide the ship to reach the target point and complete the path planning task. At the same time, the path planning task is further decomposed into subobjectives, and auxiliary reward functions are designed, respectively, so as to guide the agent to seek advantages and avoid disadvantages and improve the probability of mainline events. In order to ensure the core status and attractiveness of the mainline reward function, the absolute value of the auxiliary reward function is set relatively small to avoid affecting the guiding role of the mainline reward function. In this paper, the auxiliary reward function is divided into two parts: the first part is the punishment near the obstacle, the second part is the reward near the target point. e punishment near the obstacle is mainly used to help the ship learn to avoid the obstacle strategy, and the reward near the target point is used to help the ship move towards the target point quickly: (1) Mainline reward function: the primary reward function is divided into two parts. One part is mainly used to guide the ship to move towards the target point in the process of ship navigation to complete the path planning task. e other part is to give the ship a larger final reward to encourage the ship to reach the target point. For the first part, in order to make the ship move to the target point as much as possible, the reward function set in this paper is shown in the following formula: where d goal is the distance between the ship and the target point and min(d obs ) is the distance between the ship and the nearest obstacle, κ is the adjustment factor, which is used to adjust the impact of the nearest obstacle on the reward in general, and σ is the index coefficient, which has the same effect with κ. e value of σ and κ is [1,10]. For the second part, in order to guide the ship to reach the target point and ensure the core status and attractiveness of the mainline reward function, this paper selects reward � 10 as the final reward to reach the target point.
(2) Auxiliary reward function: auxiliary reward function is divided into punishment near obstacles and reward near target point. Its main function is to assist the mainline reward function to let ships learn the strategy of avoiding obstacles and reaching target point quickly.
e penalty near the obstacle refers to the fact that when the ship has not met the obstacle near the obstacle (hereinafter referred to as the dangerous area), in order to help the ship leave the area quickly, the penalty should be increased near the obstacle. e penalty value is inversely proportional to the distance between the ship and the obstacle. Similarly, Journal of Advanced Transportation 7 in order to avoid falling into the local optimum, the punishment in the dangerous area should not be too intensive, and there is a certain gap between the punishment in the dangerous area and that in the obstacle area. e specific punishment value is calculated by min(d obs) > α and min(d obs) ≤ β, reward � −3, min(d obs) > β and min(d obs) ≤ δ, reward � −1.5, where min(d obs) is the minimum distance between the ship and the obstacle, α, β, and δ are thresholds standing for diverse distance ranges to the obstacle, and different penalties are given in the range of α, β, and δ. e reward near the target point mainly refers to that when the ship has not reached the target point near the target point (hereinafter referred to as the reward domain), in order to help the ship reach the target point quickly, different rewards are given according to the distance between the ship and the target point to speed up the convergence speed of the model. At the same time, in order to prevent the ship from falling into the local optimum, the rewards in the reward domain should not be too dense.
ere should be a gap with the reward of reaching the target point. e specific calculation method of reward value is shown in where z, l, and ζ are thresholds representing diverse distance ranges to the target point, d goal is the distance between the ship and the target point, and different rewards are given in the range of z, l, and ζ, respectively. Figure 3 shows the ship path planning model based on improved DDPG. e model mainly includes improved DDPG algorithm and environment model (ship's actual environment information, action controller, and path optimization). During the path planning process, the model first processes the ship's actual navigation information environment status information according to the description in Section 3.1, which is denoted as s t . Meanwhile, it is entered into the Actor network and the Critic network. en, the optimal ship action strategy is output, which satisfies the ship's maximum cumulative return during the learning process, by randomly extracting data from the experience buffer pool for repeated training and learning. Action controller module in the environment executes the generated action, calculates the reward value of the ship's execution of the action according to the reward function, and stores the current ship state, the executed action, the return value of the executed action, and the ship state at the next moment after the action is executed in the replay buffer. e improved DDPG uses the status and return value to estimate the value of current actions and constantly adjusts its value function so that its output action is more in line with the current ship's actual sailing status. Finally, the planned path is further optimized through the path optimization module, which makes the optimized path safer and more economical. During the training process, the Actor network in the algorithm uses Actor Optimizer, uses deterministic policy gradients to update network parameters, and continuously corrects the action strategies which is generated. Critic network uses Critic Optimizer to train network parameters by minimizing the loss function and evaluate the action strategy in terms of action value.

Structure of Path Planning.
In the process of updating the network parameters, first, obtain some experience from the replay buffer D, and then, obtain the target return value y _ i through the target network, and its calculation is shown in en, update the main Critic network according to the target return value y _ i , and input s i and a i into the main Critic network to obtain the actual value Q and the policy gradient ▽ θ μ J. en calculate the error of the main Critic network according to the error equation, and update the network by minimizing the error. e error equation is shown in formula (7). At the same time, the main Actor network is updated according to the policy gradient ▽ θ μ J, and the calculation of ▽ θ μ J is shown in Finally, the target network parameters are updated. e target network will not directly copy the parameters of the main network but will update the parameters in a soft update mode; that is, the parameters are only updated a little bit each time. e update is shown in formula (9): Among them, θ Q and θ μ are the parameters of the main Actor network and the main Critic network, θ Q′ and θ μ′ are the parameters of the target Actor network and the target Critic network, and τ ≪ 1 is the update coefficient.

State Space.
is paper mainly studies the path planning of coastal ships. In the planning process, the ship first needs to read and quantify the environmental information in the nautical chart. At the same time, the ship will receive shore-based information (shore-based information mainly includes ship-wreck area, restricted navigation area, and military exercise area). ese pieces of information are sent by special departments. We will also read shore-based information and quantify it and convert it into obstacle environmental data in the experimental environment. When planning a path, the strategy of avoiding obstacles closest to the ship at the current moment is adopted. At each moment, the ship navigation status information provided by the 8 Journal of Advanced Transportation experimental environment mainly includes the distance between the ship and the nearest obstacle, the distance between the ship and the target point, and the speed and azimuth of the ship. Figure 4 shows the ship state information diagram in the experimental environment at time t. A diamond represents the ship S ship and its current position is denoted as (x ship , y ship ); a pentagram represents the position of the next waypoint, namely, the target point S goal , and its position is recorded as (x goal , y goal ); a circle represents an obstacle S obscale � (x obs , y obs ). In the environment, the north direction is the direction of axis Y; the east direction is the direction of axis X; v shi P denotes the speed of the ship; (v x , v y ) represents the component of v ship on the X, Y coordinate axis separately, which can be calculated by formula (10); φ v is the speed azimuth of the ship; ϕ is the angle between the target point and the ship speed; φ obs is the relative azimuth between the ship speed and the obstacle; d goal is the distance between the ship and the target point; and d obscale is the distance between the ship and the obstacle: e experimental environment is based on the position of obstacles and the ship; the ship navigation information fusion module calculates the distance between each obstacle and the ship at the current time, selects the nearest obstacle (x obs , y obs ), and calculates (dx obs , dy obs ) according to formula (11), which represents the projections of the distance between the ship and the obstacle at the moment t in X-axis and Y-axis directions: e state information at time t is defined as s t � [v x , v y , p x , p y , dx obs , dy obs , φ v , ϕ], where p x and p y are the projections of the distance between the ship's position Journal of Advanced Transportation and the target point at the moment t in X-axis and Y-axis directions, and they are calculated as shown in

Action Space Design.
During the ship's navigation, the pilot ensures the safety of the ship in the complex navigation area by changing the course and speed. In ship path planning based on improved DDPG, the ship motion is controlled by action which is predicted by the algorithm according to the ship state. At the same time, in order to let the agent try more new actions to explore better action strategies and avoid falling into the local optimum, random noise is introduced in the training process, and the decision-making process of the action is changed from determinism to a random process. e value a t of the action is sampled from the random process. e action selection process is shown in Figure 5.
In the learning process, there is a trade-off between exploration and exploitation. On one hand, the agent needs to choose as many different behaviors as possible to find the optimal strategy, which can be called exploration. On the other hand, the agent will consider the behavior with the largest Q value to obtain huge returns, which can be called exploitation. Exploration is very important for learning and only through exploration can the optimal strategy be determined. However, too much exploration will reduce the performance of ship path planning and affect the learning efficiency. Because the ε − greedy strategy can prevent the system from falling into the local optimal, the algorithm adopts it to complete the action selection of ship path planning. In the ε − greedy strategy, a certain probability of random change is increased in the process of behavior selection. In the current state, the agent will randomly select the action with probability ε to ensure that all state spaces can be explored and choose the action a max with the largest current Q value with probability 1 − ε to make the best use of the knowledge learned.
In this paper, the ship's action space mainly includes action control strategies and action exploration strategies. e action control strategy mainly adopts Actor network to predict the action of the ship. e action exploration strategy is to add random noise to the output action when designing the neural network structure to encourage the ship to try more new actions.
(1) Action Control Strategy. In our path planning model, Actions yield by the Actor network includes the ship's speed increment d v , which is used to control the changes of the magnitude of speed, and heading increment d α , which is used to control the speed direction change, and the ship's movement is controlled by the speed and heading together.
e Tanh activation function is used to ensure that the output value of the neural network is between −1 and 1. e update formulas for heading, speed, and ship position in this paper are shown in formulas (13)-(15), where α t is the current heading, α t−1 is the previous heading, M α is the maximum heading increment that can be selected, M v is the maximum speed increment, (x ship ,t , y ship ,t ) is the current ship position, (x ship ,t , y ship ,t ) is the ship position at the previous time, and dt is the update time step. x At the same time, in order to prevent the ship's speed from increasing endlessly, the maximum axis speed V max is set. e maximum axis speed refers to the maximum speed value in a single direction along the X-axis or the Y-axis. When the maximum value of ship speed v ship on the X axis or the Y-axis is reached, no further increase is made to limit the ship speed.
(2) Action Exploration Strategy. In terms of action exploration strategies, the value of Action in algorithms such as DQN is discrete, but the value of Action in DDPG is continuous. Action exploration in the continuous control space enables the unmanned ship to explore well and find better actions. DDPG constructs an exploration policy μ ′ by adding random noise from a noise process J to the actor policy.
is paper adopts the Ornstein-Uhlenbeck (OU) noise mentioned in the literature [27], which is generated by the OU process and is suitable for continuous space. e OU process is time-dependent, and the exploration of the environment is more efficient. erefore, adding OU noise to the action policy in the DDPG algorithm can accelerate the training speed and improve the exploration efficiency.

Path Optimization Based on Improved DP Algorithm
Since the path planning model based on the improved DDPG algorithm ultimately retains the random exploratory nature of the ships to fully explore the environmental information, the path planned by this algorithm has many unnecessary turning points. In order to reduce the operational risk during the actual navigation, it is necessary to compress these inappropriate turning points. An improved DP algorithm is proposed to optimize the path planned. e trajectory data compression algorithms are mainly divided into two categories: one is the nonlinear trajectory fitting algorithm to smooth the trajectory [41], and the optimized trajectory handled with this type of method is more consistent with the actual robot motion trajectory, but it is not suitable for ships. Another category is the segmented linearization of the motion trajectory algorithm [42], which optimizes the trajectory as segmented straight lines, and the path optimized is in the shape of polyline. Due to the fact that the ship's path in the actual navigation process is composed of several waypoints and its navigation path is a polyline, the second type of algorithm is more suitable for the path optimization of the ship. Among many linear compression algorithms, the DP algorithm [43] is the most representative and widely used algorithm.
e DP is an algorithm that approximates the curve as a series of discrete points, connects a straight line fictitiously between the first and last points, and optimizes the curve as a polyline according to the distance between every point and the line. e basic idea of the algorithm is to initialize a trade-off threshold, connect the first and last points of the curve to a straight line, and compute the distance between every intermediate point and the straight line. en find the maximum distance dmax, and compare dmax with the trade-off threshold: if dmax < threshold, all intermediate points on the curve are discarded; if dmax ≥ threshold, the curve is divided into two parts by this point, and the above process is repeated for these two parts of the curve until all the points are processed.
In nautical practice, when ships sail along the path, more turning points will cause additional resource consumption. erefore, it is necessary to ensure that turning points on the optimized path are as few as possible. Although the traditional DP algorithm is simple and relatively efficient, the threshold value directly determines the number of turning points in the optimized path. ere will be the following problems: if the threshold is too large, the optimized path may pass through the obstacle; if the threshold is too small, there may still be some turning points to be removed. erefore, it is necessary to improve the DP algorithm.

Improvement of DP Algorithm.
e basic idea of the improvement is to remove the superfluous turning points in the path to the maximum extent and to ensure that the path can avoid all obstacles. e specific approach is to reoptimize the path obtained by the traditional DP algorithm to reduce the unnecessary turning points.
First of all, the last track point of the path is set as the current point. For each point starting from the first track point to the prior point of the current point in the path, a segment is drawn between it and the current point successively; if there are no obstacles on the line, the path will be updated by removing all track points between these two points on the path. en the second-to-last point on the path is set as the current point, and repeat the operation when the last track point is the current point. Repeat until the second point is set as the current point. Finally, the optimized path can be obtained by successively connecting the remaining track points on the path to form a broken line.
For example, there is a path optimized by the traditional DP algorithm; as shown in Figure 6(a), there are five track points on the path, named A, B, C, D, and E. While the last track point E is set as the current point, as shown in Figure 6(b), line segment AE is drawn firstly and there is an obstacle on it, so points B, C, and D remain; then line segment BE is drawn and there is an obstacle on it too, so points C and D remain, but as line segment CE is drawn, there are no obstacles on CE, so point D is removed from the path, as shown in Figure 6(c). Point B is removed while C is set as the current point.
Compared with the path optimized by the tradition DP algorithm, as shown in Figure 6(a), the turning points in the path optimized by the improved DP algorithm, as shown in Figure 6(c), are fewer, and there is only one turning point, which greatly improves the economic benefits of the ship during navigation.

Path Optimization Algorithm Based on Improved DP.
Using the improved DDPG algorithm, a trajectory curve containing a set of points can be obtained. ese points are the input of the improved DP algorithm used to optimize the planned path. e steps are as follows: Step 1. Set the value D as the threshold for whether to delete a track point. e line segment between the starting point P start and the ending point P end of the trajectory curve is taken as the chord l of the curve. Traverse all other points on the curve, calculate the distance from each point to l, and find the point P max farthest from l and the distance d max from P max to l.
Step 2. Compare the size of d max with the threshold D; if d max < D, then take the line segment as the approximation of the trajectory curve and go to Step 4.
Step 3. If d max ≥ D, divide the curve into segments P start P max and P end P max , reset the start and endpoints for each section, and go to Step 1.
Step 4. Store all segmentation points and P start and P end in a bidirectional circular linked list denoted as P. Define the left and right pointers pointing to P start and P end , respectively.
Step 5. If left and right are pointed to two adjacent points, go to Step 7.
Step 6. If there is no obstacle on the line segment between the point pointed by left and the point pointed by right, remove all the dividing points in the middle of these two points, and go to Step 7; otherwise, let left point to the next dividing point and go to Step 5.
Step 7. Let the right pointer move one step forward and pointer left point to P start . If pointer right reaches P start , the optimization is complete, and go to Step 8; otherwise, go to Step 5.
Step 8. Connect the remaining points in the bidirectional circular linked list P in turn to form a fold line, which is the optimized path of the original trajectory.

Experimental Verification and Result Analysis
In this section, simulation experiments are conducted to verify the reliability and effectiveness of the method proposed in this paper. It mainly includes verification in the simulation experimental environment and comparison with other algorithms. It is assumed that the sea area of the ship sails is an open sea area with no coastline but only buoys and other obstacles. In the actual environment, the number of proven obstacles and shore-based information in coastal waters will be very small, so this paper sets the corresponding number of obstacles according to the actual situation for experimental comparison.

Environment Construction and Parameter Setting.
Python, Gym, and Matlab are used to construct an environment for algorithm verification. Gym is a python package used to develop and compare reinforcement learning algorithms. It allows researchers to customize training scenarios completing tasks and visualizing the process of completing tasks. At the same time, it also provides many existing scenarios for researchers to verify the effectiveness of the algorithm. Matlab is a mathematical science and technology application software. It provides researchers with a high-tech computing environment for scientific calculation, visualization, and interactive programming. It is widely used in data analysis, deep learning, robotics, and control systems.
Obtain map data information through electronic chart platform. is paper selects the real sea environment as the environment space of model training. Read the obstacles in the nautical chart, the ship's starting point and target point, and other data firstly. en use the method in Section 3.1 to process and use Python and Gym to show the processed environment. During the process of the ship's actual voyage information, 1 pixel (px) corresponds to 0.1 nautical miles. e size of the experimental environment is set for 600px · 600px, which is equivalent to 60 nm * 60 nm in the actual chart. Figure 7 shows the experimental environment after the actual environment has been processed. e green point is the ship's starting point with coordinates [540, 540], the red point is the ship's target point with coordinates [60, 60], and the yellow is the obstacle part. e color change around the obstacle part represents the change in the height of the obstacle, and the dark blue part is the navigable area of the ship.
During the training, the environment interacts with the improved DDPG algorithm to plan the ship's sailing path. When there is no obstacle in the environment or it is far away from the obstacle, the algorithm will choose the action that preferentially approaches the target point. When it is close to an obstacle, the algorithm will prioritize actions to avoid the obstacle and move towards the target point while avoiding it. e path planning does not end until it reaches the target point. In this process, the algorithm continuously interacts with the environment to improve the ability of action decision-making. e parameter settings of the proposed model in the training process are shown in Table 2. Because the DRL algorithm takes a long time to train the model, we generally refer to the selection of the original algorithm and the parameter settings of better papers in the same research field for the selection of model parameters. Among the model parameters in Table 2, the parameter action space is set by this paper. According to the predicted action, the model will be mapped to a specific action through the formula in Section 3.3.4. e parameter ReLU is the activation function used in this paper. We refer to the selection of the literature [9] and literature [29] for the two parameters update factor τ and explore decay rate.

Model Validation.
In order to verify the effectiveness of this method, this section will compare the two parts of model verification and experimental comparison. e model verification compares the improved DDPG algorithm in this paper with the traditional DDPG algorithm and the improved DP algorithm with the traditional DP algorithm. e experimental comparison mainly compares the path planned by the method proposed in this paper with the traditional DDPG algorithm, A * algorithm, RRT * algorithm, RRT algorithm, APF algorithm, and BUG2 algorithm.

Improved DDPG Algorithm Validation.
In path planning, while detecting the ship reaches the target point or collides with an obstacle, the current episode ends and the next episode starts. For safety, if the ship is 1 nautical mile away from the obstacle, it is considered that a collision has been detected. Figure 8 shows the path planned by the model under different iteration times. Since there are many turns in the path planned by the algorithm in the initial exploration stage, the 3D environment is not easy to display, so this section uses the 2D environment to show the path planned by the model under different iteration times. Among them, the black polygons represent obstacles, and the circle around each obstacle is the enclosing circle. e red circle in the upper left corner represents the target point, the green circle in the lower right corner is the starting point, and the blank area in the middle is the navigable area.
As shown in Figure 8(a), in the initial 200 iterations, the algorithm is in the initial exploration stage, and the strategy learned is not optimal. Although the task of avoiding obstacles to reach the target point is completed, there are many exploratory actions at the target point and near the obstacle, leading to many turning in the planned path. Figure 8(b) shows the planned path at 600 iterations. It can be seen from the figure that there are few broken lines near the obstacle, and the broken lines near the target point are also reduced, which indicates that the algorithm has learned the strategy to avoid obstacles, but the strategy is not very stable due to the small number of iterations. e planned path at 800 iterations is shown in Figure 8(c). By continuous "exploration," the collision phenomenon is gradually reduced, and the planned path is guaranteed from the safety aspect. Although the path is still highly volatile, the redundant path points are reduced considerably. e final planned path is shown in Figure 8(d). It can be seen that the ship has successfully avoided all obstacles to reach the target point. e planned paths gradually tend to be stable, but there is still fluctuation, which is due to the retention of the random exploration rate of the action space. e reward score of the algorithm during the training process is shown in Figure 9.
It can be seen from Figure 9 that since there is no strategy at the beginning of the iteration and the model is in the exploratory stage, the episode reward is very low. From around 200 episodes, the reward value begins to rise, but it is still in the stage of exploring learning strategies. With the continuous optimization of the model learning strategy, the episode reward is on the rise. From around 700 episodes, it can be seen that the reward starts to be stable and the model has basically converged, but there are still poor rewards in some episodes.
is is because the algorithm is still exploratory during the training process, and there is still a probability to choose some bad actions.   of the LSTM + DDPG is about 150, which indicates that the path planned by the DDPG algorithm has more redundancy.
In the rounds between 700 and 1000, the frequency of fluctuations in the number of steps of the LSTM + DDPG is very small, but the number of steps of the DDPG algorithm fluctuates greatly. It shows that the LSTM + DDPG is better than DDPG in terms of stability.

Improved DP Algorithm Verification.
Since the turning points of the ship's navigation path during actual sailing should be as few as possible, the turning points that can be combined in the planned path should be handled further. e DP algorithm is a relatively efficient path optimization algorithm so far, but its threshold value is not easy to determine, and the optimized curve may have too many turning points. e DP algorithm is improved in this paper. Figure 11 shows the path optimized by the DP algorithm and the path optimized by the improved DP algorithm. e green circle in the lower right corner is the starting point, the red circle in the upper left corner is the target point, the yellow is the obstacle part, the color change around the obstacle part represents the height change of the obstacle, the dark blue part is the navigable area of the ship, and the white line is the planned path. Figure 11(a) shows the path optimized by the traditional DP algorithm. Compared with the path without optimization (see Figure 8(d)), it is much smoother overall, and the number of turning points is much less. e number of turning points is reduced to 3, but it can be seen from the figure that some turning points can still be optimized. Figure 11(b) shows the final path optimized by the improved DP algorithm. e path is very smooth overall and there is no redundant turning point. Compared with the path without optimization, the number of turning points on the path is reduced by 7. Compared with the path optimized by the traditional DP algorithm, the number of turning points is reduced to a minimum, with only one turning point. In terms of the path length, the path length before optimization is 101.597n miles, and the path length optimized by the traditional DP algorithm is 80.863n miles, which is 20.734n miles less than that without optimization. e path length optimized by the improved DP algorithm is 73.1170n miles, which is 28.48n miles less than the path without optimization and 7.746n miles less than the path optimized by the traditional DP algorithm. From both the number of turning points and the length, it shows that the path optimized by the improved DP algorithm is more economical and safer. Figure 12 shows the path planned by the proposed LSTM + DDPG algorithm, DDPG algorithm, RRT algorithm, RRT * algorithm, APF method, A * algorithm, and BUG2 algorithm in the same marine environment.

Experimental Comparison.
Among them, the parameter setting of the LSTM + DDPG algorithm is shown in Table 3. e path planned by the model proposed in this paper (Figure 12(a)) has no redundant turning points, and the path meets the actual sailing requirements of the ship and is highly maneuverable. e path planned by the DDPG algorithm (Figure 12(b)) has fewer turning points and shorter distances. However, the path passes through multiple obstacles, which increases the risk of navigation. e path planning based on the RRT algorithm (Figure 12(c)) is more suitable for obstacles, but there are more path turning points. At the same time, the path passes through two relatively close obstacles, which increases the risk of ship driving and is not suitable for actual navigation regulations. e path planned based on the RRT * algorithm (Figure 12(d)) is more in line with the path of the ship sailing. But compared with the path planned by the model proposed in this paper, there is one more turning point, and the overall path length is longer than the path planned by the model proposed in this paper. e path planned by the APF algorithm (Figure 12(e)) has a curvature, which does not conform to the actual navigation path of the ship as a whole, and the planned path passes through two relatively close obstacles, which increases the risk of ship driving. Figure 12(f) shows the path planned by the A * algorithm. e A * algorithm walks in a straight line when there is no obstacle and walks next to the obstacle when it encounters an obstacle. It can be seen that the path planned by the A * algorithm has no arc, and it passes between two obstacles that are relatively close and do not have practical operability. e BUG2 algorithm will go around obstacles when planning the path (Figure 12(g)), and the path will stick to the obstacles. We can see that the whole path planned with the BUG2 algorithm has no arc, and there are relatively few turning points. However, the planned path will also pass between two close obstacles, which is not practical. Figure 12(a) is planned by the method in this paper. It can be seen that the planned path has no redundant turning points, and the path meets the actual sailing requirements of the ship and is highly maneuverable. Compared with the DDPG algorithm, although the length of the method in this  Step size: 10, sampling rate: 0.1, search radius: 20, and number of iterations: 10000 RRT Step size: 5, sampling rate: 0.05, and number of iterations: 10000 APF Attraction coefficient: 1.0, repulsion coefficient: 1000.0, step length: 2, iteration number: 5000, and obstacle influence radius: 3 BUG2 e distance D from the ship's current position to the target point; the distance F from the ship to the first visible obstacle paper is longer, the planned path does not pass through the multiple obstacles, which ensures the safety of the ship during navigation. Compared with the RRT algorithm, the number of turning points in the path planned by the method in this paper is much smaller than that of the RRT algorithm, and the turning points in the path increase the risk of the ship during navigation and reduce the economic benefits of the ship. In the same way, the path planned by the APF method has arcs; during the actual navigation of the ship, too many arcs of the path will greatly reduce the economic benefits of the ship. Compared with the method in this paper, the BUG2 algorithm, and the A * algorithm, the planned path passes through the position closer to the obstacle, which increases the risk of the actual navigation of the ship. Compared with the method in this paper, the RRT * algorithm has a larger number of path turning points than the path planned in this paper, so the economic benefits are lower.
In order to further compare the different paths planned by different algorithms, this paper further compares the length of the planned path and the number of corners. Figure 13 and 14 represent the comparison of the length of the path planned by different algorithms and the comparison of turning points number of the paths planned by different algorithms.
From Figures 13 and 14, the used DDPG algorithm, A * algorithm, APF algorithm, the algorithm proposed in this paper, and the path planned by the BUG2 algorithm are sorted according to the length from small to large, and their values correspond to 70.515n miles, 71.2989n miles, 72.8292n miles, 73.117n. miles, and 75.1231n miles. It can be seen that the path lengths planned by these algorithms are relatively close, and the difference is not very large. e longest path planned by the RRTalgorithm is 81.5172n miles, followed by the RRT * algorithm at 80.066n miles. Compared with the previous algorithms, the length of the path planned by these two algorithms is relatively long. From the comparison of the number of corners, the number of turning points in the path planned by the algorithm proposed in this paper is the least of one, followed by RRT * , BUG2 algorithm, DDPG, and A * algorithm. Furthermore, the path planned by the APF algorithm and RRT algorithm with radian, leading to the number of turning points, cannot be calculated, so the number of corners of the APF algorithm and the RRT algorithm in Figure 14 is marked as n.
rough these comparisons, we can see that, compared with the above algorithms, the planned path is more practical, economical, and safer.
It can also be seen from Figures 13 and 14 that the path length planned by the method in this paper is 73.117 n miles, and the number of turning points is one. Although in terms of length, the method in this paper is longer than the DDPG, A * , and APF algorithms, the safety of the path planned by these three algorithms is insufficient, and the actual ship operability is poor. In terms of the number of turning points, the path planned by this method has fewer turning points than the path planned by the above six methods. On the whole, the path planned by the method in this paper has stronger safety, higher economic benefits, and stronger actual ship operability.
At the same time, this paper further compares the time of path planning with the above algorithms, as shown in Figure 15. It can be seen from the figure that the shortest time for the RRT algorithm to plan a path is 0.6432s. is is because the RRT algorithm randomly selects points for path planning and ends as soon as the target point is reached, without considering issues such as path length and the number of turning points. e algorithm in this paper takes 0.7845s, the DDPG algorithm takes 0.9553s, the A * algorithm takes 1.324s, the RRT * algorithm takes 1.0234s, and the APF algorithm takes 2.341s. e longest time for the BUG2 method is 4.3256s; this is because the BUG2 algorithm will walk around the obstacle when it encounters an obstacle and then determine which direction to plan from, so it takes a long time.
To further illustrate the generality of the algorithm in this paper, the algorithm is verified in different environments. Figure 16 shows two environments with different levels of complexity. e obstacles in environment 1 are the same in size, and the obstacles in environment 2 are different in size. Both environments use the algorithm in this paper for path planning and path optimization, and the effects are shown in Figure 17.
As seen from Figure 17, the path planned by this paper algorithm in environment 1 has only one waypoint, and environment 2 is relatively complicated, so there are two waypoints in the planned path. ere is no case that the paths planned by the two environments are close to or   through the obstacles, which is in line with the actual navigation specifications of ships.

Conclusions
In order to improve the safety and economy of coastal ship path planning, this paper proposes a coastal ship path planning method based on improved DDPG and DP. is method realizes the path planning of ships through improved DDPG and optimized reward function. Compared with traditional DDPG, this method improves the convergence speed of the algorithm and the utilization of data. In addition, the improved DP algorithm is used to further optimize the planned path, which solves the problem that there may be more inflection points in the planned path, making the ship's navigation safer and more economical, and the planned path is more in line with the actual sailing requirements of the ship. Experimental comparison with other path planning algorithms and verification results in different environments show that the path planned by the method proposed in this research has obvious advantages in terms of path length and number of inflection points. However, the method in this paper still cannot solve the following problems, and these problems are also the key parts that need to be studied in the next part of this paper: (1) is paper cannot deal with the situation of dynamic obstacles in the sea at present. is part is the next research work. (2) is paper cannot solve the problem when the ship encounters another ship during the voyage, that is, the collision avoidance operation needs to be combined with the collision avoidance rules, which is another work in the next research.

Data Availability
All the data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare no conflicts of interest.