^{1}

^{2}

^{1}

^{1}

^{2}

Autonomous underwater vehicles (AUVs) are widely used to accomplish various missions in the complex marine environment; the design of a control system for AUVs is particularly difficult due to the high nonlinearity, variations in hydrodynamic coefficients, and external force from ocean currents. In this paper, we propose a controller based on deep reinforcement learning (DRL) in a simulation environment for studying the control performance of the vectored thruster AUV. RL is an important method of artificial intelligence that can learn behavior through trial-and-error interactions with the environment, so it does not need to provide an accurate AUV control model that is very hard to establish. The proposed RL algorithm only uses the information that can be measured by sensors inside the AUVs as the input parameters, and the outputs of the designed controller are the continuous control actions, which are the commands that are set to the vectored thruster. Moreover, a reward function is developed for deep RL controller considering different factors which actually affect the control accuracy of AUV navigation control. To confirm the algorithm’s effectiveness, a series of simulations are carried out in the designed simulation environment, which is a method to save time and improve efficiency. Simulation results prove the feasibility of the deep RL algorithm applied to the control system for AUV. Furthermore, our work also provides an optional method for robot control problems to deal with improving technology requirements and complicated application environments.

Since oceans are the most important source in terms of marine life, all kinds of scarce minerals, marine chemical, ocean energy and transportation, and human societies are increasingly dependent on the oceans. Therefore, exploring, developing, exploiting, and protecting the ocean have become a hot issue of global development and technical equipment. Thus, an enormous amount of research effort goes into research and development of all kinds of instruments and equipment, such as large numbers of underwater robots. Unmanned underwater vehicles are a kind of ideal platform to carry out ocean surveying and monitoring [

Over a few decades, various control methods have been proposed for AUVs to solve vehicle control issues while considering the aforementioned difficulties. The representative methods for AUV control such as proportional integral derivate have been developed for the low-level AUV control. In the early work, Jalving [

In addition, other different techniques have also been used for controlling AUVs to accomplish tasks, such as sliding model control, backstepping, and model predictive control. Sliding model control (SMC) is one of the most efficient and robust methods to deal with some nonlinear uncertainties and external disturbance [

However, the traditional nonlinear controller is still significantly dependent on the model, and the performance of the model-based controller will seriously degrade due to a lack of precise knowledge about nonlinearities, uncertainties, and unknown disturbances. Therefore, it is obviously difficult to obtain an accurate dynamic model; the conventional control method is hard to ensure the accurate and automatic control of the AUV [

In the current study, Zhang et al. proposed an approach-angle-based three-dimensional path-following control scheme for underactuated AUV which experiences unknown actuator saturation and environmental disturbance [

Reinforcement learning (RL) is another important method of artificial intelligence for designing control systems [

Based on research and literature review, we proposed a deep RL based on deep deterministic policy gradient algorithm for low-level velocity control of the vectored thruster AUV. In the proposed control scheme, the input parameters are the data that can be measured by the on-board sensors directly, and the outputs of the designed controller are set to the actions of the vectored thruster. Moreover, a reward function is developed for deep RL controller considering different factors which actually affect the accuracy of AUV navigation control. To confirm the algorithm's effectiveness, a series of simulations are carried out in the designed simulation environment, which is a method to save time and improve efficiency. The simulation results demonstrated the feasibility of the proposed deep RL applied on an AUV navigation control. Our work based on reinforcement learning algorithm provides an optional method for AUV control problems to deal with improving technology requirements and complicated application environments. This method based reinforcement learning significantly improves the control performance of AUVs. Furthermore, the simulation results also open up a vast range of prospects for the application of the deep RL method in complex engineering system.

The organization of this paper is as follows. In Section

The tilt angles of the ducted propeller in the AUVs’ yaw and pitch plane are limited in

The study of the AUVs about modeling and control problems involves many theories and methods of statics and dynamics. Generally, the motion study of AUVs can be divided into two major parts: one is the kinematics analysis model and the other is the dynamics analysis model of AUV. The kinematics analysis model of AUVs is used to complete the study of position and orientation of motion, while the dynamics model deals with the motion of the vehicle caused by the forces and moments.

Generally, the motions of AUVs in underwater environment are related to six degrees-of-freedoms (6 DOFs). For analyzing the motion of the vectored thruster AUV in 6 DOFs concisely and efficiently, it is convenient to define two commonly used frames, namely, earth-fixed frame and body-fixed frame, as shown in Figure

The earth-fixed and body-fixed frame of an AUV.

In this study, the earth-fixed frame is a global-coordinate system that can be considered to be inertial and fixed to its origin. The body-fixed frame is a moving frame fixed to the AUV, whose origin is coinciding with the center of mass of the AUVs. To make it convenient in investigating the vectored thruster AUV, the standard notations used to describe the motion of AUV are defined in Table

The notation of SNAME for AUVs.

DOF | Forces and moments | Positions and Euler angles | Linear and angular velocities |
---|---|---|---|

Surge | |||

Sway | |||

Heave | |||

Roll | |||

Pitch | |||

Yaw |

The general kinematics transformation of AUV between the two independent coordinate systems can be represented as follows:

The kinematics description of the nonlinear equations of the AUV above can be described separately for linear and angular parameters as follows:

The dynamic equations of motion for the underwater vehicles are derived from Newton–Euler equation using the principle of virtual work and D’Alembert’s principle. The equations of motion of underwater vehicle are established based on the traditional six-DOF model in the earth-fixed frame, and it can be expressed in the following form:

In general, the thrust

On the other hand, the direction of thrust

By comparison with the conventional AUV, there is difference in controlling because the direction of thrust is controlled by the thrust-vectoring mechanism. The deflection angle of the duct is a combination of the rudder angle

Thrust decomposition of the vectored thruster AUV.

The vector of thrust applied on the AUV along with axis in the body-fixed frame is defined as

Besides, since the thrust

In order to study the relations between the vector of thrust

Figure

Scaling factors varying with tilt angle

Due to having 6 motional DOFs, the dynamics model of AUVs with highly nonlinear characteristics is a big challenge to design a controller. In order to realize underwater vehicle function designed completely, control system plays an important role in the process of design of AUV [

Based on the control process in Figure

A block diagram of the AUV’s control process.

Simulation results of the reference velocity

The simulation results in Figure

Simulation results of the reference velocity

As shown in Figure

When a new requirement is introduced, such as the reference velocities

Simulation results of the reference velocity

According to the above, the simulation results and analysis show the inadequacies of the designed AUV controller based on PID algorithm. In order to improve the performance and reduce energy consumption, it is essential to find a new method to design the AUV controller.

Reinforcement learning (RL) is a part of machine learning that focuses on studying how an agent optimizes its behavior for a task by interacting with the environment. Then, the environment produces new stats to respond to the executed action in some state. At the same time, the agent receives a new reward value from the environment, which can be seen as the indexes to evaluate the advantages and disadvantages of action. A series of data are generated by the agent and the environment through continuing loop iterations. The basic principle of reinforcement learning is presented in Figure

The overview of reinforcement learning process.

The environment for agent training in RL can be described as a Markov Decision Process (MDP), where the environment is assumed as fully observable. An MDP can be defined as a 5-tuple

The goal of the agent is to maximize the total amount of reward it receives [

The purpose of the reinforcement learning method is to find the optimal policy

The state-value function is defined as the expected value of cumulative discounted rewards from the state

Similar to the state-value function, the action-value function, also known as the

The state-value function

When the agent utilizes strategy optimal policy

The purpose of RL problems is to learn an optimal policy

Existing RL algorithms mainly consist of value-based and policy-based methods. The first proposed value-based method is Q-learning, which has become one of the most popular and widely used RL algorithms. In the use process, Q-learning needs to calculate the Q-value of each state and action and store it in a table. It is precise because of looking up the table in each iterative calculation, so this value-based algorithm is suitable for those applications where the space of state and action are discrete and the dimension is not too high. In order to resolve the problem about the spaces of state and action being too large, the function approximation to estimate the value function is proposed. Along with the deepening of research, deep neural networks are used to develop a novel artificial agent, which is named deep Q-network (DQN), which can learn successful policies from high-dimensional state [

However, while it could resolve problems with high-dimensional state spaces, value-based methods can only tackle the discrete actions applications but fail in continuous action space. This kind of RL algorithms cannot be applied to the continuous domain directly because it depends on looking up the action that maximizes the action-value function, which needs to compete the process of iterative optimization at every step. Besides, if the rough discretization of state action is made, the results will become unacceptable; if the discretization is made so thin, then the results will be difficult to solve. Hence, it may be impracticable when applying this value-based method to a continuous control domain, such as our control of vectored thruster AUV, while another important RL algorithm, named Policy Gradient (PG) reinforcement learning method, has a wide range of applications in the areas of continuous behavior control. PG methods perform gradient ascent on the policy objective function

Then, the policy-based methods update the parameter

This equation above shows that the gradient is an exception of possible states and actions. Rather than approximating a value function, the PG methods approximate a stochastic policy using an independent function approximator with its own parameters that maximize the future expected reward. The main advantage of the PG method against value-based function is using an approximator to represent the policy directly. In the process of PG learning, it should consider the probability distribution of the states and actions simultaneously. Hence, the PG method integrates over both state and action spaces during the training process. There can be no doubt that it consumes a large amount of computing resources for the high-dimensional state and action spaces [

The gradient for deterministic policy is

In order to explore the environment fully, a stochastic policy is often necessary. To ensure the deterministic policy gradient algorithm’s adequate exploration, an off-policy actor-critic learning algorithm is proposed subsequently. The actor-critic algorithms consist of two components in policy gradient, an actor and a critic, respectively. Actor and critic are two different networks and have different policies to realize different functions. The critic estimates the value function which could be the action value (the Q-value) or state value (the V value). The actor updates the policy distribution in the direction suggested by the critic (such as with policy gradients). Actor is a policy network to produce actions by space exploration, while the critic is a value function to evaluate the actions made by the actor [

Given the policy gradient direction, the update process of actor-critic off-policy DPG can be presented as

The advantage of the actor-critic algorithm is the ability to implement the single-step update, which makes make it more efficient. The performance of actor-critic algorithm is decided by critic’s value judgment; nevertheless, it is very difficult to realize convergence, particularly when the actor also needs to upgrade its parameters. To overcome those problems mentioned above, Deep Deterministic Policy Gradient (DDPG) has been presented.

Deep reinforcement learning is composed of deep neural network and reinforcement learning. The algorithm structure makes it directly learn control policies from high-dimensional state-action spaces. Due to the excellent performance over a wide range of applications, deep neural network (DNN), which is an artificial neural network (ANN) with several layers between the input and output layers, has arisen and become a very popular research topic in machine learning recently. Thanks to the huge success in a variety of fields such as medical imaging analysis, artificial neural networks have attracted great interest in deep learning. With the development of these neural network structures, it has been used in different areas such as solving engineering control problems.

The basic unit of neural networks is neuron, which is a mathematical function that models the functioning of a biological neuron within an artificial neural network. Because it consists of a large number of layers and neurons in each layer, DNN can always find the right mathematical operation to convert the inputs to outputs, whether linear or nonlinear relationship. In the neural networks, it can be called fully connected layers when each neuron in a layer is connected to all neurons in the next layer. In the network form, each connection contains parameters weight

Deep Deterministic Policy Gradient (DDPG) is a model-free off-policy actor-critic algorithm using deep function approximator that can be used to solve the problems of high-dimensional, continuous motion spaces. Because it is proposed based on the concept of DQN, DDPG also uses the deep neural networks as function approximator, which makes it feasible in complex action-space applications [^{μ}, the actor network represents the deterministic policy ^{Q} is used to estimate the action-value function

In order to achieve stable learning, DDPG deploys experience replay and target networks like DQN. Experience replays are a key technology behind many of the latest advancements in deep reinforcement learning [

In order to optimize the state-value function (critic function) neural network, a loss function based on mean squared error is proposed to carry out the backpropagation. In DDPG, the parameters of the deep neural network for the critic are updated by minimizing the loss function _{i} is the target value function generated by the target neural network

Then, the gradient of the loss function

The actor policy function is presented with the network parameter ^{μ}, which is updated using the critic network parameter ^{Q} to optimize the expected function. The objective function _{t} under the policy

A policy gradient method is generally obtained by the deterministic policy gradient with respect to network parameter

Hence, the parameter of the actor online policy network for the actor in DDPG can be updated by using the sampled policy

In order to avoid the divergence of the algorithm, separate target networks are created as copies of the original actor network and critic network. In the DDPG algorithm, two target networks

To improve the stability of the learning, we use the “soft” update method to update the parameters as illustrated by Mnih et al. The weights of the target network are constrained to update slowly by tracking the main networks:

According to the above, the vectored thruster has one thruster, and two deflection angles need to be controlled. Hence, it is implied that the control system needs to produce the continuous control outputs of the three parts to achieve the designed reference dynamic state, such as the velocities of AUV. According to the dynamics of AUV, the designed control system must be able to complete the operational task of a nonlinear continuous control of the vectored thruster AUV in a complicated and changeable underwater environment. To solve the continuous control problem, an adaptive control system for the vectored thruster AUV based on DDPG algorithm and the study of AUV is proposed in this work. The aim of this study is to develop a new control algorithm, which has the ability to solve the problem of the vectored thruster AUV with different operative conditions. In our study of the AUV, the architecture of the control system based on DDPG algorithm can be illustrated as in Figure

The control architecture based on DDPG algorithm.

As shown in Figure _{t}. This item presented the difference between the setting parameters and the practical measurement results that provide instantaneous information to the merged information _{t}. The reward function unit is the main indicator of evaluating the advantage and disadvantages of this algorithm. The input parameter of the reward function model is the instantaneous error vector e_{t}. In this way, the immediate reward r_{t} is defined by the reference state _{t} to evaluate the operation situation of present activity and error in feedback. The AUV controller receives the information summarized in the system state _{t} of AUV and the immediate reward _{t} of the current states. According to the input parameters s_{t} and r_{t}, the AUV controller produces the action a_{t} to the AUV simulation environment based on a lot of studies and iterative computation. In practical application, the state information can be measured by the sensor system in AUV, such as DVL and IMU. In our work, the AUV simulation environment is established based on the study mentioned in Section

Based on the presented DDPG algorithm architecture, the AUV controller for low-level control of vectored thruster AUV is developed. In order to help us represent the control system better, the algorithmic representation is developed to make our code more readable. Therefore, the algorithm for the vectored thruster AUV control can be summarized in the pseudocode as shown in Algorithm

Input parameters _{min}, _{max}

Randomly initialize critic network

Initialize target network

Initialize replay buffer

For

Initialize a random process

Initialize the AUV simulation environment

Receive initial observation state _{1} from the AUV simulation environment

For step = 0 to

Select action

Execute action _{t} in the AUV simulation environment

If

Sample a random minibatch of _{i}, _{i}, _{i}, _{i+1}) from

Set

Update critic by minimizing the loss:

Update the actor policy using the sampled policy gradient:

Update the target

Update the target policy networks:

{End if}

If

Remove the oldest stored data from the reply buffer

End if

Obtain the new state _{t+1}

Obtain reward _{t}

Store transition (_{t}, _{t}, _{t,}_{t+1}) in

End for

End for

Output parameters ^{Q} and ^{μ}

The algorithm ﬂow based on DDPG for vectored thruster AUV.

In line 1 of Algorithm _{1} can be obtained directly from the AUV simulation environment. In the inner loop from line 9 to line 27, the core part of our algorithm is performed to control the AUV. In the process of this algorithm, a fixed sample time of each step in the inner loop is set to _{t} is obtained when the state _{t} is given because the actor policy is determined (line 10). The action _{t} is immediately sent to the AUV simulation environment to complete the corresponding motion control (line 11).

In order to improve the efficiency and reliability of the training process, the experience replay buffer _{min} stored to train the networks (line 12). When these conditions are fulfilled, a random minibatch _{i} can be calculated, where ^{μ} is obtained. According to the network parameters ^{Q}and ^{μ} calculated in line 14 to 16, the critic target _{t}, and then the new state _{t+1} can be obtained directly from the sensor system on the AUV in a real application, while the state information can be calculated by the simulation environment (line 23). Then, the reward function can calculate the immediate reward _{t} to evaluate the effects of the action (line 24). Subsequently, with the combination of above the data obtained, the transition

In our designed control system, the environment simulation is used to simulate the real underwater physics of AUV. During the simulations, the parameter sampling time used in Algorithm _{p}, rudder angle _{p}, rudder angle

In this control algorithm, the state _{t} in the Markov process represents the current state of the vectored thruster in the underwater environment. In our AUV simulation environment, the state parameters, which can be expressed as _{t} is the velocity error obtained by the real measured velocity _{t} for the system performance. In order to more fully evaluate the advantages and disadvantages of reward functions, a kind of reward function is proposed with different considerations in our study. Therefore, this immediate reward function _{t} is defined as follows:_{t}. The average of past executed actions

In order to verify the feasibility of the proposed Algorithm

The maximum and minimum sizes of the experience replay buffer _{max} and _{min}. The learning rate for actor and critic networks is _{R−A} and _{R−C}. The discount rate and the soft updating rate for the target networks are

The hyperparameters used for training the DDPG controller.

Hyperparameters | Value |
---|---|

M | 3000 |

T | 1000 |

dt | 0.1 s |

_{max} | 500000 |

_{min} | 1000 |

L_{R−A} | 0.0001 |

L_{R−C} | 0.0001 |

0.99 | |

0.0001 | |

64 |

According to aforementioned proposed Algorithm

Simulation results with

As we can see in Figures _{p} and deflection angles of the duct

Simulation results with

As we can see in Figures

However, the great change extent of the thrust _{p} and the duct angles

Simulation results with

Comparing the simulations of Figures

In order to improve the performance of the control system greatly, the first three terms are adopted in the reward function to further take advantage of the algorithm for AUV in the real application. Hence, in the next simulation, the factor is set to

Simulation results with

As we can see in Figure _{p} and the duct angles _{p} lead to the large deviation in the position and orientation of the AUV. In order to further improve the performance of the algorithm, this bias about the thrust and duct angles is needed to be considered in the reward function. Based on the above comparison and consideration, the last term of the reward function, which is inspired by the integral item of error of PID algorithm, is added to reduce the effect of error propagation. The new simulation is carried out with the reward function with all the four terms considered, and the results can be obtained as shown in Figure

Simulation results with

The results of simulations are obtained and shown in Figure

The simulation results comparing RL and PID.

As it can be seen, the results of simulation indicate that the controller based on DDPG makes a good performance in controlling the vectored thruster AUV problem. In contrast with the simulation results, the designed controller based on DDPG is better than PID controller in dynamic performance. In order to further research the performance of the designed controller under conditions of greater uncertain factors, the simulations are carried out to study anti-jamming performance with Gauss white noise excitations. The simulation results are shown in Figure

The simulation results comparing RL and PID with white noise.

Under the Gauss white noise disturbances presented in the simulation environment, the controller based on DDPG and the PID controller could realize its function by the simulation. Based on the above research results, the results show that the designed control scheme based on DDPG has a good dynamic and static response and strong anti-interference ability. Simulation results from Figures

In order to test the capability of this algorithm, the other simulations with changed references are carried out. The new reference is set to

Simulation results with the reference velocities

As we can see in Figure _{y} achieves the setting velocity requirements with good reliability. From Figures _{p} is very stable within the limit of ultimate thrust, and the duct angle

We applied DDPG method on the proposed controller for the vectored thruster AUVs, and the training reward and the time consumption can be obtained as shown in Figures

Episode reward while training under ideal conditions.

The cost time of episode.

As we can see in Figure

In this paper, an AUV controller based on the Deep Deterministic Policy Gradient (DDPG) was proposed for improving the control performance of the vectored thruster AUV. The proposed algorithm uses the information measured by internal sensors of AUV to provide the control commands for AUV to fulfill the task. There is no requirement to provide a model of the large complex nonlinear system about the vectored thruster AUV to the designed controller, which is essential to the classic control theory. It only needs some input parameters of the AUV, and our proposed algorithm is able to learn a control strategy for the AUV to meet exact implementation requirements. In the learning process, the reward function is fundamental to the DDPG controller to realize the system goal and related functions of the AUV. In this algorithm, a reward function is proposed by considering a series of control precision requirements and the influence of operational constraints. The designed reward function in this paper can effectively improve reliability and stability, reduce energy consumption, and restrain the vectored thruster sudden change. It should be particularly noted that the proposed control system based on DDPG algorithm was developed to realize the lower-layer motion control for the vectored thruster AUV, although some greater range of applications and more complex dynamic control systems can be solved by this method. Therefore, the controller based on DDPG algorithm has vast application and development prospects.

Furthermore, our proposed algorithm framework for AUV only uses some system states that can be measured by sensors directly as inputs, and it is different from a former method that uses images as input parameters. In this paper, it is proved that the motions of the AUV can be directly controlled by sending low-level control commands to the vectored thruster. In order to confirm the algorithm’s effectiveness, a series of simulations are carried out in a simulation environment, which is established by the kinematic and dynamic analysis of the vectored thruster AUV. In this sense, we think the method using simulation environment to replace the real underwater application environment is proved to be cost saving and efficient improvement. In this sense, we think that our works have obtained certain improvements in expanding the application range of AUV control study using the deep reinforcement learning method. Furthermore, our proposed control algorithm provides an optional mentality for controlling underwater vehicles and other kinds of robotics.

Certainly, our present study has its limitations while achieving some achievement. In our proposed algorithm, the simulations are carried out under ideal conditions, so realistic experiments need to be completed to verify the correctness and feasibility of the proposed method. Moreover, more influence factors should be taken into account, such as time delay uncertainty among the sensors, actuators, and controllers. In addition, how to improve performance and achieve stability of the proposed controller is an important task for further research. Finally, control algorithms based on deep reinforcement learning have broad application background and important meanings in theory and practical engineering; therefore, the related research will become more important.

The data of hydrodynamic and thrust coefficients of the AUV used to support the findings of this study are included within the supplementary information file (Appendix B). The other data used to support the findings of this study are included within the article.

The authors declare that they have no conflicts of interest.

The supplementary file includes two parts: the element terms of dynamic equations of motion and the hydrodynamic and thrust coefficients of the AUV. The two supplementary files are important addition to model the complex dynamics of AUVs.