Deep Reinforcement Learning for Vectored Thruster Autonomous Underwater Vehicle Control

,


Introduction
Since oceans are the most important source in terms of marine life, all kinds of scarce minerals, marine chemical, ocean energy and transportation, and human societies are increasingly dependent on the oceans. erefore, exploring, developing, exploiting, and protecting the ocean have become a hot issue of global development and technical equipment. us, an enormous amount of research effort goes into research and development of all kinds of instruments and equipment, such as large numbers of underwater robots. Unmanned underwater vehicles are a kind of ideal platform to carry out ocean surveying and monitoring [1]. Because the natural environment of the ocean is so harsh for human beings to investigate, autonomous underwater vehicles (AUVs) as ideal platforms for the significant improvement in their performance are widely used for exploring and utilizing resources by carrying different detecting and operating instruments [2,3]. Although the performance of AUVs has obtained a huge development, there are still a lot of challenging problems appealing to scientists and engineers immensely in this field. For example, the conventional AUVs are unable to perform detailed inspection missions at zero and low forward speed because the control surfaces become ineffective in this condition for control force depends on the forward speed [4][5][6][7]. ese disadvantages greatly limit the application of AUVs. An important and effective approach to overcome this restriction is to use a vectored thruster, which can use the control force produced by the vectored thrust for controlling the AUVs [8][9][10]. To perform underwater tasks, it is necessary to design a control system for the vectored thruster AUV to perform precise trajectory tracking control. However, the AUVs are highly complex and coupled nonlinear systems with all kinds of unknown, structured, and unstructured uncertainties corresponding to underwater environment [11,12]; therefore, it is difficult to establish a precise control model for the designed AUV. erefore, the control of AUVs has attracted considerable attention in recent years, which needs to satisfy the demand for an accurate trajectory tracking control for AUV against the variations in hydrodynamic coefficients and external forces from ocean currents [13,14].
Over a few decades, various control methods have been proposed for AUVs to solve vehicle control issues while considering the aforementioned difficulties. e representative methods for AUV control such as proportional integral derivate have been developed for the low-level AUV control. In the early work, Jalving [15] designed a PID controller for AUV with steering, diving, and speed subsystems. A controller based on PID was proposed for the position and attitude tracking of the AUVs, and the authors also proved the global convergence of the proposed algorithm [16]. Herman [17] proposed a decoupled PD set-point controller for underwater vehicles on the basis of previous study [18,19]. To solve the problem of windup due to uncertain dynamics together with the actuator saturation, many researchers have devoted themselves to this aspect and have achieved many results in theory and application [20][21][22][23]. Furthermore, considering the hydrodynamics of underwater vehicle which is highly nonlinear and the uncertainty in the model, some research about the adaptive controller is proposed for controlling underwater vehicles to track the desired trajectory [24][25][26]. Besides, many researchers have deployed some research on the controller for AUVs combining with other algorithms and have achieved some progress [27][28][29][30][31][32][33].
In addition, other different techniques have also been used for controlling AUVs to accomplish tasks, such as sliding model control, backstepping, and model predictive control. Sliding model control (SMC) is one of the most efficient and robust methods to deal with some nonlinear uncertainties and external disturbance [34,35]. In the earlier literature, Young et al. proposed a controller based on SMC for robust trajectory tracking control of the AUVs [36] and carried out related experiments using adaptive SMC [37]. Cristi et al. designed a decoupled controller using adaptive SMC for AUV diving control [38]. Healey and Lienard proposed multivariable sliding mode control for autonomous diving and steering of unmanned underwater vehicles [39]. Furthermore, in order to improve the performance of SMC, researchers designed a controller based on a higher order sliding model control for diving control [40]. An adaptive robust control system was proposed by employing fuzzy logic, backstepping, and sliding mode control theory [41]. Zain and Harun proposed a nonlinear control method for stabilizing all attitudes and positions of an underactuated X4-AUV with four thrusters and six degrees-of-freedom (DOFs) according to the Lyapunov stability theory and using backstepping control method [42]. In this work, Steenson et al. developed a depth and pitch controller using the model predictive control method to manoeuvre the vehicle within the constraints of the AUV actuators [43]. Shen et al. proposed a nonlinear model predictive control scheme to control the depth of AUV and to have a friendly interaction with the dynamic path planning method [44]. ese research studies evidence a growing need for designing a better controller for underwater vehicles to complete a variety of tasks in different complex unknown environmental conditions. However, the traditional nonlinear controller is still significantly dependent on the model, and the performance of the model-based controller will seriously degrade due to a lack of precise knowledge about nonlinearities, uncertainties, and unknown disturbances. erefore, it is obviously difficult to obtain an accurate dynamic model; the conventional control method is hard to ensure the accurate and automatic control of the AUV [1]. In order to develop real autonomous systems, researchers have turned their attention to artificial intelligence methods, such as using artificial neural networks in AUV control formulations [45]. Fujii developed a self-organizing neural-net-controller system as an adaptive motion control system, which can generate autonomously an appropriate controller according to some evaluations of motion of the vehicle [46]. A neural network adaptive controller for diving control of an AUV is presented in this paper [47]. Based on neural network, Shojaei addressed a control formation for underactuated AUVs with limited torque input under environmental disturbances [48]. Many other researchers also carried out a lot of research and also achieved fruitful results from different perspectives [49][50][51].
In the current study, Zhang et al. proposed an approachangle-based three-dimensional path-following control scheme for underactuated AUV which experiences unknown actuator saturation and environmental disturbance [52].
is paper investigates three-dimensional target tracking control problem of underactuated AUVs by using coordinate transformation and multilayer neural networks [53]. e authors address the problem of reachable set estimation for continuous-time Takagi-Sugeno (T-S) fuzzy systems subject to unknown output delays, and a new controller design method is also discussed based on the reachable set concept for AUVs [54]. In this paper, neural network-(NN-) based adaptive trajectory tracking control scheme has been designed for underactuated AUVs which are subjected to unknown asymmetrical actuator saturation and unknown dynamics [55]. is paper investigates neural network estimators-based fault-tolerant tracking control problem for AUV with rudder faults and ocean current disturbance [56]. A robust neural network approximationbased output-feedback tracking controller is proposed for autonomous underwater vehicles (AUVs) in six degreesof-freedom in this paper [57].
Reinforcement learning (RL) is another important method of artificial intelligence for designing control systems [58]. RL algorithms are able to learn behavior through trial-and-error interactions with a dynamic environment [59]. RL can learn a control policy directly without requiring a model [60]. Chris Gaskett and Wettergreen developed an autonomous underwater vehicle for exploration and inspection with on-board intelligent control, which can learn to control its thrusters in response to the command and sensor inputs [61]. A hybrid coordination method is proposed for behavior-based control architectures, where the behaviors are learned online by reinforcement learning [62]. Carreras et al. presented a hybrid behavior-based scheme using reinforcement learning for high-level control of autonomous underwater vehicles [63]. In the paper of El-Fakdi, a high-level RL control system using a Direct Policy Search method is proposed for solving the action selection problem of an autonomous robot in cable tracking task [64]. Fjerdingen analyzed the application of several reinforcement learning techniques for continuous state and action spaces to pipeline following for an AUV [65]. Wu proposed an RL algorithm that learns a state-feedback controller from sampled trajectories of the AUV for tracking the desired depth trajectories. In the work of Frost et al. a behaviorbased architecture using a natural actor-critic RL algorithm is presented for forming the foundation of the system with an extra layer, which uses experience to learn a policy for modulating the behaviors' weights [66]. In this article, El-Fakdi and Carreras proposed a control system based on actor-critic algorithm for solving the action selection problem of an autonomous robot in a cable tracking task [67]. On the other hand, the performance and application scope of RL algorithm is increasing rapidly, due to the development of deep learning [68]. Based on deep reinforcement learning (DRL), a lot of research studies have been carried out and achieved fruitful accomplishments, such as autonomous vehicle control [69][70][71]. Yu et al. proposed an underwater motion control system through a modified deep deterministic policy gradient, and it is proved that this algorithm is more accurate than traditional PID control in solving the trajectory tracking of AUV [72]. Two reinforcement learning schemes, which include deep deterministic policy gradient and deep Q network, were investigated to control the docking of an AUV onto a fixed platform in a simulation environment [73]. In the works of Carlucho et al. a deep RL framework based on an actor-critic goal-oriented deep RL architecture is developed for controlling the AUV's thrusters directly using the sensory information as input parameter, and experiments on a real AUV demonstrate the applicability of the proposed deep RL approach [74].
Based on research and literature review, we proposed a deep RL based on deep deterministic policy gradient algorithm for low-level velocity control of the vectored thruster AUV. In the proposed control scheme, the input parameters are the data that can be measured by the on-board sensors directly, and the outputs of the designed controller are set to the actions of the vectored thruster. Moreover, a reward function is developed for deep RL controller considering different factors which actually affect the accuracy of AUV navigation control. To confirm the algorithm's effectiveness, a series of simulations are carried out in the designed simulation environment, which is a method to save time and improve efficiency. e simulation results demonstrated the feasibility of the proposed deep RL applied on an AUV navigation control. Our work based on reinforcement learning algorithm provides an optional method for AUV control problems to deal with improving technology requirements and complicated application environments. is method based reinforcement learning significantly improves the control performance of AUVs. Furthermore, the simulation results also open up a vast range of prospects for the application of the deep RL method in complex engineering system. e organization of this paper is as follows. In Section 2, we have briefly introduced the related configuration of a vectored thruster AUV, investigated the kinematic and dynamic of the AUV, and designed a control system based on PID algorithm. In Section 3, we introduce the related knowledge of deep reinforcement learning. In Section 4, we develop our proposed controller based on deep RL. In Section 5, we carry out a series of simulations to confirm the algorithm's effectiveness. In Section 6, we conclude this paper and look to the future work.

The Vectored Thruster AUV Model and
Control Problems e tilt angles of the ducted propeller in the AUVs' yaw and pitch plane are limited in ±15 ∘ . e vectored thruster AUVs have the ability to perform missions at zero or low forward speeds for the control force provided by vectored thruster. To achieve reliable and accurate control of the AUVs, there are high demands on the autonomous control system design. And the kinematic and dynamic of the AUV are fundamental to design a control system. e study of the AUVs about modeling and control problems involves many theories and methods of statics and dynamics. Generally, the motion study of AUVs can be divided into two major parts: one is the kinematics analysis model and the other is the dynamics analysis model of AUV. e kinematics analysis model of AUVs is used to complete the study of position and orientation of motion, while the dynamics model deals with the motion of the vehicle caused by the forces and moments.
Generally, the motions of AUVs in underwater environment are related to six degrees-of-freedoms (6 DOFs). For analyzing the motion of the vectored thruster AUV in 6 DOFs concisely and efficiently, it is convenient to define two commonly used frames, namely, earth-fixed frame and body-fixed frame, as shown in Figure 1. ese DOFs usually refer to the motions about the three coordinate axes of AUVs, including surge, sway, heave, roll, pitch, and yaw, respectively. And the motions mentioned above determine the position and orientation of AUVs in the ocean corresponding to the six DOFs.
In this study, the earth-fixed frame is a global-coordinate system that can be considered to be inertial and fixed to its origin. e body-fixed frame is a moving frame fixed to the AUV, whose origin is coinciding with the center of mass of the AUVs. To make it convenient in investigating the vectored thruster AUV, the standard notations used to describe the motion of AUV are defined in Table 1.
e general kinematics transformation of AUV between the two independent coordinate systems can be represented as follows: where η � x y z ϕ θ ψ T denotes the vector of position and orientation, J ∈ R 6×6 represents the transformation matrices from the body-fixed frame to the earth-fixed frame, and v � u v w p q r T represents the corresponding vectors of linear and angular velocity. e kinematics description of the nonlinear equations of the AUV above can be described separately for linear and angular parameters as follows: where η � η 1 η 2 T , η 1 � x y z T , and η 2 � ϕ θ ψ T denote the vectors of position and orientation; v � v 1 v 2 T , v 1 � u v w T , and v 2 � p q r T represent the vectors of linear velocity and angular velocity; and J 1 (η 2 ), J 2 (η 2 ) denote the linear and angular velocity transformation matrix between the bodyfixed frame and earth-fixed frame, respectively. J 1 (η 2 ), J 2 (η 2 ) are defined as follows: where s(·), c(·), and t(·) denote sin(·), cos(·), and tan(·), respectively. e dynamic equations of motion for the underwater vehicles are derived from Newton-Euler equation using the principle of virtual work and D'Alembert's principle. e equations of motion of underwater vehicle are established based on the traditional six-DOF model in the earth-fixed frame, and it can be expressed in the following form: where v ∈ R 6×1 , _ v ∈ R 6×1 denote the vectors of velocity and acceleration related to the body-fixed frame. M ∈ R 6×6 iss the matrix of inertia of the vehicle, consisting of the rigid body inertia matrix M RB and the added massM A . e terms M RB and M A are listed in (A.1) and (A.2) in supplementary file, and the developed coefficients of the vehicle used in this paper are listed in Appendix Bin supplementary file. C(v) ∈ R 6×6 represents the Coriolis-centripetal matrix related to the Coriolis forces and the centripetal effects. e Coriolis-centripetal matrix C(v) also includes the rigid body term C RB (v) and the added mass term C A (v), as defined in the following equation: where D(v) ∈ R 6×6 refers to the hydrodynamic damping matrix of vehicle, which is mainly composed of linear damping matrix D and nonlinear damping matrix D n (v). Hence, the terms of the hydrodynamic damping matrix of the underwater vehicle can be described by the following: where D denotes the damping matrix due to linear skin friction, D n (v) represents the nonlinear damping matrix mainly generated from potential damping, wave drift damping, damping due to vortex shedding, and lifting  g(η) is the vector restoring forces and moments related by gravity and buoyancy of the vehicle.
τ ∈ R 6×1 defines the resultant vector of applied forces and moments on the vehicle in the body-fixed frame. Δτ ∈ R 6×1 represents the vector of forces and moments produced by environment disturbances including ocean currents and waves.
In general, the thrust T p is generated by a propeller mounted at the stern of AUV, and the direction of thrust is collinear with the cylindrical axis of the vehicle's hull. Hence, the applied forces and moments τ on the vehicle can be expressed as On the other hand, the direction of thrust T p provided by the designed vectored thruster can be adjusted according to the need of control AUV. e resultant vector τ which represents the applied forces and moments acting on AUVs can be expressed as follows: By comparison with the conventional AUV, there is difference in controlling because the direction of thrust is controlled by the thrust-vectoring mechanism. e deflection angle of the duct is a combination of the rudder angle α and the elevator angle β in the body-fixed frame, as presented in Figure 2.
e vector of thrust applied on the AUV along with axis in the body-fixed frame is defined as Besides, since the thrust T p of this vectored thruster AUV is provided by the propeller, according to the theory of standard propeller, the thrust can be described as where ρrepresents the density of water and K T , n p , and D denote the thrust coefficient, rotation speed, and diameter of the propeller, respectively. Referring to the definition of deflection angles of the duct α and β in Figure 2 and equation (10), the vector of thrust F can be calculated by In order to study the relations between the vector of thrust F and deflection angle of the duct α and β, the unit vectorδ is defined as follows: Figure 3 shows the 3D graph of the factor δ with the tilt angles − (π/12) ≤ a ≤ (π/12) and− (π/12) ≤ β ≤ (π/12). e linear motion of the vehicle is controlled by the vector of thrust F according to adjusting the thrust T p and deflection angle of the duct α and β. Due to the particularity of the vectored thruster AUV, the vehicle's motions of pitch and yaw are controlled by moments produced by components of thrust vector F. e moments M acting on the AUV are generated when the thrust T p is not coincident with the axis of the vehicle's hull. Referring to Figure 2, the moments M due to the thrust F acting on the center of mass can be expressed as where r p � x p y p z p denotes the position vector from the point of action of the thrust F to the vehicle center of gravity. e motions of pitch and yaw of this AUV are controlled by moment vectorM. Because the moment M is determined by thrust T p and deflection angle of the duct α and β and value of thrust T p only depends on the rotation speed of the propellern p , the value of moment M is independent of the forward speed and attitude of AUV.
Due to having 6 motional DOFs, the dynamics model of AUVs with highly nonlinear characteristics is a big challenge to design a controller. In order to realize underwater vehicle function designed completely, control system plays an important role in the process of design of AUV [75]. In general, the overall control process of AUV can be represented as shown in Figure 1. In the control process, the controller design is essential for manipulating the AUV. In our vectored thruster AUV, the control system consists of three control loops that represent its surge, pitch, and yaw motions. e inputs of the AUV controller are velocity errors, and the outputs from the controller are the control actions provided by vectored thruster. In the surge control loop, the input to the controller is the linear velocity u and the output is the thrust T p referring to Figure 1. Similarly, the input and output of the pitch controller are angular velocity q and the elevatorβ; the input and output of the yaw controller are angular velocity r and rudder angle α. In order to meet the demand of real applications, controllers for AUVs are Figure 2: rust decomposition of the vectored thruster AUV.

Complexity 5
usually designed based on the proportional-integralderivative (PID) algorithm. PID algorithm can be expressed as where k p , k i , and k d are the proportional factor, integral factor, and differential factor, respectively, and e(t) is defined as an error value from the difference between the desired set point and a measured process variable. Based on the control process in Figure 4 and PID algorithm in equation (10), a controller is designed aiming at the vectored thruster AUV. To verify control performance of this system, a series of simulations on the performance are carried out, according to a series of analysis and research on AUV mentioned above. Before simulation analysis, the reference velocity is defined by S r t � (v r x , ω r y ω r z ), and the output thrust for AUV is0 ≤ T p ≤ 10 N, α, β ∈ [− (π/12), (π/12)]([− 15 ∘ , 15 ∘ ]). When the reference velocities v r x � 1(m/s), ω r y � 0, ω r z � 0, the simulation result is obtained as shown in Figure 5. e simulation results in Figure 5 show that the proposed controller based on PID algorithm is practicable and effectively applied in this reference velocity. When the reference velocities v r x � 1(m/s), ω r y � 0.1(rad/s), ω r z � 0, the simulation result is obtained as shown in Figure 6.
As shown in Figure 6, the reference linear velocity v x � v r x in a short time, but the angular velocity ω y is different from the reference velocityω r y . On the basis of analyses, it is shown that the thrust T p is inadequate to meet the need of achieving the reference angular velocity. As described earlier, the thrust T p is determined by the surge controller, and hence the thrust T p does not increase although the angular velocity ω y did not reach the referenceω r y . When a new requirement is introduced, such as the reference velocities ω r y � 0.1(rds/s), ω r z � 0, with neglect of the velocity v r x setting, this designed AUV controller becomes difficult to implement. In order to solve this problem, the thrust T p for the AUV is set to a high value T p− set in advance. WhenT p− set � 6, the simulation results are shown in Figure 7.
According to the above, the simulation results and analysis show the inadequacies of the designed AUV controller based on PID algorithm. In order to improve the performance and reduce energy consumption, it is essential to find a new method to design the AUV controller.

Reinforcement Learning Statement.
Reinforcement learning (RL) is a part of machine learning that focuses on studying how an agent optimizes its behavior for a task by interacting with the environment. en, the environment produces new stats to respond to the executed action in some state. At the same time, the agent receives a new reward value from the environment, which can be seen as the indexes to evaluate the advantages and disadvantages of action. A series of data are generated by the agent and the environment through continuing loop iterations. e basic principle of reinforcement learning is presented in Figure 8. e environment for agent training in RL can be described as a Markov Decision Process (MDP), where the environment is assumed as fully observable. An MDP can be defined as a 5-tuple (S, A, P sa , cR), where S ∈ R d is the d-dimensional state space, A defines the action space, P sa is the probability of transition to state s ′ by taking an action a in states, c ∈ (0, 1] denotes the discount factor for future rewards, and R: S × A ⟶ R is the function expressing the reward for taking action in a particular state. e policy function π represents a mapping from states to actions and denotes a mechanism for choosing action a in current state s. e goal of the agent is to maximize the total amount of reward it receives [76,77]. When a strategy π is given, the discounted sum of immediate rewards is defined as returnR t : -0.92 Scaling factor Scaling factor    8 Complexity e purpose of the reinforcement learning method is to find the optimal policy π * , which maximizes the return R t by following the policy. e optimal policy π * satisfies the function as where the performance objective J π � E[R t |π] denotes the expected total reward under the policy π and c is the discount factor. e state-value function is defined as the expected value of cumulative discounted rewards from the state s corresponding to the policy.
Similar to the state-value function, the action-value function, also known as the Q function, can be defined as e state-value function V π (s) and the action-value function Q π (s, a) satisfy the Bellman equation.
When the agent utilizes strategy optimal policy π * , the optimal state-value function V * (s) and the action-value function Q π (s, a) achieve the highest return. e optimal functions satisfy the following Bellman equation: e purpose of RL problems is to learn an optimal policy π * . A greedy policy π is derived from Q(s, a) by taking the state and action to get the highest return. Once Q * π is obtained by interactions, the optimal policy π * can be obtained directly by

Reinforcement Learning in Continuous Domain.
Existing RL algorithms mainly consist of value-based and policy-based methods. e first proposed value-based method is Q-learning, which has become one of the most popular and widely used RL algorithms. In the use process, Q-learning needs to calculate the Q-value of each state and action and store it in a table. It is precise because of looking up the table in each iterative calculation, so this value-based algorithm is suitable for those applications where the space of state and action are discrete and the dimension is not too high. In order to resolve the problem about the spaces of state and action being too large, the function approximation to estimate the value function is proposed. Along with the deepening of research, deep neural networks are used to develop a novel artificial agent, which is named deep Q-network (DQN), which can learn successful policies from high-dimensional state [78]. Due to the use of deep neural

Agent
Reward Complexity networks, this kind of value-based algorithm has been successfully applied to all sorts of games and achieved good results. Along with depth research into the RL method theory and the extensive application of DQN, it is natural for the emergence of various varieties, which are Double DQN, Dueling DQN, and Rainbow [79]. However, while it could resolve problems with high-dimensional state spaces, value-based methods can only tackle the discrete actions applications but fail in continuous action space. is kind of RL algorithms cannot be applied to the continuous domain directly because it depends on looking up the action that maximizes the action-value function, which needs to compete the process of iterative optimization at every step. Besides, if the rough discretization of state action is made, the results will become unacceptable; if the discretization is made so thin, then the results will be difficult to solve. Hence, it may be impracticable when applying this value-based method to a continuous control domain, such as our control of vectored thruster AUV, while another important RL algorithm, named Policy Gradient (PG) reinforcement learning method, has a wide range of applications in the areas of continuous behavior control. PG methods perform gradient ascent on the policy objective function J(θ); with respect to the parameters θ of policyπ, the policy objective function J(θ) can be defined as where π θ is a stochastic policy and ρ π (s) is the state distribution. e basic idea behind the PG methods is to adjust the parameters θ of the policy in the direction of the performance gradient ∇ θ J(θ). Hence, the corresponding gradient theorem of the policy objective function J(θ) is en, the policy-based methods update the parameter θ as follows: where α is the learning rate and ∇ θ J(θ) denotes the stochastic expectation approximations of the gradient of the objective performance. is equation above shows that the gradient is an exception of possible states and actions. Rather than approximating a value function, the PG methods approximate a stochastic policy using an independent function approximator with its own parameters that maximize the future expected reward. e main advantage of the PG method against value-based function is using an approximator to represent the policy directly. In the process of PG learning, it should consider the probability distribution of the states and actions simultaneously. Hence, the PG method integrates over both state and action spaces during the training process.
ere can be no doubt that it consumes a large amount of computing resources for the high-dimensional state and action spaces [80]. To solve this drawback, the deterministic policy gradient algorithms for reinforcement learning are presented [81]. Because the map from state spaces to action spaces is fixed in deterministic policy gradient, it is not needed to integrate over all action spaces. Consequently, the deterministic policy gradient needs much fewer samples to train compared with the stochastic policy gradient. is meant that the deterministic policy gradient can be estimated much more efficiently than the stochastic version. With a deterministic policy μ θ : S ⟶ A with a parameterθ and a discounted state distribution ρ μ (s), the performance objective as an expectation can be defined as e gradient for deterministic policy is In order to explore the environment fully, a stochastic policy is often necessary. To ensure the deterministic policy gradient algorithm's adequate exploration, an off-policy actor-critic learning algorithm is proposed subsequently. e actor-critic algorithms consist of two components in policy gradient, an actor and a critic, respectively. Actor and critic are two different networks and have different policies to realize different functions. e critic estimates the value function which could be the action value (the Q-value) or state value (the V value). e actor updates the policy distribution in the direction suggested by the critic (such as with policy gradients). Actor is a policy network to produce actions by space exploration, while the critic is a value function to evaluate the actions made by the actor [82]. e critic network is updated by temporal-difference learning, and the actor network is updated by policy gradient. e performance objective functions over the state distribution by this behavior policy: where ρ β (s) is the stationary distribution of the behavior policy β and Q μ is the action-value function. en, the offpolicy deterministic policy gradient can be presented as Given the policy gradient direction, the update process of actor-critic off-policy DPG can be presented as δ t � r t + cQ w s t+1 , μ θ s t + 1 − Q w s t , a t , w t+1 � w t + αδ t ∇ w Q w s t , a t , e advantage of the actor-critic algorithm is the ability to implement the single-step update, which makes make it more efficient. e performance of actor-critic algorithm is decided by critic's value judgment; nevertheless, it is very difficult to realize convergence, particularly when the actor also needs to upgrade its parameters. To overcome those problems mentioned above, Deep Deterministic Policy Gradient (DDPG) has been presented.

DDPG in Continuous Domain.
Deep reinforcement learning is composed of deep neural network and reinforcement learning. e algorithm structure makes it directly learn control policies from high-dimensional state-action spaces. Due to the excellent performance over a wide range of applications, deep neural network (DNN), which is an artificial neural network (ANN) with several layers between the input and output layers, has arisen and become a very popular research topic in machine learning recently. anks to the huge success in a variety of fields such as medical imaging analysis, artificial neural networks have attracted great interest in deep learning. With the development of these neural network structures, it has been used in different areas such as solving engineering control problems. e basic unit of neural networks is neuron, which is a mathematical function that models the functioning of a biological neuron within an artificial neural network. Because it consists of a large number of layers and neurons in each layer, DNN can always find the right mathematical operation to convert the inputs to outputs, whether linear or nonlinear relationship. In the neural networks, it can be called fully connected layers when each neuron in a layer is connected to all neurons in the next layer. In the network form, each connection contains parameters weight w and bias b and applies the activation function σ to the weighted sum z � σ(b + i w i x i ), where x is the input vector. e learning process of the neural networks is the process that continuously regulates relative parameters, which mainly include the weights and biases of the network, to reduce the errors generated by the real and prediction results. Combining deep neural networks with reinforcement learning algorithms, deep reinforcement learning (DRL) can be created to resolve previously unresolved problems [83]. In DRL, artificial neural networks can be used as universal function approximator to estimate value function or policy-based methods.

Control Based on DDPG for Vectored Thruster AUV
Deep Deterministic Policy Gradient (DDPG) is a model-free off-policy actor-critic algorithm using deep function approximator that can be used to solve the problems of highdimensional, continuous motion spaces. Because it is proposed based on the concept of DQN, DDPG also uses the deep neural networks as function approximator, which makes it feasible in complex action-space applications [84].
DDPG contains two independent networks, which are actor network and critic network, respectively. With the parameter θ μ , the actor network represents the deterministic policy a � μ(s|θ μ ), which is used to update the policy that corresponds to the actor in the actor-critic framework. e critic network with parameter θ Q is used to estimate the actionvalue function Q(s, a|θ Q ) of the state-cation pair and calculate the gradient parameters. In order to make it stable and robust, DDPG algorithm adopts experience replay and the target network. In order to achieve stable learning, DDPG deploys experience replay and target networks like DQN. Experience replays are a key technology behind many of the latest advancements in deep reinforcement learning [85]. e experience replay was applied to avoid such situations where the training samples are not independent and identically distributed. In the training process, samples are generated by sequential explorations in an environment. In order to break the data correlation, experience replay is implemented by sampling transitions (s t , a t , r t , s t+1 ), which are historical data samples from the environment and stored in the replay buffer. And the replay buffer is continually updated by replacing old samples with new ones when the buffer is full. e actor and critic are trained with minibatches sampled from the reply buffer randomly in the training process. e main effect of experience replay is to overcome the problem of correlated data and nonstationary distribution of empirical data.
is random sampling method greatly increases the utilization of samples and improves the stability of the algorithm [85].
In order to optimize the state-value function (critic function) neural network, a loss function based on mean squared error is proposed to carry out the backpropagation. In DDPG, the parameters of the deep neural network for the critic are updated by minimizing the loss function L defined as where y i is the target value function generated by the target neural network θ Q ′ and can be defined as en, the gradient of the loss function L is defined as e actor policy function is presented with the network parameter θ μ , which is updated using the critic network parameter θ Q to optimize the expected function. e objective function J(θ) is the expected R t under the policy μ(s | θ μ ) which can be defined as Complexity A policy gradient method is generally obtained by the deterministic policy gradient with respect to network parameter θ μ , with the deterministic strategy a � μ(s | θ μ ). e gradient of the cost function with respect to θ μ can be expressed as Hence, the parameter of the actor online policy network for the actor in DDPG can be updated by using the sampled policy In order to avoid the divergence of the algorithm, separate target networks are created as copies of the original actor network and critic network. In the DDPG algorithm, two target networks Q ′ (s, a | θ Q ′ ) and μ ′ (s | θ μ′ ) are created for the main critic and actor networks, respectively. e two target networks have the same architecture with the main networks except that the target networks have different parameters θ ′ .
To improve the stability of the learning, we use the "soft" update method to update the parameters as illustrated by Mnih et al. e weights of the target network are constrained to update slowly by tracking the main networks: θ ′ � τθ + (1 − τ)θ ′ , where parameter τ ≪ 1. With the update like the proposed approach, the algorithm improves the stability of the network significantly. In addition, noise needs to be added samples from a noise process N to the actions for improving the efficiency of exploration because the actor policy is deterministic.
According to the above, the vectored thruster has one thruster, and two deflection angles need to be controlled. Hence, it is implied that the control system needs to produce the continuous control outputs of the three parts to achieve the designed reference dynamic state, such as the velocities of AUV. According to the dynamics of AUV, the designed control system must be able to complete the operational task of a nonlinear continuous control of the vectored thruster AUV in a complicated and changeable underwater environment. To solve the continuous control problem, an adaptive control system for the vectored thruster AUV based on DDPG algorithm and the study of AUV is proposed in this work. e aim of this study is to develop a new control algorithm, which has the ability to solve the problem of the vectored thruster AUV with different operative conditions. In our study of the AUV, the architecture of the control system based on DDPG algorithm can be illustrated as in Figure 9. As it can be seen, the designed control architecture can be divided into four factors: AUV reference generator unit, reward function unit, DDPG controller unit, and AUV environment unit.
As shown in Figure 9, the control architecture has an important part named as AUV reference speed generator, which can be used to generate a set of control data points for training AUV more effectively in the process of training. According to the use of the reference generator, the designed controller obtains the operating instructions s r t provided by the reference generator to manipulate the AUV. With this approach, the AUV controller can deal with different setting conditions for the AUV. Another import dynamic information is the measurement data s m t generated by the sensor system of the AUV. is item of measured data s m t is merged with s r t to produce an instantaneous error vector e t . is item presented the difference between the setting parameters and the practical measurement results that provide instantaneous information to the merged information s t . e reward function unit is the main indicator of evaluating the advantage and disadvantages of this algorithm. e input parameter of the reward function model is the instantaneous error vector e t . In this way, the immediate reward r t is defined by the reference state s r t and the error vector e t to evaluate the operation situation of present activity and error in feedback. e AUV controller receives the information summarized in the system state s t of AUV and the immediate reward r t of the current states. According to the input parameters s t and r t , the AUV controller produces the action a t to the AUV simulation environment based on a lot of studies and iterative computation. In practical application, the state information can be measured by the sensor system in AUV, such as DVL and IMU. In our work, the AUV simulation environment is established based on the study mentioned in Section 2; therefore, the state information can be obtained directly by the kinematic and dynamic analysis of AUV. is simulation method ensures the accuracy of development, increases productivity, and shortens development cycles. In addition, the state information based on simulation environment is more widely used at the earlier stage of AUV control based on DDPG algorithm. is method can be used to avoid making extraordinary efforts, especially at a great cost in experiments and computation, to complete the AUV controller training process.
Based on the presented DDPG algorithm architecture, the AUV controller for low-level control of vectored thruster AUV is developed. In order to help us represent the control system better, the algorithmic representation is developed to make our code more readable. erefore, the algorithm for the vectored thruster AUV control can be summarized in the pseudocode as shown in Algorithm 1, and the algorithm workflow is shown in Figure 10.
In line 1 of Algorithm 1, the input parameters are the maximum training episodes M, iteration times of each episode T, the minibatch size of N, the soft target update factor for the deep target networks τ, the minimum and the maximum sizes of the replay bufferR, and c the discount factor. Since Algorithm 1 needs to learn from a continuous control problem, the actor network and critic network are initialized randomly, and the target networks are initialized with the same parameters θ Q ′ � θ Q , θ μ ′ � θ μ , as shown in line 2 and in line 3. In addition, the replay buffer R is also needed to initialize at the beginning stage (line 4). As a result of the analysis, this control problem of vectored thruster AUV can be seen as a continuous control task. In Algorithm 1, each episode of the learning process contains a loop task with a fixed number of steps T. In training, the algorithm process, which is shown as line 5 to line 28 of Algorithm 1, is carried out as a loop with the max number of cycles M. In line 6, the random process N, which refers to the random Ornstein-Uhlenbeck stochastic process, is carried out for the action exploration. In lines 7 and 8, the AUV simulation environment is initialized with the setting parameters at time 0 s, and the observation state s 1 can be obtained directly from the AUV simulation environment. In the inner loop from line 9 to line 27, the core part of our algorithm is performed to control the AUV. In the process of this algorithm, a fixed sample time of each step in the inner loop is set to dt, taking the practical application of the real AUV system and the efficiency into consideration. us, the training processes develop over time to meet the control system needs with the increased number of cycles. In the   Complexity 13 establishment of the parameter, the time interval dt must be chosen effectively and reasonably. e control action a t is obtained when the state s t is given because the actor policy is determined (line 10). e action a t is immediately sent to the AUV simulation environment to complete the corresponding motion control (line 11).
In order to improve the efficiency and reliability of the training process, the experience replay buffer R is used to update the actor and critic networks in the process of training. To ensure the normal operation of experience replays, the buffer R must have a sufficient number of transitions m min stored to train the networks (line 12). When these conditions are fulfilled, a random minibatch N of transitions is sampled from the buffer R (line 13 and 14), and then the target state-action value y i can be calculated, where Q ′ (· | θ μ ′ ) and μ ′ (· | θ μ′ ) can be obtained by the target Q network and target policy network, separately. e critic network can be updated by minimizing the loss function L, and the critic network parameter θQ (line 15) is obtained. In line 16, the actor network is updated by using the sampled policy gradient ∇ θ μ J, and the actor network parameter θ μ is obtained. According to the network parameters θ Q and θ μ calculated in line 14 to 16, the critic target Q network and the actor target policy network are updated in lines 17 and 18. In addition, when the size of the reply buffer R stored the experience for training reaches a maximum, then the earliest stored experience needs to be removed to improve efficiency and reduce costs (lines [20][21][22]. e AUV receives the action a t , and then the new state s t+1 can be obtained directly from the sensor system on the AUV in a real application, while the state information can be calculated by the simulation environment (line 23).
en, the reward function can calculate the immediate reward r t to evaluate the effects of the action (line 24). Subsequently, with the combination of above the data obtained, the transition (s t , a t , r t , s t+1 ) is stored in R for training the networks. At the end of this algorithm, the critic network represented by Q(s, a | θ Q ) and the actor policy network represented by μ(s | θ μ ) are output.

Simulation Results
In our designed control system, the environment simulation is used to simulate the real underwater physics of AUV. During the simulations, the parameter sampling time used in Algorithm 1 is set to dt � 0.1s with the concern about the actual application. In this vectored thruster AUV, the control commands are applied to the three functions, including propulsive force T p , rudder angle α, and elevator angle β representing the deflection angles of the duct. e AUV-received commands can be defined by setting a vector a t � (a 1 t , a 2 t , a 3 t ), where a 1 t , a 2 t , a 3 t are the force T p , rudder angle α, and elevator angle β, respectively. ose commands are generated by the actor policy network of the designed controller.
In this control algorithm, the state s t in the Markov process represents the current state of the vectored thruster in the underwater environment. In our AUV simulation environment, the state parameters, which can be expressed as s t (v t , ω t , _ v t , _ ω t , e t ), are defined by the instantaneous measurements from the sensors in AUV. e terms v t � (v x , v y , v z ), ω t (ω x , ω y , ω z ) are the linear and angular velocities, which can be measured by DVL and IMU.
are the linear and angular accelerations corresponding to the linear and angular velocities. e t is the velocity error obtained by the real measured velocity s r v and the setting reference velocity at time t. e ultimate goal of this controller is to minimize the deviations of the real measured variable from the reference settings while minimizing the use of the vectored thruster to reduce the energy consumption. In addition, the fluctuation of the controlled dynamic variable of the AUV is not expected to be large enough, which will make it difficult to use in practical control. In order to accomplish this purpose, the reward function used in Algorithm 1 is essential for evaluating the effects of the executed action a t for the system performance. In order to more fully evaluate the advantages and disadvantages of reward functions, a kind of reward function is proposed with different considerations in our study. erefore, this immediate reward function r t is defined as follows:  14 Complexity where the first item evaluates the square error between the real measurement values s v t from the referencess r t . Due to the motion characteristics of AUV, a scale factor Λ, which can be defined as Λ � ([λ 2 1 , λ 2 2 , λ 2 3 ]), needs to be added to represent the error more efficiently. In the process of training, the parameters λ i in factor Λ are changed according to the motion characteristics of this AUV. e second term is utilized to describe the actual use degree of the vectored thruster.
e third term is added to avoid the vectored thruster producing sudden changes in propulsive force and duct deflection angles. In this term, it can be obtained by calculating the norm between the average of past executed actions a t− 1 and the current action a t . e average of past executed actions a t− 1 is obtained by computing the mean of action over a certain time period t − 1 for each iteration. e last term presents the error accumulation between the real measured dynamic variable s v t from the references s r t . is term is inspired by PID algorithm to reduce eliminate the steady error. e parameter ζ, κ, ξ and σ are scale factor ∈∈(0, 1].
In order to verify the feasibility of the proposed Algorithm 1, a numerical simulation is implemented in Python with TensorFlow. According to the related content in Section 4, the policy network uses a deep fully connected neural network with five layers, including three hidden layers, one input layer, and one output layer. e size of the input layer is 18, the sizes of hidden layers are 600 and 400, and the size of the output layer is 3. In addition, in the aspect of the selection of activation function, three hidden layers choose ReLU as activation functions, and the output layer chooses Tanh activation function.
e state-action value networks use a similar network architecture apart from the size of the output layer. In addition, all parameters are set up before carrying out a series of numerical simulations. e maximum episode and step were fixed as M and T. During the process of training, the sampling time is set as dt, which is full in consideration of the calculation speed and accuracy of the designed simulation environment and the practical application of AUV. e maximum and minimum sizes of the experience replay buffer R were set as m max and m min . e learning rate for actor and critic networks is L R− A and L R− C . e discount rate and the soft updating rate for the target networks are c and τ, respectively. e size of state transitions for the minibatch was defined as N. e parameter setting of the DDPG controller is shown in Table 2.
According to aforementioned proposed Algorithm 1 and related parameters, a series of simulations are carried out to study the effects of each term in reward function equation (36). In order to make a better comparison, the following related simulations are accomplished with the same reference state. e reference state is defined by velocities for the vectored thruster AUV s r t � (v r x , ω r y , ω r z ). When the reference velocities s r t � (v r x , ω r y , ω r z ) � (1(m/s), 0, 0), the parameters in the scale factor Λ, λ 1 � 1, λ 2 � λ 3 � 50 en, the performance of the reward function with ζ � 1κ � 0, ξ � 0 and σ � 0 can be simulated, and the results are shown in Figure 11.
As we can see in Figures  In addition, the linear and angular velocities are bigger than the references, which will result in the unnecessary loss of power. Considering that the reference velocities are set to zero except the velocity in x-direction, the position and orientation of the AUV should be zero except the displacement in x-direction. ose deviations are also needed to be considered in the reward function for improving the performance of Algorithm 1. In order to achieve the goal of reducing energy consumption, the factor κ in the reward function is set to κ � 0.001, and the other parameters remain unchanged, as described above. e relative simulations are carried out, and the results are shown in Figure 12.
As we can see in Figures 12(a)-12(d), the simulation results are obtained based on the reward function with the new factors. By comparing the two results in Figures 11 and  12, the usage of the vectored thruster has declined significantly. e simulation results as shown in those pictures also illustrate the second term of the reward function and the reasonability and validity, which can smooth out the velocity fluctuation and enhance the reference tracking performance of algorithm effectively.
rough the comparison of the amplitude variations of the thrust and duct angle with the same parameters, it is shown that the reward function considering energy consumption penalizes the usage of the vectored thruster while reducing its fluctuation range. Meanwhile, this method also reduces the deviations of position and orientation to provide more accurate control for the AUV according to the comparison of two simulation results.

Complexity
However, the great change extent of the thrust T p and the duct angles α, β will make it difficult for controlling vectored thruster to take advantage of the algorithm for AUV in the real application, while the results have proved that such reward function has achieved a good result. Hence, the third term with the factor ξ � 0.02, which presents the punishment term of the fluctuation of action outputs, is added in the reward function to evaluate the performance of actions. In addition, in order to further evaluate the reasonable existence of the second term in the reward function, another simulation is carried out with the factor κ � 0. e simulation results are obtained as shown in Figure 13.
Comparing the simulations of Figures 11-13, the current result considering the third term of the reward function proves that it is so useful to reduce the change ranges of action outputs. According to Figures 12 and 13, it should be noted that the second term of the reward function plays an important part in reducing the energy consumption for improving the performance of this AUV.
In order to improve the performance of the control system greatly, the first three terms are adopted in the reward function to further take advantage of the algorithm for AUV in the real application. Hence, in the next simulation, the factor is set to ζ � 1, κ � 0.001, and ξ � 0.02, and the results are obtained as shown in Figure 14.
As we can see in Figure 14, the change ranges of the thrust T p and the duct angles α and β are smaller than before, which makes this algorithm easier to use in the real application for AUV. Performance provides potent proof for the reward function with the second term used to stabilize and smooth the action outputs, while this term is applied in reducing energy consumption in the original design. Although simulations above indicate that this algorithm may gain better results in the vectored thruster AUV, the bias between the control deflection angles of the duct and the goal is large. Meanwhile, the biases of the duct angles α, β and thrust T p lead to the large deviation in the position and orientation of the AUV. In order to further improve the performance of the algorithm, this bias about the thrust and duct angles is needed to be considered in the reward function. Based on the above comparison and consideration, the last term of the reward function, which is inspired by the integral item of error of PID algorithm, is added to reduce the effect of error propagation. e new simulation is carried out with the reward function with all the four terms considered, and the results can be obtained as shown in Figure 15.
e results of simulations are obtained and shown in Figure 15. As it can be seen, the improving performance indicates the effects of using the reward function with the punishment term of error accumulation. As shown in Figures  15(e) and 15(f), the position and orientation of the AUV also can be obtained. In particular, the biases of duct angles, position, and orientation of this AUV decrease effectively. It can be proved that the result of the reward function considering all aspects is good and stable from the comparison between current results in Figure 15 and other results above. After comparison with the simulation results between Figures 5 and  15, it has been found that a high coincidence rate is found between the designed controller based on Algorithm 1 and the traditional PID method. e results of simulation comparing RL and PID are obtained and shown in Figure 16.
As it can be seen, the results of simulation indicate that the controller based on DDPG makes a good performance in controlling the vectored thruster AUV problem. In contrast with the simulation results, the designed controller based on DDPG is better than PID controller in dynamic performance. In order to further research the performance of the designed controller under conditions of greater uncertain factors, the simulations are carried out to study antijamming performance with Gauss white noise excitations.
e simulation results are shown in Figure 17. Under the Gauss white noise disturbances presented in the simulation environment, the controller based on DDPG  and the PID controller could realize its function by the simulation. Based on the above research results, the results show that the designed control scheme based on DDPG has a good dynamic and static response and strong antiinterference ability. Simulation results from Figures 16 and  17 show that the proposed controller based on DDPG has better stability, fast convergence rate, and good tracking ability.
In order to test the capability of this algorithm, the other simulations with changed references are carried out. e new reference is set to ω r x � 0, ω r y � 0.1 (rad/s), ω r z � 0, and the simulation results can be obtained after training the algorithm. e obtained results are shown in Figure 18.
As we can see in Figure 18, the angular velocity ω y achieves the setting velocity requirements with good reliability. From Figures 18(c) and 18(d), the thrust output T p is very stable within the limit of ultimate thrust, and the duct angle β is 15°, which is the limiting deflection angle of the  duct. By comparing the results between Figures 18 and 7, the designed controller and the reward function accomplish the functionality of controlling angular velocity for this vectored thruster AUV and the control performance is better than conventional PID controller.
We applied DDPG method on the proposed controller for the vectored thruster AUVs, and the training reward and the time consumption can be obtained as shown in Figures 19 and 20.
As we can see in Figure 19, the result showed that the value of the accumulated reward tended to monotonically increase until it reached about 1500 episodes. After that training episode, the accumulated reward tended to stabilize. According to this learning curve, we can discover the development tendency of the proposed controller based on DDPG method as the training proceeds. As we can see in Figure 20, the mean cost time of the episode is 9.24 seconds, and our method costs almost 7.7 hours in the whole 3000 episodes of simulated time, which corresponds to 3.5 days of computation in real time.

Conclusion and Future Work
In this paper, an AUV controller based on the Deep Deterministic Policy Gradient (DDPG) was proposed for improving the control performance of the vectored thruster AUV. e proposed algorithm uses the information measured by internal sensors of AUV to provide the control commands for AUV to fulfill the task. ere is no requirement to provide a model of the large complex nonlinear system about the vectored thruster AUV to the designed controller, which is essential to the classic control theory. It only needs some input parameters of the AUV, and our proposed algorithm is able to learn a control strategy for the AUV to meet exact implementation requirements. In the learning process, the reward function is fundamental to the DDPG controller to realize the system goal and related functions of the AUV. In this algorithm, a reward function is proposed by considering a series of control precision requirements and the influence of operational constraints. e designed reward function in this paper can effectively improve reliability and stability, reduce energy consumption, and restrain the vectored thruster sudden change. It should be particularly noted that the proposed control system based on DDPG algorithm was developed to realize the lower-layer motion control for the vectored thruster AUV, although some greater range of applications and more complex dynamic control systems can be solved by this method. erefore, the controller based on DDPG algorithm has vast application and development prospects.
Furthermore, our proposed algorithm framework for AUV only uses some system states that can be measured by sensors directly as inputs, and it is different from a former method that uses images as input parameters. In this paper, it is proved that the motions of the AUV can be directly controlled by sending low-level control commands to the vectored thruster. In order to confirm the algorithm's effectiveness, a series of simulations are carried out in a simulation environment, which is established by the kinematic and dynamic analysis of the vectored thruster AUV. In this sense, we think the method using simulation environment to replace the real underwater application environment is proved to be cost saving and efficient improvement. In this sense, we think that our works have obtained certain improvements in expanding the application range of AUV control study using the deep reinforcement learning method. Furthermore, our proposed control algorithm provides an optional mentality for controlling underwater vehicles and other kinds of robotics.
Certainly, our present study has its limitations while achieving some achievement. In our proposed algorithm, the simulations are carried out under ideal conditions, so realistic experiments need to be completed to verify the correctness and feasibility of the proposed method. Moreover, more influence factors should be taken into account, such as time delay uncertainty among the sensors, actuators, and controllers. In addition, how to improve performance and achieve stability of the proposed controller is an important task for further research. Finally, control algorithms based on deep reinforcement learning have broad application background and important meanings in theory and practical engineering; therefore, the related research will become more important.
Data Availability e data of hydrodynamic and thrust coefficients of the AUV used to support the findings of this study are included within the supplementary information file (Appendix B).
e other data used to support the findings of this study are included within the article.