Network Architecture for Optimizing Deep Deterministic Policy Gradient Algorithms

The traditional Deep Deterministic Policy Gradient (DDPG) algorithm has been widely used in continuous action spaces, but it still suffers from the problems of easily falling into local optima and large error fluctuations. Aiming at these deficiencies, this paper proposes a dual-actor-dual-critic DDPG algorithm (DN-DDPG). First, on the basis of the original actor-critic network architecture of the algorithm, a critic network is added to assist the training, and the smallest Q value of the two critic networks is taken as the estimated value of the action in each update. Reduce the probability of local optimal phenomenon; then, introduce the idea of dual-actor network to alleviate the underestimation of value generated by dual-evaluator network, and select the action with the greatest value in the two-actor networks to update to stabilize the training of the algorithm process. Finally, the improved method is validated on four continuous action tasks provided by MuJoCo, and the results show that the improved method can reduce the fluctuation range of error and improve the cumulative return compared with the classical algorithm.


Introduction
As artifcial intelligence continues to thrive, reinforcement learning (RL), which is a learning process that combines exploration and action, has been well developed in discrete action spaces focusing on decision control. By letting the agents learn continuously in a way of trial and error, RL pursues the overall maximum return while seeking the optimal action policy [1,2]. However, when high-dimensional inputs or continuous action tasks are involved, traditional RL that relies on maximizing expected returns by performing trial and error may not work well. To tackle these kinds of problems, the concept of deep reinforcement learning (DRL) has been presented. In 2013, DeepMind proposed a method of using deep neural networks to play Atari games. It is the frst successful and versatile DRL algorithm, although its scope of application is still limited to low-dimensional discrete action spaces. Te topics dealing with continuous action tasks have become a new set of research interests [3,4].
Te basic idea of deep reinforcement learning is to ft the value function and policy function in reinforcement learning through a neural network. Typical algorithms include Deep Q-Network (DQN) [5] based on discrete action tasks and Deep Deterministic Policy Gradients (DDPG) [6] based on continuous action tasks. DDPG and DQN have very high similarities in algorithms. Te main diference is that DDPG introduces a policy network to output continuous action values. DDPG can be understood as an extension algorithm of DQN in continuous action. DDPG algorithm has been studied extensively with a series of outcomes obtained. Mnih et al. [7] proposed the concept of two-layer BP neural network and hence improved the DDPG algorithm. Te search efciency of BP network was improved by using Armijo-Goldstein-based criterion and BFGS method [8]. Nikishin et al. [9] reduced the infuence of noise on the gradient by averaging methods under the premise of random weights. Parallel actor networks and prioritized experience replay are used and tested in the continuous action space of bipedal robots [10]. Te experimental results show that the revised algorithm can efectively improve the training speed. In addition, the storage structure of experience in DDPG is optimized, which improves the convergence speed of the DDPG algorithm through binary tree [11][12][13].
To sum up, the above methods propose improvements to address the shortcomings of DDPG, and all have achieved good results. Although the performance of the improved algorithms has been signifcantly improved, the faws of local optimal solutions and large error fuctuations need to be further addressed.
Te main content of this paper is as follows: Firstly, the basic principle of DDPG is introduced, and then, combined with the description of the network structure and its associated parameters, the existing shortcomings are also analyzed. Secondly, an improved algorithm is proposed to tackle the shortcomings of DDPG. Te improvement method is mainly divided into two aspects. First, in order to reduce the probability of local optimal solutions, a critic network is added to assist training, and the smallest Q value in the two critic networks is taken as the estimated value of the action. Second, the dual-critic network will select the suboptimal Q value to update each round, and the suboptimal Q value also corresponds to the suboptimal action, which leads to the continuous underestimation of the action value of the agent. In response to this problem, this work introduces a dual-actor network based on the dual-critic network architecture; that is, the most valuable action in the two action networks is selected for training under the minimum Q value, so as to improve the robustness of the network structure. Finally, the efectiveness of the improved method is verifed in eight simulated, experimental environments.
Te rest of this paper is organized as follows: Te basics of DDPG are introduced in Section 2. In Section 3, the idea of improving the algorithm is elaborated. Section 4 includes experimental results and analysis. Section 5 summarizes the work and refers to the future works.

Deep Deterministic Policy Gradients
Te problem that reinforcement learning needs to solve is how to let the agent learn what actions to take in an environment, so as to obtain the maximum sum of reward values [12][13][14]. Te reward value is generally associated with the task goal defned by the agent. Te DDPG algorithm is used to solve the reinforcement learning problem in continuous action space [6,[15][16][17]. Te main process is as follows: Firstly, the experience data generated by the interaction between the agent and the environment is stored in the experience recall mechanism. Secondly, the sampled data is learned and updated through the actor-critic architecture, and fnally the optimal policy is obtained. Te structure of the DDPG algorithm is shown in Figure 1 [15].
Based on the deterministic policy gradient, the DDPG algorithm uses a neural network to simulate the policy function and the Q function and combines the deep learning method to complete the task training [16]. Te DDPG algorithm continues with the organizational structure of the DQN algorithm and uses actor-critic as the basic architecture of the algorithm [17]. Te combination of the concepts of the online network and the target network in the DQN algorithm with the actor-critic method makes both actor and critic modules in the DDPG have access to the structure of the online network and the target network [6,18,19].
During the training process, the agent in the current state S decides the action A that needs to be performed through the current actor network and then calculates the Q value of the current action and the expected return value y i � R + cQ ′ according to the current critic network. Ten, the actor target network selects the optimal action A ′ among the actions that can be performed according to the previous learning experience, and the Q ′ value of the future action is calculated by the critic target network. Te parameters of the target network are periodically updated by the online network parameters of the corresponding module.
DDPG adopts a "soft" method to update the target network parameters; that is, the magnitude of each update of the network parameters is very small, which improves the stability of the training process [20][21][22]. Te update coeffcient is denoted as τ, then the "soft" update method can be expressed as (1) DDPG makes the decision of using action a t by the deterministic policy π. It approximates the state-action function via a value network, with the defnition of the target function as the accumulated reward with a discounted factor [23,24] as shown in the following equation: In the online network of the critic, the update of the network parameters is based on the minimal value of the mean square error of the loss function [10], which can be expressed as For the actor online network, the network parameters are updated according to the loss gradients of the policy [10] as shown in the following equation:.

Error Analysis. It is an inevitable problem for Q-
Learning to tend to overestimate errors [25][26][27][28]. In Q-Learning, the update of the estimated value of an action by the learning algorithm is conducted by the ε-greedy policy y t � r + c max (Q(s t+1 , a t+1 )), hence the actual maximal value of an action is usually smaller than the estimated 2 Computational Intelligence and Neuroscience maximal value of this action as shown in the following equation: Equation (5) has already been proved for its establishment [29,30]. Even the zero mean error of the initial state will lead to an overestimation of the action value due to the update of the value function, and the adverse efect of this error will be gradually enlarged by the calculation of the Bellman equation.
In the structure of actor-critic, the update of the actor policy depends on the critic value function [31][32][33]. Given the online network parameter φ, ϕ approx denotes the updated parameter of the actor network calculated by the estimated maximal value function max (Q θ (s, a)), ϕ true denotes the parameter obtained by using the actual value function Q π (s, a), where Q π (s, a) is unknown in the training process which represents the value function in an ideal state, then ϕ approx and ϕ true can be expressed in the following equation: In Equation (6), Z −1 1,2 ||F[·]|| � 1, which normalizes gradients by using Z 1 and Z 2 . Otherwise, highly estimated errors would have been a certain case in a strict constraint if gradient normalization had not been used [34,35].
Since the gradient is updated in the direction of the local maximum, there is a very small number k1, so that when the learning rate of the neural network is less than k1. Te parameter π approx based on ϕ approx and the parameter π true based on ϕ true converge to the local optimal value of the corresponding Q function, at this time, the estimation of π approx is restricted to be below π true as shown in the following equation: On the contrary, there is an extremely small number k2, so that when the learning rate of the neural network is less than k2, the parameter π approx and the parameter π true also converge to the local optimal value of the corresponding Q function, and the estimation of π true is limited below π approx .
If the training efect of the critic network is satisfying, the estimation of the policy value will be at least similar to the actual value of φ true as shown in the following equation: At this time, if the learning rate of the network is smaller than the smaller one of k1 and k2, we know by combining Equations (8) and (9), the action value will be overestimated as shown in the following equation:

Computational Intelligence and Neuroscience
Te existence of errors will lead to inaccurate estimation of the action value, making the suboptimal policy be taken as the optimal policy output of the online network, thereby afecting the performance of the algorithm.

Dual-Actors and Dual-Critics Network
Structure. Due to the existence of the overestimated error, the estimation of the value function can be used as an approximate upper limit of the estimated value of the future state. If there is a certain error in every Q value update, the accumulation of errors will result in a suboptimal policy. Aiming at this kind of problem, an additional critic network is used in this work. Te smallest Q value of the two networks is taken as the estimated value of the action in each update, so as to reduce the adverse efect of the overestimated error.
Te process of obtaining the smallest Q value via the dual-critic network is shown in the following equation: Although the dual-critic network can reduce the overestimated error of the algorithm and reduce the probability of generating a local optimal strategy, in the actual training process, it is rare for the learning rate of the neural network to be less than the minimum value of k1 and k2. Combined with Section 3.1 analysis, that is, the probability of overestimation is very low. Te dual-critic network will select the suboptimal Q value to update in each round. Te suboptimal Q value also corresponds to the suboptimal action, which leads to the continuous underestimation of the action value of the agent, and in turn afects the rate of convergence of the critic network [36][37][38].
Aiming at the problem of underestimation of the dualcritic network, in this work a dual-actor network is presented for training on the basis of the dual-critic network architecture. Te network selects the action with the highest value among the two actions under the minimum Q value, which is used to reduce the infuence of the Q value underestimation and improve the robustness of the network structure.
Te network structure of the dual-actors and dual-critics is shown in Figure 2.
For a two-actor network, the training of this network is subject to the same issues upon the use of the same sample data and processing methods. In order to ameliorate this kind of problems, the update of the parameters of the twoactor network is based on diferent policy gradients, which helps to reduce the coupling between the two-actors and further improves the convergence rate of the algorithm [39,40].
If the policies of the two-actors are defned as π 1 and π 2 , and the parameters of the dual-critic network are θ 1 and θ 2 , we will have two actions a 1 � μ (s| π 1 ) and a 2 � μ (s| π 2 ), then we can select the action with the maximal value based on this dual-actor network by using the following equation:  (1) Arm_easy. 400 * 400 2-dimensional space is constructed in the Arm environment. One end of a robot arm is fxed in the middle of the environment. Te goal of the training is to make the other end of the robot arm fnd the blue target point as shown in Figure 3. (2) Arm_hard. Tis is similar to the Arm_easy environment, the only diference is that the target point is randomly generated in each round.
Two classical, continuous control task used in this work are shown below.
(1) Pendulum. Te pendulum starts at a random position, the aim is to push it swing upwards and keep erected. (2) Mountain Car Continuous. Tis task is to drive a car to reach the top of a hill; however, the power of the car is not sufcient to drive it directly to reach the top, it needs to rise and drop on the left and right sides repeatedly so that it can accumulate power to reach the top. It is shown in Figure 4.

Te 4 Mujoco continuous control tasks include:
(1) Half Cheetah. Train a bipedal agent to learn running as shown in Figure 5.
(1) Randomly initialize the actor-critic network for their parameters θ 1 Q , θ 2 μ and θ 2 Q , θ 2 μ (2) Initialize the target network Q′ and μ′, and copy the online network parameters to the target network (3) Initialize the experience playback bufer D, noise coefcient N t , and discount rate c (4) Set up external loop, the round number � 1, M (5) Initialize State S as the current state, and obtain the start state s 1 (6) Set up internal loop, the round number � 1, T (7) Select action a t : a t � argmax a [Q 1 (s t , a t , θ 1 μ ), Q 2 (s t , a t , θ 2 μ )] + N t (8) Conduct action a t , and obtain the reward r t and the new state s t+1 (9) Save the experience data (s t , a t , r t , s t+1 ) in an experience pool (10) Randomly select a certain number of samples (s i , a i , r i , s i+1 ) from the experience pool (11) Calculate the target value Q: y 1 � r t+1 + cQ 1 ′ (s t+1 , μ′(s t+1 |θ μ′ ))y 2 � r t+1 + cQ 2 ′ (s t+1 , μ′(s t+1 |θ μ′ ))y � min (y 1 , y 2 ) (12) Calculate the square error of the loss function and update the critic network: J(w) � 1/m m j�1 (y j − Q(�(S j ), A j , w)) 2 (13) Update the actor network via the gradients of the sample data: (14) Regularly update the parameters of the target network: Computational Intelligence and Neuroscience (4) Walker2d. Train a 3-dimensional bipedal agent to walk forward as fast as possible.
Tis work compares the performance of DN-DDPG and the original DDPG algorithm. In order to study the improvement efect of the dual-critic network and the dualactor network, the DCN-DDPG algorithm which is the single-actor and dual-critic network is included for comparison. Te outcomes of the comparison are shown intuitively through experiments.

Parameter Setting.
To ensure the accuracy and fairness of the experimental results, the common parameter values of diferent algorithms are the same. Te training rounds for both the Arm environment and the two Gym classic control tasks are set to 2000 times, and the maximum number of training steps per round is 300 times. Te training rounds of 4 kinds of Mujoco continuous control tasks are set to 5000 times, and the number of training steps per round is the maximum number of round steps in the Gym environment. Te agent continuously learns and explores in the environment. If the preset task in the environment is successfully completed or the number of training times per round exceeds the maximum number of times, the scene will be reset and a new round will be started. Some parameters in the MuJoCo task are shown in Table 1.     Neuron number in 2 nd layer 300 6 Experience pool volume 100000 7 Batch data size 256 8 Soft update coefcient 0.01 9 Action reward discount rate 0.99 10 Critic net output distribution low limit −20 11 Target net parameters update round number addition of an extra actor network to optimizing training. Te comparison of these three algorithms can make a more intuitive display of the two improved methods mentioned in this article: dual-critics and dual-actors. Te experimental results are shown in Figure 6. Te shaded part in the fgure represents the standard deviation during training, that is, when using the same hyperparameters and network model, diferent random number seeds are used to achieve random exploration. Te shaded upper limit is the optimal result. Te x-axis represents the number of rounds of agent training, the y-axis represents the cumulative reward obtained per round, and the experiment recorded the average reward value per 100 rounds.
In the environments of Arm easy and Arm hard, the average rewards from three algorithms stay around a same value. In some cases, the rewards from both DCN-DDPG and DDPG are superior to that of DN-DDPG. However, from the point of view of overall training efects, DN-DDPG performs better than the other two algorithms, while DCN-DDPG is slightly better than DDPG. In Pendulum experiment, the overall performance of the DN-DDPG is the best,   Computational Intelligence and Neuroscience which is due largely to the fact that dual-critics network is able to reduce the error while dual-actors network selects the action of higher value. In cases of Mountain Car Continuous, the average rewards from these three algorithms tend to be the same. However, during the process of 200 time steps, DN-DDPG has a better convergence speed than the rest two algorithms. In addition, in Half Cheetah, Humanoid, Hopper and Walker2d, DN-DDPG has a worse starting performance than DCN-DDPG and DDPG, which could be due to the fact that DCN-DDPG and DDPG have relatively simpler network structure able to deal with complex environment easier than DN-DDPG. Te DN-DDPG needs a period for training, and after this initial training period the average reward from DN-DDPG becomes obviously better than the rest two algorithms. Again, the overall performance of DCN-DDPG is better than DDPG. Finally, the shaded areas of diferent algorithms are compared, with the outcomes that the area of DN-DDPG is smaller than those of DCN-DDPG and DDPG, which refects that the training of DN-DDPG is more stable. From the experimental results in Figure 6, the dualcritics method is able to increase the performance of DDPG algorithm, but to a limited extent. By introducing dualactors method, the DN-DDPG network, based on the DCN-DDPG, is able to further increase the overall performance and training stability of the algorithm. Hence, compared to the original DDPG, the DN-DDPG which is based on dualactors and dual-critics, has the best increased performance.

Conclusion
A deep deterministic policy gradient algorithm is proposed based on a dual-actors and dual-critics network. In order to reduce the overestimated error in the original actor-critic network, a dual-critics target network is introduced into the algorithm, and the minimum action estimate generated by the two networks is selected to update the policy network. In order to alleviate the problem of underestimation caused by the dual-critics network, a dual-actors network is added on the basis of the original network, and the action with the highest value among the two actions generated by the dual-actors network is selected. Te experimental results show that, compared with the original DDPG algorithm, and the DDPG algorithm based on the single-actor and two-critics network, the novel DN-DDPG algorithm based on the dual-actors and dual-critics network has a higher cumulative reward and a smaller standard deviation of training.
Tere is more to be explored in future work. First, in order to improve the optimization ability of the algorithm, more suitable deep learning methods can be explored and applied to neural networks. Second, for the experience replay mechanism in the DDPG algorithm, it is viable to explore whether there is a better method to determine the sample priority to improve the convergence speed during training.

Data Availability
Te dataset can be accessed upon request.

Conflicts of Interest
Te authors declare that there are no conficts of interest to report regarding the present study.