Deep deterministic policy gradient algorithm operating over continuous space of actions has attracted great attention for reinforcement learning. However, the exploration strategy through dynamic programming within the Bayesian belief state space is rather inefficient even for simple systems. Another problem is the sequential and iterative training data with autonomous vehicles subject to the law of causality, which is against the i.i.d. (independent identically distributed) data assumption of the training samples. This usually results in failure of the standard bootstrap when learning an optimal policy. In this paper, we propose a framework of moutofn bootstrapped and aggregated multiple deep deterministic policy gradient to accelerate the training process and increase the performance. Experiment results on the 2D robot arm game show that the reward gained by the aggregated policy is 10%–50% better than those gained by subpolicies. Experiment results on the open racing car simulator (TORCS) demonstrate that the new algorithm can learn successful control policies with less training time by 56.7%. Analysis on convergence is also given from the perspective of probability and statistics. These results verify that the proposed method outperforms the existing algorithms in both efficiency and performance.
Reinforcement learning is an active branch of machine learning, where an agent tries to maximize the accumulated reward when interacting with a complex and uncertain environment [
However, DQN only deals with the tasks with small, discrete state and action spaces while many reinforcement learning tasks have large, continuous, realvalued state and action spaces. Although such tasks could be solved with DQN by discretizing the continuous spaces, the instability of the control system may be increased. For overcoming this difficulty, deterministic policy gradient (DPG) algorithm [
Additionally, researchers attempted to overcome the disadvantage of unstable training of DDPG and speed up the convergence of DDPG with bootstrap technique recently [
In consideration of the above shortcomings of the previous work, this paper introduces a simple DRL algorithm with moutofn bootstrap technique [
The remainder of this paper is organized as follows. Section
In a classical scenario of reinforcement learning, an agent aims at learning an optimal policy according to the reward function by interacting with the environment
Policy gradient (PG) algorithms optimize a policy directly by maximizing the performance function with the policy gradient. Deterministic policy gradient algorithm which is originated from deterministic policy gradient theorem [
DDPG applies the DNN technique onto the deterministic policy gradient algorithm [
Diagram of deep deterministic policy gradient.
There are two sets of weights in DDPG.
DDPG utilizes the experience replay technique [
Compared with DQN, DDPG is more appropriate for reinforcement learning tasks with continuous action spaces. However, it takes long time for DDPG to converge to the optimal policy. We propose multiDDPG structure and bootstrap technique to train several subpolicies in parallel so as to cut down the training time.
We randomly initialize
The structure of multiDDPG with the centralized experience replay buffer is shown in Figure
Structure of BAMDDPG.
Aggregation of subpolicies.
Randomly initialize
Initialize
Initialize centralized experience replay buffer
Initialize an Ornstein–Uhlenbeck process
Alternately select
Select all
Receive state
Execute action
Store experience
Update
Get final policy by aggregating subpolicies:
In Algorithm
In the interaction procedure, the main actor network which represents an agent interacts with the environment. It receives the current environment state
In the update procedure, a random minibatch of transitions used for updating weights is sampled from the central experience replay buffer. The main critic network is updated by minimizing the loss function which is based on the Qlearning method [
Figure
In practice, we train multiple subpolicies by setting a maximum number of episodes. Since episodes in BAMDDPG terminate earlier than that of the original DDPG algorithm with less steps, the training time of subpolicies is less than the optimal policy. It can be predicted that the performance of lesstrained subpolicies will be worse than the optimal policy to some degree, but we can aggregate the trained subpolicies to increase the performance and get the optimal policy. Furthermore, we use the average method as aggregation strategy in consideration of the equal status and realvalued outputs of all subpolicies. Specifically, the outputs of all subpolicies are averaged to produce the final output.
As Figure
Additionally, from the perspective of intuition, the centralized experience replay technique exploited in BAMDDPG enables each agent to use experiences encountered by other agents. This makes the training of subpolicies of BAMDDPG more efficient since each agent owns a wider vision and more environment information.
For ease of description, we suppose BAMDDPG trains
Equation (
Further, we analyze the convergence from the perspective of probability and statistics [
Equation (
Bootstrap [
However, standard bootstrap fails as the training data subject to a longtail distribution, rather than the usual normal distribution, as the i.i.d. assumption implies. A valid technique is moutofn bootstrap method [
In order to illustrate the effectiveness of aggregation, we use BAMDDPG to learn a control policy for a 2D robot arm task.
As Figure
2D robot arm benchmark.
During the training process of BAMDDPG, each agent interacts with its corresponding environment, producing multiple learning curves. Figure
Comparison between subDDPGs of BAMDDPG and DDPG.
The key of BAMDDPG is the aggregation of subpolicies. In this section, we show the comparison of performance between the aggregated policy and subpolicies so as to illustrate the effectiveness of aggregation. Suppose the action given by the
Table
Performance comparison of subpolicies and the aggregated policy.
Policy  Episodes  Total reward  Average reward 

Subpolicy1  20  720.69  36.03 
Subpolicy2  20  538.28  26.91 
Subpolicy3  20  463.98  23.20 
Aggregated policy  20  829.17  41.46 
The Open Racing Car Simulator (TORCS) is a car driving simulation software with high portability, which takes the clientserver architecture [
Diagram of the clientserver architecture of TORCS.
Designing a suitable reward function is a key for using TORCS as the platform to test BAMDDPG, which helps to learn a good policy to control the simulated car. We describe the details of designing the reward function in this section. As the driving environment state of TORCS can be perceived by various sensors of the simulated car, we can create the reward function using these sensor data which is shown in Table
Information of sensor data for creating the reward function.
Name  Range (unit)  Description 



Angle between track direction and car’s forward direction 


Speed of the car along the direction of the car’s longitudinal axis 


Distance between the track edge ahead and the car 


Distance between the track axis and the car. 
Equation (
Equation (
Graph of speed constrained function.
Equation (
We successfully achieve the optimal selfdriving policy with BAMDDPG by aggregating multiple subpolicies in TORCS. During one episode of the training process, one subpolicy is selected. The corresponding agent perceives the driving environment state through various sensors and executes the action by following the selected subpolicy. Table
Description of action commands.
Commands  Range  Description 

Steering  [−1, +1]  −1 is full right while +1 is full left 
Brake  [0, 1]  Brake pedal (1 is full brake while 0 is no brake) 
Throttle  [0, 1]  Gas pedal (1 is full gas while 0 is no gas) 
After the interaction, all subpolicies were updated using the minibatch from the centralized experience replay buffer. We have argued that less training time is demanded by BAMDDPG than DDPG. Figure
(a) Learning curve and (b) training time comparison of BAMDDPG and DDPG.
In our experiments on TORCS, the simulated car was trained 6000 episodes with the Aalborg track using BAMDDPG and DDPG, respectively. Figure
The ability of the BAMDDPG algorithm to reduce training time is based on policy aggregation. Section
In order to avoid the influence of too many subpolicies on the conciseness and contrast of expression, only three subpolicies are trained by the BAMDDPG algorithm in this experiment. The trained subpolicies and the aggregated policy control the same simulated car on the same track, Aalborg track, within one lap. Then, we observe the total reward and whether the car can finish one lap on the track or not. Table
Performance comparison of the aggregated policy and subpolicies.
Policy  Steps  Total reward (points)  Complete one lap 

Subpolicy1  246  16690.60  No 
Subpolicy2  246  15413.12  No 
Subpolicy3  102  −1252.46  No 
Aggregated policy  457  31603.37  Yes 
Figure
Performance comparison of the aggregated policy and subpolicies.
The final policy gained by BAMDDPG is based on the aggregation of subpolicies, but the algorithm does not give specific number of subpolicies. In theory, when there is large enough number of subpolicies, the aggregated policy successfully approximates the optimal policy. However, aggregating a large number of subpolicies is inefficient in consideration of computing and storage resource consumption in practice.
Under the consideration of balancing efficiency and performance, this section explores the appropriate number range of subpolicies through experiment. We choose the numbers of subpolicies within 30 and get the appropriate number of subpolicies by comparing the performance of the aggregated policies with different number of subpolicies. These aggregated policies are tested on the Aalborg track, and we then compare their training time, total reward within 5000 steps. Furthermore, we compare the generalization performance of the aggregated policies by testing them on the CG1 track and CG2 track. Experimental results are demonstrated in Figure
Reward comparison of aggregated policies with different numbers of subpolicies on Aalborg.
Comparison of aggregated policies with different numbers of subpolicies.
Number of subpolicies  Training time (hours)  Total steps  Total reward  Pass Aalborg  Pass CG1  Pass CG2 

3  22.84  5000  331086.10  Yes  Yes  Yes 
5  24.40  5000  360804.43  Yes  Yes  Yes 
10  24.16  5000  303678.65  Yes  Yes  Yes 
15  22.09  771  47121.87  Yes  No  Yes 
20  20.49  567  34343.05  Yes  No  Yes 
30  21.74  1541  97146.37  Yes  No  Yes 
Figure
Table
Generally speaking, when the number of subpolicies is 3–10, the corresponding aggregated policies perform well and have better generalization performance than the aggregated policies with over 10 subpolicies, which means 3–10 is the appropriate number of subpolicies for BAMDDPG in practical application.
However, the aggregated policies with over 10 subpolicies cannot reach the maximum steps on the Aalborg track and are not able to finish the CG1 track. The reason why the aggregated policies with over 10 subpolicies performed worse mainly lies in the limit of the centralized experience replay buffer. During the training time, we fixed the size of the centralized experience replay buffer to 100, 000 transition tuples
Generalization performance is a research hotspot in the field of machine learning, and it is also a key evaluation index for the performance of algorithms. An overtrained model often performs well in the training set, while it performs poorly in the test set. In our experiments, selfdriving policies are learned successfully on the Aalborg track using BAMDDPG. The car controlled by these policies has good performance on the training track. However, the generalization performance of the learned policies is not known. Hence, we test the performance of the aggregated policy learned with BAMDDPG on both the training and test tracks, including Aalborg, CG1, and CG2, whose maps are illustrated in Figure
Maps of training and test tracks. (a) Aalborg; (b) CG1; (c) CG2.
The total reward of the aggregated policy shown in Table
Generalization performance of the aggregated policy.
Track name  Total reward (points)  Complete 

Aalborg  30007.61  Yes 
CG1  23755.62  Yes 
CG2  35602.09  Yes 
Table
This paper proposed a deep reinforcement learning algorithm, by aggregating multiple deep deterministic policy gradient algorithm and an moutofn bootstrap sampling method. This method is effective to the sequential and iterative training data, where the data exhibit longtailed distribution, rather than the norm distribution implicated by the i.i.d. data assumption. The method can learn the optimal policies with much less training time for tasks with continuous space of actions and states.
Experiment results on the 2D robot arm game show that the reward gained by the aggregated policy is 10%∼50% better than those gained by the nonaggregated subpolicies. Experiment results on TORCS demonstrate the proposed method can learn successful control policies with less training time by 56.7%, compared to the normal sampling method and nonaggregated subpolicies.
The program and data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author. The simulation platform (The Open Racing Car Simulator, TORCS) used to support the findings of this study is opensourced and is available at
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported by NSFC (61672512 and 51707191), CAS Key Laboratory of HumanMachine IntelligenceSynergy Systems, Shenzhen Institutes of Advanced Technology, and Shenzhen Engineering Laboratory for Autonomous Driving Technology.