Learn to Make Decision with Small Data for Autonomous Driving: Deep Gaussian Process and Feedback Control

,


Introduction
Autonomous driving is one of the most promising field of artificial intelligence [1,2].To realize safety driving on the real road, ego-vehicles need to recognize and track the objects with its perceptual equipment [3], as well as act properly according to the current road conditions with decision-making modules.e decision-making module is the most important part for self-driving, yet the most challenging part to achieve.
e core mission mainly includes obstacle avoidance, trajectory planning, and action prediction [4,5].e decision-making model is built with rule-based [6] or statistical method [7], which are two popular schemes.Rule-based method can implement functionality quickly, but they are confined by the incomplete sets of state and the inability of capturing uncertainty.
ese shortcomings are overcome by a combination with statistical methods.In addition, with the advent of simulation engines such as Torcs [8] and Carla [9], various methods based on reinforcement learning [10] are proposed in the decision-making research and satisfactory performances are achieved.In the typical models of reinforcement learning, the agent begins, without prior knowledge about the world in advance, only with knowledge of which actions are possible, and it is expected to learn the skill solely by interacting with the environment and receiving rewards after taking actions.With this shortcoming, it requires absurdly huge amounts of time and datasets to learn to do specific tasks, such as board game [11] or self-driving [10].However, due to insufficient diversity, real-world datasets are often unsatisfactory.And setting up a dataset with precise labels from the real world is labor-intensive and time-consuming, especially in large-scale complex transportation systems [12], not to mention building a dataset with specific features that meets our needs.To address this problem, building diversity of the virtual datasets is a practical way to improve the performance of the trained models in the case of insufficient training data [12,13].And in our paper, we propose another feasible way to solve decision-making problem in autonomous driving, which directly utilizes small datasets without the aid of visual samples.
In recent years, Gaussian process (GP) has become one of prevailing regression techniques [14].To be precise, a GP is a distribution over functions such that any finite set of function values have a joint Gaussian distribution.e predicted mean function and covariance function are used for regression and uncertainty estimation, respectively.e strength of GP regression lies in avoiding overfitting while still being able to find functions complex enough to describe any observed behavior, even in unstructured or noisy data.GP is commonly used in the situation when observations are expensive or rare to produce and methods such as deep neural network performs poorly.
And it has been applied among a wide range from engineering [15], optimization [16], robotics [17], and physics [18] to biology [19].Nevertheless, the sort of phenomena that can be easily expressed by using GP directly are limited.For example, in a sparse data scenario, the constructed probability distribution is often far away from the true posterior distribution.Recognizing this problem, many interesting research activities have been carried out, which attempt to represent new properties via the hierarchical cascading of Gaussians.Inspired by the widespread success of deep neural network architectures, Damianou and Lawrence proposed a method that a GP was directly composed with another GP; furthermore, the idea was implemented recursively, leading to the so-called deep Gaussian process (deep GP) [20].A deep GP consists of a cascade of hidden layers of latent variables where each node acts as output for the layer above and as input for the layer below.GPs govern the mappings between the layers with their own kernel.erefore, deep GP retains valuable properties of GP, such as well-calibrated predictive uncertainty estimation and nonparametric modeling power.In addition, it employs a hierarchical structure of GP mappings which makes it more flexible, has a greater capacity to generalize, and provides better predictive performance.
is model is fascinating because it can potentially discover layers of increasingly abstract data representations, while handling and propagating uncertainty in the hierarchy at the same time [21].
Undoubtedly, according to the nature of Bayesian statistics, the deep GP model makes prediction based on statistical average.However, we cannot ensure that the statistical average results are reasonable with such small training data because of model uncertainty.And it turns out that it is not enough to solve decision-making problem in our setting according to our calculation.So, we introduce the feedback control method to compensate this shortcoming.
Based on the analysis above, in this paper, we propose a decision-making framework combining deep GP and feedback control method.e deep GP model makes action prediction possible, and feedback control method assists action uncertainty error correction.When the state is fed into the framework, the action will be obtained immediately.It is kind of an end-to-end learning method [22]. is method is trained with small data in Torcs and tested in Torcs.According to our calculation, in terms of time consumption and data volume, our method is superior to the deep reinforcement learning trained by deterministic policy gradient (DPG) method [23].

Related Work
2.1.Deep Reinforcement Learning.As mentioned above, selfdriving vehicle is a decision-making system that processes information from various sources, such as cameras, radars, LiDARs, GPS units, and inertial sensors. is information is used by the vehicle's system to make driving decisions.e architecture can be implemented either as a sequential perception-planing-action pipeline, or as an end-to-end system.Recent works are mainly focused on deep reinforcement learning paradigm to achieve self-driving.Existing reinforcement learning algorithms mainly compose of value-based and policy-based methods.Vanilla Q-learning is the first proposed method and then becomes one of the popular value-based methods.Karavolos applies vanilla Q-learning algorithm to simulator Torcs and evaluates the effectiveness of using heuristic during the exploration [24].Recently, lots of variants of Q-learning algorithm, such as DQN [25], Double DQN, and Dueling DQN [26], have been successfully applied to a variety of games and outperform humans since the resurgence of deep neural networks.
Different from value-based methods, policy-based methods learn the policy directly.In other words, policybased methods output action given the current state.Silver et al. [27] propose a DPG algorithm to handle continuous action spaces efficiently without losing adequate exploration.By combining idea from DQN and actor-critic, Lillicrap et al. [23] then propose a deep DPG (DDPG) model-free approach and achieve end-to-end policy learning.In 2016, a new technique, which combines policy gradient and offpolicy Q-learning (PGQL), is proposed and achieves performance exceeding that of both asynchronous advantage actor-critic and Q-learning on the full suite of Atari games [28].All these policy-gradient methods can naturally handle the continuous action spaces.Despite validity and practicability of reinforcement learning, the training time costs too much and the volume of training data is its soft spot if we cannot get enough data.

Feedback Control Method.
In addition, traditional control methods also play an important role for solving selfdriving problem.e automatic control is almost the last part in the sequence of the autonomous vehicle, and one of the most critical tasks since it is responsible for its movement.
e well-known controller mainly includes PID (proportional integral derivative) controller and MPC (model predictive control) controller [29].A PID controller is a practical part used in industrial control applications to regulate pressure, speed, temperature, and other core variables [30].e PID controller uses a control loop feedback mechanism to control process variables, and it is the most accurate and stable controller.It is so named because its output is the summation of three terms (proportional, Journal of Advanced Transportation integral, and derivative term).Each of these terms depends on the error value between the input and the output.Differently, the MPC controller relies on dynamic models of the process, the most common being the linear empirical models obtained through system identification.e main advantage of MPC is that it can optimize the current time step, while also taking future time steps into account.is is achieved by optimizing a finite time-horizon, but only implementing the current time slot and then optimizing again, repeatedly.

Gaussian Process.
GP is a Bayesian nonparametric machine learning framework for regression, classification, and unsupervised learning [14].A GP is a collection of random variables f, any finite combination of which satisfies a multivariate normal distribution.Suppose that a set of noisy observed outputs y ≜ y n Since the data likelihood can be written as Due to the inversion of the covariance matrix, the training GP model needs O(N 3 ) operations, which prevents it from scaling well to massive datasets.To improve its scalability, the sparse GP (SGP) models exploit a set u ≜ u m   M m�1 of inducing output variables for some small set z ≜ z m   M m�1 of inducing inputs (i.e., M ≪ N). en, the joint probability of y, f, and u is as follows: where , and u is treated as a column vector here.

K xz and K zz represent covariance matrices with components
for m, m ′ � 1, . . . ,M, respectively.e SGP predictive belief can also be computed in a closed form by marginalizing u out: the SGP model can be referred to in [31,32].
Inference for the GP model is analytically possible when the likelihood is Gaussian.For the non-Gaussian likelihoods, approximation approach should work.Titsias [33] proposed a seminal variational inference (VI) framework that approximates the joint posterior distribution p(f, u | y) with a variational posterior q(f, u) ≜ p(f | u)q(u) by minimizing the Kullback-Leibler (KL) distance between them: KL[q(f, u)|p(f, u | y)].And this procedure is equivalent to maximizing evidence lower bound (ELBO) of the logmarginal likelihood [32,34]: A gradient-based algorithm can be employed to maximize the ELBO with respect to the inducing point and hyperparameters in the chosen kernel function.Several common used kernel functions can be found in Table 1 and discussed in [35].

Problem Statement.
To mathematically formulate the autonomous driving task, we refer to the basic theory of deep reinforcement learning.Let S, A, and R be the state space, action space, and the reward function.In the standard reinforcement learning setting, an agent interacts with the environment at discrete time steps.At each time step t, the agent observes the state s t ∈ S and takes an action a t ∈ A, according to its policy π, which maps a state to a deterministic action or a probability distribution over the actions (a t � π(s t )).en, it receives an immediate reward r(s t , a t ) ∈ R from the environment.e goal of a reinforcement learning task is to learn an optimal policy π * by maximizing the expected accumulated reward from the beginning.In the DDPG setup, it adopts deep neural network to approximate deterministic policy and action value function.However, training the deep neural network costs too much time and needs a lot of data.
In our setting, we treat the deep GP model M as the policy.To get the optimal deep GP model, the N training data collected from interaction between well-trained neural network and Torcs engine, which consists of state set regarding sensor's states and action set from the controller in Torcs, are used to train the model M. Each state s t and action a t , as well as s t and a t , are represented by several variables presented in Tables 2 and 3, respectively.And the reward function we defined is as follows: e reward function can be constructed more effectively by including other related variables [36].Although the reward function does not contain the variables of action at explicitly, each state s t+1 is observed at time step t + 1 after taking the action at time step t.It influences the result of reward value indirectly.
To get the best policy, the evidence lower bound, denoted as ELBO M , for the deep GP model M, which is more complex than standard GP should also be maximized using training data, X and Y.According to the Bayesian theory, it means to get the statistical mean of action a t mapping from state s t , denoted as Ε M [a t | s t , X, Y].After optimization, the model M will be settled, denoted as  M, which yields a t �  M(s t ).As you can imagine, though the trained deep GP can fit the training data well, it does not guarantee the optimal reward value in the testing period.With that being considered, we introduce the feedback control method F to refine the output of the deep GP model, i.e.,  a t � F(a t , s t , X).In this method, we consider data ψ ref and d 2ref , which represent collection of ψ and d 2 in each state s t , in the training data X.To achieve better reward, it is designed to optimize action a t according to the difference between state s t and the training state s t .Our solution presents in the following expression: Feedback Control model F: optimize All the details will be presented in the next section.To compare our method with deep reinforcement learning, we performed an autonomous driving simulation of the lane keeping task in the Torcs engine.

Proposed Solution.
In this section, the details of our autonomous driving decision-making methods for lane keeping task is given.e whole framework is presented in Figure 1.After training, we can get a fairly good deep GP model  M to fit the training data.e trained deep GP model is used to predict the action a t according to the state feedback s t from Torcs in each step.For validation, all the predicted actions are further refined by feedback control method F for feasibility and safety concerns.en, the final actions  a t are then sent to Torcs to demonstrate visually the performance on running a successful lap.e overall algorithm flow is shown in Algorithm 1.In the following content, we will discuss core methods in our framework in detail.

Deep GP Model 􏽥
M. As for this multidimension input and output problem, we use the deep GP method to fit the training data in consideration of its advantage over a standard GP [20].A multilayer GP model is a hierarchical composition of GP.Considering a deep GP with a depth of L, each GP layer is associated with a set Fl− 1 of inputs and a set Fl of outputs for l � 1, . . ., F0 � X and FL+1 � Y.An example of deep GP is as follows: where f l ∼ GP(0, K l F l− 1 F l− 1 ) for each layer.Each layer has different kernels.For deep GP, each layer is governed by GP; however, the overall prior f 1: L is no longer a GP which makes it intractable to train a deep GP model.For reasonability, we can introduce the Gaussian noise in each layer.In this case, we can get the following recursive definition: A graphical model for deep Gaussian process with one hidden node is illustrated in Figure 2.
Let F ≜ F l   L l�1 ; for supervised learning case, the distribution of a deep GP model with L hidden layers can be written as follows: As for the conditional probabilities, they can be expanded as follows: e nonlinearities introduced by the GP covariance functions make the Bayesian treatment of this model In this way, we can obtain the logarithm of the augmented joint distribution: ) and L l is the lower bound for log p(F l | U l , F l− 1 ): (1) collect data from interaction between Well-Trained Network and Torcs engine (2) set elements for deep GP: layer number, kernel, inducing points, etc. train a deep GP model and save (4) aunch Torcs (5) for i � 1, N do (6) reset Torcs (7) get the initial state (8) for j � 1, M do (9) predict action with state using deep GP model  M (10) amend action by feedback control method F (11) get the next state (12) if unsuccessful loop trip then (13) break (14) end if (15) if successful loop trip then (16) save the experience data ( 17) break (18) end if (19) end for (20) end for (21) shut down Torcs ALGORITHM 1: Overall algorithm flow.
where [21].And we can see that the latent variable f l are integrated out within each layer.Our aim is to approximate the logarithm of the marginal likelihood: where To get the bound ELBO M for marginal likelihood, with Jensen's inequality, we can get that where Q � q(F, U) is the introduced approximate variational distribution.
Generally, the ELBO M can be more simplified by mean field approximation Q �  L l�1 q(U l+1 )q(F) (i.e., q(U l+1 ) � N(m l+1 , S l+1 ), q(F) � N(μ l , Σ l ) in each layer), and the final form of ELBO M can be tractable because of these conjugate distributions, when the covariance functions selected in each layer are feasibly convoluted with the Gaussian density [20,21].A gradient-based algorithm, such as L-BFGS-B algorithm [37], can be employed to maximize the variational lower bound ELBOM with respect to the model parameters (i.e., kernel hyperparameter θ l and noise variance ] l in each layer) and variational parameters are introduced: e trained deep GP model  M can fit the training data well.In recent years, other several approximation methods are put forward to train deep GP such as importanceweighted variational inference [38], stochastic gradient Hamiltonian Monte Carlo [39], and approximate expectation propagation [40].

Feedback Control Model F. After training the deep GP model, all the parameters in the model 􏽥
M are settled.We try this model in Torcs and find that it can only finish a little more than half loop trip on the CG road.After analysing the failed experience and the input data, we find the essential cause is that the data from DDPG well-trained network only contain the state cases s t with small ψ and d 2 , which are close to the center line of the lane.With the input state s t with highly deviated ψ or d 2 value, the trained deep GP  M may generate action a t with improper ξ, ϕ, or φ value.And it indirectly affects the value of the reward function.We assume that if the values of ψ and d 2 stay in a reasonable range, the successful loop trip can be achieved regardless of the values of other state variables.With that being thought, we design an extra feedback control method F, for reward optimization, to amend action a t predicted by the deep GP model.
In this method, we refer to the idea of the PID controller method.For simplicity, unlike the PID controller, we only add proportional changing errors, but it is composed of two different positive items, to the predicted steer value ξ in the action a t .In addition, instead of using integral error terms, we take the past state into consideration by adding the error between the current state and a reference state in the past.Firstly, several critical values c n , η n (n � 1, 2, 3, 4) need to be set in our method.e reason for doing this is that the feedback control is only needed for those improper feedback state variables, ψ and d 2 .Secondly, the error λ n is calculated by the difference between jth or − ( j − N)th number of variable, when j is smaller or larger than N, ψ ref and d 2ref in s t , and current feedback state variable, ψ and d 2 .Finally, the parameters b n , c n of the linear error term need to be regulated to achieve loop trip.
e detailed algorithm flow is presented in Algorithm 2.
As we can see, there are many adjustable parameters in our feedback control method.Actually, it turns out that all the parameters can be easily determined: (i) According to our experience, c n and η n can be set immediately after analysing the domain of ψ ref and d 2ref .Because the logical judgment in F should only be needed when the vehicle deviates too much from the center line or the steering angle is too large.(ii) And the value of b 3,4 and c 3,4 can be set the same as b 1,2 and c 1,2 .Consequently, only four parameters in the feedback control method are left to be considered seriously.(iii) Besides, there are two logical judgment statements after iteration number check in our method.is actually corresponds to the case that the visual vehicle is in the left or right side of the center line.We should consider these two situations separately.Moreover, the reason why we use the absolute value of the error λ n is that the sign of the steer angle ξ should be always in the correct direction, with its value larger (left side of center line) or smaller (right side of center line) than predicted values by deep GP model  M. (iv) In other words, the absolute value of output action from model  M is not large enough to drag the vehicle back into the safe road in some extreme dangerous situations.From this perspective, the sign of rest four parameters to be determined will be obvious.
Unlike usual optimization routine, the optimization of the reward value in each step is not carried out by a gradientbased algorithm.Actually, for lane keeping task, if the vehicle can complete the lap successfully, the obtained reward value may be not the best, but it must be one of the local optimal values.After enough trial and error, we can get the relatively optimal parameters introduced in F. In this way, after the parameters in method F are determined, the vehicle can immediately respond to the Torcs engine through the refined action  a t .

Results and Discussion
In this section, we conduct extensive simulations to valid our method and compare it with reinforcement learning approach that are typically used in a similar setting.We start with experiment setup about data preparation, then show how well our model fits the training data, and finally provide comparison by examining the performance of lane keeping in a simulation environment.2  and 3, respectively.To train deep GP network, the state set X is the input and the action set Y is the output.And the raw data are fed into deep GP network without any additional data processing before training.

Experimental Results.
In our case, we use the GPy [41], which is an open framework developed by Sheffield machine learning group, to conduct simulation.We use two layers GP to fit the training data.e kernels we used per layer are as follows: All the function expressions corresponding to these function names can be found in Table 1 or the GPy document web page.And their corresponding automatic relevance determination (ARD) [35] version can be easily extended.e number of inducing points we used in each layer is 200.After optimization, the output action values are shown in Figures 3-5.In these figures, the true data and predicted mean value legends mean the true training data action values and the predicted values after finishing training deep GP, respectively.e green zone, with its margin depicted by the green dashed line, in the figures, represents the 95% credible interval of the predicted value.e x-axis Journal of Advanced Transportation variable Times represents the time steps of self-driving.We can see that the model  M can capture the most main features of the train data except for few strong vibration zones.
Recall that we stated at the beginning that the deep GP mode  M is not enough to solve lane keeping task in our setting.And we compare the cumulative reward value between the deep GP method with and without combining the feedback control method.In Figure 6, it demonstrates that the deep GP model can only finish about half loop trip on CG road, but after combining the feedback control method, the accumulated reward value increases to about 2.7 times more as using the deep GP model only.It proves the effectiveness of the feedback control method.In Section 2.2, we already explained the main reasons why the lane keeping task cannot be completed using the deep GP model only.In addition, although the deep GP model can capture the uncertainty very well, it does not have the ability to correct the wrong predicted actions.In such a rapidly interactive environment, these unreasonable actions are so fatal that the vehicle is much more likely to rush out of the track.

Experimental Comparison.
In Section 4.2, we show how our model fit the training data and the necessity of the feedback control method in our case.And now, we compare their performance with the DDPG method.Compared to the DDPG method, the other two methods take more steps to achieve loop trip and the total rewards are a little less than DDPG.So, their ascending curves of accumulated rewards in each step have more flat slopes than the DDPG method.Table 4 lists several properties, such as Total Rewards, Training Time, and Training Data, of the three methods.In spite of advantages in iteration times and total rewards, the DDPG well-trained network, which is used to get training data, costs about 16 hours to train, and it only takes about 1.5 hours to train the deep GP model.It is also much less than the well-trained network with the AMDDPG method according to the result in the paper [36].In DDPG and AMDDPG methods, they need to interact with the environment in each episode to update the new training data, and this procedure will be repeated for multiple times for exploration and exploitation.us, these data-hungry approaches need tens of thousands of data, but the training data we used only contains about 340 items, which is far more less than what is required.All the simulations are conducted on the CG track in Torcs (the overview map in upright corner in Figure 1).Other complex tracks, shown in Figure 7, can be found in Torcs engine or generated by an online tool named TrackGen [42].
With these benefits, we believe that the proposed framework is a promising way to make decisions in simulation environments and actual road conditions.However, there are many technical problems to tackle to achieve the   Journal of Advanced Transportation real road test.Admittedly, the shortcomings of the proposed method should also be acknowledged.In this paper, we only test the methods for lane keeping task on a relative simple road.On a complex road or executing a complex task, it is obvious that more data should be fed into the deep GP model and the feedback control method also needs to be dedicatedly designed and validated.For example, doing simulation on Curuzu track in Figure 7, we can imagine that more training data will be recorded by a well-performed reinforcement learning model.And for this road with many irregular turns, the feedback control method must be tested with extensive trial and error.To check the effectiveness of the proposed framework in real road tests similar to simulation setting, we can record the training data with the aid   of perceptual equipment by manual driving.After training the deep GP model off-line, we can test its validity in both autonomous driving and manual driving modes.In addition, the parameters in the feedback control method should also be regulated to complete the self-driving task.For more complex tasks, such as car-following or overtaking, since the feedback control method in our framework do not take the motional characteristics of the vehicle into consideration, we plan to combine our framework with other motion control methods, such as pure pursuit [43] or Stanley [44], in the future work.

Conclusions
In conclusion, we presented an end-to-end learning method which combines the deep GP and feedback control method to solve decision-making problem of lane keeping task in self-driving simulation.e proposed method achieved almost the same performance with only 0.34% of the training data, compared with deep reinforcement learning, and the time consumption of the training is only 10%.We believe this method is a promising one when dealing with complex self-driving tasks with small training data.

Figure 2 :
Figure 2: A deep Gaussian process with one hidden node.

Figure 6 :
Figure 6: Cumulative reward in each step by deep GP with or without combining feedback control.

Table 1 :
e mathematical expression of some kernel functions.

Table 2 :
e information of state s t .

Table 3 :
e information of state a t .
each time step, the Torcs engine generates the vehicle state s t .e trained deep GP model  M maps the state s t into action at.After that, the action a t is refined by feedback control method F, i.e.,  a t G e n e r a t e v e h i c l e s t a t e V e h ic le t a k e s a c t io na t = M (s t ) a t = F (a t , s t , X) ~Figure 1: e framework of our solution: in t � F(a t , s t , X).Finally, the visual vehicle takes action  a t .ese procedures cycle until the vehicle finishes the single lap.
We collect the training data by DDPG well-trained network on the CG road in software Torcs.338 records (N � 338) are collected during the loop trip simulation.It contains state set and action set of the visual vehicle.e detail of state s t and action a t are already shown in Tables

Table 4 :
Comparison of three methods.