Learning the Car-following Behavior of Drivers Using Maximum Entropy Deep Inverse Reinforcement Learning

. The present study proposes a framework for learning the car-following behavior of drivers based on maximum entropy deep inverse reinforcement learning. The proposed framework enables learning the reward function, which is represented by a fully connected neural network, from driving data, including the speed of the driver’s vehicle, the distance to the leading vehicle, and the relative speed. Data from two ﬁeld tests with 42 drivers are used. After clustering the participants into aggressive and conservative groups, the car-following data were used to train the proposed model, a fully connected neural network model, and a recurrent neural network model. Adopting the ﬁvefold cross-validation method, the proposed model was proved to have the lowest root mean squared percentage error and modiﬁed Hausdorﬀ distance among the diﬀerent models, exhibiting superior ability for reproducing drivers’ car-following behaviors. Moreover, the proposed model captured the characteristics of diﬀerent driving styles during car-following scenarios. The learned rewards and strategies were consistent with the demonstrations of the two groups. Inverse reinforcement learning can serve as a new tool to explain and model driving behavior, providing references for the development of human-like autonomous driving models.


Introduction
Recent studies have suggested that the development of autonomous driving may benefit from imitating human drivers [1][2][3]. ere are two reasons: First, the comfort of autonomous vehicles (AVs) may be improved if the driving styles match the preferences of the passengers. Second, the transition period during which AVs will share the road with human-driven cars is expected to last for decades. Road safety may be enhanced if AVs are designed to understand how human drivers will react in different situations.
Car-following is one of the most common situations encountered by drivers. e modeling of car-following behavior has been a common research focus in the fields of traffic simulation [4], advanced driver-assistance system (ADAS) design [5], and connected driving and autonomous driving [6][7][8][9]. Various car-following models have been proposed since 1953 [10]. In general, there are two major approaches. e classical methods use several parameters to characterize the car-following behavior of drivers [11,12]. With the rapid development of data science, data-driven methods with a focus on learning the behavior of drivers based on field data [13,14] have emerged. For both approaches, data-driven car-following models were found to provide the highest accuracy and best generalization ability for replicating the drivers' trajectories.
Among data-driven methods, supervised learning and expressive models, such as neural networks (NNs), have been commonly used to learn the relationships between states and drivers' controls [15][16][17]. ese modeling techniques are often referred to as behavior cloning (BC). Even though BC approaches have been successfully applied, they are prone to cascading errors [18], which is a well-known problem in the sequential decision-making literature. e reason is that inaccuracies occur in model predictions when there are insufficient data for training the model. Small inaccuracies accumulate during the simulation, which eventually leads the model to states not included in the training data and brings about even poorer predictions.
Inverse reinforcement learning (IRL) was introduced to overcome these drawbacks. IRL, which was proposed by Ng and Russell [19], is the inverse problem of reinforcement learning (RL). Although RL has been applied with great success in recent years, such as in the well-known program AlphaGo [20], the use of RL in other domains remains limited because it is challenging to determine the reward, which is the core component in RL. Manual tweaking of the reward functions can be tedious, and inappropriate reward assignments may lead to unexpected behaviors [21]. IRL, however, provides a framework to learn the rewards automatically. e advantages of IRL are twofold: the learned rewards can be used to improve the interpretability of the models, and the goals of the tasks can be understood, which may prevent cascading errors [22]. erefore, the present study proposes a car-following model based on IRL. In contrast to a recent work, which applied IRL to model carfollowing using linear reward representation [23], in this study, a nonlinear function, that is, NN, is used to approximate the reward function as the preferences of human drivers may be highly nonlinear. e proposed model is trained and tested using data under actual driving conditions, and the performance is compared with that of other car-following models. e rest of the paper is organized as follows: Section 2 briefly reviews the literature on car-following modeling, RL, and IRL. Section 3 presents the input feature vectors of the reward network in the IRL and the proposed algorithm. Section 4 describes the experiments and data used in this study. Section 5 elaborates on the training process of the proposed model and presents the investigated car-following models. Section 6 presents the comparison of the performance for different methods and the characteristics of the trained models using data from drivers with different driving styles. e final section presents the discussion and conclusion.

Background
e car-following process is essentially a sequential decisionmaking problem where drivers continually adjust the longitudinal control a based on the states s they encounter, which include the speed of the driver's car, the spacing between the driver's car and the leading car, and the relative speed between the two vehicles. Car-following models are designed to model the policy π(a|s) of drivers.

Classical Car-following Models.
e early General Motors models proposed by Chandler [24] modeled the drivers' longitudinal controls to minimize the relative speed because this is one of the primary objectives of car-following. ese models exhibited poor performance in predicting the distance between cars. Later models addressed this problem by considering another objective of car-following, that is, maintaining the desired distance; these models included the Gipps model [25] and the intelligent driver model (IDM) [12].

Behavior
Cloning Car-following Models. As the access to high-fidelity driving data has become increasingly available, data-driven models such as NN have been used to model car-following behavior. NN have been demonstrated to exhibit excellent performance for estimating nonlinear and complex relationships. In 2003, Jia et al. [16] proposed an NN-based car-following model with two hidden layers and the inputs speed, relative speed, spacing, and desired speed. Chong et al. [15] simplified the architecture proposed by Jia to one hidden layer and obtained similar results. Instead of using as input only a single time step of relevant information, such as in the conventional NN-based models, Zhou et al. [17] proposed a recurrent neural network-(RNN-) based model that used a sequence of past driving information as input. e RNN approach was better adapted to changes in traffic conditions than the NN approaches. e present study also uses the RNN-based model to compare its performance with that of the proposed method.

Reinforcement Learning.
In RL, a sequential decisionmaking problem is modeled as a Markov-decision process (MDP), which is defined as a tuple M � S, A, T, r, c . S and A denote the state and action space, respectively, and T denotes the transition matrix, which is defined in equation (1). r and c denote the reward function and the discount factor, respectively.
where v(t), Δv(t), and h(t) denote the speed of the ego vehicle, the relative speed from the lead vehicle, and the spacing between the ego and the leader at time step t, respectively. Δt is the simulation time interval, which is 0.1 s in this study, and v lead denotes the speed of the lead vehicle, which was obtained from the collected data. RL assumes that drivers follow a policy that maximizes long-term rewards. Once the rewards are known, the policy can be determined using algorithms such as Q-learning [26]. In recent years, RL has been applied by researchers to solve real-world problems such as the balance control of a robot and the energy management of hybrid electric vehicles [27][28][29].

Inverse Reinforcement
Learning. In IRL, the reward of a state can be represented by a linear combination of the relevant features (equation (2)). e goal of IRL is to determine the weights θ from expert demonstrations.
Abbeel and Ng [30] proposed a feature matching strategy to solve the problem (equation (3)). As long as the feature expectation of the simulated trajectories equals the features calculated from the expert data, the learned behavior has the same performance as the demonstrator. However, it was found that many different policies can be obtained when the feature matching conditions were satisfied. e ambiguity problem related to the correct reward and policy remains unsolved.
π a t |ts t T s t+1 |s t , a t .
(3) e maximum entropy IRL (Max-Ent IRL) proposed by Ziebart [31] addressed the ambiguity problem by incorporating the principle of maximum entropy into the IRL. In the Max-Ent IRL framework, the probability of a trajectory is proportional to the sum of the exponential rewards accumulated in the trajectory (equation (4)). is form of distribution can guarantee no additional preferences other than the feature matching requirement. When the probability of a trajectory is known, the weights of the reward can be determined by maximizing the log-likelihood of the expert data using the following objective function (equation (5)):

Maximum Entropy Deep Inverse Reinforcement Learning.
Since the linear representation of the rewards might limit the accuracy of reward approximation, Wulfmeier [32] extended the method to nonlinear models using deep NNs. Deep architectures have been shown to capture the nonlinear reward structure in several benchmark tasks with high accuracy. e present study uses the approach of deep architectures to represent the rewards of drivers in car-following. e fully connected NNs used in this study map the input features in the car-following model to estimate the rewards, as shown in Figure 1. It can be derived that the gradient of the Max-Ent deep IRL (DIRL) is as follows: where μ D and E μ refer to the state visitation frequencies calculated from the expert demonstrations and expected state visitation frequencies obtained from the learned policy and g(f, θ) refers to the network architectures. Once the gradient is calculated, the parameters of the NN are updated using backpropagation [33].

The Proposed Car-following Model
In this section, the details of the proposed model (DIRL) are explained, including the design of the input features for the reward network and the full algorithm. e DIRL model uses as input the driver data on car-following trajectories, consisting of speed during car-following, spacing to the leading car, and relative speed. After training, the DIRL model outputs the policy and the rewards of drivers. A discrete state and action space were defined in the present study.
According to the rules for determining car-following events that will be described in Section 4.2 and the distribution of the empirical data used in this study, the spacing h is limited to the range from 0 to 120 m with an interval of 0.5 m, the speed v is limited to the range from 0 to 33 m/s with an interval of 0.5 m/s, and the relative speed Δv is limited to the range from −5 to 5 m/s with an interval of 0.5 m/s. e action a is limited to the range from -3 to 2 m/s 2 with an interval of 0.2 m/s 2 .

Feature Selection for the Rewards in Car-following.
As introduced in the last section, the input features of the network are determined first to create an NN and obtain the rewards in car-following. e rewards in RL encode the objectives or the purpose of the agent [26]. erefore, the selected features should represent the objectives of drivers in the car-following task.
In the study of Gao [23], speed and spacing were chosen as features for representing the rewards. In [34], the reward function represented the speed discrepancies between the simulated trajectories and the test data. In contrast to these studies, we base the reward function on the following features.

Time-Headway.
Time-headway (TH) has been widely used as an indicator for drivers to evaluate risk during carfollowing [35]; TH is defined as the time between two vehicles passing the same point on the road. It has been suggested that a driver's safety margin in car-following can Journal of Advanced Transportation be characterized by the TH, which plays a role in the driver's decision-making [36]. Drivers may have different desired safety margins for the TH. For example, aggressive drivers may prefer a shorter TH than conservative drivers because they like to track vehicles at a closer distance. It has been suggested that one of drivers' objectives in car-following is to control TH to their expectations [37]. erefore, TH is selected as an input of the reward network in this study.

Relative Speed.
Research has shown that the drivers' speed control in car-following is proportional to the relative speed [38]. As mentioned earlier, an objective in car-following is to keep the relative speed close to zero [37]. In this study, we relax this objective so that drivers will keep the relative speed within an appropriate range because people's driving behavior is imperfect and is not always optimal.
Following the method presented in [23], these two features were mapped into high-dimensional space using the Gaussian radial kernel: where s i � (TH i , ΔV i ) denotes the kernel vectors, which represent the conjectural values of the preferred TH and relative speed, and σ is a parameter that controls the width of the kernel function. Specifically, TH i has a range of 0.5 s to 3 s, with an interval of 0.5 s, and ΔV i has a range of −4 m/s to 4 m/s, with an interval of 0.5 m/s in this study.

Maximum Speed.
e maximum desired speed is commonly used in many classical car-following models [12,16]. Drivers may have a preferred maximum speed, and they may not continue to follow the leader if their speed is already above this value. It is assumed that the objective of the driver is to keep the speed below the maximum speed as follows: where v i max denotes the conjectural acceptable maximum speed. v i max is in the range of 90 km/h to 120 km/h, with an interval of 5 km/h. e reward function is represented by an NN that is parameterized by θ as follows: 3.2. e Full Algorithm. e proposed DIRL algorithm consists of three parts, which are marked in bold in Algorithm 1. In the first part, the reward r i (s) is determined by the parameters of the NN to calculate the policy π i (a|s). Value iteration with a softmax function is used to solve the policy based on the reward. e result of the softmax version of value iteration is a stochastic policy in which the probabilities of every predefined action are listed in a tabular form. V(s) and Q(s, a) in this part denote the expected longterm return of states and state-action pairs.
In the second part, the policy π i (a|s) is applied to estimate the expected state visitation frequencies μ i (s). e original version for estimating μ i (s), as reported in [31], is not suitable in car-following tasks because the speed of the lead vehicle is always changing. Simply applying policy propagation [32] for every trajectory in the data can be timeconsuming. erefore, in this study, we perform sampling by running the policy in the simulation of drivers' car-following trajectories for N 2 times to approximate μ i (s). During the simulation, the action at every time step was randomly sampled from the policy based on the probability of every action.
In the third part, the gradients are calculated by subtracting the estimated μ i (s) from the state visitation frequencies μ D obtained from the data. Subsequently, the parameters of the NN are updated by backpropagation.
ese steps are repeated several times until convergence. e training of the algorithm can be stopped when the rewards accumulated in the trajectories stop increasing.

Data Description.
Data from two field tests that were conducted in Huzhou city in Zhejiang province and Xi'an city in Shaanxi province were used in this study. Forty-two drivers participated in the test. eir driving experience ranged from 2 to 30 years with the average being 15.2 years. During the test, the participants were only informed of the starting location and destination, and they were asked to follow their normal driving styles. e test data were collected by a Volkswagen Touran equipped with instruments and sensors, as illustrated in Figure 2. e test route consisted of diverse driving scenarios such as urban roads and highways, as shown in Figure 3. e other details of the field tests are described in [39,40].

Extraction of Car-following Events and Data Filtering.
We applied the rules described in [41] to extract the carfollowing events from the obtained data. (1) We ensured that the test vehicle was following the same lead car; (2) the distance to the lead car was less than 120 m to eliminate freeflow traffic conditions; (3) we ensured that the follower and the leader were on the same lane; (4) the duration of carfollowing events was longer than 15 s. e extracted events were then manually reviewed by checking the videos recorded by the front camera on the equipment vehicle to guarantee good data quality. Eventually, nearly one thousand car-following events were extracted. A moving average filter was applied (1 s) to remove noise from the extracted car-following data.

Driving Style Clustering.
e participants displayed diverse driving styles, which were evident in the driving data. e k-means algorithm was used to cluster the drivers into different driving styles. Previous studies have adopted kinematic features such as spacing, speed, and relative speed or time-based features such as TH and TTC for driving style clustering [34,39]. In this study, multiple combinations of the mentioned features were tested as inputs for the k-means algorithm, and the quality of the clustering results was then evaluated by the silhouette coefficient where a larger silhouette coefficient indicates a better result. Finally, the mean value of TH and TH when braking was chosen because this combination achieved the highest value of the silhouette coefficient [42]. e number of the clusters was also determined to be two based on the results of the silhouette coefficient. Figures 4 and 5 present the boxplot of the mean TH and mean TH when braking for the conservative group that consisted of 16 drivers and the aggressive group that consisted of 26 drivers, respectively. e aggressive group had significantly higher mean TH (t � 6.748, p < 0.001) and mean TH when braking (t � 7.655, p < 0.001) than the conservative group. e descriptive statistics (Table 1) of the two groups confirmed the clustering results. e aggressive drivers had shorter mean spacing and higher mean speed and mean acceleration than the conservative drivers.

Evaluation Metrics.
Two metrics, the root mean square percentage error (RMSPE) (equation (10)) and the modified Hausdorff distance (MHD), were used to evaluate the accuracy of the car-following models for reproducing drivers' car-following trajectories. As suggested by Punzo and Montanino [43], the cumulative sum of the errors is an appropriate option to evaluate the performance of car-following models.
where RMSPE(speed) denotes the RMSPE of speed, RMSPE(spacing) denotes the RMSPE of spacing, v obs n (t), h obs n (t) are the speed and spacing at time t in the observed nth trajectory, and v simu n (t), h simu n (t) are the simulated speed and spacing at time t for the nth trajectory. e MHD is an extension of the Hausdorff distance which represents the distance between two sets of points C � c 1 , c 2 , . . . , c N c and B � b 1 , b 2 , . . . , b N b , as defined in equation (11). e median of the MHD (MHD 50 ) had been used to evaluate the similarity of simulated and actual trajectories in modeling defensive driving strategies [44] and urban route planning [45].
Since the proposed DIRL model outputs a stochastic policy, the two metrics were calculated by averaging the results of 10 simulations for every trajectory in the data.

Model Training.
e k-fold cross-validation method was applied to evaluate the performance of the car-following models. Specifically, the car-following datasets of the two groups of drivers were randomly divided into 5 groups with an equal number of trajectories. One group was taken as the test set and the remaining four groups were taken as the training set. e training and test experiments were repeated five times because every divided group had been used as the test set. Finally, the performance of the car-following models was evaluated by the average value of the two metrics. e Adam optimizer [46] with learning rate decay was applied to train the DIRL model. e hyperparameters used for training are listed in Table 2. L2 regularization was used to prevent overfitting of the reward network. Figures 6 and 7 present the change of RMSPE of spacing and the change of the cumulative normalized rewards per trajectory in one of the cross-validation experiments, respectively. After about 5 iterations, the RMSPE of spacing for the training set and test set start to converge. e rewards collected in the trajectory remain stable after about the same number of iterations.

e Investigated Models.
e accuracy and generalization ability of the proposed model was compared with those of two other data-driven car-following models, that is, the NN-based model and the RNN-based model.

NN-Based Car-following Model.
A fully connected neural network with one hidden layer was built following the study conducted by Chong et al. [15]. e hidden layer consisted of 60 neurons in this study. e NN-based model takes inputs of speed, spacing, and relative speed and outputs the acceleration for the current time step. e objective Journal of Advanced Transportation 5 of minimizing the empirical acceleration and the model's predictions was adopted to train the model (equation (12)).
where w, b denotes the weights and bias in the NN-based model, a simu n (t) denotes the predicted acceleration at time step t for the nth trajectory, and a obs n (t) denotes the empirical acceleration at time step t for the nth trajectory.

RNN-Based Car-following Model.
e architecture of the RNN-based model built in this study is in line with the study conducted by Zhou et al. [17]. e number of hidden neurons in the RNN model was set to be 60. e RNN model takes inputs of a sequence of historical information that lasts for 1 s and outputs the acceleration for the current time step. e speed and spacing for the next time step were then estimated based on the state transition matrix described in equation (1). e training of the RNN model adopted the   Journal of Advanced Transportation Randomly initialize the parameters of the neural network as θ 1 For i � 1 to N 1 do Determine the reward for every state by applying forward propagation in the neural network Use the softmax version of value iteration to obtain the policy          loss function shown in equation (13) which minimizes the RMSPE of speed and spacing.
where w, b denotes the weights and bias in the RNN model, h obs n (t), h obs n (t) are the speed and spacing at time t in the observed nth trajectory, and v simu n (t), s simu n (t) are the simulated speed and spacing at time t for the nth trajectory.

Performance Comparison.
e average performances of the three models in the fivefold cross-validation tests using the data from the aggressive and conservative groups were compared in this section. Tables 3 and 4 present the results on the training sets and the test sets, respectively. e DIRL had the lowest RMSPE of spacing and MHD 50 in both the training sets and the test sets. Although the NN and the RNN model had lower RMSPE of speed in the test sets, the overall error of the DIRL in reproducing drivers' trajectories was lower than that in the other two models. For the two kinds of BC models, RNN outperformed the NN model as it achieved lower RMSPE and MHD 50 than the NN model. Figure 8 presents the simulation results of speed and spacing for two car-following periods randomly selected from the datasets. As can be seen, the DIRL model tracks the empirical speed and spacing more closely than the other two models. e simulation results of speed for the NN and RNN model are smoother than those of the DIRL model because the former models output a continuous action, while the latter model outputs a discrete action.

6.2.
e Learned Characteristics of the Model. Since the proposed model was trained with data from two groups of drivers with different driving styles, we expected that the learned models would exhibit features of both groups. erefore, the learned value of the two driving styles, which represents the expected long-term return, is compared in this section. As depicted in Figure 9, the states with a higher value represent the preferable states, which drivers try to achieve during car-following. For the same distance to the lead vehicle, the aggressive drivers preferred a higher speed than the conservative drivers. e high-value area (V ≥ 0.8, in red) for the aggressive drivers has a steeper slope as indicated by the angle θ between the black-dashed line and the x-axis. Since the cotangent of the angle θ is proportional to the value of TH, a larger angle means a shorter TH. Hence, the comparison of the angle θ in the two figures shows that the aggressive drivers favor a shorter TH. Besides, the width of the highvalue area for the aggressive is wider compared with the conservative; it indicates that the aggressive drivers' preferred TH has a larger variance than that of the conservative drivers. is result is in good agreement with the details shown in the boxplot of TH for the two groups of drivers in Figure 4.
It is also found that the high-value region of the speed becomes wider with an increase in the spacing to the lead vehicle in the two figures. e interpretation is that when the spacing is small, drivers must control the speed more precisely to prevent colliding. As the distance increases, drivers have more flexibility for speed control. e learned policies of the two groups were compared by assuming that both groups were following the same leader. e initial states of this car-following event and the speed of the leader were input from the collected data. e learned stochastic policy was run 20 times for both groups. As shown in Figure 10, the aggressive group (in blue) maintained a smaller distance compared to the conservative group (in red) during the simulation. Both the aggressive and conservative drivers accelerated to follow the leader. However, the aggressive drivers increased the speed more quickly in the first 4 s, resulting in less distance to the leader compared with the conservative drivers.

Discussion and Conclusion
In this study, we propose a car-following model based on Max-Ent DIRL. e proposed model learns the rewards of drivers during car-following which were approximated by an NN. e policy of drivers was solved by an RL algorithm of softmax version of value iteration. Tested on actual driving data, the results showed that the proposed model outperformed the BC models NN and RNN by providing the lowest RMSPE and MHD 50 in replicating drivers' car-following trajectories. e better performance of the proposed model can be explained by the more general objective compared with the BC models. e DIRL model reproduces drivers' policy by firstly learning drivers' decision-making mechanisms (i.e., the rewards), whereas the BC approaches only learn the state-action relationships. Since the policy was solved by the RL algorithm that is based on the assumption of maximizing long-term rewards, the obtained policy then has the ability of long-term planning. In contrast, the BC methods do not include long-term planning in its model training objectives. e simulation results for the two carfollowing trajectories confirmed the superior ability of longterm planning for the DIRL model. e derivation between the simulated spacing and the empirical data for the BC models becomes lager as the simulation continues. On the contrary, the simulation error does not accumulate during the simulation for the DIRL model. Moreover, the better performance of the RNN model found in this study is in line with previous studies [17,34]. Compared with the NN model that only relies on information in the current time step for predication, the advantage of using historical information makes the RNN model more suitable for time series prediction. e present study also demonstrates that the proposed model could capture the characteristics of different driving styles of human drivers. e learned value and policy matched those of the drivers with distinct driving styles. e fully connected NN applied in this study was trained to capture the relevant features that represented the drivers' preferences or objectives in car-following scenarios. e IRL method used in this study provides a new perspective to explain driver behavior and to model different driving strategies. However, solving the IRL problem is computationally expensive, which makes it challenging to apply to high-dimensional systems. Recent studies that have applied adversarial learning to IRL have shown an ability to scale the method to solve complex problems [22,47]. Future studies should consider these new approaches. e present study had some important limitations. First, the participants in the present study are all male, so a broader sample is needed in future research. Second, the proposed model does not consider drivers' reaction delay and memory effect for speed control during car-following. Future studies should take these factors into account.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.