Sim-to-Real Reinforcement Learning for Autonomous Driving Using Pseudosegmentation Labeling and Dynamic Calibration

Applying reinforcement learning algorithms to autonomous driving is difficult because of mismatches between the simulation in which the algorithm was trained and the real world. To address this problem, data from global navigation satellite systems and inertial navigation systems (GNSS/INS) were used to gather pseudolabels for semantic segmentation. A very simple dynamics model was used as a simulator, and dynamic parameters were obtained from the linear regression of manual driving records. Segmentation and a dynamic calibration method were found to be effective in easing the transition from a simulation to the real world. Pseudosegmentation labels are found to be more suitable for reinforcement learning models. We conducted tests on the efficacy of our proposed method, and a vehicle using the proposed system successfully drove on an unpaved track for approximately 1.8 km at an average speed of 26.57km/h without incident.


Introduction
Due to the improvements in deep convolutional neural network architectures and graphical processing units, recent research has aimed at applying deep learning algorithms to autonomous driving tasks. Previously, traditional computer vision algorithms, including edge detection and template matching, were used to infer how the vehicle should drive. Researchers utilized deep learning methods, including convolutional neural networks (CNNs) to leverage their complicated features and enable the self-driving algorithm to behave more intelligently.
Imitation learning has been a common approach for autonomous driving [1]. During imitation learning, a CNN is trained to learn human-like control from a given image and features. However, there are some drawbacks associated with imitation learning. First, imitation learning cannot encompass the very diverse number of possible cases that can occur in association with driving. Additionally, imitation learning requires large amounts of labeled data for training, which must be collected from actual driving environments and is therefore cost-ine ective and labor-intensive.
O -road driving environments are quite di erent from road driving environments, wherein the path is kept much more standard and uniform. In o -road environments, there exists much more variability, such as grass growing in the middle of the road, road curvature changes due to seasonal factors like rain and snow, and even color changes in the road itself before and after rain. Autonomous driving based on images in o -road environments is inevitably a ected by these various disturbances. To address this problem, we require the ability to generalize o -road driving environments, and we also need a robust algorithm that can make the best choice in a wide variety of driving situations.
In contrast, reinforcement learning (RL) has several advantages over imitation learning. Agents can learn how to drive over many trials in a simulation, and they can be trained from a near-in nite number of possible cases without the need for labeled data. Moreover, reinforcement learning has the potential to outperform human drivers because the driving performance of systems trained through reinforcement learning is not limited by the training dataset.
Nevertheless, deploying reinforcement learning in a real vehicle remains challenging, in part because of distribution shifts between the simulations in which they are trained and the real world [2]. Distribution shift is one of the main reasons why a trained model might perform poorly in a realworld test environment. Covariate shift, specifically, refers to the difference between trained input data and testing input data [3]. Because a simulation cannot perfectly reconstruct the real world, there are often mismatches between the real and simulation scenes, which results in negative effects on a model's driving performance [4]. Alternatively, concept shift refers to the difference in the relationships between labels and their given inputs [2]. For example, the correct control command for a given image might be different in a simulation and in the real world because of the differences in dynamics of the two environments. e covariate shift between a simulator and the real world can be relieved using intermediate representations of the input [5]. For example, two-class semantic segmentation narrows the gap between the simulator and the real world. For this to work, simulators can easily produce binary images. In the real world, images can be processed into binary images by semantic segmentation networks. Using these binary images instead of the raw images can help reduce covariance shift.
In this article, dynamic calibration was used to enable a trained model to behave similarly in both simulations and the real world. A simple vehicle dynamics model was used, and only four parameters were fine-tuned using linear regression to mediate problems that may occur from distribution shift, covariate shift, and dynamics. e experiments conducted in this article demonstrated that this simple method was effective in reducing the concept shift between the simulation and real environment. However, modeling the complicated car dynamics in a way that considers the many relevant parameters requires significant computational effort and can often not be generalized to other types of vehicles, our simple and data-driven approach can be used on different devices and vehicles with little modification. e overall architecture of the proposed method, and in particular its training phase, is illustrated in Figure 1. ere are two parts of the training phase. e first is for training the semantic segmentation network with synthesized labels from the global navigation satellite systems (GNSS) and inertial navigation systems (INS) data. e second is to train the RL model using our calibrated simulator. Our simulator and synthesized labels share the same road width and camera parameters, including focal length, camera height, and tilt angle. Figure 2 shows the testing phase of our method, wherein semantic segmentation and RL model inference are conducted sequentially. e steering and throttle values produced by the RL model are then passed to the control system of the test vehicle.
In this article, our main contributions are as follows:

Related Works
Autonomous driving has become a key research area in the field of artificial intelligence. Pomerleau [6] introduced ALVINN, a military vehicle driven by an algorithm, and demonstrated that it could successfully drive on paved roads. In 2016, Bojarski et al. [1] applied and achieved end-to-end learning of a self-driving task using a convolutional network. eir model had five convolutional layers and two fully connected layers to output throttle and steering control. e authors demonstrated that each of the convolutional filters was able to successfully recognize the edges of roads without explicitly being provided that information.
Semantic segmentation is a traditional computer vision task that classifies every pixel in a given image. Long et al. [7] used fully convolutional networks (FCNs) and skip connections to leverage both coarse and fine-grained features from the image for semantic segmentation. ey achieved state-of-the-art performance on several segmentation challenges, including PASCAL-VOC [8] and SIFT flows. Chen et al. [9] proposed DeepLab, which uses Atrous convolution and conditional random fields to increase the performance of semantic segmentation, and achieved stateof-the-art performance at PASCAL-VOC-2012 [8].
Reinforcement learning is one of the most important branches of artificial intelligence. Mnih et al. [10] applied deep convolutional networks to reinforcement learning to allow an agent to play Atari games. eir trained model outperformed humans for most of the games it played, and the authors showed that only a few consecutive images were necessary to train the reinforcement learning algorithm. Deep reinforcement learning has also been applied to selfdriving to handle various scenarios, which cannot be solved with traditional rule-based algorithms [11][12][13].
Sim-to-real transfer, lastly, regards transferring a model that was trained in a virtual environment to the real world. Domain randomization is one of the representative techniques of sim-to-real. It randomizes various property of the inputs, including brightness, contrast, and dynamics, to allow a trained model to consider the real-world input as just one of the randomized simulation data. Researchers have applied domain randomization techniques to make vehicles trained in simulations perform well in the real world [14][15][16][17].

Semantic Segmentation Using GNSS/INS.
For the semantic segmentation of the road area, a fully convolutional network (FCN) [7] was trained with images and labels. As semantic segmentation is trained using supervised learning, segmentation labels are necessary for each corresponding raw image. To gather the tremendous number of necessary segmentation labels, GNSS/INS data were utilized to synthesize pseudosegmentation labels, rather than rely on human labor.
GNSS/INS data can accurately measure the current location of a vehicle, localizing its position within an error of 0.40 meters at 20 Hz. As a vehicle drives around a track, its location data, including longitude and latitude, are recorded for every frame. Each location component is estimated in meters. Figure 3 shows how the segmentation labels are produced from GNSS/INS data. e left image shows the location points projected on a camera input, which is taken at one of the recorded locations. e image in the middle shows the lateral points of each location points. ey are predefined distance away from its corresponding location point. Lastly, the road segmentation label is produced by polygons composed of lateral points of each location point.
An FCN [7] with ResNet50 [18] backbone was used as the semantic segmentation network. e network consists of 57 layers, which takes 224 × 224 sized RGB image (3 channels) as an input and produces an output tensor with the shape of 21 × 224 × 224. We used the default number of classes, which is 21, to load the pretrained weights of FCN. e training dataset consisted of two types of labeled datasets.
e first was the dataset with synthesized segmentation labels from the GNSS/INS records. e other contained labels produced by humans. e number of labels in the two datasets was 6,492 and 969, respectively. e model was trained for 100 epochs, with a learning rate of 0.001. Adam was used as the optimizer during training.

Simulator.
e simulator was designed to simulate realworld situations. e global map in the simulator was reconstructed as a 10,000 × 10,000 array filled with zeros. e global map was derived through accumulating the trajectory positions driven by an expert driver. Figure 4 shows the global map of the simulator and one of the rendered images. e car dynamics in the simulator were designed to be as simple as possible. In the simulator, the car was considered a e outputs from RL models often show unrealistic actions. For example, the steering values from an RL model may rapidly change between the maximum and minimum values, or the trembling of steering can help an agent to maximize rewards within the learning paradigm. However, these unrealistic movements can often cause a catastrophic breakdown of wheel motors in the real world. For this reason, the maximum steering change was limited to M s , which is obtained from actual driving data.

Reinforcement Learning.
Proximal Policy Optimization (PPO2) [19] with Multilayer Perceptron Policy (MlpPolicy) was used as the reinforcement learning model. MlpPolicy consists of two layers, with 64 features each. e depth of the policy network is shallow, and the number of features is significantly lower than many modern CNN networks. However, MlpPolicy is still sufficient to train tasks where inputs are simple and state transitions are very consistent. Our simulator provides binary images of roads and uses simple dynamics. Moreover, by using complicated and deep networks can cause an overfitting problem, which is critical for sim-to-real transfer.
Observation refers to the data that are input to an RL model. e simulator provides binary images of roads from the camera's point of view. To compress the input image to a feature vector, the lengths of ten evenly spaced vertical lines were taken as observations. Figure 5 visualizes the observation from the perspective of the simulation. In addition, to provide the agent with temporal information, the previous (i.e., historical) steering and throttle values were added to the observation. erefore, the total number of observation features was 12.
e objective of reinforcement learning is to maximize the rewards. e reward function is composed of four types of reward. We describe each reward in detail: Here, R t is the throttle reward. e throttle reward induces the vehicle to move forward. T refers to the throttle value. λ t is the weight of the throttle reward, which can be empirically determined. e imbalance penalty is given as Here, P i is the imbalance penalty, which measures how close the vehicle is to the center of a road. l is the distance of the left side to the road boundary from the car, and r is the distance from the opposite direction. If the vehicle gets closer to one of the road boundaries, the penalty increases. e imbalance penalty was found to be useful in preventing vehicles from driving in zigzags and encouraged a straighter path: Here, R e is the exploration reward that induces an agent to visit an unseen area. It outputs 1000/N when the car reaches a new track tile [20], where N is the total number of location points used to build the global map. Our simulator determines that the agent arrives at an unvisited point only when the current closest location point c is not included in a   Journal of Robotics visited point set V. is reward prevents the vehicle from continuously driving along a small circle. Lastly, the crash penalty is given as Here, P c is the crash penalty that an agent receives whenever the vehicle touches the edges of the road. P c is proportional to the throttle value.

Experimental Settings.
e total distance of the test road was 1.8 km. e road was unpaved and covered with gravel and dirt. e average road width was approximately 8 m. e boundaries of the road were not clear, and there were grasses and trees outside the road. e height of the camera attached to the vehicle was approximately 1.4 m from the ground. e test vehicle had six wheels and was a skid-type vehicle, which can reach speeds of up to 50 km/h. Table 1 depicts the details of the experimental setup.

Results and Analysis
To gather driving data, an expert driver drove along the whole course of the test road. At each frame, the information of the vehicle including steering, throttle, heading angle, and position coordinates is recorded. We obtain ∆θ by calculating the difference between heading angle of the recorded two consecutive frames. Also, ∆n is computed by projecting the difference between the two consecutive GNSS/INS positions to heading angle vector of the vehicle. Figure 6 shows the scattered blue points, which represent the throttle and ∆n pair recorded at each moment. Likewise, the blue points in Figure 7 denote the pairs of steering and ∆θ recorded at each frame. e red lines in Figures 6 and 7 visualize the results of linear regression conducted by the least squared error method. Table 2 shows the calibrated value of the model parameters w s , b s , w t , and b t , which were   Journal of Robotics 5 obtained from linear regression. Our simulator used this model and these parameters to mimic the real-world dynamics. Several pseudosegmentation labeling methods were implemented and compared in the present study. SLIC [21] and Watershed [22] are methods based on superpixel algorithms. e threshold method filters pixels whose values are below a threshold. e specific threshold was determined using the method from Otsu [23].
According to Figure 8, the SLIC algorithm appears to be the most promising method, relative to our own. However, the SLIC method is vulnerable to discrepancies resulting from shade. Similarly, the threshold method produces noisy labels and misclassifies the sky as part of the road. Lastly, Watershed barely provided any useful segmentation labels.
Two methods have recently been published for pseudosemantic segmentation labeling [24,25]. ose methods use class activation maps from GradCAM [26]. Unfortunately, the activation maps of our road images were not suitable for obtaining the road areas, because these pretrained classification models classify roads as a part of the background. e first row of Figure 9 shows the input images. e last two rows are the segmented images that were inferred from the two different models. e first model was trained using the ground-truth labels of the test road, whereas the second model was trained using the synthesized pseudolabels.
Both Figures 9(b) and 9(c) show reasonable segmentation performance. e intersection over union (IoU) was higher in (b), except for the three rightmost columns. (c) often predicted a road area that was narrower than the ground truth because the road width of the synthesized labels was fixed at 6 m. e road widths of the synthesized labels and simulator were the same. us, our model can produce segmentation images that is more similar to the simulation scenes. e input to our RL model was in the form of the lengths of 10 lines in a segmentation image.
e Kullback-Leibler divergence was calculated to compare the similarity between the distributions in simulator observations and the observations from the segmentation models. To calculate the KL divergence, histograms of each line length were generated. e formula is as follows: where p i (x) is the value of the i-th bin in the histogram of a segmentation model outputs and q i (x) represents that of the simulator outputs. e KL divergence of each line is shown in Figure 10. According to Figure 10, the KL divergences were lower in our segmentation model for each line, which implies that our model produces much more similar output to the simulator scenes than the comparator models. e average KL divergences for the model trained with ground truths and our model were 1.5450 and 0.42644, respectively. erefore, our pseudosegmentation labeling algorithm significantly reduced the covariate shift between the simulator and the real world.
To compare the suitability of the segmentation outputs from these models, a dataset collected through actual human driving was used. e dataset contains images and corresponding values of the throttle and steering. e images were processed by both segmentation models, and the RL model produced values of steering and throttle from the segmentation images. e steering values were compared with the values from the dataset, which are representative of actual human decisions.
In Figure 11, the blue lines represent the visualization of the steering values from the dataset collected from manual driving. e above orange line shows the steering outputs from segmentation model trained by the ground-truth labels. e below orange line represents the steering values obtained from our model, which is trained with pseudolabels. From the figure, it is clear that our segmentation model is more suitable for use in the RL model than the model trained from the ground truth. It is remarkable that the RL model behaved similarly to human driving without requiring any steering or throttle data.
In the real environment experiment, our model was deployed in the test vehicle. Figure 12 shows the trajectory of our model and the trajectory from human driving. Our model drove around the entire track without crashing, driving at an average speed of 26.57 km/h. e minimum, maximum, and the speed during the 270-degree hairpin curve were 23.2 km/h, 28.7 km/h, and 23.4 km/h, respectively.   Journal of Robotics Figure 13 shows the velocities recorded at each point on the track. e left image shows the velocities from when the human was driving, and the right image is from our RL model. e human driver was instructed to drive run the track clockwise along the center of the road at about 30 km/ h. According to the figure, the driver slowed down the vehicle at each turning points and accelerated at the straight parts of the road. In contrast, our model drove at an almost constant speed. Human drivers consider the safety of the human and vehicle when driving. However, the RL model was trained in a way that it drives as fast as possible without considering safety. Table 3 shows the results of applying various deep RL models to our simulator. For performance comparison, we used the representative deep reinforcement learning algorithms including Proximal Policy Optimization (PPO2), Soft Actor Critic (SAC), Advantage Actor Critic (A2C), and Twin Delayed Deep Deterministic Policy Gradients (TD3). ose algorithms are chosen because they show state-of-the-art performances with appropriate hyperparameters and are recommended for continuous action environments. According to the reward comparison results, PPO2 turned out to provide the highest reward among the methods. To validate statistical superiority of PPO2, we conducted t-test with other RL methods, and the results are shown in Table 4. erefore, PPO2 was chosen to be the main RL algorithm to test on our vehicle.
To validate the effectiveness of pseudolabeling and dynamic calibration, we evaluated the mean squared error (MSE) between the steering values from the manual driving and the values from the four types of testing models. e testing models were trained with and without pseudolabeling and dynamic calibration. e steering values from the testing models were adjusted to half because the testing models typically output full throttle values. Table 5 shows that using both pseudolabeling and dynamic calibration resulted in steering values that were most similar to manual driving. e equally distributed velocity heatmap in Figure 13 represents the optimal steering and throttle values that can be provided to the vehicle to drive the course. According to the results, the speed was maintained during the curved course unless it was a 270°hairpin curve course. is may be considered an un-human-like driving method, but this style of driving can be useful for strategic defense purposes. Strategic purposed vehicles, such as self-propelled artillery and armored vehicles, are required to move swiftly in curves without decelerating, because doing so will leave them vulnerable to enemy fire.

Conclusion
Applying reinforcement learning to autonomous driving has been a significant challenge for researchers because of the severe mismatches between simulations and the real world. Our simulator used dynamic calibration to predict the vehicle's next location from the given control commands. Moreover, two-class semantic segmentation, which distinguishes the road from the background, was found to be effective in reducing the gap between simulation scenes and real images. ese methods demonstrated a positive effect on the sim-to-real performance of self-driving RL models. As a result, our model successfully drove on an unpaved road track without derailment.

Discussion
When the driving algorithm passes the simulation stage on the computer and tested in the real driving environment, there are many restrictions besides the core algorithm that must be considered. is is because it is no longer a simulation, but a real driving in the off-road condition. When a large vehicle weighing nearly 6.5 tons drives off-road with a high altitude difference at an average of 28 km/h, the restrictions are more severe. is is because driving in a situation where errors and problems exist in the overall system integration. Problems between tests continuously occur, which causes delay, and can continue only when these problems are resolved. It was practically difficult to prove the superiority and performance of one method to another through autonomous driving results in a real environment by implementing various reinforcement learning-based autonomous driving due to the project schedule and realistic conditions. Instead, the most probable and realistic algorithm through selection and concentration process was chosen through simulation, and then, the goal was to implement it in actual driving.
Data Availability e data that support the findings of this study are available from the corresponding author upon reasonable request.