RS-Lane: A Robust Lane Detection Method Based on ResNeSt and Self-Attention Distillation for Challenging Traffic Situations

Lane detection plays an essential part in advanced driver-assistance systems and autonomous driving systems. However, lane detection is aﬀected by many factors such as some challenging traﬃc situations. Multilane detection is also very important. To solve these problems, we proposed a lane detection method based on instance segmentation, named RS-Lane. This method is based on LaneNet and uses Split Attention proposed by ResNeSt to improve the feature representation on slender and sparse annotations like lane markings. We also use Self-Attention Distillation to enhance the feature representation capabilities of the network without adding inference time. RS-Lane can detect lanes without number limits. The tests on TuSimple and CULane datasets show that RS-Lane has achieved comparable results with SOTA and has improved in challenging traﬃc situations such as no line, dazzle light, and shadow. This research provides a reference for the application of lane detection in autonomous driving and advanced driver-assistance systems.


Introduction
Lane detection plays a vital role in autonomous driving. Reliable lane detection can help autonomous driving systems to make the right decisions. Lane detection algorithms get to be a challenging task due to many factors such as the wide variety of lane markings, the complex and changeable road conditions, and the inherent slender features of lane markings. In this paper, we proposed a lane detection method based on LaneNet [1] using Split Attention proposed by ResNeSt [2] and Self-Attention Distillation (SAD) [3] to improve the feature representation on the slender and sparse annotations like lane markings.
e current lane detection methods can be roughly divided into two kinds: one is based on traditional computer vision and the other one is based on deep learning. Most of the traditional detection methods rely on extracting a certain feature to detect lanes such as color features [4][5][6], edge features [7,8], geometric features [9][10][11], and so on. Also, they are possibly combined with Hough Transform [12] and Random Sample Consensus (RANSAC) [13,14]. ese methods are simple and efficient, but they need to manually adjust the parameters. Although they can perform well when working in normal situations, they cannot adapt to situations with different conditions such as lighting and occlusion.
Most deep learning methods are based on Convolutional Neural Network (CNN) [1,15,16]. With the development of CNN, more and more theories and structures have been proposed.
Although these methods have good performance in most normal situations, there are still some limitations. First of all, most methods can only detect a fixed number of lanes. Secondly, due to the slenderness of the lane, the number of background pixels is far greater than the number of lane pixels. Learning such features could be very difficult. Also, some situations have few visual clues or even no clue at all, such as no line, shadow occlusion, and complex lighting conditions. Under these situations, detecting the lanes from the picture could be very tricky.
To solve the above problem, we use the framework proposed by LaneNet [1], using pixel embedding [23], to achieve instance segmentation so that our method can detect lanes without number limits. We use the Self-Attention Distillation (SAD) [3] mechanism and Split Attention proposed by ResNeSt [2] to improve the feature representation on the slender and sparse annotations like lane markings. Meanwhile, SAD would not increase the inference time of the model. e results on TuSimple dataset and CULane dataset show that our method is comparable with existing methods in normal situations. Above all, our method shows better improvement in situations with few visual clues, such as no line, shadow, and dazzle light.
Our contributions are summarized as follows: (i) We use Split Attention to improve the feature representation of the network on the slender and sparse annotations like lane markings.
(ii) We use the distillation and attention mechanism of SAD to further improve the ability of the network without needing more annotation information. At the same time, it does not increase the inference time during deployment.
(iii) By using pixel embedding to obtain lane instances, our method can detect lanes without number limits, which is the limitation of most lane detection methods. Our method also shows great improvement in challenging traffic situations, such as no line, shadow, and dazzle light. e rest of this paper is organized as follows. Section 2 reviews some related research of lane detection. Section 3 introduces the proposed lane detection method. Section 4 discusses our experimental results. Section 5 makes a conclusion of this paper.

Related Research
Vision-based lane detection methods can be divided into traditional vision methods and deep learning methods. However, both methods can be divided into three steps: image preprocessing, feature detection, and lane model fitting. Image preprocessing is to remove part of the noise. Feature detection uses the features of lanes to extract areas that are lanes. Lane model is fitted generally by using the least-squares method, spline fit, etc.
Among them, feature detection is the most important part of the lane detection algorithm and plays a decisive role in performance. ere are many kinds of lane markings, including yellow lines, white lines, solid lines, and dashed lines. Moreover, the proportion of lane markings in the pictures is very low. And there may be abrasion and occlusion, which make detection more difficult.
Since AlexNet [24], Convolutional Neural Network (CNN) has been widely used in the field of computer vision with its outstanding feature extraction capabilities. More and more excellent neural networks have been proposed, such as ZFNet [25], GoogLeNet [26], VGGNet [27], and ResNet [28]. Since ResNet was proposed, it has been widely used as the backbone network due to its simple and modular structure. In recent years, there were many variants proposed based on ResNet, such as ResNeSt [2] and ResNeXt [29]. ese networks are also widely used in the field of lane detection.
Semantic segmentation methods [30][31][32] are used to distinguish background and lane pixels. Instance segmentation methods [23] are used to directly get lane location. Object detection methods [33] are used to remove noise caused by cars and pedestrians.
In [1], they cast the lane detection problem as an instance segmentation problem. ey designed a two branched, multitask network, for lane instance segmentation. In [3], a novel knowledge distillation approach is proposed, Self-Attention Distillation (SAD), which allows a model to learn from itself without any additional supervision or labels. Pan et al. [20] proposed a message passing mechanism between adjacent pixels to use visual information more efficiently, which significantly improves the performance of deep segmentation methods. In [21], a formulation with structural loss is proposed to address the problem of speed and no visual clue. e proposed formulation regards lane detection as a problem of row-based selection using global features and achieves remarkable speed and accuracy. In [22], a hybrid network combining CNN and RNN is proposed for robust lane detection in continuous driving scenes. In this framework, features on each frame of the input were firstly abstracted by a CNN encoder. en, the sequential encoded features of all input frames were processed by a ConvLSTM. In [34], YOLO [33] and CPN [35] are used to remove noises, and then they proposed a lane marking model inference method that can detect lanes when lane markings are missing. But it can only detect two lanes. In fact, [3,[20][21][22]34] are all limited with the fixed number of lanes.
In this paper, we design a network, which simultaneously performs semantic segmentation and instance segmentation and has no limits with the number of lanes. Our method also has better performance in challenging traffic situations.

Proposed Method
Our method can be divided into several modules as follows. In the preprocessing module, the input images can be appropriately processed to be easier to extract features in the later stage. e driving picture and its annotation are converted into a standard format that can be input into the model. In the model training stage, the annotated data are used to train the network so that it can achieve the segmentation of the lanes. e postprocessing stage is to get the final results from the output of the model through denoising and fitting. Figure 1 shows the flowchart of our method. Journal of Advanced Transportation

Preprocessing.
is module processes the input image to make feature extraction easier in the later stage and improves the speed and accuracy. In our method, the preprocessing module is mainly to get the input image downsampled.
e original size of the images is 1280 × 720 in TuSimple dataset and 1960 × 590 in CULane dataset. If it is directly inputted into the network, the calculation is very large. erefore, the raw image needs to be downsampled. In this paper, we use bilinear interpolation to downsample the images to the size of 512 × 288 (for TuSimple dataset) and 800 × 288 (for CULane dataset).
In our method, there are two branches in our network, which means two outputs. So, we made two labels for one input based on the annotations. One label is for binary branch, denoting whether a pixel belongs to lanes or background. And another one is for embedding branch, denoting which lane the pixel belongs to. Moreover, since the original image was downsampled, the same operation should be applied on the labels; the results are shown in Figure 2.

Model Training.
Our network uses the structure proposed by LaneNet [1] to simultaneously perform semantic segmentation and instance segmentation, using the encoderdecoder [36] framework. Semantic segmentation is to achieve pixel-level processing of input pictures, getting the pixels that belong to lanes. Meanwhile, based on semantic segmentation, instance segmentation is performed using the pixel embedding [23] method proposed by De Brabandere et al.
Differently, LaneNet uses one encoder and two decoders, which means each branch has a decoder. To reduce the parameters and complexity of the network, we only have one decoder and use two different 1 × 1 convolution layers at the end of the decoder to get two branches. e binary branch is for semantic segmentation, and the embedding branch is for instance segmentation. e structure of the network is displayed in Figure 3.

3.2.1.
e Encoder. Our encoder uses ResNeSt as the backbone. ResNeSt proposed a Split-Attention mechanism, which can obtain the attention based on different groups and different channels. As a variant of ResNet, ResNeSt retains the complete ResNet structure and has a conv1 and 4 layers. Each layer consists of several blocks. e structure of the encoder block is shown in Figure 4. Firstly, the encoder block divides the input into several groups (or cardinal) along the channel dimension. en, each group is divided into several splits (or subgroups). After putting splits through different convolutions, the feature map of each group is a weighted combination of its splits, while the weights are selected based on contextual information. e output of the block is concatenated by feature maps of groups. is structure enables the network to utilize multidimensional information and enables cross-group and cross-channel attention.
We add two SAD paths to further enhance the feature extraction capabilities of the network. SAD enables a network to learn from itself, without needing any extract information. rough making the attention map of the lower layer to mimic the higher ones, the lower layers can learn the higher feature representation. Since the feature representation ability of lower layers is enhanced, the higher ones and the whole network also are enhanced. e steps of applying SAD can be summarized as follows: De-layer0   e first step is the generation of the attention map which is equivalent to finding a mapping function: where A m denotes the output of the m th layer and C m , H m , and W m , respectively, denote the channel, height, and width. And this mapping function can be constructed via computing statistics of these values across the channel dimension.
In our method, we set p � 2, following Hou et al. [3]. Additionally, Hou et al. [3] used the spatial softmax on the attention map. However, we find that, after using the spatial softmax, there are a large number of 0 existing in the map, and only several values are nearly 1, which is not conducive to the calculation value of the gradient. So, we apply softmax on the same dimension for maps.

3.2.2.
e Decoder. e decoder performs deconvolution operation to decode the feature maps' output by the encoder and realizes upsampling and classification. Our decoder has 5 layers, one-to-one correspondence with layers of the encoder. In order to make full use of the global context information, we use skip-connect proposed by Unet [32] to concatenate the output of the encoder and the decoder. e structure of the decoder block is shown in Figure 1. In the last layer of the decoder, we have two branches, namely, binary branch and embedding branch. We use two convolutional layers with a 1 × 1 kernel to generate the output of binary branch and embedding branch. e binary branch outputs the semantic segmentation. e embedding branch outputs a three-channel embedding map, which means a 3D embedding vector for each pixel.

Loss Function.
Loss function plays a great role in the optimization of the network. Our network has two outputs at the end, and the appropriate loss function needs to be selected.
Since the proportion of lane pixels in the image is very small, there is a serious data-imbalance problem in lane segmentation task. To solve this problem, we use dice loss as our semantic segmentation task loss.
After training, the embedding branch outputs a 3D embedding vector for each pixel. e distance between pixel embeddings belonging to the same lane is small. And the distance between pixel embeddings belonging to different lanes is maximized. De Brabandere et al. [23] introduced three terms to realize the loss function, variance loss (L var ), distance loss (L dist ), and regularization loss (L reg ). e variance loss (L var ) pulls the embedding of pixels to the mean embedding of a lane, that is, makes the embedding distance between pixels belonging to the same lane closer. e distance loss (L dist ) pushes the cluster centers apart from each other, that is, increases the embedding distance of pixels belonging to different lane lines. e function of the regularization loss (L reg ) is to attract all clusters to the origin.
Neven [1] et al. made some modifications to the loss function's formula and omitted L reg . We use the loss function they modified, as shown in equations (4)

Journal of Advanced Transportation
where δ v and δ d are hyperparameters, C denotes the number of clusters, N c is the number of pixels belonging to cluster C, x i is the embedding vector of the i th pixel, μ c is the mean embedding of cluster C, and ‖·‖ means the L2 distance. We add two SAD paths in our network, which are between layer2, layer3, and layer4. After extracting the attention maps, since the target maps are smaller than origin maps, we upsample the target maps to the same size of the origin maps and perform softmax on each map. And then we calculate the Mean Square Error (MSE) between the mimic maps and the target maps. e SAD loss is formulated as follows: where Ψ denotes the extraction, interpolation, and softmax operations. e total loss include these three terms, as follows: where L bin is the loss of the semantic segmentation task calculated with the dice loss function. e parameters α, β, and c balance the influence of each loss. In our experiments, we found the best performance when α � 1, β � 0.3, and c � 0.1.

Postprocessing.
As mentioned above, there are two outputs of the network: one is the semantic segmentation map outputted by binary branch and the other is the embedding map outputted from the embedding branch. We use the segmentation map as a mask and apply the mask on the embedding map, so that we can get the embedding map only of lane pixels. en we perform mean shift clustering on it to get clusters of each lane and obtain the real result of instance segmentation. Most of the time, the beginning of lanes is very straight, and they start to bend in the end of the sight. Least square fitting cannot fit this kind of curve well. So, for each lane, we take the center points every 10 rows and get the final output through cubic spline interpolation.

Development and Test Environment.
We used Python as the main development language. e training and testing of the deep learning model are based on the PyTorch framework. e image processing part uses the OpenCV framework. e scientific computing part uses numpy and sklearn. e overall environment configuration parameters are shown in Table 1.

Dataset.
We used two datasets to evaluate our method, TuSimple dataset [37] and CULane dataset [38]. TuSimple dataset collects road information at different times during the day, including two lanes/three lanes/four lanes/or more lanes, with different traffic conditions, including clear lane lines and severe wear. Some samples of the data set are shown in Figure 5.
CULane dataset is a much more challenging and larger dataset, including normal and 8 challenging situations, such as crowded, night, and no line. e proportion of different situations is shown in Figure 6. Some of the examples are shown in Figure 7.
Compared to CULane, TuSimple dataset is rather simple. But CULane dataset focuses on the detection of four lane markings, while TuSimple dataset includes four lanes or more. And the basic information about these two datasets is shown in Table 2.

Evaluation Metrics and Test
Results. Usually, we regard lane detection as a binary classification problem, so the performance can be presented by confusion matrix. In addition, in order to compare with other methods, we also use the official evaluation metrics provided by datasets.
CULane dataset [38] uses precision, recall, and F1 to evaluate the detection. e expression of precision is given in equation (8), and the expression of recall is given in equation (9). TP (true positive) denotes the number of pixels that are lane pixels, and the predicted results are positive; FP (false positive) denotes the number of pixels that are not lane pixels, but the predicted results are positive; FN (false negative) denotes the number of pixels that are lane pixels, but the predicted results are negative. To measure these two values together, F1 is their harmonic mean, and the expression is shown in equation (10).
TuSimple [37] uses accuracy, which is computed as follows. where N pred denotes the number of correctly predicted lane points and N gt is the number of ground-truth lane points. Also, FN and FP are used as a reference. e comparison of our method with SCNN, LaneNet, ENet-SAD, and EL-GAN on TuSimple dataset is given in Table 3. e comparison on CULane dataset is given in Table 4. Table 3 shows our comparison results on TuSimple. Considering the accuracy of different methods, all are already extremely high, and the gap between our method and the best one is very small (which is 0.0027). It is fair to say that we have comparable performance with the state of the art on TuSimple. Table 4 shows that our method has better performance on challenging situations, especially on no line,  Journal of Advanced Transportation shadow, and dazzle light situations. Overall our method has good performance in normal situations and also improved in challenging traffic situations. e output of the trained network is shown in Figure 8. From Figure 8, it can be seen that the trained network has a good recognition effect for lanes under various interferences, like the road shown in Figure 8(b) which is occluded by a large vehicle. e semantic segmentation stage fails to identify the occluded lane line pixels, but the instance segmentation stage can still classify the two lanes as the same lane.
e testing results on TuSimple dataset are shown in Figure 9. Figure 9 shows part of the detection results of the lane line detection algorithm designed in this paper. It shows that our method has good performance on normal highway situations. And it can detect more than 4 lanes. When there is a large area of shadow on the road, our method can still work well. e testing results on CULane dataset are shown in Figure 10. Since CULane contains more difficult situations, Figure 10 shows the performance of our method on crowded situations, hazzle light situations, and night and the performance when there are no line markings on the road or there are other markings.
From the above testing results, it can be seen that our method has good robustness. It can deal with most of the normal conditions in the daytime and perform great in challenging conditions as well. e comparison also shows that our method has achieved the state of the art.

RS-Lane (ours)
96.37% 0.0532 0.0279 Accuracy is computed as equation (11). FN denotes the proportion of false negative points. FP denotes the proportion of false positive points. Lane markings are severely worn and blocked by vehicles  e real-time performance of the method on current hardware devices (shown in Table 1) is slightly off. e running time of our method now is about 54 ms per frame. Normally, 50 ms could satisfy the real-time requirements for most situations. e method mainly takes time on clustering algorithm, which takes up nearly half of the running time.

Journal of Advanced Transportation
is is because the cluster algorithm cannot speed up through GPU. But it can be made up by some simple solutions when applying to engineering. One is upgrading CPU and GPU, and the other is reducing the images' size when preprocessing. By doing these, our method can be applied in engineering.

Conclusion
In this paper, we present a new network for lane detection, which uses Split Attention and Self-Attention Distillation to enhance the performance in challenging traffic situations.   Using pixel embedding to obtain lane instances, our method also has no limits of number of lanes. e results on TuSimple show that RS-Lane has comparable performance with the state of the art in most normal situations. And the results on CULane show that RS-Lane improves in challenging traffic situations such as no line and shadow. In general, our method achieves the state-of-the-art performance and provides a reference for the application of lane detection. ough the real-time performance is slightly off, our method can be applied in engineering by making a little change.
In the future, we will further explore how to improve the speed and accuracy at the same time. We will also continue working on the situations with various weather conditions, such as rainy and foggy. Besides, our work is a part of Cooperative Vehicle Infrastructure System, especially for future intelligent vehicle design and control with 5G technology [42][43][44].

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.