Efficient Lane Detection Technique Based on Lightweight Attention Deep Neural Network

For self-driving vehicles, detecting lane lines in changeable scenarios is a fundamental yet challenging task. *e rise of deep learning in recent years has contributed to the thriving of autonomous driving. However, existing methods of lane detection based on deep learning have high requirements on computing environment, so their applicability is further restricted. *is paper proposed an improved attention deep neural network (DNN), a lightweight semantic segmentation architecture catering for efficient computation in low memory, which contains two branches worked in different resolution. *e proposed network integrates fine details captured by local interaction of pixels at high resolution into global contexts at low resolution, computing dense feature maps for prediction task. Based on the attributes of disparate feature resolution characteristics, different attention mechanisms are adopted to guide the network to effectively exploit the model parameters. *e proposed network achieves comparable results with state-of-the-art methods on two popular lane detection benchmarks (TuSimple and CULane), with faster calculation efficiency at 259 frames-per-second (FPS) on CULane dataset, and the total number of model parameters only requires 1.57M. *is study provides a practical and meaningful reference for the application of lane detection in memory constrained devices.


Introduction
Lane detection is a research hotspot in autonomous driving, and it is a key technology of Advanced Driver Assistance System (ADAS) [1]. e forward-looking camera mounted behind the front windshield is used to collect the surrounding driving environment. e vision-based processing algorithm is embedded in the vehicle to detect lane lines from captured video clips. en the results of lane detection will be applied to subsequent tasks including lane keeping [2], trajectory planning, and behavior prediction. erefore, lane detection is an integral part of automatic driving and ADAS.
To extract lane markings from video clips, current studies have mainly investigated two key routes: traditional vision methods and deep learning methods. To be specific, traditional vision-based lane detection algorithm mainly includes two steps: extraction of hand-crafted features and fitting geometrical curves. Researchers distinguish lanes from the prior characteristics including lane shape, color feature, and edge texture information [3]. en they use polynomial curves or splines to approach lane boundaries model with distinctive features. is traditional vision-based method can obtain satisfactory performance under the condition of clear lane markings and board vision. However, lane detection scenarios are subject to changes in illumination, weather conditions, traffic congestion, and other factors, which varies during driving and poses great challenges to lane estimation. ese challenges are inevitable but can hamper the detection accuracy, especially when using traditional methods. us, we need a more robust method to meet high precision requirements of lane detection in challenging environments.
Deep learning, especially the convolutional neural network (CNN) [4], has rich feature representation capabilities, which greatly boosts the performance of computer vision (CV) systems [5]. Recent studies on lane detection have shown that methods based on deep learning can deliver strikingly better results than traditional methods that depend on hand-crafted cues. erefore, deep learning methods have become the preferred option in this field. Compared with traditional algorithms that highly rely on vision cues in specific environments, deep learning-based methods improve the network's scene perception ability by continuously optimizing neural network parameters, which attain higher robustness and applicability. Consequently, lane detection methods based on semantic segmentation and instance segmentation [6][7][8][9] have attracted extensive attention in recent years. However, some of these recent approaches have tried to directly adopt a classic segmentation network or its variants to segment lane markings.
e results, although very encouraging, appear coarse in detail. e primary reasons come from two aspects: (1) In terms of accuracy, the segmentation of the boundary is less precise, especially when the lines are in distance and suffering occlusion. (2) In terms of computing resources, excessive computation and parameter overheads lead to large memory usage and deficient real-time performance, which limits the practicability of algorithm. In this paper, our motivation is to design an improved neural network architecture to compensate for the aforementioned dilemmas and execute a trade-off between high accuracy and low memory consumption. To achieve this goal, we designed a network framework with a two-branched structure. In this framework, we first utilize a lightweight downsampling module to squeeze the spatial dimension of the input feature map and forward them into two branches. One branch named global context embedding (GCE) focuses on capturing global information that can be used to deduce heavily occluded and blurred lane markings. Another branch explicit boundary regression (EBR) tries to exploit spatial attention mechanism (SAM) to aggregate boundary information at different locations, and a supervisory operation by label's edge information is attached at the end of EBR. In particular, SAM is integrated into the EBR module for better boundary regression, and channel attention mechanism (CAM) is placed behind the output of GCE encoder, so that channels with target objects can be assigned higher weights. Inspired by the effectiveness of MobileNet [10] series articles, we replace regular convolution layers with bottleneck units for network deepening operation, which contributes to the reduction of network parameters.
e main contributions of this study can be summarized as follows.
(1) is paper proposes a lightweight DNN framework to simultaneously address precision performance and memory overloads issues, which is better suited for the lane detection task. (2) is paper designs an EBR module with SAM and auxiliary edge supervision to reinforce the consistency of semantic boundary. e lateral experimental results indicate these modules significantly improve the precision of lane boundaries.
(3) e experimental part elaborates on the details of the ablation study and compares the segmentation results on TuSimple and CULane datasets with other methods. Results show that the proposed model attains faster inference time with competitive performance compared with state of the art. And the robustness of the algorithm can tackle lane detection in dark, dazzled, and blurred environment. e remainder of this paper proceeds organized as follows. Section 2 reviews the previous researches on lane detection. Section 3 introduces the proposed method. Section 4 demonstrates the experimental performance of the proposed method. Section 5 summarizes this article.

Related Works
e present work relies heavily on prior efforts in lane detection and attention mechanism areas.

Lane Detection.
Before the rise of deep learning, methods on lane detection generally focus on feature extraction, model fitting, and lane tracking. Researchers use the color, boundary, and texture features of the road to realize lane detection and tracking. We conclude a general block diagram of the traditional lane detection methods in Figure 1. For more algorithm types and implementation details of image preprocessing, region of interest (ROI) selection, lane modeling, lane detection, and lane tracking in traditional methods, please refer to [11,12].
Methods based on deep learning have gradually become the dominant algorithm in vision tasks relying on their powerful representation capabilities. In order to handle the complex situations in lane detection, [6] designed a hybrid deep architecture by combining RNN with CNN. Input consecutive frames into CNN for feature extraction and then exploit RNN to further learn extracted features and make lane prediction. In [7], they design a network named RS-Lane, which adds split attention and self-attention distillation on the basis of LaneNet to increase the reasoning ability and robustness of the proposed method.
is work can detect lane lines without number limits. In [13], the authors improve you only look once (YOLO) object detector and realize the detection of yellow lane lines by means of object detection. Pan et al. [8] proposed a spatial CNN that aggregates the features of each pixel through slice-by-slice convolution in a layer, resulting in top-1 performance in the CVPR'17 TuSimple benchmark. However, to take advantage of spatial information sufficiently, this method gathers both horizontal and vertical information by shifting the sliced feature recurrently, which is conceivably time-consuming. Compared with SCNN, Tu et al. [9] employ the same spatial information utilization strategy, but on this basis, the algorithm is simplified by changing the information shifting strides. Qin et al. [14] realize fast lane detection by gridding the image and searching the lane grids row-by-row and column-by-column. Even though this method can benefit from the fast speed delivered by grid downsampling, it also cannot prevent reduced accuracy from low-resolution sparse feature map. With approach in [15], the authors proposed PolyLaneNet, which converts the lane estimation task into conjecturing polynomials that represent each lane in input image.
is approach obtains higher real-time efficiency; however, the accuracy relies on the position of lane line starting points, so the performance drops significantly when the lanes suffer severe occlusion. Typical framework of deep learning-based lane detection is shown in Figure 2.

Attention Mechanism.
Attention mechanism originated from the research of natural language processing (NLP) and was gradually applied to the field of CV recently. is mechanism can selectively focus on important features and suppress irrelevant features, thus improving the performance of DNN. SENet [16] proposes a "Squeeze-and-Excitation" block to model interdependencies between channels. By recalibrating the channelwise feature responses, the network produces significant performance improvements at negligible overheads. Different from SENet, approaches of [5,17] both exploit semantic interdependencies in spatial and channel dimensions. In order to make best use of two complementary attention outputs, [5] propagates the channelwise output to spatialwise attention submodule in sequential arrangement, while [17] arranges two modules in parallel and sums the outputs to further improve feature representation. In this work, we utilize channel attention and spatial attention in different branches to enhance network representation power. e channel attention and spatial attention are similar to DANet [17] and CBAM [5], respectively. In [18], a self-attention distillation (SAD) approach is proposed to improve the representation learning of CNN-based lane detection models, which is a flexible plugand-play module. For more articles based on deep learning to achieve lane line detection, please refer to [19].
More close to our work, [20] introduces channel attention module and self-attention module in parallel to obtain global contexts and channel dependencies of feature maps. However, successive dilated convolutions introduced in the subsampling process and large rectangular multiplication under the self-attention mechanism will incur expensive computations and memory burdens to the network. In our network, we utilize a lightweight downsampling module to gather low-stage feature maps and exploit spatial and channel attention based on a simple yet efficient architecture.

Methods
In this section, we first present a general architecture of our designed model and then elaborate the inner blocks used to capture the global context information and spatial edge information. Finally, we demonstrate how to aggregate the information for further enhancement of feature representation. Figure 3 is a pictorial description of our proposed network architecture. It is obvious that this architecture consists of four components and splits into two branches. e above branch encodes increasingly abstract feature representation with long-range context in deeper encoder output, while the bottom branch reserves spatial details in low-level high-resolution feature maps. Subsequently, we take the superiorities of two branches into consideration, so that elementwise summation is employed to fuse features. e following describes the details of each module.

Lightweight Downsampling.
In lightweight downsampling module, we consider different strategies to subsample original input. e large kernel size (with kernel size � 7) is adopted in the first subsampling process to  dimension. Furthermore, we adopt the bottleneck with residual shortcut as a substitute for conventional convolution to reduce network overloads, as shown in Figure 4. ese two feature extraction strategies are combined in the lightweight downsampling module to realize the aggregation of deep abstract features while reducing the number of parameters.

Global Context Embedding.
e GCE module involves two parts laid out sequentially to encode long-range contexts. e first part is convolutional striding operation and dilated convolution, which capacitate high stage features to obtain richer global information.
e other is the introduction of CAM. Since convolutional operation extracts informative features by blending cross-channel information with simple summation, the importance of different channels is ignored. However, we expect the network to selectively emphasize channels that contain lane line semantic features and restrain irrelevant ones. Channel attention provides a means of recalibration channelwise feature responses according to interchannel correlations, which stimulates the network sensitivity to crucial and informative features. CAM calculates the specific weight of each channel in the unit of feature map, so that the channels that cover the target feature receive more attention. erefore, the precision of the network can be significantly improved while slightly increasing the computation, which is consistent with the theme of our proposed method.
Concretely, the second strategy in 3.2 is first introduced to subsample the input; then stacking the dilated convolutions is aimed at expanding receptive field of filters. Next, CAM is appended to aggregate the intraclass consistency and enhance robustness to local disturbance; the structure is illustrated in Figure 5(a). Here we describe the operation of CAM in detail below. Suppose that X ∈ R C×H×W is the output of the previous layer. Firstly, we reshape it to X ∈ R C×N , where N represents total pixel numbers. en, we conduct a matrix multiplication between X ∈ R C×N and transposed X T ∈ R N×C .
Next, the above multiplied resulting matrix is forwarded into a softmax layer to normalize the interchannel dependencies Y ∈ R C×C between any two channel maps. Finally, we multiply the weighted channel attention map y ji with corresponding channel X i and perform summation with origin channel X j to obtain the final feature maps O as follows: y ji denotes the influence exerted by channel i th on j th , where φ is a learnable parameter which is initialized as 0 and gradually learns to assign more weight.

Explicit Boundary
Regression. e lane lines in the distance are inconspicuous and incomplete due to the influence of light and occlusion. If we merely perform ordinary convolution operation in high-resolution input to generate local feature map, the discriminability for these indistinct lane targets will be covered by other salient objects, resulting in misclassification and misdetection of semantic segmentation. To remedy the above mentioned issue, we first exert SAM to enhance the feature discriminant ability, which redistributes weight based on the interspatial relationship of features in a global view, as shown in Figure 5(b). We improve the SAM proposed in [5], which reallocates the weight of each pixel in feature map and assigns the pixel containing target cue to a higher weight, thus improving the representation ability of indistinct boundary features. Given the spatialwise refined features, we introduce an edge supervision to guide the network to learn the boundary characteristics. is supervision acts as an auxiliary boundary segmentation task, enabling the network to achieve EBR. Next, we expound the process of SAM and edge supervision.
As illustrated in Figure 5(b), given a local feature I ∈ R C×H×W , we first feed it into point convolution, maxpooling, and average-pooling operations along the channel axis to generate feature descriptors I s conv , I s max , and I s avg , respectively, where I s conv , I s max , I s avg ∈ R 1×H×W . en we concatenate the above three descriptors and forward them into a convolution layer with large kernel size of 7. After that we apply a sigmoid layer to calculate the 2D spatial feature map F ∈ R H×W . Finally, we perform an elementwise multiplication between I and feature map F to obtain the final output E as follows: where W f ∈ R 1×1×C×1 and W h ∈ R 7×7×3×1 represent convolutional operation with kernel size of 1 and 7, respectively; ⊗ denotes elementwise multiplication.
To enhance the continuity and discriminability of the lane boundary, we employ Sobel edge extraction operator to filter the semantic labels and exploit them as supervision signal. It is worth noting that the lane lines occupy only a small proportion in the labels and the imbalance between background and foreground is detrimental to the segmentation performance. us, we employ focal loss [21], as an auxiliary loss function, to supervise the output of SAM; the equation is shown as follows: where p n means the probability of class n, n ∈ 1, 2, 3, { . . . , N}, N is the maximal number of labels, and c is a modulating factor and is set to 2 here.

Integration and Classification.
To embed rich semantic information into low-level features, we conduct transpose Journal of Advanced Transportation convolution with stride 2 to upsample the outputs of GCE and integrate them with the outputs of EBR by summation. Furthermore, from the perspective of feature fusion, the exploit of EBR can bridge the gap between low-level and high-level features as emphasized in [22]. en two bottleneck layers with residual connection are followed to tightly aggregate features from aforementioned two branches. Finally, we convolve the fusion results and generate final prediction maps.

Experiment
We evaluate our network on two widely used lane detection benchmark datasets: TuSimple benchmark dataset [23] and CULane dataset [8]. We first introduce the datasets and report implementation details, then perform a series of ablation experiments, and present comparison results with other state-of-the-art approaches.

Datasets and Implementation Details
TuSimple: It is a well-known traffic lane detection benchmark which contains five annotated lane markings, involving 3626 images for training and 2782 images for testing. Because no validation set is given, we randomly split 368 images from train set as validation set, which are employed for preventing overfitting and validating the model during training. CULane: e CULane dataset is a large scale challenging lane detection dataset which contains more than 55 hours of videos and extracts 133235 frames, involving 88880 frames for training, 9675 frames for validation, and 34680 frames for testing. is dataset comprises different conditions including urban, rural, and highway, which consists of normal and night challenging scenarios.  Journal of Advanced Transportation

Implementation Details
Training: Our experiments are executed on PyTorch deep learning framework. Considering the excessive resolution of inputs is computationally expensive in training, thus we first resize the original images to 360 × 640 for TuSimple and 288 × 800 for CULane. en we train our network using stochastic gradient decent (SGD) with momentum 0.9 and weight decay 0.0001. e batch size is set to 8 here. Influenced by DeepLab [24] success in semantic segmentation, we employ the similar "poly" learning rate policy where the initial learning rate 4e −3 is multiplied by (1 − iter/max iter) power . We adopt cross-entropy loss to measure the similarity between prediction mask and ground truth. Besides, a weighted focal loss served as an auxiliary loss function for precise boundary regression, as shown in where L CE represents the segmentation loss calculated by cross-entropy, L FL reveals boundary regression loss by focal loss, and ω is a weight factor. e network training will be terminated when the epoch number exceeds 100, and the model with lowest validation loss will be selected as the final testing model. Metrics: e evaluation metrics used to qualify the approach's segmentation performance consists of mean intersection-over-union (mIoU), accuracy (Acc), false positive (FP), false negative (FN), true positive (TP), and F1-measure (F1). With respect to speed evaluation metrics, we adopt the FPS. e Acc metric defined in TuSimple benchmark is shown below: where L CE denotes the number of points detected correctly, and L FL represents the total number of the ground truth points in ω clip. e F1 metric defined in CULane dataset is shown below:

Ablation Study.
In this section, we stepwise disintegrate our network to investigate the effect of each component in proposed method. In the next experiments, we conduct comparative experiments on inner structure of the proposed network architecture on TuSimple dataset and use the framework without attention module and edge supervision as the baseline for ablation experiments.

Attention Mechanism.
We use CAM in the GCE module to capture long-range dependencies for better lane marking inference. And SAM are used to break the situation where the position weights are identical, which is complementary to the channel attention. To verify the performance of the aforementioned components, we explore four different implementation combinations in Table 1. It can be seen from the table that the integration of SAM and CAM has significantly improved the segmentation accuracy of the network, especially the CAM. e SAM is placed on the branch of high-resolution feature map to optimize local position details by increasing the pixel weights containing lane features, thus improving the detection accuracy in a small range. e CAM is integrated into the high-semantic feature map branch to optimize the global classification accuracy of the network by enhancing the channel weights containing lane cues, resulting in an obvious boost in detection accuracy.
ese two submodules are arranged in parallel combination at different branches, so that the dependencies modeled by two attention modules do not affect each other. erefore, it can be seen from the last line in Table 1 that the precision of the combined model has been further improved.

Boundary Regression.
In order to better identify and locate boundary features, we employ label-based edge detection results as the supervision signal to strengthen the network representation ability. en we attach the edge supervision to the output of SAM. is improves the continuity of lane boundary and contributes explicitly to regression of lane lines. We first quantitatively test the balance of bilateral losses, which further analyze the correlation between significance of boundary regression and final segmentation results; various values {0.05, 0.1, 0.2, 0.5, 0.75} are used to weight the focal loss. It is apparent from Figure 6, with the same setting, that mloU reaches the peak at 70.60% with w of 0.2. e initial climbing trend of the line chart shows that moderate edge supervision helps the network efficiently extract lane lines, while excessive reliance will lead to a rapid decline in network performance. erefore, we set the turning point as the final weight factor.
We visualize the effect of auxiliary edge supervision in Figure 7. e first column and the second column represent the original image and ground truth, respectively. e last two columns exhibit the detection results with and without edge supervision. From the experiment results, the fourth column obviously outperforms the third column at better refined edge details.
e quantitative results are specifically reported in Table 2. e experimental results show that the proposed method outperforms the LaneNet and PolyLaneNet by 1.73% and 1.75% and achieves comparable performance with UFast. PolyLaneNet obtains the lane line detection results by predicting the polynomial parameters of the lane lines, while a slight deviation from the parameters will lead to a significant decrease in the accuracy of lane detection result. RESA gets excellent performance by recurrently convolving sliced feature maps vertically and horizontally to gather global information for each pixel.
is approach introduces a large number of convolution operations, which significantly increases the number of parameters and inference time of the network. Figure 8 displays the intuitive comparison results. Because TuSimple's evaluation metrics allow the predicted points to bias from the true points within a certain threshold range, our quantitative accuracy index is only 1.75% higher than PolyLaneNet. However, it is apparent from the qualitative graph that our results are more consistent with real lines and the deviation value is smaller. In order to verify the lightweight structure and real-time performance of our model, further contrast experiments were carried out on inference time and parameters. We loop 100 times to calculate the run time of a single frame. e last two columns in Table 2 provide the results. e experimental results show that our network is 2.7×, 5.5×, and 24.4× faster than FastDraw, ENet-SAD, and RESA, respectively. In addition, the model parameters are only 1/16 and 1/39 of RESA and Res18-UFast. From the above comparison tests, although RESA and ENet-SAD achieve better results in terms of accuracy index, our method provides a more practical and comprehensive choice considering the high requirements for real-time performance in practical application scenarios and the limited memory of embedded devices. On the basis of completing the above ablation and comparative experiments, we also conduct further tests on the robustness of the algorithm. In the course of driving, the illumination of road surface often changes when being sheltered by shadows or passing under the viaduct. Furthermore, spurious marks on the road surface and significant changes in road color will also interfere with the lane line detection results. erefore, the lanes detection algorithm needs to be robust enough to deal with the above situations. In Figure 9 we first display multiple road scenes with obvious challenges in road light, surface color, and spurious dirt traces. en, the lane line clustering and polynomial fitting results are superimposed on the original map to get a more intuitive result displayed in the following rows.
e experiment results show that the proposed algorithm can accurately identify the lanes, including scenarios with curves, road slope, and online.

Performance Evaluation on CULane.
To verify the effectiveness and generality of the proposed method, we conduct comparative experiments on the CULane dataset. Several recent lane detection methods, including ENet-SAD [18], SCNN [8], FastDraw [25], Res18-Ultra [14], and Res18-VP [27], are used for comparison. Table 3 presents the results. Compared with TuSimple dataset, CULane dataset has a larger and richer training set and can better train the generalization performance of the model in different scenarios. It can be seen from Table 3 that the proposed method is comparable with Res18-VP and Res-Ultra in F1 metric and has outstanding performance in parameters, which denotes our method strikes a balance between accuracy, speed, and computation burdens. Compared with ENet-SAD, our model has 0.59 M more parameters. Considering that both models are lightweight network structures, the total network parameters are less than 2 M. erefore, the influence of the 0.59 M parameter advantage on practical performance is not distinct. In terms of speed, the detection efficiency of proposed network is more than 5 times faster than that of ENet-SAD; this significant gap in real-time performance is more obvious in practical application, which can meet the detection speed requirements in various occasions. en we select the challenging scenarios from the test set to intuitively verify the effectiveness of the proposed method. Figure 10 presents the visualizations.

Conclusion
In this paper, we propose an efficient lane detection method based on lightweight attention DNN, which is tailored for real-time lane detection task. Our method can effectually capture global context information to segment occluded lane lines and judge the classification of each pixel. Meanwhile, it retains high-resolution dense features to better conjecture  the inconspicuous lane boundaries. In order to generate semantically precise prediction maps and refine segmentation results along lane boundaries, we further incorporate attention and edge supervision mechanisms into the network. We evaluate effectiveness and generality of our proposed method on TuSimple and CULane datasets.
Although the results show that state-of-the-art methods can obtain slightly higher accuracy, the model parameters and computational overheads far exceed our network. us, our proposed method strikes a balance between accuracy and computational costs. Extensive experiments demonstrate that our network can attain robust and effective performance under the challenging scenarios, which provides a reference for the implementation of lane detection in embedded devices. In the future, we will continue to explore how to improve the accuracy of the model and keep seeking a balance between accuracy and efficiency. In addition, we will deploy our algorithm on an embedded vehicle platform to guide subsequent obstacle avoidance and planning tasks.

Conflicts of Interest
e authors declare that they have no conflicts of interest.