Vision-Based Lane Departure Detection Using a Stacked Sparse Autoencoder

This paper presents a lane departure detection approach that utilizes a stacked sparse autoencoder (SSAE) for vehicles driving on motorways or similar roads. Image preprocessing techniques are successfully executed in the initialization procedure to obtain robust region-of-interest extraction parts. Lane detection operations based on Hough transform with a polar angle constraint and a matching algorithm are then implemented for two-lane boundary extraction. The slopes and intercepts of lines are obtained by converting the two lanes from polar to Cartesian space. Lateral offsets are also computed as an important step of feature extraction in the image pixel coordinate without any intrinsic or extrinsic camera parameter. Subsequently, a softmax classifier is designed with the proposed SSAE.The slopes and intercepts of lines and lateral offsets are the feature inputs. A greedy, layer-wise method is employed based on the inputs to pretrain the weights of the entire deep network. Fine-tuning is conducted to determine the global optimal parameters by simultaneously altering all layer parameters. The outputs are three detection labels. Experimental results indicate that the proposed approach can detect lane departure robustly with a high detection rate. The efficiency of the proposed method is demonstrated on several real images.


Introduction
Road safety is a major social issue. The 2015 Global Status Report on Road Safety provided by the World Health Organization shows that the total number of road traffic deaths worldwide has plateaued at 1.25 million per year, with over 3,400 people dying on roads all over the world every day [1]. A considerable fraction of these accidents is due to the unintentional deviation of vehicles from their traveling lane. Unexpected lane departure usually occurs because of the temporary and involuntary fading of a driver's vision caused by fatigue, use of a mobile phone, operation of devices on the instrument panels of vehicles, or chatting. Lane departure is also the secondary cause of road traffic accidents [2]. Therefore, identifying lane departure occurrence is important.
Vision-based systems, which involve installing video cameras (and/or other sensors) in the interior of vehicles to sense the environment and provide useful information for the driver, can be used to improve road safety. Equipping drivers with an effective vision-based lane departure warning system (LDWS) can rapidly and effectively prevent or reduce lane departure accidents. Lane departure detection has elicited much attention, and several vision-based LDWSs have been successfully developed in the past 20 years [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]. These vision-based LDWSs rely on different computer vision techniques. With preparatory work that includes image preprocessing, lane detection, and matching, lane departure detection was conducted in this study by using stacked sparse autoencoders (SSAEs) to create a softmax classifier in graylevel images obtained by a camera with a charge coupled device (CCD); the camera was mounted on a test vehicle. a polar angle constraint (hereinafter referred to as PACHT). and are the distances of the vehicle to the left and right lanes; they were computed without internal and external camera parameters [19]. The SSAE neural network is a highly effective classification method and has been widely employed for classification and pattern recognition problems [20] since its proposal. An SSAE consists of multiple layers of SAEs, in which the output of the previous layer is the input of the next layer. Greedy layer-wise SSAEs can pretrain the weights of the entire deep network (DN) by training each layer. Numerous optimization algorithms have been proposed to optimize the parameters of neural networks. The steepest descent algorithm [21] was selected in the current study because of its practicality. A pictorial description of the lane departure of a vehicle is shown in Figures 1(a) and 1(d). The lane boundary description changes when the driving direction of the vehicle deviates from the center of its moving lane in the left or right direction, as shown in Figures 1(b) and 1(c).
To implement the proposed method, we assumed the following: (1) the optical axis of the CCD camera, the lane center, and the centerline of the car body nearly coincide, as shown in Figures 1(e) and 1(f); (2) the processed video sequences are collected on highways and similar roads where the lane curvatures are small; (3) the lane marks are denoted with a color that is brighter than that of the other parts; and (4) the left and right lane marks are parallel to the lane center. With these assumptions, we obtained the following advantages. First, the proposed approach reduces the noiserelated effects of a dynamic road scene. Second, the approach allows for lane departure detection without camera-related parameters (i.e., camera calibration is unnecessary). Third, the six input features of our SSAE system can be obtained in the image pixel coordinate without coordinate transformation between world coordinates.
The proposed algorithm involves four steps. Figure 2 illustrates the basic procedure. The first step is image preprocessing, which includes graying, filtering, binarization, and extraction of the region of interest (ROI). These processes are presented in Section 3. The second step is lane detection and matching (presented in Section 4), in which , , , and are obtained. The third step is the calculation of lateral offset (LO) and and the design of a softmax classifier according to SSAE. The last step is lane departure detection with three labels using SSAE. The third and last steps are presented in Section 5. The experimental results and comparison with other methods are provided in Section 6. The conclusions are provided in Section 7.

Related Work
An important problem with captured images is how to effectively obtain robust ROI extraction parts while using image processing techniques to eliminate noise factors. Various feature extraction techniques have been utilized in literature. These techniques include filtering and denoising methods (i.e., mean, median, adaptive [22], and finite impulse response (FIR) filters [23]), gradient operators (Sobel [24], Canny [25], Roberts [26], and Prewitt [27]), binarization methods [28], and vanishing point detection methods [29]. The highperformance approach proposed in the current study was verified through an experimental comparison.
The next step is a robust type of detection and tracking of lane marks. Ruyi et al. [30,31] used inverse perspective mappings to obtain top-view images for lane boundary detection. Various shape models, which include piecewise linear segments [32], parabolas [19], hyperbolas [33], splines [34,35], snakes [36], clothoids [37], or their combinations [38], have been applied to determine the mathematical parameters Obtain k l , k r , b l , and b r Obtain D l and D r that fit lane borders. These models focus on obtaining such parameters but are complex and time consuming. Sehestedt et al. [39] proposed a lane tracking system based on a weak model and particle filtering. Borkar A. et al. [40] applied the Kalman filter to track lane marks. The group intelligent algorithm was also utilized in lane boundary tracking by Cheng [41]. Several lane detectors and trackers, such as GOLD [3], RALPH [4], SCARF [5], MANIAC [6], and LANA [7], have been implemented and cited in literature.
Other researchers worked on LDWS by using several types of lane boundary estimation and lane tracking techniques. LeBlanc et al. [8] proposed an LDWS that predicts a vehicle's path and compared this path with the sensed road geometry to estimate the time-to-lane-crossing (TLC). Kwon et al. [9] developed a vision-based LDWS that considers two warning criteria: LO and TLC. Lee [10] proposed an LDWS that estimates lane orientation through an edge distribution function and identified changes in the driving direction of a vehicle; a modification of this technique [11] includes a boundary pixel extractor to improve robustness. Jung and Kelber [12] also provided an LDWS using LO with an uncalibrated camera, and the change rate was considered. Hsu [13] applied radial basis probability networks as a pattern recognition mechanism that measures and records a vehicle's lateral displacement and its change rate. Then, the trajectory is compared with the training patterns to determine the classification that fits most and to check if the vehicle is about to perform lane departure. Fardi et al. [14] proposed an LDWS based on the ratio of the lane angles and distances of twolane boundaries. Hsu et al. [15] proposed an LDWS based on angle variation. Kibbel et al. [16] also used lateral position and lateral velocity based on road marking detection for lane departure detection. Kim and Oh [17] proposed an LDWS based on fuzzy techniques by combining LO information with TLC. Wang et al. [18] also applied the fuzzy method to vision-based lane detection and LDWS by using the angle relations of the boundaries.
In sum, the approaches for lane departure detection can generally be classified into two main classes according to whether camera calibration is required or not. The two classifications are (1) lane departure detection with necessary camera calibration to obtain the internal and external parameters of camera and link the world coordinates of 3D real-world and image pixel coordinates via coordinate transformation [8,9,[13][14][15][16][17] and (2) lane departure detection that relies solely on image pixel coordinates and does not provide accurate estimates of the vehicle's position in the world coordinates [10][11][12]. Lane departure detection methods may also be classified into four main classes according to the lane departure discriminant parameters. The four classes are (1) TLC method [8], (2) LO method [12,13,16], (3) lane angle variation method [14,15,18], and (4) the combination of these three methods [9][10][11]17]. With the four assumptions established in this study, we explored the combination of LO, lane slope, and intercept solely in the image pixel coordinates as the feature inputs of our softmax classifier. This scenario was considered because of the important observation that the left and right lane slopes and the intercept change in the image pixel coordinates when lane departures to the left or right side occur. PACHT was utilized to estimate the lane slope and intercept. The proposed approach is unaffected by the lens optics parameters related to the camera, vehiclerelated data (e.g., vehicle width and weight), and width of the driving lane. Furthermore, coordinate transformation, a complex road model, curvature, and TLC are unnecessary in the proposed approach.

Image Preprocessing
The input vision sequences include not only lane information but also nonlane information, such as road obstructions and the sky, which affect lane detection. Therefore, roadway image preprocessing is necessary to highlight the lane lines and detect lanes in real-time with accurate and low-error rates.
3.1. Graying. The collected sequences are RGB images. All RGB images were converted into gray images according to [42] = 0.2989 + 0.5870 + 0.1140 This conversion reduces the computational burden by three times and contributes to real-time image processing.

Filtering. The gray input images
contained a large amount of noise, so noise removal was necessary. A 2D FIR filter was used in our test [23]. This 2D filter's time consumption is much lower than that of other filters, and its filtering result is much better than that of other filters. The FIR filtering result is shown in Figure 3.

Binarization.
Binarization is required to highlight the lane of the filtering images. The core problem of binarization is how to determine the optimal threshold; if the threshold is excessively large, then a lane edge point will be missed or some redundant information will be detected. This study employed adaptive Otsu's method to perform binarization [28]. The method assumes the presence of two pixel distributions (one for the lane and another for the background) and calculates a threshold value t to minimize the variance between the two pixel distributions according to where t is the threshold value of the two pixel distributions, 1 denotes the probability of the background pixels in the total image, 2 denotes the probability of the lane pixels in the entire image, 1 is the average grayscale of 1 , 2 is the average grayscale of 2 , and u is the average grayscale of the entire image. Adaptive Otsu's method performs better than the other compared thresholding methods, as shown in  obtained as ( , ) and ( , ). , , , and were introduced to (3) to calculate the intersection vertical coordinates of two straight lines (i.e., the vanishing point ordinate in the current frame).
where is the slope of the left lane, is the slope of the right lane, is the intercept of the left lane, and is the intercept of the right lane. The part where the ordinate is maintained is the dynamic ROI. The schematic in Figure 5(a) shows that ROI-II is the dynamic ROI, whereas ROI-I is the truncated part. Figure 5(b) shows the experiment result.

Lane Detection and Matching
4.1. Linear Model. Selecting a good model is necessary in the lane recognition process, which begins with the hypothesis of a road model. A linear model is better than parabolic, hyperbola-pair, and spline models in terms of algorithm simplicity and computational burden. A linear model was selected in this study and adopted for the subsequent processing. Below, we verified the selected model meets the accuracy requirements.
The highway engineering technical standard states that the minimum turning radius on a freeway is 650 m, as used in the method in [43], and the assigned lane curvature radius is 650 m. The lane line is replaced with a straight line in the region of length ( = 5m). The resulting error can be calculated as follows: The result is about 4.8 mm less than 5 mm, which is less than one-thousandth of the intercept length. Therefore, a small stadia highway curve line can be approximated as a straight line. The linear model can thus meet lane detection accuracy requirements for highways.

PACHT.
The classic Hough transform utilizes point-line duality to convert the straight line detection problem in the image space to a cumulative peak problem of the point in the Hough domain by using = cos + sin (5) where is the polar radius denoting the normal distance of the line from the image space origin, is the polar angle that denotes the angle between the x-axis and the normal line, ( , ) is the coordinate in the image space, -denotes the Hough domain, and -denotes the image space. Many nontarget lane edge points (e.g., trees, road signs, and other interference points) are commonly noted after binarization. This study improved the traditional Hough transform by restraining and values to limit the scope of the voting space and minimize the interference.
We  Figure 6(a). Figure 6(b) shows that many interference points in the areas involved in the followup processing were effectively removed.

Lane Detection and Matching. Use PACHT to detect lanes.
Although the aforementioned processing technique has set the region, a frame contains at least two-lane lines; it can even contain four, six, or more lanes on a high-grade road in real lane detection. We assumed that a maximum of lines are detected per frame, and all lines constitute a straight line repository {( , )}, = 1, 2, ⋅ ⋅ ⋅ , to perform the following matching operation.
The   (iii) The distances between the lines identified in the current frame and those in the repository at time − 1 are calculated one by one according to the following formula:

Right lane area
where is the polar radius of lines detected in the current frame, is the polar angle of lines detected in the current frame, ( −1) is the polar radius of lines in the repository at time − 1, ( −1) is the polar angle of lines in the repository at time − 1, and is the width of the road image; , = 1, 2, ⋅ ⋅ ⋅ , .
(iv) The best matches between the current lines and those in the repository are then identified. If < min , then the matching is successful, the input lines replace the lines that correspond to them, and the count value is increased by one. Otherwise, the count number decreases by one until it saturates. min is a predetermined matching distance threshold.
(v) The count number of each line is then checked. If < 0, let = 0; if > , let = . This condition continues to the next cycle until the image sequences end.
The counting process of the counter is shown in Figure 7, which shows that two active lanes are detected in the system. We set to 25 in this process. Two horizontal lines were observed at a value of 25, and the two lines indicate two detected lanes. The vehicle changed its driving lanes at frames 250 and 750, where the curve cross phenomenon occurred. Figure 8 shows the lane detection and matching flow chart.
Convert the two lanes from polar to Cartesian space. Left and right lane slopes and and lane intercepts and are obtained.   Match ( i   (k-1) ,  i(k-1) ) and ( i   k ,  ik ) Match ( i(k-1) ,  i(k-1) ) and ( ik ,  ik ) is the distance between the corner of the rigid body and two lanes. Strict calculation of this distance has a hysteresis effect when the onset time, which is the driver's initial action time, is considered. Therefore, we presumed a virtual lane that is narrower than the actual lane. Our LO is the distance between the center and virtual lane. Figure 9(a) presents the description of our LO in the world coordinates, where is the left LO, is the right LO, and is the width of the reserved area.

Lane Departure Detection Based on SSAE
Given that the images captured by the CCD camera are projected through the perspective, and can be directly calculated in the image pixel coordinates without any intrinsic or extrinsic camera parameters. Figure 9(c) shows the perspective projection image of Figure 9 where is the left LO, is the right LO, is the width of the image, is the width of the reserved area, is the width of the vehicle, is the abscissa of the intersection of the left lane and image bottom boundary, and is the abscissa of the intersection of the right lane and image bottom boundary.   Figure 9(a) shows that the width of vehicle is approximately considered. Figures 9(b), 9(c), and 9(d) show that R+V/2, which is a constant, can be regarded as a whole due to mathematical transformation. LO is not influenced by the parameters of lens optics, vehicle type, width of the traveling lane, and localization of lane marks throughout its computation process.

SSAE.
Neural networks have been applied in many classification problems and have obtained favorable results. This study proposed a novel neural network model called SSAE to classify lane departure. SSAE initializes the parameters and then utilizes feed-forward and back-propagation by a batch gradient descent algorithm to identify the minimum cost function for obtaining the global optimum parameters. The progression, which involves unsupervised learning, is called pretraining. Afterward, fine-tuning is employed to obtain improved results. SSAE possesses a powerful expression and enjoys all the benefits of DNs. Furthermore, the model can accomplish hierarchical grouping or part-whole decomposition of the input.
Unlike other neural networks, an SAE neural network is an unsupervised learning algorithm that does not require labeled training examples. Applying back-propagation makes the target values equal to the inputs. A schematic of an SAE with m inputs and n units in the hidden layer is shown in Figure 10. The overall cost function in an AE is as follows: where m is the number of inputs, ℎ , ( ( ) ) is the output of the activation function when the raw input is ( ) , ( ) is the raw output, is the relative importance of the second term, and is the weight associated with the connection between unit in layer and unit in layer + 1. Several correlations of the input features in an SAE can be determined by imposing constraints on the network. Sparsity is imposed to constrain the hidden units as follows: where 2 denotes the activation of hidden unit in the SAE when the network is given a specific ( ) . The parameter is the average activation of the hidden unit . Moreover, we set the two parameters equal as follows: This parameter commonly has a small value (e.g., 0.05). Thus, the activation of the hidden unit must be close to 0. A penalty term is added to penalize the situation of̂deviating significantly from to optimize the objective. The penalty term is as follows: The overall cost function is The SSAE consists of multiple SAE layers. The outputs of each layer are wired to the inputs of the succeeding layers. The detailed establishment of SSAE, pretraining, and fine-tuning were conducted in the following steps.
Step 1. An SAE was trained on raw input ( ) to learn primary features ℎ (1)( ) . The structure of the first SAE is [ , 2 , ], which corresponds to inputs, 2 units in hidden layers, and outputs, as shown in Figure 11.
Step 5. The secondary features were regarded as "raw inputs" to a sigmoid classifier and trained to map the secondary features to digit labels.
Step 6. All three layers were combined to form an SSAE with two hidden layers and a classifier layer.
Step 7. Back-propagation was conducted to improve the results by adjusting the parameters of all layers simultaneously through a process called fine-tuning.
Step 7 was performed repeatedly until the set training times were attained.

Basic Parameters.
One of the most important parameters during the reduction of the cost function by batch gradient descent is learning rate . If this rate is excessively large, it would result in an excessively large step size, and the gradient descent can overshoot the minimum and deviate increasingly farther from the minimum. If this rate is exceedingly small, it would slow down the computing speed required to reduce the cost function. Another drawback is the potential trapping in the local optima and potential resultant inability to reach the global optimal solution. We set to 5×10 −6 in the proposed model. The other parameter values, such as the sample number in the batch training , sparsity parameter , and sparse penalty factor weight , can be obtained by changing only one parameter and fixing the others. We set = 300, = 0.3, = 3, and the number of SAEs to 2. increase. Conversely, when the vehicle approaches the right boundary, , , and decrease simultaneously, whereas , , and increase. Table 1 shows the six parameter changes in several frames.

Experiment and Result
The proposed system was evaluated with images captured by a CCD camera mounted on a vehicle. While driving the vehicle, road tests were conducted on structured roads paved with asphalt or cement and with lane marks. The number of image sequences in the tests was 5309 frames, and the image size was 320 × 240 pixels.

Lane Detection and Matching Experiment.
The experimental lane detection and matching results of the proposed system are presented in Figure 12. Figure 12(a) presents the lane detection and matching result of a sequence from frames 1259 to 1278. Figure 12  Lane detection and matching in our experiment failed in 176 frames because of a large amount of white noise on the road, so the recognition rate of the proposed method is 96.69%. The real-time performance is approximately 15 ms in a computer with 3.88 GB of RAM.

Comparison with Different Hidden Layer Structures.
We determined the number of first and second AE hidden units in this experimental procedure. We set the number of the first AE hidden units to vary from 5 to 1,000 at intervals of 25. The number of the second AE hidden units varied from 5 to 200 at intervals of 5. A total of 1,600 different structures were assessed to determine the most suitable structure for the proposed model. The recognition accuracy results are shown in Figure 13. Figure 13 shows the accuracy of a certain combination in which the number of the first SAE hidden units is 205 and the number of the second SAE hidden units is 160. The maximum value of 90.74% is reached. Hence, we finalized our lane departure detection model as [6,205,160,3], followed by the structure of [6,155,150,3] (90.67%) and the structure of [6,180,195,3] (90.61%).

Comparison of with and without Pretraining.
Pretraining plays an important role in the recognition of neural networks. The experimental results showed that the accuracy rates with pretraining are 90.74%, and the accuracy rates without pretraining are 88.69%. Approximately 2.05% was improved by pretraining. This result is relatively easy to understand. The network can only fine-tune the parameters without pretraining, which generally comprise several random initial points. This notion explains why the network may be confined to the local optima. The network can be fine-tuned on the pretrained parameters to obtain the global optima. Thus, pretraining the SSAE is necessary.

Comparison with Different Classifiers.
We utilized the best model [6,205,160,3] to detect lane departure given the basic parameters, as shown in Figure 14. Figure 14   also decreases gradually before frame 4655. A step change also occurs because the vehicle offsets to a certain extent and changes lanes, as shown in the last two images of Figure 14(d).
Then, we compared our method with other classifications. The first comparison experiment only covered LO with no other parameters (i.e., , , , and ). This method failed to detect 926 frames, and the detection accuracy rate is 81.59%, as shown in Table 2. The SSAE is better than LO by 9.15% because regular changes in , , , and emerge when lane departures occur. Therefore, , , , and , which are SSAE inputs, are required. The second comparison experiment covered the classifier function changes from softmax to sigmoid in SSAE. The sigmoid performed worse than the softmax and only attained 71.63% accuracy, as shown in Table 2. The third experiment compared our algorithm with SAE-DN, whose weights were pretrained only by one SAE. SAE-DN attained 86.26% accuracy, which is 4.48% lower than that of SSAE, as shown in Table 2. The fourth experiment compared SSAE with a linear separable support vector machine (SVM-LS). The accuracy rate of SVM-LS is 81.78%, which means that it failed to detect 968 frames, as shown in Table 2. The last experiment compared SSAE with a nonlinear support vector machine (SVM-NL). It failed to detect 559 frames and attained 89.47% accuracy, which is    1.27% lower than that of SSAE, as shown in Table 2. The parameter selection result of SVM-NL is shown in Figure 15.
To visually present that SSAE is superior to other classifiers, a bar chart is drawn in Figure 16. The SSAE method, represented by the red column, obtained the highest recognition rate of 90.74%. It is followed by SVM-NL (89.47%), represented by the black column, and the SSAE with no pretraining (88.69%), represented by the yellow column. The other methods have a lower accuracy rate.

Conclusion
The fundamental issue in a lane departure detection system is to robustly identify lane boundaries and design a robust lane departure detection algorithm. The proposed method of vision-based lane departure detection based on SSAE is relatively successful from these perspectives. Our lane detection and matching method achieved a high recognition rate of 96.69%, and its real-time performance was good. The parameter setting and performance of the SSAE algorithm were proven by the experimental results under different conditions. The proposed method obtained a high accuracy rate of 90.74%. Furthermore, we compared the performance of the proposed approach with that of five other algorithms and proved the superiority of the SSAE model over the competing models.
This paper presents SSAE DNs for the lane departure detection problem. We selected right and left LO, lane slope, and lane intercept as the feature inputs of SSAE based on our observations and experiment. As a result, a satisfactory experimental result was obtained. More importantly, this study determined approaches to obtain feature inputs and perfect solutions. In the future, we can obtain more feature inputs to improve pattern recognition accuracy. Collecting more video sequences is also necessary to achieve convincing experimental results.
The proposed algorithm was processed at a speed of 61 frames per second on a Pentium PC (3.00 GHz, 3.88 GB of RAM).

Conflicts of Interest
The authors declare that they have no conflicts of interest.