Stacking-Based Ensemble Learning Method for the Recognition of the Pedestrian Crossing Intention

the


Introduction
A zebra crossing is an area for pedestrians to cross the road, and it is also a potential confict area between vehicles and pedestrians [1,2]. According to the accident statistics report issued by the road trafc management department, the number of pedestrian deaths rose from 14,923 in 2015 to 17,473 in 2019. Te proportion of pedestrian deaths in the total number of trafc accident deaths increased from 25.72% to 27.84%. Te number of injured pedestrians rose from 34,379 to 45,495. Further calculations found that from 2015 to 2019, in each pedestrian-related accident, 1.2 pedestrians were injured or died on average [3][4][5][6][7]. Te above data shows that in recent years, the situation of vehicle-pedestrian accidents in China has been deteriorating year by year, and both the absolute number and the proportion of fatalities have been rising. At the same time, the above data prove that in the road trafc system, pedestrians belong to a vulnerable group. Once a pedestrian-vehicle accident occurs, even a slight scratching accident may induce serious pedestrian injury or even death [8,9].
With the rapid development of current technology, autonomous vehicles are getting closer to reality. Autonomous vehicles have signifcant potential in reducing collision-related casualties, improving trafc conditions, and reducing trafc jams and vehicle emissions. Te U.S. Department of Transportation released the Autopilot System Safety Vision 2.0 in 2017, which aims to improve the safety and reliability of the autopilot system in order to achieve the purpose of reducing the accident rate [10]. In 2016, the China Association of Automotive Engineers released a route for autonomous vehicle technology. Te route mentioned that every vehicle will have a fully automated driving system or assisted driving system between 2026 and 2030 to improve road trafc safety [11].
Driving safely on urban roads is an important challenge for autonomous vehicles. In particular, it should be pointed out that there are a large number of pedestrians on urban roads. As relatively complex individuals, their movement behavior is afected by factors such as their own emotions, trafc environment, and weather. Trough vision, sound, gestures, and actions, the driver can understand the pedestrians' intentions and then accurately complete the interaction with the pedestrian. However, for autonomous vehicles, it is difcult to understand the intentions of pedestrians and then accurately complete the pedestrian-vehicle interaction [12,13]. Te zebra crossing is the main interaction area between pedestrians and vehicles. Terefore, the research on the pedestrian crossing intention recognition model is carried out in this paper.
Te main contributions of this paper are as follows: (1) Te current pedestrian crossing intention recognition models are mainly established based on traditional machine learning algorithms or deep learning algorithms, and the recognition accuracy is relatively low. Tis paper proposes a machine learning algorithm combination framework that can improve model recognition accuracy, namely, the stacking ensemble learning framework, which integrates four classical algorithms.
(2) Te current pedestrian crossing intention recognition model usually cannot take into account the recognition accuracy and recognition advance time. Diferent from the current model, this model greatly increases the recognition advance time on the premise of ensuring recognition accuracy.

Related Works
At present, scholars at home and abroad have carried out a lot of research on pedestrian crossing intention recognition and have achieved relatively fruitful research results. Mingus et al. [14] considered the trajectory and posture of pedestrians and established a pedestrian crossing intention recognition model based on the Gaussian dynamic model. Te model recognition accuracy is 80%. Quintero et al. [15,16] collected the posture data of pedestrians crossing the zebra crossing and divided the pedestrian movement posture into 11 key points of the human body. A pedestrian crossing intention recognition model is established based on the hidden Markov model. When it is recognized 0.125 s in advance, the accuracy of the model is 80%. Fang and Lopez [17] collected a large amount of posture data of pedestrians crossing the zebra crossing. Te direction parameters were calculated between diferent points through the positioned human body key point data, and a pedestrian crossing intention recognition model was established using the support vector machine (SVM) algorithm. Te model has high recognition accuracy, reaching 93%. Brehar et al. [18] proposed a method to identify pedestrian crossing behavior using a monocular far infrared. Te method can still efectively identify pedestrian street cross action in low visibility environments such as nighttime, fog, heavy rain, or smoke, with an accuracy of 93.28%. Cȃilean et al. [19] propose a novel architecture for improving pedestrian safety at crosswalks. Te architecture can efectively detect pedestrians and predict their street cross actions.
Völz et al. [20] established a pedestrian crossing intention recognition model based on a data-driven method. Te main input parameters of the model are the distance between pedestrians and the zebra crossing, the distance between vehicles and the zebra crossing parameters, etc. Te pedestrian recognition accuracy is 84.74%. Camara et al. [21] collected a large amount of pedestrian crossing data and established a pedestrian crossing intention recognition model by analyzing the relative position between pedestrians and vehicles. Te recognition accuracy of the model can reach up to 96%. Zhao et al. [22] used lidar to collect a large amount of pedestrian crossing data and established a pedestrian crossing intention recognition model based on an artifcial neural network (ANN) by analyzing the motion parameters of pedestrians and vehicles before crossing the zebra crossing. When recognized 0.5 s in advance, the model recognition accuracy is 92.6%. Zhang et al. [23] proposed a bidirectional long short-term memory network with an attention mechanism (AT-Bi-LSTM) to establish a pedestrian crossing intention recognition model. Te recognition accuracy is 90.68% when the model is 0.6 s in advance.
Ghori et al. [24] proposed a new pedestrian crossing intention recognition framework, which combines convolutional neural networks (CNN) and LSTM networks. When recognized 1 s in advance, the recognition accuracy of the model is relatively low, at only 72%. Schulz and Stiefelhagen [25] and Brouwer et al. [26] established a pedestrian crossing intention recognition model by estimating the head movement posture of pedestrians crossing the zebra crossing. Hashimoto et al. [27] collected the intersection information and established a pedestrian crossing intention recognition model based on the dynamic Bayesian network (DBN). Schneemann and Heinemann [28] combined the image data and motion parameters of pedestrians crossing the zebra crossing and established a pedestrian crossing intention recognition model based on SVM.
Trough the literature review, it can be seen that the current research on pedestrian crossing intentions has been relatively mature. Te recognition accuracy of the intention model is already good, and the highest value has exceeded 90%. However, the recognition advance time of the model is relatively short. Overall, existing models do not seem to be able to maintain high recognition accuracy while maintaining a long recognition advance time.
In general, pedestrian crossing intention recognition can be regarded as a time-series modeling and forecasting problem. Terefore, this paper frst collects the continuous data stream 2.1 s before pedestrians cross the zebra crossing. Te data collection uses laser radar and a high-defnition (HD) monitor. Secondly, the characteristic parameters related to the crossing intention are extracted. Te characteristic parameters mainly include pedestrian speed, the distance between pedestrian and zebra crossing, age, gender, vehicle speed, the distance between vehicle and zebra crossing, and time to collision (TTC). Finally, a pedestrian crossing intention recognition model is established based on stacking ensemble learning. Te SVM, random forest (RF), LSTM, and AT-Bi-LSTM algorithms were integrated. Figure 1 shows the research framework of this paper.
Tis paper is divided into fve parts, namely, introduction, related works, proposed solution, experimental results, and conclusions. In the frst and second parts, it mainly analyzed the confict between pedestrians and vehicles and introduced the signifcance of the research on pedestrian crossing intention recognition. In the third part, the crossing intention recognition algorithm was introduced. Tis paper is based on the stacking ensemble learning algorithm, which integrates SVM, random forest (RF), LSTM, and AT-Bi-LSTM algorithms. Data acquisition equipment and acquisition methods were introduced. Te main data acquisition equipment is the laser radar and an HD monitor. In the fourth part, the characteristic parameters of pedestrian crossing intention were analyzed, and the characteristic parameter set of pedestrian crossing intention was obtained. Te fourth part also analyzed the results of the pedestrian crossing intention recognition model based on stacking ensemble learning and compares it with the traditional intention recognition algorithm. Te ffth part elaborated on the conclusions of this paper.

Methodology
3.1.1. Ensemble Learning. Ensemble learning improves the performance of machine learning by combining multiple models. Compared with a single model, this method allows for better prediction performance. At present, it is widely used in some well-known international machine learning competitions (Netfix, KDD2009, and Kaggle) and has achieved good rankings. Te ensemble learning method can be used to solve classifcation and regression tasks [29].
For ensemble learning, there are two main problems faced in the process of model integration, namely, (1) how to change the distribution or weight of the data. (2) How to combine multiple weak classifers into a strong classifer. For the above two problems, there are three main solutions: (1) bagging method for reducing variance. (2) Boosting method for reducing bias. (3) Stacking method for improving prediction results [30][31][32]. Stacking ensemble learning has a better efect on improving recognition accuracy. Terefore, this paper chose stacking ensemble learning.
Stacking is a typical representative of ensemble learning methods. Individual weak classifers are called base classifers, and the classifers used for combinations are called meta-classifers. Te base classifer is usually a heterogeneous classifer.

Base Classifer and Meta-Classifer
(1) SVM-Base Classifer. SVM [33] is a commonly used supervised learning algorithm for machine learning. It is a typical linear binary classifer. SVM is also regarded as the process of solving the optimal classifcation hyperplane. For the SVM, the key is the determination of the kernel function, the penalty function C, and the kernel function parameter g. Te kernel function selected is the radial basis kernel function. Te values of the penalty function C and the kernel function parameter g are determined by the grid search method. In this paper, when the pedestrian intention is identifed at 0 s before crossing the zebra crossing, the values of C and g are 36 and 2.73, respectively. When the pedestrian intention is identifed at 0.5 s before crossing the zebra crossing, the values of C and g are 48 and 2.32, respectively. When the pedestrian intention is identifed at 1 s before crossing the zebra crossing, the values of C and g are 45 and 2.08, respectively. Since SVM is a common and mature algorithm, it will not be described in more detail in this paper.
(2) RF-Base Classifer. RF [34] is a classifer composed of a large number of decision trees, which is regarded as an ensemble learning method. Multiple decision tree classifers are trained by sampling with replacement (bootstrap). Each decision tree classifer is independent of the others and has no correlation. Many classifers are integrated into an RF classifer, and multiple decision tree classifers obtain the fnal classifcation result through voting. To achieve a good recognition result, the adjustment of hyperparameters is essential. Te hyperparameters refer to the number of decision trees and the maximum number of features. In this paper, we also use the grid search method to determine the two important parameter values. When the pedestrian intention is identifed at 0 s before crossing the zebra crossing, the number of decision trees and the maximum number of features are 80 and 5, respectively. When the pedestrian intention is identifed at 0.5 s before crossing the zebra crossing, the number of decision trees and the maximum number of features are 115 and 5, respectively. When the pedestrian intention is identifed at 1s before crossing the zebra crossing, the number of decision trees and the maximum number of features are 125 and 5, respectively.
(3) LSTM-Base Classifer. At the end of the last century, Hochreiter and Schmidhuber proposed LSTM on the basis of RNN [35], which to some extent overcomes the problem of gradient disappearance and explosion in the back propagation process. Te LSTM network introduces the concept of "gates," which are the input gate, forget gate, and output gate. Tese three gates are also called the memory unit of the network. Te main purpose is to selectively delete and retain the associated information in the data to achieve the purpose of continuous update of the cell state and increase the model recognition accuracy. Te grid search method was used to determine the hyperparameter values. When pedestrian intention is identifed at 0 s before crossing the zebra crossing, the learning rate, hidden unit, and dropout values are 0.01, 128, and 0.4, respectively. When the pedestrian's intention is recognized at 0.5 s before crossing the zebra crossing, the values of the learning rate, hidden unit, and dropout are 0.05, 100, and 0.4, respectively. When the pedestrian intention is recognized at 1 s before crossing the zebra crossing, the learning rate, hidden unit, and dropout values are 0.001, 100, and 0.5, respectively. Adam was used as the optimizer. In addition, the LSTM network also solves the problem of interdependence before and after the input data so that the cell unit has a longer memory capacity. Te specifc working steps of the LSTM network are as follows: Forget gate: the main function is to delete useless information in the cell unit, and the content of the information is determined by the sigmoid function.
where σ is the forget gate sigmoid function, W f is the weight matrix, b f is the bias term, and the output range of f t is [0, 1], and its value is inversely proportional to the degree of forgetting. Input gate: it updates the information in the cell unit of the structure. Te sigmoid layer and the tanh layer determine the updated information in the cell information.
where σ is the input gate sigmoid function, tanh is the input gate function, W t and W c are weight matrices, b i and b c are bias terms, i t is the input gate cell state update value, and C t is the tanh function state update value. Trough formulas (2)-(4), the fnal updated state value of the cell unit is obtained, and the specifc expression is 4.5.
where C t−1 is the unit cell state value at the previous moment.
Te main function of the output gate is to transfer the associated information to the cell unit at the next moment.
where o t is the output value of the output gate, W o is the weight matrix, and b o is the bias term. Te fnal output h t of the unit cell at the current moment can be expressed as follows: (4) At-Bi-LSTM-Meta Classifer. Pedestrian crossing intention recognition can be regarded as a sequence recognition problem. Te movement state of pedestrians before crossing the zebra crossing can refect the pedestrians' crossing decision. Te data between a certain moment before crossing the zebra crossing and the next moment has a greater correlation. To better capture the characteristic information of pedestrian crossing intentions and fully obtain the correlation of sequence data in a period of time before crossing the zebra crossing, this paper adopts Bi-LSTM [36]. Te input of the Bi-LSTM model at time t is x t . During information processing, the state of Bi-LSTM from the forward to backward direction is updated as follows: where H is the backward output function, W fw is the weight matrix from the input layer to the forward layer , W fw1 is the weight matrix between forward layers, and b fw is the bias term.
Te Bi-LSTM model is then updated from the backward to forward direction as follows: where H ' is the forward output function, W bw is the weight matrix from the input layer to the back layer, W bw1 is the weight matrix between back layers, and b bw is the bias term. Equation (9) describes the fnal output of the Bi-LSTM model following the forward and backward superimposition as follows: where H is the output function of the output layer, W fw2 is the weight matrix from the forward layer to the output layer, and W bw2 is the weight matrix from the backward layer to the output layer. Te parameters of the pedestrian crossing intention are not equally important. To capture the most important information and shorten the fow distance of information, the Bi-LSTM-based attention mechanism was introduced [37]. Te grid search method was used to determine the hyperparameter values. When the pedestrian intention is identifed at 0 s before crossing the zebra crossing, the learning rate, hidden unit, and dropout values are 0.005, 120, and 0.4, respectively. When the pedestrian intention is recognized at 0.5 s before crossing the zebra crossing, the values of the learning rate, hidden unit, and dropout are 0.001, 120, and 0.4, respectively. When the pedestrian intention is recognized at 1 s before crossing the zebra crossing, the learning rate, hidden unit, and dropout values are 0.001, 100, and 0.2, respectively. Adam was used as the optimizer. Figure 2 presents the four components of the AT-Bi-LSTM framework, namely, (1) the input layer, which inputs the feature parameter sequence of the crossing intention, (2) the LSTM layer, (3) the attention layer, and (4) the output layer.
Te correlation function of the attention layer is expressed as follows: where P is a vector composed of h 1 , h 2 , h 3 . . . h t , T is the data length, c is a trained parameter vector, and h * is the fnal value used for classifcation.

Stacking-Based Ensemble Learning Algorithm
Description. Te training set based on stacking ensemble learning includes a primary training set and a secondary training set. In the training phase, the secondary training set is generated using the base classifer. If the training set of the primary classifer is used directly to generate the secondary training set, the risk of over-ftting will be relatively high. Terefore, cross-validation is generally used to generate training samples for the meta-classifer. Te method used in this paper is 5-fold cross-validation. Firstly, the base classifer (SVM, RF, LSTM, and AT-Bi-LSTM) is obtained through the primary training set training, and the primary training set is divided into 5 subsets. Secondly, the training set is reconstructed through 5-fold cross-validation to obtain the secondary training set, which is used to train the metaclassifer. Finally, the meta-classifer (Bi-LSTM) is obtained through the training of the secondary training set. Figure 3 presents the framework of stacking-based ensemble learning. Table 1 is the pseudocode of the stacking algorithm, and the main steps of model training are described as follows: Step 1: divide the pedestrians' intention sample dataset S into the training set S train and S test according to the ratio of 3 : 1. According to the 5-fold cross-validation method, we randomly and equally divide S train into 5 subsets, namely, S 1 , S 2 , S 3 , S 4 , and S 5, and select one of the subsets S i (i � 1, 2, . . ., 5) as the verifcation subset in turn. Use the remaining S +i � S train − S i as the training subset.
Step 2: we use S +i as the training set of base classifers RF, SVM, LSTM, and AT-Bi-LSTM, use S i as the verifcation subset, and output the test result x i . Simultaneously, we predict the test set S test and output the prediction result y i .
Step 3: we iterate step 2 fve times to obtain {x 1 , x 2 , x 3 , x 4 , and x 5 }, and we merge the results according to the columns to get the column vector X 1 of the same length as the original training set S train . We combine the test Output layer Backward layer BiLSTM Figure 2: AT-Bi-LSTM structure: input layer is used to input data; the data fows into the forward and backward layers of the Bi-LSTM to obtain important clues in the data. Te attention layer is used to remove useless information from data and extract key features. Te softmax layer is responsible for outputting pedestrian intentions.
Validation sub-set Training sub-set 2 Training sub-set 3 Training sub-set 4 Training sub-set 5 Training sub-set 1 Validation sub-set Training sub-set 3 Training sub-set 4 Training sub-set 5 Training sub-set 1 Training sub-set 2 Validation sub-set Training sub-set 4 Training sub-set 5 Training sub-set 1 Training sub-set 2 Training sub-set 3 Validation sub-set Training sub-set 5 Training ; % training a meta-classifer based on the Bi-LSTM algorithm with the newly combined dataset Output: samples and take the average to obtain a column vector Y 1 of the same length as the original test S test .
Step 4: by sequentially performing step 3 on the base classifers SVM, LSTM, and AT-Bi-LSTM, we obtain X 2 , X 3 , and X 4 from the original training set and Y 2 , Y 3 , and Y 4 from the original test set.
Step 5: we combine X 1 , X 2 , X 3 , and X 4 and the label L of the original training set S train to obtain a new sample dataset N � {X 1 , X 2 , X 3 , X 4 , and L}, and we use it as the training dataset of the meta-classifer Bi-LSTM. We obtain the accuracy of the meta-classifer via the test dataset M � {Y 1 , Y 2 , Y 3 , Y 4 , and P}.

Experimental
3.2.1. Study Site. Figures 4 and 5 are diagrams of the study site and equipment placement location, respectively. Te zebra crossing section has no signal light control and monitoring equipment. Te width of the zebra crossing is 12 m, a two-way four-lane. Te road gradient is small and negligible, and the road is separated by a double yellow line. Tere is no green belt or bufer waiting area. Te selected road is a common road in the city. Te trafc fow in this section is mainly composed of small passenger vehicles.

Experimental Equipment.
Te laser radar model LUX4L-4 selected in this experiment is produced by the German IBEO company, as shown in Figure 6. Te radar used in the experiment belongs to the four-line radar, and the scanning frequency is set to 12.5 Hz. Te detection range of the lidar is 0.3-200 m, the vertical viewing angle is 3.2°FOV, and the horizontal viewing angle can reach 110°. Te radar used in the experiment can perform real-time scanning of all objects within the detection feld, including moving objects and stationary objects. At the same time, the data collected by the radar are read through the associated software ILV-Premium, as shown in Figure 6. Trough this software, the type, speed, and position of the target detected by the radar can be displayed in real time. Te specifc display interface of software is shown in Figure 6.
Te selected HD monitor is small in size, and the video resolution is 1920 × 1080. Figure 6 shows the physical image. Both the LUX radar and the driving recorder are powered by small batteries. Te data collection location is 15 m away from the zebra crossing. In addition, the use of radar alone will miss a large amount of data, making the selection work more complicated. At the same time, the gender and age of pedestrians cannot be judged. In order to overcome this problem, radar and HD monitors are used together. After the two devices are synchronized in time, the HD monitor is used to determine whether the pedestrian wants to cross the zebra crossing. Te data of the pedestrian before or when crossing the zebra crossing are collected by the laser radar. Te radar point cloud image recorded by ILV-Premium is the main, and the video recorded by the HD monitor is auxiliary to realize the precise selection of data.

Data Collection and Analysis.
To overcome the infuence of time heterogeneity, all observation experiments were conducted on sunny days. Pedestrian crossing intention recognition is a continuous-time series classifcation problem. Te pedestrians' crossing intention is determined according to the speed change within a period of time before the pedestrians cross the zebra crossing or the time series change of the surrounding environment (vehicle speed or the distance between the vehicle and the zebra crossing, etc.). Generally speaking, when pedestrians are crossing the zebra crossing, they determine their intention to cross the zebra crossing by observing the surrounding environment (such as the distance between the vehicle and themselves), which is refected in the speed of the pedestrian crossing the zebra crossing. If the pedestrian does not slow down, it may be a direct crossing behavior. Figure 7 shows a schematic diagram of the pedestrian crossing. In this paper, pedestrian crossing intentions are divided into three categories, namely, "walking-walking intention (WWI)," "walking-stopping intention (WSI)," and "stopping-walking intention (SWI)." WWI refers to a pedestrian crossing the zebra crossing without stopping after reaching the curb. WSI means that after considering the road trafc environment, pedestrians did not choose to cross directly after reaching the curb but waited. SWI means that pedestrians start to cross the zebra crossing after waiting at the curb.
In this paper, the main process of selecting the characteristic parameters of pedestrian intention before crossing the zebra crossing is as follows.
Check whether the pedestrian has the intention of crossing the zebra crossing through the HD monitor. If the video shows that the pedestrian is WWI, then we need to go back for a certain period of time and collect the pedestrianrelated data and vehicle-related data during this period of time through the laser radar. If it is determined through the video that the pedestrians' intention to cross the zebra crossing is WSI or SWI, we use the same method to reverse the laser radar and record it.
Te intention characterization parameters selected in this paper are mainly pedestrian speed, the distance between the pedestrian and the zebra crossing, vehicle speed, the distance between the vehicle and the zebra crossing, and TTC. In addition, the paper also introduces the infuence of pedestrian age, gender, and group on pedestrians' intention to cross the zebra crossing. Te specifc defnition is as follows: Pedestrian speed is the mean speed value of pedestrians during a period of time before crossing the zebra crossing, obtained by laser radar. In the process of collecting pedestrian speed by radar, the true speed value is obtained after Kalman fltering, and the speed value of each frame is counted to fnally get the mean speed of the pedestrian before crossing the zebra crossing.
Te distance between the pedestrian and the zebra crossing (DPZC) refers to the square and root result of the two parameters of the vertical distance between the pedestrian and the curb and the vertical distance between the pedestrian and the zebra crossing.

25°8 5°F
igure 5: Photograph of the study site: lidar detection angle is 110°. It can completely cover the whole road.

Laser radar Ilv-Premium
Video player

HD monitor
Pedestrian and pedestrian speed Vehicle and vehicle speed Figure 6: Laser radar and HD monitor: the upper part of the picture is a radar map, and the lower part is a camera map. Time synchronization between two devices.

Journal of Advanced Transportation
Te distance between the vehicle and the zebra crossing (DVZC) refers to the vertical distance between the vehicle and the zebra crossing.
TTC refers to the distance between the vehicle and the zebra crossing divided by the current speed of the vehicle.

Data Preprocessing.
Te data obtained from the radar will bring a lot of noise and interference signals. In order to make the collected data closer to the real value, this paper uses a Kalman flter to flter the data directly collected by the radar. It should be pointed out that the distance value between the vehicle and the zebra crossing and the vehicle speed value is larger than the pedestrian speed value and the value between the pedestrian and the zebra crossing, in order to more accurately capture the key information in the data, reduce the training time of the model, and improve model recognition accuracy. Tis paper uses the min-max function to normalize the characteristic parameters. Figure 8(a) shows the TTC line chart under diferent crossing intentions within 2.1 s before crossing the zebra crossing. It can be seen that when the intention is WWI, the selected TTC value when pedestrians cross the zebra crossing is the largest, which is at the top of the three curves. When the intention is SWI, the TTC value selected by pedestrians crossing the zebra crossing is second, in the middle of the three curves. When the intention is WSI, the TTC value selected by pedestrians crossing the zebra crossing is the smallest, which is at the bottom of the three curves. As time goes by, the TTC value under diferent intentions shows a steady downward trend. found that there were signifcant diferences in TTC values under diferent intentions (F (2, 1977) � 1719.60, p < 0.001), and the post-hoc test found that there were signifcant diferences in TTC values after pairings with diferent intentions (p < 0.001). Figure 9(a) shows the vehicle speed line chart under diferent crossing intentions within 2.1 s before crossing the zebra crossing. It can be seen that when the intention is SWI, the vehicle speed value when pedestrians' cross the zebra crossing is the largest, which is at the top of the three curves. When the intention is WWI, the vehicle speed value is the second, in the middle of the three curves. When the intention is WSI, the vehicle speed value is the smallest, which is at the bottom of the three curves. Generally speaking, with the change of time, the value of vehicle speed does not change much, and the value is relatively stable. Tere is a signifcant diference in the vehicle speed value between WWI and WSI (p < 0.001). Tere is a signifcant diference in the vehicle speed value between SWI and WSI (p < 0.001). Figure 10(a) shows the DPZC changes under diferent crossing intentions within 2.1 s before crossing the zebra crossing. It can be seen that when the intention is WWI, the DPZC value when pedestrians cross the zebra crossing is the largest, which is at the top of the three curves. When the intention is WSI, the DPZC value is the second, in the middle of the three curves. When the intention is SWI, the DPZC value is the smallest, which is at the bottom of the three curves. Generally speaking, as time goes by, the DPZC value with the intention of WWI and WSI shows a steady downward trend. Te DPZC value with the intention of SWI did not change signifcantly. Figure 10 Figure 11(a) shows the pedestrian speed changes under diferent crossing intentions within 2.1 s before crossing the zebra crossing. It can be seen that when the intention is WWI, the pedestrian speed value when  pedestrians cross the zebra crossing is the largest, which is at the top of the three curves. When the crossing intention is WSI, the pedestrian speed value is the second, in the middle of the three curves. When the intention is SWI, the pedestrian speed value is the smallest, which is at the bottom of the three curves. Generally speaking, as time goes by, there is no signifcant change in the value of pedestrian speed with WWI. Te value of pedestrian speed whose intention is WSI drops rapidly. Te pedestrian speed value with the intention of SWI shows a slow upward trend.    Figure 12(a) shows the DVZC changes under diferent crossing intentions within 2.1 s before crossing the zebra crossing. It can be seen that when the intention is WWI, the DVZC value when pedestrians cross the zebra crossing is the largest, which is at the top of the three curves. When the intention is SWI, the DVZC value is the second, in the middle of the three curves. When the intention is WSI, the DVZC value selected by pedestrians crossing the zebra crossing is the smallest, which is at the bottom of the three curves. Generally speaking, as time goes by, the DVZC value under diferent intentions shows a steady downward trend. Figure 12 the post-hoc test found that there were signifcant diferences in DVZC values after pairings with diferent intentions (p < 0.001).

Age and Gender.
Numerous studies have shown that the age and gender of pedestrians have great diferences in the choice of pedestrians to cross the zebra crossing. Generally speaking, men's choice of crossing the zebra crossing is relatively aggressive, and women's choice is relatively cautious [38,39]. Te ages of pedestrians are usually divided into young, middle-aged, and old. When crossing the zebra crossing, elderly pedestrians choose relatively cautiously, while middle-aged pedestrians choose more aggressively. Generally, 18-30, 30-59, and >59 are young, middle-aged, and old, respectively [40][41][42].

Model Results.
Trough the analysis in the previous chapter, the input parameter set of the model is fnally determined, which includes TTC, DPZC, DVZC, vehicle speed, pedestrian speed, age, and gender. In this paper, a total of 1980 sets of valid data are selected, of which 75% are used as the training set, and the remaining 25% are used as the test set. Te training set uses a fve-fold cross-validation method. Table 2 shows the number of training samples and the number of test samples under diferent intentions. In this paper, the pedestrian crossing intention recognition models at 0 s, 0.5 s, and 1 s before crossing the zebra crossing are established, respectively. Te performance of the model was evaluated by precision, recall, F1 score, confusion matrix, and receiver operating characteristic (ROC) curve. Table 3 shows the model evaluation results when the model is 0 s before crossing the zebra crossing. Compared with several traditional machine learning algorithms, it is found that the pedestrian crossing intention model based on stacking ensemble learning has the highest recognition accuracy, reaching 98.79%. Te precision, recall, and F1 score of this model for identifying WWI are 98.78%, 98.78%, and 98.78%, respectively. In the same way, the precision, recall, and F1 scores of the model for identifying SWI are 99.38%, 98.76%, and 99.07%, respectively. Te precision, recall, and F1 scores of the model for identifying WSI are, respectively, 99.24%, 98.82%, and 98.53%. Te comprehensive evaluation found that the pedestrian crossing intention model based on stacking-based ensemble learning introduced in this paper has the best recognition performance. Te running time of the stacking model is 0.0083 s, and the running times of the AT-Bi-LSTM, LSTM, RF, and SVM models are 0.0032 s, 0.0054 s, 0.0065 s, and 0.0046 s, respectively. It can be seen that the running times of the above models are all in milliseconds, which can meet the actual needs. Figure 13 shows the ROC curve of each model. It can be seen from the fgure that when the false positive rate is 5%, the pedestrian crossing intention recognition model based on stacking ensemble learning has the highest true positive rate, followed by AT-Bi-LSTM, LSTM, RF, and SVM. Secondly, the area under the ROC curve based on the stacking ensemble learning method is the largest, which is higher than the other four algorithms. In addition, the ROC curves of the fve algorithms are relatively far from the straight-line y � x, which shows that the recognition performance of the  fve models is better. A comprehensive comparison found that the performance of the pedestrian crossing intention recognition model based on stacking ensemble learning introduced in this paper is the best. Figure 14 shows the confusion matrix of the fve algorithms. It can be seen from the confusion matrix that the SVM-based intention recognition model has the most misrecognition times. Te number of times that WWI is recognized as SWI and WSI is 6 and 11, respectively, and the times that SWI is recognized as WWI and WSI are, respectively, 5 and 7, and the number of times that WSI is recognized as WWI and SWI is 12 and 8, respectively. In contrast, the pedestrian crossing intention recognition model based on stacking integrated learning has the least number of misrecognitions and the best model performance. Among them, the times that WWI is recognized as SWI and WSI are 1 and 1, respectively, and the times that SWI is recognized as WWI and WSI are 1 and 1, respectively, and the times of WSI being recognized as WWI and SWI are 2 and 1, respectively.

Model
Recognition Results at 0.5 s before Crossing the Zebra Crossing. Table 4 shows the model evaluation results when the model is 0.5 s before crossing the zebra crossing. Compared with several traditional algorithms, it is found that the intention recognition model based on stacking ensemble learning has the highest accuracy of 95.36%, the model recognition accuracy based on AT-Bi-LSTM is 92.12%, the model recognition accuracy based on LSTM is 89.30%, and the model recognition accuracy based on RF is 87.07%. Te SVM-based model has the lowest recognition accuracy, which is 85.26%. It can be seen from Table 4 that the precision, recall, and F1 score of the pedestrian crossing intention model based on stacking ensemble learning are signifcantly higher than the other four algorithms. It can be seen that the stacking ensemble learning method introduced in this paper has the best recognition performance at 0.5 s before crossing the zebra crossing. Compared with Table 3, it can be seen that when the model is recognized at 0.5 s before crossing the zebra crossing, the accuracy has decreased to a certain extent. Te main reason is that some key features contained in the sequence data have been deleted. However, in general, the accuracy of the model can still meet actual needs. Te running time of the stacking model is 0.0076 s, and the running times of the AT-Bi-LSTM, LSTM, RF, and SVM models are 0.0027 s, 0.0060 s, 0.0061 s, and 0.0052 s, respectively. Figure 15 shows the ROC curve of each model. It can be seen from the fgure that when the false positive rate is 5%, the pedestrian crossing intention recognition model based on stacking ensemble learning has the highest true positive rate, followed by AT-Bi-LSTM, LSTM, RF, and SVM. Secondly, the area under the ROC curve based on the stacking ensemble learning method is the largest, which is higher than the other four algorithms. Compared with Figure 16, it can be seen that the area under the ROC curve corresponding to each algorithm has been reduced, and the performance of the model has begun to decline. Figure 16 shows the confusion matrix of the fve algorithms. It can be seen from the confusion matrix that the SVM-based intention recognition model still has the most misrecognition times. Te number of times that WWI is recognized as SWI and WSI is 10 and 15, respectively, and the times that SWI is recognized as WWI and WSI are, respectively, 7 and 11, and the number of times that WSI is recognized as WWI and SWI is 19 and 11, respectively. In contrast, the pedestrian crossing intention recognition model based on stacking ensemble learning has the least number of misrecognitions and the best model performance. Among them, the times that WWI is recognized as SWI and WSI are 3 and 5, respectively, and the times that SWI is recognized as WWI and WSI are 2 and 5, respectively; the times of WSI being recognized as WWI and SWI are 7 and 3, respectively. Compared with Table 3, the number of misrecognition times has increased.         accuracy, which is 76.33%. It can be seen from Table 5 that the precision, recall, and F1 score of the pedestrian crossing intention model based on stacking ensemble learning are signifcantly higher than the other four algorithms. It can be seen that the stacking ensemble learning method introduced in this paper has the best recognition performance at 1s before crossing the zebra crossing. Compared with Tables 3  and 4, it can be seen that when the model is recognized at 1s before crossing the zebra crossing, the accuracy has decreased. Te main reason is that most of the key features contained in the sequence data have been deleted. However, the method introduced in this paper still has high      Figure 17 shows the ROC curve of each model. It can be seen from the fgure that when the false positive rate is 5%, the pedestrian crossing intention model based on stacking ensemble learning has the highest true positive rate, over 80%. Te recognition accuracy of the remaining four algorithms has dropped signifcantly, and the corresponding value is less than 80%. Secondly, the area under the ROC curve based on stacking ensemble learning is the largest, which is higher than the other four algorithms. Compared with Figures 16 and 17, it can be seen that the area under the ROC curve corresponding to each algorithm has been reduced. Figure 18 shows the confusion matrix of the fve algorithms. It can be seen from the confusion matrix that the SVM-based intention recognition model has the most misrecognition times. In contrast, the pedestrian crossing intention recognition model based on stacking ensemble learning has the least number of misrecognitions and the best model performance. Compared with Figures 14 and 16, the number of misrecognition times has signifcantly increased.

Conclusions
Tis paper frst collected the motion parameters of pedestrians and vehicles with laser radar and HD monitor and selected 1980 efective samples. Secondly, the statistical method is used to obtain the characteristic parameter set that can refect the pedestrians' crossing intention. Finally, using the characteristic parameter set as the input of the stacking integrated learning method, a pedestrian crossing intention model with high recognition accuracy is trained and compared with traditional machine learning algorithms. Te results show that the accuracy rate of the pedestrian crossing intention recognition model based on stacking ensemble learning is 98.79% when it is recognized at 0 s before crossing the zebra crossing. When it is recognized at 0.5 s before crossing the zebra crossing, the accuracy rate of the pedestrian crossing intention recognition model based on stacking ensemble learning is 95.36%. When it is recognized at 1 s before crossing the zebra crossing, the accuracy of the pedestrian crossing intention recognition model based on stacking ensemble learning is 89.27%. Compared with traditional machine learning algorithms, the method introduced in this paper has the best recognition performance. Te method introduced in this paper has a high accuracy of intention recognition, which is of practical signifcance for future fully autonomous vehicles to efectively avoid humanvehicle conficts and improve the efciency of urban road driving.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.