Lane-Level Road Map Construction considering Vehicle Lane-Changing Behavior

.


Introduction
Lane-level road maps are the core of automatic driverless systems and intelligent assisted driving systems. It can be used for autonomous vehicle navigation and online driver guidance. Besides, such a map can also be important for lane-based trafc analysis. Te analysis of the trafc fow will be done more accurately if it is based on a lane-level map that is fully compatible with the input trajectory data. Currently, lane-level road maps are usually acquired from high-resolution remote sensing images, vehicle-mounted laser point clouds, or diferential GPS trajectories with an accuracy of about 0.5-4 m [1]. Tese manual and semimanual approaches are time-consuming and labor-intensive.
Crowdsourcing is a low-cost and efcient way to extract useful road information from data acquired by crowd participants or volunteers. Te crowdsourced method has been successfully applied in road map construction, which collects road data by crowding vehicles [2][3][4]. Te advantage of crowdsourced technology has led to a relatively limited number of scholarly papers on lane-level map construction based on crowdsourced trajectories. However, these eforts mainly model the trajectory distribution on road cross sections and do not model the trajectory distribution in the longitudinal direction of the road. It can cause the lane counts on a road to not be constant throughout a segment of the road, while road designers usually maintain the road width constant throughout the segment.
Tis paper proposes a strategy to model the road longitudinal trajectory distribution by considering vehicle lanechanging behavior. Te method is based on the observation that the probability of a vehicle changing lanes increases on road sections where the number of lanes changes. Tus, the lane-changing behaviors of vehicles are frst identifed from their trajectories. Ten, we use the weighted constrained Gaussian mixture model (WCGMM)  Te rest of this article is structured as follows: Section 2 gives an overview of works related to this article. Section 3 describes the lane-changing behavior recognition method and the lane centerline extraction approach based on lanechanging behavior. Experiments and results are discussed in Section 4, followed by conclusions and future work in Section 5.

Related Work
Crowdsourced vehicle trajectory data is a low-cost, real-time data source that potentially contains rich road information. Tus, many scholars have studied building a navigation map using crowdsourced vehicle trajectories [5][6][7][8]. With the rise of autonomous driving technology, high-defnition navigation maps have become one of the critical capabilities for autonomous driving. It can provide vehicles with more reliable environmental perception capabilities. Compared with traditional navigation maps, high-defnition maps for autonomous driving require more real-time and accurate lane information. Constructing lane-level road maps using crowdsourced vehicle trajectories has also become a new research hotspot. Generally, the lane-level road map construction method based on crowdsourced vehicle trajectories usually includes three steps. First, the noise of raw trajectories is fltered. For example, previous studies have used a Kalman flter and a particle flter algorithm [9] or kernel density methods [10] for trajectory data preprocessing. Second, the lane counts of the segment are estimated. In previous studies, Edelkamp and Schrödl [11] proposed a K-means clustering method to extract lane counts from massive diferential GNSS trajectories to construct and update urban road digital maps. Uduwaragoda et al. [12] used nonparametric Kernel Density Estimation (KDE) to estimate the number and location of the lane centerlines. Chen and Krumm used vehicle trajectories recorded from standard GPS devices to construct lane-based, routable digital maps [13]. Tey used perpendicular lines at certain distances relative to the road's centerline for onedimensional classifcation of the trajectories. Ten, prior knowledge and the constrained Gaussian mixture model (CGMM) are used to get better classifcation results. Tang et al. [14] proposed a naive Bayesian classifer to extract the number and the rules of trafc lanes. In addition, an optimized CGMM was proposed to mine the trafc lanes' numbers and locations [15]. Furthermore, a fuzzy-set-based algorithm is proposed to construct the lane geometry near road intersections using trafc rules [16]. Roeth et al. [17] proposed a method based on elementary building blocks that guarantees applicable lane models and used a reversible jump Markov chain Monte Carlo method to explore the model parameters. Tird, these lane models of road segments are connected into lanes. Gong et al. proposed a region-growing cluster algorithm with distance and orientation constraints to construct lane-level roads [18]. Arman and Tampère [19,20] frst identify road nodes and divide the network into segments. Ten they construct lanes for each segment and connect these lanes. Zheng et al. [21] summarize the lane-level road geometry extraction methods and mathematical modeling of a lane-level road network. Tey analyzed these two parts' methodologies, advantages, and limitations and discussed the classic logic formats of a lane-level road network.
It can be found that lane count extraction is a critical step in lane construction, according to the above research methods. GMM uses prior knowledge of lanes. Tus, the methods based on GMM have higher robustness for lowprecision trajectory data. However, the infuence of vehicle lane-changing behavior is not considered in the GMM method, which leads to the diferences between the actual trajectory distribution and the Gaussian mixture distribution. In addition, researchers usually use a one-dimensional classifcation method to extract the number of lanes. It can lead to variable lane width over the whole road segment. Road designers typically keep the number and width of lanes constant over a section of road. Te results revealed that there was still room to improve the accuracy and precision of the lane geometry [22], although the crowdsourced trajectory method was economical.

Methodology
We propose a two-step algorithm that converts a set of trajectory data into a lane-level road map. Figure 1 gives an overview of the method, which is detailed hereafter. Te frst step of the algorithm is vehicle lane-changing behavior recognition. At the end of this step, the lane-changing behaviors will be identifed from vehicle trajectories. We select the trajectory segments with lane-changing behaviors to model the change of lane counts between adjacent road cross sections. In the second step, the lane counts and lane centerline nodes are extracted, and we connect these nodes to construct lane centerlines for each unidirectional road segment.

Recognition of Lane-Changing
Behavior. Te recognition of lane-changing behavior is explained in the following sections.
3.1.1. Lane-Changing Feature Extraction. Driving behavior can be divided into car-following and lane-changing behavior [23]. Te car-following behavior refers to following the preceding vehicle in the same lane. Te lane-changing behavior refers to the vehicle entering an adjacent lane to satisfy its driving purpose. In this paper, the trajectory with the former behavior is called the car-following trajectory segment, and the latter is called the lane-changing trajectory segment. It is clear that the points on a car-following trajectory segment are located in the same lane, and the points on a lane-changing trajectory segment will cross the adjacent lanes.
We divide the vehicle trajectories into car-following and lane-changing trajectory segments in the Frenet coordinate system. Compared with other coordinate systems, the Frenet coordinate system can distinguish the vehicle's motion state on the road from the geometric shape of the road itself. Te Frenet coordinate system converts the two-dimensional motion of a vehicle in plane space into two one-dimensional motions in reference curve space. One-dimensional motion problems are more accessible to model and resolve than twodimensional ones.
Unlike the conventional Cartesian coordinate system, the Frenet coordinate system needs to be based on a given reference line to convert the absolute position of an object to the relative position of a reference line. Te road centerline is selected as the reference line in a Frenet coordinate system, and the vehicle trajectories in the coordinate system are converted to the relative positions of the road centerline. Te new trajectory points can directly refect the state of vehicle motion on the road without being afected by road geometry.
In the Frenet coordinate system, lateral and longitudinal displacements are used to represent the position of a vehicle.
Assuming that the coordinates of a vehicle in a global coordinate system (Cartesian coordinate system or Geographic coordinate system) are P t (x, y). Te projection is made from point P t to road centerline T ref . P s is the projection point. Te distance between P s and P t is d. Te curve distance from the starting point of T ref to P s is s. d is the lateral displacement between the vehicle and the road centerline. s is the vehicle's longitudinal displacement along the road centerline's extension direction, as shown in Figure 2. Terefore, the vehicle's coordinates in the Frenet coordinate system are F r (s, d). Figure 3(a) shows the vehicle's trajectory in Figure 2 using the Frenet coordinate system. Compared with the trajectory in the global coordinate system (Figure 3(b)), the shape of the vehicle trajectory curve in the Frenet coordinate system is more straightforward and intuitive. Te ordinate of the curve can directly refect the position information of the lane where the vehicle is located. Tus, it is easy to detect the lane-changing behavior of a vehicle in the Frenet coordinate system.
In the Frenet coordinate system, the car-following trajectory segment is a horizontal straight-line segment. Te lane-changing trajectory segment is an "S"-shaped curve that crosses the adjacent lanes. Compared with the lane-changing trajectory segment, the lateral displacement of points on the car-following trajectory segment is similar, and the slope of each point is around zero. Terefore, they can be distinguished according to the lateral displacement and slope of the trajectory points. We select two indices to describe lanechanging features of trajectory points. One is the diference between the lateral displacement of neighborhood points. Te other is the slope at the target trajectory point.
Because a consumer-grade GPS equipped on a foating car usually has a signifcant positioning error, lane-changing features extracted by vehicle trajectory points are usually afected by the positioning error. Terefore, we use the moving least squares method [24] to ft the actual trajectory of a vehicle before extracting the lane-changing features. Compared with the original trajectory, the ftted trajectory is smoother and less afected by positioning errors. Lanechanging features extracted based on the ftted trajectory are more stable.

Recognition of Lane-Changing Trajectory Points.
After extracting the lane-changing features of each trajectory point, lane-changing trajectory points can be classifed based on the features. We use the K-means algorithm to classify points and recognize lane-changing behavior for each trajectory. K-means is a common clustering algorithm widely used because of its simplicity and high efciency. We chose this algorithm because the number of points on each trajectory is not very large and the number of clusters is determined. It should be noted that this paper mainly focuses on the accuracy of lane information extraction and does not consider the speed of algorithm execution. Te distributed clustering algorithm can also be used to replace the traditional K-means algorithm, considering the algorithm's effciency. Te lane-changing point recognition method based on K-means includes the following three steps:

Journal of Advanced Transportation
Step 1: Normalization of lane-changing eigenvalues.
Since the two extracted lane-changing features are not uniform in dimension and the numerical ranges of the two features are inconsistent, it is necessary to normalize the lane-changing feature values. We use the min-max normalization method to make the values range from 0 to 1. If x is a set of feature values, the normalized x' can be calculated as follows: . (1) Step 2: K-means clustering of trajectory points. Te sample set S p � {S p1 , S p2 , . . ., S pn } is established according to the lane-changing features of trajectory points, where n is the number of trajectory points. Te sample point S pi corresponds to the two normalized lane-changing eigenvalues v 1 and v 2 of the i-th trajectory point, namely Firstly, the two samples closest to the origin and the farthest samples in S p are selected as the initial cluster centers. Ten, calculate the distance between each sample and the two cluster centers, and assign each sample to the closest cluster. Each cluster center and the samples assigned to it represent a cluster. Once all samples have been allocated, the cluster centers are recalculated based on the samples contained in each cluster. Te sample allocation and cluster center calculation process are repeated until no samples are reassigned to diferent clusters. Ten the category to which each sample belongs is the fnal clustering result. Te trajectory points corresponding to the samples are divided into two categories according to the fnal clustering results of the samples.
Step 3: Determination of cluster category. After classifying trajectory points, the meaning of the point cluster category is further determined according to the location of the cluster center. Compared with nonlanechanging trajectory points, lane-changing trajectory points have larger eigenvalues. Terefore, the trajectory points whose cluster center is further from the coordinate origin are regarded as lane-changing, and the other points are regarded as nonlane-changing.

Construction of Lane-Level
Map. Te construction of lane-level map is explained in the following sections.

Trajectory Distribution Modeling on the Road Cross
Section. Lane centerlines can be extracted using a constrained Gaussian mixture model (CGMM) [13]. Te basic principle is to analyze the density distribution of trajectory data on a road cross section. Constrained by the prior knowledge of lane distribution law, the trajectory density distribution is ftted on the premise of knowing the number of lanes. A lane's centerline position and width can be determined by the ftted trajectory density distribution on a road cross section. Furthermore, the lane centerlines are obtained by the CGMM for multiple consecutive road cross sections. Te CGMM method ignores the diference in positioning errors of vehicle trajectories. Tang et al. [15] found that some trajectories have high positioning accuracy while  Journal of Advanced Transportation others have low accuracy on the same road. Terefore, we improve the original CGMM and propose a weighted constrained Gaussian mixture model (WCGMM). Te weighted model considers the diference in trajectory positioning accuracy. Firstly, it estimates the positioning accuracy of trajectories. Ten, the accuracy is taken as a weight in the original constrained Gaussian mixture model. Vehicles usually travel steadily along the extended direction of a road. Tus, the trajectory of a vehicle will be a smooth curve similar to the road's centerline. When the positioning accuracy is low, the position deviation of points on the trajectory is signifcant. It results in a large diference between the trajectory shape and the shape of the road centerline. Terefore, we compare the similarity between the vehicle trajectory and the road centerline to estimate the positioning accuracy of trajectories. Te similarity can be measured by the standard deviation of the distance between all points and the road centerline. When a trajectory's shape is the same as the road centerline, the standard deviation of the distance between all the trajectory points and the road centerline is equal to 0. If the trajectory shape difers signifcantly from the road centerline, the standard deviation of the distance between all trajectory points and the road centerline is also signifcant. Figure 4 visualizes the results of the standard deviation of the distance between the points and the road centerline. In this fgure, the red trajectories have a more signifcant standard deviation, and the green trajectories have a minor standard deviation. It can be found that the positioning accuracy of a trajectory is positively correlated with the standard deviation of the distance between all trajectory points and the corresponding road centerline.
Suppose the distances between the trajectory points and the corresponding road centerline are d 1 , d 2 , . . ., d n (n is the number of trajectory points), and the average of these distances is d. Ten, the positioning accuracy ω of the trajectory can be calculated by equation (2). After estimating the positioning accuracy of each trajectory, a weighted constrained Gaussian mixture model is constructed based on the trajectories for each road cross section.
Before discussing the WCGMM method, we frst introduce the basic concepts of the Gaussian mixture model (GMM). Te GMM can be formulated by mixing multiple single Gaussian distributions. Every Gaussian distribution can be regarded as a component of the GMM. Terefore, the probability density distribution p(x|Θ) of the GMM can be expressed as where Θ � {φ 1 , . . ., φ K , θ 1 , . . ., θ K } are parameters of the GMM. φ 1 , . . ., φ K is the probability that the sample value x belongs to each component. Te sum of all the probabilities is equal to 1. θ j � {μ j , σ j } is the parameter of the j-th Gaussian component. μ j is the mean, and σ j is the standard deviation.
K is the number of Gaussian components in the GMM. Ψ(x|μ j , σ j ) is the probability density function of a single Gaussian distribution, and Ψ(x|μ j , σ j ) can be expressed as Gebru et al. [25] further introduced the concept of sample weights based on the GMM. It combines the sample weights and the GMM by treating the weight ω of sample x as ω times equivalent observations of x. Ten, the probability density distribution p(x|Θ, ω) of the weighted GMM can be derived as Te weighted GMM can model the density distribution of vehicle trajectories on road cross sections. Sample value x can be expressed as the position of a trajectory on the road cross section. ω is the estimated positioning accuracy of each trajectory segment. K can be expressed as the number of lanes. φ j is equivalent to the ratio of lane trafc fow to roadway trafc fow. μ j corresponds to the position of each lane centerline. σ j is the dispersion of trajectories on each lane. σ j is related to the width of a lane. According to urban road construction standards, the width of each lane on the same roadway should be equal. Tus, σ j of every Gaussian distribution is the same in a weighted GMM, that is, σ 1 � σ 2 � · · · � σ K � σ. We call the weighted GMM constrained by the same σ as the weighted constrained GMM (WCGMM). Lane-level road information can be obtained by solving the parameters of the model.
Te EM algorithm calculates the parameters of the WCGMM. Te EM algorithm is widely used to solve the latent variable model. It can estimate the maximum likelihood probability, or posterior probability, of a latent variable model in an iterative manner. Te maximum posterior probability method is used to estimate the parameters. To express model parameters in the iterative process of the EM algorithm, we represent model parameters Θ as (φ j (m), μ j (m), σ(m)), where m is the solution of the m-th iteration process. For the initial value of parameters in the frst iteration, we set them as φ 1 (0) � φ 2 (0) � · · ·φ K (0) � 1/K, and the setting of σ(0) can refer to the standard width of lanes. We set σ(0) to 1.75 meters and set the initially estimated position μ j (0) by σ(0) and lanes number K.
Te above EM algorithm is divided into two steps: step E estimates the probability of each sample belonging to each Gaussian component.
Step M updates the model parameters.
We alternate the step E and step M iteratively to make the model gradually approach the maximum posterior probability. Among them, the calculation method of step E is where x i is the i-th sample value, ω i is the weight of x i , and c ij (m) represents the probability that x i belongs to the j-th Gaussian component during the m-th iteration.
Step M updates the parameters of the WCGMM according to the probabilities that each sample calculated in step E belongs to diferent Gaussian components. If the total number of samples is n, the calculation method is When the number of lanes on a road cross section is known, the value of each unknown parameter can be calculated through the EM algorithm. Ten the lane centerline and lane width can be estimated. Meanwhile, this model's likelihood can be calculated using model parameters. Te likelihood can be understood as the probability of the current trajectory state distribution under the conditions of the known number of lanes and model parameters. It can be expressed as equation (8). In the next section, we will use the log-likelihood of the model to extract the number of lanes.

Extraction of Lane Counts on Continuous Road Cross
Sections. Te ratio of vehicle lane-changing behavior between adjacent road cross sections is related to the number of lanes. If the number of lanes increases or decreases, many vehicles will change lanes. When the trafc volume of each lane is approximately the same, the changes in lane counts can be estimated by the ratio of lane-changing vehicles to all vehicles on the road segment. As shown in Figure 5, when the number of lanes changes from two to three, two of the three lanes on road cross section G is the continuation of the two lanes on road cross section F.  [26]. Each state has an observation probability for the possible observations. Te transition probability defnes the state-to-state transition. Te hidden state sequence can be generated by maximizing the overall probability given a series of observations. HMM mainly includes fve elements, namely two state sets and three probability matrices [27]: and M is the number of hidden states. b ij represents the probability that the hidden state is H j at t + 1, and the hidden state is H i at t. Observation state matrix C: this matrix describes the probability of the current observation state under the condition that the hidden state is known.
is the number of observation states. c j (k) represents the probability that the observed state is O k , and the hidden state is H j at time t. Te initial state matrix η: η � (η i ), η i � P(h 1 � H i ), 1 ≤ i ≤ N. It represents the probability matrix of the hidden state at the initial time t � 1.
Te number of lanes at multiple road cross sections can be considered a Markov process. It means the number of lanes at each road cross section is only related to adjacent road cross sections. Terefore, the HMM model can be established by taking the number of lanes on multiple consecutive road cross sections as the hidden state, assuming Cross Section G Cross Section F where L(O k |Θ, H j ) is the likelihood of the WCGMM under the condition that the number of lanes is H j . It can be calculated by equation (8). Te value of this probability is usually minimal because the probability is in the form of multiplication of multiple probabilities. Terefore, we use the log-likelihood to calculate the observation state. Te maximum log-likelihood of all observation states is used for the min-max normalization to facilitate computer processing. Te state transition matrix B can be defned according to the ratio of the lane-changing vehicle. Each element b ij in B can be expressed as where β t,t+1 is the ratio of lane-changing vehicles between the t and t + 1 road cross sections. Te product of β t,t+1, and the maximum number of lanes on the adjacent road cross section is the number of changing lanes. According to the diference between hidden states H i and H j , the number of changing lanes can also be obtained. Terefore, the probability of lane counts on the current and previous road cross sections is H j and H i , which can be calculated according to the number of changing lanes. Taking into account that some vehicles may change lanes due to the need to overtake or turn even if there is no new lane, β t,t+1 is expressed as where CG left and CG right represent the number of vehicles changing lanes to the left and right, respectively. CR total represents the number of all vehicles between two adjacent road cross sections. β t,t+1 is calculated this way because the probability of vehicles changing to the left and right sides is considered equal when no additional lanes exist. Using the diference in the number of vehicles changing lanes to the left and right can avoid the infuence of overtaking or turning. Te number of lane-changing vehicles can be extracted according to the lane-change trajectory points identifed in Section 3.1. First, we count all lane-change trajectory points between two adjacent road cross sections. Ten these points are classifed into left and right categories according to the change in direction. Finally, the number of vehicles changing direction to the left or right is counted by the vehicle's ID of trajectory points.
Te observed state probability can directly represent the initial state matrix. We take the maximum number of lanes as a parameter of the algorithm, and the candidate set of hidden state variables can be determined according to the maximum number of lanes. Te solution to the hidden state in HMM is to fnd the number of lanes on each road cross section with the most signifcant overall probability. Generally, it can be solved by the Viterbi algorithm [28]. After using the Viterbi algorithm to fnd the number of lanes in each road cross section, the WCGMM corresponding to each road cross section can be determined. Ten the lane information on a road cross section can be obtained using model parameters.

Extraction of Lane Centerlines.
Te HMM and the WCGMM can generate lane centerline nodes on road cross sections. Ten, we need to construct lane centerlines by connecting these nodes on multiple consecutive road cross sections. To make the constructed lane centerlines as close to the actual lane centerlines as possible, we need to divide the road into as many cross sections as possible. Tese multiple road cross sections are set along the centerline of a road at a fxed length interval to extract the lane centerline nodes. We take the fxed-length interval of road cross sections as a parameter of the algorithm. Ten, we can extract multiplelane centerline nodes with more than two lanes on one road cross section. Besides, the number of centerline nodes on adjacent road cross sections may difer when lanes change. It is necessary to determine which two nodes on the adjacent road cross sections need to be connected. A method for connecting lane centerline nodes is proposed based on the minimum matching distance, which includes the following steps: Step 1: Index the lane centerline nodes. Sort the nodes on a cross section according to the distance between the node and the road's centerline. Ten number the nodes according to the sort order. As shown in Figure 6, for the nodes on road cross section G, the closest node to the centerline is G 1 , followed by G 2 and G 3 , until the nth node is G n .
Step 2: Build a candidate match set. Since the number of lanes on two adjacent road cross sections may difer, we use the road cross section with fewer lanes as a reference. Ten we match the nodes on the cross section with more lanes on the reference road cross section. If two adjacent road cross sections are F and G, then the number of lanes in F is m, the lane counts in G is n, and m < n. Terefore, the set of nodes in F is F Node � {F 1 , F 2 , . . ., F m }, and the set of nodes in G is G Node � {G 1 , G 2 , . . ., G n }. According to the node index in G, the candidate matching set for F Node is constructed. Each candidate matching set for F Node is {G i , G i+1 , . . ., G i+m−1 }, with i from 1 to n + 1 − m. According to each candidate matching set, a matching pair can be constructed, that Journal of Advanced Transportation is, F 1 matches G i , F 2 matches G i+1 , and F m matches G i+m−1 .
Step 3: Calculate the matching distance. For each candidate matching set, we calculate the distance of each matching pair in the matching set. Ten we calculate the sum of all distances for each matching pair. Te sum is the matching distance of the whole candidate matching set.
Step 4: Construct the centerlines of lanes. Te candidate matching set with the smallest matching distance is selected. We connect the corresponding nodes according to each matching pair, as shown by the solid red line in Figure 7. For the unmatched node G u on road cross section G, we fnd the closest matching node G v on G. Ten we connect G u to the corresponding matching node of G v on road cross section F, as shown by the red dashed line in Figure 7

Experiments
Te experiments are described in the following sections.

Experimental Data Collection and Preprocessing.
Mapillary is a street-level image data-sharing platform based on geospatial tags. Users can upload vehicle trajectory data with geographic location tags for individuals or teams. Trough the service interface and development toolkit provided by Mapillary, users can download crowdsourced vehicle trajectory data on the platform. We found that the number of trajectories uploaded by users in the San Francisco area is relatively large, and the higher the road grade, the higher the trajectory coverage rate. In order to ensure that there are enough trajectories to cover the road area, we collect vehicle trajectories on motorway sections, including US 101, CA 1, I280, and I80. Figure 8 shows the result of superimposing collected trajectory data on Google Satellite Maps. As can be seen from Figure 8, the research area contains complex road scenes such as curves, tunnels, and overpasses. Te blue lines in Figure 8 are trajectory data. It contains 3,728 trajectories and 557,924 trajectory points. Tese trajectories are sampled at intervals ranging from 1 to 10 seconds. Te average sampling interval is 2.36 seconds. Te positioning errors of some trajectories exceed 100 meters. We treat these vehicle trajectories with signifcant positioning errors as noise data. Terefore, this data needs to be preprocessed before lane information extraction to eliminate the noise.
We download the road centerline data within the research area from the OSM website. Te road centerlines are used to estimate the positioning error of each trajectory using the error calculation method proposed in Section 3.2. We introduced the natural discontinuity method [29] to determine the positioning error threshold of noise. Trajectories with errors higher than the threshold are regarded as noisy. Figure 9 shows the vehicle trajectories after preprocessing on one of the road sections in the research area. Trajectories on the left and right sides of the road are distinguished. We use vehicle trajectories on each side to extract lane information, respectively. Table 1 shows the parameter setting of the proposed method. According to road construction standards, the maximum number of lanes on urban roads is set at 6. Terefore, the candidate hidden value in the HMM model is {1, 2, 3, 4, 5, 6}. Te length of the interval between road cross sections is related to the algorithm efciency and the smoothness of the extracted lane centerlines. Te longer the interval, the faster the calculation speed, while the shorter the interval, the smoother the extracted lane centerline. We set the length interval to 20 meters considering the above two factors.

Extraction of Lane-Changing Trajectory Points.
Te extraction of lane-changing trajectory points is the frst step in the proposed method. By observing the trajectory distribution on diferent roads, we found a specifc correlation between the lane-changing behavior of vehicles and the change in the number of lanes. In order to verify the rationality of this assumption, we selected two road sections (road sections A and B in Figure 10) in the study area to carry out experiments. Te number of lanes on road section A remains the same, and the number of lanes on road section B changes from 4 to 5. Te trafc volumes of the two road sections are similar. Te purple and pink points in Figure 10 are the vehicle lane-changing trajectory points extracted by the method proposed in this paper. It can be seen that more vehicles have changed lanes on road B. Te detection results were consistent with the visual results. We further count the number of detected lane-changing vehicles. As shown in Figure 11, it can be seen that more vehicles change lanes to the right in road section B. Tis is because some vehicles need to enter the new lane on the right. Tere are also more vehicles changing lanes to the right than to the left on road A. We think this may be caused by the driving habits of drivers. Te driving habits of drivers can be further studied quantitatively. However, it is not signifcant compared with the diference caused by adding lanes. Tus, we did not take this factor into account in our method.

Construction of Lane Centerlines.
Accurately ftting the trajectory distribution on the road cross section is the key to extracting the lane centerline. Based on the existing CGMM, we propose a weighted constrained Gaussian mixture model combined with the trajectory positioning error and verify the method's efectiveness through the trajectory data on road A in Figure 10. Figure 12 shows the trajectory distribution ftting results of the WCGMM method under the assumption of diferent numbers of lanes. Unlike the original Gaussian mixture model, the weighted Gaussian mixture model is related to the error of each sample. Tus, the ftted distribution of the Gaussian mixture model is not a    Journal of Advanced Transportation smooth curve. Figure 13 is the log-likelihood for each subgraph in Figure 12. It can be seen that the largest loglikelihood is under the condition of 4 lanes, consistent with the results observed in Figure 10.
An accurate estimation of the number of lanes is the basis of lane centerline extraction. In order to accurately estimate the number of lanes from the trajectory data, our method has made two improvements. First, a weightedbased constrained Gaussian mixture model is proposed. Based on the original constrained Gaussian mixture model, the infuence of trajectory positioning errors is considered, and the weight of high-precision trajectory data is increased while the impact of low-precision trajectory data errors is reduced on model ftting accuracy. Ten, a hidden Markov model is used to model the correlation between vehicle lanechanging behavior and lane number changes to improve the consistency of lane number estimates on adjacent road cross sections. Figure 14 shows the number of lanes extracted using the trajectory data in Figure 9. Te actual lane number of the road is obtained by manually interpreting the Google Satellite Map at each road cross section. Figure 14(a) is an improved method that only adds a weighted constrained Gaussian mixture model. Figure 14(b) is the extraction result of the number of lanes obtained using a weighted constrained Gaussian mixture model and a hidden Markov model to ft the distribution of road transverse and longitudinal trajectories. It can be found that the continuity of the number of lanes on the road is more robust when both improvement methods are used at the same time, compared with only using the weighted constrained Gaussian mixture model.
In order to further verify the efectiveness of the method, we use all the trajectory data collected in the study area to construct a lane-level road map, as shown in Figure 15. In order to show the detailed information extracted from the lane centerline more clearly, we selected three representative areas, a, b, and c, in Figure 15. Area a is the scene where vehicles pass through the tunnel; area b is a curve; and area c is a viaduct where multiple roads converge. Figure 16 is the result of lane centerline extraction in three scenarios. Figure 16 Figure 16 shows that the lane centerline extraction results in area b are relatively more accurate, the lane width extracted in area a is wider than the actual width, and the number of lanes and lane centerline positions of some roads in area c are incorrectly extracted. Comparing the extraction results of the three regions, we found that two reasons may reduce the extraction accuracy of lane information. Te frst reason is the trajectory error. Although the WCGMM model considers the infuence of the trajectory error, it cannot eliminate the trajectory positioning error. When the error is too large and the number of samples is insufcient, even if the number of lanes can be estimated correctly, there will still be a signifcant deviation in the position of the ftted centerline. Te second reason is insufcient trajectory coverage. When the trajectory distribution on a specifc lane is sparse or even has no trajectory, the uncertainty of the extracted lane information will increase. In addition, this paper's method only applies to the extraction of lane information from road segments. For the lane information of road intersections, the steering behavior of diferent vehicle trajectories is used to further extract the routing information of the intersection after extracting the lane information on road segments.

Quantitative Evaluation.
Existing methods for extracting lane information using vehicle trajectories mainly include KDE (kernel density estimation), proposed by Uduwaragoda et al. [12], and CGMM (constrained Gaussian mixture model) proposed by Chen and Krumm [13]. Among them, the CGMM method uses constraint conditions such as    Table 2. Tey are obtained by calculating the ratio of correctly extracted road cross sections to total road cross sections. Tese results show that  the accuracy of lane count recognition based on GMM is better than kernel density estimation. Compared with CGMM and KDE, the proposed method uses trajectory information between adjacent road cross sections and lanechanging features. It can improve the accuracy of lane count extraction. In addition, compared with only using the WCGMM model, combining HMM and WCGMM can further improve the algorithm's accuracy.
Te comparative analysis of the lane centerline extraction experiment needs to use the actual lane centerline as a reference. Google satellite map is used as a reference. Te reference lane centerlines are obtained by manually vectorizing lane centerlines in the research area. Te extraction error can be estimated according to the distance between the extracted and reference centerline nodes. Te average, minimum, maximum, and standard deviation of the distance between each node are used as the index of extraction accuracy. Table 3 shows the evaluation results of each method. It can be found that the kernel density estimation method has lower accuracy than GMM because the former needs to select an appropriate kernel density radius. If the trajectory positioning accuracy is low, it is not easy to estimate the accurate density distribution of vehicle trajectories on the road. Te Gaussian mixture model can achieve more accurate results by adding the constraints of prior knowledge. Te proposed method takes trajectory positioning accuracy as the weight in the constrained Gaussian mixture model. It can further improve the accuracy of lane centerline extraction. Tus, the average error of the proposed method is minor. Using both the WCGMM and the HMM model can reduce the error variance compared to only using the WCGMM model. It is because the HMM uses the vehicle's lane-changing behavior to model the longitudinal trajectory distribution of the road so that the lane information on diferent road cross sections has a better consistency.

Conclusions
Tis article studies the problem of using crowdsourced vehicle trajectories to extract lane centerlines. A highprecision lane centerline extraction method is proposed

Data Availability
Te data used to support the fndings of this study can be obtained from the corresponding author upon request.

Additional Points
Highlights. (i) We identify the lane-changing behavior of vehicles from massive trajectories. (ii) A weighted, constrained Gaussian mixture model is proposed to describe the trajectory distribution on a road cross section. (iii) A hidden Markov model is proposed to estimate the lane counts in diferent road segments. (iv) A compatible and more accurate estimation of lane centerlines is achieved by considering lane-changing behavior.

Conflicts of Interest
Te authors declare that they have no conficts of interest.