Refining Sparse Cell-ID Trajectory of Public Service Vehicles by Spatiotemporal Modelling

,


Introduction
Cellular network-based data are emerging as a great data source for urban transportation application due to the advantage in the large geographic coverage of cellular networks and the comprehensive penetration in a population [1][2][3]. Generally, cellular network-based data collected by mobile network operators can be reorganized into a mobile phone user's trajectory formed as a sequence of time-stamped cellids (i.e., cell-id trajectory), illustrating motion characteristics of corresponding mobile objects [4]. e location in cell-id trajectory at a particular time is assigned with the coordinate of an occupied base transceiver station (BTS). e spatial resolution of cell-id trajectory data depends on the service radius of each BTS, which varies in different areas, e.g., of hundred meters in metropolitan cities, and several kilometers in rural areas [5]. Meanwhile, the records of a cell-id trajectory are collected in a relatively long-time interval, which inevitably results in the problem of trajectory discontinuity or sparsity [6][7][8][9][10]. erefore, due to the inconformity between the spatiotemporal sparsity of the cell-id trajectory data and the requirement on fine-scale footprints, the refinement of cell-id trajectories becomes a research hotspot due to its extensive application prospects. e trajectory refinement that refers to filling the spatiotemporal gaps in the data is an approach to mitigate the sparsity of time-series trajectories [11]. Existing refinement method based on mobile phone location data mainly includes an interpolation-based method and a mapmatching-based method. e former mainly uses spatialtemporal correlations among records to interpolate missing points. During interpolation, trajectory points are calculated based on the distances and time spans between each missing point and its contextual points by an estimation function (e.g., nearest-neighbor function, linear function, or Gaussian function). Ficek and Kencl [12] proposed an intercall mobility model which combines Gaussian mixtures to refine CDRs. Hoteit et al. [13,14] compared the reconstruction performance of various interpolation methods (linear, cubic, nearest, and spline interpolations) on trajectories with different sampling interval and radius of gyration. Yu et al. [15] simply used a spatially-linear-interpolated method to estimate exposure in air pollution of cell phone user. Csaji et al. [16] interpolated between home and office to infer user's location and analyze the population distribution. ese refinement methods might meet the requirement for pollution exposure or census estimation; however, it is unfeasible for transportation study. Perera et al. [17] provided a method to compute the location of phone user within a cell, but it requires extra speed information which is generally unavailable in cell-id data.
As for map-matching-based methods, the basic assumption is that vehicle movement behaviors always occur along road networks. us, a sequence of trajectory points can be aligned to a sequence of road segments to form a complete path. Hidden Markov model (HMM) is the most popular one among map-matching-based methods. Jagadeesh and Srikanthan [18] used a pretraining route choice model to generate partial map-matched paths and identify the most likely one. Another HMM model proposed by Jagadeesh and Srikanthan is in [19] which considered the tailored transition probabilities for the type of data. Algizawy et al. [20] used HMM to generate a road-level traffic density, at an hourly granularity, for each mobile trajectory. Xiao et al. [21] used contextual relationships between trajectory points as features of the CDR trajectories in a conditional random field model to reconstruct individual trajectories. Chen et al. [22] proposed two approaches for completing CDRs adaptively to reduce the sparsity and mitigate the problems the latter raises. However, the basic assumption that underlies the map-matching method is questionable.
at is, individuals in urban space can travel by public transportation, which limits the performance of such methods.
is paper aims to refine sparse cell-id trajectory of public service vehicles (PSVs) by combining vehicle transport model and mobile cell-id trajectories, in which an LCSS-based SVM classifier took full advantage of the similarity between cell-id trajectories and designed public transportation routes (PTRs) to separate PSVs from those with other transport modes (e.g., walk or private car) in a large-scale unlabeled trajectory dataset. en, each trajectory of PSV was modeled to fit with its mobile behaviors as much as possible at stops, junctions, and roads and be consistent to a spatial cell cover and bus travel speed by a heuristic global optimization. We evaluated our proposed trajectory refinement method by using an encrypted cell-id trajectory dataset and a GNSS-logged bus trajectory collection. e results show that our approach delivers a state-of-the-art achievement in refining cell-id trajectories.

Conceptual Model.
e cell-id trajectory explored in this paper is a sequence of time-stamped cell-ids. Each cell-id is a unique identification of a base transceiver station (BTS) with geographical coordinates and sectoral signal transmission coverage. So, we can imagine that a refined cell-id trajectory must occur within the overlap area among a road network and a BTS sectorial area. Figure 1 gives an overview of the architecture of our approach, including two phases: PSV trajectory detection and trajectory refinement or reconstruction.
PSV cell-id trajectory detection is to identify PSVgenerated trajectories from huge volume of cell-id trajectories with various transportation modes, i.e., walk, cycle, and private car. An LCSS alignment algorithm is used to match a cell-id trajectory to a public transportation route (PTR) according to a number of PTR-nearby BTS sectoral coverages and chronological order of timestamps. LCCS generates a sequence of corresponding points (also called anchor points), in which an anchor point is represented to two different locations on the PTR route and cell-id trajectory, respectively. e similarities calculated based on the LCSS sequence are sufficient conditions to recognize PSV trajectory by a support vector machine classifier. Anchor points are initial inputs to heuristic optimization model for further estimating precise locations of anchors on PTR routes, and consequently, high-quality trajectories are generated by interpolating among optimized anchor points.

PSV Cell-ID Trajectory Detection.
e PSV such as buses, subways, and trams always run following the transportation mode with fixed routes and prescript schedules, providing transportation services for public passengers. To identify cell-id trajectories from mass datasets, it is essential to establish a set of specific measures to quantify the correlation between cell-id trajectories and the PTR. Based on LCSS, the longest common sequence among a cell-id trajectory and PTR is matched and thus a set of similarity measures are proposed to measure the spatiotemporal correlation between them. Taking use of the similarity measures, a SVM binary classifier is deployed to recognize cell-id trajectories with the PSV mode.

BTS Sector.
In a cell-id trajectory, a phone holder's location is roughly represented as the installation position of the cell-id marked BTS. It is not the identical location of the phone holder as the BTS actually covers a large geographical area. It is impossible to catch the exact location of mobile users as they can be anywhere inside the mobile network cell.
ere are two popular mobile network cell models. Most researches represent mobile network cells as Voronoi areas centered on BTSs (see Figure 2(a)). In this work, BTS sectors are used that offer a finer spatial scale due to limiting a mobile phone location to only parts of the cell, as shown in Figure 2(b). Generally, there are three antenna sectors per BTS and the sector's orientation is same as the direction of BTS antenna. We construct sectors with an average diameter of 500 meters in urban areas.

LCSS Matching.
Due to the uncertainness of spatial positioning and irregular time interval logging at a cell-id tracking point, it is difficult to match a PSV cell-id trajectory to a PTR route. An LCSS alignment algorithm that originates in the field of string matching, where two strings are given to find characters that appear left-to-right, not necessarily consecutively, in both strings [23], is applied to find the longest common subsequence of two sequences. e longer an LCCS is, the higher the probabilities that the trajectory was generated by a PSV.
e LCCS is extended to support periodic matching in this work. Following pilot studies [24,25] that processed cell-id trajectories as strings, we discretized a continuous PTR route into a cyclic sequence of discrete points with an where p i is the ith point on trajectory T, t i is the timestamp of p i , and q j is the jth point on route R.
Two points p i and q j may be considered to be matched if q j located within the BTS sector of p i , and it can be represented as where S p is the BTS sector of p i . Let T (n) denote the first n points of trajectory T and R (m) denote the first m points of R, the LCSS between T and R.
Dynamic programming algorithm deployed to discover optimal alignment for T and R is shown in Figure 3. A matrix D is created to save the LCSS for every subsequence pair of T and R. ‖T‖ and ‖R‖ are the point numbers of T and R, respectively. Considering a PSV periodically moves along PTR, R is designed as a periodic cyclic sequence. ext is the search space on R for T. p i to find matched R.q j . e entries of D are gradually filled as the dynamic programming proceeds (lines 4-10), and the last entry stores the LCSS of aligning T and R. Finally, we decode D to find all the aligning cell-id log pairs of the optimal alignment (lines [12][13][14][15][16][17][18][19][20][21][22]. e extended LCSS model can be viewed as a modified version of the models [26,27], which not only finds the longest common subsequence in terms of the accumulate number of matched anchors LCSS(T, R) but also the cycle numbers of the cell-id trajectory.  Journal of Advanced Transportation

Similarity Measures (a) Anchors
Ratio. An often used similarity measure for an LCSS matching is calculated as the ratio between the number of points in LCSS and the number of cell-ids in original cell-id sequences size [26,27], called anchors ratio in this work.
where LCSS(T, R) is the number of the matched anchor point pairs, and T ′ is a subset of T that is the common part between T and R. Using T ′ instead of T is because T contains points beyond on-duty period..
. , m could be regarded as traversed partial on the PTR; correspondingly, traverse completeness TC is defined as follows: where MSR is the union of matched subroutes SR j , j � 1, . . . , m, L(MSR) is the length of union-routes MSR, and L(R) is the total length of route R.
(c) PSV Cycles or Round-Trips. PSV cycles denote to the cycles of periodic matching of cell-id trajectory on the PTR, indicating how many round-trips the PSV run along the route.
where m s and m e are the mileages of the first and last anchor points, respectively, m e − m s represents the total distance during on-duty time, and L(R) is the total length of route R.

Trips Partition.
A PSV cell-id trajectory often includes several round-trips along a PTR route, each alternating with a long-time stay at terminal stations as drivers have access to toilet facilities at rest, fuel, and food establishments. e stop time cannot be directly obtained from original data, so we introduced a spatiotemporal kernel density estimation (STKDE) method to identify these stays and therefore enable to partition trips. Trajectory refinement or reconstruction is thus able to be performed trip by trip because those short trajectories generated at terminal stay time are eliminated.
We first transform the cell-id trajectory into a timedistance relationship by calculating the distance between start station and each BTS. en, the kernel density along the distance axis is estimated as where m is the accumulated mileage of the PSV, K s is a kernel function for the spatial domain, and h s is the spatial bandwidth. Trajectory point is weighted on a univariate kernel density function K s as follows: e kernel density f(d) is estimated based on the travel duration when the PSV pass through the spatial domain. Points on which the PSV has met traffic congestion, junctions, or stops often have high-density values. In particular, the long-time stay at terminal stations causes an extremely high-density peak, which can be used to partition trips [29].

Heuristic Global Optimization.
For a round-trip matching, we deploy a heuristic global optimization approach to estimate the precise locations of anchor points on the route. It tends to match a time-stamped point sequence to a monotonically increased mileage sequence. For each anchor point of the LCSS from a cell-id trajectory and a PTR, it must satisfy (a) locating within the intersection of the PTR and BTS sectors, (b) having a longer mileage than previous points' one, and (c) having new location nearby the initial.
A heuristic optimization model, as a commonly used model on finding approximate global optima problems, is used to search new location of anchor points naturally. Assuming a PSV always tends to move on a PTR in uniform motion to meet a prescript visit to stops, an objective function is defined as equation (8) to minimize the standard deviation of bus speed along a route in which the speeds among two consecutive anchors totally depend on their locations and time interval.
Journal of Advanced Transportation 5 subject to: where m i is the mileage of the ith anchor along the PTR, t i is the timestamp, n is the total number of anchor points, and lb i , ub i are the upper and lower limits of the decision variables m i being optimized. As illustrated in Figure 4, by iteratively adjusting the location of anchor points in a search space, a minimum objective function value is reached by leveraging dwelling times on bus stops and cross-roads besides the abovementioned constraints. e steps of the method are listed as follows.

Study Area.
e study was conducted at Huilongguan town to evaluate the proposed public transit mode detection and cell-id trajectory refining method. Huilongguan town is one of the largest townships in the northern part of Beijing, China. e town has a total population of about 450,000 people and an area of 34.5 square kilometers. Being one of the most populated residential areas in Beijing, choked traffic has been an archenemy to urban transportation system in this area. As shown in Figure 5, the town streets and road maps were sourced from OpenStreetMap and total 20 bus routes and related more than 400 bus stops covering this area were also retrieved from Beijing Public Transport Corporation (BPTC).

Datasets.
Cellular phone network signaling datasets, covering the town extension area on early August, 2016, were collected, about 670,000 mobile phone trajectories, 240 million records with an average phone-station interaction interval of 280 seconds. Each record includes timestamp (TS), International Mobile Equipment Identity (IMEI), tracking area code (TAC), and cell identity (CI), representing an interaction event between a mobile phone (IMEI) and a base station (TAC plus CI) at a dedicated time (TS). All private information in mobile phone datasets has been encrypted to protect privacy.
Ground truth datasets were collected by deploying an android device-based cell signal monitor program. Following GNSS positioning (longitude, latitude, and time), cellular towers information along bus routes was also acquired, including cellular network (GPRS/EDGE/UMTS/ LTE), current cell identity (CID), current area identity (LAC/RNC/TAC), signal strength (RSSI and RSRP for LTE networks), and cells that were used by the mobile device.
is tracking dataset was logged with a sampling interval of 1 second and a GPS positioning error of 5-10 meters. It is mainly used to calibrate and validate our proposed model.

Data Preprocessing.
Ground truth datasets were reorganized as follows. First, the information about bus roundtrips was extracted from GNSS-measured bus trajectories. en, bus cell-id trajectories were prepared by resampling raw cellular tacking datasets with a long-time interval of 280 seconds to be consistent with our big cellular collection in 2016.
As shown in Figure 6, a subset of ground truth datasets at route no. 307, totally 3,8273 GNSS points and 136 cell towers of BTS, were illustrated, among which each BTS might be deployed many times when our android devices passed through its covered area. Figure 6 displays both a GNSS trajectory and a cell-id trajectory from 8 : 00 to 20 : 00 within 4 separate round-trips. GNSS points overlap bus route very well but cell-id points are quite poor.

PSV Cell-ID Trajectory Detection by LCSS.
Applying our revised LCSS algorithm to register cell-id trajectories datasets to 20 bus routes in Huilongguan, the similarities between trajectories and bus routes were calculated. Total 843 candidate cell-id trajectories with 16,732 time-stamped track points (or towers) were identified with an anchor ratio threshold of 0.2. In this work, the cell-id trajectories that are from bus drivers or conductors are called "PSV mode." By human interpretation, 456 cell-id trajectories (phone holders) were labelled as "PSV mode" and the rest were "non-PSV mode." e statistics of similarity measures of PSV mode trajectories is listed in Table 1. Support vector machine (SVM), which may maximize the margin by separating two classes of samples, was used to identify PSV cell-id trajectories from others in this work. K-fold cross-validation was used for SVM parameters tuning such that the model with most optimal value of hyperparameters can be trained. e interpreted candidates are divided into 5 folds, out of which four folds are for training and one for testing. e results of five repeated classifications are shown in Table 2. For PSV cell-id trajectory detection, the precision is about 90% while the recall is from 88.60% to 92.32%. For non-PSV mode detection, the precision is around 88%, and the recall is from 88.37% to 93.54%.

Trajectory Refinement by Heuristic Optimization.
Once the trajectories with a PSV mode were identified, our proposed heuristic optimization method is used to determine the precise location of anchor points on a bus route at a specific timestamp by concerning its corresponding BTS sector and other contextual information. To evaluate the performance of our heuristic globe optimization approaches, the ground truth datasets were deployed. As the result of trajectory matching shown in Figure 7, 81 out of 136 BTS along bus route no. 307 were detected that they have counterparts anchor points existing on the route. Five optimization algorithms including particle swarm optimization (PSO), augmented Lagrangian (AL), compass search (CS), and artificial bee colony (ABC) were tested to solve the optimization problem of spatiotemporal modelling of a cell-id trajectory. For our PSV's location optimization, it is actually a continuous, constrained, and single-objective problem. at is, an anchor point of a BTS needs to be found in optimization space which is the common segment part of the BTS sector and the bus route.
A python library PyGMO was deployed for implementing this iterative process, and all results were evaluated by the indicators of MAE (mean absolute errors) and St. Dev (standard deviation) by referring to ground truth GNSS tracking points. As shown in Figure 8, the compass search algorithm achieved the best performance in comparison of others. e estimated locations of anchor points at their BTS-communication time have the smallest errors. Based on the optimized cell-id trajectories, we further make a spatiotemporal interpolation among anchor points using method in [13] and reconstruct a high-quality PSV trajectory.  Journal of Advanced Transportation Figure 9 presents the iterative process of a trip of PSV cell-id trajectory data using compass search algorithm. A monotonic basin hopping (MBH) meta-algorithm is applied.
To avoid local optimization and accelerate convergence velocity of iteration, a monotonic basin hopping meta-algorithm is applied in the abovementioned optimization processes. e optimization space threshold λ roughly set to 500 m to avoid an explicitly unreasonable location was tested in the iteration. As shown in Figure 10, the objective decreases quickly at the beginning and finally converges at minimization after 120,000 iterations. Generally, an acceptable result could be reached after 8,000 iterations.

Comparison with Other Spatiotemporal Interpolations.
To evaluate the advantages of our proposed method, we selected three most popular trajectory reconstruction methods, nearest sampling (nearest), linear interpolation (L), and cubic hermit interpolation (CH), [13] to make a comparison.
ese popular interpolations have two models, constrained (C) and unconstrained (U), depending on whether ancillary transportation line datasets are used or not. Calculating the difference between the estimated location and the GNSS measured location of a tracking point at a specific timestamp, the performances of the abovementioned methods are shown in Figure 10, among which our method (compass search algorithm) has the smallest error on its refined cell-id trajectory.
e box of CS errors is compact with a small median of 166 m. Among traditional interpolation methods, constrained linear interpolation seems introducing a relatively good result with a median of 207 m, but the range of errors widely varies. Moreover, CS is also robust and the better performance is due to the fact that we accounted for the unique transportation model of PSV and the constraints of public transport line.

Shift in LCSS Matching.
Assuming a bus route crosses a BTS sector that is a wireless-signal coverage area (Figure 11), it leads to a piece of road intersection on which a mobile holder (a bus driver) moves when the phone communicates with the corresponding BTS at a particular time.
e end-points of the interaction may be called, entry point and exit point. All communication events with this BTS must happen on the interactions with big probabilities. e revised LCSS for trajectory matching is a forward matching process. is process is able to find the longest common subsequence, but cannot guarantee an appropriate anchor point that is expected to be the identical location of the bus when moving on roads at a specific time. Among the discretized points on the intersection part, the closer a point is to the entry, the higher the probabilities of the point are to be matched.
is is because the effect of mismatch and process variation results in shifting in the process of LCSS.
Our heuristic optimization with compass search algorithm further adjusts the initial positions of cell-id trajectory points within the intersection part and estimates appropriate positions to match a PSV-mode transportation along the bus route. In comparison of four GNSS trajectories, the average distance between initial points and estimated points is about 197 m with a deviation of 57 m.

How a Time Interval Affects the Spatiotemporal Modeling in Trajectory Refinement?
From the comparison of heuristic optimization and traditional interpolations in the above section, our proposed method did improve the accuracy of trajectory refinement. Moreover, the inherent quality of cellid trajectories, particularly the time intervals of tracking points on a cell-id trajectory, also has big effects on the refinement.
Based on the ground truth datasets collected in four round-trips at bus route 307, the cellular signaling data originally logged with one second interval were resampled into a number of trajectories with intervals ranging from 10 s to 600 s. e BTS (cell-id) number in these resampled   Journal of Advanced Transportation cell-id trajectories varies from six hundred to two dozen, as shown in Figure 12(a). e errors of estimated trajectory to the GNSS trajectory increase from thirty meters to six hundred meters, as shown in Figure 12(b). e interval of 5 minutes that is close to the average interval of our big cell-id trajectory collection generates an error of 180 m. Once the interval is greater than 10 minutes, our proposed method will not support acceptable trajectory refinement any more.

Conclusion
is work proposed a novel approach to reconstruct the spatiotemporal location of vehicles between sparse updates in a cellid trajectory. First, an LCSS-based machine learning method was proposed to detect PSV-mode cell-id trajectories from the huge volume of cellular network signaling datasets. en, a heuristic global optimization method was deployed to estimate the precise locations along bus routes of these detected cell-id trajectories. Evaluated with ground truth datasets, our proposed method has achieved very good performance in both accuracy and robustness in comparison with traditional interpolations.
Our heuristic global optimization with compass search algorithm overcomes the issue of location shifting in LCSS when matching a cell-id trajectory and a bus route. is leads to a set of high-quality anchor points, that is, a spatial position at a road network that is originally corresponding to a cell-id tracking point at a particular time is estimated in the common intersection part of the BTS sector and the road network. e experiment indicates that, by taking advantage of the nature of PSV-mode cell-id trajectories, our approach works well on cellular network signaling datasets with fiveminute sampling interval, but the performance decreases sharply after a ten-minute sampling.
Data Availability e 4G-LTE mobile phone data used to support the findings of this study were supplied by China Mobile Communications Group Co., Ltd., under license and so cannot be made freely available. Requests for access to these data should be made to Kemin, Xianfeng (xfsong@ucas.ac.cn).

Conflicts of Interest
e authors declare that they have no conflicts of interest. Journal of Advanced Transportation 11