Practical Constraint K-Segment Principal Curve Algorithms for Generating Railway GPS Digital Map

In order to obtain a decent trade-off between the low-cost, low-accuracy Global Positioning System (GPS) receivers and the requirements of high-precision digital maps for modern railways, using the concept of constraint K-segment principal curves (CKPCS) and the expert knowledge on railways, we propose three practical CKPCS generation algorithms with reduced computational complexity, and thereafter more suitable for engineering applications. The three algorithms are named ALLopt, MPMopt, and DCopt, in which ALLopt exploits global optimization and MPMopt and DCopt apply local optimization with different initial solutions. We compare the three practical algorithms according to their performance on average projection error, stability, and the fitness for simple and complex simulated trajectories with noise data. It is found that ALLopt only works well for simple curves and small data sets. The other two algorithms can work better for complex curves and large data sets. Moreover, MPMopt runs faster than DCopt, but DCopt can work better for some curves with cross points. The three algorithms are also applied in generating GPS digital maps for two railway GPS data sets measured in Qinghai-Tibet Railway (QTR). Similar results like the ones in synthetic data are obtained. Because the trajectory of a railway is relatively simple and straight, we conclude that MPMopt works best according to the comprehensive considerations on the speed of computation and the quality of generated CKPCS. MPMopt can be used to obtain some key points to represent a large amount of GPS data. Hence, it can greatly reduce the data storage requirements and increase the positioning speed for real-time digital map applications.


Introduction
1.1.Principal Curves.Spearman proposed principal component analysis (PCA) in 1904, and now PCA is one of the most important tools for statistical data analysis [1].But PCA is one linear analysis method which cannot deal with some intrinsic nonlinear data sets.Then Hastie proposed the concept of principal curves and its generation algorithm in 1983 [2].As the nonlinear extension of PCA, principal curve uses a smooth curve instead of a straight line to summarize the given data set of the symmetric variables.Recently, the principal curves have been widely used in different real engineering problems such as control of the electron beam trajectory in Linear Collider [3], sound data model reduction, and data visualization in speech processing [4][5][6] and chemical process monitoring [7,8].
In 1992, Banfield and Raftery proposed BR principal curves, which resolved the problem of original principal curves that the estimation error has significant effects on the actual principal curves due to the much large curvature in closed principal curves [9].But BR principal curves algorithm also brings the numerical instability, which may obtain smooth but false principal curves in practice.In 1994 and 1995, Duchamp and Stuezle studied the entire differential geometric properties of principal curves [10,11] and they found that the principal curves are not unique.
In 1997, Kégl proposed the concept of length constraint principal curves [12][13][14]and proved the existence and uniqueness of the K-segment principal curves (KPCS) defined in some theoretical distributions.He also studied the learning nature of KPCS about data distribution and proved the convergence of KPCS.Then, he proposed the related polygonal line algorithm to obtain KPCS.This is the first article giving proofs of existence of principal curves, but the length of the curve must be prefixed.In 2000, Verbeek et al. proposed soft-K-segment principal curve algorithm [15] which applied local PCA to develop K-segments and then connected them smoothly to form principal curves.However, this method cannot be extended to multidimensional application and it did not consider the curvature penalty in the objective function.In 2011, Ozertem and Erdogmus proposed a new way of generating principal curve by probability density estimation [16].For the data set with non-Gaussian process [17], generating principal curves is more difficult than that with Gaussian process.As for the parameter selection, Biau addressed the parameter selection issue using the empirical risk minimization perspective and model selection via penalization [18].Gerber and Whitaker made a meaningful exploration in regularization-free principal curve estimation by a gradient-descent-based method for controlling the model complexity [19].Zhang et al. applied granular computation on principal curves and better results were obtained [20].These studies greatly enrich the theoretical research and applications of principal curves.

Constraint K-Segment Principal Curves (CKPCS) and GPS Railway Digital Maps.
Motivated by some applications about transportation data with fixed endpoints, we have proposed the concept of constraint K-segment principal curves (CKPCS) and its generation algorithm in 2008 [21].Unfortunately, the proposed algorithm is complicated and computing intensive; thus, its practicality is not very good.To decrease the computational complexity, we later proposed a heuristic optimization algorithm [22] for fusing multiple GPS track data with fixed endpoints, trying to make the CKPCS algorithm more practical and easier to be used.However, it only works for some simple curves.
GPS technology has been intensively used to guide aircraft, ships, vehicles, and trains to accurately reach destinations on time along the selected routes.Nowadays, accurate GPS digital maps play a very important role in railway control systems especially in precise positioning of trains [23].The application of GPS digital maps in train positioning can effectively reduce positioning errors and improve positioning reliability [24].To generate high quality GPS digital maps, accurate GPS data are to be collected by GPS receivers.However, the accuracy of GPS receivers may be deteriorated by many factors including track error, satellite clock difference, ionospheric error, troposphere error, carrier phase cycle slips, and multipath effects.Besides, the extremely complex electromagnetic environment in railway system also negatively affected the accuracy of GPS receivers.Consequently, it is difficult for low-cost GPS receivers to collect high-accuracy GPS data for railway digital mapping.Differential GPS system can be employed to obtain high-accuracy GPS data, but the cost for building such kind of digital mapping system is much higher compared with using normal GPS receivers.
Considering that there are some important static points like switches and signal machines in railways, long-time static measurement can be implemented to get their accurate positions.Or in other words, some endpoints with highaccuracy can be obtained.As a result, we conclude that the CKPCS with fixed endpoints is very suitable for digital mapping of railways.Using CKPCS algorithms to generate high-accuracy data with multiple GPS data collected by lowcost GPS receivers will help us to obtain a decent balance between low-cost devices and high-accuracy digital maps.
In this paper, based on the discussions above, we further our study on CKPCS by using some expert knowledge about railways.We will develop several heuristic CKPCS generation algorithms which are more computational efficient and easier to be used.By using simulation data and the real GPS data collected in Qinghai-Tibet railway, the new algorithms will be validated according to data accuracy, computing time, stability, and application scope.The remainder of this paper is organized as follows.The model of CKPCS based on the principle of nearest neighbor is presented in Section 2. The three practical CKPCS algorithms are developed in Section 3.After analyzing errors and fitness of proposed algorithms by simulation data and validating their practical applicability by railway GPS data sets, the advantages and disadvantages of different proposed algorithms are compared in Section 4. We conclude this paper with some discussion on future work in Section 5.

Mathematical Model for CKPCS
K-segment principal curves (KPCS) are generally the curves containing a series of connected segments that pass through the "middle" of collected data.KPCS are usually obtained by adding points gradually from the initial PCA solution and minimizing the distance from points to segments and vertices according to the principle of nearest neighbor.However, as constrained K-segment principal curves (CKPCS) have two fixed endpoints, the segment connecting the two endpoints composites the initial solution, and the two endpoints are kept unchanged during the optimization process of minimizing the distance from points to line segments and vertices.As illustrated in Figure 1, CKPCS consist of vertices   ( = 0, 1, 2, 3, . . ., ) and segments  ,+1 ( = 0, 1, 2, 3, . . .,  − 1), and the positions of endpoints  0 ,   are fixed.According to the principle of nearest neighbor, all data points   ( = 1, 2, 3, . . ., ) are partitioned to their corresponding nearest segments and vertices.Here,  is the number of vertices of constraints principal curves and  is the number of GPS data points.
The distance from a data point   to one segment  ,+1 is defined as Here ( The smallest one of the distances  , ( = 0, 1, 2, 3, . . ., ) and  , ( = 0, 1, 2, 3, . . .,  − 1) is defined as projection distance of a data point to the K-segment principal curve.The average projection distance of all data points AD is defined as follows and will be minimized during the CKPCS generation algorithm: AD = ∑  =1 min{ , ( = 0, 1, 2, 3, . . ., ) ,  , ( = 0, 1, 2, 3, . . ., )}  . ( To the problem of minimizing AD, both the number and positions of vertices are to be optimized.Because we do not know the number of vertices in advance, getting the optimal solution in one step is not easy and some iterative algorithms become a great need.Moreover, the practical engineering applications prohibit complex algorithms with long running time.

Three Iterative CKPCS Algorithms
According to the model above, this section proposes three iterative algorithms to generate CKPCS from noisy observed data.For all the three algorithms, two endpoints are supposed as the constrained points and kept unchanged in the iteration.But different algorithms imply different ways in finding the initial solution and adding other vertices according to our domain knowledge.Many successful applications have showed that good initial solutions and high computational efficiency could be achieved with the help of expert knowledge [25][26][27].For all the three algorithms, when the vertex number is fixed, adjusting several or all the positions of vertices to minimize the objective function AD is a classical nonlinear optimization problem, which is solved in this paper by the commercial nonlinear optimization routine "fmincon" provided by Matlab.The "fmincon" uses traditional SQP optimization [28] for small-scale problems and trust-region reflective method for large-scale problems [29].
3.1.Splitting Optimizing Algorithm: ALLopt.In this algorithm, CKPCS is obtained by two major parts: dividing some segment and optimizing all vertices as shown in Figure 2. The main steps of this algorithm are listed as follows.
Step 1. Connect two endpoints and take the segment as an initial solution for the CKPCS.
Step 2. If AD is larger than the presetting error, go to the next step; otherwise, current curve is the obtained CKPCS and the algorithm is terminated.
Step 3. Add one new vertex which is the middle point of the line segment where the maximal AD is found among all line segments to generate a new solution.Then, the positions of all vertices are adjusted using a nonlinear optimization method to minimize the AD.Step 4. If AD is bigger than the setting error, go to Step 3; otherwise, the CKPCS is obtained.As this splitting optimization algorithm needs to optimize all vertices after adding a new one, we call it ALLopt for short.

Improved Splitting Optimizing Algorithm: MPMopt.
Because ALLopt optimizes the positions of all vertices in each iteration, it needs long running time if the volume of data is large.To solve this problem, an improved algorithm is proposed by applying a better initial solution and adjusting at most three vertices once a time to accelerate the optimization process.As illustrated in Figure 3, this algorithm includes two loops: one loop for generating a candidate principal curve, and the other loop for finding the final CKPCS.
Loop A. Generate initial principal curve.
Step 1. Connect two endpoints, take the line as initial solution.
Step 2. Calculate the AD.If AD is bigger than the setting error, go to step three; otherwise, the initial solution is the CKPCS obtained.
Step 3. Add one vertex   in a heuristic way and generate a new solution.Then, calculate the AD; if the AD is bigger than  (usually between 2 and 4) times of the setting error, repeat this step until the AD is less than  times of the setting error.
The heuristic way of adding a new vertex is as follows: the area where its AD is maximum among all line segments is selected.Then, in this area, the point with the maximal projection distance among all points is selected as the new vertex.We call this method Max Point Method (MPM).
Loop B. Find the final CKPCS.
Step 4. Based on the initial CKPCS obtained in Loop A, the method MPM introduced above is applied to find the vertex    .
Step 5. Locally optimize the positions of vertex    and the two adjacent vertices   and  +1 , making the AD minimum, but do not change the positions of two endpoints.
Step 6.If AD is larger than the setting error, repeat Steps 4 and 5; otherwise, the final CKPCS is obtained.This algorithm is called an MPMopt for short as its optimization is based on MPM method.

Dividing and Connecting Optimizing Algorithm: DCopt.
Although MPMopt can run faster than the first algorithm ALLopt, we find that it also has some disadvantages in handling the complicated curves with cross points.Therefore, we propose the third algorithm which contains three parts: dividing, connecting, and optimizing as shown in Figure 4.For short, the third algorithm is called DCopt.
Part 1. Dividing:divide all data points to the adjacent region of all vertices including two endpoints.
Step 1.According to the nearest neighbor principle, divide all data points into different vertices with adjacent regions.
Step 2. Add a new vertex at the each junction of two adjacent regions; then divide all data points again.
Step 3. Repeat Step 2 to keep dividing the adjacent ranges, until the maximal radius of all adjacent regions is less than the setting radius, which is  times of the setting error.Two rules are followed in connecting to make the CKPCS smoothly and continuously as follows.
Rule One. the slope change of adjacent line segments should be as small as possible.
Rule Two.connect the nearest points.
In connecting part, the starting point follows rule two; while other points follow rule one.
Part Three.Optimizing.
If AD is bigger than the setting error, add one new vertex using MPM methods that is mentioned above and optimize it and its adjacent vertices except the two endpoints, until AD is less than the setting error.

Verification, Analysis, and Comparison of Proposed Algorithms
In this section, some simulation data are used to analyze the error, stability, and fitness of the three proposed algorithms first.Then, their practicability is verified and analyzed using the measured GPS data on Qinghai-Tibet Railway (QTR).QTR is the highest railway in the world and also the first railway in China to adopt GPS technique for train positioning [30,31].
We have applied some different thresholds for MPMopt and DCopt in our experiments.By trial-and-error, we find out that using  (ranging from 2 to 4) in MPMopt and setting the radius of each circle as the  (ranging from 3 to 7) times of the setting error can work very well in all experiments.We set  as 3 and  as 5 in this paper to keep the balance of accuracy and running time.Small  or small  means long time in finding initial solution but less time in iterative optimization.For the setting error , which is often set manually, the less of its value means higher requirement in accuracy and longer computing time in optimization.

Analysis of Algorithms' Error by Simple Synthetic Data.
There are three steps to finish the analytic process for synthetic data illustrated in Figure 5. First, a reference curve is generated and two fixed endpoints are recorded; secondly, some data points near the curve are randomly generated to simulate the real measured data points with noise; finally, the new CKPCS are obtained by the proposed algorithms using the generated data points and the two endpoints.The obtained CKPCS are compared with the standard curve to compute performance indexes.
In Figure 5, the red curve is the generated reference curve (Grc)  = sin() with two fixed endpoints (0, 0) and (, 0).The blue data points represent the generated random data points simulating the measured data points from the reference curve.The yellow curve, the black curve and the green curve are the three CKPCS generated by ALLopt, MPMopt, and DCopt, respectively.To test the performance of proposed algorithms, the simulation repeats 30 times for each algorithm.For the reference curve, the data points are regenerated each time and with the random error between 0 and 0.06.And we find that the three algorithm can generate CKPCS well and make the AD less than 0.03.Therefore, for this simple curve, the fitness and stability are good for all the three algorithms.

Analysis of the Applicability of Three Algorithms by
Complicate Synthetic Data.As  = sin() is a simple and smooth curve, we do not know whether the three algorithms can work well for more complicated curves.To validate the applicability of three algorithms for the complicated curves, different shapes of reference curves are needed to be tested.
Here, some more complex curves are generated using the above introduced method and the applicability of the three algorithms are analyzed.The results are illustrated in Figures 6, 7, and 8.We can see that ALLopt has a good applicability on simple curves, while it fails to track the complex curves with spiral trajectory or with cross points.MPMopt has a good applicability on simple curves and some complex ones, but it has some problems on the curve with cross points.DCopt has a good applicability on all the study cases including the curves with spiral trajectory and cross points.

Application of the Proposed Algorithms in Railway Digital
Mapping.As previous studies focus on the synthetic curves and data, we need some real GPS data to test which algorithm can solve the real-world problems best.In this subsection, we will use some real GPS data collected on Qinghai-Tibet Railway to test the performance of the three algorithms in generating CKPCS for railway digital mapping.
The railway GPS data has the characteristics of simple trajectory, long length, and large volume.So, we test the practicability of the proposed algorithms on the railway GPS data by analyzing their approximation error, running time, storage space (vertex number) and other performance indexes.Figures 9 and 10.The measurement equipment is Thales max with the accuracy of about 6 meters, and the AD for one-time measurement is about 3 m.We collected data three times, hoping to decrease the AD to 1 m.

Data Collection and
T-04 is about 2 kilometers long and its two endpoints are two signal machines.Because of the importance of the two signal machines positions, we took about 30 minutes to get the accurate coordinates for each signal machine by averaging the measuring coordinates.Then, the dynamic measurement on trajectory was done three times and 1093 data points were obtained.With the combination of longtime static measurement for two endpoints and short-time dynamic measurements along the railway track, we achieved a decent trade-off between the efficiency and accuracy of the measurements.The T-10 is about 40 kilometers and its two endpoints are two switch turning machines.Two endpoints and 15213 data points for T-10 were obtained using the similar measurement methods for T-04.

Results and Analysis.
The setting error for the proposed algorithms is 1 m for T-04 and T-10, which is accurate enough for train positioning.The algorithms will stop if AD is less than 1 m or the decrease of AD is relatively slow; that is, one algorithm will stop if adding a vertex cannot decrease AD by 1 percent.The CKPCS are obtained from railway GPS data sets collected in QTR using the three proposed algorithms, respectively.For T-04, the obtained CKPCS by the three algorithms are shown in Figures 11,12, and 13, respectively.Table 1 lists three numerical performance indexes for the three algorithms, where the error means the AD, time represents the running time of the algorithm, and number is the number of vertices in the final CKPCS.We can find that all the three algorithms can generate CKPCS very well but  the ALLopt takes a longer time than DCopt and MPMopt.Moreover, MPMopt algorithm uses the shortest time in generating CKPCS and obtains the minimal AD.As the GPS data set collected in T-04 is a relative small one, the difference among the three algorithms is not distinctive.Next, we will explore the difference of the three algorithms on the large data set collected on T-10.
For T-10 data, the simulation results and the performance indexes are provided in Figures 14, 15, and 16 and Table 2.We can see that the MPMopt algorithm only takes 103 seconds to generate the CKPCS and its accuracy is less than 1 m.But the DCopt takes 82 min with a bigger error of 10 m and ALLopt takes 24 hours with the biggest error of 27 m.Or in other words, these two algorithms become invalid on the large data set.MPMopt not only saves time but also greatly reduces storage space for digital maps as it just uses 68 points to represent the 15,213 GPS data points.This example also shows the importance and efficiency of good initial solution and local optimization for generating CKPCS from a huge amount of data points.

Conclusions and Future Work
In this paper, the origination, history, and defects of principal curves are revisited first.For some special problems with fixed two endpoints, CKPCS with complicated computational process was proposed before.Aiming to make the complicated CKPCS algorithm more applicable to real engineering applications especially the railway digital mapping, three CKPCS generation algorithms, named ALLopt, MPMopt, and DCopt, are proposed and their implementations are explained in detail.All the three algorithms need to keep the two endpoints fixed during the iterative optimization process.There are also some significant differences among them.ALLopt use the idea of global optimization, which optimizes the location of all vertices to minimize the average projection distance.However, MPMopt and DCopt employ the idea of the partial optimization, which only heuristically optimize partial vertices to speed up the optimization process.The difference between MPMopt and DCopt lies in the generation of initial solutions, where MPMopt always find the point with the maximal distance while DCopt use the idea of dividing and connecting.In comparison analysis, the average error (AD), the running time, and the number of vertices in the final CKPCS are used to validate the proposed algorithms' performance.
First, verification and analysis of the proposed algorithms are performed on some synthetic data.We found that ALLopt works well only on simple curves and small data sets and takes the longest running time among all the three proposed algorithms.MPMopt can work well both on simple curves and complicated curves, but its performance will get worse on the curves with cross points.DCopt can work well on all curves, but its running time is longer than MPMopt.
Then, two measured GPS data sets collected in Qinghai-Tibet railway are used to test the three algorithms to see which can work for real engineering problems.The results show that the MPMopt algorithm has the best comprehensive performance according to the accuracy, the computational efficiency, and the data reduction capability.As the real track of railway is not very complicated, it is recommended to use MPMopt in CKPCS generation for railway digital map.Of course, ALLopt can be used for small data set of simple curve and DCopt is preferable for very complicate curves with cross points.By using these proposed algorithms, we can produce the CKPCS easily and quickly so as to give a simpler description for complex curve or huge amount of GPS data points; thus, storage space is greatly reduced which can make GPS digital maps work better for real-time GPS positioning.
Although the proposed algorithms obtained good results in synthetic data and railway GPS data, there is further work that can be done, for example, considering more performance indexes like the maximal projection error, the smoothness of the CKPCS, and testing the proposed algorithms with more real engineering data with different shapes.Our methods are not used for generating all kinds of PCs, but used for generating PC with fixed endpoints, which is defined as CKPCS.For CKPCS, the existing PC algorithms with solid theoretical foundation cannot be applied directly.We must admit that we propose our methods with focus more on simple implementation and fast applications.This makes our methods look rather ad-hoc, but suitable for real engineering applications.The experiments on real data collected in Qinghai-Tibet railway also demonstrate that the proposed approaches work very well in real applications.Anyway, the research on practical algorithms with more solid mathematical foundations for CKPCS should not be discouraged and we think it is a good direction of future work.

Part 2 .
Connecting: according to the shape of trajectory, connect the generated vertices orderly and obtain the initial solution.all areas < m * E?

Figure 5 :
Figure 5: Error analysis for three algorithms by simulated sin data.
,    and    are  and  coordinates of data point   . V  and  V  are  and  coordinates of vertex   .And the segment  ,+1 connects   and  +1 .The distance from a data point   to the vertex   is defined as

Table 1 :
Comparisons among the three algorithms on T-04 railway.

Table 2 :
Comparison among the three algorithms on T-10 railway.