Study of Railway Track Irregularity Standard Deviation Time Series Based on Data Mining and Linear Model

Good track geometry state ensures the safe operation of the railway passenger service and freight service. Railway transportation plays an important role in the Chinese economic and social development. This paper studies track irregularity standard deviation time series data and focuses on the characteristics and trend changes of track state by applying clustering analysis. Linear recursive model and linear-ARMA model based on wavelet decomposition reconstruction are proposed, and all they offer supports for the safe management of railway transportation.


Introduction
Temporal data and temporal data mining which reflect the dynamic nature of data are one of the focuses of academic community in recent years.Time series is an important temporal data.Time series similarity has been widely used in speech recognition.Euclidean distance and dynamic time warping are two classic methods.Euclidean distance is most frequently used in the time series, but Euclidean distance is a very brittle distance measure [1].There is an obvious defect of Euclidean distance: sometimes when the sequence is very similar but the distance is great.Then DTW is proposed.DTW is an algorithm for measuring the similarity between two sequences which may vary in time or speed.A wellknown application has been automatic speech recognition [2][3][4], to cope with different speaking speeds.Since Agrawal et al. first proposed overall matching algorithm of time series similarity search in 1993 [5], more and more scholars began to focus on temporal data mining study.
There are a variety of methods and models in the forecast area, such as determining function method, statistical regression analysis, time series analysis, Markov model [12], state-space model [13], Bayesian forecasting model [14], as well as a variety of methods combined with theories, techniques, and methods, such as the hybrid of fuzzy theory and linear regression [15], the hybrid ARIMA and support vector machines model [16], the hybrid of time series forecast using neural networks, fuzzy logic, fractal theory [17], and the hybrid Markov model and neural networks [18].At the same time, the number of subjects is divided into univariate and multivariate.In time series analysis and forecasting methods, the most commonly used method is based on the time domain and frequency domain.Linear and stochastic linear models and nonlinear models are two main types of study models.Random process cannot be expressed by definite function.Typically, there are mainly two types of methods to analyze random process: one is a probabilistic method, and the other is the analysis method, and the two methods are often used simultaneously in practical study.Each of AR model, MA model, ARMA model, the Markov Forecast [19], and Kalman filter model [20,21], can be used to study the stochastic process.The characteristics of the first three models are linear forecast model and are relatively simple in terms of elements taken into consideration.
In the recent 20 years, fuzzy system is one of the most frequently used methods [22][23][24][25].Fuzzy system has been used for a variety of optimization algorithms.In many studies, the design of fuzzy system forecast has been proposed.
In the study of time series forecast, the forecast accuracy is considered as top priority in the selection of forecast methods.The neural network is the most representative time series forecasting method, which has drawn more and more attention.In the past ten years, the neural network model is used in the study of time series forecast.The discovery of the neural network is considered to be a competitor to a variety of traditional time series models [26][27][28].Because of flexible computing framework and general approximation, artificial neural network [29] is widely used in the field of predictive analytics and has higher precision.
However, an important and difficult task is faced by decision makers in many areas to improve the accuracy of the time series forecast.Using a hybrid model by combining several models has become a common practice to improve forecast accuracy.This study field has been significantly expanded [30][31][32][33][34].
The study has shown that forecasts of certain hybrid models are often better than those obtained from a single model forecast.Hybrid model forecast is considered to be a more accurate forecast.Its main problem can be described as follows: suppose there are  kinds of forecasts, such as ŷ1 (), ŷ2 (), . . ., ŷ ().The general form of this hybrid forecasting model can be defined as weighted sums of ŷ (), and the sum of all weights stack up is 1.The biggest difficulty is to determine the weights of every single forecast.
According to the principles recognized by the scientific community, simple theories are more reliable than complex ones under the same circumstances.The best scientific theory should be the simplest.railway transportation is different from automobile traffic.In automobile traffic, the driver is the major safety factor [35,36], but in Railway transportation, track irregularity state is the decisive factor.This paper tries to look for a simple, reliable model to analyze the track irregularity change trend and to explore knowledge.
The paper is organized as follows.Section 2 introduces forecast models of time series.Data mining of track irregularity time series is described in Section 3. Section 4 presents analysis and forecast of track irregularity time series based on linear recursive forecast model.Section 5 proposes linear-ARMA model based on wavelet decomposition reconstruction.Finally, Section 6 concludes the paper with a summary.

Methods and Processes of Time Series
Data Study  to the quantitative description of the degree of development and changes, is based more on historical statistics, and is less affected by subjective factors.Qualitative forecast and quantitative forecast are not mutually exclusive but can be complementary to each other and should be combined correctly in the actual forecast process.Two types of time series forecasting methods and their methods are shown in Figure 1.

Study Processes of Time Series Data.
The study processes of time series data include three steps: data acquisition, correlation analysis, and model identification.Specific study processes are shown in Figure 2.

Data Mining of Track Irregularity Time Series
3.1.Theoretical Analysis of Data Mining.Data mining is the process of mining interested information from databases, data warehouses, or other data repositories.It is an iterative process, whose study steps are shown in Figure 3.
The main method, purpose, and contents of time series data mining include: time series segmentation studying underlying mechanisms change of time series and representation at high level; similarity search looking for similar sequences; clustering analysis on similarity measure, clustering algorithms and results, and gathering similar time sequence variation into one class; classification and sequence analysis on time series and the time points in the entire time series; anomaly detection which finds the abnormality of sequence, points, and mode; analysis on the law and trends of time series changing over time, forecast of the future data and trends based on the historical and current data; using graphics technology, virtual reality technology, and data mining technology to display complex time series in a way which is easy, intuitive, and graphical for people to understand so that we can realize visualized and practical study of time series data.

Time Series Similarity
. Time series similarity is the basis of time series analysis and time series data mining.The similarity is achieved by calculating the distance between the time series.The definition of distance needs to meet the following four properties: (1) is the distance between   and   in a group of objects  = { 1 ,  2 , . . .,   } in -dimensional space.
Minkowski distance is the common formula to calculate distance, and the expression is Many distances are formed by changing the parameters of the Minkowski distance.Distance formula commonly used includes Manhattan distance, Euclidean distance, Chebyshev distance, custom distance, Mahalanobis distance, Minkowski distance, variance weighted distance, Canberra Distance,  dynamic time warping distance, the cross-correlation distance, KL distance, and so forth.

Clustering Analysis of Track Irregularity.
Standard deviation between the data is used to reflect the degree of variation.When the two sets of data are under circumstances of the same units and similar means, the greater the standard deviation is, the greater the degree of variation between the data will be.When the data around the mean of the distribution is more discrete, representation of the mean is weaker.On the contrary, the smaller the standard deviation is, the little the variance between the data is indicated, and the distribution of data around the mean is more intensive, and representation of the mean is better.Therefore, standard deviation of track irregularity data is selected in the study of discrete distribution of track irregularity data over time to evaluate the development trends of track irregularity.In this study, track irregularity data is provided by State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University.
We take K449-K450 km section, Beijing-Kowloon line, with 44th cross-level inspection data as study data.The section is divided into 40 units, and the standard deviation of each units is calculated; then we get 40 cross-level standard deviation time-series data.This standard deviation data is the object of cluster analysis.In this paper, -mean algorithm is used in clustering algorithm, cross correlation distance is used in the distance between objects in the matrix, and the minimum variance algorithm is used in the connection between the variables.Clustering results are shown in Figures 4 and 5.
The above track irregularity standard deviation time series which have similar clustering methods, changing trends, and characteristics are clustered into one category.

Linear Recursive Model
4.1.Core Ideas.Because of the inertia characteristics of the track state changes, track state has a memory effect.The latest track state and the nearest previous state shares similarity, and the inspection state of the adjacent time points has a similar trend.From the macroperspective, the track state presents nonlinear changes throughout the whole life cycle of the track [37,38], but from the microperspective, in a short time, track state changes at adjacent time will be close to linear features.
Based on the above assumptions, this paper proposes linear model of track state changes as follows: In the formula,   is the track state of theth day between the th to the ( + 1)th inspection;   is slope value of track state linear change between the th to the ( + 1)th inspection;   is intercept;  , is theth day between the th to the ( + 1)th inspection.
According to the least square method, the model employs the vector form.Least squares estimation of the model parameters accords to the following conditions: In the formula: ) . ( In the formula,  + is the length of time between the th to the ( + 1)th inspection;  + =  + −  +−1 ,  + is the time of state value  +,0 at the ( + )th inspection;  +,0 is state value of the ( + )th inspection.
The form of recursive forecast of the model is as follows: In the formula,  > , and  >  + .

Model Analysis.
Three time intervals data of the first four historical inspection data and the second to fourth inspection data in cross-level standard deviation time series data are considered as an input; the slope and intercept of track state linear changes in time interval between the fourth and fifth inspection are considered as output.When the forecast finishes, the fifth forecast data is updated by the fifth standard deviation of the actual inspection data, then three-time interval between the second and the fifth inspection data and the third to fifth inspection data are used to forecast the slope and intercept of track state linear changes in time interval between the fifth and sixth inspections.As the data is updated gradually, the model pushes and forecasts towards ahead.Meanwhile, the linear forecast model proposed in this paper does not have to consider equal interval requirements of data samples like general time series models.In the track irregularity time series data, especially data of track inspection car, due to maintenance work, passenger and cargo plan changes, and the track inspection car scheduling and other reasons, inspection data is unequal interval, namely, in the model  +,0 ̸ = const.Linear model has the following advantages: simple model based on a linear model and least squares solution; without regard to unequal interval characteristics.Length of unequal interval is seen as a parameter directly in the model.
Forecast process of linear recursive model is shown in Figure 6.

Residual Correction.
In order to further improve the forecast accuracy of the model, the residual needs to be corrected.Model residuals sequence data is a cyclical variation with concussive time series.In this paper, the Fourier transforming idea is used on residuals analysis.
Model residuals sequence data is as follows: The Fourier transforming residuals expression (  ) is In the formula,  = 1, 2, . . ., , By looking for the appropriate value of , fitting errors between model residuals  + and periodic sequence ê+ achieved through Fourier transforming can get to a minimum value; namely, After getting the best value of , the final forecast model is When the parameters {(  ,   )} are substituted into the short time track state changes linearly formula, the daily track state changes between two inspection tome points can be estimated, and the forecast curve and the actual curve are shown in Figure 7.
Forecast results after residual correction are shown in Figure 8.
If we put several indicators together, we can fully measure the forecast accuracy of the model.Here, we use mean square error (MSE) and mean absolute percentage error (MAPE).
MSE is the expectation value of square of difference between the estimate value and true value of the parameter, and it indicates the degree of changes of data.The smaller the MSE value is, the mere accurate the forecast model is in describing the experimental data.The expression of MSE is as follows: Forecast accuracy of MAPE can be divided into four indicators: high-precision forecast (MAPE < 10%), good forecast (10% < MAPE < 20%), feasible forecast (20% < MAPE < 50%), and error forecast (MAPE > 50%).
Forecast accuracy of the model in some units segment is shown in Table 1.
Through the comparison of forecast accuracy indicators MSE and MAPE, we can find that the values of MSE and MAPE of the model after residuals correction are significantly reduced, and the forecast accuracy is generally improved by 30-40%, which belongs to high-precision forecast.

Linear-ARMA Model Based on Wavelet Decomposition Reconstruction
Forecast models after residual correction generally enjoy higher forecast accuracy than the original forecast models, but this also increases the computation and complexity of the model.Therefore, this paper will propose linear-ARMA model based on wavelet decomposition reconstruction.

Wavelet Transform.
The wavelet transform [39][40][41][42][43][44][45] is a new field developing rapidly in applied mathematics and engineering.It is a new branch of mathematics and perfectly combines functional analysis, Fourier analysis, sample transfer analysis, and numerical analysis.It is based on certain special functions; it converts data process or data sequence into series in order to find the similar spectrum characteristics and finally achieves data processing.The wavelet transform is the local transformation of space (time) and frequency and can extract information effectively from the signal and do multiscale detailed analysis to functions or signals by dilation and translation and other computing functions.As the name implies, "wavelet" means a waveform with a small area, limited length, and zero mean value."Small" refers to the decay of the wavelet; while the "wave" refers to its volatility, its amplitude shocks in alternate positive and negative forms.Compared to the Fourier transform, wavelet transform is the localized analysis of time (spatial) frequency; it eventually reaches breakdown of time at a high frequency; and subdivision of frequency at the low frequency, can automatically adapt to the requirements of time-frequency signal analysis, and can focus on any detail of the signal, solving the difficulties of Fourier transforming.Thus, it becomes a major breakthrough in the scientific method after the Fourier transforming.So, the wavelet transform is even called "mathematical microscope." In summary, the method that divides functions into a series of simple basic functions is of theoretical and practical significance.In this paper, Daubechies wavelet is used in track irregularity time series data decomposition.Daubechies is the general name of a series of binary wavelet proposed by the French scholar Daubechies, which can do multiscale wavelet decomposition to signal.

Core Ideas.
According to the idea of wavelet decomposition and reconstruction, the standard deviation of track irregularity sequence data waveform signal is decomposed into detail waveform signal (1, 2, 3) and approximate waveform signal (3), in which detail waveform signal is stationary series with zero mean.We can use the random linear model to study it.We use ARMA model in this paper.Approximate waveform signals are generally nonstationary, smooth sequence curves.According to its smooth characteristics, the linear recursive model is used in trend analysis.The modeling idea is shown in Figure 9.
In the figure, 3  is the actual value and Â3  , Â3 +1 are forecast values.According to the geometric relationship, the values Â3  , Â3 +1 are as follows: Finally, the two partial results are combined, and all the forecast sequence data are added up by weight 1.The formula is shown as follows: In the formula, D1, D2, and D3 are high-frequency detail signal sequence forecasted by the ARMA model, Â3 is sequence of the low-frequency approximation by linear model forecast, and Ŝ is forecast value of the final state.
After decomposition-reconstruction process, the final forecast result appears, and the modeling process is shown in Figure 10.reconstruction of track irregularity standard deviation time series.K449-450 km section, the Beijing-Kowloon line, upgoing, at 44th cross-level inspection data is selected as the study data.The section is divided into 40 units, and the standard deviation of each unit is calculated, and we get 40 crosslevel standard deviation time-series data.After equal time intervals conversion of standard deviation time-series, we use linear-ARMA model based on wavelet decomposition and reconstruction for analysis.Take K449+800−K449+825 unit sections, from November 13, 2008, to May 18, 2010, with 38 cross-level standard deviation data as study data.The Linear-ARMA model process based on wavelet decomposition and reconstruction is as follows.
(  (3) Low-frequency signal linear model forecast: lowfrequency signal is a smooth curve, and its linear forecast result is shown in Figure 18.Forecast error of the model is shown in Figure 20.

Accuracy Analysis.
MSE and MAPE are used to measure accuracy of the model, and the forecast the accuracy of some units is shown in Table 2.
According to the forecast results of MSE, MAPE value in Table 2, the model has higher forecast accuracy, and so there is no need to correct the residual like linear recursive model.This has shown that linear-ARMA model based on wavelet decomposition reconstruction is an effective way to forecast the trend of track state changes.

Conclusions
In this paper, data mining and time series theory are used to study track irregularity standard deviation time series data.The main purpose of this study is to forecast track irregularity state in future.By using clustering analysis theory in data mining, different patterns and characteristics of track irregularity change can be found.Through a systematic study of time series data classification and time series forecast model, this paper puts forward linear recursive models and linear-ARMA model based on wavelet decomposition reconstruction to forecast the changing trends of track irregularity standard deviation time series.Simulation results show that the models have higher accuracy.The change of railway track state is a complex process.It is affected by various aspects of the situation.Although it is an extremely difficult task to explore laws of its development and changes, the significance of the study is far reaching.Because track state is inspected by section, it is not carried out for each fixed measuring points, although there is mileage data information of fixed measuring points to be studied in the inspection data.However, it is unavoidable that there is mileage deviation between the actual measuring point's mileage data information and measuring point's mileage data information in inspection data, and the mileage deviation will be a few or dozens of meters.It is the basic idea in the study of track state changes that puts the data after mileage relative calibration as the subject or takes the section as a whole as the subject.Due to track maintenance and repair cycle, we only study track state changes in a cycle.However, track state change trend between each of the maintenance and repair cycle is also worth to be studied.

Figure 1 :
Figure 1: Main forecast methods of time series.

Figure 2 :
Figure 2: Classification of time series analysis.

Figure 3 :
Figure 3: Steps of data mining process.

Figure 6 :
Figure 6: Forecast process of linear recursive model.

( 4 )
Model reconstruction and accuracy analysis: all forecast sequence data is added up by weight 1; comparison of forecast values and original values are shown in Figure 19.

Figure 15 :
Figure 15: Level 1 forecast value of detail waveform signal (HF) and original data cross-level standard deviation.

Figure 16 :
Figure 16: Level 2 forecast value of detail waveform signal (HF) and original data cross-level standard deviation.

Figure 17 :Figure 18 :
Figure 17: Level 3 forecast value of detail waveform signal (HF) and original data cross-level standard deviation.

Figure 19 :
Figure 19: Comparison of forecast values and original values.

Table 1 :
Analysis of accuracy of the cross level standard deviation.