Data Calibration Based on Multisensor Using Classification Analysis : A Random Forests Approach

This paper analyzes the problem of meaningless outliers in traffic detective data sets and researches characteristics about the data of monophyletic detector and multisensor detector based on real-time data on highway. Based on analysis of the current random forests algorithm, which is a learning algorithm of high accuracy and fast speed, new optimum random forests about filtrating outlier in the sample are proposed, which employ bagging strategy combined with boosting strategy. Random forests of different number of trees are applied to analyze status classification ofmeaningless outliers in traffic detective data sets, respectively, based on traffic flow, spot mean speed, and roadway occupancy rate of traffic parameters. The results show that optimum model of random forest is more accurate to filtrate meaningless outliers in traffic detective data collected from road intersections. With filtrated data for processing, transportation information system can decrease the influence of error data to improve highway traffic information services.


Introduction
With the constant development in digital image technology and detection technology, traffic state information can be collected by technology of magnetic frequency, wave frequency, video, and GPS which has been installed in most of the vehicles [1].In addition, RFID technology and mobile signaling technology can also provide such information as a supplementary role.A lot of spatiotemporal data sets are obtained by above technology.For the purpose of efficient traffic state identification and prediction [2][3][4], the premise is to grasp accurate real-time traffic data.Outliers problem [5] occurs in progress of traffic awareness data sets obtaining traffic information; namely, the traffic information contains some data which are obviously inconsistent data with other data.There are many causes of outliers as follows: (1) short period of collection; (2) imperfect detective devices; (3) loss of data; (4) errors in detective data being transferred; (5) environmental factors.If discriminating process of the traffic state ignores the existence of outlier data, a mixture of meaningless outlier data and traffic events data will be stored.It is a basic question in transportation information: how to effectively distinguish outlier data using the multidimensional characteristics to effectively improve accuracy of the traffic prediction.
In recent years, increasing attention has been given to outlier research in dynamic traffic data.Nam and Drew [6] pointed out that conservation laws for the traffic flow could recognize and process erroneous data.Vanajakshi and Rilett [7] used loop detector data to analyze cumulative flow with adjacent section data.The law of conservation of flow optimization model was established with target of minimizing the sum of the squares of adjacent detection section cumulative flow in order to eliminate the error when several continuous detection sections showed counting errors.Smith et al. [8] proposed a calibrated idea to fix outlier data using exponential smoothing method.Methods of optimum data based on clustering [9][10][11] and genetic algorithm [12] were presented in view of outlier in the multidimensional characteristic data in recent years.

Mathematical Problems in Engineering
This paper analyzes the problem of meaningless outliers in traffic detective data sets and researches characteristics about the data of monophyletic detector and multisensor detector based on real-time data on highway.Based on analysis of the current random forests algorithm, which is a learning algorithm of high accuracy and fast speed, new optimum random forests about filtrating outlier in the sample are proposed, which employ bagging strategy combined with boosting strategy.Random forests of different numbers of trees are applied to analyze status classification of meaningless outliers in traffic detective data sets, respectively, based on traffic flow, spot mean speed, and roadway occupancy rate of traffic parameters.The results show that optimum model of random forest is more accurate to filtrate meaningless outliers in traffic detective data collected from road intersections.With filtrated data for processing, transportation information system can decrease the influence of error data to improve highway traffic information services.

Random Forest Optimization Model
Based on Traffic Data (1) Road traffic state detection data is out of line largely with the actual road traffic status value.
(2) Obtained states of road traffic data are error data, because the values from them are beyond the reasonable scope or have violated the relevant law of road traffic.
(3) The data of road traffic state data are missing.
First of all, road detection data of single parameter are compared in Figure 1. Figure 1 represents the scatterplot of spot mean velocity extracted from geomagnetic detection data of freeways in November 2014.There are 1320 groups of discrete detection data collected forming the same section of 165 time points in it.In addition, Figure 2 lays out the difference of sensor data from the same cross section.Figure 2 represents integrated scatterplot of the same section of multisensor data, which contains three parameters of flow, spot mean speed, and occupancy rate, determining location of data point.Two figures of data samples show that outlier data is present in the data, but the proportion of outlier data in the samples is small.In the above, statistics cases which accounted for the largest number of samples are called the most classes, and accounts for the fewest category are called the minority class (nonequilibrium data) [13].

Model Based on Traffic Data
2.2.1.Random Forest.RF is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual tree [14].RF using bagging resampling strategy form sample sets combines the tree predictors by majority voting.Each tree grows using a new bagging training set.
RF is one of the most accurate leaning algorithms available.For many data sets, it produces a highly accurate classifier and runs efficiently on large databases.A significant advantage of RF is that it can generate an internal unbiased estimate of the generalization errors within the forest building progress.Furthermore, RF is less prone to overfit.It is widely applied to many domains, such as computer vision, information retrieval, data mining, and pattern recognition.Mathematical description of random tree classification model is as follows.
Definition 1.A random forest is a classifier consisting of a collection of tree structured classifiers {ℎ(, Θ  ),  = 1, 2, . . ., }, where {Θ  } are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input .Voting model equally weighted is presented as in the following formula: Given an ensemble of classifiers {ℎ 1 (), ℎ 2 (), . . ., ℎ  ()}, each of these can get a classification.A classifier ℎ  () is a common way of ℎ(, Θ  ).With the training set drawn at random from the distribution of the random vector , , define the margin function as where (⋅) is the indicator function.The margin measures the extent to which the average number of votes at ,  for the right class exceeds the average vote for any other class.
The larger the margin is, the more the confidence is in the classification.The generalization error is given by The strength of the set of classifiers {ℎ(, Θ)} is An upper bound for the generalization error is given by where  is the correlation between two members of the forest averaged over different distribution.
The property of a decision tree in random forests is bagging random sampling.Input data for random forest is a process of resample from training set, and the sampling of sample collection may be duplicate samples.Compared to another common boosting method, in terms of sampling, bagging is uniform sampling.Boosting is sampling according to the error rate; thus the classification accuracy of boosting is better than bagging; the choice of the training set of bagging is random, and it is independent of the training set.The choice of boosting is related to each previous sampling, and it gets learning results.The prediction function of bagging cannot be weighted and can be generated in parallel.The prediction function of boosting can be weighted and can only be generated sequence.Both methods can effectively improve the accuracy of classification, and thus the paper using advantages of the two methods proposes an integrated method to optimize traffic field data classification.

Training Set and Testing
Set.When data sets are generated in random forest model, the initial training of some samples could not be extracted from all collected data.The data which could not be sampled are called OOB (out of the bag).The whole data set is divided into two parts: a set of training and a set of testing.The former one is used to build the model; the latter one is used to test capability of the model.
In traffic detection data set, each testing point can get a lot of sensory data composed of a variety of detection sources.Suppose there are  sources; each data source can get multiple traffic parameters of detected section, and then each time all can get a set of multisensor data.Define a perception data set consisting of time ,  different types of data sources to the monitoring object, and attributes of , represented by {,   , DN  ,  1,1 ,  1,2 ,  1,3 , . . .,  ,1 ,  ,2 ,  ,3 ,   }, in which DN indicates detector number,  indicates the day,   indicates the data acquisition time,  , indicates the th parameter of the jth traffic detector, and L is quality mark.
For the convenience of analyzing detector data collected in cross section of road, three fundamental traffic parameters, namely, flow, spot speed, and occupancy rate, which are extracted from data sets commonly, using three kinds of detection equipment of data (induction loop data, magnetic data, and monitoring data).Data calibration about traffic parameters for some detector of a certain acquisition time needs to extract the spatial correlation data from other detectors.For instance, a detection equipment at acquisition time,   , gets the flow   , spot mean speed V  , and occupancy   from traffic induction loop.If the data need be calibrated, properties should be selected, such as traffic data collection time   , flow   , spot mean speed V  , and occupancy   from induction loop data, volume   , spot mean speed V  , and occupancy   from magnetic data, and volume   , spot mean speed V  , and occupancy   from monitoring data.And traffic data quality mark   ,  = 1, 2, . . ., , in which   value belongs to {1, 1}, indicates that testing calibration set evaluation data information is normal data or outlier.
The number of -variables is 10.This means that the matrix  used in training the model has the size of 22093 × 10.The test data  forms a matrix with a size of 22093 × 10.The formal description of matrices  and  can be written as follows: where   is a set of data elements and n is the number of input samples; consider where   ∈ {−1, 1} and   represent the results of data quality assessment.
In the training of random tree building, not every sample is selected in input sample of the decision tree.The number of choices is  from  features.A decision tree builds by completely split process, such as from   to   .Processes of the decision tree terminate, when each leaf node cannot continue to split or all samples are pointing to the same classification.

Random Forest Optimization Model.
Based on realistic significance of identifying the nonequilibrium data in traffic information, and the decline performance of random forest classification method for scarce and extreme value [15], this paper focuses on a few samples which are given greater weight in each independent decision tree of random forests to avoid unrepresentative training of the decision tree rules by the amount of data being trained.The character of training forces the classifier to pay more attention to the minority class samples and improve the accuracy of the class with less training data.Proposed model can solve the problem of nonequilibrium data sets classification.Eventually research gets the vote result on nonequilibrium samples with higher accuracy.
When using bagging in a method of randomly selecting sampling, the original training set of the minority class is less and probability of selected nonequilibrium samples is very low.This section proposes a method to optimize random forest to put particular emphasis on nonequilibrium samples.The basic idea of optimization algorithm is loading some established characteristics of the tree in the process of building a new tree.The core of the optimization algorithm steps combines formal bagging strategy with boosting strategy.First of all, according to the original algorithm for random sampling, the resampling  numbers of instances (when the initial set is training, this algorithm keeps the original bagging method), then, adjust the data according to the principle of boosting.An algorithmic principle maintains original effective randomization process and selection of the random properties and improves the random forest adaptive accuracy.Therefore, except for using the bagging to build the first tree, an evaluation of the current forest added to data induction; that is, it estimates the prediction error of data to weight random selection of training examples.It is necessary to improve the classification ability of the sample and contribute to the subsequent decision tree.The bag outside data are independent of generating sampled data.With continuation of the underlying principle of random forest (using OOB to estimate the error), an estimation function only about the bag outside data is given in the following definition, as the following formula: where (⋅) is the indicator function. is given by the independent variable;  is an actual classification.ℎ(, Θ  ) represents output of th decision tree; ℎ oob represents outside the bag set of .Less the value of (, ) means that the more the current forest error classification tree exists, the more the attention should be paid to the subsequent instance .Therefore, the design of the weighting function should increase with decrease of the corresponding (, ).By analyzing a typical example, this section gives a corresponding weight distribution formula, as shown in the following formula: In order to clearly describe the random forest optimization model (RFOM), the required explanation is as follows: a given  represents the individual number of training sets (, ) in the individual number. is the category of the classification characteristics.And  represents the number of decision trees in the "forest."The optimization method for traffic outlier data is as follows: (1) The original samples are on the training set, given the initial distribution  1 (  ,   ) = 1/.
(2) Train the decision tree.Randomly sample with replacement for the first random sample set  1 , and then train the decision tree based on sampling.Unsampled samples form the first bag outside data.
(4) For every tree, randomly sample  characteristics ( < ).Calculate Gini coefficient of each sample and the Gini coefficient of each division, such as formula (10) and formula (11): where Then based on the principle of minimum Gini index, select a variable to split.Finally through a recursive form, train classification rules of a decision tree.Maximize each tree without clipping.
(5) Merge decision trees into a forest.
(   The optimization model for induced random trees shows a schematic of the algorithm in Figure 3.The first tree is trained in the traditional way, namely, in the process of training decision tree with equal consideration to each sample; then algorithm modifies the weight of some samples, namely, adding corresponding sample weight of correct classification.New training set is sampled under the condition of second tree weighted; "forecast" of the first and second tree is calculation to get the updated weights in the third iteration.By analogy, the new weighted data set is trained.The optimization process based on the efficiency of random feature selection approach [16,17] could retain characteristics of the random forest algorithm and constant prior probability.According to the actual data need, induction from samples of the minority class is improved to train a random forest.

Experimental Validations
3.1.Data Collection.The data was collected by Shandong Hi-Speed Group Co., Ltd., at the freeways in Shandong province, China.The comparison result is obtained by using data sets in monitoring stations data selected on November 13, 2014.Data contains traffic parameters, such as flow, spot speed, and occupancy rate from induction loops, magnetoresistive sensors, and monitoring devices at Jibei, Jiaozhou, and Gaotang monitoring stations.The properties of the nonequilibrium datasets are described in Table 1.

Performance Indicator.
The performance indexes of classification accuracy, detection rate, false positive rate, and precision rate are used to evaluate performance of algorithm classification.Classification accuracy, detection rate, false positive rate, and precision rate are defined as follows: where CN represents the number of detected outliers; EG represents the number of undetected outliers; CG represents the number of detected normal data; EN represents the number of undetected normal data.As described in previous section, nonequilibrium feature of traffic instance data is a significant problem to be solved.Because it would increase, risk of classification error increases.So   and   are often used to measure the classification of this situation.  index is defined, such as formula (13).Parameter   is geometric average of two kinds of classifying accuracy, such as formula (14): where  represents proportionality coefficient of precision rate and detection rate,  4. OOB estimated gradually reduce with the increase of the trees.The classification accuracy of algorithm increases with the increasing number of trees in the forest and then keeps stabilization.When the number of trees increases to over a certain degree, a limiting value of OOB error appears; namely, classification accuracy of RFOM algorithm tends to being stable.This experiment using different size of training set builds RFOM, respectively.Algorithm performance of classifications is compared.The number of trees is from 60 to 100, adding 20 every time.We increase the number of trees in order to obtain a greater difference.The optimization random forests with 60 trees, 80 trees, and 100 trees and tree optimization random forest are named RFOM 60, RFOM 80, and RFOM 100.In order to compare algorithm performance, CART, RF, and RFOM are used to classify outlier for Jibei Station Data Set, respectively.Six performance measures, such as Acc, DR, FPR, PR,   , and   , are computed for different situations, which are shown in Table 2.
It is observed that different numbers of trees yield similar classification accuracy, and RFOM obtains a better performance than CART or RF.The ACC of RFOM 80 is 91.93%, which is the best.The false positive rate of RFOM 80 is 0.86%, which is the best.As   for RFOM is concerned 0.9635 of detection rate yield by RFOM 100 is the best one.  of RFOM 80 is 0.8489, which is the best.RFOM 60 obtains performance lowest in RFOM algorithms.Among five comparisons, RFOM 100 and RFOM 60 outperform the other methods.In the Shandong freeways real data, when the tree number is 80, it can obtain some improvement and save time of calculation.
In addition, RFOM is superior to other algorithms, as shown in Figure 6.It uses ROC curve to evaluate the detection method.Comparison of the ROC curve is more intuitive, which is as false positive rate for the horizontal axis, with detection rate for the vertical axis.
Three data sets, including Jibei Station Data Set, Jiaozhou Station Data Set, and Gaotang Station Data Set, are computed for performance measures, which are shown in Table 3.The number of trees is 80 in the comparison with different

Conclusions
This paper proposes a random forest optimum model of traffic samples calibration using multisource features to separate outliers from real-time data.The model optimizes the training of random forests and decision-making process using bagging and boosting simultaneously based on nonequilibrium feature of outliers in the traffic data.According to the actual data need, induction from samples of the minority class is improved to train a random forest.The optimized RF model has the following characteristics: (1) the advantage of the randomization process in the original RF algorithm remained; (2) the boosting method is introduced to strengthen the "induction" of the decision trees.The experimental results show that the optimized RF effectively separates outliers from traffic data by test of Shandong freeways samples.By the algorithmic verification, it compares index of classification accuracy, detection rate, false positive rate, precision rate, and ROC, which evaluates the detection method.Compared with the previous method 2, RFOM has advantageous properties such as high generalization performance and high accuracy.However, it can only measure nonequilibrium sample set of traffic data.So there are several restrictions concerning nonequilibrium feature of detection data sets.Further research will focus on the improvement of limitations.

Figure 3 :
Figure 3: Schematic of random forest optimization model.

Figure 4 :
Figure 4: Diagram of OOB estimate with number of trees in forests.

Figure 6 :
Figure 6: Comparison of ROC curves with Jibei Station Data Set.
(9)If the current  of a decision tree is less than K, according to the boosting, weight the new distribution of random sampling   with replacement.Unsampled samples form the bag outside data and return to (3).(10)After merging decision trees into forests, classify new data with random forests.The vote of tree classifier depends on classification results.

Table 1 :
Properties Description of the Non-equilibrium Datasets.

Table 2 :
Comparison of Algorithm Performance.In addition, the number of decision trees is related to accuracy, detection rate, false positive rate, and precision rate of RFOM algorithm.Figures5(a)-5(d) show boxplots of error rates.Horizontal lines inside the boxes are median error rates.Figures 5(a)-5(d) are detection indexes, which are different degrees of growth except for FPR.When the number of trees is fewer than 60, Acc, DR, and PR grow relatively fast.Through repeated experiments about forests of different trees, we take average standard deviation of each tree in a forest as OOB estimate of the forest.The experimental results are shown in Figure 5.According to the two aspects, parameter num is selected in [60, 100].

Table 3 :
Comparison of Algorithm Performance with Different Data Set.