The Method of Dynamic Identification of the Maximum Speed Limit of Expressway Based on Electronic Toll Collection Data

College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, Fujian, China Fujian Key Lab for Automotive Electronics and Electric Drive, Fujian University of Technology, Fuzhou 350118, Fujian, China Fujian Provincial Expressway Information Technology Co., Ltd., Fuzhou 350011, Fujian, China Fujian Provincial Big Data Research Institute of Intelligent Transportation, Fujian University of Technology, Fuzhou 350118, Fujian, China


Introduction
In recent years, China's expressway ETC system technology has been developed rapidly. More and more vehicles have installed ETC equipment. ese vehicles interact with ETC gantries during driving, resulting in massive ETC data. At present, the cumulative users of ETC have exceeded 220 million, and the utilization rate of vehicle owners is 78% [1]. Moreover, the ETC gantry can also interact with the Manual Toll Collection (MTC) system users. erefore, the ETC gantry system almost collects the traffic information of all vehicles on the expressway, reflecting the overall traffic situation of the expressway, which can provide strong support for the informatization construction, vehicle infrastructure cooperation, and automatic driving [2] of smart expressway. Obtaining the maximum speed limit information of each section of the expressway is an important part of intelligent management of expressways [3]; it can provide drivers with expressway speed limit information [4,5] to avoid traffic accidents caused by speeding and provide reliable perception and driving speed decision-making for autonomous vehicles. However, the maximum speed limit information is dynamic and changeable. e relevant management departments will adjust the speed limit information of the road section according to road traffic flow, road maintenance conditions, and the number of traffic accidents [6][7][8]. At present, the method of collecting speed limit identification information is mainly manually collected, then the data is uploaded to the system for updating within a certain period. However, this method has two disadvantages: first, it requires professionals to travel to the expressway and collect speed limit information, which costs immense manpower and material resources. Second, it has a long update cycle, and the driver cannot obtain the latest speed limit information, which leads to safety hazards while driving, and the traffic efficiency of the road is correspondingly reduced. erefore, the study of how to automatically collect the speed limit information and dynamically identify the maximum speed limit information on the road in real-time has research significance.
Traffic flow prediction and travel time prediction are research hotspots in the field of transportation. Most of their research methods and speed limit recognition are supervised learning based on machine learning algorithms. e difference is that speed limit recognition is a classification problem, and traffic flow prediction and travel time prediction are regression problems. e recognition of road maximum speed limit information mainly relies on image recognition technology [9][10][11][12] and floating car trajectory data mining technology.
e image recognition technology obtains the speed limit information of each road by recognizing the speed limit information of the traffic signs on the road. Machine learning is widely used in a variety of research fields [13]. Support Vector Machine (SVM) [14], Extreme Learning Machine (ELM) [15], and multitask convolutional neural network (MTCNN) [16] are used to train and learn speed limit signs features to realize the recognition of maximum road speed limit. Although these methods are relatively suitable in terms of recognition effect, they require surveyors to collect pictures of speed limit signs on the road, which consumes a lot of resources. In addition, the collection period is long and cannot achieve real-time and dynamic recognition maximum speed limit information. In terms of floating car trajectory data mining, the floating car is equipped with a global positioning system, which records the time, location, and other information of the vehicle, and the floating car trajectory data mining can obtain the driving speed feature of all floating car on the road [17]. Machine learning algorithm [18] is able to learn the maximum speed limit feature in the vehicle speed information of the road to realize the recognition of the maximum speed limit information. However, the floating car accounts for a small proportion of all cars that cannot fully reflect the speed of the vehicles on the expressway. erefore, the maximum speed limit recognition based on floating car data still has certain defects.
In view of the high cost of speed limit sign recognition and the shortcomings of trajectory data recognition, this study proposes a method using real-time traffic data collected by an ETC gantry system to identify the maximum speed limit of expressways dynamically, which solves the problems of the high cost of manual information collection and incomplete vehicle data. First, the road section speed set construction algorithm and section driving speed abnormal filtering algorithm are designed to ensure the integrity and reliability of the sample data. en, the speed feature vector model of the speed limit feature is constructed to mine the speed limit feature of the vehicle speed in different aspects. Finally, taking the road maximum speed limit information of 534 sections of expressways in Fujian Province as the sample set. en, the multivoting ensemble algorithm is used to perform supervised classification training and crossvalidation on the road speed feature. e test results show that this method can well identify the maximum speed limit information and recognize the dynamic changes of the maximum speed limit information on the road. e contributions of this paper can be summarized as follows. First, an algorithm is proposed for constructing speed sets of road section, which can solve the problem that the speed of road section cannot be calculated due to the lack of transaction records of ETC gantries and obtain the speeds of vehicles on each road section accurately and completely. Second, this proposal extracts the feature of the road section speed from different aspects to construct the road section speed feature vector model and mine the potential correlation features between the speed of the vehicles on the expressway and the road speed limit information. ird, a dynamic recognition method of the maximum speed limit of expressways is proposed to identify the maximum speed limit of the expressway, the validity of the method is verified by the real maximum speed limit information, and the scientificity is verified by comparing a large number of prediction algorithms. is paper is organized as follows. Section 1 introduces the research methods of road speed limit recognition. Section 2 defines the related concepts in this work. Section 3 describes each part of the dynamic method of expressway maximum speed limit. Section 4 shows the experimental results and analysis. Section 5 draws the conclusion and future work.

Relevant Definitions
Definition 1. Each ETC gantry of the expressway is collectively called Node, and two adjacent Nodes on the road constituting an expressway section, which is referred to as QD � Q, Distance { }, Q � < Node1, Node2 >, Node and Q, are shown in Figure 1, where Node1 is the start point of the road section, Node2 is the end point of the road section, and Distance is the actual distance of the road section.  e average speed of a vehicle passing through a certain road section is called road section speed. e calculation method is shown in the following equation: where s is the actual length of the road section, t 1 is the time when vehicles pass the start point of the road section, and t 2 is the time when vehicles pass the end point of the road section.
Definition 5. e dispersion of the speed of the road section describes the measures of dispersion of the average speed of vehicles passing through the road section. e section speed of vehicles on the expressway within a certain period of time constitutes the speed set of the section. Sort the value of speed: the speed at 85th percentile is v 1 , and the speed at 15th percentile is v 2 . e speed dispersion index can be expressed as (2) e larger the value range is, the higher dispersions of the speed information are.

Definition 6.
e speed limit includes the minimum speed limit and the maximum speed limit. e speed limit value is generally an integer multiple of 10. In this paper, we only discuss the maximum speed limit.

ETC Data Cleaning.
e ETC gantry system can generate a large amount of transaction data in a short period. Due to system error, information exchange interruption, and severe weather conditions, these factors can lead to abnormal data which can affect the results. In order to reduce interference, the data needs to be preprocessed, mainly including the following aspects.
Data Redundancy: Duplication between Multiple Data. e transaction information of each vehicle passing through the ETC gantry should be unique. However, due to problems in data acquisition, transmission, storage process, and other intermediate links, it can cause the repeated data uploading and duplication, resulting in data redundancy. erefore, these data need to be cleaned.

Data
Error. e data record does not conform to the normal driving rules, including two ETC gantries that control different driving directions recorded by the same vehicle at the same time, and different passing records of the same vehicle are recorded at the same time. ese data need to be filtered or deleted.

Vehicle Speed Recognition Algorithm in Road Section.
In order to calculate the speed distribution of the road section, it is necessary to obtain the transaction data of all vehicles of each gantry. However, gantry transaction data may be missing. erefore, all traffic data and road network data need to be checked and supplemented to ensure the integrity of the gantry transaction data. After the transaction data of the ETC gantry system is initially cleaned, the trajectory Traj of each vehicle is constructed in chronological order according to the transaction data of each gantry. Traverse each adjacent ETC gantry Node i ，Node i+1 in the Traj one by one. Check whether the road section formed by the two gantries QD j belong to the expressway road network G. If the road section QD j belongs to the expressway road network G, the speed v of the vehicle passing through the section QD j is directly generated. QD j and the speed v are expressed as follows: where n represents the number of all vehicles within certain time period T of the road section QD j and v i represents the average speed of each vehicle on the road section QD j within certain time period. If QD j does not belong to the expressway road network G, it means that the section data of the middle gantries are missing. And path searching algorithm based on Node i , Node i+1 needs to be performed to fill the missing gantry transaction data. As shown in Figure 2, if the road section formed by Node i and Node i+1 cannot be queried in the road network G, use Node i and Node i+1 as the basic node. e feasible path Node i , Node a , Node b , Node i+1 can be obtained through path search. Node a and Node b are supplementary nodes, and the average speed v between Node i and Node i+1 is taken as speed for〈Node i , Node a 〉, 〈Node a , Node b 〉, 〈Node b , Node i+1 〉.
To ensure the reliability of the average speed v, the minimum speed v min is set for high-speed driving to 30 km/h and the maximum speed v max for high-speed driving to 160 km/h [19]. If the average speed value is not in the range vε[v min , v max ], where v is the average speed of all road sections between Node i and Node i+1 , it will be deleted as abnormal data. e specific process of the section speed data construction algorithm is shown in Algorithm 1.

Outlier Information Detection Algorithm for Road
Section. To better analyze the road section speed distribution feature of each section, a noise data cleaning model is constructed to detect and eliminate outliers in the data. e basic idea of the model is to use the upper and lower limits of the speed boxplot to detect abnormal points and determine the threshold interval for filtering abnormal speed data. Under the condition of collecting a large amount of expressway ETC transaction data, according to the central limit theorem, the road section speed data set should be a normal distribution. And the upper and lower limits of the speed boxplot that meet the 3σ interval range of the normal distribution can better prove the rationality of realizing outlier detection and filtering through boxplot analysis. As shown in Figure 3, there are 6 element points in the boxplot, among which q1 is 1/4 divide point; q2 is the median; q3 is the 3/4 divide point; and IQR � q3 − q1, which is the distance between q1 and q3. ere are also upper limit and lower limit. Here, q1 represents the speed value greater than 25% of the traffic flow, q2 represents the speed value greater than 50% of the traffic flow, and q3 represents the speed value greater than 75% of the traffic flow. us, the upper and lower limits of the noise data cleaning threshold model can be obtained, expressed as follows: Upper limit: q3 + 1.5 * IQR, en, the threshold range of velocity filtering is obtained as follows: v T ∈ (Lower limit, Upper limit).
Among which, the speed data of the road section within the range of v T is retained, and the outlier data is deleted.

Feature Vector Model of Expressway Speed.
Vehicles driving on the expressway have different speeds at different times or on different road sections. rough the statistical analysis of the feature of the traffic speed of the road section, the potential connection between the speed of the vehicle and the road speed limit information can be obtained, after which the road section speed feature vector model is constructed. e feature vector is mainly divided into three categories such that the first is the frequency-speed percentile feature, the second is road section speed evaluation feature, and the third is road section speed time domain feature.

Road Section Frequency-Speed Percentile Feature.
Road section frequency-speed percentile feature reflects the distribution of the section speed at different times, including the speed values of the 50th percentile, upper and lower 25th percentile, and the upper and lower 15th percentile of the speed set of the road section, and then converts it into multidimensional feature vector α. It can be expressed as follows: where α 1 ∼ α 6 are, respectively, the 15th, 25th, 50th, 75th, 85th, and 95th percentile of the total section speed distribution, which can describe the overall distribution of the speed in road section.

Road Section Speed Evaluation
Feature. Road section speed feature are described by the relevant evaluation indexes in frequency domain, including average speed, speed standard deviation, and speed dispersion, which can transform into multidimensional feature vectors β. It is expressed as follows: where β 1 is the majority number of section speed, representing the general level of vehicle speed statistical law; β 2 and β 3 are the overall average interval speed of the road section μ and standard deviation σ, respectively; and β 4 attributes the speed dispersion indices, which reflects the changing range and dispersion range of speed data.

Road Section Speed Time Domain Feature.
Road section speed time domain feature reflects the speed evolution regularity of the traffic flow on different road sections under different limited speed conditions. If the section speed data was analyzed by day without considering the feature of different periods, it was easily affected by road congestion and other factors in individual periods, and it cannot reflect the speed evolution feature of the road. erefore, it is necessary to fully integrate the speed feature information of roads in different periods. e whole day is divided into 24 time periods, denoted as 0, 1, ..., 23, respectively. en, mining and counting the speed information of each road section in each period is carried out to find the speed change law of each road section. As shown in Figure 3, the multidimensional velocity time domain feature vector is constructed. It is expressed as follows: where c 1 ∼ c n is the average road section speed of each period in the data sample; that is, the average road section speed of 24 time periods in the whole day, in order from large to small, takes the first n values. Here, we take the first 6 values to avoid the disturbance caused by the relatively low road section speed caused by traffic congestion or road maintenance in some periods.

Sample Imbalance
Processing. e road speed limit classification values constructed in this paper conform to the 80 km/h, 100 km/h, 110 km/h, and 120 km/h specified in the Input: trajectory data of a car D, expressway road network data G Output: speed data of the road section (1) fuction Sections(D)// e vehicle trajectory data is divided into the data of each section of the vehicle Distance k ←G k . Distance //Getting road section distance from expressway network, which k � Q j (18) t � Sec j.delta //Extracting the time required for vehicles to pass through the road section (19) v j � Distance k /t//Speed of vehicle passing through road section (20) R j.V ←v j //Adding speed attribute (21) if Q j not in G then// e road information cannot be found in the expressway network, and there is uncollected node information between two nodes of the road section (22) {N 1 , N 2 , . . ., N Z }←shortest_path(G, N j )//Searching the shortest path between two nodes, getting the path node data set, which . ., path Z-1 }←G k . Distance //Getting road section distance from expressway network, and add to path, which  . Because most of the data we collect is 100 km/h, this means the data size of 100 km/h is far more than the other three types of sample data, 80 km/h, 110 km/h, and 120 km/h. is creates an imbalance among sample categories. erefore, to tackle the problem of unbalanced data samples, there are two processing methods, including oversampling and undersampling [20]. Oversampling is to copy the minority samples multiple times to expand the data volume of the minority samples. is oversampling method will duplicate the preexisting sample data, which will lead to a certain degree of overfitting during the model training process. Undersampling is to randomly remove part of the data from the majority samples or select a part of the sample in this category according to a certain proportion as the sample data. is method will cause the model to only learn a part of the rules of the sample data; thus, it cannot effectively reflect the complete pattern of the sample in this category. In order to alleviate these problems, an improved random oversampling method SOMTE [21] is utilized, which analyzes the minority samples, by using their similarity in feature space to add the simulated new samples to the data set. e number of minority samples in the original data set is expanded, and the dispersion between categories is reduced; therefore, the imbalance problem is solved. e process of the SOMTE can be divided into the following steps: Step 1. Select the speed feature vector set of minority sample categories with speed limit values of 80, 110, and 120 km/h Step 2. For each category of sample set, Euclidean distance is used as the metric in the feature space, and then the distance between each sample in the sample set is iteratively calculated to determine the k-nearest neighbor sample points Step 3. Perform random linear interpolation on the connection line between sample points and the selected s neighboring sample points to generate new samples Step 4. Repeat Step 2 and Step 3 until the various categories of the expressway speed feature vector data set reach a balance

Maximum Speed Limit Recognition Classification Model.
e acquisition of speed limit information on expressways is an important factor that affects the driving safety. Different road sections correspond to different speed limit information, and the differences of speed limit information directly affect the state of the vehicles, which makes the relevant data show a certain pattern. Using strong learning machine to perform in-depth learning and training on related data can achieve high-precision recognition results. XGBoost is a method of integrated learning based on a boosting algorithm [22]. Its learning machine usually takes the decision tree model and learns the true value and the residuals of the current prediction values of all trees through the continuous iterative generation of new trees. en, the results of all trees are accumulated as the final result to obtain a better classification accuracy [23][24][25]. By using the XGBoost algorithm as a classifier for identifying the maximum speed limit information on expressways, the maximum speed limit information can be determined accurately.
A sample data set is constructed by extracting 16-dimensional speed feature vectors from the expressway section data with the known speed limit information. Suppose the data set is S � (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x m , y m ) . 1, 2, . . . , M) 1, 2, . . . , M) is the output value of the ith sample, that is, the road speed limit classification labeled value corresponding to x i . Assuming that the XGBoost integrated learning model integrates a total of K regression trees, the prediction result of the XGBoost algorithm can be expressed as in the following equation: where K is the number of trees, f k corresponds to the kth regression tree with structure q k and leaf weight w k , F is an integrated classifier composed of all regression trees, andf k (x i ) corresponds to the predicted score of the kth regression tree on the sample x i . e objective function of XGBoost consists of a loss function and a regular term, expressed as follows: where l is the error function and Ω(f k ) is the regularization term. e regular term can be expressed as follows: where c represents the penalty coefficient of the model, and the value range is [0,1]. T k represents the number of leaves of the kth tree; c is the regular term coefficient. e XGBoost algorithm adopts an additive step-by-step integration strategy in the training process. First, optimize the first tree, and then optimize the second tree until the kth tree is optimized, and the loss function is continuously reduced during the optimization process. By adding an incremental function f t in the iterative process to optimize the objective function, the prediction accuracy can be improved, and the calculation method can be expressed as in the following equation: where c is a constant term and y i (t− 1) represents the predicted value in the (t − 1)th iteration on the ith sample. en, carry out the expansion of the second-order Taylor equation and discard the constant term in order to reduce the running time of the model, expressed as follows: where I j � i|q(x i ) � j represents the sample set of leaf j and g i and h i are the first derivative and the second derivative of the loss function, respectively. e objective function is converted into a quadratic function Obj (t) about w j to find the minimum value, and then the optimal prediction score of each leaf node and the optimal value of the objective function are obtained as follows: where G j � i∈I j g i , H j � i∈I j h i . After that, the optimization of XGBoost parameters mainly include the following 4 steps: Step 1. Choose a higher learning rate, set a reasonable initial value of the booster parameters, and use K-fold cross-validation in each iteration to get the ideal number of decision trees Step 2. According to Step 1, the learning rate and the number of decision trees are determined, and the K − fold cross-validation method and grid search method are used to optimize the parameters of each boosting machine Step 3. e method is the same as Step 2; based on the given data, adjust the regularization parameters to reduce overfitting Step 4. Appropriately reduce the learning rate to determine the final ideal parameter combination of the model

Maximum Speed Limit Recognition Model.
e problem of identifying the maximum speed limit information on expressways is a classification problem. e framework of identification model is shown in Figure 4. Dynamic Scientific Programming identification of highway speed limit information is realized based on the following steps. First, the data cleaning is adopted on ETC gantries transaction data, removing duplicated data and error data. Taking vehicle speed recognition, the algorithm is used to find the missing records in the ETC gantries transaction data and to accurately reduce of gantry distribution on expressways. e speed of the road section can be obtained by calculating the speed of the vehicle between the gantries. However, there are some very large or small outliers in the speed of the road section so that boxplot is utilized to remove speed outliers. Next, the speed of each driving section is analyzed, and the models of frequency-speed percentile feature, interval speed evaluation feature, and interval speed time domain feature are constructed. Since the velocity distributions of various types in the data are quite different, the oversampling algorithm is used to expand the minority samples to obtain the balanced data. Finally, data are divided into training data and test data. e training data are inputted into XGBoost algorithm for training and learning; the training process is shown in process 1 in Figure 4. At the same time, the grid search and cross-validation are used to find the optimal parameters of each boosting machine in XGBoost; the optimization process is shown in process 2 in Figure 4

Introduction of Experimental
Data. ETC gantry system is one of the main components of the Expressway ETC System, which is used for real-time vehicle driving information supervision and record, vehicle path identification, toll data fitting, and other functions [14]. e experimental data mainly includes three categories. One is the ETC transaction data collected by the ETC gantry on various sections of the expressway in Fujian Province for 9 days from September 3 to September 11, 2020; it contains 50 expressways including Fuyin Expressway, Xiazhang Expressway, and Longchang Expressway, which contains 534 sections, about 100 million pieces of data. e average distance between each section is 8.9 km, 85% of the section distance are less than 16 km, and the maximum distance is 30 km; its distribution is shown in Figure 5. ese data are sourced from Fujian Provincial Expressway Information Technology Co., Ltd. e main attributes of the data are shown in Table 1. e second category is the road speed limit information data, including the name of the road section and the maximum speed limit value of the road section, which is derived from the online announcement of the Fujian traffic police. It is used for model learning, training, and testing; the third category is the distance of each section of the expressway from the Amap, including the node pair of the gantry of each section and the actual road section distance.

ETC Data Preprocessing.
Matching the initially cleaned ETC data with the road network topology data, the road section speed of each vehicle is calculated, and then the expressway road section speed data set is constructed.  Figure 4: e flowchart of expressway speed limit information recognition model. main characteristics of the data. Due to the influences of some random factors, there may be a certain amount of outlier data; these outlier values of each road section can be detected through the noise data filtering model. After the noise data is eliminated, the road section velocity data after preprocessing is obtained. As shown in Figure 6, the road section speed data of the road section from September 3, 2020, to September 11, 2020, is used. Among them, the abscissa denotes the date of each day, and the ordinate represents the magnitude of the road section speed. In addition, each box represents the overall distribution of the road section speed of the road section on that day, and the black origin represents the part need to be deleted. e original speed data of the road section are around 1.229 million, the abnormal data are about 1.19 million, accounting for 9.68%, and the preprocessed section speed data is approximately 11.1 million.

Road Section Velocity Feature Vector.
After obtaining the preprocessed speed data set of the road section, the road section speed feature vector model is constructed based on the statistical analysis of the expressway road section speed feature by day. us, the expressway road section data set contains 3 types, including 16-dimensional feature vector, and its sample classification mark value is obtained. e attributes shown in Tables 3-5       section between ETC gantry 340507 to ETC gantry 351C03. Date represents the date when the traffic condition occurred, and α 1 − α 6 represent that each section is between 15% and 95% of driving speed, where β 1 − β 4 represents the mode, average, standard deviation, and dispersion of vehicle speed, c 1 − c 6 represent the first 6 values after sorting the average road speed in 24 time periods of the day, andl represents the maximum speed limit value.

Balance Analysis of Sample Data.
ere are 5,081 samples in road section speed feature vector data set, among which the number of samples with 80 km/h, 100 km/h, 110 km/h, and 120 km/h speed limits accounts for 5.31%, 87.24%, 9.39%, and 2.83%, respectively, which are seriously unbalanced among different categories and have adverse effects on the efficiency of model identification. erefore, the SMOTE is used to oversample the sample data with speed limits of 80, 100, and 120 km/h, which makes it possible to achieve relative balance among all kinds of samples. In the experiment, the new data obtained by the SMOTE algorithm is used as the input of the algorithm model. e sample data consists of training sample data and testing sample data.

e Result of the Model's Performance.
e parameter setting of XGBoost algorithm is an important factor that affects the performance of the model. In order to improve the accuracy of the model, a set of sensitivity experiments is conducted to optimize the performance of the model. First, four boosting machine parameters are identified that have a significant impact on the model, including n_estimators, learn_rate, max_depth, and min_child_weight. Second, a combination of grid search and K-fold cross-validation (GK) are used to obtain the optimal parameters, in which K � 5 for cross-validation. Follow the method of Section 3.4 for parameter optimization. e search range, step length, and postexperiment parameter optimizations for each parameter are shown in Table 6. e model can be established through the above processing, using test data to verify the effectiveness of the model, and the results of the confusion matrix are shown in Table 7.
In 3295 test samples, 3212 were identified correctly, with an accuracy rate of 97.5%. e recognition accuracy of 80 km/h data is 100%. is is because the data with a speed limit of 80 km/h is quite different from other categories and can be better distinguished. However, the gap between the category data with100 km/h and110 km/h is very small, and it is easy to cause mistakes in identification. Among them, there are 824 sample data with a speed limit of 100 km/h, 759 correctly identified, and 47 with a speed limit of 110 km/h, which makes the accuracy rate decrease to some extent. For the same reason, the accuracy rate of the 110 km/h limit is also lower position compared with the other three categories.

Comparison and Analysis
(1) Impact Analysis of Data Equalization. In order to verify the influence of oversampling model on SMOTE algorithm, the original data set and the data set processed by SMOTE algorithm are used for training and learning. e other steps of the model are consistent, and two model classifiers are obtained. e comparison of classification results is shown in Table 8. e first category is the model result corresponding to the data set processed by the SMOTE algorithm, and the second category is the model result corresponding to the original data set. e following can be seen from Table 8: (1) After the SMOTE algorithm oversampled the data, the accuracy, recall rate, and F1-score of all categories were greatly improved.
(2) e data with the speed limit value of 100 km/has the most samples. Without data expansion in the oversampling process, the evaluation indexes of this class are still improved, indicating that the SMOTE algorithm can not only greatly improve the  recognition accuracy of minority speed limit information, but also effectively improve the recognition accuracy of majority speed limit information.
(3) e SMOTE algorithm improves the prediction accuracy of data with a speed limit of 110 km/h and 120 km/ h, and the recall rate and F1-score are also greatly improved. It has little effect on the prediction accuracy of class data with a speed limit of 80 km/h but has a great influence on the recall rate and F1-score.
(  Figure 7, where A1-A7 represent models A α , A β , A c , A α,β , A α,c , A β,c , and A α,β,c , respectively. e following can be seen: (1) When only a single feature is added, a better model prediction effect can be obtained by adding frequency-velocity percentile feature, followed by interval velocity evaluation feature model and interval velocity time domain feature model.
(2) When two features are added, the prediction effect is improved compared to a single feature. When all the features are added, the prediction effect is the best.
(3) e contribution of each feature in the speed feature vector model of the expressway section to the prediction model is arranged from large to small, which is the road section speed-frequency percentile feature, road section speed time domain feature, and road section speed evaluation feature; the contribution of the feature vector in each feature is shown in Figure 8.  Table 9. From the comparison of six different classification methods in Table 7, SVM, AdaBoost, and LR

Conclusion
is paper proposes a method of identifying expressway speed limit information based on ETC data mining analysis. First, the abnormal data of ETC gantry is processed, and a road section speed data set construction algorithm is proposed. e speed data of the road section is constructed, and the outlier samples in each road section are eliminated by the boxplot analysis to ensure the accuracy of the ETC data expression.
en, the SMOTE algorithm is used to oversample the samples of the minority speed limit categories to achieve the balance between the various types of road section speed limit information. Finally, the oversampled training samples are input into the proposed GC-XGBoost (grid search + cross-validation + XGBoost) algorithm for training and learning; then it is compared and analyzed with multiple similar algorithms. e experimental results show the following:    prediction model is arranged from large to small, followed by the speed-frequency percentage feature, time domain feature, and speed evaluation feature. ree categories of features have an improvement effect on the prediction model, and the frequency-speed percentile feature has the best improvement effect.
(2) In the test sample data, the speed limits of 80 km/h, 100 km/h, 110 km/h, and 120 km/h classification data recognition accuracy are 100%, 92.1%, 97.9%, and 99.9%; the overall accuracy is 97.5%. e gap between the category data with 100 km/h and 110 km/h is very small, so the recognition accuracy is relatively low.
(3) e speed limit recognition accuracy of GC-XGBoost is 97.5%, precision is 0.98, recall is 0.97, and F1-score is 0.97. e experimental results are significantly better than those of the other five algorithms, which can accurately identify the maximum speed limit information of expressway.
is paper considers the speed feature of hybrid vehicles, which is suitable for the identification of the maximum speed limit information of expressway. However, this work still has some limitations: (1) e speed limit recognition of 100 km/h and 110 km/h is less effective. More speed limit features can be considered to explore the differences between the two to improve their speed limit recognition effect. (2) In this study, we do not consider the speed limit values of different lanes on the same road. In the future, they can be considered to analyze the speed limit information on different lanes of the same road through vehicle classification and road lane number and construct a more complete expressway speed limit information recognition model.

Data Availability
e data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.