Traffic Flow Anomaly Detection Based on Robust Ridge Regression with Particle Swarm Optimization Algorithm

Traffic flow anomaly detection is helpful to improve the efficiency and reliability of detecting fault behavior and the overall effectiveness of the traffic operation. The data detected by the traffic flow sensor contains a lot of noise due to equipment failure, environmental interference, and other factors. In the case of large traffic flow data noises, a traffic flow anomaly detection method based on robust ridge regression with particle swarm optimization (PSO) algorithm is proposed. Feature sets containing historical characteristics with a strong linear correlation and statistical characteristics using the optimal sliding window are constructed. Then by providing the feature sets inputs to the PSO-Huber-Ridge model and the model outputs the traffic flow. The Huber loss function is recommended to reduce noise interference in the traffic flow. The L2 regular term of the ridge regression is employed to reduce the degree of overfitting of the model training. A fitness function is constructed, which can balance the relative size between the k-fold cross-validation root mean square error and the k-fold cross-validation average absolute error with the control parameter η to improve the optimization efficiency of the optimization algorithm and the generalization ability of the proposed model. The hyperparameters of the robust ridge regression forecast model are optimized by the PSO algorithm to obtain the optimal hyperparameters. The traffic flow data set is used to train and validate the proposed model. Compared with other optimization methods, the proposed model has the lowest RMSE, MAE, and MAPE. Finally, the traffic flow that forecasted by the proposed model is used to perform anomaly detection. The abnormality of the error between the forecasted value and the actual value is detected by the abnormal traffic flow threshold based on the sliding window. The experimental results verify the validity of the proposed anomaly detection model.


Introduction
Traffic flow anomaly detection plays an essential role in the traffic field. Traffic jams have become a common thing in big cities and have received considerable critical attention. e traffic flow anomaly detection model can detect the abnormal traffic flow and can be achieved by constructing a traffic flow forecast model, which is helpful to avoid traffic congestion in time. e accurate forecast of traffic flow can not only provide a basis for real-time traffic control but also provide support for the alleviation of traffic jams and the effective use of traffic networks, and the forecast result of traffic flow can directly affect the accuracy of traffic anomaly detection. Useful information can be extracted from massive traffic flow data through the traffic flow forecast model so as to quickly forecast the short-term traffic flow in the future and detect the traffic flow abnormalities in time, thus improving the traffic operation efficiency.
In recent years, many experts and scholars have studied traffic flow forecasting. e ARIMA model is a classic time series model that is often used in traffic flow forecasts. Kumar and Vanajakshi proposed a SARIMA-based traffic flow forecast scheme, which effectively solved the problem of massive data required for model training [1]. Shahriari et al. combined bootstrap with the ARIMA model, which overcame the shortcomings of nonparametric methods lacking theoretical support and improved the forecast accuracy of the model [2]. Luo et al. combined the improved SARIMA model with the genetic algorithm and used the real traffic flow to test the model. e model forecast results were good [3]. e ARIMA model forecasts the traffic flow based on historical values. If the model training data contain noise, the model's performance will be greatly reduced. e neural network model can fit complex data relationships, which can learn the nonlinear relationships implicit in traffic flow. Qu et al. proposed a batch learning method to solve the time-consuming problem of the traffic flow neural network prediction model, which effectively reduced the training time of the neural network [4]. Zhang et al. used the spatiotemporal feature extraction algorithm to extract the temporal and spatial features in traffic flow. e features were input into the recurrent neural network for modeling and forecast, which effectively improved the forecast performance of the model [5]. Zhang et al. proposed a multitask learning deep learning model to forecast the traffic network flow. e nonlinear Granger causality analysis was used to select features for the model. e Bayesian optimization algorithm was used to optimize the model parameters. e forecast performance was better than that of the single deep learning model [6]. Do et al. used temporal and spatial attention mechanisms to help neural network models fully explore the temporal and spatial characteristics of the traffic flow, which not only effectively improved the prediction performance of the model but also enhanced the interpretability of the model [7]. e use of neural network models can cause overfitting easily with a calculation cost much higher than that of the traditional traffic flow forecast model. As neural networks can fit nonlinear relationships of data, it is easy to use the wrong noise as the implicit nonlinear relationship in the data, which will reduce the generalization ability of the model. e support vector regression machine can fit data based on the strategy of structural risk minimization, which is a common model in the field of traffic flow forecasts. Wang et al. proposed an adaptive traffic flow forecast framework, which used the Bayesian optimization algorithm to optimize the parameters of the support vector machine model. e forecast performance was better than that of the SARIMA model [8]. Luo et al. used the discrete Fourier transform to extract the trend information in traffic flow and used the support vector machines for error compensation, which improved the forecast accuracy of the model [9]. e support vector regression machine solves the optimization problem based on quadratic programming. When the sample size is large, the model training time will be greatly increased. e support vector regression machine is very sensitive to the noise in the data. When the support vector regression machine selects the noise as the support vector, the forecast performance of the model will be poor.
Traffic flow anomaly detection plays an important role in the field of urban traffic control. Many studies have done related work in the field of traffic flow anomaly detection. Djenouri et al. proposed a framework for detecting temporal and spatial traffic anomalies. e KNN algorithm was applied to the space-time traffic database, and the traffic flows at ten different locations were experimented. Experimental results showed that the performance of the proposed framework is better than the baseline model [10]. Yujun et al. proposed a hybrid model that contained the Poisson mixture model and coupled hidden Markov model. e proposed model considered the spatial correlation of traffic flow and the degree of traffic congestion. Semisynthetic and real traffic anomaly data were used to verify the validity of the model [11]. Zhang et al. employed the dictionary-based compression theory to identify the spatial and temporal characteristics of traffic flow and used anomaly index to quantify the degree of traffic anomalies [12]. e proposed method can clearly detect the location of traffic flow spatial anomalies. Noise in traffic data may lead to false detection results of traffic anomaly detection models, which may affect the normal operation of traffic networks.
Influenced by factors such as mechanical damage, line aging, signal loss, and environmental interference, the data detected by the traffic flow sensor contain a lot of noise. Huber loss function is a mixture of L 1 and L 2 loss functions, which is insensitive to noise [13], the L 2 regular term of the ridge regression can effectively avoid overfitting caused by model training [14]. To improve the generalization performance of the model, the sum of RMSE k cv and η * MAE k cv on the training set based on k-fold cross-validation is constructed as the fitness function and the PSO algorithm is used to optimize the model hyperparameters. e PSO algorithm originated from the research on the foraging process of birds [15]. It has a simple structure. Each particle in the particle swarm has three main parameters: position, velocity, and fitness. In recent years, many pieces of literature have achieved good results using the particle swarm optimization algorithm [16][17][18][19][20].
To solve the problem of noise in traffic flow data, a Huber-Ridge traffic flow anomaly detection model with the particle swarm optimization (PSO) algorithm is proposed.
e Huber-Ridge model is used to reduce the negative impact of noise in the data. Huber-Ridge model performance depends on model hyperparameters. erefore, it is very important to determine the optimal model hyperparameters. A PSO algorithm based on the proposed fitness function is used to search for the optimal hyperparameters of the model so that the model has the best performance. e remaining part of the paper proceeds as follows: Section 2 introduces the theoretical information of the Huber-Ridge algorithm; Section 3 proposes the data preprocessing steps and the steps using PSO algorithm to optimize the Huber-Ridge model parameters; Section 4 illustrates the model evaluation indexes; Section 5 presents the experimental content which contains the comparison of the forecast models and the results of traffic flow anomaly detections; Section 6 is conclusions.

Huber Function.
e combination of the Huber function with the L 1 loss function and the L 2 loss function can effectively avoid the interference of noise in the data during the data fitting [21]. Its robustness is better than that of L 1 and L 2 loss functions. e definition of the Huber loss function is e definitions of L 1 loss function and L 2 loss function are shown in equations (2) and (3): where u is the error between the actual value and the estimated value, and M is the threshold. When the threshold M is 1, the comparison of the Huber loss function, the L 1 loss function, and the L 2 loss function is shown in Figure 1. Compared with the L 1 loss function, when u is smaller than the threshold M, the Huber loss function penalizes the model for making large errors. Compared with the L 2 loss function, when u is greater than the threshold M, the Huber loss function penalizes the model for making small error erefore, the Huber loss function is quadratic for smaller errors and is linear for larger errors.

Ridge Regression Model.
e ridge regression model is first proposed by Hoerl and Kennard. e ridge regression objective function adds the L 2 regular term based on the least square objective function [22]. Its definition is as follows: where k j�1 (w j ) 2 is the L 2 regular term and λ is the ridge parameter, which is the weight of the L 2 regular term.
For the linear regression model y � wx + ε, the least square estimation of the regression coefficient is defined as follows:ŵ where x is the independent variable matrix and y is the dependent variable vector. e mean square error of the least square estimation is defined as follows: If there is a linear correlation between independent variables, the matrix x T x is a singular matrix. Some characteristic roots k i of the singular matrix are close to zero, resulting in a largeŵ mse . is indicates that there is a large error between the least-squares estimated value and the actual value. e addition of the disturbance term λI(λ > 0) on the matrix x T x will weaken the singularity, thereby reducingŵ mse . e least square estimation with the disturbance term added is the ridge estimation. e ridge estimate is defined as follows: where λ is the ridge parameter and I is the identity matrix. w(λ) indicates the ridge estimation of the regression parameter w when the ridge parameter is λ. When λ � 0, the ridge estimation is the least square estimation. In the case of linear correlation of independent variables, the ridge estimation provides improved efficiency in parameter estimation problems, that is, biased but has lower variance than the least square estimator.

Huber-Ridge Regression.
Owen uses the Huber loss function to replace the least-squares loss function and converted the ridge regression to the Huber-Ridge regression [23]. e definition of the Huber-Ridge model is as follows:ŵ where w is the weight vector of the regression when the objective function is the smallest, w j represents the estimate for each regression coefficient, k j�1 (w j ) 2 is the L 2 regular term, and λ/2 is the weight of the L 2 regular term, which is used to balance the relationship between the Huber loss function and the L 2 regular term. e Huber loss function can help the model avoid the influence of the data noise. e L 2 regular term helps the model have a proper sparsity and avoid overfitting of the model. e Huber-ridge regression combines the robustness of the Huber regression to noise with the regularization of the Ridge regression, which not only ensures the robustness of the model but also makes the regression model more stable. k j�1 (w j ) 2 can be considered as ||w|| 2 2 , which is the L 2 norm square of the weight vector w. e objective function f(w) is defined as follows: where u is the error. e objective function f(w) is used to take the partial derivative of the weight vector w and let it to be zero. It can be obtained that the expression of the weight vector w is at the minimum value of the objective function in the direction of the weight vector w. e solution process of equation (9) is as follows: where u � xw − y, xw is the estimated value, and y is the actual value. e first term of equation (10) can be simplified as Let ω(u) � zh(u)/zu, equation (11) can be simplified as e second term of equation (10) can be simplified as In summary, the solution process of equation (10) is as follows: where I is the identity matrix. e optimal threshold M and the ridge parameter λ can be found in a fixed interval through the optimization algorithm. e weight vector w can be obtained by substituting the threshold value M, the ridge parameter λ, and the sample data into equation (16).

PSO Algorithm.
e core idea of the PSO algorithm comes from the foraging process of birds. For the PSO algorithm, the candidate solution of the optimization problem is a particle in the hyperparameter space. Each particle has its corresponding fitness value, speed, and position. e speed of the particle determines the direction and the displacement of the particle to look for the candidate solution. e PSO algorithm can find the optimal solution by iterating a group of initialized random particles.
For the PSO algorithm, there are m particles in the Ddimensional space. e speed of each particle can be expressed as v , and the position of each particle can be expressed as s In the loop iteration, each particle represents a candidate solution. e corresponding fitness value can be obtained through the fitness function. e individual optimal particle and the global optimal particle can be selected based on the fitness value. e personal optimal particle (pbest) is expressed as p → i � (p i1 , p i2 , . . . , p i D ), and the global optimal particle (g best) is expressed as p → g � (p g1 , p g2 , . . . , p g D ). Before the next iteration, each particle will update its speed and position through equations (17)- (19): where ω is the inertia factor (ω > 0), c 1 is the local learning factor, and c 2 is the global learning factor (c 1 , c 2 > 0). r 1 and r 2 are random numbers uniformly distributed between [0, 1]. t and t + 1 represent the number of iterations. v Max represents the maximum speed of the particle. For equation (17), where ω * v →(t) i is called the memory item, which refers to the influences of the speed on the particle when it is updated; is called the self-cognition term, which means that when the particle is updated, it is biased toward the individual optimal particle; is called the group-cognition term, which means that when the particles are updated, they are biased toward the group optimal particle. It represents the result of collaboration among multiple particles.

Fitness Function.
e PSO algorithm can find the optimal hyperparameters for the model based on the fitness function. e smaller the particle fitness value, the lower the forecast error of the hyperparameters. To improve the generalization ability of the model, the k-fold cross-validation [24] is added to the fitness function.
e fitness function is defined as the sum of RMSE and MAE of k-fold cross-validation on the model training set. e expression equation for the fitness function is as follows: RMSE k cv is a root mean square error based on k-fold cross-validation and its expression is as follows: MAE k cv is based on the average absolute error of k-fold cross-validation, and its expression equation is as follows: where n is the number of training samples, k cv is the number of cross-validated subsets. y ij and y ij are the model estimated value and the true value, respectively. e smaller the fitness function value, the better the corresponding particle. e weight of MAE k cv is η (η > 0), which is also the control parameter used to balance the size of RMSE k cv and MAE k cv . When 0 < η < 1, MAE k cv has less weight than RMSE k cv ; when 1 < η < + ∞, MAE k cv has more weight than RMSE k cv ; when η � 1, MAE k cv has the same weight as RMSE k cv . RMSE k cv has a small penalty for small errors. e degree of MAE k cv penalty for errors remains unchanged. However, it does not punish large errors as much as RMSE k cv . e fitness function controls the degree of which the fitness function penalizes errors by adjusting the size of the control parameter η. As the control parameter η increases, the degree of penalty for small errors by the fitness function increases. Combining MAE k cv and RMSE k cv , the problem of insufficient penalty for small errors for RMSE k cv and insufficient penalty for large errors for MAE k cv can be improved, which not only increases the penalty for model prediction errors but also improves the generalization ability of the model.

Data Preprocessing.
Good data quality can improve the performance of the model. e missing values and the dimensional differences in the data will reduce the forecast performance of the model. erefore, it is significant to preprocess the data. e data preprocessing can be divided into the following steps: (1) Data cleaning. e previous value of the missing value should be used to fill in the missing value. ere are dimensional differences between different features. To prevent dimensional errors from reducing the model performance, the data distribution is transformed into a standard distribution with a mean of 0 and a variance of 1 through the standardized equation. e standardized equation is as follows: For the feature matrix, x ki is the standardized data of the k-th row and the i-th column, X i is the mean value of the i-th column, σ i is the standard deviation of the i-th column, and n is the number of samples.
Mathematical Problems in Engineering 5

PSO-Huber-Ridge Model Optimization Process.
e optimization steps of the PSO-Huber-Ridge model are as follows: Step 1. Start the optimization.
Step 2. Determine the model inputs and outputs. e feature set is used as the model input and the model output the traffic flow.
Step 3. PSO-Huber-Ridge model parameter settings. e number of particles m, the inertial factor ω, the local learning factor c 1 , and the global learning factor c 2 are input into the PSO algorithm. Initialize the speed v → and the position s → of each particle. Set the maximum number of iterations of the PSO algorithm i Max and the value range of the model hyperparameters.
Step 5. Particles update. Use equations (17)∼ (19) to update the speed v → and position s → of each particle.
Step 6. Fitness evaluation. Use equation (21) to calculate the fitness value of the particle based on the threshold value M and the ridge parameter λ of each particle.
Step 7. Optimal particle selection. Select the individual optimal particle and the global optimal particle according to the fitness value of the particles.
Step 8. Terminate training judgment. If the number of iterations i does not meet the termination condition (i > i Max ), return to Step 4. Otherwise, continue to the next step.
Step 9. Output optimization results. Output the threshold M and the ridge parameter λ in the global optimal particle.
Step 10. End the optimization.

Evaluation Indexes
e average absolute error (MAE), root mean square error (RMSE), and average absolute percentage error (MAPE) were used to evaluate the forecast performance of the model. e definition equations of MAE, RMSE, and MAPE are as follows: where n is the number of samples in the test set, y i is the model forecast value, y i is the true value. MAE and RMSE can reflect the forecast error of the model. e value range of MAPE is [0, +∞]. e closer its value is to 0, the better the model performance.

Data Description.
e traffic flow data set used in the experiment came from a highway intersection in Changsha City and was collected by a single detector with a data interval of 5 minutes. ere were a small number of missing values in the traffic flow data set and the previous value of the missing value was used to fill in the missing points. e data sets containing 5 days of traffic flow were divided into the training set and the test set. e traffic flow from Saturday to Tuesday was used as the training set for the training model. e traffic flow on Wednesday was used as the test set to verify the performance of the trained model.

Feature Extraction and Selection.
Historical characteristics based on the linear correlation from the traffic flow data were selected. e statistical characteristics based on the optimal sliding window were extracted. e historical characteristics were selected. e Pearson correlation coefficient was used to judge the strength of the linear correlation between the data. e range of the correlation coefficient r was [−1, 1]. e closer to 1, the stronger the positive correlation between the data; the closer to −1, the stronger the negative correlation between the data; the closer to 0, the weaker the linear correlation between the data. e historical value of r greater than 0.9 was selected as historical characteristics. See Table 1 for the correlation coefficients of traffic flow with delays of 1-9.
According to Table 1, the historical characteristics with delays of 1-6 were selected as historical characteristics. To fully consider the periodicity of the traffic flow, the historical characteristics at the same time point last week were selected. e set of historical characteristics included the historical values with delays of 1-6 and the historical values at the same time point last week. e statistical characteristics of the optimal sliding window were extracted. e maximum, minimum, median, mean, standard deviation, skewness, and kurtosis of the data set within the length of the sliding window were taken as the statistical characteristics. e sliding window length L had a value range of [6,150]. e Huber-Ridge model with default hyperparameters (λ � 0.0001, M � 1.35) was used for the exhaustive operation on the traffic flow training set. e optimal window length was selected with the MAPE evaluation index as the standard. It can be seen from Figure 2 that when the MAPE value was the smallest, the sliding window length was 34 as the optimal sliding window length.

Experimental Results.
e state transition algorithm (STA) [25], grey wolf optimizer (GWO) [26], genetic algorithm (GA) [27], and PSO algorithm were used to optimize the hyperparameters of the Huber-Ridge model. e range of model parameters is shown in Table 2: e parameter values of the optimization algorithm are shown in Table 3 Table 4. e iterative comparison of their fitness values is shown in Figure 3.
It can be seen from Table 1 and Figure 3 that the fitness value of the STA algorithm dropped rapidly in the early stage of the iteration and then fell into the search for the local    where m represents the number of seeds of each optimization algorithm, the maxital represents the maximum number of the iterations of the optimization algorithms. For the STA: the value range of the rotation factor α is [α max , α min ], which decreases in the form of an exponential function with 1/fc as the base with the increasing number of iterations; β indicates the translation factor; c indicates the expansion factor; δ indicates the axesion factor. For the GWO: a is called the convergence factor and decreases from 2 linear to 0 with the increase of iterations; r 1 and r 2 are random numbers evenly distributed over an interval [0, 1]. For the GA: prob mut represents the mutation probability, and the Partial-Mapped crossover is used as the crossover operator. For the PSO algorithm: ω indicates the inertia factor; c 1 indicates the local learning factor; c 2 indicates the global learning factor; v Max represents the maximum speed of the particle.  [28]. e fitness value of the GWO algorithm decreased slowly in the iterative process. e global optimization efficiency was not high. e GWO algorithm may easily fall into the local optimum and be unsuccessful in finding the global best [29]. e control parameters of the GWO algorithm decreased linearly with the iterative process, which cannot satisfy the complex search process [30]. e fitness value of GA stagnated in the early stage of the iteration and fell into the search for the local optimum. is is because the genetic algorithm has a premature phenomenon [31], making it difficult to jump out of the local optimum. Compared with the GWO, GA, and STA optimization algorithms, the PSO algorithm has a better iterative update strategy. It updates the particle position based on the individual experience of particles and the global experience    Mathematical Problems in Engineering of the particle swarm so that it will not all into the search for the local optimum easily. e forecast evaluation results of the four models are shown in Table 5. e forecast result of the PSO-Huber-Ridge model is shown in Figure 4.
It can be seen from Table 5 that the PSO-Huber-Ridge model had the lowest MAE, RMSE, and MAPE; that is, the forecast performance of the PSO-Huber-Ridge model was the best. It can be seen from Figure 4 that the PSO-Huberridge model can well forecast the trend of the traffic flow at most time points.
Based on the error between the predicted value of the PSO-Huber-Ridge model and the actual value, the anomaly detection was performed on the traffic flow using the threshold (mean ± 2σ) by calculating the mean value (mean) and variance (σ) of error data in a sliding window with a length of 10. If the forecast error at the next time point of the sliding window was greater than the anomaly detection threshold, the traffic flow at this time point was defined as an abnormal flow. e abnormal warnings would be reported to relevant traffic departments to avoid possible traffic jams. e label for abnormal traffic flow was defined as 1 and the label for normal traffic flow was defined as 0. e traffic flow anomaly detection based on the PSO-Huber-Ridge model is shown in Figure 5. It can be seen from Figure 5 that the proposed model can well detect the abnormal traffic flow in each period time.

Conclusions
To solve the problem of the large data noises in traffic flow, the traffic flow anomaly detection based on PSO-Huber-Ridge model is proposed. e strong robustness of the Huber function enables it to effectively reduce the influence of noise in traffic flow data on model training. e addition of the L 2 regular term of the ridge regression in the objective function can reduce the risk of model overfitting. e sum of RMSE k cv and MAE k cv based on 10-fold cross-validation is constructed as the fitness function to improve the generalization ability of the model. e optimal model parameters can be obtained through the particle swarm optimization algorithm so as to improve the model performance. Compared with the STA-Huber-Ridge, GA-Huber-Ridge, and GWO-Huber-Ridge models, the experimental results show that the PSO-Huber-Ridge model has the best model forecast performance. e traffic flow anomaly detection is performed using the traffic flow forecasted by the PSO-Huber-Ridge model. e error between the forecasted and actual traffic flow at a certain time point is large, which indicates that the regular pattern of traffic flow at that time point is different from that of history and may cause traffic congestion. e anomaly detection is performed on the traffic flow using the threshold (mean ± 2σ). e experimental results verify the validity of the proposed traffic flow anomaly detection model. e information contained in the traffic flow is complex. e PSO-Huber-Ridge model is limited to explore the linear information in the traffic flow. e nonlinear information needs further analysis and exploration. When extracting statistical features in feature engineering, the optimal sliding window is determined by the method of exhaustion. Its disadvantage is that it takes a long time and is not easy to apply. Using an adaptive method to extract features will greatly reduce the time of feature engineering. e Huber loss function reduces the negative impact of the data noise on the model training by reducing the penalty for large errors. Combining the Huber function with outlier detection method in data preprocessing can further improve the robustness of the model. Using adaptive feature extraction to mine linear and nonlinear information on the basis of improving model robustness is the next step.

Data Availability
e data used to support the findings of this study are currently under embargo, while the research findings are commercialized. Requests for data, 6/12 months after publication of this article, will be considered by the corresponding author.