Traffic Condition Classification Model Based on Traffic-Net

The classification and detection of traffic status plays a vital role in the urban smart transportation system. The classification and mastery of the traffic status at different time periods and sections will help the traffic management department to optimize road management and implement rescue in real time. Travelers can follow the traffic conditions. We choose the best route to effectively improve travel efficiency and safety. However, due to factors such as weather, time of day, lighting, and sample labeling costs, the existing classification methods are insufficient in real time and detection accuracy to meet application requirements. In order to solve this problem, this article aims to effectively transfer and apply the pretrained model learned on large-scale image data sets to small-sample road traffic data sets. By sharing common visual features, model weight parameter migration, and fine-tuning, the road is finally optimized. Traffic conditions classification is based on Traffic-Net. Experiments show that the method in this article can not only obtain a prediction accuracy of more than 96% but also can effectively reduce the model training time and meet the needs of practical applications.


Introduction
Te problem of trafc congestion has become a worldwide problem. Tere are many factors that cause trafc congestion, including the rapid increase in the number of vehicles, insufcient rationalization of road planning, irregular driving behavior, and trafc lights. Te resulting trafc delays, increased fuel consumption, and trafc accidents seriously afect people's travel safety and hinder urban development. Peak commuting, bad weather, and holiday travel are usually high periods of road congestion and secondary accidents. If we can collect enough data to accurately describe and classify urban trafc conditions and analyze the main causes of trafc congestion and secondary accidents, especially when trafc emergencies occur, rapid access to information is a key factor in organizing an optimal response; therefore, monitoring the area of the efective detection of trafc status is crucial for road trafc management. In recent years, a great deal of works have been done on trafc classifcation modeling, path planning, and in the broader area of transportation, where path planning algorithm [1] and energy-efcient information collection [2] are an emerging key supporting technology in the feld of intelligence transportation. Many road path planning modeling and information collection methods [1][2][3][4] have been developed to understand the causes of road trafc congestion and to prevent and manage road congestion.
Our focus in this article is on the road trafc condition classifcation prediction using transfer learning approaches. At present, the solutions to the trafc condition classifcation problem can be summarized into two major categories of trafc condition classifcation methods that rely on the traditional manual feature representation and automatic feature extraction that rely on the deep neural network models.
Traditional image classifcation models mainly rely on manual feature representations, and such methods perform behavioral classifcation recognition by interframe diference, HOG (histogram of orientation gradients), feature back subtraction, hybrid Gaussian modeling, optical fow, and other [5,6] feature representations and then train SVM (support vector machines) classifers [7,8] based on the feature representations. Te support vector machine classifer approach is based on statistical learning theory with low prediction accuracy and limitations of data sample dependence and strict requirement of identical distribution for training and testing samples. Subject to the shortcomings of manual feature representation and SVM classifcation models, such methods cannot meet the application requirements in terms of accuracy and realtime performance.
Te deep convolution neural network (DCNN) models, which combine automatic feature extraction and classifcation recognition, build complex neural network models in a data-driven manner with end-to-end learning mechanisms, mainly consisting of RCNN (regions with convolutional neural networks features) [9], Fast/Faster RCNN [10,11], R-FCN (region-based fully convolutional networks) [12] as the two-stage approach based on region candidate suggestions, SSD (single shot multibox detector) [13], YOLOvX (you only look once) [14][15][16][17] for regression-based one-stage methods, and some other deep learning [18][19][20][21][22] methods. SSD takes VGG16 [20] as the base convolutional network architecture and adds a multiscale convolutional map for prediction result fusion of auxiliary network layer, combined with the default boxes target preselection box similar to the anchor box structure in Faster RCNN, solving the problem of diferent sizes of input image targets. YOLOv3 uses Darknet53 network with the introduction of residual structure as the base network, which is diferent from the single-level base network input of SSD, and achieves multilevel input, with higher accuracy in small target detection higher compared to SSD; although the abovementioned methods have achieved excellent results, they are limited by the lack of data samples, sample labeling, and computational resources. Tis type of method mainly sufers from overly complex network models that easily lead to overftting and high false detection rates under small-sample [23][24][25] datasets.
To address the abovementioned problems, this article proposes a sample-based augmented trafc condition classifcation model. Te contributions of this article can be summarized as follows: (1) Te collection includes Trafc-Net dataset provided by OlafenwaMoses on GitHub, web images, and an autonomously collected image dataset Trafc-Net dataset V_HF for the created trafc condition classifcation, which contains four trafc categories: congested trafc, sparse trafc, accidents, and fres (2) Based on ResNet50 (residual neural network), VGG16 (Oxford Visual Geometry Group), and GoogLeNet [26] pretrained networks, migration learning is performed to fne-tune the four classifcations of trafc conditions, respectively (3) Te original dataset was expanded by image random geometric transformation preprocessing and CutMix sample enhancement on the basis of Trafc-Net Dataset V_HF, and the experimental results before and after enhancement were compared and analyzed

Pretraining Network Framework
Te pretraining network framework is described in the following sections.
2.1. VGG Network Structure. In VGG, stacked small-sized convolutional kernels are used instead of larger convolutional kernels, and each convolutional layer is convolved with a modifed linear unit (ReLU) as the activation function, using 3 × 3 convolutional kernels for convolution and 2 × 2 convolutional kernels for maximum pooling, so that the number of channels can be doubled, and then, the feature map can be continuously reduced, while at the same time, more nonlinear transformations in the convolutional structure of VGGNet decrease the computational efort and increase the efciency of the CNN for image feature extraction. Te VGG model in [27] has a total of six confgurations with diferent weight layers structure, as shown in Figure 1.

GoogLeNet Network Structure.
Feature fusion is prominent in the GoogLeNet network. Its core lies in the introduction of the Inception module, which has undergone several versions of iterative development to assemble multiple convolutional pooling operations into one module, which provides multiple kinds of convolutional kernels so that feature extraction with diferent sensory felds can be done and fnally stitched together. Another feature of GoogLeNet is the introduction of an auxiliary classifer, which allows intermediate results to be used as the output and lets it be used with some weighting put into the fnal classifcation prediction result so that model fusion is achieved. It only works during training and is removed during prediction. Te size of the original input image is 224 × 224 × 3 and is zero-averaged. Te network structure of GoogLeNet Inception V1 is shown in Figure 2.

ResNet Network Structure.
Te residual structure of ResNet50 is shown in Figure 3. Te output results are as follows: When F(X) is 0, H(X) � X denotes the constant mapping, which is represented as a curved line in Figure 3.
Te ResNet50 model has 50 layers and its structure is shown in Figure 4 ResNet50 is divided into 5 stages; each stage consists of a combination of residual structures; excluding the frst stage, the next Conv2_x, Conv3_x, Conv4_x, and Conv5_x represent the remaining stages, respectively. Te number of corresponding residual units are 3, 4, 6, and 3. In Figure 4, c64 means the number of channels is 64, s1 means the step size is 1, and p0 means the padding is 0. It can be seen that the ResNet50 network uses the maximum pooling to halve the size of the feature map by taking a step size of 2 except for the frst stage, and the other stages use the convolution operation by taking a step size of 2 to achieve the same efect.

Traffic Condition Classification Based on Pretrained Network Model Migration
Although there are many reasons for diferent feature distributions in diferent datasets [17], it is not difcult to fnd that shallow networks can extract low-level features such as edges and contours. As the layers of the network deepen many local features will be formed, and then, they are combined to form the whole. Due to the following similarities, similar data, similar tasks, and similar models, it then becomes possible to take a model trained in an old domain and apply it to a new domain [18], and this process is called migration learning. Due to the large and complex structure of the deep neural networks, designing and testing models are expensive and time-consuming, and one of the convenient and efective ways to improve the efciency of model training (especially when the number of samples is small) can be done by migration learning techniques. Generally, migration learning methods using pretrained models are divided into feature extraction and fne-tuning, which can be chosen according to the sample size and characteristics of a particular application area. In this article, we choose a migration learning method based on fne-tuning of parameters to improve the training speed and recognition rate of pretrained models on domainspecifc sample datasets by freezing the shallow weights and fne-tuning the retraining for the deeper networks. Te features learned in the convolutional layer are generalizable to diferent samples, especially in the shallow network layer because the shallow convolutional network layer learns local subtle features, while the deep layer is more biased to local or global object contour features. Te parameter fnetuning mechanism also efectively avoids overftting due to the small number of samples, which leads to overly complex model parameters.
In this article, three pretrained models are selected for multistrategy fne-tuning experiments of weight parameters in MATLAB2021a environment. Te parameter fne-tuning strategies mainly include freezing the weights of shallow layers, relearning the weights, and freezing all weights of the fully connected layers. Algorithm 1 outlines the fowchart of the proposed method.   preprocessing operations, which result in diferent data in each round due to the random nature of the operation. For example, for the same image in some rounds fipped and some rounds not fipped, so the data used for training in each round are diferent, and the purpose of sample enhancement is achieved.

CutMix Sample Enhancement.
CutMix generates a new training sample (x, y) by combining two training samples (x a , y a ) and (x b , y b ); the new sample is used to train the network model with the original loss function. Te sample combination operation is defned as shown in the following equation: where M ∈ 0, 1 { } W×H denotes the marker mask for the image cropping and retention region, the flled region is 1 and the rest is 0, ☉ representing pixel-by-pixel, λ belonging to Beta(z, z), and usually set to 1 in experiments z, i.e., λ obeying a uniform distribution of (0, 1).

Experimental Results
Te experimental results are described in the following sections.   Table 1, and the training process is shown in Figures 6(a)-6(c).
Comparing the training process time and validation accuracy of each model, it is easy to fnd that ResNet50 takes the longest time to train when the three models achieve close accuracy.

Sample Augmentation Control Experiment Results.
Sample augmentation mainly includes the following. (1) Te data augmentation operation comes with the deep learning training in MATLAB, by randomly rotating, panning, and resizing the training samples in each iteration of training, so that the sample data are diferent in each round of training to achieve sample augmentation.
(2)Te number of samples is increased by collecting the same type of image data on the network. However, the downloaded photos cannot be converted to the same dimension due to their diferent formats, so they are converted to grayscale images while adding images. Te "ColorPreprocessing" option in MatLab is used to ensure that all enhanced images have the same number of channels. (3) Using CutMix to select images from the samples, local areas of the images are Cropping and are used to superimpose local areas of the image onto other sample images, and new training samples are generated to enhance the dataset. As shown in Figure 7, the fnal validation accuracy reached 96.14% using the GoogLeNet pretrained model, which shows that the larger the dataset is, the higher the validation accuracy is.

Experimental Results of Hyperparameter Settings.
Te experimental results of hyperparameter settings are explained in the following sections.

Efect of Learning Rate Setting on Model Accuracy.
With MaxEpochs � 5, MiniBatchSize � 8, and the optimizer as SGDM, setting the learning rate as 0.001, 0.0003, 0.0001, and 0.00001, respectively, the accuracy of the validation set obtained by using the GoogLeNet pretrained model is 93.70%, 94.35%, 94.77%, and 87.78%. It can be seen in Table 2 that the model validation accuracy is strongly infuenced by the initial learning rate when using the SGDM optimizer.

Optimizer.
Keeping other parameters unchanged and changing only the model optimizer settings, Adam's squared gradient decrement factor is set to 0.99 and the accuracy of the validation set using the GoogLeNet pretrained model is 89.91%, but the accuracy using SGDM is 94.35%.

Number of Training Rounds.
Keeping the other parameters unchanged and changing only the number of training rounds and MaxEpochs from 6 to 10, the accuracy of the validation set using the GoogLeNet pretraining model is changed from 95.53% to 96.21%, respectively. From the results of the training process in Figure 8, we can see that increasing the number of training rounds does not change the model performance when the model reaches convergence but only increases the training time.

Analysis and Discussion
Te analysis and discussion are described in the following sections.

Model Evaluation.
Te trained model is tested on the test set, and the confusion matrix is drawn as shown in Figure 9, where the correct predictions are distributed on the diagonal, and the rows and columns also show the recall and accuracy rates for each class, respectively. Recall, also known as the full rate, is shown in Figure 9, where the denominator is the sum of the rows and the numerator is the correct prediction for each class. Precision, also called accuracy, is shown in Figure 9, where the denominator is the sum of the columns and the correct prediction of each class is the numerator. Teir numerators are the same, but the denominators are diferent.
We calculate precision and recall of the integrative index F1-measure, as shown in the following equation: Te value of F1-measure is distributed between (0, 1), and the closer to 1, the better. Its value is calculated from Figure 9 as 94.67%, which shows that the performance of the model is good. It can also be visualized by visualizing the intermediate network layers. Selecting the fully connected layer, a detailed image of each classifcation with strong activation was generated as shown in Figure 10. Te image generated for the "fre" category contains obvious fre color features.

Error
Analysis. An example of one prediction result error is shown in the frst panel on the left of Figure 11, where the model incorrectly predicts congested trafc as sparse trafc.
Te error example arises when the model has no evaluation criteria for the number of vehicles in sparse trafc and predicts sparse trafc when the extracted feature is an empty 8 Computational Intelligence and Neuroscience road. Tis also refects the problem of how to solve this problem when the model's region of interest is not focused on the correct category. Another reason for the error is that the image is not captured with a clear vision, resulting in no texture; then, the convolutional layer does not extract features, and the congested trafc is discriminated as sparse trafc. Te visualization error prediction of the incep-tion_5a-5 × 5 convolutional layer with the strongest activation channel is shown in Figure 12(a), and it can be seen that the of-white pixels correspond to the original image in      Act: dense t raffic Pred: sparse t raffic, 100% Act: sparse t raffic Pred: sparse t raffic, 100% Act: sparse t raffic Pred: sparse t raffic, 83.8% Act: fire Pred: fire, 100%   Figure 12(b), and it can be seen that only the vehicle part of the of-white is activated. For better application deployment, the algorithm classifcation test results are displayed in combination with the UI interface, as shown in Figures 13(a)-13(d).

Conclusion
In this article, we designed and implemented a trafc classifcation model based on migration learning on the basis of the Trafc-Net V1 dataset and conducted a multidimensional comparative analysis of various deep learning frameworks with multiple strategies such as sample data enhancement and fne-tuning parameters to improve the model, and the experimental comparison results showed that the model migration has good generalization ability, and the classifcation recognition is applied on the dataset of the target domain. Te accuracy rate reaches more than 95%, which is well adapted to the classifcation recognition task in the target domain.

Data Availability
Te data used to support the fndings of the study can be obtained from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.