Realtime Vehicle Tracking Method Based on YOLOv5 + DeepSORT

. In actual trafc scenarios, the environment is complex and constantly changing, with many vehicles that have substantial similarities, posing signifcant challenges to vehicle tracking research based on deep learning. To address these challenges, this article investigates the application of the DeepSORT (simple online and realtime tracking with a deep association metric) multitarget tracking algorithm in vehicle tracking. Due to the strong dependence of the DeepSORTalgorithm on target detection, a YOLOv5s_DSC vehicle detection algorithm based on the YOLOv5s algorithm is proposed, which provides accurate and fast vehicle detection data to the DeepSORTalgorithm. Compared to YOLOv5s, YOLOv5s_DSC has no more than a 1% diference in optimal mAP0.5 (mean average precision), precision rate, and recall rate, while reducing the number of parameters by 23.5%, the amount of computation by 32.3%, the size of the weight fle by 20%, and increasing the average processing speed of each image by 18.8%. After integrating the DeepSORTalgorithm, the processing speed of YOLOv5s_DSC +DeepSORTreaches up to 25 FPS, and the system exhibits better robustness to occlusion.


Introduction
Te increasing number of vehicles has caused great difculties in trafc management. Vehicle tracking is an application of a target tracking in the feld of transportation, which can serve to alleviate the pressure of trafc management [1][2][3]. At present, the mainstream target tracking method is the discriminative tracking method, which adds the step of target detection and makes the tracking more accurate. Discriminant tracking methods mainly include tracking methods based on sparse representation [4][5][6], tracking methods based on correlation fltering [7][8][9], and tracking methods based on deep learning. Li and Huang [10] proposed the TOD (tracking object based on detector) algorithm, which used YOLOv3 for target detection, and tracked the target according to LBP (local binary pattern) features and color histogram. Bertinetto et al. [11] proposed the SiamFC (Siamese fully convolutional) algorithm, which took the target object in the frst frame as one input of the SiameseNet and the search area in the subsequent frames as another input and then found out the area closest to the target object to realize the target tracking. However, the target loss can easily happen, while the target size changes. Zhu et al. [12] adopted a distractor recognition model to update the tracking template online, which could well deal with the problems of serious occlusion and appearance change of the target. Li et al. [13] introduced the deep network into the Siamese Net framework and played the role of the deep network through multilayer aggregation.
Multitarget tracking is harder than the single-target tracking. Problems such as appearance similarity among targets, occlusion, and the start and end of single-target tracking tasks pose signifcant challenges in the feld of multitarget tracking. Bewley et al. [14] proposed the SORT (simple online and realtime tracking) algorithm, which used the Kalman flter to predict the tracking frame information of the tracked object in the next frame and performed data associated with the detection frame information in the next frame to achieve multitarget tracking. Te algorithm had small memory footprint and high speed, but the accuracy was very low when the target was occluded. Wojke et al. [15] proposed the DeepSORT algorithm based on the idea of SORT. Te algorithm considered the motion information and appearance information in the tracking process and resolved the problem of target occlusion. At present, detection-based tracking algorithms still have many problems, such as a lack of datasets, inaccurate target detection, and insufcient realtime performance,.
Traditional target detection algorithms rely on image features and classifers such as SVM (support vector machine) [16], Adaboost [17], Random Forest [18], artifcially designed color features [19], gradient features [20], and pattern features [21]. Target detection algorithms based on deep learning have stronger adaptability to complex scenes, including target detection methods based on candidate regions and target detection methods based on regression. Te representative algorithm based on candidate region is R-CNN (Region-CNN) series [22][23][24]. Owing to the need to process large number of candidate frames, such methods face the problem of low efciency and do not have the ability for realtime detection. Te regression-based target detection method reduces the steps of generating candidate regions and improves the speed signifcantly. It has been widely used for developing realtime target detection systems. Te YOLO (you only live once) algorithm [25] proposed in 2016 used a grid to divide an image and generated a series of initial anchor boxes in each grid of the image. By learning to fne tune the initial box, the predicted box was generated to be closer to the actual box. Te YOLOv2 algorithm introduced batch normalization and used DarkNet-19 as the backbone network, which could dynamically adjust the input and achieve better precision for small targets [26]. On this basis, the YOLOv3 algorithm used DarkNet-53 as the backbone network, introduced FPN (feature pyramid network) structure to obtain feature maps at diferent scales, and used a logistic classifer to predict the category of targets [27]. Te YOLOv4 algorithm added data enhancement and selfantagonistic training methods at the input end [28]. Te backbone network used CSPDarkNet53 and improved the loss function of the output layer, which greatly improved the speed and accuracy. Te YOLOv5 has the same performance as YOLOv4. However, YOLOv5 is faster and has a detection speed of 140 FPS on Tesla P100. Sasagawa and Nagahara [29] used YOLO to locate and identify objects and proposed a method for detecting objects under low illumination by utilizing the power of transfer learning. Krišto et al. [30] used thermal images on YOLO to improve target detection performance in challenging conditions such as adverse weather, night time, and dense areas. Xiao et al. [31] fused the context information in the YOLO backbone network to avoid the loss of low-level context features, retain lower spatial features, and solve the problem of difcult detection of targets under dim light. Guo et al. [32] designed an improved SSD (single shot multibox detector) detector, which used the method of single data deformation data amplifcation to transform the color gamut and afne of the original data and could detect targets that were close to each other. To improve feature fusion for small tassel detection, Liu et al. [33] proposed a novel algorithm referred to as YOLOv5-tassel to detect tassels. To enrich feature information and improve the feature extraction ability, Bie et al. [34] proposed an improved YOLOv5 algorithm based on bidirectional feature pyramid network for multiscale feature fusion. Wang et al. [35] proposed a novel vehicle detection and tracking method for small target vehicles to achieve high detection and tracking accuracy based on the attention mechanism. In summary, research based on the improved YOLOv5 algorithm mainly focuses on the accuracy of small object detection, while research on detection speed and occlusion robustness in vehicle tracking still has great research value. Te main contributions of this article are as follows: (1) To solve the problems of large number of vehicles, fast-moving speed of vehicles, substantial similarity of vehicle appearance, and vehicle occlusion in the actual urban trafc scene, the DeepSORT algorithm is used for vehicle tracking, which has better realtime performance and tracking robustness than traditional vision-based vehicle tracking methods. (2) To reduce the calculation amount of YOLOv5s, reduce its inference time, and improve the operation speed, a YOLOv5s_DSC algorithm with faster inference speed is proposed. (3) Combining YOLOv5s_DSC with the DeepSORT algorithm, the robustness of occlusion of the proposed algorithm is verifed and the realtime performance of the algorithm is tested in the cases of vehicles being occluded by foreign objects or vehicles being occluded by each other.

Overall Framework.
Te DeepSORT algorithm adopts a two-stage idea of detection and tracking, using the Kalman flter and Hungary algorithm to track the target and introducing a deep convolutional neural network to extract the appearance information of the tracked target for data association, which solves the problem that the target occlusion is difcult to track accurately. Stable and accurate vehicle detection result is an important guarantee for the Deep-SORTalgorithm in the vehicle tracking task. Considering the realtime requirements of the realistic application scenarios, the YOLOv5 target detection algorithm is studied in this article. To further reduce the memory and computing resources occupied by the algorithm, a DSC structure with residual is introduced into YOLOv5s, and the YOLOv5s_DSC algorithm with a smaller model and faster speed is proposed. YOLOv5s_DSC is used as the detector of the DeepSORT algorithm, and its excellent detection accuracy can make tracking more accurate and provide better realtime performance. Figure 1 is a framework diagram of the DeepSORT algorithm. First, the Kalman flter is used to predict the tracking frame information of the tracked target in the next frame, and all the detection frame information is obtained by the target detection algorithm in the subsequent frame. Ten, the Hungarian algorithm is used to fnd an optimal allocation for the minimum cost between all the detection frames and the tracking prediction frames. Te cost matrix used in this step contains not only the Mahalanobis distance but also the cosine distance of the appearance features constructed from the appearance features extracted by the deep convolutional neural network. After solving by the Hungarian algorithm, the optimal combination of the prediction frame and the detection frame can be obtained. Te DeepSORT algorithm uses cascade matching, and the shorter the number of frames from the last successful matching is, the higher the priority is in this matching. Te tracking frame information is updated according to the detected frame information after the matching is successful, and the tracking frame information of the tracked target in the next frame is continued to predict according to the tracking information. For the samples that fail to match, the cost matrix will be constructed again with the IOU calculation results of the remaining tracking frame and the prediction frame and then transferred to the Hungarian algorithm for the solution. After the matching is successful, the tracking frame is updated according to the detection frame information, and the tracking frame information of the tracked target in the next frame is continuously predicted according to the tracking frame information. Whether the match is successful or not is determined by marking "true" and "false." For the detection frame that fails to match, a fag will be added-"false," and three subsequent rounds (age is the round and max age � 3) of investigation will be conducted. If all three rounds of matching are successful, the fag will be changed to "true." For tracking frame that fail to match, if they are marked as "false," the tracking task will be stopped, and if they are marked as "true," the lifespan will be set. Within the lifespan, the following three rounds of investigation will also be conducted. If all three rounds of matching are successful, the mark will be changed to "true." State estimation methods mainly include state observers and various linear and nonlinear discrete estimators based on the Kalman flter. Liu et al. [36] proposed a novel vehicle sideslip angle estimation algorithm with the fusion of dynamic model and vision for vehicle dynamic control. A vehicle attitude angle observer based on the square-root cubature Kalman flter (SCKF) is designed in [37] to estimate the roll and pitch to reject the gravity components induced by the vehicle roll and pitch. For simplicity, this article uses the Kalman flter as state estimation. Te prediction equation of the Kalman flter is as follows:

Te DeepSORT Algorithm.
where x − k is the state estimation at the time k, x + k−1 is the state estimation at the time k − 1, P − k is the covariance matrix of the state estimation, and F k is the state transition matrix. Te measurement update equation of the Kalman flter is as follows: Computational Intelligence and Neuroscience where y k is the measurement vector at a time k, H k is the measurement matrix, R k is the covariance matrix of measurement noise, K k is the Kalman gain used to correct the state estimation, and I is the identity matrix. Te state vector of the DeepSORT algorithm can be described as follows: where u, v, r, and h represent the target box center coordinates of x, y aspect ratio, and height, respectively. _ u, _ v, _ r, and _ h represent the corresponding value in the next frame predicted with Kalman fltering. Te DeepSORT algorithm uses the cost matrix constructed by Mahalanobis distance and cosine distance of appearance features in the frst data association. Te Mahalanobis distance correlation metric is calculated as follows: where d j � [u j , v j , r j , h j ] T represents the jth detection state, represents the ith tracking target which predicts the state of the current frame according to the state of the previous frame. S i is the covariance of the detected state with the predicted state. Te cosine distance measurement formula of appearance features is as follows: where r j corresponds to the feature vector of the jth detection frame, r (i) k correspond to the feature vector of the tracking frame, and R i is for the last set of features k times successfully tracked. Te DeepSORT algorithm constructs a deep convolutional neural network to extract the appearance features of the tracking target and uses L2 standardization to project the features. Te network structure is shown in Table 1.
Due to the nonconstant update frequency of image frames, we use the time diference between the two frames as the time step of the Kalman flter during the discretization process. Tis approach allows us to dynamically adjust the state update rate of the Kalman flter based on the actual situation, which helps to better track the target.

Improved Yolov5 Vehicle Detection Method
To achieve high precision vehicle tracking tasks, the vehicle detection algorithm is studied in this subsection. To further improve the realtime performance of vehicle detection, the DSC structure with residual is introduced, and the YOLOv5s_DSC vehicle detection algorithm is proposed, which has a lower number of parameters and calculation and faster detection speed.

Depth Separable Convolution.
With the help of grouping convolution, DSC uses point-by-point convolution to fuse the feature information of diferent channels, which can achieve the purpose of a lightweight deep learning network, while ensuring feature extraction. It divides into the following two steps: (1) Channel-by-channel convolution: the input image is H in * W in * C in . Each channel consists of a K * K * 1. Te convolution kernel performs an independent convolution operation to obtain C in characteristic map, whose size is H out * W out . Te parameter number of the convolution kernel is K * K * C in . As shown in Figure 2, if 3 channels of images are as inputs in the point-by-point convolution, 3 single-channel will be obtained. (2) Pointwise convolution: using 1 * 1 * C in * C out , convolution kernel performs convolution operation output of (1) to obtain the characteristic map with H out * W out * C out . As shown in Figure 3, the number of parameters of the characteristic map is 1 * 1 * C in * C out .
Te number of parameters of the whole DSC is as follows: Tis is similar to the packet convolution with a number of packets C in . Te diference is that the results of group convolution are the splicing of each group result, while the results of DSC are the weighted combination of each group of result by point-by-point convolution, which can make full use of the characteristic information of each channel at the same position.

Yolov5s Improvement
Strategy. Te YOLOv5s model has 283 layers in total, the number of parameters is 7,071,633, and the amount of calculation is 16.4GFLOPS. To further simplify the network structure, reduce the amount of calculation, and reduce the reasoning time of the model, the DSC structure is introduced to replace the C3 structure of the backbone part in YOLOv5s. As shown in Figure 4, the frst C3 structure in the YOLOv5s network contains fve convolutions, and the parameters are shown in Table 2.
From Table 2, it can be calculated that the number of parameters for Conv1 and Conv2 is 2048, the number of parameters for Conv3 is 4,096, the number of parameters for  Conv  3  32  3  1  Conv  32  32  3  1  MaxPool  32  32  3  2  Residual  32  32  3  1  Residual  32  32  3  1  Residual  32  64  3  2  Residual  64  64  3  1  Residual  64  128  3  2  Residual  128  128  3  1  Dense  128  Batch and L2  normalization  128   4 Computational Intelligence and Neuroscience Conv4 is 1,024, and the number of parameters for Conv5 is 9,216. Terefore, the number of parameters of the frst C3 structure in the YOLOv5s network amounts to 18,432. DSC performs two convolutions. Te frst convolution obtains the features of each channel. Te second convolution fuses the position information of each channel. In contrast to the frst C3 structure in the YOLOv5s network, the input and output channels of the DSC are also set to 64, and the size of the channel-by-channel convolution is 3 × 3. Ten, the number of parameters of the channel-by-channel convolution is 576, and the size of the point-by-point convolution is 1 × 1. Ten, the number of parameters of the pointwise convolution is 4,096. Tus, the number of parameters of the DSC structure amounts to 4,672, which is 13,760 lower than that of the frst C3 structure in the YOLOv5s network. Te backbone of the YOLOv5s network contains four C3 structures, which are replaced by DSC structures in turn. To avoid network degradation caused by replacing with DSC, a residual structure is introduced, as shown in Figure 5.
Te introduction of DSC can efectively reduce the number of parameters and make the network model smaller. Te comparison of the number of parameters after replacement is shown in Table 3. Te number of parameters of each structure includes the parameters of convolution, deviation, and batch normalization in the structure. Te improved network framework is presented in Table 4. [38] is a large vehicle rerecognition dataset, which contains vehicle images from multiple angles and under diferent light intensities. It is suitable for related research on vehicle rerecognition. As shown in Figure 6, each folder contains pictures taken by the same vehicle from diferent angles, with a total of 776 folders. Te training set and the test set are distributed according to the proportion of 8 : 1.

Dataset Preparation. Te VeRi dataset
UA-DETRAC [39] is a vehicle dataset, collected from the real trafc environment of Beijing and Tianjin, labeled with four vehicle categories of "Bus," "Car," "Van," and "Others," including vehicle images of diferent angles and periods, covering most of the trafc conditions. Te UA-DETRAC dataset contains a total of 60 image folders collected from diferent road sections and periods, and each folder corresponds to an XML tag fle. We use the code to strip the tag corresponding to each image in the XML fle and convert all the XML tag fles obtained into TXT format. Te training set and the test set are distributed according to the proportion of 9 : 1. Tere are 73,876 pieces of training sets and 8,209 pieces of test sets in total. Te data structure of images and labels is shown in Figure 7.

Training of DeepSORT Deep Convolutional Neural
Network. Te vehicle rerecognition dataset is used to train the DeepSORT deep convolutional neural network, so that it can correctly extract the appearance features of the vehicle for the calculation of the cosine distance of the appearance features. Since the task requirement is vehicle tracking, the input of the network is set 128(h) × 64(w), according to the aspect ratio of the vehicle image. Te network model is built under the PyTorch framework. Te initial learning rate is set at 0.1, which is reduced to 0.1 times every 40 epochs. Te training loss curve is shown in Figure 8. After reaching 100 epochs, the loss tends to be stable and the accuracy on the test set reaches 88%.    Kernel_size  Conv1  64  32  1  Conv2  64  32  1  Conv3  64  64  1  Conv4  32  32  1  Conv5  32  32  3 Filters * 3 Maps * 3 3 channel input

Computational Intelligence and Neuroscience
Te batch size is 128, and 50 epochs are trained. Te training loss is shown in Figure 9. Te YOLOv5s_DSC network decreases as fast as YOLOv5s in the regression loss, the classifcation loss, and the target loss, where the lowest values of the three losses of YOLOv5S are 0.01722, 0.0011741, and 0.02758, respectively. However, the lowest values of the three losses of YOLOv5s_DSC are 0.01835, 0.0013954, and 0.02933, respectively, which indicates that the introduction of DSC structure with residual error does not bring too much impact on the training difculty of the network.
Compare the performance of YOLOv5s and YOLOv5s_DSC in mAP, precision, and recall. In Figure 10, the curves of the two networks are almost coincident, which indicates that the introduction of the DSC structure with residuals brings about a decrease in the number of network parameters but does not cause a decrease in the accuracy of the network. Te YOLOv5s_DSC with KF (Kalman flter) is smoother than the YOLOv5s_DSC. Te YOLOv5s with KF is smoother than the YOLOv5s. Tis indicates that KF can dynamically adjust its update rate, which helps to better track the targets. In Table 5, mAP (mean average precision) indicates better performance of the detector, except for the optimal mAP0.5: 0.95; the diference between the optimal mAP0.5, precision, and recall of YOLOv5s_DSC and YOLOv5s is not more than 1%, while the number of parameters is reduced by 23.5%. Te amount of calculation is reduced by 32.3%, and the size of the weight fle is decreased by 20%. In the hardware environment, where the graphics card is NVIDIA GeForce RTX 3080 and the CPU is Inter (R) Xeon (R) CPU E5-2670 v3, the average processing speed of each image is improved by 18.8%, which proves that the proposed algorithm is faster while ensuring the accuracy.

Verifcation Experiment.
Select a video of trafc fow captured from the front of the intersection as the input. As shown in Figure 11, the YOLOv5s_DSC vehicle detection algorithm can efectively detect vehicles and correctly classify vehicles under the window. Each detection frame contains two information: vehicle category name and category confdence. In the hardware environment shown in Table 6, the detection speed of the algorithm reaches 77 FPS. Select a video of trafc fow captured from the oblique side of the intersection as the input, and YOLOv5s_DSC vehicle detection algorithm can also efectively detect the vehicles and correctly classify the vehicles under this window, as shown in Figure 12. YOLOv5s_ DSC can accurately detect vehicles from diferent angles. As shown in Figure 12(b), local mutual occlusion between vehicles does not afect the detection efect of the algorithm. Terefore, the advantages of the YOLOv5s_ DSC algorithm for vehicle detection can provide realtime and accurate vehicle detection information for vehicle tracking.
To test the efect and the robustness of occlusion of the algorithm on vehicle tracking, the YOLOv5s_DSC is as a detector connected to YOLOv5s_DSC and DeepSORT. As shown in Figure 13, the tracking boxes of diferent types of vehicles have diferent colors, and each tracking box includes a tracking ID in addition to the category and category confdential information of the vehicle. In the hardware environment shown in Table 6, the YOLOv5s_DSC + DeepSORT algorithm achieves a processing speed of 25 FPS.
Next, the robustness of the proposed algorithm is verifed in the occlusion scene. Consider the tracking performance of two occlusion situations: (1) the target is occluded by foreign objects and (2) the targets are occluded by each other. First, the robustness of the proposed algorithm is verifed when the target is occluded by external objects. Te efect of rerecognition and retracking after the target disappears is tested. A trafc video blocked by a pillar is supported to verify the algorithm. Figure 14 shows four consecutive images. Te dark car with tracking ID 3 reappears after being blocked by a pillar and can still be tracked by the algorithm. Tis result shows that the algorithm exhibits strong robustness and accuracy in occluded scenes, providing strong support for target tracking in practical applications.
We also evaluate the algorithm's performance in scenarios where targets are occluded by other targets. Specifically, we test the algorithm's ability to track targets that are partially occluded by other targets. A video sequence in which a bus partially occludes a car that is being tracked is selected. As shown in Figure 15, the vehicle with tracking ID 4 is partially blocked by the bus with tracking ID 1. Despite the occlusion, the tracking ID for the car remains unchanged. It is shown that the proposed algorithm is capable of handling partial occlusions between targets. Tese results    Figure 10: mAP, accuracy, and recall of YOLOv5s and YOLOv5s_DSC.    Computational Intelligence and Neuroscience 9 further demonstrate the robustness and efectiveness of the algorithm in occluded scenes, which is crucial for the practical application of the target tracking.

Conclusions
Tis article investigates the application of the DeepSORT algorithm in vehicle tracking, using vehicle fow videos from diferent scenarios to verify the efectiveness and robustness of the YOLOv5s_DSC vehicle detection algorithm. Te YOLOv5s + DeepSORT algorithm is validated by reproducing trafc fow videos that block each other after the vehicle disappears. It is showed that the algorithm has good rerecognition and retracking ability and robustness against partial occlusion of targets. However, the algorithm in this article does not take into account the detection efect in diferent weather environments such as rainy days, foggy weather, and vehicle video blurring. In future work, the application of model compression methods will be studied to further compress network models, while maintaining a certain accuracy and improving the speed of network reasoning and combining algorithms such as environment optimization to achieve more scene applications.

Data Availability
Te data used to support the fndings of this study are available upon request from the corresponding author.

Conflicts of Interest
Te authors declare that they have no conficts of interest.