Research on Intelligent Video Detection of Small Targets Based on Deep Learning Intelligent Algorithm

Compared with the traditional object detection algorithm, the object detection algorithm based on deep learning has stronger robustness to complex scenarios, which is the hot direction of current research. According to the process characteristics of the object detection algorithm based on deep learning, it is divided into two-stage object detection algorithm and single-stage object detection algorithm, focusing on the problems solved by some classical algorithms and their advantages and disadvantages. In view of the problem of object detection, especially small object detection, the commonly used data sets and performance evaluation indicators are summarized; the characteristics, advantages, and detection difficulties of various common data sets are compared; the challenges faced by commonly used object detection methods and small object detection are systematically summarized; the latest work of small object detection methods based on deep learning is sorted out; and the small object detection methods based on multiscale and small object detection methods based on super-resolution are introduced. At the same time, the lightweight strategy for target detection methods and the performance of some lightweight models are introduced; the characteristics, advantages, and limitations of various methods are summarized; and the future development direction of small object detection methods based on deep learning is prospected.


Introduction
e main purpose of object detection technology is to find the location of the target in the input image while determining the category attributes of the target, with the development of computer hardware and software and deep learning theory, based on the goal of deep learning.
Detection has been widely used in intelligent security, smart medical, unmanned driving, and other fields. Before the advent of deep learning, traditional object detection algorithms were mainly based on manual design features, such as Sobel edge detection features, Haar features, and Hog features, which have weak generalization ability and poor performance in complex scenarios.
Object detection is a basic computer vision task that combines the two tasks of target localization and recognition, the purpose of which is to find several targets in the complex background of the image, give each target a precise target enveloping box, and determine the category to which the target in the enveloping box belongs [1]. e popularity of deep learning has benefited from object detection technology, and at present, deep learning has been widely used in the entire field of object detection, including general object detection and domain-specific object detection. Among them, small object detection is a hot and difficult problem in the field of computer vision. Due to the limited resolution and amount of information of small targets, the task of detecting small objects has become a huge challenge in the field of computer vision at this stage. Small object detection tasks also play a very important role in civil, military, security, and other fields, such as UAV target detection of ground vehicles and pedestrians, ground target detection of remote-sensing satellite images, identification of unmanned pedestrian vehicles and traffic signs in the distance, detection of some early lesions and lumps in medical imaging, and automatic industrial inspection of small defects on positioning materials. With the gradual complexity and intelligence of computer vision systems in real life, the detection task of small targets also needs more attention.
In view of the problem of object detection, especially small object detection, this paper first summarizes the commonly used data sets, systematically summarizes the commonly used object detection methods, as well as the challenges faced by small object detection, sorts out the latest work of small object detection methods based on deep learning, briefly reviews, and finally summarizes its advantages and disadvantages and discusses its possible future development directions. e control variable method compares the detection accuracy of the same target at different learning rates to verify the limitations of the target detection method and provide the basis for the future development direction.

Small Object Detection
2.1.1. Definition of Relative Scale. Define small targets from the relative ratio of targets to images. Chen et al. proposed a data set for small targets, defining small targets: in the same category, the median of the relative area of all target instances, that is, the ratio of bounding box area to image area, is between 0.08% and 0.58%. For example, in 640 × 480 pixel resolution images, targets of 16 × 16 to 42 × 42 should be considered small targets.

Absolute Scale Definition.
Small targets are defined from the absolute pixel size of the target. e most common definition comes from the MS COCO data set, which defines a small target as a target with a resolution of less than 32 × 32 pixels.
Although great progress has been made in object detection in recent years, the above methods only have a good effect on conventional target detection problems, and the extracted features have poor representation ability for small targets and less than ideal detection effects for small targets. e study of Huang et al. [2] shows that the average accuracy of small targets in detectors at this stage is about 10 times lower than that of large targets, which is not to say that convolutional neural networks are proposing. e characteristics of the performance ability are not enough, but the resolution of the small target is too low and can provide less information to the model, which is also one of the bottlenecks that currently limit the development of target detection; more and more experts and scholars will also turn their attention to the field of small target detection and carry out research, and a series of effective improvement methods have been proposed.

Small Object Detection Method Based on Multiscale.
Multiscale prediction refers to the prediction of the class and coordinates of objects on multiple feature maps of different scales. In the early stage of the development of object detection models, representative algorithms such as YOLO and Faster RCNN only use the last layer of features of the backbone network for target detection, resulting in insufficient detection performance of small targets. For the first time in SSDs, multiscale prediction has been used to improve the detection performance of small targets. At present, the use of multiscale prediction has become a basic operation to improve the performance of small target detection.
Most of the existing convolutional models of general object detection use the topmost features of the convolutional model for prediction, and the amount of information about small targets is less, so it is necessary to make better use of the details of the image. In a convolutional neural network, low-level features can often well represent the texture, edges, and other details of the image, while highlevel features can often represent the semantic information of the image very well, but the corresponding convolutional pooling will also ignore some details. In view of this factor, Ye et al. [3] took the lead in introducing the idea of multiscale, proposed the SSD algorithm, and predicted on the feature map extracted at each scale, and the detection of small targets was better improved than that of the YOLO algorithm. Although shallow features can better represent detailed information, due to the lack of semantic information, coupled with the small target corresponding to less anchor, it cannot be fully trained, and the effect of SSD in practical applications is still not satisfactory. Fu et al. [4] made improvements to SSDs for small targets because they did not adequately correspond to anchor training as shown in Figure 1, replacing SSDs with ResNet. e VGG model, using deconvolution layers, divides the picture into smaller grids, thereby reducing the rate of missed detections. However, due to the introduction of the ResNet model by DSSD, which has more complex residual connections and lateral connections, and the addition of additional layers to the prediction module and the deconvolution module in the model, which introduces additional overhead, the DSSD algorithm is not as fast as the SSD in prediction speed algorithm. Figure 1 is the specific algorithmic process framework.
Qu et al. [5] think from the perspective of the data set, believing that the scale of the target object in the current data set is quite different, and the small target is too small compared to the picture to be detected and proposes a multiscale training method: the scale normalization of the image pyramid (SNIP), training on each scale of the pyramid, and efficiently using all the training data. Although the detection effect of small targets has been significantly improved, it is slower.
Lin et al. [6] proposed that the feature pyramid network (FPN) uses the method of collecting samples to integrate the low-level features with more detailed information and the high-level features with more semantic information. Although the efficiency of this method is low, it enhances the ability of extracting depth features of small targets, and the effect is better than general detection methods. Cao et al. combined the idea of FPN into SSD, thereby improving the detection effect of SSD algorithm on small targets, and because of its emphasis on the lightweight of the model, the parameters are slightly less, and the noise in the background information cannot be better screened out, compared with the use of feature fusion ideas. e DSSD algorithm is slightly less accurate. On the basis of FPN, Liu et al. Shrivastava et al. [7] proposed a structure similar to FPN, which realized the feature fusion of top-down in another way to improve the detection effect of small targets, and the fusion of the algorithm did not use simple weighted superposition like FPN but used convolution for fusion. e core of the algorithm lies in its top-down modulation module, the core structure of the module can be selected by itself, but because it is convoluted for feature fusion, resulting in each new module network to be gradually trained once, the training process is more cumbersome, not necessarily applicable to the actual scene. Subsequently, Ghiasi et al. and Xu et al. proposed that NAS-FPN and auto-FPN optimize the FPN algorithm, which is different. In the previous artificially designed network structure, auto-ML technology was applied to object detection so that the neural network automatically searched for the design and thus improved the efficiency of the FPN algorithm. In order to make better use of multiscale features, Guo et al. introduced a new feature pyramid structure, AugFPN, which used consistency supervision to narrow the semantic gap before feature fusion, and used residual features to reduce the convolutional pooling process. Finally, a Soft-ROI selection method is proposed to better learn the characteristics, averaged on the ResNet50 network [8]. e accuracy is improved by 2.3 percentage points, but the complexity of the model resulted in training times and frame rates using the AugFPN algorithm being inferior to that of the FPN algorithm under the same conditions, such as training each on the ResNet50 network Epoch, Faster RCNN with AugFPN takes 1.1 h, while FPN takes only 0.9 h, the frame rate is 11.1 frames and 13.4 frames, respectively. Rashwan et al. believe that the previous multiscale approach did not take into account the factors of the length and width scale, proposed the MatrixNet model, and reached 47.8% on the MS COCO data set. e average accuracy is higher than any other existing state-of-the-art single-stage object detection method, but although the method uses a matrixbased hierarchical prediction mechanism, it does not consider combining semantic information at different levels, such as high-level low-resolution and low-level high-resolution, and combining the proposed long-width scale ideas, it may be possible to go further in accuracy [9].
It can also be seen that in order to obtain better results and obtain more effective small target feature information, the multiscale detection model also changes from the initial single-layer feature to multilayer feature fusion, and the multilayer feature fusion also develops from a simple weighted superposition at the beginning to convolutional fusion and adding certain residual characteristic blocks to the model, etc. e gradual redundancy of the model can gradually improve the detection effect, but this redundancy also makes it more difficult to apply to the actual scene. erefore, some scholars have also begun to do research on the lightweight of the model, and proposed some excellent models, such as MobileNet and ShuffleNet [10], and applied them to some existing excellent methods, such as Table 1 as shown. How to better maintain the detection accuracy while Computational Intelligence and Neuroscience reducing the complexity of the model is also a difficult problem, while applying the lightweight model, the researchers also proposed some target detection methods that combine the lightweight strategy as shown in Table 2.

Small Object Detection Method Based on Super-
Resolution. Because small targets occupy few pixels and have low resolution in the image, another direct way to detect small targets is to generate a high-resolution image as input to the detection model. Hu et al. used bilinear interpolation to obtain two upsampled input images to train a convolutional model, Fookes et al. use traditional super-resolution techniques to better recognize faces [11]. Although this improves the resolution of the input image which is beneficial to the detection of small targets, but also brings other problems, the super-resolution model and the detection model are trained independently of each other, the high-resolution input image generated by the super-resolution model also includes no need to detect unnecessary objects and factors, and the increase in the resolution of the input image makes the overall architecture too heavy; the training and prediction time of the model will be greatly increased, reducing the possibility of practical application. Haris et al. also proposed an architecture for end-to-end joint training of super-resolution models and detection models for this problem, but there are still a large number of images unrelated to the detection task that perform super-resolution and thus reduce the overall efficiency [12].

Other Methods.
In addition to small object detection methods based on multiscale and super-resolution, there are some excellent methods. Takeki et al. proposed a small target detection method that combines the semantic segmentation method for the identification of small target birds in the context of the sky and takes advantage of the weak semantics of small targets to combine variants of the full convolutional network and the convolutional network and integrate the support vector machine, but it is difficult to expand only for this specific environment [13]. In the field of small object detection in remote-sensing satellite images, Ren et al. studied the RPN module of Faster RCNN and proposed the anchor corresponding to the conventional RPN module. e box scale is too large to cover small targets in the remote-sensing data set, so the RPN module corresponding to the small target scale is specially designed and combined with contextual information to improve model performance and in their self-made SORSI remotesensing data set (containing 5 216 ship images and 706 aircraft images) achieved an average accuracy of 78.9%, but due to training samples, it still did not work well for remote-sensing targets in complex scenes and dense small optical remote-sensing targets [14]. In terms of small face detection, Zhang and others have used the difficult negative examples in OHEM to dig up ideas and dynamically at the image level and feature level. Assign the difficulty scores to the training images to determine whether they have been well detected. In this way, we can make full use of those images that have not been perfectly detected to better monitor the subsequent learning process, and achieve excellent performance on a wider range of face data sets. e subset also achieves 89.7% accuracy. Luo et al. proposed a four-branch face detection architecture that treats large, medium, and small faces separately and uses feature fusion technology to add more anchors to match small faces, which further improves the detection ability of small faces [15]. Replace the 3 × 3 convolution with 1 × 1 convolution, reduce the number of convolution channels, and postpone the sampling operation to significantly reduce the number of parameters and the amount of computation so as to maximize the speed increase in exchange for the reduction in accuracy 57.5

Computational Intelligence and Neuroscience
Zoph et al. said that the ability of future lightweight models to extract features is bound to be limited, starting from the data enhancement aspect is a weapon to enhance the detection effect, and assumes that when the proposed features are good enough, the use of data enhancement can get rid of the serious data-driven dependence of current algorithms. Kisantal et al. believe that there are two main reasons for the low accuracy of small target detection, one is that there are fewer pictures containing small targets in the existing public data set, and the other is that even if the pictures contain small targets, but the number of occurrences is not sufficient, the model training is not sufficient, and a means of oversampling and copying and pasting small targets are proposed for this point to enhance the data [16]. By copying the small targets in the image and pasting them to different locations in the image, the number and location diversity of the small targets in the image are increased, and the corresponding number of matching anchors will also be enhanced, thereby reducing the missed detection rate, using the Mask RCNN algorithm as a benchmark to compare the method of no data enhancement on the MS COCO data set. e detection accuracy of small targets increased by 7.1 percentage points. In addition, some scholars have also proposed some small object detection algorithms based on deep learning as shown in Table 3.

Object Detection Algorithm.
With the proposal of the two-stage object detection algorithm RCNN in 2014, the object detection algorithm has officially entered the era of deep learning, but the algorithm is time-consuming and cannot be practically applied.
rough the in-depth understanding and improvement of the algorithm, algorithms such as SPP-Net, Fast RCNN, and Faster RCNN have been proposed one after another, and the algorithm speed has increased by hundreds of times, basically meeting the application requirements. With the increase in the application of algorithms, the shortcomings of the algorithm are also exposed, and the detection effect of small targets is not ideal is one of them; researchers through the study of feature fusion methods proposed FPN, CascadeR-CNN, M2Det, and other algorithms, which greatly improved the detection effect of small targets and the accuracy of the algorithm. e methods of object detection are mainly divided into two categories: object detection methods based on traditional artificial features and object detection methods based on deep learning. e concept of deep learning has its roots in artificial neural networks. e motivation for studying deep learning is to build a neural network that simulates the learning process of the human brain, and its theoretical basis is the general approximation theorem [17]. e theorem states that if enough hidden neurons are included, even if there is only one implicit layer, the expressed input-output map can adequately approximate any one continuous function defined on a unit cube. e mathematical description of the general approximation theorem is as follows: eorem 2.1 states that the activation function must be a nonlogarithmic, bounded, and monotonically increasing continuous function I m � [0, 1] m , representing the unit hypercube of m-dimensional space and C(I m ) representing a continuous function space on I m . en, given any continuous function f ∈ C(I m ) and ε > 0, there is a positive integer n > 0 and the real constants a i , w ij , and b i (1 ≤ i ≤ n, 1 ≤ j ≤ m), which make the equation valid.
For all X 1 , X 2 , . . . in the input space, X m , satisfies relational: Starting from the perspective of network architecture, we use cross-stage feature fusion to optimize the repetitive gradient information in the network to achieve lightweighting In the same environment, the computation is reduced by nearly 30% and the accuracy is improved by 2% compared to yolov3 YOLO Nano 2019 Design PEP macro-architecture by human-machine collaborative design strategy, combined with fully connected attention module for embedded environment to significantly reduce the amount of computation, but only in embedded environment Achieve an average accuracy of 69.1% on the VOC2007 data set and a model size of only 4 MB underNet 2019 Based on ShuffieNet v2, compressing RPN module, proposing context enhancement module, using 1 × 1 convolutional compression channel to achieve feature fusion effect while reducing computational cost, and introducing spatial attention mechanism to optimize feature distribution to reduce computational effort Obtain 19.1% AP on the COCO data set, which is similar to the accuracy of SSD using MobileNet, but nearly 5 times faster and significantly cheaper to compute

RefineDeLite 2020
A new backbone network Res2NetLite is proposed for the detection task, ensuring the same number of input and output channels, focusing on optimizing the loss function and training strategy e AP value of 26.8% is achieved on the COCO data set, and it can reach 29.6% with its proposed training strategy, which is the best lightweight network at present Computational Intelligence and Neuroscience

Gradient Descent Method.
In the process of function f constantly approximating the F function, it is necessary to use the gradient descent method to determine the new direction of each iteration so that the objective function to be optimized is gradually reduced. e gradient descent method is the basic method for calculating the minimum values of continuously differentiable functions without constraints. Let f (x) be continuously distinct around x k , such that x � x k + αd, where d is the unit direction (‖d‖ � 1). If g k � ∇f(x k ) ≠ 0 so, you get the Taylor unfolding formula.
e angle between d and −g k Θ satisfies the relation When θ � 0, g T k d is the minimum, and in this case, d � −g k . us, the negative gradient direction −g k is the direction in which the function f (x> decreases fastest around x k . If f(x) � 0.5x T Gx, where G is a symmetric positive definite matrix of n × n, then the convergence rate of the gradient descent algorithm is linear, and the resulting point column {x k } is all k which satisfies the relational. Small object detection using context and attention e multiscale features are concatenated and the additional features of different spreading are used as contexts, and a target detection algorithm combining the note checking force mechanism is also proposed, and the detection effect of both methods on small targets is higher than SSD Computational Intelligence and Neuroscience where the minimum values x n are f(x) and the maximum λ 1 and λ 2 are minimum eigenvalues of G, respectively. All the training data are required during each iteration of the gradient descent method, and each iteration takes a long time when the data volume is large, resulting in the whole process converging very slowly, so there are various deformations, including SGD (Stochastic Gradient Descent), Momentum SGD AdaGrad (Adaptive Gradient), RMSProp (Root Mean Square prop), Adam (Adaptive Moment estimation), and AdaDelta (Adaptive delta) [18].
SGD updates the current weight (zL/zW) * αw using a gradient, and its iteration relationship is as follows: where L represents the error between the actual output of the sample x l and the desired output. Momentum SGD updates the weights using the exponential moving average V of the historical and current gradients, with its iterative relationship: where V is initialized to a zero matrix and β � 0.9. AdaGrad updates the weights using the learning rate a divided by the square root of S, the gradient part being the same as SGD, but constructing an iterative relationship.
where S is the accumulation of historical and current gradient squares, usually initialized to a zero matrix, and ε is the parameter that ensures that the denominator is not 0 a � 0.01, ε � 10 − 7 .
RMSProp is an improvement over AdaGrad, using an exponential moving average of gradients instead of cumulative square gradients and updating the weights, iterative relational: where S is initialized to a zero matrix a � 0.01, ε � 10 − 7 , β � 0.9. Adam's iterative relationship is a combination of Momentum and RMSProp. e gradient section, like the Momentum SGD, uses the exponential moving average of the gradient instead of the current gradient update weights. e learning section is learned using the RMSProp learning rate divided by the exponential moving average of the gradient.

Traditional Object Detection Algorithms.
Early object detection algorithms were mostly built on manual features. First, the areas that may have a target are found on the original image of the input, then the feature extraction is performed on each region, and the classifier model is sent to the classifier model for judgment, and finally, the area that the classifier model considers to be the target is filtered and other postprocessing operations are used to obtain the results. Due to the lack of efficient image representations at the time, people had no choice but to design complex feature representations and use various acceleration techniques to exhaust limited computing resources [19].

VJ Detector.
Nineteen years ago, Viola and Jones designed an efficient face detector that was dozens of times faster than other detectors at the time, a milestone in the development of face detection and even computer vision, which was named Viola-Jones (VJ) in honor of this work. VJ Detector adopts the detection method of sliding window, uses Haar features to describe each window, and introduces an integral graph to accelerate the extraction of Haar features so that the computational complexity of each window is independent of window size combined. e AdaBoost algorithm selects features and introduces cascading ideas to reduce the amount of background window computation, increase the target computation of faces, improve accuracy, and reduce the scale of computation.
Computational Intelligence and Neuroscience

HOG Features.
e directional gradient histogram (HOG) feature was first proposed as a local feature by Daal et al. for pedestrian detection problems. As the name suggests, HOG features accumulate to form a histogram by calculating gradient values in different directions in an area of the picture as a feature of this area [20]. HOG features can better extract the local detail information of the image, in the case of geometric deformation of the image, optical distortion, etc., and have a good feature invariance; for many years, HOG features have been the basis of many object detectors and various types of computer vision systems.

Variable Part Model (DPM).
Deformable part model is a deformable part model or DPM model for short. is model is very intuitive, modeling the target object into a combination of several parts. For example, it sees humans as a combination of head/body/hands/legs. Although DPM seems to be very simple, how to translate this intuition into a mathematical model is very challenging (SVM, convex optimization, coordinate descent, gradient descent). It was not until 2005 that this model was proposed because it involved a lot of optimization/training methods. After being proposed, DPM became a big hit, but after the deep learning technology caught fire, DPM seemed to lose its light. If you visualize the features of each layer of deep learning, you can see that deep learning also seems to be learning the characteristics of these components (human head/hand/foot), but deep learning is relatively simple to train. DPM, on the other hand, adds more heuristic rules and becomes more complicated. In view of the poor performance of the HOG feature processing occlusion problem,

Accuracy and Recall.
Prior to evaluation, each predicted value will be compared to a confidence value, which is a positive sample if it is greater than that value, or an inverse sample otherwise. As a binary classification problem, there are four possibilities for the algorithm to detect the results, as shown in Table 4, the prediction is a positive sample, and the actual positive sample is called a real example; e prediction is a positive sample, and the actual is an inverse sample, called a false positive case. Predicted to be an inverse sample, the actual is an inverse sample, called a true counterexample. e prediction is an inverse sample, and the actual positive sample is a false counterexample.
During the inspection process, the usage formula (20) calculates the accuracy of the inspection, and the application formula (21) calculates the recall rate of the test.

PR Curve and ROC
Curve. ROC curves were first used to analyze radar signals in World War II to detect enemy forces. e cause is the Pearl Harbor incident; because it is more useful, it is slowly used in some applications such as psychology and medicine, and in machine learning, data mining, and other fields to judge the quality of classification and test results. PR curves and ROC curves are commonly used evaluation methods in categorical imbalance problems. For the same model, both PR and ROC curves can illustrate certain problems, and the two have a certain correlation. e PR curve is used to describe the relationship between accuracy and recall. According to the PR curve to judge the detection effect of an algorithm, there are three judgment methods: (1) according to the size of the closed area of the curve and the coordinate axis to judge the detection effect of the algorithm, the larger the closed area, the better the detection effect of the algorithm. (2) Use the balance point to judge the detection effect of the algorithm, that is, the recall rate is equal to the accuracy rate, and the larger the value of the balance point, the better the detection effect of the algorithm.
(3) e detection effect of the algorithm is judged by using the F1 value; as shown in the formula, the larger the F1 value, the better the detection effect of the algorithm. e graph of the PR curve for detecting the target is shown in Figure 2. When running the target detection algorithm, the ROC curve mainly calculates the TPR and FPR. e FPR is actually the recall, which is consistent with the recall in the PR curve. e graph of the ROC curve for detecting the target is shown in Figure 3.
where P is the accuracy and R is the recall.
Since the setting of network training hyperparameters is also a key factor that directly affects the detection accuracy, the influence of different learning rates, batch sizes, and gradient descent methods on the detection accuracy is analyzed experimentally to obtain better hyperparameters.  Figure 4 shows the accuracy of the corresponding target detection at different learning rates. By comparing the detection accuracy of the target at different learning rates, it can be obtained that the detection accuracy is the highest when the learning rate is 0.01. Table 5 shows the accuracy of target detection corresponding to different batches.

e Size of the Data Batch.
By comparing the detection accuracy of the target under different batches, it can be obtained that the impact of the batch size on the detection results is not obvious, but when the batch is less than 10, the training is difficult to converge.

Gradient Descent Mode.
Here are the six gradient descent methods for SGD, Momentum SGD, AdaGrad, RMSProp, Adam, and AdaDelta. e convergence effect of 1000 times was compared, and the convergence result of the iteration is shown in Figure 5.
rough the above comparative experiments, it can be found that among the six iteration methods, the convergence is faster and more stable under the AdaDelta gradient descent method, while the gradient descent method affects the convergence speed of the algorithm, thereby affecting the number of iterations of the algorithm.
Based on the above analysis, during the experiment, the learning rate was set to 0.001, the batch was set to 25, and AdaDelta was used as the gradient descent method for this article.
Experiments and analysis show that the above improvement strategy can maintain the accuracy of ordinary object detection while greatly improving the accuracy of small object detection of SSD algorithm so as to meet the requirements of conventional target detection tasks. With the improvement of shooting technology and the change in the shooting environment, the requirements for small object     Computational Intelligence and Neuroscience 9 detection have gradually increased.
e further research directions of this paper are reflected in: (1) the method of augmenting the small target data, exploring a better way of augmenting the small target data, and determining its relationship with the detection accuracy; (2) aiming at the proposed contextual feature fusion structure, the detection speed of the network is improved by simplifying the number of convolutional layers of the underlying network; and (3) continue to optimize the improved NMS method so that the network can adaptively select a better reward parameter.

Conclusion
is paper systematically elaborates the research progress in the field of object detection in recent years, including traditional object detection methods and deep learning-based object detection methods, and reviews and analyzes the popular related data sets. Focusing on the more difficult small object detection problems in the field of object detection, he analyzed some improved algorithms at home and abroad in the small object detection problem in recent years, hoping to bring new research ideas to researchers in related fields. Although the existing small object detection algorithm has achieved some results, but the accuracy is still very low, with the gradual complexity of the computer vision system deployed in real life, the accuracy requirements of small object detection will gradually improve, through the summary and analysis of the above technology, and the following points are proposed.
(1) Small target detection combined with traditional methods (2) Introduce attention mechanisms (3) Build a more complete small object detection data set (4) Lightweight of the mold type to improve the real time, accuracy, and robustness of the detection system (5) In the process of model training, focus on training for small goals (6) Research on small object detection method based on anchor-free In view of the above problems, this paper makes some prospects for the future development of deep learning-based object detection algorithms. For the problem of low detection accuracy of small targets, the anchor-based algorithm is mainly due to the fact that the anchors set up cannot match the small targets well, resulting in the inability to extract the characteristics of the small targets well. e D2Det algorithm improves the matching effect of small targets and anchor by introducing dense local regression, thereby improving the detection effect of small targets, and in the future, it can also improve the detection accuracy of small targets on this basis. For the problem of unbalanced positive and negative samples during training, OHEM, Focal Loss, GHM, and other algorithms balance the loss feedback brought by positive and negative samples to the network by manually suppressing negative sample losses, and in the future, they can more reasonably balance positive and negative sample losses through online adaptive methods to further improve the algorithm effect. Some application areas due to confidentiality are difficult to obtain, resulting in less training sample size; you can simulate the scene to increase the sample size, you can also do some enhancement and expansion of samples through the GAN series algorithm, and you can also further study the relevant methods of small sample learning and improve the training effect of small sample set. Data Availability e labeled data set used to support the findings of this study is available from the author upon request.

Conflicts of Interest
e author declares that there are no conflicts of interest.