A Small Target Pedestrian Detection Model Based on Autonomous Driving

. Since small-target pedestrians account for a small proportion of pixels in images and lack texture features, the feature information of small-target pedestrians is often ignored in the feature extraction process, leading to reduced accuracy and poor robustness. To improve the accuracy of small-target pedestrian detection and the anti-interference ability of the model, a small-target pedestrian detection model that fuses residual networks and feature pyramids is proposed. First, a residual block with a discard layer is constructed to replace the standard residual block in the residual network structure to reduce the complexity of the model computation process and solve the problems of gradient disappearance and explosion in the deep network. Ten, feature selection and feature alignment modules are added to the lateral connection part of the feature pyramid to enhance important pedestrian features in the input image, and the multiscale feature fusion capability of the model is enhanced for small-target pedestrians, thereby improving the detection accuracy of small-target pedestrians and solving the problems of feature misalignment and ignored multiscale features in the feature pyramid network. Finally, a cascaded autofocus query module is proposed to increase the inference speed of the feature pyramid network through focusing and querying, thus improving the performance and efciency of small-target pedestrian detection. Te experimental results show that the proposed model achieves better detection results than previous models.


Introduction
With the development of deep learning and computers, the felds of autonomous driving (AD) and intelligent transportation systems (ITS) have rapidly advanced.Although AD and ITS have achieved great results in some scenarios, AD motor vehicle collisions with pedestrians [1] and sensitive ethical and moral issues [2] present serious challenges to pedestrian detection technology, and pedestrian detection technology is crucial for the development of AD and ITS.Pedestrian detection is a technique to determine whether a pedestrian is present in an image or video and provide their precise location and size.Small-target pedestrian detection is a difcult aspect of pedestrian detection.Small-target pedestrians have little problem information in sensors and little feature information in deep learning.Accurate pedestrian detection is the basis for AD vehicles and provides operation and guidance strategies for AD vehicles to avoid collisions with pedestrians, reduce trafc accidents, and improve safety factors.
One of the most essential issues in intelligent transportation systems is small-target pedestrian detection, mainly focusing on urban roads and places with high pedestrian fow.However, actual trafc environments are large, complex, and contain multiple variables, and many challenges need to be addressed to achieve accurate and robust pedestrian detection using radar and digital image processing techniques.For example, in environments with partial pedestrian occlusion, radar techniques fail to detect occluded pedestrian targets [3], and small-target pedestrians are more difcult to detect.Depth-based detection of small-target pedestrians in these environments requires deeper networks and larger models, which require considerable computational power.In addition, detection speed remains a challenge.
To address these issues, an increasing number of scholars in this feld of research have considered deep learning because the development of LIDAR systems is time-consuming and expensive.Te most critical issue is that the sensor cannot perform image processing analogously to the human eye seeing pedestrians.Recently, Tesla proposed building self-driving cars using visual methods.Te most widely used small-target pedestrian detection model is based on deep learning.Deep learning was proposed in 2006 and is widely used in computer vision, natural language processing, bioinformatics, and other felds because of its human-like analytical learning capabilities.Deep learning has also been used in pedestrian detection.Te goal is to learn the relationship between target pedestrians in diferent images.Te representative networks include the deep residual network (ResNet), the feature pyramid network (FPN), and the you only look once (YOLO) networks.ResNet addresses network degradation well, while the FPN has an improved feature fusion capability, and YOLO has a higher pedestrian detection speed.However, due to many factors, the patterns of small-target pedestrians in images are complex and variable.Tus, accurate smalltarget pedestrian detection is difcult to achieve with only a single shallow network.Small-target pedestrian features require both depth networks and feature fusion networks.Considering that deep networks and feature fusion networks can both improve the detection of smalltarget pedestrians [4], many scholars have studied smalltarget pedestrian detection considering depth and feature fusion networks.For example, Noh et al. [5] proposed feature superresolution for a small-target pedestrian detection algorithm, and Nie et al. [6] proposed enriched features for a small-target pedestrian detection network.Tese small-target pedestrian detection methods have achieved positive detection results.However, the pedestrian detection speed is not ideal as the models are enlarged, and detecting small-target pedestrians accurately and quickly remains an open problem.
To solve the above problems, this article proposes a small-target pedestrian detection model based on autonomous driving.Te main contributions of this study can be summarized as follows: (1) An improved residual network is proposed.By adding a dropout layer to the residual network, the number of model parameters is reduced, and the model generalizability is improved.Te model training efect is evaluated through ablation experiments, and the best model parameters are selected.(2) A feature fusion and alignment network is proposed.
By adding feature selection and feature alignment modules to the feature pyramid network, the most important features in the feature map are enhanced, and the ofset features in the feature extraction and feature fusion processes are corrected and aligned.
(3) A cascaded autofocus query (AFQ) module is proposed to increase pedestrian detection speed.Tis module accelerates small-target pedestrian detection through automatic focusing and querying.Diferent AFQ modules are constructed according to feature maps of diferent scales, thus allowing the modules to automatically adapt to diferent scale features.In addition, the cascade method is used to share data to increase the detection speed of the model.

Literature Review
Pedestrian detection is a technology that judges whether there are pedestrians in an image or video and provides the precise position and size.Small-target pedestrian detection is a difcult aspect of pedestrian detection.In AD scenarios, high-precision small-target pedestrian detection can give the car control system sufcient time for early warning and processing [7], which is important in ensuring driving safety [8][9][10].According to an overview of domestic and international research, pedestrian detection methods can be roughly divided into two categories: shallow machine learning detection models and deep learning detection models.Moreover, deep learning models can be further divided into two categories: one-stage pedestrian detection algorithms [11] and two-stage pedestrian detection algorithms [12].Te above two types of algorithms have distinct advantages and similar disadvantages, including occluded pedestrian targets [13,14] and trafc signs [15,16], image resolution [17], light intensity interference [18], scale transformation issues [16], and many other challenges.Machine learning implements pedestrian detection by constructing feature models and using these features to train classifers.Common feature extraction methods include Haar wavelet features, histograms of oriented gradient (HOG) features, grayscale and rotation invariant features, and denatured local binary pattern (LBP) features.Common classifers include the support vector machine (SVM), AdaBoost, and random forests.Machine learning algorithms can achieve accurate pedestrian detection.However, due to the nonrigid nature of pedestrians, the constructed feature model is often difcult to adapt to pedestrians with diferent perspectives, mutual occlusion, and diferent postures.In particular, small detection targets are easy to miss.Moreover, false detection issues reduce the practicality of these algorithms.
Deep learning can address the above problems.Te onestage pedestrian detection algorithm mainly adopts the core idea of an end-to-end network [19].A single neural network is used to directly predict the positions of objects in the image with only one evaluation.Te conventional representative works on the one-stage pedestrian detection algorithm include the SSD algorithm [20] proposed by Liu et al. and the YOLO algorithm [21] proposed by Redmon et al.Since these algorithms do not consider feature and semantic information when extracting image features, the detection efect on small-target pedestrians is not ideal.Terefore, to optimize the detection efect on small-target pedestrians, Yin et al. proposed the FD-SSD algorithm [22].
Tis algorithm improves the semantic information of shallow feature maps through a multilayer feature fusion module.Trough the multibranch residual hole convolution module, the original resolution of the feature map is maintained, and the context information of the feature map is improved.In addition, deformable convolutions are introduced to ft the shapes of small objects.Fu et al. proposed the DSSD algorithm [23], which imitates the feature pyramid, adds a Residual-101 network in the deconvolution layer, uses deconvolutions to upsample high-level features and combine them with shallow features, and increases the semantic information of the shallow layers to improve the accuracy of small object detection.Although these multifeature fusion methods improve the detection accuracy of small-target pedestrians to a certain extent, they still do not meet the actual needs [24].
Te two-stage pedestrian detection algorithm frst generates pedestrian candidate regions and then classifes the candidate regions using a convolutional neural network.Te conventional two-stage pedestrian detection algorithms are the fast region convolutional neural network (Fast R-CNN) proposed by Li et al. [25], the faster region convolutional neural network (Faster R-CNN) proposed by Ren et al. [26], and the mask region-based convolutional neural network (Mask R-CNN) proposed by He et al. [27].Similar to the one-stage algorithms, these two-stage algorithms are often not very efective in detecting small-target pedestrians.To address this problem, Zhang et al. [28] analyzed the Faster R-CNN algorithm and found the reason for the unsatisfactory small-target pedestrian detection results: the feature map resolution of the neural network is not sufcient when dealing with small-target pedestrians.As a result, the neural network easily ignores these pedestrian features during the learning process.Moreover, the use of a region proposal network (RPN) and decision forests (DFs) on the shared high-resolution convolutional feature map can effectively improve small-target pedestrian detection.Additionally, to address this problem, Liu and Stathaki [29] proposed a pedestrian detection algorithm using a faster R-CNN with a semantic segmentation network and a regionbased convolutional neural network.Tis network uses semantic cues to better detect pedestrians by computing complementary high-level semantic features and integrating these features with convolutional features using multiresolution feature maps extracted from diferent network layers, thus ensuring good detection accuracy for pedestrians of diferent scales.Tese algorithms can efectively achieve small-target pedestrian detection; however, feature alignment issues occur during detection due to inaccurate spatial sampling [30].
To address the above problems, this paper proposes a fusion residual network and feature pyramid (FRFP) model for automatic focused for query small-target pedestrian detection.Te model uses the two-stage Faster R-CNN model as the framework and ResNet with a fusion FPN as the backbone.Te model uses a bottom-up path to generate feature maps of diferent scales by improving the residual network and a top-down path to fuse feature maps of different scales by using the feature pyramid incorporated in the residual network to achieve multiscale feature fusion.Finally, a cascaded AFQ module is added behind the feature pyramid.Te cascaded AFQ module shares data, reduces the computational costs of the model in the inference process to determine the spatial information of small-target pedestrians, and passes the information to the next AFQ module to increase the detection speed of small-target pedestrians.

Our Approach
To address the problems that small-target pedestrians account for a relatively small amount of image information, neural networks ignore small-target pedestrian features in the feature fusion process [31] and feature pair misalignment [32] Tis paper uses Faster R-CNN as the overall framework of the model and incorporates an FPN into the output layer of the residual block of ResNet.Tis allows the model to mitigate network degradation issues and increase the accuracy of small-target pedestrian detection.Finally, the AFQ module is proposed to reduce the inference speed of the model and increase the detection speed of the model.Journal of Advanced Transportation residual blocks with discarded layers to address the problems of overnetwork degradation and gradient disappearance.Te structure diagram of the improved residual block is shown in Figure 1.
Figure 1 shows the input features of the residual block, where F(X) is the nonlinear mapping in the residual block, and F(x) + x is the output value of the residual block.If the underlying mapping function is set to H(x), the output of the residual block is when F(x) � 0, H(x) � x, and the neural network layer in the residual block becomes a constant mapping layer.
According to equation ( 1), the nonlinear mapping formula of the residual block can be defned as Equation ( 2) indicates that the network determines the optimal solution when F(x) approaches 0, although the phenomenon of network degradation in the neural network is greatly reduced as the number of network layers increases.In the residual block, the weight layer contains a convolutional layer and a batch normalization layer.Te convolutional layer extracts image features, and a pooling layer is added after the convolutional layer to reduce the size of the features and the number of network parameters using downsampling.Te feature extraction and pooling processes are described in the following equations.
where K l i denotes the weight of the i-th flter in layer l; b l i denotes the bias of the i-th flter in layer l; x l (j) denotes the value of the jth convolutional region in layer l; y l+1 i (j) denotes the input to the j-th neuron in the i-th frame in layer l + 1; P l+2 i (j ′ ) denotes the value corresponding to the neuron in layer l + 1 after the pooling operation, where j ∈ [(j ′ − 1)W + 1, j ′ W]; and W denotes the width of the pooled region.Te activation function in the residual block is the linear rectifcation function (ReLU), which is formulated as follows: Dropout is a simple method proposed by Srivastava et al. [33] to address overftting in neural networks with a large number of parameters.Te dropout layer discards the values of neural units in the network according to a certain probability, i.e., if the output is set to zero, the weights are not updated.A schematic of the dropout process is shown in Figure 2.
Te formula of the neural network in the residual block changes due to the introduction of the discard layer, which is calculated as follows: where r (l) j is a random coefcient obeying the Bernoulli distribution; y (l) is the neuron in the hidden layer;  y (l) is the neuron after the discard layer; Z (l+1) i is the neuron in the l + 1 layer that is waiting for activation; w (l+1) i and b (l+1) i are the weight and bias in the l + 1 layer, respectively; y (l+1) i is the output neuron in the l + 1 layer after the activation function; and f(•) is the activation function in the residual block.By adding a discard layer to the residual block, the number of neurons in the hidden layer can be reduced.Tus, the number of features in the intermediate layer can be reduced, thus  weakening the complex adaptive relationships among the neural nodes in the network, enhancing the generalizability and robustness of the network, and efectively reducing network degradation.

Feature-Aligned Pyramid Network.
To solve the problems of feature pair misalignment and feature fusion in the feature extraction process of small-target pedestrian detection [34], this paper uses a feature pyramid network and proposes improvements based on this network.Te feature pyramid network improves small-target pedestrian detection accuracy through multiscale feature map fusion.In this paper, we introduce a feature alignment module (FAM) and a feature selection module (FSM) in the lateral connections part of the feature alignment pyramid to build a network with lateral connections, learn and align important pedestrian features, and enhance the multiscale feature fusion ability of the network to improve the small-target pedestrian detection performance.Te network structure is shown in Figure 3.
In Figure 3, the image in the lower left corner is the image input to be trained, the multiscale feature map output by the residual block is shown above the image, the feature map after multiscale fusion in the pyramid network is shown on the right, and the part in the dashed box is the lateral connection part of the pyramid network, which contains the 2x up-sampling module, feature selection module, and feature alignment module.
Te feature selection module in the conventional FPN performs only 1 × 1 convolutions to ensure that the number of channels with high-dimensional and low-dimensional features remains constant.However, without judging the saliency of the respective channel features, it is difcult to express the important features of spatial details when solving channel compression.To address this problem, this paper introduces the feature selection module, which models the signifcant features in the feature mapping process while suppressing and recalibrating redundant feature mappings.Figure 4 shows the structure of the feature selection module.
Figure 4 illustrates the structure of the feature selection module.First, the global information Z i of the input feature map C i is extracted based on the global average pooling operation.Te global information Z i is sent to the signifcant feature construction layer f m (•), which learns the weights of each channel in the input feature map.Te above weights are expressed in terms of feature importance vectors to indicate the salience of the respective feature maps.Te original input feature maps are scaled using the importance vectors.Te scaled feature maps are added to the original feature maps to generate rescaled feature maps, which are introduced into the feature selection layer f s (•).Tis process retains the important feature maps while reducing the number of channels by removing redundant feature maps.Te workfow of the feature selection module is shown in the following equations:  contextual misalignment of the predicted features in the feature alignment module.Tus, the conventional FPN performs feature fusion in a manner that afects the prediction of the target boundary, thus causing misclassifcation in the prediction process.Te feature alignment module aligns the upsampled feature mappings to a set of reference mappings by adjusting the respective sampling positions in the convolutional kernel according to the learning ofset.Figure 5 illustrates the workfow of the feature alignment module, which aligns the upsampled feature map P up i with its reference feature map  C i−1 before proceeding to feature fusion, i.e., the upsampled feature P up i is normalized based on the spatial location information provided by  C i−1 .N in Figure 5 denotes the convolution kernel at N sample locations, and C denotes the number of feature channels.Δ i denotes the ofset of the convolution kernel to be learned.

Autofocus Query Module.
Although the combination of the FPN and ResNet increases the detection accuracy of small-target pedestrians, the detection speed and accuracy of small-target pedestrians are not ideal, especially the detection speed.Te inference and computation processes of the feature pyramids for small-target pedestrian features are highly redundant due to the very sparse information about small-target pedestrians in the image space, which reduces the computational performance and increases the detection speed [35].In addition, background noise in the image interferes with the features of small-target pedestrians, leading to poor accuracy.To address the above problems, this paper proposes the autofocus query (AFQ) module, which performs AFQ operations on feature maps of diferent scales, and its operation process is shown in Figure 6.
Figure 6 illustrates a schematic diagram of the AFQ module, which automatically focuses the low-resolution feature map P l input from the pyramid network and predicts the perceptual region.Ten, the key locations of small-target pedestrians are calculated by means of a query while passing the key location coordinates as key information to the next higher resolution feature map through the AFQ module.We set the output vector map after the AFQ module as l denotes the probability that the i-th row and j-th column of the feature map contain a smalltarget pedestrian.Ten, we defne small-target pedestrians in each feature map as objects with scales smaller than a predefned threshold s l and set the border of the small-target pedestrian o in each feature map P l as b l o � (x o l , y o l , w o l , h o l ), where (x o l , y o l ) is the center point of the small-target pedestrian and (w o l , h o l ) is the height and width of the smalltarget pedestrian.Next, a binary encoded feature map [36] is generated by calculating the distance from each feature pixel (x, y) to the center point (x o l , y o l ) of the feature map according to the following distance calculation and judgment equations: Journal of Advanced Transportation To predict the approximate location of the small-target pedestrian, a parallel query classifcation and regression module is added to the AFQ module, which corresponds to the feature mapping accepted by each layer of the AFQ module.Te regression and prediction values are passed as location information to the next module.Let the key location be k o l−1 , which can be defned as For each layer P l , the loss function is defned as where U l is the classifcation output, R l is the regression output, V l is the query score output, U * l is the true mapping of the classifcation output, R * l is the true mapping of the regression output, V * l is the true mapping of the query score, L FL is the focal loss, and L r is the bounding box regression loss [37].
To increase the inference speed, we use a cascade connection in each AFQ module [38], which has the advantage that k is not generated from a single feature map, which allows for increasingly more key locations k l as l decreases in the query mapping.

FRFP-AFQ.
First, to ensure that the model can address network degradation, the residual network is used as the backbone network of the model in this paper.Second, to enhance the model's ability to detect small-target pedestrians, the feature pyramid and ResNet are combined.Finally, an AFQ module is proposed to optimize the smalltarget pedestrian detection performance of the model.Terefore, this paper proposes an automatic, focused query, small-target pedestrian detection model that combines a residual network and feature pyramid.Te proposed model is termed the FRFP-AFQ model, and the model structure is illustrated in Figure 7.
In Figure 7, the leftmost image is the original target detection input, which is a 640 × 480 pixel RGB image, and the dashed box immediately following the arrow contains the residual network, which is the feature map output by each residual block, where the lowest dimensional feature map has 160 × 120 pixels and 256 channels and the highest dimensional feature map has 20 × 15 pixels and 2048 channels.Te dashed box below the residual network shows the structure of the feature pyramid network.Te feature pyramid network fuses deep high semantic features and shallow multidetail features through lateral connections, and the lateral connections are shown in the lower right corner of Figure 7. Te lateral connections are used to construct the fused shallow and deep feature map, which has 160 × 120 pixels and 256 channels.Te deepest feature in each layer contains not only the detailed features of the current dimension but also the high semantic information of the deep layer.Te deep feature maps have high semantic information and are suitable for detecting large targets, while the shallow feature maps have multi-detail features and are suitable for detecting small targets.Finally, the AFQ operation is used in each layer of the FPN to automatically focus the query operation, and the AFQ operations are cascaded to form the AFQ module.Each AFQ operation in the AFQ module includes classifcation, regression, and query functions to quickly determine the location of small-target pedestrians.Collectively, the FRFP-AFQ model can address network degradation and achieve superior multiscale feature fusion performance as well as excellent inference and detection performance.

Implementation Steps.
Te main steps of the FRFP-AFQ-based small-target pedestrian detection model are implemented as follows: Step 1. Te experimental environment uses cloud servers, two Tesla V100 graphics cards with computing powers of 15.7 TFLOPS (FP32) and 125 TFLOPS (FP16); a CPU using Xeon Gold 6139; an Ubuntu 18.04 system with 172 GB memory and 16 × 2 GB video memory; PyTorch version 1.9.0;CUDA version 11.4; and Python version 3.6.9.
Step 2. Te model proposed in this paper was constructed by setting the structures of the convolutional layer, pooling layer, batch normalization layer, and other explicit and implicit layers, and the stochastic gradient descent (SGD) method with the introduction of momentum was chosen as the optimizer during model training.Te network parameters were set by model parameter comparison experiments (see Section 3.3).Te fnal number of epochs was set to 200.
Step 3. Te dataset used in this paper was divided into three folders.Te frst folder was named Annotations and stored all the annotation fles in XML format.Te second folder was named JPEGImages and stored the image fles corresponding to the annotation fles in jpg format.Te last folder was named ImageSets, which contained a main folder with txt fles of the names of the images in the training, test, and validation sets.
Step 4. Te pretrained weights were downloaded and unzipped into the pretrained_weights folder.Ten, the uploaded dataset was unzipped, and the paths of the training set, test set, and validation set were confgured.Ten, we returned to the model folder in the terminal command line.
Next, we input python train.py to train the model, python test.py to evaluate the trained model, python eval.py to evaluate the training level of the model, and python predict.py to assess the test images.
Step 5. We used equations ( 3)-( 9) to obtain the IResNet model and generate the feature maps, and equations ( 10)-( 12) to complete the FSM function in the IFPN.
Step 6.We calculated the pixel-to-pedestrian center distances of the small targets in the feature vector map using equation (13).Ten, we determined the value of the pixel encoding in the new feature vector map using equation ( 14).
Step 7. We determined the key position information of the small-target pedestrians by using equation (15).Ten, the pixel encoding value and position information generated in Step 6 were sent to the next AFQ module by combining them as one key value.
Step 8. We evaluated the trained model according to the loss function shown in equation ( 16).If the loss value was too large, the AFQ module parameters were fne-tuned, and Steps 6 and 7were repeated until the loss function value was less than a predefned threshold.
Step 9. We evaluated the data generated during the test to determine whether the detection accuracy reached the expected value.In this case, we output the obtained model.Otherwise, we return to Step 2, fne-tune the parameters according to the evaluation indices, and repeat Steps ( 4)-( 8).
Step 10.We calculated the frames per second (FPS) of the model generated in Step 9 and obtained the detection results.

Dataset and Data Processing. Te Caltech Pedestrian
Dataset is a dataset dedicated to pedestrian detection that was released by Caltech in 2009.Te dataset was mainly captured by cars driving on rural streets and contains 10 h of 640 × 480 30 Hz videos with a total of 250000 frames, 3500000 bounding boxes, and 2300 pedestrian annotations.Te dataset includes an image dataset (data in seq format) and pedestrian labeled data (data in vbb format), which mainly includes the pedestrian bounding boxes in the dataset.
Te experimental data processing is implemented via the Python programming language.First, the seq and vbb fles are converted to jpg and XML fles.Te jpg and XML fles are placed on the same level as the images and annotations folders and renamed.Te unnamed fles are deleted.After this processing, we obtain 18348 images and 18348 corresponding annotation fles.Te training set, test set, and validation set were generated randomly according to the ratio 6 : 2 : 2.
Journal of Advanced Transportation 4.2.Evaluation Indicators.Tis paper adopts the evaluation metrics used in the COCO competition [39], including the average precision (AP), AP 50 , AP 75 , AP S , AP M , and AP L .Te AP is defned as the threshold value of the intersection over union (IOU) being m%, and its calculation is shown in equation (17).
Te IOU precision formula indicates the summation of the detection accuracy under diferent IOU thresholds, where the IOU values are 0.5 : 0.05 : 0.95, and AP 50 and AP 75 are the AP values when the IOU values are 0.5 and 0.75, respectively.Te precision indicates the total number of correctly identifed pedestrians under the IOU thresholds as a percentage of the total number of pedestrians.Te percentage of the number of correctly identifed pedestrians under the IOU threshold is calculated by the following formula: where true positive (TP) indicates that the prediction result is correct when the sample is positive, false positive (FP) indicates that the prediction result is incorrect when the sample is positive, true negative (TN) indicates that the prediction result is correct when the sample is negative, and false negative (FN) indicates that the prediction result is incorrect when the sample is negative.
To judge the object conditions that indicate large, medium, and small targets, defnitions are given according to the COCO evaluation index, and AP S , AP M , and AP L are small, medium, and large targets, which are defned as follows: In the above equation, area is the size of the detected object.Te actual small object size is the number of pixels that are accounted for.Te criterion for determining a small target is area < 32 2 � 1024, the criterion for determining a medium target is 1024 < area < 9216, and the criterion for determining a large target is area > 9216.
To judge the detection speed of the model, the number of frames per second (FPS) is used as the evaluation index in this paper [40], and its calculation formula is shown below: In the above equation, FrameNum is the total number of detected images and ElapsedTime is the total time from the start to the end of the detection period.

Comparative Experiments and Analysis of Model
Parameters.To obtain a network model suitable for smalltarget pedestrian detection, this paper sets diferent network structure parameters based on the Faster R-CNN framework.Te fve parameters are the learning rate, discard rate, momentum decay, weight decay, and batch size, and the specifc modifed network parameters and comparison results are shown in Table 1 and Figure 8.
In Table 1, model 0 has a learning rate of 0.01, a discard rate of 0.5, a momentum decay of 0.9, a weight decay of 0.0005, and a batch size of 64; model 1 sets the learning rate to 0.001 on the basis of model 0; model 2 sets the discard rate to 0.   Figure 8 shows a comparison of the loss values of the models with diferent parameters, where M0 and M7 correspond to model 0 and model 7 in Table 1. Figure 8 shows that the lowest loss value of 0.0543 is obtained by M0, while the highest loss value of 0.0657 is obtained by M4.Te loss function values of M1, M2, M3, M5, M6, and M7 are 0.0587, 0.0613, 0.0606, 0.0641, 0.0625, and 0.0642, respectively.Te results indicate that the model performance is better under the M0 parameters and that the detection capability is excellent.Terefore, the model in this paper uses an initial learning rate of 0.01, a batch size of 64, a momentum decay of 0.9, a weight decay of 0.0005, and a dropout rate of 0.5.Te hyperparameters of this ablation experiment were selected as follows: the learning rate was set to 0.01, the batch size was set to 64, the momentum decay was set to 0.9, the weight decay was set to 0.0005, and the dropout rate was set to 0.5.Te corresponding model was trained ofine, the  model was saved in the xxx.pth fle format, and the corresponding detection code and detection image were confgured.Te command python predict.py was input to obtain the detection results of the video, dataset, or camera images.Te fnal detection results are shown in Figure 9, and the evaluation results are shown in Table 2.

Ablation Experiments and Analysis. Ablation
As shown in Figure 9 and Table 2, for both small-target and large-target pedestrian detection, the IResNet-IFPN evaluation results are better than those of the ResNet-FPN, IResNet-FPN, and ResNet-IFPN models.Te model proposed in this paper was compared with the ResNet-FPN model, and the large-target pedestrian detection accuracy improved by 21.6% and 32.3%, the small-target pedestrian detection accuracy improved by 21.6% and 32.3%, the AP value improved by 17.2% and 24.5%, and the AP 50 value improved by 7.8% and 8.2%.When the proposed model was compared with only the modifed residual network, the large-target pedestrian detection accuracy improved by 8.7% and 14.4%, the small-target pedestrian detection accuracy improved by 16% and 10.1%, the AP value improved by 12.6% and 19.4%, and the AP 50 value improved by 5.5% and 4.6%.When the proposed model was compared with the feature-only modifed pyramid network, the large-target pedestrian detection accuracy improved by 8.4% and 14.3%, the small-target pedestrian detection accuracy improved by 14.1% and 22.7%, the AP value improved by 11.0% and 17.9%, and the AP 50 value improved by 3.4% and 3.6%.Te AFQ module ablation experiments show that the detection speed of the model increases from the lowest speed of 6.9 FPS to 9.8 FPS with 42.0% performance improvement and the highest speed of 18.5 FPS to 20.1 FPS with 8.4% performance improvement under the same backbone network.
Te performance clearly improves.Although the model detection accuracy decreases slightly, the overall accuracy is not afected.Te above data comparison suggests that the FRFP-AFQ model greatly improves the original algorithm results for all row targets; however, there is not much improvement in large-target pedestrian detection and integrated pedestrian detection compared to the only modifed residual network and only modifed feature pyramid network.Te small-target pedestrian detection accuracy is greatly improved, which proves that the model proposed in this paper can improve the comprehensive pedestrian detection capability of the model.Finally, the AFQ module improves the detection speed of the model by 8.4% to 42%.Te results show that the FRFP-AFQ model is feasible and efective.

Comparison with Other Pedestrian Detection Algorithms.
Te FRFP-AFQ model proposed in this paper is compared with other conventional pedestrian detection algorithms, including MEL [41], SIRA [42], YOLOV3-Promote [43], YOLOV5 [44], and DMSFLN [13], and the detection results are evaluated using COCO evaluation metrics.All algorithms use the same module, as described in Section 2.3.Te same hyperparameters and datasets are used; the fnal detection results are shown in Figure 10, and the evaluation results are shown in Table 3.
As seen in Figure 10 and Table 3, the DMSFLN pedestrian detection algorithm with a VGG-16 backbone network improves the AP 50 accuracy by 41.3%, and the detection speed is approximately two times faster than that of the DMSFLN pedestrian detection algorithm.In the case of the same 101-layer residual network, compared with the MEL and SIRA algorithms, the FRFP model improves the large-target pedestrian detection accuracy by 16% and 14.5%, the small-target pedestrian detection accuracy by 26.8% and 20.6%, the AP value by 20.8% and 17.7%, and the AP 50 value by 5.5% and 3.6%, respectively.Compared with the conventional YOLO detection algorithm, the detection speed of the model proposed in this paper is slightly reduced, but the AP 50 and AP 75 detection accuracies of the model with the IResNet-50-IFPN backbone network are, respectively, 21.4% and 13.0% better than the G-Module model with the YOLOV3-Promote and YOLOV5 backbone networks.When the model uses IResNet-101-IFPN as the backbone network, the AP 50 and AP 75 values are improved by 22.0% and 13.6%, but the slowest and fastest detection speeds are only 0.1 FPS and 8.4 FPS, which shows that the FRFP-AFQ model outperforms the conventional prediction algorithms in terms of detection capability and evaluation results for large, small, and integrated targets.In particular, the detection and evaluation results are better for small targets, which shows that the FRFP-AFQ model enhances the multiscale feature fusion and feature alignment abilities for small pedestrian targets, so the detection accuracy is higher than that of the MEL and SIRA models.In practical applications, considering multiscale feature fusion and feature alignment is benefcial for improving the detection performance of the model.Te FRFP-AFQ model also outperforms the conventional pedestrian detection models in detecting medium and large targets, indicating that the proposed model has a better comprehensive detection capability and more advantages for small targets than the conventional models.

Summary
To improve the detection accuracy and robustness of smalltarget pedestrian detection, an FRFP-AFQ model is proposed to construct bottom-up multiscale feature maps via ResNet and perform feature fusion and feature alignment on the multiscale feature map by using an FPN.Te multiscale feature fusion is completed by using a deep feature map with high semantic features and a shallow feature map with multidetail features, and the fused feature map contains both the deep, high-semantic features and the shallow, detailed features.Finally, a cascaded AFQ module is introduced to reduce the inference process time and increase the detection speed.Experiments are conducted on the Caltech Pedestrian Dataset.Te experimental results show that the model designed in this paper outperforms the conventional YOLOV3-promote, SIRA, YOLOV5, MEL, and other detection models and has good application prospects.
Te detection accuracy of the proposed model is still afected by extreme weather and multitarget pedestrian occlusion, and the small-target pedestrian detection ability is reduced in bad weather such as heavy rain and fog, as well as in the case of high crowd fow.In future studies, we will focus on the efects of bad weather and multitarget pedestrian occlusion on detection, enhance the learning ability and generalizability of the model in the case of extreme weather and multitarget pedestrian occlusion, and improve the robustness of the model for small-target pedestrian detection in extreme situations such as snowstorms and pedestrian occlusion.

Figure 2 :
Figure 2: Schematic diagram of the dropout process.

Figure 4 :Figure 5 :
Figure 4: Structure of the feature selection module.
3 on the basis of model 0; model 3 sets the momentum decay to 0.8 on the basis of model 0; model 4 sets the weight decay to 0.05 based on model 0; model 5 sets the batch size to 32 based on model 0; model 6 sets the discard rate and

Figure 8 :
Figure 8: Comparison of loss values with diferent parameters.
experiments were conducted to verify the enhancement efect of the dropout layer in the residual network and the FAM module in the feature pyramid network for small-target pedestrian detection.To fairly compare the performance of the models, the ablation experimental frameworks all use Faster R-CNN, and the backbone neural networks are ResNet-50-FPN, ResNet-101-FPN, IResNet-50-FPN, IRe-sNet-101-FPN, ResNet-50-IFPN, ResNet-101-IFPN, IRe-sNet-50-IFPN, and IResNet-101-IFPN.AFQ ablation experiments were also conducted for each backbone network.Te above models were trained on the Caltech Pedestrian Dataset to verify the validity of the models according to the COCO evaluation metrics.

Figure 10 :
Figure 10: Comparison results with conventional pedestrian detection algorithms.

Table 2 :
Ablation experiment results on the Caltech Pedestrian Dataset.

Table 3 :
Comparison results with conventional pedestrian detection algorithms.