Subway Platform Passenger Flow Counting Algorithm Based on Feature-Enhanced Pyramid and Mixed Attention

,


Introduction
Crowd counting aims to estimate the number and density distribution of people in images or videos and is used in felds such as crowd behavior analysis and public safety management.Te surge of metro passenger fow on the metro has posed a huge challenge to the organization of trafc and safe operation, such as the difculty of transportation organization during peak periods and the lack of operability in emergency management.Real-time access to station passenger fow through crowd counting algorithms can provide scientifc data support for organizational management and safety alerts.For example, the departure interval can be optimized according to the passenger fow of the subway platform obtained in real time, and the turn-back station can be accurately obtained [1].Te distribution of passenger density on the platform is displayed in combination with the Passenger Information System (PIS) and the Public Address System (PA), so as to induce passenger travel behavior [2] and reduce operational pressure during peak hours.At the same time, it can also implement control strategies [3] such as closing stations and overtaking according to the platform passenger fow, so as to reduce the potential safety hazards caused by congestion.
Traditional crowd-counting algorithms fall into three categories, detection-based methods take the whole human body or body parts as the object of detection and calculate the number of people [4]; regression-based methods treat the crowd as a whole and complete the counting by establishing a mapping relationship between the extracted features and the number of people, such as ridge regression [5] and Bayesian regression [6]; and density estimation-based methods count by learning linear mapping [7] or nonlinear mapping [8] relationships between features and density maps.Traditional methods rely on manual feature extraction, which is less accurate and only applicable to sparse scenes.At present, convolutional neural networks are widely used in crowd counting due to their excellent feature extraction and learning capabilities.According to the structure of the neural network model, it is generally divided into two categories: single-branch structure and multibranch structure.Te early crowd-counting algorithms are all single-branch structures.Wang et al. [9] applied CNN to crowd counting for the frst time and the model uses the regression method to count.Due to the limitation of network width and depth, the counting accuracy in dense scenes needs to be improved and cannot meet the requirements of cross-scene counting.To solve the crossscene problem, Zhang et al. [10] proposed the cross-scene counting model (Crowd CNN), and the algorithm fnetunes the counting model according to the characteristics of the input scene so that it can accomplish cross-scene counting.Te diferent distances between the crowd in the image and the camera lead to diferent crowd scales.To solve the multiscale problem, various multibranch networks have been proposed.Te multicolumn convolutional neural network (MCNN) proposed by Zhang et al. [11] has three branches, which employ convolutional kernels of diferent sizes for feature extraction of targets at diferent scales to solve the scale problem.Sam et al. [12] proposed a multicolumn selection network (Switch-CNN), where the input images are frst cut, and then parts of the images with diferent density levels are fed into the corresponding branches separately, and counting is done separately using diferent regression networks.Te quality of the density map determines the counting accuracy.To obtain high-quality density maps, Sindagi and Patel [13] proposed the contextual pyramid model (CP-CNN), which applies the global and local contextual information extracted from diferent branches to density map generation.Although multibranch networks achieve better counting results, they are accompanied by the problems of large number of parameters, training difculties, and model redundancy.To solve these problems, dilated convolutions [14], deformable convolutions [15], and generative adversarial networks [16] have been introduced in the feld of crowd counting to reduce the complexity of the models and improve the counting accuracy.For passenger fow counting in the subway scene, Sheng et al. [17] proposed a counting method with the head and shoulder of passengers as the detection object.Tis method performs well when the passengers are sparse, but the counting accuracy decreases due to severe occlusion during peak hours.Zhang et al. [18] used a multiscale feature extraction module and transposed convolutional upsampling to enhance multiscale features but did not consider the efect of background interference on the counting task.Xiao et al. [19] conducted crowd counting in the target area of the subway based on the background diference method, but the background difference method is mostly aimed at moving objects and is not suitable for platform scenes where passengers are mostly stationary or moving slowly.Hu et al. [20] used a hybrid Gaussian background modeling method to compensate for the defciencies in background diferencing, but the regression-based approach makes the correlation between the features learned by the network and the number of people weak, and the accuracy needs to be improved.Te double-region learning algorithm proposed by He et al. [21] divides the subway surveillance image into near region and far region and adopts diferent strategies for counting the two subregions to solve the impact of perspective distortion.However, the method can only divide the image into two fxed regions without considering the variability of the scene.Te MPCNet proposed by Zhang et al. [22] uses multicolumn dilated convolution to aggregate multiscale context information in crowded scenes, but the multicolumn structure inference speed is slow and cannot meet the requirements of real-time detection.Tiny MetroNet proposed by Guo et al. [23] adopts a micro-passenger feature extraction network as the backbone network to achieve a balance between counting accuracy and detection speed.In the MDP algorithm proposed by Liu et al. [24], the MetroNext based on the multiscale convolutional attention module can quickly obtain the location information of the train and passengers, and the optical fow algorithm is used to predict the direction of passenger movement.Te combination of the two completes the detection of passengers on and of the train.Yang et al. [25] introduced CBAM into YOLOv4 to solve the problem of inhomogeneous illumination in the station to improve the accuracy and robustness of the network.Te MPDNet proposed by Yang et al. [26] uses the pyramid vision transformer to extract features and then uses an adaptive spatial feature fusion algorithm to compensate for the loss of spatial information in feature extraction, achieving higher accuracy while meeting real-time requirements.
Most of the current research is aimed at outdoor open scenes, which is quite diferent from the subway platform scene.Te existing passenger fow counting algorithms in the platform scene still need to be improved.For the subway platform, the narrow and long platform leads to more obvious diferences in passenger scales in diferent areas of the monitoring image, and there may be a problem of missing detection of small-scale heads away from the camera side.Te variety of building facilities in the station leads to complex background and difcult crowd feature extraction.In addition, most of the existing public datasets are images of open scenes, and there is no public dataset suitable for subway platform scenes.Based on the above analysis, this paper frst constructs a metro platform dataset by capturing images from Lanzhou metro platform surveillance video and then proposes a subway platform passenger fow counting algorithm based on feature-enhanced pyramid and mixed attention.Te pyramid structure efectively fuses the semantic information and spatial information of deep and shallow features to solve the problem of diferent crowd scales.A mixed attention module is constructed to aggregate global context information, and the problem of complex background is solved by paying more attention to the target area.

Literature Review
Te main difculties of crowd counting in the platform scene are the large diference in head scale and the complex background of the platform.In this section, two types of networks related to the algorithm in this paper, i.e., multiscale feature fusion network and attention network, are reviewed.Te above research studies use diferent methods to solve the scaling problem in image processing, which have achieved certain results but still have some problems, such as higher complexity of the model and feature loss.Te featureenhanced pyramid structure proposed in this paper uses a channel conversion module to highly preserve the channel features and a semantic consistency learning module to simplify the model while solving the aliasing efect.

Attention Network.
Te main idea of the attention mechanism is to allocate limited information processing resources to the parts of the input that are useful for task execution, and the widely used ones in crowd counting algorithms are channel attention, spatial attention, and pixel attention.Te FANet [33] proposed by Niu et al. sets the weight of the background area to zero and weights the target area according to the area where the crowd is located and the density to exclude background interference.Te MS-SPCANet [34] proposed by Wang et al. assigns different channel weights to diferent spatial positions of the channel feature map, in order to highlight useful information and suppress useless information to the greatest extent.MGANet [35] proposed by Li et al. uses spatial attention to focus on the human head region to solve the problem of foreground and background confusion and uses channel attention to enhance the dependence between features and improve semantic expression.In the coordinated attention module CA [36] proposed by Hou et al., the channel attention is decomposed into two onedimensional feature coding processes, and the features are aggregated along two spatial directions.In this way, longrange dependencies can be captured in one spatial direction, while precise position information can be preserved in the other spatial direction.In CAFNet [37] proposed by Wang et al., pixel attention and channel attention are used to integrate low-level features into highlevel features, and then density maps are generated by combining each layer of features that adaptively aggregate local context.
Te existing research on attention mechanism is relatively rich, but there are still some limitations.Some studies only consider channel attention or spatial attention, which is not comprehensive enough, while the research considering both ignores the global relationship of feature maps.Te mixed attention mechanism proposed in this paper uses the idea of nonlocal operation to obtain the long-distance dependence of spatial and channel feature maps to make full use of context information to obtain high-quality density maps.

Algorithm
Te main difculty in counting passenger fow in the subway platform scene comes from the high density of crowds during peak hours.Te camera angle on the platform is low, and the head scale tends to increase from far to near and the scale diference is large, which needs to be taken into consideration in the algorithm design.In addition, since there are many escalators and other building facilities on the platform, the complex background brings difculties to crowd feature extraction, and the interference brought by the complex background needs to be minimized when designing the algorithm.
Figure 1 shows the network framework of the algorithm in this paper, consisting of a VGG-16 network with the fully connected layer removed, a feature enhancement pyramid structure, and a mixed attention module.Taking the platform monitoring image as input, the frst 13 layers of VGG-16 are used to extract the image features.Te original features are sent into the featureenhanced pyramid structure, and the problems of different crowd scales and small target missed detection are solved by aggregating features of diferent scales.Ten, the fused features are sent to the mixed attention mechanism, which can efectively focus on the global information by capturing the long-distance dependencies of any two positions in the space or any two channels, which is helpful to solve the problem of background interference and occlusion.Finally, the attention feature map is upsampled to the size of the input image, and the predicted density map is obtained.After the integral sum, the number of passenger fows in the image can be obtained.
3.1.Feature-Enhanced Pyramid Structure.Targets of different scales in subway platform surveillance images will have a semantic generation gap after the same proportion of downsampling, which is manifested by the loss of small targets after multilayer convolution.Te feature pyramid captures targets of diferent scales by fusing deep and shallow feature maps and solves the problem of missed detection of small targets.However, the traditional feature pyramid has the following disadvantages [38].Firstly, the lateral link uses 1 × 1 convolution to reduce the number of channels of deep features so that the deep and shallow features can be fused, but this operation causes a large loss of channel information of deep features.Secondly, 3 × 3 convolution is used to eliminate the aliasing efect after feature fusion, which introduces redundant calculation.Terefore, this paper proposes an improved featureenhanced pyramid structure, using a channel conversion module (CCM) and externally introduced semantic consistency learning module (SCLM) to solve the above two problems.Te specifc feature-enhanced pyramid structure is shown in Figure 2.
Te backbone network extracts features from the bottom up and takes the feature map after the four-layer convolution of Conv2_2, Conv3_3, Conv4_3, and Conv5_3 as input, recorded as C2-C5.Te input feature map is sent to the channel conversion module to convert the reduced channel information into pixel information, that is, the channel information is retained by expanding the width and height of the feature map.As shown in Figure 3, frst the channel conversion operation can reshape the lowresolution feature map H × W × α 2 C into the highresolution feature map αH × αW × C by upsampling.Since the backbone network uses 2 times downsampling, α is taken as 2 in the algorithm for the subsequent fusion of adjacent feature maps.At this time, the width and height of the feature map increase by 2 times, and the number of channels decreases to 1/4.Because the number of channels in each layer needs to be consistent with the feature map C2, 1 × 1 convolution is used to enrich the channel information.Finally, 3 × 3 convolution is used to downsample the feature map to the original size, which can aggregate the original channel information at the pixel level.Te deep feature map after CCM processing retains rich channel information for subsequent fusion stages.
Due to the inconsistent distribution of features, the direct fusion of deep feature maps with shallow feature maps after sampling will lead to aliasing efects, and the continuity of features cannot be guaranteed.Terefore, before the fusion after CCM and upsampling operation, the semantic consistency learning module is used to standardize the distribution of features.As shown in Figure 2, the SCLM module consists of a 3 × 3 convolution and two 1 × 1 convolutions, and then the consistency features are output through the activation layer.Te channel information of the original feature map after CCM and SCLM is preserved and the aliasing efect brought by the fusion process is eliminated, and thus the features are enhanced.Te fused feature maps P3-P5 are upsampled to the size of P2 and then spliced in the channel dimension to obtain the feature map F, which preserves more feature information.

Mixed Attention Mechanism.
In the convolution process, the receptive feld is limited to a certain range leading to diferences in the feature representation between pixels of the same category [39], which then leads to a decrease in counting accuracy.Te idea of the nonlocal operation [40] is that when calculating the weight of a certain position, all other positions need to be weighted so that the global contextual information can be fully utilized.Inspired by this, a mixed attention module is built to solve the problem of complex background of station monitoring images from two dimensions.Te spatial attention mechanism can capture global dependencies and suppress background interference by focusing on target regions with high similarity.Te channel attention weights each channel to highlight the channels useful for the counting task and suppress the useless channels.
Figure 4 shows the specifc structure of the mixed attention mechanism, with the left-hand branch being the spatial attention mechanism and the right-hand branch being the channel attention mechanism.Te idea of the spatial and channel attention mechanisms is similar, except that the spatial attention mechanism performs a 1 × 1 convolution operation to reduce the dimensionality before reshaping and transposing the feature map.Te input feature map of the mixed attention mechanism is F ∈ R C×H×W , where C, H, and W represent the channel, height, and width, respectively.After convolution, reshaping, and transposition, the feature maps ∈ R C×HW are obtained; then the matrix multiplication operation is performed and normalized by Softmax to obtain the spatial and channel attention maps s and c.Te formulas are where s ij represents the spatial weight of the i-th spatial position weighted by all positions j, c ij represents the channel weight of the i-th channel weighted by all channels j, s i 1 and s j 2 represent the i-th and j-th positions of spatial feature maps s 1 and s 2 , respectively, and c i 1 and c j 2 represent the i-th and j-th channels of channel feature maps c 1 and c 2 , respectively.Te output of the spatial and channel attention module is represented as where F S and F C denote the spatial and channel attention feature maps, respectively.s j 3 and c j 3 denote the j-th position or channel of the spatial feature map s 3 and channel feature map c 3 , respectively, and matrix multiplication is used to reshape the feature maps into R C×H×W .Te coefcients λ 1 and λ 2 are learnable parameters that are initially set to zero and are adaptively assigned weights to local features through network training.F i is the i-th position or channel of the input feature map.F S and F C are fused to obtain a mixed attentional feature map F a with the same dimensions as F.

Loss Function.
Te loss function is made up of two parts.Te Euclidean distance loss function L E is the pixel-level diference between the predicted density map and the true density map.Te formula is where N is the number of images, X i is the i-th input image, and θ is the learnable network parameter.F(X i ; θ) and F GT i are the predicted and true density maps for the i-th image.Te Euclidean distance loss function is based on the premise that pixels are independent of each other, ignoring the correlation between them.Averaging all pixels without attention to structured information leads to blurred density maps and unclear details.To compensate for the shortcomings of the Euclidean distance loss function, the model introduces a structural similarity loss function L S , which uses three local statistics of mean, variance, and covariance to calculate the similarity between the predicted density map and the true density map.Te formula is where M is the number of pixels in the density map and P is the image block corresponding to the same pixels p in the predicted and true density maps.SSIM is the structural similarity index and is calculated as where μ F , μ F GT , σ 2 F , and σ 2 F GT denote the mean and variance of the predicted and true density maps, respectively, and σ FF GT denotes the covariance between the predicted and true density maps.C 1 and C 2 are small constants set to prevent zeros in the denominator.SSIM ∈ [−1, 1] and the image similarity is proportional to the value of SSIM.
Te fnal loss function is obtained by weighting L E and L S : where α is the weighting coefcient used to balance pixellevel loss with structural loss and α is set to 0.001 through experiments.

Experimental Results and Analysis
Te experiment was divided into two stages and the frst was the training stage.Taking the training set images as input, the predicted value obtained by forward propagation was compared with the true value to obtain the loss value, and the parameters were updated in the process of backward propagation to make the loss value smaller and smaller until it reached the ideal value, completing the network training.Te test set images were then fed into the trained network to obtain the predicted values, where the accuracy and robustness of the network were evaluated by the MAE and MSE. and the learning rate decay parameter was set to 0.995.Te training batch size was set to 16 and the number of iterations was set to 200.To better compare the performance of the algorithms, the experimental parameters of all the compared methods were set in the same way as the methods in this paper.

Evaluation Indicators.
In this paper, mean absolute error (MAE) and mean square error (MSE) are used to evaluate the performance of the algorithm.MAE represents the error between the predicted and true values, refecting accuracy, while MSE represents the degree of diference between the predicted and true values, refecting robustness.
where N is the number of images and C P i and C GT i are the predicted and true number of people for the i-th image, respectively.

Dataset Description.
To verify the performance of the proposed algorithm, experiments were conducted on ShanghaiTech and UCF_CC_50 public datasets and selfbuilt station dataset, respectively.
Te ShanghaiTech dataset contains 1198 images, with a total of 330,165 individuals tagged.Te dataset is divided into two parts.Te images in Part_A are randomly obtained from the Internet while the images in Part_B are obtained from street surveillance in Shanghai.Part_A is characterized by a high density of crowds and variable scenes, while Part_B is characterized by a low density of crowds but sufers from the problem of large diferences in crowd scales.Tis dataset is a challenging dataset across diferent scene types and densities.
Te UCF_CC_50 dataset images cover a wide range of scenes such as marathons, stadiums, and concerts.Te average number of people in the images is as high as 1280, while the number of people in the single image ranges from 94 to 5453, with a large gap in density levels between images, making the dataset challenging.Te disadvantage of this dataset is the insufcient number of images, only 50, and thus a fve-fold cross-validation method was used to conduct experiments in this paper.Te 50 images were randomly and equally divided into fve, one of which was used in turn as the test set and the other four were combined as the training set, and the results of the fve experiments were averaged as the fnal result.
For deep learning crowd counting, the quality of the dataset will to a certain extent afect the counting efectiveness of the model.Te existing public datasets are mostly images of open scenes, while the long and narrow subway platforms and numerous construction facilities pose the problem of cluttered backgrounds.Due to the height limitation of the platform, the height of its surveillance cameras also difers from the public dataset.In order to better evaluate the performance of the model in this paper, platform images were collected from the Lanzhou Metro to build the dataset.Five stations in Lanzhou Metro Line 1 with high passenger fow, including Xizhanshizi, Xiguan, Dongfanghong Square, Wulipu, and Lanzhou University, were selected to capture images from the surveillance video at one end of the platform waiting area during the morning peak (e.g., 7:00-9:00), evening peak (e.g., 17:30−20:00), and fat peak periods (e.g., 10: 00-16:00) of weekdays and weekends.Te dataset is labelled with a total of 2000 images, of which 1500 are used as the training set and 500 as the test set.Te size of each image is 1200 × 1024.
Typical images for each dataset are shown in Figure 5.

Experimental Result Analysis.
Table 1 shows the experimental results of the proposed algorithm and fve other classical or advanced comparison algorithms on the ShanghaiTech dataset.Te comparison between the experimental results of the two-part datasets shows that the counting results of sparse scenes are better than those of dense scenes, indicating that dense scenes are still the key direction for future research on crowd counting.Te proposed algorithm achieves the best results on this dataset compared to the comparison algorithm.Compared with the best MIA [43] model, the MAE and MSE of Part_A improved by 2.3% and 1.4%, respectively.Te MAE and MSE of Part_B improved by 0.9% and 1.6%, respectively, indicating the efectiveness of the feature-enhanced pyramid structure and the mixed attention mechanism, which can perform the counting task well in the case of higher crowd density and diferent scales.
Table 2 shows the experimental results on the UCF_CC_50 dataset.It can be seen that only the contextaware model (CAN) [41] is superior to the proposed algorithm in the comparison algorithm, and the accuracy and robustness of other algorithms are lower than the proposed algorithm.Te CAN network, which uses spatial pyramid pooling to compute scale-aware features, is a multicolumn network that adaptively encodes contextual information.Te multiscale enhanced network (MSEN) [42] and the multivariate information aggregation (MIA) [43], which also employed multicolumn structures, have also achieved good results, indicating that the multicolumn structured model works better on this dataset.Te algorithm in this paper is a single-column structure, which has less parameters and simpler calculation while achieving competitive results, and can also meet the counting requirements of various dense scenes.Te last two columns are the number of parameters and the inference time of each algorithm; the model in this paper is a single-column structure; therefore, the number of parameters is less and the inference time is shorter.
Te experimental results of the self-built platform dataset are shown in Table 3. Te algorithm in this paper has achieved the best results because the algorithm has been improved on the traditional pyramid.Te application of CCM and SCLM makes the channel information of the original feature map retained and eliminates the aliasing efect caused by the fusion process, enhances the feature representation, and helps to solve the scale problem.In addition, the mixed attention mechanism in the algorithm utilizes the idea of nonlocal image processing.By focusing on the relationship between local features, the global context information is fully aggregated to generate a high-quality prediction density map.
To further verify the efectiveness of the algorithm in this paper, the platform of Xizhanshizi Station of Lanzhou Metro Line 1 on April 26, 2023 (Wednesday), was selected, and the passenger fow on the platform was counted every 10 minutes during the period of 6:30−9:00, and a total of 16 groups of predicted passenger fow and the real passenger fow on the platform and the relative error were obtained, as shown in Figure 6.It can be seen from the fgure that the number of passengers on the platform increases gradually with time, and the number of passengers on the platform increases signifcantly after 7:30 and remains at a high level, which is consistent with the trend of passenger fow in the morning peak of weekdays.Te relative errors of the 16 groups of data are all within 4.5%, and the average absolute percentage error is 2.71%, which proves the efectiveness and accuracy of the passenger counting algorithm in this paper.
Figure 7 shows partial density maps obtained from the proposed model on diferent datasets, with every two rows of experimental result maps coming from the same dataset, arranged in the order of the ShanghaiTech, UCF_CC_50, and self-built station datasets.Te experimental results on the frst four rows of the public datasets show that the counting error is greater for dense scenes than that for    Journal of Advanced Transportation sparser scenes, but in general, the enhanced feature fusion and the suppression of background interference by the attention mechanism allow the algorithm to achieve good counting results Te predicted values in the last two rows of the experiment are greater than the true values, and observation of the density distribution shows that it is the refection of passengers by the platform screen doors that causes the repeat counts to bring about the slightly larger predicted values.Te experimental results show that the model performs well on both public and self-built station    Journal of Advanced Transportation datasets and can make accurate predictions in scenes with very high crowd density, large variations in crowd size, and severe background interference.

Ablation Experiments.
To verify the efectiveness of the modules in the network, ablation experiments were conducted in Part_A of the ShanghaiTech dataset.Te backbone network is denoted as Backbone, the traditional feature pyramid structure is denoted as FPN, the feature-enhanced pyramid structure is denoted as FEPN, and the mixed attention mechanism is denoted as MA.Te experimental results are shown in Table 4. Te comparison between the frst two rows and the last two rows shows that the embedding of mixed attention improves the counting accuracy and robustness of the network, indicating that fully utilizing global contextual information works well in crowd counting studies.Te comparison between the frst and third rows illustrates that the feature-enhanced pyramid structure with channel transformation and semantic consistency learning brings about an improvement in network performance compared to the traditional feature pyramid structure.Te loss and accuracy convergence curves of the ablation experiment are shown in Figure 8; in order to ensure the simplicity and readability of the image, the training loss curve and the test loss curve are presented in two fgures, and the training accuracy curve and the test accuracy curve are also presented in two fgures.Te feature-enhanced pyramid structure proposed in this paper is improved on the traditional feature pyramid structure.While the model achieves excellent performance, it also needs to pay attention to whether this improvement brings redundant calculation.Te number of model parameters refects the calculation amount and running time of the model to a certain extent.Terefore, this paper analyzes the improvement of the feature pyramid structure based on the number of model parameters.As shown in the last column of Table 4, the comparison of the frst and third rows shows that the improvement of the feature pyramid brings less than 1MB increase in parameters, which proves that the feature-enhanced pyramid network algorithm does not bring redundant calculation while improving the network counting accuracy.
Figure 8 shows the loss convergence curve and accuracy convergence curve of the model.In the early stage, the fuctuation of training loss is large, mainly because the parameter learning of the network is not yet completed and the model is disturbed by useless information.As the learning proceeds, the training loss curve tends to be stable and converges, indicating that the model has efectively completed the learning.Te accuracy convergence curve indicates that the parameters of the model are well set and learned, and the counting performance of the model is good.

Conclusion
Based on the problems of large changes in crowd scale and strong background interference in subway platform passenger fow counting, the algorithm proposed in this paper uses a feature-enhanced pyramid structure to retain channel information and eliminate aliasing efects.Te enhanced feature representation is more conducive to solving the scale problem.By embedding a mixed attention module in the algorithm, the idea of nonlocal image processing is used to capture the global context information, to obtain a highquality prediction density map.Te algorithm achieves good results on the two public datasets and the self-built station dataset, which proves the efectiveness of the algorithm in this paper.However, there are still some shortcomings in the study.For example, refections of passengers from platform screen doors may lead to repeated counts and thus large predictions, and preprocessing of the images to cover or cut sections of screen doors with severe refections will be considered in the future.For the problem that passengers are completely occluded by pillars or other passengers on the platform, resulting in missed detection and small prediction results, the idea of the target detection algorithm can be used for reference in the future to reduce the impact of occlusion on crowd detection from the loss function.

FFigure 4 :
Figure 4: Structure diagram of the mixed attention mechanism.

Figure 6 :
Figure 6: Platform real passenger fow and predicted passenger fow.

Figure 8 :
Figure 8: Loss and accuracy convergence curve.(a) Training loss convergence curve.(b) Test loss convergence curve.(c) Training accuracy convergence curve.(d) Test accuracy convergence curve.

10
4.1.Environment and Parameter Settings.All comparison experiments in this paper were completed on the Windows 11 system equipped with an NVIDIA GeForce RTX 3050 graphics card.Te environment confguration was CUDA 11.6 + Anaconda 4.13 + Python 3.7 + PyTorch 1.10.Te Gaussian distribution was used to initialize the convolutional layer parameters randomly, and the Adam algorithm was used to optimize the parameters.To balance the training speed and the loss, the initial learning rate was set to 1 × 10 −5

Table 1 :
Experimental results of the ShanghaiTech dataset.

Table 3 :
Experimental results of the self-built station dataset.