Fire-PPYOLOE: An Ef ﬁ cient Forest Fire Detector for Real-Time Wild Forest Fire Monitoring

,


Introduction
In the past few decades, frequency and scale of global wildfire have increased dramatically [1].We take China alone as an example.From January to November 2021, 546 forest fires and 17 grassland fires occurred in China.Global forest fires are characterized by prolonged fire duration, expanded fire scope, and serious release of harmful gases, which affect social order and threaten heritage security [2].
Forest fire is very harmful and difficult to dispose and rescue.Therefore, forest fire monitoring, as an effective means of forest fire prevention and spread control, has become a major global research topic.Traditional forest fire monitoring is mainly based on observation tower patrol aircraft or satellite remote sensing images.However, this readitional way of forest fire monitoring by the weather climate technology level and monitoring of operating costs does not provide forest fire forecast information in real time.With the rapid development of science and technology, manned aircraft inspection and unmanned aerial vehicle (UAV) inspection monitoring has become a more effective means of forest fire monitoring.It has the advantages of high efficiency, low cost, and strong real-time performance [3].
The traditional smoke fire detection methods mainly focus on the feature extraction and classification of static pictures or dynamic videos.The typical features of smoke contain color, texture, motion orientation, etc. [4].Wang [5] designed a forest fire monitoring system using principal component analysis dimensionality reduction method to analyze the specificity of each channel of the three-color spaces.
With the widespread use of deep learning in target recognition and image classification in recent years, more and more researchers have started to combine this method with forest fire forecast tasks.Convolutional neural network (CNN) was first used in smoke and fire image classification [4,[6][7][8].In general, the CNN or R-CNN outperforms other machine learning methods, such as support vector machine, stack autoenconder, and deep belief network, in terms of classification accuracy, receiver operating characteristic curve, recall rate, and F1-score [6].CNN has good detection accuracy for small objects, but the target of the flame may be large due to the shooting distance.Therefore, the YOLO method, an onestage target detection algorithm, is proposed to improve the global detection accuracy and reduce the error detection rate.Wu et al. [9] proposed to combine the CNN Deeplab V3 + model with classical image processing algorithms to finely segment the beams and calculate the number of beams.The whole cluster of banana fruit was identified based on deep learning.The edge detection algorithm was used to extract the centroid of fruit finger shape, and the clustering algorithm was used to determine the optimal number of fruit bundles on the visual detection plane.The accuracy of beam detection in debudding stage was 86%.During the harvest, beam detection is very challenging, with a detection accuracy of 76%.Chen et al. [10] proposed an optimal YOLO-v4 detection method for bayberry trees based on drone images.Speed up model extraction by using the Leaky_ ReLU activation function and use DIoU NMS to retain the high accuracy prediction boxes.The optimal YOLO-v4 model had a detection accuracy of up to 97.78% and a recall rate of up to 98.16% on the dataset.Li et al. [11] proposed a remote sensing image detection (RSI-YOLO) method based on YOLOv5 object detection algorithm.Channel attention and spatial attention mechanisms are used to enhance the features of neural network fusion.The multiscale feature fusion structure based on PANet is improved to a weighted bidirectional feature pyramid structure.In addition, the loss function is modified to optimize the network model.Jiao et al. [12] has proposed a deep learning fire detection algorithm that aims to improve the accuracy and efficiency of fire detection using drones.Extensive studies on fire detection using large-scale YOLOv3 and tiny-YOLOv3 network have been shown to be capable of learning representative and have presented ideal detection accuracy, about 91%, and the frame rate can reach up to 30 frames per second (FPS).Zhao et al. [13] proposed an improved fire-yolo deep learning algorithm.By extending the feature extraction network in three dimensions, the feature propagation ability of small fire target identification is enhanced, the network performance is improved, and the model parameters are reduced.Furthermore, through the enhancement of the feature pyramid, the best performance prediction box is obtained.The average detection time of the real-time model is 0.04 s per frame.
To solve the problem of low accuracy of early fire image recognition based on single-stage target detection model, the following improvements are made in this paper: (1) The feature extraction capability of backbone is improved by using large-core convolution instead of ordinary convolution kernel to improve the accuracy of early fire image recognition.(2) The CSPNet network is introduced to reduce the model parameters so as to reduce the resources consumed by model reasoning.
(3) The network structure is changed and the reasoning speed is greatly increased to solve the problem of slow reasoning speed caused by large kernel convolution.
The main parts of this paper are structured as follows.Section 2 introduces PP-YOLOE briefly and then elaborates the improvement of Fire-PPYOLOE.To test the performance of the model proposed in this paper, the results of the three models on labeled and unlabeled datasets are compared and analyzed in Section 3. In addition, Section 3 gives some experimental details, and the conclusion is provided in Section 4.

Materials and Methods
For practical applications, there are high requirements for forest fire detection models, such as fast detection speed, high recall, low computing cost, and deployment of multiple application devices.This paper develops an efficient real-time forest fire detection model based on state-of-the-art object detection model PP-YOLOE [14] and name it as Fire-PPYO-LOE.In this section, we introduce PP-YOLOE briefly and then elaborate the improvements of Fire-PPYOLOE.
To meet the high requirements of forest fire monitoring, we design a new backbone and neck structure based on large kernel convolutions.The proposed Fire-PPYOLOE can further improve the detection recall and decrease the computing cost without sacrificing the detection speed.
2.2.The Proposed Fire-PPYOLOE.It is necessary to preprocess the images captured by monitoring devices to facilitate deep neural network calculations [18].First, we normalize the image and map the value of color channel from [0-255] to [0-1].Then, we adjust the image size to a uniform scale (e.g., 640 × 640).Next, the preprocessed image will be passed directly to our fire detector Fire-PPYOLOE.There is no need to locate candidate object region based on predefined anchors because our model is anchor-free.This will improve the detection speed to some extend.As shown in Figure 1, the proposed Fire-PPYOLOE is able to detect multiple flames in one time.Once the flames are detected, the result will be transmited to the terminal device such as UAV.
The image is sent to the target detection network for prediction in the following steps.The first step is to put the image into backbone for feature extraction at different scales.Then, the feature map is put into the head (detection layer) to predict the location and category of the target.We 2 Journal of Sensors select multiscale feature maps for target prediction and add neck part to fuse different scale feature maps to improve the recognition accuracy of small objects.The feature map with higher scale has a strong feature extraction ability for image details.It can recognize small targets better.
To meet the practical usage, the fire detector should be able to be deployed on different endpoint equipments.Therefore, we train tiny/small/large models to cover different scenarios in practice.In this paper, we take the large version as an running example.The network structure of our Fire-PPYOLOE is shown in Figure 2. It consists of three parts, namely, the new designed backbone with ConvNeXt [19] and CSPNet [20] structure, the PANet [21] neck with CSPConvStage, and the efficient ET-head used in the original PP-YOLOE.We will introduce them in detail in the following subsections.

CSPConvNeXt Backbone.
In the original PP-YOLOE, CSPResNet [14] is used as the backbone to extract multiple dimensional features.It leverages many 3 × 3 convolution layers for feature extraction.The receptive field calculation formula is shown in Equation (1) [22].We can see that the kernel size and the network depth are positively correlated with the size of receptive field where L represents the size of receptive field, f represents the kernel size, and S represents the stride.The receptive field size is equal to its kernel size.Theoretically, CSPResNet can fully extract the features at every position.However, this is not the case.The effective receptive field (ERF) [22] is proposed to show the effective area of L k , where Figure 3 (This picture is refereed from Scaling Up Your Kernels to 31 × 31: Revisiting Large Kernel Design in CNNs [23]) shows the effective areas of different networks with various kernel sizes.Equation (2) shows its computation formula.It has been proven that deep neural Journal of Sensors networks using small kernel convolutions pay more attention to the central part of the image and ignore the feature extraction for the edge part [23] ERF From Equation (2), we can see that ERF is positively related to kernel size, and it is better to improve ERF by increasing the kernel size K compared with network depth L. Inspired by this, we propose to leverage large kernel convolutions to extract better ERF, so as to improve the detection recall.Although the speed of the large kernel convolution network is faster than that of the network formed by attention mechanisms, it still has a big disadvantage of detection speed compared with the traditional convolution network.
In this paper, we choose to use ConvNeXt [24] to replace CSPResNet and propose to introduce CSPNet to improve the recall without speed damage.As shown in Figure 2, the overall structure of the network starts with a stem with a kernel size of four and stride of four.Then, there are four structures, which consist of CSPConvStage and convolutions with a kernel size of three and stride of two.For the large version, the width and depth of every CSPConvStage layer are [96,192,384,768] and [3,3,9,3], respectively.The feature map of the last three structures will be output.We make some improvements to backbone.
Figure 4 shows the details of the network structure improvement.We change layer normalization [25] in the original ConvNeXt block to batch normalization [26] and further remove gamma to improve speed.In addition, the activation function GELU [27] is changed to a powerful function SILU [28].Most importantly, we leverage CSPNet to optimize network structure.Compared with the CSPRes-Net, the network parameters are greatly reduced, and the recall is also improved.Some specific parameters will be shown in the experimental section.

PANet Neck with CSPConvStage.
Neck is a network structure to fuse the extracted features from backbone.The original PP-YOLOE uses Path Aggregation Network (PANet) [21] as the neck.Our Fire-PPYOLOE updates PANet using the same CSPConvStage with its backbone.PANet with CSPConvStage has a large receptive field by changing the small kernel  2.4.ET-Head.The role of head is to predict the location and object class.The ET-head used in PP-YOLOE is proved to be very efficient, so we use ET-head directly in Fire-PPYOLOE.
Varifocal loss (VFL) [29] and distribution focal loss (DFL) [30] are used to improve the recall and speed.Specifically, VFL uses the target score to weight the loss of positive samples, and this makes the contribution of positive samples with high intersection over union (IOU) to loss relatively large.It also makes the model pay more attention to the high-quality samples rather than the low-quality samples during the training process This can effectively learn a joint representation of classification score and localization quality estimation, such that there is a high degree of consistency between training and inference.Therefore, VFL can make up the imbalance of positive and negative samples in forest fire detection.Equation (3) shows its computation formula: where p is the predicted IOU-aware classification score and q is the target score.DFL proposes to solve the problem of inflexible bounding boxes by using conventional distribution prediction bounding boxes where y represents the regressed label and S represents the softmax function.Based on the above computation, Fire-PPYOLOE is supervised by the following loss function: In all the above formulas, b t represents the normalized target score.Here, α; β; γ represent the weight coefficient of classification loss, the weight coefficient of regression loss, and the weight coefficient of DFL loss, respectively.The loss VFL indicates the loss of varifocal focus, the loss GIoU indicates the GIoU loss, and the loss DFL indicates the loss of distribution focus.

Results
In this section, we present the experiment details.
3.1.Experiment Setup.We used a server with a TESLA-V100 GPU for training Fire-PPYOLOE, which has two E5-266v2 CPU and 128 GB of RoM.The operating system is Ubuntu 20.04.Meanwhile, we use a number of libraries of python, such as Paddle, numpy, pycocotools, Cython, pyclipper, PyYAML, and scipy.The number of training rounds is 150 epochs, AdamW [31] is used as the optimizer, and the weight decay is set to 0.0005.The involved optimization strategies are cosine annealing [32] and warning up.The initial learning rate is 0. The learning rate is 1e−4 at epoch 20, and the final learning rate decays to 1e−6 at epoch 150.We trained PP-YOLOE with the same settings.
3.2.Forest Fire Dataset.We use a public labeled dataset for the model training and test.The dataset (https://aistudio.baidu.com/aistudio/datasetdetail/107770) contains 6,675 fire and smoke images collected on public websites.It is randomly divided into the training set and test set according to the ratio of 80% and 20%. Figure 5 shows two examples of the labeled images.We can see that there are large and small flames in the data, and multiple flames may exist in one image.This demonstrates that the labeled data are close to the real application data.
In real world, the images transmitted from various monitoring devices are unlabeled and in various styles.To evaluate the effect of different models in real scenarios, we also conduct an experiment based on a public unlabeled dataset (https://www.kaggle.com/datasets/phylake1337/fire-dataset).It contains 999 images, including 755 images with flames and 244 images without flames.Figure 6 shows two examples of unlabeled images.Images with flames are used to detect the recall of the model in the presence of fire, whereas images without fire and smoke are used to detect the false detection rate of the model.From Figure 6, we can see that there are large flames and small flames in the data, and one image may contain one single target or multiple targets.Meanwhile, there are various actual scenes such as images captured in day and night.

Baseline Models.
To verify the effectiveness of our Fire-PPYOLOE, we take Faster R-CNN [33] as a baseline.It is a classical two-stage object detection model with high recall.In this paper, we retrain Faster R-CNN (https://github.com/rbgirshick/fast-rcnn) for fair comparison in the same dataset.We take PP-YOLOE (https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/configs/ppyoloe) as another baseline because it is an state-of-the-art single-stage detection model, and our Fire-PPYOLOE is deployed based on it.We train three models with the optimal training strategy and compare their advantages and disadvantages in this subsection.

Evaluation Metrics.
We compare Fire-PPYOLOE, PP-YOLOE [14], and Faster R-CNN [33] on labeled dataset and unlabeled dataset.For tests on labeled data, we use several commonly used metrics, namely, parameters of different models (Params), giga floating-point operations per second (GFlops), infer time, FPS, and mean average precision (mAP).
Params means the total number of parameters to be trained in our model.Generally speaking, the fewer parameters, the less computation and less memory.GFlops means 1 billion floating-point operations per second, and it is a computational quantity and can be used to measure the complexity of the model.In general, the lower the GFlops, the lower the complexity of the model.Infer time describes the time required to process an image.FPS reflects the number of images that can be processed in 1 s.
The mAP combines a tradeoff between precision and recall, which is a commonly used metric for most detection models.Equations ( 7) and (8) show the computation formula: with where PðrÞ is the measured precision at recall r, and r takes the maximum precision whose recall value is greater or equal than r nþ1 .
To evaluate the performance of different models on unlabeled dataset, we explore another two metrics in terms of recall and misdetection rate.Specifically, recall means that the correct predictions of positive samples take percentage of all positive samples.The higher value of recall, the better effectiveness, as shown in the following equation: where TP represents true positive, which means the number of positive samples that are collectedly detected and FN represents false negative, which means the number of negative samples that are incorrectly detected.Misdetection means that the negative predictions of negative samples take percentage of all negative samples.The lower number of misdetection, the better effectiveness, as shown in the following equation: where FP represents false positive, which means the number of positive samples that are incorrectly detected and TN represents true negative, which means the number of negative samples that are incorrectly detected.

Evaluation of Models.
The Params and GFlops of the Fire-PPYOLOE are much lower than the PP-YOLOE, but the model inference time increases.That was because Con-vNeXt use DepthWise convolution and increase the kernel size to replace the small kernel convolution in the original network.DepthWise convolution can reduce parameters effectively, but its memory access time cost is higher than other ordinary convolutions at the same amount of network parameters.For example, replacing ordinary convolutions with deep separable convolutions can reduce the size of network parameters to 10%, but the running speed of the network may only increase 4-5 times.If the DepthWise 6 Journal of Sensors convolution is enlarged to the same size as the ordinary convolution, its running speed will also be much slower than that of the ordinary convolution.This is why ConvNeXt runs much slower compared with ResNet with similar parameter sizes.The purpose of our work is to preserve the high accuracy of ConvNeXt while optimizing its running speed.Our goal is to improve speed as much as possible while ensuring accuracy.Therefore, we add CSPNet network to the model, which can reduce network parameters greatly and restore speed to the same level of PP-YOLOE.
3.5.1.Results on Labeled Dataset.Table 1 shows the results of different models on labeled data.We compare three models in terms of parameters, GFlops, mAP, infer time, and FPS.
From Table 1, we draw the following observations: (1) From the view of model parameters, Fire-PPYOLOE has only 37 million parameters.This is only 22% of Faster R-CNN and also much smaller than PP-YOLOE.We can see that Fire-PPYOLOE has a relatively low number of parameters and is more suitable for small devices such as drones.This shows the effectiveness of CSPNet, which can decrease the number of parameters largely.(2) It shows the same trends in terms of GFlops.The value of Fire-PPYOLOE is 28.5, which is about 50% of PP-YOLOE and only 14% of Faster R-CNN.This shows that Fire-PPYOLOE is much faster.During the experiment, we found that introducing Con-vNeXt structure can greatly reduce GFlops.(3) For the metric of mAP, we set the IoU to 0.5 and compute the values of different models.We can see that Fire-PPYOLOE is largely superior to the other models.This shows the effectiveness of large kernel convolution.It can perceive a large range of features so as to improve the detection recall.(4) As for infer time and FPS, we can see that PP-YOLOE performs the best.Our model performs a little less than PP-YOLOE and much better than Faster R-CNN.We made a tradeoff between the recall and the infer time.Specifically, we combine the ConvNeXt with the PANet.This can further improve the detection precision but a little damage to the infer time.
To sum up, the overall performance of our Fire-PPYOLOE is very good in terms of detection recall and speed.This makes it suitable for practical application in forest fire detection.Because there is no ground truth for the unlabeled data, we hire volunteers to judge the detection results.For the recall in this test, the judgment is positive if there is fire or smoke in an unlabeled image when it is detected successfully by the model, no matter where the generated box is.For misdetection, if there is no fire or smoke but it is detected by the model incorrectly, the judgement is true. Figure 7 shows an example of successful detection for a given unlabeled forest fire image.Figure 8 shows an example of misdetection picture.In this case, a long yellow road is mistaken for a flame by Fire-PPYOLOE.
Table 2 shows the performances of different models on unlabeled data.We can see that the recall of Fire-PPYOLOE is 81.85%, which is higher than PP-YOLOE through the optimization of large kernel convolution.However, it is relatively low compared with Faster R-CNN.This is because the two-stage detection model has a high advantage in terms of detection accuracy but performs not very well in terms of detection speed, as shown in Table 1.
We can also see that the misdetection rate of Fire-PPYOLOE is only 9.93%, much lower than the other models.By leveraging large kernel convolution, the proposed model can capture a relatively large receptive field, so as to extract more features around the area to be detected.This makes it  Journal of Sensors particularly robust in the detection of some confusing forest fire images.We will make a further qualitative analysis in the following subsection.
To sum up, the proposed Fire-PPYOLOE is more reliable in terms of detection accuracy and speed from the perspective of practical applications.Faster R-CNN has a big disadvantage compared with the other two in terms of both the number of parameters and the detection speed.The original PP-YOLOE is a fast object director.We improve it using a new designed backbone and neck based on large kernel convolution.
3.6.Qualitative Analysis Results.This subsection gives a qualitative analysis to further evaluate the performances of different models.In practice, there are often smoke-and flame-like objects in forest images, such as the sunset, sunrise, and morning mist.These affect a lot to the performance of forest fire detector.Most research focuses on model optimization such as the improvement of detection recall [34], but few focuses on smoke-and flame-like scenes [13].Through the experiment, we observe that our Fire-PPYOLOE is able to detect the smoke-and fire-like objects by leveraging large kernel convolution.
YOLO and R-CNN both use backbone to extract features, and neck can fuse multisize feature maps.R-CNN adopts two regressors for classification and regression, respectively, which has high accuracy but slow speed.YOLO adopts one regressor for classification and regression, which has fast speed but low accuracy for small targets.R-CNN is suitable for high-precision detection such as faces and medical images.YOLO is suitable for rapid detection such as autonomous driving and surveillance.Forest fire detection belongs to the monitoring system; therefore, YOLO is preferred.However, it is necessary to improve the initial detection accuracy of flames.Fire-PPYOLOE enhances its ability to distinguish between flames and backgrounds by introducing large kernel convolution to enhance feature maps.
We show three examples of different scenarios, namely, small fire targets, fog in the forest, and fire-like tree trunks.Figure 9 shows the results of PP-YOLOE on the three scenarios.We can see that PP-YOLOE fails to detect small flames.This suggests that the feature extraction for fire is not sufficient by using small kernel convolution.
The results of Faster R-CNN model is shown in Figure 10.It can be seen that Faster R-CNN detect successfully for small fire but incorrectly detect the fogs as smoke and the red tree trunk as fire.This is also because that Faster R-CNN is a two-stage detection model, which focuses excessively on one feature but neglects the extraction of surrounding features, so as to weaken the role of surrounding features as an aid to the central region.
As shown in Figure 11, the proposed Fire-PPYOLOE performs well on three scenarios.It does not misdetect fog as smoke nor does it mistake the red tree trunk as fire.It can also detect small fire flame in the forest.However, it does not detect all the flames in the first image.This suggests that the Fire-PPYOLOE is not perfect.There is still room for improvements in terms of recall, as shown in Table 2.It is expected to compensate for the low recall rate in subsequent studies.
By comparing the inspection performance of the models on abundant fireand smoke-like images, it can be found that Fire-PPYOLOE has better detection efficiency on fireand smoke-like targets.We use large kernel convolution to feel a larger receptive field, so as to perceive a larger range of feature extraction.Compared with PP-YOLOE and Faster 8 Journal of Sensors R-CNN, a larger range of features can better assist judgment, thereby reducing the misdetection rate.Such advantages of the proposed Fire-PPYOLOE can improve the accuracy and reduce the misdetection rate and make it suitable for the practical application as a forest fire detector.

Conclusions
In this paper, we propose a new model for the practical application of forest fire detection and discuss its performance compared with state-of-the-art technologies in different scenarios.Based on PP-YOLOE, our model is improved using large kernel convolution to capture surrounding features.It can improve the network precision and reduce the network parameters without too much influence on the reasoning speed.Given a forest image, the proposed model can detect multiple fire and smoke in a time accurately and quickly.This method is not only for early fire detection but also can be applied to other target detection models or image segmentation models by replacing the infrastructure of other models with CSPConvStage.The paper also puts forward some interesting research directions.For example, it is interesting and necessary to carry out in-depth research on the recognition of small targets.

FIGURE 5 :
FIGURE 5: Examples of fire images in labeled dataset.

3. 5 . 2 .
Results on Unlabeled Dataset.For the evaluation on unlabeled dataset, we use recall and misdetection rate.

TABLE 1 :
The comparison results of different models on labeled dataset in terms of Params, GFlops, infer time, FPS, and mAP.
FIGURE 7: Picture of successful detection.

TABLE 2 :
The comparison of different models on unlabeled data in terms of recall and misdetection rate.