FFA-GAN: A Generative Adversarial Network Based on Feature Fusion Attention for Intelligent Safety Monitoring

,


Introduction
Te purpose of power grid construction management is to prevent and reduce power accidents, as well as to prevent serious impacts on society. Tis serves as the guarantee for power enterprises to fulfll social responsibilities and improve economic benefts. Troughout the process, monitoring and early warning to ensure the safety of construction workers have always been crucial but challenging [1]. Currently, there are two main approaches to power grid construction management: traditional methods and deep learning-based methods. Te former relies on manual inspection and screening by security guards, while the latter employs deep learning algorithms to achieve automatic detection [2,3]. Specifcally, the traditional approach to managing construction safety is to establish safety policies, safety objectives, and safety culture based on safety theory, in order to enhance the safety awareness of construction workers. Tis approach mainly relies on manual monitoring, whereby workers are reminded to pay attention to safety and comply with safety regulations through broadcast alarms and remote calls to the person in charge, when violations are detected during patrols. Tis method aims to create a safetyconscious working environment and reduce the risk of accidents. Power grid safety supervision and management encompasses various aspects, such as historical event management, current situation management, and task assignment. Most of these methods are postmanagement measures, but they can efectively reduce the occurrence of accidents, and more importantly, enhance the real-time supervision and even early warning capabilities. Terefore, in recent years, developing a power grid security supervision and management platform and promoting the informatization level of power grid security management have become a research hotspot.
Tanks to the breakthrough in deep learning for computer vision tasks, it is now possible to utilize deep learning based object detection technologies to enable intelligent monitoring of power grid construction sites [4][5][6], safety vests detection [7], unsafe behavior detection [8], and unauthorized intrusion detection [9]. Tese can be further analyzed to trigger security alarms. Te abovementioned deep learning inspired detection algorithms are required to recognize, interpret, and comprehend images in surveillance video sequences. It is necessary to recognize complex scene systems based on the semantic model representing the monitored scene. Tis enables complex events to be identifed from the surveillance video obtained from the work environment, which is critical in creating a safe work environment and tracking employees. To overcome these challenges, image fusion techniques have been proposed, which can combine complementary information sources from multiple images of the same scene. Its purpose is to enhance the information of a single generated image [10]. Image fusion can provide detailed and reliable images for high-level visual tasks. It plays a crucial role in computer vision, and have been applied in many aspects, including object detection [11,12], pedestrian recognition [13], face recognition [14,15], semantic segmentation [16], and other areas.
In power grid construction, safety management is of utmost importance, but the existing methods have several limitations. Traditional management strategies rely on manual inspection and screening by security personnel, which may not be able to prevent accidents immediately. In addition, these methods are labor-intensive and inefcient. Tis is partly due to the unique characteristics of power grids, including their extensive construction areas, complex site backgrounds, and a large workforce. As a result, implementing traditional methods on power grid construction sites can be challenging [17].
With the wide application of surveillance video technology, power grid enterprises have begun to use video recording to check for security risks. Although this strategy can alleviate some of the problems associated with traditional methods, long-term manual monitoring is prone to fatigue and can result in missed detection. Studies have shown that when an individual observes two monitoring screens simultaneously, they can miss 45% of useful information in 10 minutes and 95% in 22 minutes [18]. Terefore, artifcial naked eye monitoring has limitations in accuracy and real-time performance.
Intelligent automatic image detection based on deep learning can greatly enhance detection accuracy and efciency. However, the automatic detection algorithms used in power grid construction are all based on visible light. Visible light-based on-site operation video safety monitoring is often subject to environmental light, posture, expression, and ornaments, which lead to difculties in accurately identifying and tracking specifc targets, especially in complex working environments and harsh climate conditions. Yunnan Province is located in the southwest of China. Due to the complexity of terrain and environment, Yunnan power grid is most difcult to monitor and manage in China and even in the world [19]. For instance, due to the foggy mountain area, the quality of the acquired visible image is unsatisfactory, which afects the subsequent highlevel visual tasks. Te use of infrared-visible image fusion algorithm before detection can efectively improve the accuracy of detection, but the application of infraredvisible image fusion in power grid is rarely studied [20]. Tis paper presents a novel and efective fusion method based on infrared and visible image fusion for power grid construction management. We summarize our major contributions as follows: (i) Tis paper presents the frst application of the infrared-visible image fusion method and explores a multifeature infrared-visible light multisource image enhancement technology. Te proposed approach aims to improve the video monitoring efect for remote monitoring personnel. (ii) In this work, we design a shared convolution group consisting of channel attention and pixel attention in a two-branch generator network, which is conducive to capturing the common modal features of infrared and visible images and generating stable and reliable fusion images. To address the issue of foggy environments caused by mountainous terrain in Yunnan, we have designed the two-stage discriminator for GAN-based method in the proposed method. Tis designment improves the perceptual and interpretive qualities of visible light images captured in foggy conditions, and enhances the efectiveness of subsequent high-level visual tasks. (iii) Most power management algorithms currently in use are based on postmanagement, which means they can only detect and respond to security risks after they have occurred. In reality, early warning and real-time supervision are essential for truly preventing accidents from happening. Te model proposed in this paper will be integrated into the power grid artifcial intelligence such as intelligent video surveillance systems. Tis will allow for more proactive and efective safety management in power grid construction.
Te remainder of this paper is organized as follows: Section 2 briefy describes the related works of existing deep learning based power grid operation safety management technology and multimodal image fusion algorithms. In Section 3, we introduce our method in detail, including the network architecture and function modules. Section 4 presents experiments to show the impressive performance of our method in comparison with state-of-the-art methods, followed by some concluding remarks in Section 5.

2
Advances in Multimedia

Deep Learning-Based Management Technology.
On the basis of artifcial intelligence technology, the power grid is developing towards intelligentization, automation, digitization, and informatization. Existing research of power grid management mainly focused on detecting unsafe factors in construction based on deep learning algorithm. Faster R-CNN and deep ResNet were used to quickly and accurately detect workers against complex backgrounds in [21]. IFaster R-CNN approach is used to automatically detect workers and heavy equipment in real-time in [22]. YOLO v3 algorithm is used to detect whether the helmet is worn by the standard in [18]. An improved lightweight YOLOv4 model is used to detect the transmission line insulator defects in [23]. Immediately, Tang et al. [7] provide a method based on YOLOv5 to resolve the problems of low detection efciency and poor accuracy caused by complex background and numerous personnel. Tan et al. [24] improved YOLOv5, the functional detection scale is increased to realize the detection of smaller targets. For the purpose of the lack of dataset in power grid scenarios, Peng et al. [3] proposed a contrastive Res-YOLOv5 network for intelligent safety monitoring on power grid construction sites.

Multimodal Image Fusion Algorithm. Image fusion (IF)
is an emerging feld for generating a robust and informative image through the integration of images obtained by different sensors, e.g., visible, infrared, computed tomography, and magnetic resonance imaging [25,26]. Among them, infrared-visible image fusion has attracted much attention [14,27], and it is more informative than that of single mode signal because visible image captures refected light, while infrared image captures thermal radiation [28]. As mentioned above, the application of those fusion algorithms can be mainly divided into the following categories:

Face Recognition.
Li et al. [29] proposed a GAs based infrared-visible image fusion to solve the problem of low face recognition sensitivity caused by glasses occlusion. Heo et al. provide two types of visual and thermal infrared images fusion methods to enhance the robustness of face recognition [10].

Object Detection.
Han and Bhanu [30] proposed a search scheme based on the hierarchical genetic algorithm to achieve automatic registration color images and thermal image sequences, and then further used multiple fusion strategies to fuse registration and infrared images for human contour detection. Ulusoy and Yuruk [31] conducted background modeling and foreground detection for infrared, visual intensity, and visual color domains, respectively, so that the complementary regions from the domain were combined, the infrared foreground was covered by this fusion information, and the infrared foreground was fused with the covered visual foreground. Finally, active contour lines are applied to each connected part in the infrared domain to detect the object boundary. Gao et al. [32] proposed a fexible framework for visible and infrared video fusion moving target detection based on the low rank sparse decomposition method. Ma et al. [33] propose an end-to-end STDFusionNet to realize salient target detection. Zuo et al. [34] designed an attention fusion feature pyramid network for infrared small target detection. Te network focuses on the important spatial position and channel information of small targets by acquiring and utilizing the global context information of images, and enhances the feature representation of small targets, thus improving the detection performance.

Pedestrian Recognition.
Shopovska et al. [35] present a learning-based fusion method to enhance pedestrian visibility in variable conditions (day and night).

Semantic Segmentation.
Te cascade of the ResNet and improved CRFs are used to construct the semantic segmentation module for the aluminum electrolyte image in [36]. Hou et al. [37] proposed a semantic segmentation strategy using infrared and visible image fusion method based on GANs. Xu et al. present an AFNet based on deep learning, which efectively improves the accuracy of multispectral image semantic segmentation [38]. Recently, Zhao et al. [39] proposed a correlation-driven feature decomposition fusion network, which utilizes various modules to extract the high-frequency and low-frequency features of an image. In summary, in terms of current fusion performance, image fusion methods based on deep learning generally outperform traditional methods. In practical applications, diferent model architectures should be designed in combination with specifc image fusion task drivers to improve the advanced visual application of fusion images in real scenes.

Proposed Method
In this section, we proposed a fusion algorithm to enrich image information and make detection more accurate for intelligent safety monitoring on power grid construction sites. Firstly, we introduce an overview of the proposed fusion model, a feature fusion attention network based on generative adversarial network (FFA-GAN). Ten, we introduce the shared convolution group (SCG) module, the channel attention (CA) module, and the pixel attention (PA) module in order, which are designed in sequence for the generator to deal with multimodal features fexibly. Finally, we describe the discriminator dehaze (DDE) and discriminator fusion (DFU) designed for the two-stage discriminator, in which they jointly guarantee the FFA-GAN to achieve good performance on the infrared and visible image fusion task.

Overview.
In the feld of image fusion, the generative adversarial network (GAN)-based methods are usually used as representative baselines, especially for infrared and visible image fusion tasks. Te characteristic of GAN-based fusion methods is the fused image containing abundant information, while retaining structural similarity between fused image and source images. Tus, the GAN has great potential to achieve success in the area of image fusion. Inspired by the framework of the existing GAN-based image fusion algorithms, we have designed a feature fusion attention network based on the generative adversarial network, as shown in Figure 1, which has been abbreviated as FFA-GAN hereinafter.
As illustrated in Figure 1, the FFA-GAN consists of a generator network and two discriminator networks. In this work, the generator is designed to constantly explore the feature fusion mapping function between infrared and visible images to obtain fused images. Moreover, the twostage discriminator is designed in the FFA-GAN to provide two constraints for the generator to get clean fused images with both infrared and visible information. Specially, the DDE is used to recognize whether images are clean while the DFU is used to identify the proportion of infrared and visible information in fused images. During the testing stage, since the generator has learned to fuse images, the discriminator is not required to provide constrains and only the generator is needed to get the fusion image. Ten, we describe the architecture of the generator and discriminators in detail.
Te fused images generated by the designed generator network are used to fool the discriminator. Te discriminator network consists of three convolutional layers and three fully connected layers to classify the input image. In this paper, two discriminators are designed in the FFA-GAN to identify haze images and normal images, respectively. Te reason lies on to avoid information loss caused by single countermeasure architecture when dealing with fog images and normal images. At the same time, it forces the generated image to retain more meaningful information from the source image.
3.1.1. Generator. Firstly, the infrared and visible features are extracted through two convolution layers. Secondly, these features are fed into three SCG modules to capture modalitycommon features. Ten, the output of each group by channel connection (CC) is integrated. In order to select and reweight signifcant infrared and visible features, we adopt two CA modules in dual branches of the generator. After that, a PA module will be used to achieve fne-grained modality fusion which aimed to blend dual-branch features. Finally, a 3 × 3 convolution layer and a 1 × 1 convolution layer are used to map the fusion features to the twodimensional plane, and the fusion results are obtained. Note that, merging together CA modules and PA module constitute a special structure, which can be helpful to handle source images with complex distribution, like uneven haze distribution.

Discriminator.
Te DDE which is designed in the FFA-GAN is used to identify whether the input image is a fuzzy image or a clean image to avoid the loss of information caused by a single game architecture while processing fog images and clean images. At the same time, the DFU forces the generated results to retain more meaningful information from the source image by balancing the proportion of information between infrared and visible images. Te structure of the DDE and DFU is similar, both consist of three convolution layers and three fully connected layers. Te DDE aims to obtain a one-dimensional class vector, while the DFU tries to obtain a two-dimensional proportion vector.

Designments in Generator
3.2.1. SCG Module. Te detailed structure of the SCG module is shown in Figure 2. Tere are N contiguous convolutional blocks (CBs), represented by grey squares, which help to increase depth and expressiveness of the FFA-GAN. Te detailed structure of the CB is shown at the bottom of Figure 2. Te CB consists of skip feature residual connections and cascaded CA and PA modules. Tese skip feature residual connections are designed to reduce information loss and get around training difculty. And the cascaded CA and PA modules are used to select more signifcant features. Te key of the generator to obtain effective fused images is to select and reweight multimodal features is CA and PA modules introduced next.

CA Module.
Most existing deep learning-based fusion strategies for combining features simply integrate them equally through channel connections, without considering the varying importance of diferent feature channels. However, as the fusion network becomes deeper, it is likely that only a small subset of features will respond meaningfully. To address this issue, we propose using CA modules to assign appropriate weights to diferent features, based on their similarity relationships across channels. Te structure of the CA module is illustrated in Figure 3. To obtain channel-level global information of input feature map which be denoted as X, we frst apply global average pooling. Tis operation calculates the average value of each channel feature. Specifcally, it can be expressed as follows: where X c denotes the c th channel feature, and i and j represent the coordinate information of the feature value. activation layer is used to obtain the channel weight, which is then applied to reweight the input source features X. Tis enables the FFA-GAN to focus on the most meaningful and relevant features, which helps to improve its fusion performance. We can express this whole process as follows: where Conv(·) denotes a 3 × 3 convolution operation. ReLU(·) and Sigmoid(·) denote the ReLU and sigmoid activation functions.

PA Module.
Due to the uneven distribution of haze on diferent pixels in hazed images, it is necessary to use pixel attention to focus on the features of each individual pixel. To refne pixel feature fusion and reduce interference afected by the haze, we consider employing the PA module, which is illustrated in Figure 4. Unlike the channel CA module, the PA module includes a 3 × 3 convolution layer, a ReLU activation layer, and another 3 × 3 convolution layer that work together to refne and enhance the features for each pixel not channel. Ten, a sigmoid activation layer is used to weight the feature weight of each pixel based on its importance. Tese weights are then applied to the input features to obtain the fnal output. We can express the complete PA module as follows:

Two-Stage Discriminator.
Te proposed FFA-GAN incorporates a two-stage discriminator composed of a discriminator dehaze (DDE) network and a discriminator fusion (DDF) network, as illustrated in Figure 1.

DDE Network.
Te DDE network serves as a simple image classifer distinguishing whether input images are hazed or clean. Te DDE network can take the fusion result of the proposed generator or the source image as input. As shown in Figure 1, the input of the DDF network is subjected to three 3 × 3 convolution operations and three ReLU activations before being processed through three fully connected layers. Finally, a sigmoid activation layer is used to obtain the probability that the image is a clean image, which produces a one-dimensional class vector.

DFU Network.
Te structure of the DFU network is similar to the DDE network. Tere are also three 3 × 3 convolution layers, three ReLU combinations, and three fully connected layers. However, unlike the DDE network, the DFU network uses a Softmax activation layer to obtain the proportion of infrared and visible image features, producing a two-dimensional class vector.

Loss Function.
In order to obtain desired fusion results for our proposed FFA-GAN, we will describe the loss function in detail from two parts: generator loss and discriminator loss in the next.

Generator Loss.
Te generator loss is defned as the distance between the fused results and the desired results. Tis can be measured using image pixel loss, image gradient loss, perceptual loss, and adversary loss. Te formula is as follows: Te image pixel loss is designed to maintain suitable pixels from infrared and visible images through Euclidean distance. It can be represented as follows: where I fused , I vi , and I ir are fused image, visible image, and infrared image; H and W are height and width of images; and ‖·‖ indicates the adoption of L2 normal.
Te image gradient loss is proposed to calculate image gradient to preserve texture information. It can be expressed as follows: where ∇ denotes the function to calculate image gradient through Laplace operator. Te image perceptual loss is proposed to calculate the distance of image features through VGG. It can be expressed as follows: where VGG(·) denotes the function to get image feature maps through VGG. Te adversary loss is the key for GANbased fusion methods, which can be denoted as follows:

Advances in Multimedia 7
where |·| denotes the function to calculate the vector length. In addition, I �vi are clean visible images.

Discriminator Loss.
Te discriminator loss consists of DDE loss and DFU loss, which can be represented as follows: Te DDE loss is designed to guide the DDE network in identifying whether input images are clean or hazy. Hence, there is a label of 0 or 1 to represent the hazy image or the clean image. It can be expressed as follows: Te DFU loss is introduced to measure the proportion of infrared and visible information. It can be expressed as follows:

Experiment
In this section, we frst provide an overview of the dataset used in our training and testing process. Ten, we briefy introduce the fusion metrics used in our experiments and compare them with 11 state-of-the-art fusion methods. Ten, we present extensive experiments to demonstrate the rationality and superiority of our method. Finally, we analyze the results of our method from both qualitative and quantitative perspectives. It is worth noting that only partial results are given due to the page limits.  Furthermore, there are six tagged targets including people, cars, buses, motorcycles, lights, and trucks (these data can be found at https://github.com/JinyuanLiu-CV/TarDAL).

Experimental Settings.
Our work exploits the mapping function between source infrared and visible images through the FFA-GAN. During the training phase, we use the Adam algorithm to guide minimizing the generator loss and discriminator. Te learning rate is set to 0.0001. As we employ data augmentation by cutting source images into patches, we select 30 strictly aligned infrared and visible image pairs in the M3FD dataset. While more image pairs could be used, there are enough infrared and visible image patch pairs to make the proposed algorithm efective. All experiments are conducted on a laptop with a 3.60 GHz 11th i7-11700K CPU and GeForce RTX 3090. Te code is implemented with Python and MATLAB. We compare the proposed method with 11 state-of-the-art fusion methods, including DenseFuse [41], PMGI, RFN [42], SeAFusion [43], SwinFuse, SwinFusion [44], Tar-DAL [40], FusionGAN [20], GANMcC [45], U2Fusion [46], and Rse2Fusion. Tese are implemented based on available codes.

Fusion Metrics.
We choose average gradient (AG), cross entropy (CE), edge intensity (EI), entropy (EN), mutual information (MI), peak signal-to-noise ratio (PSNR), QAB/F, Qcb, Qcv, root mean squared error (RMSE), spatial frequency (SF), structural similarity index measure (SSIM), and standard deviation (SD) as fusion metrics based on [47][48][49]. In all experiments, we use an up arrow or a down arrow to indicate that the higher or lower the indicator, the better the fusion.

Comparison with State-of-the-Art Methods.
In order to convincingly verify the performance of our fusion method, we compare the FFA-GAN with other 11 fusion methods on the public benchmarks INO dataset and the M3FD dataset. Te qualitative and quantitative analysis of the fusion results of diferent methods are presented below. Figure 5, we frst present the qualitative visualization of our FFA-GAN on the INO dataset. We provide 3 sets of representative infrared and visible images and their experimental results are shown in Figure 5. In each set, the frst two subfgures present the infrared-visible image pair, the third to thirteenth subfgures show the results of the advanced fusion models mentioned above, and the last subfgure presents the fused image of our method. Te meaningful region is enlarged and marked with red boxes in each fused result. We can see that, the results of our method is more clear and contain more texture and contour information from the source images, which will be advantageous for advanced visual tasks, such as object detection in power grid.

Qualitative Comparisons. In
Another qualitative analysis in this part is carried out on the 3 groups of M3FD dataset in Figure 6. Similar to Figure 5, the frst two subfgures, the third to thirteenth subfgures, and the last one, respectively, present the infrared-visible image pair, the results of the comparison models, and the fused image of our method in each set. It can be clearly observed in Figure 6, our method is able to preserve rich texture information, scene information, and unique contrast information. In contrast, the target in the fusion result lacks clarity and the background is blurred, indicating that the target region in the infrared image and the typical features in the visible image, such as license plate information and human information, are not well preserved. It is worth emphasizing that our method is highly robust in the presence of strong light interference during the night.

Quantitative
Analysis. Afterward, 13 metrics mentioned above on the INO and M3FD dataset (100 image pairs) are employed to quantitatively compare the abovementioned results, which are displayed in Figures 7 and 8. Due to space constraints, we have not provided a detailed introduction of each evaluation indicator in this paper. Interested readers can refer to [47,50], and [51] for more information about these indicators. To quantify the comparison results, we introduce a ranking rule of ranking the average of 13 evaluation indexes, as shown in Tables 1 and 2. In a comprehensive perspective, our method obtains the satisfactory performance in AG, CE, EI, EN, SF, and SD on the INO dataset (as shown in Table 1). Besides, we obtain the best performance in AG, CE, EI, EN, SF, and SD on the M3FD dataset (as shown in Table 2). Tese are indicators based on the feature of the image, indicating that our fused images are informative and more consistent with the human visual system. In addition, we have also found that the Dense, U2Fusion, and Res2Fusion models have achieved good results in the PSNR, RMSE, and SSIM indicators. Te likely reason is that the authors of these algorithms were more focused on specifc pieces of information during their design. Tis phenomenon further suggests that an image fusion algorithm should be evaluated using a variety of indicators for comprehensive comparison, which demonstrates the benefts of our FFA-GAN. Unfortunately, our method did not perform well on the abovementioned PSNR, SSIM, and RMSE indices. Te primary reason for this outcome is the lack of a ground truth for fusing infrared and visible light images. In particular, due to noise and other factors, the fusion of infrared and visible light may result in inaccuracies in the values of these three indicators, as they are compared to reference images. Overall, our FFA-GAN stably retains rich useful information from source images,  and can describe the scene information of the whole image, especially visible images are contaminated (such as strong light and haze).

Conclusions
In this work, we have proposed a new generative adversarial network called the FFA-GAN, which is based on feature fusion attention. We have applied this network to power grid security management. Te key design in our approach is a shared convolution group (SCG) in the dual-branch generator network. Tese are designed to extract modality-common features from source images. To handle multimodality information fexibly, each SCG contains both channel attention modules and pixel attention modules. We have also incorporated infrared and visible image features into our network, using a CA and PA combination structure to fuse these features. Our two-stage discriminator includes both DDE and DFU to ensure that the proposed FFA-GAN achieves good performance in infrared and visible image fusion tasks. Experimental results demonstrate that our fusion network performs well. It can be embedded in the grid AI platform to provide services for related applications and provides a strong guarantee for power grid safety.
However, there are some limitations to our approach due to the lack of aligned infrared and visible data with haze. In our experiments using the M3FD dataset, we used dark channel prior to remove image haze. Although this approach can efectively remove haze, the image may still be afected by the distribution of the haze. Terefore, improving the performance of our proposed method further is subject to overcome these limitations.

Data Availability
Te data that support the fndings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.