MSIS: Multispectral Instance Segmentation Method for Power Equipment

Infrared image of power equipment is widely used in power equipment fault detection, and segmentation of infrared images is an important step in power equipment thermal fault detection. Nevertheless, since the overlap of the equipment, the complex background, and the low contrast of the infrared image, the current method still cannot complete the detection and segmentation of the power equipment well. To better segment the power equipment in the infrared image, in this paper, a multispectral instance segmentation (MSIS) based on SOLOv2 is designed, which is an end-to-end and single-stage network. First, we provide a novel structure of multispectral feature extraction, which can simultaneously obtain rich features in visible images and infrared images. Secondly, a module of feature fusion (MARFN) has been constructed to fully obtain fusion features. Finally, the combination of multispectral feature extraction, the module of feature fusion (MARFN), and instance segmentation (SOLOv2) realize multispectral instance segmentation of power equipment. The experimental results show that the proposed MSIS model has an excellent performance in the instance segmentation of power equipment. The MSIS based on ResNet-50 has 40.06% AP.


Introduction
In the fault detection of power systems, infrared imaging technology has the characteristics of operationally simple, fast response speed, and accurate judgment; it has become an important tool for the systems of failure detection [1]. By processing the collected images, the fault status of the power equipment can be diagnosed and the fault area of the equipment can be determined. To better process infrared images, many scholars have used image segmentation technology to conduct a lot of research and mainly divided into the traditional methods, the machine learning methods, and the deep learning methods, as shown in Table 1.
In the traditional segmentation method, Zhou et al. extract potential regions of faults by superpixel segmentation method, and then, the residual network has used to screen the real position of fault [2]. e Ostu algorithm is used to segment the image by Fan et al. To accurately segment the overheated area, the active contour model was used to refine the edge. e fuzzy C-means (FCM) clustering algorithm was used to suppress the oversegmentation, and finally, the overheated area was accurately divided [3]. In the machine learning method, Xu et al. proposed a fault region extraction method based on a pulse-coupled neural network (PCNN). is method reduces the internal parameters of the PCNN, and local features of the fault and nonfault regions are combined to achieve adaptive iteration, which can effectively extract the faulty area [5]. Shanmugam and Chandira Sekaran used the FCM clustering algorithm to segment infrared images, and the Modified Ant Lion Optimization (MALO) and Region Pros function are used to optimize the segmentation area [4]. e instance segmentation of power equipment uses the color and texture information of the equipment to segment the overall equipment, which provides a basic image for subsequent diagnosis of equipment failures. Qi et al. proposed a new method of infrared image segmentation based on a multiinformation fused fuzzy clustering method.
is method segmented the complete power equipment by constructing a joint domain of fuzzy clustering field (FCF) and Markov random field (MRF) [7]. Guo et al. proposed a diagnosis system based on the comprehensive analysis of infrared images. is system uses the Sobel operator and Canny operator for preprocessing, the SIFT algorithm extracts prefeature points, and the K-means clustering identifies power equipment [6]. With the development of deep learning, deep learning has been applied to more and more tasks. Image classification [12,13], semantic segmentation, object detection, and instance segmentation [14,15] have become recent academic hotspots. Infrared image segmentation based on deep learning has also been proposed by many scholars. Wang et al. used Mask R-CNN to extract the insulator instances in the infrared image, and the temperature distribution of each insulator was obtained by function fitting.
is method realizes the automatic diagnosis of infrared faults of power equipment [8]. Jiang et al. used the Mask R-CNN framework to build a target detection system, which can accurately extract the bushing frame. e segmentation performance of the faulty area is improved by combining it with a pulse-coupled neural network based on linear iterative clustering [10]. Yan et al. established a multispectral instance segmentation network model based on Mask R-CNN and compared the fusion abilities of different fusion methods in detail [9]. Khalid et al. used a twostage method of fusion-segmentation for multispectral instance segmentation. e network first uses the encoderdecoder architecture method to get the fused image and then uses Mask R-CNN for instance segmentation [11].
Although many models have been proposed based on infrared image segmentation, the current segmentation methods still need to be improved. On the one hand, most of the current segmentation methods use an infrared image dataset with distinct equipment and a clear background. When the equipment overlaps and the background is complex, these methods are challenging. On the other hand, these methods based on machine learning only use visible or infrared for segmentation, but there is a good complement of information between visible and infrared. In the deep learning method, although the visible image and the infrared image are fused by the fusion algorithm of the multispectral image, there are many redundant structures. When these algorithms are combined with the instance segmentation model, it is difficult to improve network performance. For [9], the multispectral instance segmentation based on Mask R-CNN reduces the redundant structure, but compared with the single-stage instance segmentation, the speed of the Mask R-CNN segmentation has a certain gap. is leads to practical deployment difficulties and higher costs.
To solve the above problems, this research has collected and set up power equipment image datasets, it is aimed that the complete segmentation of power equipment was realized, and a multispectral instance segmentation is designed to directly complete the classification, positioning, and pixel segmentation of power equipment. e main contributions of this work are as follows: (1) We propose a multispectral single-stage instance segmentation (MSIS) network based on SOLOv2. e method integrates image fusion and instance segmentation into a single network. e network may ensure the real-time performance of segmentation while reducing structural redundancy caused by multitasking. It may segment infrared images with complex backgrounds and poor quality, facilitating subsequent power equipment inspections.
(2) To preserve more details in the original image, a dual-input feature extraction module is proposed, which can better extract the features of infrared images and visible images. It provides richer information for subsequent feature fusion and instance segmentation. (3) A multifeature attention RFN (MARFN) is proposed based on a residual fusion network (RFN), which can fuse infrared images and visible images to get a richer fusion feature. And a novel fusion layer is used to solve the problem of network degradation caused by the increase of RFN depth.

Instance Segmentation.
Instance segmentation is an instance-level object segmentation method in image segmentation tasks. Instance segmentation is mainly divided into two stages and a single stage, as shown in Table 2. e popular instance segmentation [14,[16][17][18][19] is to find out the area where the instance is located through the method of object detection, and then, semantic segmentation is performed in the detection box. Each segmentation result is output as a different instance. In methods such as SGN [20] and SSAP [21], pixel-level semantic segmentation is first performed, and then, different instances are distinguished by means such as clustering and metric learning. Most singlestage instance segmentation methods [15,[22][23][24] are mainly inspired by one-stage and anchor-based detection models  [27] and RetinaNet [28]. PolarMask [25] and AdaptIS [26] are inspired by anchor-free detection models such as FCOS [29]. Compared with the two-stage model, the single-stage model has a natural advantage in speed [15].

Image Fusion.
ere are four categories of image fusion algorithms based on deep learning, mainly including the CNN method, the GAN method, the self-encoding method, and other methods, as shown in Table 3.
e image fusion method based on CNN mainly uses the existing CNN network for image fusion. Li et al. proposed an image fusion network based on VGG-19 [32], which decomposes the source image into two parts: the basic part and the detailed content, then the VGG-19 is used to extract multilayer features, and the fusion image is obtained through an appropriate fusion strategy. Li et al. used residual neural network (ResNet) and zero-phase component analysis (ZCA) to construct a fusion framework. e residual neural network was used for feature extraction, and the image was reconstructed by zero-phase component analysis [33]. Inspired by the transform-domain image fusion algorithms, Zhang et al. used two convolutional layers to extract the salient image features of multiple images, and appropriate fusion rules were selected to fuse these features and generate images [31]. e shortcomings of the network are also obvious. e structure and fusion strategy are too simple, so the fusion performance of the network is not optimal. In the paper [30], an unsupervised and unified densely connected network (FusionDN) is proposed. It is the main contribution that the weights of different source images were generated by weight block, which is to complete the fusion of different source images. Zhang et al. proposed a fast unified image fusion network based on proportional maintenance of gradient and intensity (PMGI), which can fuse multisource images [35]. e fusion result is achieved by adjusting the texture and intensity ratio of the image. In the network, the information is extracted through the gradient path and the intensity path. In order to meet the fusion task of different sources, the author also defines two loss functions for extracted information. Xu et al. provide a fusion network model that adapts to different source images because the model can retain the adaptive similarity between the fusion result and source images [36]. Chen et al. designed a multilayer fused convolution neural network (MLF-CNN) for pedestrian detection; they combined image fusion and object detection into a single network [34]. e autoencoder method uses the existing autoencoder neural network to extract features, fuse features, and generate features. Prabhakar et al. proposed a fusion network from the perspective of optimizing the loss function. e network is composed of an encoder, a fusion layer, and a decoder [37]. Even if the network input changes and the parameters are not adjusted, better results can be obtained.
Inspired by DeepFuse, a fusion network based on an autoencoder neural network [38] was proposed by Li and Wu. e network is composed of an encoder, a fusion layer, and a decoder. e dense block [45] is mainly used for feature extraction of the original image. NestFuse [39] also uses the same structure, which is inspired by DenseFuse and U-Net++ [46]. e author also designed a multiscale fusion strategy based on the attention mechanism. In 2021, Li et al.
proposed an end-to-end residual fusion architecture (RFN-Nest). Its main contribution was to design a residual fusion network (RFN) based on the residual architecture [40].
In the GAN-based approach, the Generative Adversarial Network is used to train a generator that can generate fused images. An image fusion framework based on generative adversarial networks [41] was proposed by Ma et al. e generator is used to generate the fusion image, and the discriminator is used to discriminate the result of the generator. But the network still cannot retain the rich detail. To preserve the rich details in the visible image, the author improves FusionGAN [42]. e author has improved the generator, discriminator, and loss function of the GAN network. ese changes make the fused image have more details. As a network that solves the fusion task, there are problems such as poor real-time performance of the network due to structural redundancy when it is combined with the instance segmentation for multiple networks.
Other methods are different from the above methods. In the paper [43], the input infrared image and visible image are decomposed into three high-frequency feature images and low-frequency feature images, then, a specific fusion strategy is used to fuse two sets of feature images, and the fusion image is obtained through image reconstruction. e paper [44] proposed an infrared and visible image fusion method based on multiscale transformation and norm optimization. e fusion ability of the network as a whole was improved by using a combination of prefusion and postfusion in the paper.
Image fusion methods based on CNN, GAN, and other types are independent structures, which makes it relatively difficult to combine with instance segmentation networks and also produces structural redundancy. e self-encoding method can be combined with the existing instance segmentation method in a modular form to avoid the abovementioned problems. erefore, this paper builds our multispectral feature fusion module based on the RFN of the RFN-Nest method.

MSIS Network Architecture.
e architecture of the MSIS model is shown in Figure 1, which consists of three parts: feature extraction module, feature fusion module, and the module of multiscale instance segmentation. Firstly, the feature extraction module generates infrared image features FM ir,i  In the module of multiscale instance segmentation, FPN (Feature Pyramid Network) was used by MSIS to improve the ability of multispectral instance segmentation and deal with the multiscale problems of power equipment in SOLOv2.
e FPN can fuse deep semantic features and shallow detail features. ese new features were input into the prediction head of the multispectral instance for prediction. Here, we use the prediction header of SOLOv2, including the instance category branch and instance mask branch. e specific operation is as follows: the feature of FPN output will be divided into S × S grids. e branch of the instance category will output S × S × C semantic category probabilities, where C is the number of instance categories. e branch of instance mask outputs H × W × S 2 prediction masks, H × W represents the size of the output image, and S 2 is the maximum number of instances predicted. When the center position of the target object falls into a certain grid, its corresponding category branch and mask branch will output the object instance category and pixel segmentation, respectively. Finally, MSIS realized the end-to-end feature fusion and automatic segmentation of complete power equipment.

MSIS Feature Extraction Module.
Before the feature fusion of infrared image I ir and the visible image I vi , feature extraction is an indispensable step. However, the difficulty in the training of the network model is due to the limited amount of data in the power equipment dataset. And the pretrained ResNet-50 model on the MS COCO dataset was used for feature extraction; the segmentation effect is not very satisfactory. To this end, we propose the MSIS feature extraction module, including the feature extraction branch of the infrared image, the feature extraction branch of the visible image, and the feature prefusion branch. Specifically, as shown in Figure 2, the feature extraction branch of the visible image and the feature extraction branch of the infrared image use the pretrained ResNet-50. e feature prefusion branch is composed of 2 3 × 3 Conv, attention mechanism, and residual structure (Stage 1-Stage 3). In the structure, 3 × 3 Conv is to ensure that the input feature information is fully retained, and the number of output channels is 512. We add 1 × 1 Conv before each residual structure, ensuring that the output is consistent with ResNet-50 features. And at the same time, it can effectively reduce the computational burden, which was caused by the increase of channels because the training parameters were reduced by dimensionality reduction. e residual structure is consistent with the structure in ResNet-50. Meanwhile, the attention module is added behind each residual structure. e channel attention module (CA) and the spatial attention module (SA) are a parallel combination. e feature FM am generated by its attention module can be equivalent to the expression (1).  Figure 3. In the original RFN structure, the convolution size is 3 × 3 Conv. e original intention of RFN's fusion layer fusion convolutional layer (Conv3∼Conv6) is to fuse features from different sources through the convolutional layer, but the convolutional fusion ability of the fusion layer is not very good; see ablation experiment of fusion layer for details. We try to increase the number of convolutional layers and modify the convolutional layer to improve the fusion ability of the fusion layer. In the case of enhancing the ability of feature fusion and ensuring fewer module parameters, we construct a novel multifeature attention RFN (MARFN), as shown in Figure 4. e features of the infrared image FM ir and the features of the visible image FM vi are spliced by channels through Conv1 and Conv2, and Space Attention is arranged after Conv1 and Conv2. en, they are input into the fusion convolutional layer (Conv3∼Conv6). e increase in the number of convolutional layers will cause degradation. In order to solve this problem, we design a new convolutional layer. As shown in Figure 5, this structure can well solve the phenomenon of degradation caused by the increase in the number of layers in the module.
Finally, the prefusion features FM pf in the MARFN-A will be input into Conv7, and Conv7 will combine the output of Conv3∼Conv6 into the next layer and get the fusion feature FM m . After Conv3∼Conv7, the channel attention (CA) will be placed. According to Figure 4(a), the feature fusion formula is defined as shown in e MARFN-B is different from the MARFN-A; FM vi and FM ir will be input into Conv7 at the same time. According to Figure 4(b), the feature fusion formula is defined as shown in w l is the loss coefficient of different feature layers, and w vi and w ir are used to balance the loss of each scale in the multiscale features. FM vi and FM ir control the relative influence of visible and infrared features in the fusion feature map FM m . e MARFN fuses the features of the infrared image FM ir and the visible image FM vi to generate the features FM m required by the FPN. e multiscale instance segmentation module can obtain the final instance segmentation results. We use the SOLOV2 single-stage prediction head, so the loss definition of the multiscale instance segmentation module is consistent with SOLOv2, and its definition is as follows: L cate means focal loss, L mask means dice loss, and more details about loss function can be found in SOLOV2. erefore, our total loss is defined as follows:

Image Dataset of Power Equipment.
e image dataset of power equipment comes from a medium-sized converter station in Huanggang City, Hubei Province, China. In the experiment, we constructed and used this dataset, and all infrared images and visible images were obtained by an infrared thermal camera (Fluke Ti480 PRO). e shooting time is from 8 : 00 am to 5 : 00 pm, and the weather is mainly cloudy and sunny. e image mainly contains common power equipment such as transformers and lightning arresters. e power equipment dataset is shown in Figure 6. e power equipment dataset is mainly used for image processing tasks such as object detection and instance segmentation. In the experiment of the multispectral instance segmentation, we used the method [47] to obtain the final registration image. e multispectral image consists of 2940 pairs of arresters and 2998 pairs of transformers. e division ratio of the training set, validation set, and test set is 6 : 2 : 2, and the distribution results of the power equipment dataset during training are shown in Table 4. e dataset is manually labeled by LabelMe. And according to the MS COCO dataset style, we constructed a dataset of instance segmentation.

Experiment Setup.
e experiment was completed on a deep learning server, which was configured with NVIDIA Tesla V100 GPU and Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40 GHz, the OS was 64-bit Ubuntu 18.04, and the network was implemented based on Pytorch 1.3.0. In model training, it is L fusion that the loss function is used by the multispectral fusion network, and L cate and L mask are used by the loss functions of the multispectral instance segmentation network. We use stochastic gradient descent (SGD) as the optimizer during network training, and its learning rate (lr) is 0.01, the momentum parameter (momentum) is 0.9, and the decay value (decay) of the learning rate for each update is 0.0001. e evaluation index is the detection evaluation index of COCO [48], including AP, AP 50 , AP 75 , AP S , AP M , and AP L .

Our Results.
To validate the proposed MSIS model, we quantitatively and qualitatively evaluate the MSIS model with existing state-of-the-art methods on multispectral datasets of electrical devices, which include two-stage, single-stage, and multispectral instance segmentation. e twostage instance segmentation contains Mask R-CNN [14], MS R-CNN [17], TensorMask [18], and PANet [16]. e singlestage instance segmentation has PolarMask [25], YOLACT++ [24], and SOLOv2 [15], and multispectral has Mask R-CNN (RFN) [11], SOLOv2 (RFN), and Mask  Computational Intelligence and Neuroscience 5 R-CNN ( * ) [43]. In the above instance segmentation network, the two-stage and single-stage instance segmentation methods only use infrared light images. Multispectral instance segmentation includes instance segmentation based on image fusion (Mask R-CNN (RFN) and SOLOv2 (RFN)) and instance segmentation based on feature fusion (Mask R-CNN ( * ) and MSIS). In instance segmentation based on image fusion, the RFN-Nest method is used for image fusion, and then, the fused image will be input to the instance segmentation. In instance segmentation based on feature fusion, different fusion strategies are used to fuse features, and then, instance segmentation is performed based on the fusion features. e quantitative evaluation results of the above network are shown in Table 5.
In Table 5, the AP value of the MSIS model based on ResNet-101 reaches 42.20%, which is better than the other methods above to achieve the segmentation of power equipment. Compared with SOLOv2, which only uses infrared images, the effect is significantly improved, and the AP value is increased by 7.5%. e reason is that MSIS can obtain information of infrared images and visible light images at the same time, and the complementarity of information improves the semantic information processing capabilities of the network. Compared with SOLOv2 (RFN), the AP value of MSIS has increased by 3.4%. is shows that the proposed prefusion network and MARFN module can obtain richer fusion features than the RFN module. We also evaluated the FPS of MSIS on the NVIDIA Tesla V100 GPU, as shown in Table 6. e MSIS based on Res-50-PFN can reach 12 FPS, and the lightweight model based on SOLOv2 can reach 23 FPS.
For further explanation, Figure 7 shows the segmentation results of the above method on the power equipment multispectral dataset. (c) and (d) represent instance segmentation using only infrared light images, and they show the phenomenon of incorrect segmentation of overlapping ResNet-50

ResNet-50
Pre-fusion  is article provides generalization experiments to prove the effectiveness of the proposed method. e MSIS method is tested on the FLIR thermal imaging dataset. e FLIR thermal imaging dataset was provided by FLIR for ADAS and driverless technology, which mainly includes thermal images and RGB images. Since the FLIR thermal imaging dataset provides annotation information for target detection, the object detection prediction head will be used to complete the generalization experiment. In Table 7, Faster R-CNN represents the original network. Faster R-CNN (MSIS) uses the proposed MISS method and replaces the prediction head with the prediction head of Faster R-CNN. As shown in Table 7, the mAP of Faster R-CNN (MSIS) is 58.56, which is 5.22% higher than Faster R-CNN. is result is basically consistent with the result of the MS COCO dataset.

Ablation Experiment.
In this section, in order to verify the superiority of the proposed MSIS method, we provide four sets of ablation experiments. ey are the ablation experiment of the feature fusion module, the ablation   Computational Intelligence and Neuroscience experiment of the fusion layer, the ablation experiment of the backbone, and the ablation experiment of the prefusion network. e experimental process is as follows. First, the ablation experiment of the fusion layer and the ablation experiment of the feature fusion module are executed. Next, the best fusion layer and feature fusion module are used for ablation verification of the prefusion network. When the above-mentioned ablation experiment is completed, the main ablation experiment is finally carried out.

Ablation Experiment of Fusion Layer.
In the fusion layer ablation experiment, we consider the fusion convolutional layer from two perspectives: the number of convolutional layers and the structure of the convolutional layer. RFN (Conv × 3) represents the original fusion convolution layer, which means that only 3 layers of convolution are provided. RFN (Conv × 4) to RFN (Conv × 6) indicate that 4, 5, and 6 layers of convolution are provided, respectively. Fusion Convolutional Layer (Conv × 6) represents the proposed fusion layer convolution. e comparison results are shown in Table 8.
In Table 8, after the fusion convolutional layer is increased to 5 layers, the fusion ability of the network decreases. is also causes the AP value to drop further. e main reason is that the network is degraded. In the process of forward transmission, as the number of convolutional layers increases, the image information contained in the feature map will decrease layer by layer. e deep network may get a worse training effect than the shallow network. Based on this analysis, we propose a new fusion layer structure, as shown in Figure 7. In Table 8, we compare the performance brought by different fusion layers. When the number of FCL increases to 6, the network segmentation ability still maintains good fusion performance.

Ablation Experiment of Feature Fusion Module.
is section compares the MSIS feature fusion module with the existing fusion methods (Add, Max, l 1 − norm, l * − norm, SCA and RFN). In the existing fusion module, add refers to directly adding different features. Max selects the maximum value of the element as the fusion feature. e method based on l 1 − norm refers to calculating the weight based on l 1 − norm. e l * − norm (known as nuclear-norm) method refers to obtaining the fusion weight by calculating the sum of singular values of a matrix involved in the global pooling operation of deep features. SCA represents the spatial/channel attention fusion strategy used in NestFuse [39]. RFN represents the residual fusion strategy used in RFN-Nest. e expression definition is shown in Table 9.
We use 6 evaluation indicators for evaluation. ey include Entropy (En) [49], Standard Deviation (SD) [50], Mutual Information (MI) [51], Improved Fusion Artifact Measurement (N abf ) [52], Sum of Difference Correlation (SCD) [53], and Multiscale Structural Similarity (MS-SSIM) [54]. At the same time, in order to evaluate the indicators, Nest-RFN will be used as the basic fusion network. e different fusion images are obtained by replacing the strategy fusion of the Nest-RFN. Finally, the fusion result quality index evaluation table is shown in Table 10.
In Table 10, the fusion methods based on convolution (SCA, RFN, MARFN-A, and MARFN-B) get better fusion effects than other classic fusion methods. From the perspective of information retention (En, SD), the fusion methods based on convolution extract rich image features through convolution, and these features are used by the fusion convolution structure to generate fused features. Finally, a better result than the classic fusion method is obtained. Although both MARFN-B and RFN are fusion methods based on convolution, the MARFN-B method is better than RFN. e main reason is that FCL can further improve the fusion of features and retain richer information. In addition, MARFN-A has a significant improvement in the evaluation indicators. From the perspective of feature preservation (MS-SSIM, MI), the prefusion network and MARFN-A construct deeper feature extraction and fusion, thereby enhancing the fusion capability.   Table 11. "✓" means that the prefusion network is enabled. e MARFN-A module with a prefusion network has been significantly improved, and its AP value has increased by 5%.
e prefusion network provides richer features and enhances the fusion capability of the MARFN module, and finally, the overall segmentation performance of the network is improved.

Ablation Experiment of Backbone.
To explore the feature extraction module in the MSIS, in the ablation experiment of the MSIS backbone, we provide two backbones. ey are dual-input backbone based on the traditional backbone and dual-input backbone based on the feature extraction module of the MSIS, respectively. Dual-input backbone based on traditional backbone uses the classic backbone (ResNet-101 or ResNeXt-101), whose structure is shown in Figure 8. Dual-input backbone based on the feature extraction module of the MSIS is a combination of the MSIS feature extraction network and the classic backbone (MSSIS ResNet−101 or MSSIS ResNeXt−101 ), as shown in Figure 2. Table 12 shows the performance of network segmentation for different backbones. Compared with ResNet-101, the AP value has increased by 4.54%. e AP of the MSSIS ResNeXt−101 reached 43.61%. From the         and MARFN extracts more complex features and provides deeper feature fusion, thereby enhancing the fusion capability of the fusion network.

Conclusions
In this work, we designed an end-to-end multispectral instance segmentation model, which can achieve complete segmentation of power equipment and meet the requirements of the preliminary work of power fault detection and segmentation for nonfaulty equipment. Compared with ordinary instance segmentation, the proposed network adds a multispectral feature fusion network to fuse the features of infrared images and visible images. For the MSIS network model, we have done enough experiments and adopted the best solution to greatly improve the accuracy of segmentation. To better process infrared images and visible images, we propose a dual-input method, which takes advantage of the advantages of infrared images and visible light images at the same time.
Finally, the AP of the MSIS model reached 40.06%, and the segmentation results can be seen in Figure 7. e multispectral instance segmentation can achieve complete segmentation of power equipment and help with power equipment fault detection, however, there is no segmentation of faults, and the model itself belongs to a large model to be further optimized. erefore, in future research, the model will be further improved for fault detection.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest to report regarding the present study.