Weakly Supervised Fatigue Crack Detection in Steel Bridge Girders Using a Proposed Two-Stage Network Training with a Segmentation Refinement Module

,


Introduction
Steel box girders have been widely applied in long-span cable-stayed and suspension bridges in light of their advantages of low weight, high torsional stifness, and rapid construction.Such structures are sufering from fatigue cracks owing to initial defects and residual stresses related to fabrication and construction processes.Under repeated vehicle loads, the development of fatigue cracks continues to reduce the stifness and integrity of the local welded connections, thereby decreasing the reliability and durability of bridges.To support accurate decision-making on bridge maintenance, it is essential to detect and monitor the fatigue cracks in a periodical or even real-time manner.
Nondestructive testing (NDT) techniques, such as traditional ultrasonic testing or advanced phased-array ultrasonic testing [1], are usually used to approach the problem of local damage detection, which can obtain both the inner and surface damage characteristics.Nevertheless, the accuracy of these NDT-based methods is limited by measurement noise and highly depends on skilled inspectors or expensive instruments.Compared with NDT, the vision-based methods are relatively inexpensive to implement and adaptive for surface cracks [2].Currently, human-based visual inspection still plays a crucial role in the routine fatigue maintenance of steel bridges.However, its consistency in quantitative evaluation and accessibility cannot be guaranteed considering environmental and human factors [3,4].Furthermore, the inspection by human inspectors is often labor-intensive and time-consuming.
With the rapid development of computer technology, methods for crack inspection based on computer vision have emerged.Most of these studies have focused on image processing techniques (IPTs).A signifcant advantage of IPTs is that almost all surface defects may be identifed [3].However, this method is limited by its reliance on subjectively chosen parameters and flters, which indicates a lack of generality under diferent scenarios [5].Terefore, some researchers tried to improve the robustness of the IPT-based method in real-world situations by machine learning (ML) [6][7][8][9].However, these improvements are strictly limited by the feature extraction capacity of IPTs, and despite the improvements, these optimized methods still require some pre-and postprocessing techniques that are time-consuming [10,11].
Owing to the development of convolutional neural network (CNN), deep learning (DL)-based methods have been proposed for image-based crack detection in computer vision.Recent studies on DL-based crack detection have mainly involved methods of image classifcation, object detection, and semantic segmentation.Image classifcation methods focus on obtaining image-level class information, while object detection methods concentrate on getting the class and general location information.Neither of the above methods can provide sufciently accurate information in terms of crack width, length, and direction, which, however, play fundamental roles in fatigue maintenance design.Consequently, semantic segmentation methods based on supervised learning which could provide pixel-level semantic and localization information have been used for fatigue crack detection.
However, the utilization of supervised semantic segmentation demands a large number of annotated pixel-level labels [12], which calls for enormous human labor and time costs during the preparation process [13,14].To approach this difculty, weakly supervised learning (WSL) with image-level labels, which indicates the existence of the object of interest, has been implemented in the training of segmentation networks [14,15].Compared with pixel-level annotation, the labeling cost at the image level can be signifcantly reduced.Nevertheless, the network performance by WSL is lower than that by fully supervised learning (FSL), and there remains room for performance improvement.
Tis paper proposed an improved WSL-based method for high performance of fatigue crack detection with low labeling cost.Te main contributions of the proposed method are as follows: (i) A pixel-level detection method was proposed for the segmentation of fatigue cracks, which only used image-level classifcation labels but achieved stateof-the-art performance in WSL-based methods.
(ii) To realize customized optimization for the segmentation characteristics, the activation modulation and recalibration (AMR) scheme was adopted to generate refned pseudolabels, which approached the problem that only the most discriminative regions were highlighted in state-of-the-art WSLbased methods.(iii) To the best of the author's knowledge, this paper is the frst to propose a two-stage training method for weakly supervised fatigue crack segmentation.After learning the semantic features in refned pseudolabels, the segmentation network was trained recursively with a segmentation refnement module.Tis takes both the advantages of deep learning and the morphological knowledge of cracks.
Te remainder of this paper is organized as follows.Section 2 introduces the current literature on crack detection.Section 3 outlines the methodology of our proposed method.Section 4 presents the experimental results.Lastly, conclusions are given in Section 5.

Conventional IPT-Based Crack Detection Methods.
Conventional IPT-based methods have been widely utilized for crack detection over the past two decades.Early studies rely on intensity-thresholding methods due to their simplicity and efciency, assuming that crack pixels exhibit lower intensity than the background [16][17][18].However, these methods face challenges in unevenly illuminated images, as a single threshold is applied to the entire image.
To overcome challenges related to real-world image variations, IPT-based methods have been enhanced by integrating several ML classifers such as support vector machine (SVM) and k-nearest neighbor (KNN) algorithms [6,7,32].Nevertheless, these improvements are constrained by the feature extraction capacity of IPTs.Notably, prior IPT-based studies predominantly focused on cleaner surfaces, such as pavement or concrete, and may not be robust enough for crack detection in steel bridge girders, where obstacles such as marker curves and weld line edges exist.[2,33].Patch-level methods typically employ sliding window techniques or crop small patches for crack detection.A CNN-based workfow utilizing sliding windows allows the detection of cracks in images larger than those used for training [3].Faster region-based CNN (faster R-CNN) has been proposed for real-time detection of cracks [10], and transfer learning from a benchmark CNN enhances robust crack classifcation with limited crack images [11].While these methods provide satisfactory crack region detection, they may lack morphology information related to cracks.Postprocessing algorithms, such as edge detection and dilation operations, are employed to segment detected patches at the pixel level [34,35].However, postprocessing accuracy depends on the optimal patch size, which can be challenging to determine.
While these approaches have advanced automated crack detection, patch-level methods are quick but limited in extracting crack information.Pixel-level methods provide accurate segmentation but are labor and time-intensive for pixel-level label preparation.Addressing this trade-of, an efcient crack detection method with reduced annotation burden needs to be proposed.

Image-Level Weakly Supervised Semantic Segmentation (WSSS).
Since the generation of fully annotated datasets is laborious, alternative learning methods based on unlabeled or weakly labeled visual data have become prevalent in recent years.Various forms of weak labels have been proposed in previous studies, such as bounding boxes, points, scribbles, and image-level supervision.Among these, imagelevel weak supervision is favorable for its simplicity and reliability [49].Terefore, this study focuses on the imagelevel weakly supervised crack segmentation.
Recently, image-level WSSS works mostly employ class activation maps (CAMs) [50] as initial pseudolabels for the training of segmentation networks.CAMs provide a heatmap representation that highlights the interested object regions.However, CAMs generated by classifcation networks tend to highlight the most discriminative regions; hence, the obtained pseudolabels may only cover a part of the target objects.Tese coarse discriminative object regions may not meet the requirement of pixel-level semantic segmentation and thus harm the network performance.Te eforts to alleviate this issue can be classifed into two aspects: refning the pseudolabels based on CAMs and modifying the segmentation training procedures.
To obtain fner initial pseudolabels, most studies have focused on refning the seeds or response regions of initial CAMs.Wei et al. [51] used dilated convolution with diferent dilate rates to enlarge the receptive felds.Kolesnikov and Lampert [52] proposed three principles: seed, expand, and constrain (SEC) to refne the seeds.Ahn and Kwak [53] predicted semantic afnity between pixels by AfnityNet and propagated local activations using a random walk algorithm.Chang et al. [54] enforced the network to learn better response regions by exploiting the subcategory information.Lee et al. [55] randomly selected the hidden units in the feature map to make the activated regions better characterize the object.However, these methods were developed in an interactive and random manner, which may lose essential information.To approach this issue, Qin et al. [56] recently proposed a novel activation modulation and recalibration (AMR) scheme, which leverages a spotlight branch and a compensation branch to provide complementary and task-oriented CAMs for WSSS.
In addition to refning the initial pseudolabels, several studies have focused on improving the WSSS performance by modifying the segmentation training procedures.Most of them trained segmentation networks in a recursive manner along with a refnement module that exploits the prior information at its best.Khoreva et al. [57] proposed a recursive training enhanced by denoise techniques that improved the labels between each round with object priors.Li et al. [58] designed a new superpixel conditional random feld (superpixel-CRF) model to refne generated masks, based on which the segmentation model was trained iteratively.Recently, for WSSS of power lines, Choi et al. [59] introduced a broken line connection algorithm to provide refned segmentation labels for recursive segmentation training.
To the authors' knowledge, no studies have been conducted on WSSS of fatigue cracks in a noisy background, let alone the corresponding improvement methods for WSSS performance.Tis paper is also based on the concept of CAM and utilizes a recursive training procedure.A critical diference from previous studies is our proposed two-stage training procedure, which frst trains the segmentation network based on refned pseudolabels generated by AMR and then performs recursive training with a designed refnement module to denoise the crack segmentation.Terefore, the proposed method can beneft from both the refnement of the initial pseudolabels and the modifcation of the segmentation training procedure.

Overview of the Proposed Method.
Single classifcation networks can localize the most discriminative object regions, which is however far from the requirement for pixel-level segmentation.To provide refned pseudolabels, the AMR Structural Control and Health Monitoring method is recommended to integrate less-discriminative regions.In addition, to further enhance the performance of WSSS, a two-stage training procedure is proposed, aiming to continuously refne training labels by leveraging the features of deep learning and fatigue crack morphologies.
As shown in Figure 1, our framework consists of two parts: the frst part deals with the training of AMR branches and the generation of initial pseudolabels; the second part involves the proposed weakly supervised two-stage training of the segmentation network.Specifcally, the frst part involves a systematic four-step process.First, the input images (X in ) undergo preprocessing, wherein they are cropped into small patches, and each patch is annotated with four imagelevel labels: background (L b ), crack (L c ), marker (L m ), and the combination of crack and marker (L cm ).Subsequently, AMR is trained using these annotated patches, enabling it to efectively highlight crack and marker regions of interest.Te trained AMR is then utilized to generate CAMs of patches for new input images.Importantly, the images used for AMR training are distinct from those used for generating the CAMs, avoiding overestimating the trained AMR performance.Finally, the dense conditional random feld (DenseCRF) [60] is employed to process the probability maps derived from CAMs, producing fne segmentation masks as initial pseudolabels for the following training of the segmentation network.
In the second part, the generated pseudolabels are used to train the segmentation network within certain epochs to obtain fne-enough basic segmentation performance in stage I.After that, the pretrained network is further trained in a recursive manner in stage II, and in each iteration, a designed refnement module refnes the predicted masks from the prior iteration and gets more complete and precise labels to train the segmentation network in the current iteration.

Activation Modulation and Recalibration Method.
A conventional CAM of a specifc category highlights the discriminative regions used by multilabel classifcation networks to determine that category.Given an input image I ∈ R 3×H×W , global average pooling (GPA) is used to identify the importance of the feature maps F(I) ∈ R C×H×W (C is the channel of the feature maps) extracted from the last convolution layer.Ten, the conventional CAM can be simply obtained by computing a weighted sum matrix of the extracted feature maps: where M(I) ∈ R N×H×W is the obtained CAMs and w T N is the weight of the fully connected layer for N classes.
However, the conventional CAMs are classifcationoriented and lack some minor but essential features for the segmentation tasks.To solve this problem, Qin et al. [56] proposed a novel AMR scheme for WSSS, which outperformed state-of-the-art WSL-based methods on the PASCAL VOC 2012 dataset.However, its efectiveness for highlighting crack regions under the interference of edgelike features has not been validated.As shown in Figure 2, similar to previous studies, a spotlight branch based on common CNNs is utilized to highlight the most discriminative object regions and generate the corresponding spotlight CAMs M s .Besides that, the main contribution of the AMR method is the implementation of a parallel compensation branch, which leverages a spatial-channel attention module to focus on those essential but easily overlooked regions.Te obtained compensation CAMs M c are used to recalibrate the spotlight CAMs M s to generate the fnal weighted CAMs M w [56]: where ξ is the recalibration coefcient.
During the training process, the AMR method is optimized with a two-part combination loss L all , which can be expressed as Te frst loss prat, L cls , is the averaged classifcation loss of the two branches and can be calculated as where L s cls and L c cls are the multilabel soft margin losses for supervision on the spotlight branch and the compensation branch, respectively.
Te second loss part, L cps , aims to provide a cross pseudosupervision on the spotlight branch and the compensation branch.It can be regarded as the semantic similarity regularization of each branch and can be represented as Tis paper aims to use the AMR method to generate high-quality pseudolabels from image-level annotations.Te whole generation process of the pseudolabels can be summarized as follows.ResNet50 is used as the backbone to design the multilabel classifcation branches of AMR, and the spatial-channel attention module with the Gaussian function is plugged into the compensation branch.Te process then utilizes the image-level annotations to train the AMR under the supervision of equation ( 3).After the training is completed, the weighted CAMs can be obtained by using the discriminative localization technique described in equation (2).Finally, DenseCRF is used to process the CAM probability map to obtain the synthetic labels used to train the segmentation network.

Two-Stage U-Net
Training.U-net [61] was originally designed for semantic segmentation of biomedical images with some edge-like features, which make it much more straightforward for crack detection.Te adopted skipconnection promotes the aggregation of spatial and semantic information, which makes U-net outperform the conventional FCN.Furthermore, U-net also performs well while little training data are available.All the advantages mentioned above make U-net a good ft for the detection of  To accelerate the network convergence and further improve the performance of WSSS, the segmentation network is proposed to be trained in a two-stage manner, which can be summarized as follows.In the frst stage, U-net is pretrained for certain epochs to learn all the essential information indicated by the initial pseudolabels.Tis training stage aims to provide a basic segmentation performance and facilitate network convergence in the following training process.Although the AMR method is used, the initial pseudolabels are still incomplete since they are generated by the network only using image-level labels.To develop the inference quality, the pretrained U-net is further trained in a recursive manner in the second stage.It is expected that the segmentation performance with noisy labels could be developed by itself via recursive training with a segmentation refnement module.

Segmentation Refnement Module: Assimilation and
Connection.In the proposed method, a segmentation refnement module is designed to provide continuously optimized labels in the recursive training of U-net.Te refnement module aims to exploit the available morphology information related to fatigue cracks and the surrounding markers at their best.Te information is integrated in the following two cues: C1.Cracks and markers are generally separated.Terefore, there are no discrete marker segments on the crack path, and likewise, there are no discrete crack segments on the marker path.C2.Fatigue cracks mostly initiate near the substrate surface, and in the propagation phase, the crack penetrates the substrate surface, forming a continuous damage path.Terefore, surface fatigue cracks are usually continuous.
Te recursive training is enhanced by denoising the network outputs using the morphology information.Following the above two cues, the labels can be improved by two postprocessing algorithms between each iteration.
A1.An assimilation algorithm follows cue C1 to assimilate the false-detected discrete segments into their categories.Algorithm A1 supports the proposed assimilation process, as illustrated in Figure 3. Te input img is a synthetic mask containing segmented crack and marker domains.When U-net is used for pixel classifcation, given the similar features between cracks and markers, they could be misclassifed into each other, which results in the intermingling of cracks and markers.Statistical analysis of the predicted results reveals that the number of pixels being false-detected as another category in a connected domain is generally smaller than that being correctly detected.Based on this, crack assimilation and marker assimilation are designed to correct the misidentifed crack and marker pixels, respectively.Each assimilation part consists of three steps.For crack assimilation, frst, marker pixels img mark are stripped from the synthetic label copy img whole and connected domains of markers CoDom mark can be obtained.Second, the marker pixels in each connected domain are counted and further removed from img whole if the pixel number exceeds the threshold N th_mask .Te reason for the removal procedure is that those larger connected domains are less likely to be misidentifed, and the removal of them facilitates the following assimilation process.Tird, connected domains of the updated synthetic label copy CoDom whole are obtained, and each obtained domain is checked to assimilate the misidentifed crack pixels into its category by comparing the pixel numbers.Similar steps are implemented for mask assimilation as well.A2.A connection algorithm follows cue C2 to connect discrete crack segments into a whole.Given the infuence of uneven illumination inside dim bridge girders, some background noise could be wrongly identifed as small crack points during the segmentation process.Before the crack connection, these misidentifed noises are fltered beforehand according to the highlighted regions by CAMs.Algorithm A2 supports the connection process, as shown in Figure 4. Te proposed crack connection part consists of three steps.First, crack segments are extracted from the assimilated mask img whole , and the corresponding connected domains CoDom crack are further found.Subsequently, the extreme points ExtrmPts i are found for each domain, which is a list containing top-most, bottom-most, right-most, and left-most points (Figure 4).After that, the Eulerian distance between every extreme point of a connected domain and that of every other contour is compared to obtain two endpoints (Pt 1 , Pt 2 ) with the least distance.Finally, the two endpoints are connected using an assumed crack line in img whole .During the recursive training process, the discrete crack segments are gradually connected into a whole.[62].Specifcally, a total of 200 images with the size of 4,928 × 3,264 or 5,152 × 3,864 pixels were provided, and these images were collected from steel bridge girders under diferent camera parameters and environment conditions during routine inspection.Te dataset acquisition details can be found at https://www.schm.org.cn/#/IPC-SHM,2020/project1.

Experiment
Based on the original dataset, two subdatasets were further generated to train and evaluate the proposed method.Te frst subdataset, called the AMR dataset, aims to train the AMR branches for generating high-quality pseudolabels.Tus, AMR-dataset is a multilabel image classifcation dataset.80 high-resolution images were selected from the original dataset and resized to multipliers of 512.Tese resized images were further cropped into small patches of 512 × 512 pixels.Te cropping process was performed to improve the training and testing efciency and has been widely adopted in previous studies.Considering the category-imbalance problem, the fnal generated AMR-dataset contains 800 images with cracks, 800 images with markers, and 800 background images, which all have manually annotated image-level labels.

Structural Control and Health Monitoring
Te remaining 120 images in the original dataset were randomly divided into three subsets: the training set, the validation set, and the testing set using a split ratio of 6 : 2 : 2. Terefore, the second subdataset, called the U-net-dataset, was constructed for training and evaluating the segmentation network.During training, the original high-resolution images were resized and cropped into small patches of 512 × 512 pixels, and the trained AMR was used to produce their corresponding synthetic labels.Note that there is no element overlap between the AMR-dataset and the U-netdataset, thus avoiding overestimating the AMR performance.

Evaluation Metrics.
In evaluating the segmentation performance of the proposed method, three key metrics are employed.Annotation time measures the labeling efciency before model training, ofering insights into the pretraining annotation workload and cost-efectiveness.After model training, the model's prediction accuracy is assessed using the mean Intersection-over-Union (mIoU) metric, a standard measure of segmentation performance.Furthermore, the novel efciency metric (Images/s) quantifes the prediction speed of the trained model, refecting the number of images processed per second.Tis comprehensive set of metrics provides a thorough evaluation, addressing annotation costs, segmentation accuracy, and computational efciency, ofering a well-rounded perspective on the effectiveness of the proposed segmentation approach.

Training Confguration.
For producing attention maps, the AMR classifcation branches were trained for 20 epochs with a batch size of 16 images and an initial learning rate of 0.001.A stochastic gradient descent algorithm was leveraged for network optimization using a 0.0001 weight decay.Some data augmentations were also implemented on the training samples to improve the training efciency.After obtaining the pseudolabels, the U-net segmentation network was trained in a two-stage manner, which frst pretrained the network for 50 epochs using the initial pseudolabels and then further developed the segmentation performance using a recursive network training for 50 iterations.Te initial learning rate was 0.0001, and the weighted cross-entropy loss was used to approach the unbalanced data.All the tasks described were performed on a workstation (CPU: double Intel ® Xeon ® CPU E5-2680 v4 @ 2.40 GHz, RAM: 64 GB, GPU: ASUS GeForce RTX 2060 D6 12G).

Study of the Annotation
Workload.An experiment was designed to contrast the annotation workload of the proposed weakly supervised method with that of traditional fully supervised methods.In this experiment, our method employed image-level annotation, which simply involved placing diferent categories of images into diferent category folders.For the fully supervised method, three main representative pixel-level annotation tools were selected to scrutinize the annotation workload: Adobe ® Photoshop (PS), LabelMe [63], and an online annotation tool EasyDL [64].
As illustrated in Figure 5, PS utilizes the magic wand and lasso tools, especially efective for objects in sharp contrast to the background.LabelMe, an open-source Python tool, extracts annotations with control points along object boundaries.EasyDL, an algorithm-assisted tool, requires users to add or remove anchor points, automatically identifying object and background regions for pixel-level labels.
Twenty images from the dataset were selected for the annotation workload experiment.Each experiment was repeated three times under consistent conditions by individuals with varying profciency levels.Te recorded  8 Structural Control and Health Monitoring annotation time characterized the overall workload, and the results for diferent annotation methods are summarized in Figure 6. Figure 6 reveals that LabelMe exhibits the highest annotation time, attributed to the complexity of placing boundary control points.In contrast, PS, utilizing the magic wand tool, proves quicker than LabelMe.Among pixel-level tools, EasyDL records the lowest annotation time and deviation.However, EasyDL's pixel-level annotation time is approximately seven times longer than annotating imagelevel labels.Tese results suggest the signifcantly reduced annotation workload of our weakly supervised method compared to conventional fully supervised methods.

Comparison between AMR and Grad-CAM for CAM
Generation.Te paper proposes using AMR in Section 3.2 to generate efective CAMs for providing semantic and localization cues in segmentation.To assess the AMR's effectiveness in activating complete object regions, a comparison was conducted with the state-of-the-art method, gradient-weighted Class Activation Mapping (grad-CAM) [65].Grad-CAM used gradient information to assign weights to feature maps extracted from the last convolutional layer of ResNet50 in this study.Tis process allowed the creation of CAMs without the need for retraining, preserving the existing model structure and parameters.A set of thresholds th � {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8} was defned to convert CAMs into synthetic labels.For semantic Interface: Extracted regions: Pixel-level labels: (a) Interface: Extracted regions: Pixel-level labels: Interface: Extracted regions: Pixel-level labels:  class k, if the value of CAM A x,y k ≥ th n , the class of these pixels at the spatial location (x, y) was annotated as k, otherwise background.
Table 2 shows the comparison of IoU metrics on the training subset of the built U-net-dataset for diferent th values."Crack," "Marker," and "Background" denote IoUs corresponding to these three diferent classes, and "All" denotes the mean values of IoUs for the three classes, namely, mIoU.Te results of the grad-CAM were reproduced using their publicly available implementation, and all the results in Table 2 were obtained without DenseCRF postprocessing.Te results indicate that the adopted AMR produces more accurate CAMs than grad-CAM for all diferent thresholds.Te performance of our proposed CAMs is generally improved by about 2%, and when th � {0.5, 0.6}, "All" achieves the best results.Figure 7 shows typical examples of the CAM results obtained using AMR and grad-CAM, and it can be observed that the former provides more complete activation regions than the latter.Tese results demonstrate that the implementation of AMR in WSSS of fatigue cracks is promising, but given the limited training data, the authors recommend further validation when more data are available.

Ablation Study.
Te confguration of the proposed method contains three main components: the AMR-based CAM generation, a segmentation refnement module, and the proposed two-stage training.To investigate the efectiveness of each component, ablation studies were conducted and the corresponding results are listed in Table 3. Te mIoU metric during training was compared for each experiment confguration, as shown in Figure 8.For confgurations C1, C3, and C4, they have the same mIoU trend in the frst 50 epochs and overlap each other before 50 in Figure 8.Some example images according to the two-stage training phase are illustrated in Figure 9, where ground truth refers to the true labels for the input images in this study.
In the ablation studies, direct segmentation training based on the initial pseudolabels (Confguration C1) was adopted as the baseline.As shown in Table 3 and Figure 9, the initial pseudolabels are very coarse and the performance of the baseline was 71.9%.By applying the two-stage training with the segmentation refnement module, the segmentation performance improves gradually during the recursive training (50∼100 iterations in Figure 8) and fnally achieves 76.5%.Tis demonstrates that our method is efective.
Experiments were extensively conducted to verify the efectiveness of the proposed two-stage training by comparing the performance of confgurations C2 and C4.For the direct training method, it started from improving initial pseudolabels using the segmentation refnement module and then used these refned labels to train the segmentation network for 100 epochs.Figure 8 shows the comparison results.With the iterations, the performance only increases at the beginning and stabilizes at a low mIoU, while in the proposed two-stage training, the performance continues to increase at the second stage and reaches a much higher mIoU.Tis result demonstrates that our two-stage training method is efective.Tis training scheme progressively mines common object features from previous masks and then expends more reliable object regions with the assistance of the segmentation refnement module; thus, the performance can increase rapidly to a quite satisfactory result.
In some cases, the initial pseudolabels produced by AMR are still incomplete.To mine the whole regions of objects, a segmentation refnement module is incorporated into the recursive training.To evaluate the efectiveness, an experiment was conducted on the training framework without refnement (confguration C3) and the performance was compared with that of confguration C4.From Figure 8, without the refnement module, some misidentifed object regions may grow gradually, and thus the performance decreases continuously during the recursive training.Figure 9 shows how the prediction improves with iterations according to the refnement module.By exploiting the available morphology information, some false-detected segments are assimilated into their correct categories (Figures 9(b) and 9(d)), and the broken cracks are gradually connected into a whole (Figures 9(a), 9(c) and 9(e)).Tese results demonstrate the efectiveness of our proposed segmentation refnement module.
Te performance diferences among confgurations C1 to C4 are intricately linked to the recursive training process and the role of the segmentation refnement module.Troughout recursive training, the model's predictions from the previous iteration serve as training labels for the next iteration.In the absence of segmentation refnement, as observed in C3 (Figure 8), errors accumulated during iterative training can amplify, causing a gradual decline in performance.On the contrary, confguration C4, leveraging segmentation refnement, undergoes an iterative correction process.Tis module guides the model through training cycles, progressively rectifying errors and enhancing performance.Tis iterative refnement proves pivotal in steering the model toward improved predictions, countering the cumulative degradation seen in C3.However, in confguration C2, where the segmentation refnement module optimizes initial pseudolabels only once at the beginning of training, its impact is more restrained.Tis limited optimization, compared to the continuous refnement in C4, restricts the model's exposure to refned and contextually rich labels, resulting in a more modest improvement, as shown in Table 3.

Comparison of the Trained Segmentation Performance.
Te proposed method was further compared with fully supervised methods and some other weakly-supervised methods.Te fully supervised learning was directly conducted on the manually labeled ground-truth data of the pixel level.Te other weakly supervised segmentation followed the workfow that frst generated pseudolabels and then used the synthetic labels to directly train the segmentation network.Other than the proposed U-net, several popular architectures were chosen as alternative segmentation networks to provide a more comprehensive comparison.All of the models were trained to converge using the 10 Structural Control and Health Monitoring same training parameters, and the trained models were evaluated according to the test set.Te evaluation metrics are listed in Table 4.
As shown in Table 4, the weakly supervised methods are obviously lower than the fully supervised methods in terms of mIoU, which is attributed to the incompleteness of the initial pseudolabels.Compared with the FCN-based methods, U-net achieves better prediction performance.Tis is due to the fact that U-net has a fner upsampling process with more channels, and rather than being simply added as in FCN, the same-level encoder and decoder parts are concatenated in U-net.Although soft attention is implemented at the skip connections in attention U-net (AttU-net), the mIoU metrics of AttU-net and U-net are very close under both fully supervised and weakly supervised learning confgurations.For the current test samples, our method with U-net achieves a higher mIoU value than the other weakly supervised methods and is only slightly lower than the U-net-based fully supervised method by 1.6%.Besides, the higher efciency value obtained by our method indicates its capability to handle more images in a given time frame.
A comparison of segmentation results for typical damage images is shown in Figure 10.In Figures 10(a) and 10(b), image patches with diferent background colors under normal conditions are used to test the trained models.All models provide good prediction results for them.However, the FCN-based prediction is relatively rough, which may be owing to the lack of enough localization information during the upsampling process.Figures 10(c) and 10(d) show prediction results for very thin cracks, and our method makes better predictions than the other weakly supervised methods whose inferences of fatigue cracks are incomplete.To evaluate our model's robustness to surface interference, Grad-CAM:   12 Structural Control and Health Monitoring the image patch with cracks on a contaminated surface is fed into the trained models, as shown in Figure 10(e).Our proposed model is able to discriminate stains with minor errors, but its prediction of fatigue cracks is not as accurate as that of the fully supervised methods.As shown in Figure 10(f ), the crack-like weld edges are correctly identifed as background by all models, and among the weakly supervised methods, the proposed method provides more satisfactory results.However, the models fail in some cases with confusing construction lines and tiny markers, as illustrated in Figure 10(g).
Overall, the promising nature of our proposed method is demonstrated by the higher mIoU results compared to other weakly supervised methods, and the accurate segmentation outcomes achieved for typical damage image patches, showcasing its potential for efective crack detection.

Assessment of Model Performance under Complex
Real-Bridge Conditions.Section 4.5 shows the promising performance of our proposed method.However, only cropped image patches with sizes of 512 × 512 pixels are visually illustrated in Figure 10.In this section, the trained model performance is further visualized using original images with the size of 4,928 × 3,264 or 5,152 × 3,864 pixels.Tese larger images are deemed to better capture and refect the complexity inherent in real-bridge environments, ofering a comprehensive assessment of the trained model's strengths and limitations.Four typical real-bridge conditions are considered as follows.
4.6.1.Ideal Inspection Conditions.Under ideal inspection conditions, where the crack background is clean and free of distractions, and the crack markers are neatly applied, the model's performance is evaluated.In Figure 11, the segmentation results demonstrate the accurate detection of cracks with varying widths.Our method also successfully extracts most of the crack markers; only very few pen strokes are identifed as cracks, as indicated by the dashed-line frames.Tis can be attributed to the resemblance of pen strokes to cracks in terms of color and shape.Overall, these fndings provide a baseline understanding of the model's accuracy and segmentation quality under optimal conditions.4.6.2.Varying Lighting Conditions.Te results presented in Figure 12 reveal that our proposed method successfully detects and extracts the position and morphological information of cracks under varying lighting conditions.However, the dynamic nature of lighting introduces slight errors in the model's performance, as indicated by the dashed-line frames.For instance, in Figure 9(a), the smooth top plate-fllet weld appears brighter than the surrounding base material due to refections.Along the boundaries where the weld intersects with areas of varying brightness, cracklike features are occasionally formed, leading to misclassifcations by the algorithm.Similarly, in Figure 9(c), the presence of shadows creates a contrast with the bright background, resulting in several pixels along the boundaries being unfortunately identifed as cracks.However, in Figure 9(d), where dim lighting conditions are present, the model successfully avoids misclassifcations near shadow boundaries, attributing to the lack of strong contrasts and intensity-gradient changes.Moreover, in Figure 12(b), the proposed method efectively identifes most of the crack and marker regions even under dim lighting conditions.4.6.3.Cluttered Backgrounds.Te prediction performance of our method is further assessed under the challenging condition of cluttered backgrounds.In Figure 13, genuine cracks and markers are accurately identifed.Figure 13(a) shows the detection of some background pixels as cracks at primer color transition areas, attributed to visual complexities caused by gradients and borders between diferent primer colors.In Figure 13(b), occasional false detections of cracks occur due to complex thin and light markings that visually resemble cracks.Figure 13(c) demonstrates instances where some dot-like stains are sometimes misclassifed as markers due to their visual similarity.Finally, in Figure 13(d), the needle-like stains are identifed as cracks as expected, given their elongated and thin characteristics resembling crack-like features so much.To mitigate the above minor errors, the authors recommend calculating the area of connected components in the predicted mask and removing some false positives corresponding to smaller connected component areas, such as the smaller dot-like stains.Despite these challenges, our model performance remains satisfactory overall.

Obstacles or Confusions Caused by Irrelevant Objects.
Figure 14 showcases the model's performance when faced with obstacles or confusions caused by irrelevant objects.While genuine cracks and markers are accurately detected, there are some prediction errors.In Figures 14(a   Given that this study aims to enhance the accuracy of traditional WSL-based crack segmentation and reduce the annotation burden of FSL-based methods, the proposed method successfully achieves this goal, as evidenced by comparable mIoU and visualized segmentation results, along with reduced annotation time compared to previous studies [13,33,42,48,66].Overall, the proposed method

Conclusions
Tis paper introduced an improved WSL-based semantic segmentation method for accurate fatigue crack detection in steel bridge girders.Te proposed method utilized the annotation map refnement (AMR) technique to generate high-quality initial pseudolabels, overcoming the limitation of highlighting only discriminative regions in conventional WSL-based methods.Tese pseudolabels were then used to train the segmentation model in a two-stage approach.First, the model learned essential semantic and localization information from the initial labels.Ten, the model was further refned iteratively using a segmentation refnement module equipped with postprocessing algorithms.Experimental evaluations compared the proposed method with diferent labeling tools and state-of-the-art techniques, demonstrating faster image-level annotation and the superiority of AMR in generating more accurate and complete object regions, leading to an improvement for pseudolabels in Intersection over Union (IoU) accuracy by approximately 2%.Ablation studies confrmed the efectiveness of the main components, and comparisons with traditional WSL-based and FSL-based methods revealed superior performance by the proposed method.Te visualizations of real-bridge conditions showcased the model's ability to accurately detect genuine cracks and markers.However, further optimization, including data augmentation, is needed to enhance performance under challenging conditions.Overall, our method achieves comparable inference results to FSL-based approaches while signifcantly reducing annotation workload.Further validation is recommended to assess its efectiveness in more diverse scenarios, and future research should focus on studying the efect of segmentation network structures and integrating the proposed method with more advanced networks to enhance its performance.

4. 1 .
Dataset and Experimental Setup 4.1.1.Dataset.Te original dataset employed in this paper was granted by the organizing committee of the 1st International Project Competition for Structural Health Monitoring (IPC-SHM 2020)

Figure 3 :
Figure 3: Example images of the proposed assimilation process.For illustration purpose, only the crack assimilation process is shown here.

Figure 4 :
Figure 4: Example images of the proposed connection process.

Figure 6 :
Figure 6: Comparison of the annotation time using diferent methods.

Figure 7 :
Figure 7: Examples of CAMs generated by AMR and grad-CAM methods on the training subset of the built U-net dataset.

Figure 8 :Figure 9 :
Figure 8: Change of mIoU on the validation set during the training process.Confgurations C1, C3, and C4 have the same mIoU trend in the frst 50 epochs and thus overlap each other before 50.

StructuralFigure 10 :
Figure 10: Example segmentation results generated by diferent networks under full or weak supervision: (a, b) normal condition, (c, d) tiny crack, (e) contaminated surface, (f ) weld line edges, and (g) misidentifed example.F and W represent fully supervised and weakly supervised, respectively.
-Based Crack Detection Methods.With the development of high-performance graphics processing units (GPUs) and parallel computing, DL-based techniques are 2 Structural Control and Health Monitoring gaining prominence in computer vision-based surface damage detection.CNNs do not require manual construction of features or prior knowledge of crack shape, texture, or contextual information.DL-based crack detection methods have been widely used in civil engineering and are categorized into two main types: patch-level and pixel-level methods

Table 1 :
Detailed operations for each layer in U-net.

Table 2 :
Comparison of the accuracy in terms of IoU (%) for CAMs generated by AMR and grad-CAM on the training subset of the built U-net dataset.

Table 3 :
Results of the ablation study.

Table 4 :
Comparison of the evaluation metrics on the test set.