Two-Stream Boundary-Aware Neural Network for Concrete Crack Segmentation and Quantification

,


Introduction
Cracks and associated crack patterns can be indicators of the stress state, loss of durability and reliability, and thus of the service life of concrete structures. Terefore, certain cracks and crack patterns also provide important information about the deterioration of the concrete structure. Existing cracks can signifcantly accelerate corrosion and expansion of the reinforcement by the corrosion products and fnally spalling of the concrete surface, which leads in the end to a reduction of the load-bearing capacity and the service life of the reinforced concrete structure. Frequent structural monitoring with reliable reporting of the condition of the inspected infrastructures is a necessary procedure to maintain their long-term service capabilities [1][2][3][4]. Manual inspections, which have been used for decades, are the most widely used methods for crack detection. However, owing to the subjective judgment of inspectors and dangerous working conditions, they are always criticized for being time-consuming and not sufciently accurate [5,6].
To overcome these shortcomings, several automated vision-based techniques for crack detection have been developed in the past few years. Classical computer vision approaches, such as image processing techniques (IPTs), have been introduced to the feld of crack detection [7][8][9][10][11][12]. IPTs extract features from images through elaborately designed extractors and detect cracks using thresholds or trained classifers. Still, a major concern is the reliance of prediction performance on the quality of hand-crafted extracted features. Tis could be inevitably limited by subjectivity and domain expertise [13]. Further, a great number of preprocessed images are required; this makes the detection process unadaptable, tedious, and inefcient. Moreover, these hand-crafted extracted features cannot distinguish between cracks and complex backgrounds in low-level image cues [14]; thus, they are less applicable in images with large variations.
Currently, deep learning techniques are driving advances in computer vision to tackle the drawbacks of classical IPTs; they can automatically identify intricate structures of largescale data using models with multiple processing layers [15][16][17][18][19][20]. Convolutional neural networks (CNNs) are the most widely used models for automated feature learning and supervised detection [21][22][23][24][25][26]. Multiple eforts have been made to implement CNN-related methods in pixel-level crack detection. Li et al. [27] employed a fully convolutional network-(FCN-) [28] based model for multiple damage detection, including cracks and spalling. Mei et al. [29] and Pan et al. [13] adopted DenseNet (as the backbone) together with loss functions and attention modules, respectively, to segment concrete cracks. Te UNet [30] architecture has been used for concrete surface crack segmentation [31,32]. Kang et al. [33] utilized an integrated model based on Faster R-CNN [34] and a modifed IPT for crack detection and quantifcation. In addition, a two-level technique consisting of Faster R-CNN and Mask R-CNN [35] was devised for detecting and measuring the damage on historic glazed tiles [36]. In addition, the idea of digital twin of concrete structures has been brought up and damaged images have been considered as important data for it.
Nevertheless, cracks in engineering practice occur under various scenarios, and the current crack segmentation methods often obtain suboptimal detection results, due to the following reasons. First, most of them focus on crack detection across monotonous backgrounds, such as pure concrete surfaces and pavement surfaces. However, fnding the optimal network architecture to segment cracks with such complex backgrounds and illumination is difcult, resulting in more realistic and practical problems. Secondly, the size of cracks varies dramatically, with an order of magnitude diference between small and large cracks. Furthermore, the size of certain discontinuous details and the major section of the crack difer signifcantly for discontinuous cracks; these diferences are crucial for determining the current stable state of the fracture and whether it will continue to spread. Tird, in practice, many cracks have complex topological shapes and very large diferences in terms of width, as illustrated in Figure 1. Te results of most existing methods tend to consist of blurry boundaries and inadequate segmentation, while accurate edge segmentation is the premise of obtaining the width crack.
Te boundaries of cracks are crucial in crack segmentation, especially for width calculation. After being coupled with boundary detection, the crack segmentation can be treated as a multi-task learning (MTL) problem [37]. MTL could potentially improve segmentation performance if the associated tasks shared complementary information. Te evidence has been provided in existing literature for certain pairs of tasks, i.e., detection and segmentation [34,38], segmentation and depth estimation [39,40], and segmentation and edge detection [41,42]. Considering these observations, researchers started designing architectures capable of learning shared representations from multi-task supervisory signals, such as cross-stitch networks [43], multi-task attention networks [44], pattern-afnitive propagation networks [40], and multi-scale task interaction networks [45]. Tere have been several studies on the joint learning of semantic segmentation and boundary detection. Ding et al. [46] and Liew et al. [41] proposed to learn the boundary as an additional semantic branch to boost the segmentation performance for scene segmentation. Marmanis et al. [47] and Liu et al. [48] combined semantic segmentation with edge detection to reduce the semantic ambiguity in remote sensing tasks. Tese works demonstrate the validity of joint learning of segmentation and boundary detection. However, the joint learning of boundary detection and segmentation is seldom investigated in the crack segmentation task. Yamaguchi and Hashimoto [49] introduced a crack detection method for a concrete surface image based on a percolation model, yet handcrafted features are needed with this approach. FCN and structured forests with wavelet transform (SFW) were combined to detect tiny cracks in steel beams [50]; among them, edge detection was performed using multi-scale structured forests and wavelet maximum modulus edge. Although edge detection was utilized to improve crack segmentation performance, SFW is time-consuming, and the proposed method is not an end-to-end deep learning approach.
Crack segmentation is a binary pixel-level classifcation task. Whether a pixel belongs to cracks largely depends on high-resolution representation, which contains low-level information of the image. Here, high resolution refers to the high-resolution representations in the segmentation neural network, like SegNet [51], U-Net [30], DeconvNet [52], and HRNet [53]. Unlike other segmentation tasks, such as medical and satellite images, high-resolution representation plays a key role in crack segmentation. However, most current CNN models tend to lose high-resolution details in complex scenes and lead to blurry object boundaries [34,39].
To overcome these limitations and construct a novel crack assessment framework, this contribution introduces a two-stream neural network architecture for crack image segmentation under various scenarios, namely, boundaryaware crack segmentation (BACS) network. A highresolution network (HRNet) [53,54] is utilized in the segmentation branch with strong high-resolution representations for cracks in complex backgrounds and variable illumination. A dynamic feature fusion (DFF) network [55], which assigns diferent fusion weights for diferent input images and locations adaptively, is utilized in the edge branch to produce more accurate and sharper edge predictions. As a result, high-resolution features can be further improved with the aid of an edge branch to enhance the quality of feature representation, contributing to a more accurate and efcient crack detection process for distinguishing the crack and non-crack at the pixel level.
Tis paper is organized as follows. In Section 2, the proposed BACS method composed of two branches is presented in detail. In Section 3, the BACS is tested in a concrete crack dataset, and a comparison with the state-ofthe-art methods and quantifcation are presented. Finally, Section 4 summarizes the concluding remarks.

Methodology
Troughout this section, the proposed BACS for crack segmentation is presented. As depicted in Figure 2, it consists of two streams of networks. Te frst branch of the network, namely, segmentation branch, is HRNet. Te segmentation branch is utilized to extract the overall semantic feature of images. Te second branch, namely, edge branch, processes edge information in the form of semantic boundaries. Te edge branch is enforced to only process boundary-related information using DFF. Semantic features from the segmentation branch are then fused with boundary features from the edge branch to produce a refned segmentation result, especially around boundaries. Next, each of the modules in our framework is described in detail. Codes and the dataset will be made available at https:// github.com/GaoyangLiu/BACS.

Segmentation Branch.
HRNet is adopted as the backbone of segmentation branch because it has two advantages in comparison to existing networks for segmentation tasks. First, it connects high-to-low resolution subnetworks in parallel, rather than in series, as it is done in most existing solutions. Tus, it is possible to maintain the high resolution instead of recovering the resolution through a low-to-high process, and accordingly, the predicted result is potentially more precise spatially. Secondly, most existing fusion schemes aggregate low-level and high-level representations. Instead, HRNet performs repeated multi-scale fusions to boost the high-resolution representations with the help of the low-resolution ones of the same depth and similar level, and vice versa; this results in high-resolution representations that are also rich for crack segmentation. Consequently, the predicted crack segmentation result is potentially more accurate, especially boosting performance on thin and small objects in complex backgrounds. Te architecture is illustrated in Figure 3.
HRNet starts from a high-resolution subnetwork as the frst stage and gradually adds high-to-low resolution subnetworks one by one; this strategy forms new stages and connects the multi-resolution subnetworks in parallel. As a result, the resolutions for the parallel subnetworks of a later stage consist of the resolutions from the previous stage and an extra lower one. Te exchange units across parallel subnetworks are introduced in a way that each subnetwork repeatedly receives information from other parallel subnetworks. Below, there is an example showing the scheme for exchanging information. Te third stage is divided into three exchange blocks, and each block is composed of three parallel convolution units with an exchange unit across the parallel units, which is shown in Figure 4.
In Figure 4, C b sr represents the convolution unit in the r th resolution of the b th block in the s sh stage, and ε b s is the corresponding exchange unit. Te semantic information among diferent branches exchanges in the exchange unit. Te aggregation of information by exchange unit is illustrated in Figure 5.
Te exchange units consist of upsampling and downsampling operations across various resolutions. In contrast to most existing fusion schemes that aggregate low-level and high-level representations, HRNet repeatedly performs multi-scale fusions to boost the high-resolution representations. Tis strategy is suitable for crack segmentation, where the high-resolution feature plays an important role with regard to boundary accuracy.

Edge Branch.
Te goal of edge branch is to extract object boundaries for guiding the segmentation of boundaries and thin structures of cracks. To prevent a large loss of image details due to input downsampling, and particularly for elongated thin parts, the edge stream processes the original images directly without resizing. To facilitate edge learning, the image gradient is appended in the input tensors, which can be easily computed using the Sobel flter [41,56,57]. Te commonly used 3 × 3 convolution kernels are adopted to compute the horizontal and vertical gradients G x and G y , as illustrated in the following equations: Structural Control and Health Monitoring where I is the input image. Te magnitude of a single channel is obtained by Tis procedure is applied to each of the RGB channels. Te image gradient is the square root of the gradient summation of each channel.
Finally, the gradient is normalized to the range [0, 1]. Some typical crack images and corresponding gradients for the Sobel flter are shown in Figure 6. It can be seen that the gradients of cracks with the clean background are with less noise. However, the Sobel flter does not consider the context of a pixel, so cracks with complex backgrounds tend to produce more noise in the gradient output. Te gradient channel is appended to the RGB image as the fourth channel. Te concatenated image together with four feature maps from HRNet in diferent stages is then fed into basic blocks to include information of diferent stages. Ten, norm blocks are utilized to reduce the channels to the predicted categories, which is 1 in this study to denote whether a pixel is a crack or not.
Te features from multiple scales can greatly beneft the semantic edge detection task if they are well fused. However, the prevalent semantic edge detection methods apply a fxed weight fusion strategy where images with diferent semantics are forced to share the same weights, resulting in universal fusion weights for all images and locations regardless of their C11 T11  C21 T21  C31 T31  C41 T41   C22 T22  C32 T32  C42 T42   C33 T33  C43 T43 C44 T44   Convolution block in s-th stage and r-th branch Transition block in s-th stage and r-th branch T11 Multi-resolution block  diferent semantics or local context. Te DFF strategy [55], which assigns diferent fusion weights for diferent input images and locations adaptively, is adopted in this study. Tis is achieved using a weight learner that infers proper fusion weights over multi-level features for each location of the feature map, and it is conditioned on the specifc input. In this way, the heterogeneity in contributions made by diferent locations of feature maps and input images can be better considered and thus help produce more accurate and sharper edge predictions. Te detailed architecture of different blocks and edge detection procedure by DFF in the edge branch is illustrated in Figure 7. Te feature maps from diferent stages are fed into the edge branch. After being processed by basic blocks and norm blocks, four edge feature maps (F e1 , F e2 , F e3 , F e4 ) of 1 channel are generated with information from the early to latter stages of segmentation branch. To better consider the heterogeneous contributions of feature maps, F e4 is fed into the adaptive weight learner to infer proper fusion weights over multi-stage features. F e1 ∼F e4 are concatenated into a four-channel feature map. Ten, element-wise multiplication and category-wise summation are applied to produce the fnal edge prediction.

Training Loss.
Both segmentation and edge branches are trained with a binary cross entropy (BCE) loss, since there are only two categories, crack and non-crack, in the dataset. Te similarity between predicted probabilities and ground truth is measured by where p is the predicted probability, p is the ground truth label, and N is the total number of pixels. Te ground truth of segmentation branch is the label of the entire image. Te edge branch is supervised by the label of the edge only. Te fnal loss is the summation of losses from segmentation branch and edge branch. Tat is to say, edge branch serves as an auxiliary task [58] for crack semantic segmentation. As a kind of additional regularization, edge branch is expected to boost the performance of the ultimately desired main task.

Task Afnity between Crack Segmentation and Edge
Detection. In contrast to the single-task methods, joint-task learning methods yield a promising direction to improve predictions by utilizing task correlation information to boost each other. However, the joint learning of multiple tasks can lead to negative transfer, leading to performance degradation of a single task if information sharing happens between unrelated tasks. Te key point is the degree to which tasks share common structures. A statistical analysis [59] on those second-order patterns across boundary detection and crack segmentation is performed to quantify pixel afnities. Semantic pixels are considered similar when they belong to the same category. Te matching number of those similar pairs is accumulated with the same space positions across the two types of corresponding images.
As shown in Figure 8, the afnity pairs (green points) at the common positions may exist in diferent tasks. Meanwhile, some common dissimilar pairs (red points) exist across tasks. Take the afnity pairs in Figure 7 for example; pixels (p 1 , p 2 ) in the background and pixels (p 3 , p 4 ) in the crack labels are afnity pairs in both segmentation and edge labels. Pixels (p 5 , p 6 ) with p 5 in the edge and p 6 in the crack label are afnity pairs in segmentation labels while dissimilar pairs in edge labels. Te rate of matched afnity pairs can be calculated by counting the matched pairs across segmentation and edge labels and then dividing them by the number of all pixel pairs. Te rate of matched afnity pairs is 89.4% in this study. Te statistical result shows that nearly all pairs across two tasks are of high afnity, which indicates that crack segmentation and edge detection share common structures in images. Terefore, the edge branch has the potential to boost the performance of the segmentation branch.

Dataset.
To improve the generalization and demonstrate the superiority of the proposed method in various scenarios, a crack segmentation dataset consisting of three diferent scenarios is built for this study. Te crack images that are collected from existing literature, the Internet, and taken by our team are 1,892 in total. Te dataset is divided into three scenarios: pure cracks, complex background, and variable width, containing 1090, 432, and 370 images, respectively. Te cracks in the pure crack scenario are relatively clear with a relatively large width, without background noise and illumination interference. Most of the images in the complex background scenario have complex backgrounds, such as spraying, water stains, honeycomb pitted surfaces, and other objects. Most of the cracks in the variable-width scenario have complex topological shapes that are difcult to segment, such as extremely thin cracks, cracks with large width diferences, and other cracks with complex shapes. Te images were divided into two main subsets: a training set with 1514 images and a testing set with 378 images. Each image is made available to a pixel-wise segmentation map, which operates as a mask covering the crack regions. All of the images have a fxed size of 256 × 256 pixels. Some examples of typical images corresponding to the three scenarios are illustrated in Figure 9.
To add segmentation masks to the crack images, the annotation tool "LabelMe" [60] is utilized for manual annotation. Users have the option to zoom in, zoom out, and annotate a crack by clicking along the boundary to get precise boundary labels. Figure 10 demonstrates several examples of images used in the concrete crack dataset, where the frst, second, and third columns stand for original images, images with manual labels, and the corresponding ground truth, respectively.
After pixel-level annotation for the ground truth of the cracks, edge labels are extracted by applying Euclidean distance transformation. Euclidean distance transformation returns the distances to the closest background pixels. With labeled crack masks as input, the smaller distances correspond to pixels closer to the edge of the binary object. To avoid discontinuity, the width of edges is set to 2 pixels. Te  ground truth of a crack and corresponding edges is illustrated in Figure 11.

Training Process.
Te developed model is implemented using PyTorch and trained on Nvidia GeForce 1080TI GPU with a memory of 11 GB. Transfer learning is utilized in the segmentation branch with pretrained weights of HRNet on ImageNet to boost the performance and accelerate the training procedure. Data augmentation includes image fip, rotation, and translation. Tere are several hyperparameters in network training, among which the learning rate is regarded as the most important one to tune [61]. For training deep neural networks, selecting a good learning rate is essential for both better performance and faster convergence. Optimizers that adjust automatically the learning rate  Structural Control and Health Monitoring can beneft from more optimal choices. To reduce the amount of guesswork regarding the choice of a good initial learning rate, a learning rate fnder [62] is utilized in our experiments. Only one epoch is carried out starting with a very low learning rate (10e − 8 in this study) to a very high learning rate (10e − 1). Te learning rate is increased after each processed batch, and the corresponding loss is logged as shown in Figure 12. Te loss decreases at the beginning, and then it stops and goes back increasing extremely quickly. It should be noted that the learning rate that corresponds to the minimum loss value is a bit too high, since we are at the edge between improving performance and gradient   [62], which is around the middle of the steepest descending loss curve. In this study, the best learning rate is 5.2e − 4 for BACS.
With the optimal initial learning rate, Adam [63] is employed as the optimizer. Te training schedule of the learning rate is multiplied by 0.8 every 50 epochs. To avoid the problem of running out of memory, the batch size is set to be eight during the training and validation processes. Figure 13 depicts the change in loss value during the training and validation processes, which shows that the training process gradually converges after about 60 epochs.
In comparison with the other four methods, BACS shows superior performance with a similar number of parameters. Under the pure crack scenario, all methods reach fne results. Yet, in complex backgrounds and variable-width scenarios, BACS obtains much better mIoU than the other methods. Specifcally, the BACS is 7.13% and 12.03% higher than the lowest UNet model under complex background and variable-width scenarios, respectively. Latency in the right column denotes the seconds per image. It should be noticed that latency is highly dependent on the network architecture and hardware. Networks with complex architecture and skip connections, such as DeepLabV3++ and UNet++, are more likely to have large latencies. DeepCrack obtains the lowest latency (79 ms) among all the models. Te latency of BACS, 172 ms, is larger than the backbone HRNet-w32 due to the edge branch. Overall, the latency is acceptable in a single 1080TI GPU. It is shown that benefting from HRNet in the segmentation branch, the fused high-resolution features can be treated as neutralization which aggregates the multiplelevel features from coarse to fne. Additionally, the edge branch with the DFF can maintain the boundary information, which is critical for the crack segmentation task. Terefore, the results of BACS achieve a signifcant performance improvement, especially in the complex background and variable-width scenarios.
To further investigate the performance of the proposed BACS compared with DeepCrack, they are trained and validated on the original DeepCrack dataset. Te result shows that BACS achieves mIoU of 82.31%, 6.89% higher than that of DeepCrack.
Some sample images from the three diferent scenarios and the results for all methods are shown in Figures 14-16 Under the complex background scenario, the BACS performs better than the other methods. Owing to the rich highresolution features of the HRNet, the BACS produces fewer false-positive predictions. Moreover, the edge of the predicted crack is sharper and crisper, benefting from the edge branch fused information. With the aid of the edge branch and DFF, BACS produces crisper and more precise segmentation results, especially along thin cracks. In Figure 17, a typical image under the variable-width scenario is illustrated; for better presentation, only the results of BACS and DeepCrack are presented. Te width of cracks ranges from 1 to 32 pixels, while crack widths of 2 and 15 pixels are shown in Figure 17(a). It can be seen that BACS is able to detect very narrow cracks.

Ablation Study.
In this section, the edge branch, DFF block, and edge supervision are thoroughly analyzed to further understand their operation.

Structural Control and Health Monitoring
In the segmentation branch, the HRNet maintains highresolution representations by connecting high-to-low resolution convolutions in parallel and repeatedly conducting multi-scale fusions across parallel convolutions. Te resulting high-resolution representations are robust and spatially precise. Tus, the baseline mIoUs are relatively high in all three scenarios, as shown in Table 2.
When comparing models with and without the edge branch, the edge information plays a crucial role in the variable-width scenario. Te edge stream helps guide the segmentation of thin cracks. Nevertheless, one may argue that the performance gain from adding the edge branch is partially due to the increased number of parameters. Terefore, the two sources of performance boost are disentangled by comparing with a baseline whose network architecture is the same as the BACS, while the edge supervision is replaced with mask segmentation supervision. Te ground truth of the edge branch is the same as segmentation branch with mask labels. As illustrated in Section 2.3, the fnal loss is the summation of BCE losses of segmentation branch and edge branch. In the experiment of edge supervision, the loss of edge branch is calculated by the BCE loss of predictions and the mask labels. Tis procedure aims to validate the performance improvement of the edge branch compared with the network with the same number of parameters. A comparison study to investigate the efect of the Sobel flter is also conducted. Without the Sobel flter, the model performs worse than that with the Sobel flter, particularly in the pure crack scenario. Tis is due to the fact that the Sobel flter tends to yield image gradients with less noise in the pure crack scenario, which provides efective additional information to the edge branch. A slight performance drop is noticed in all three scenarios in the dataset when removing edge supervision. Tis verifes our fnding that the edge branch information is essential to addressing the crack segmentation task.

Feature Maps among the Two Branches.
To further illustrate the BACS in detail and validate its efectiveness qualitatively, some intermediate feature maps are presented in Figure 18. Crack images are fed into the segmentation branch, yielding four output feature maps from the highresolution branch of the HRNet. Four feature maps of each stage are shown in the downside of the segmentation branch. Te diferent stages of the convolutional layers obtain different features with various levels of information. Low layers kept more low-level information; thus, the boundary of the crack and other dots is clear in the frst two feature maps from the segmentation branch. However, low layers focus more on the sharp contrast of images, thereby containing a large amount of noise. Te top layers obtained more abstract and global features, which are composed of lowlevel features. Tese global features contain much more semantic and context information, which helps determine whether a pixel belongs to the crack or background. Since cracks are often long and thin in noisy backgrounds, the segmentation performance is more sensitive to low-level information compared with some other segmentation tasks. Hence, after maintaining high resolution and repeatedly performing multi-scale fusions, the HRNet in the segmentation branch is suitable and superior for the crack segmentation task.
Subsequently, four feature maps are fed into the edge branch together with the concatenation of crack images and gradient. Te norm blocks of the edge branch produce raw edge outputs, which are further fne-tuned by the DFF block, as illustrated in Figure 18. Te edge prediction of the DFF block is much better than the raw edge branch output. Te fnal loss is a summation of the segmentation and edge loss. As an MTL problem, combining the segmentation and edge losses can boost the performance of crack segmentation, which is validated in Section 3.4.1. Considering the importance of low-level information in crack segmentation, the segmentation and edge branch outputs are concatenated together and passed to a basic block to produce the crack segmentation result. Te concatenated tensor with fve channels is fed into a basic block to produce the fnal onechannel prediction with the same resolution. With the aid of the edge branch, the BACS yields more accurate crack segmentation results and precise crack boundaries.

Crack Width Quantifcation.
To validate the accuracy of the proposed method in engineering practice, crack widths on various concrete surfaces are calculated based on BACS and compared with the widths obtained by a crack width observer. Tese crack images are obtained using a smartphone in diferent locations. Te crack images and other measured widths are shown in Figure 19.
Tere are mainly two steps for quantifying the chosen cracks, i.e., obtaining the pixel widths and mapping them to actual widths in the measurement unit. In the frst step, the crack images are fed into the BACS to get the predicted cracks, as shown in the second column in Figure 20(b). Ten, each crack instance is skeletonized using the medial axis thinning algorithm to extract a one-pixel-wide centerline.
Since the width of a crack often varies along the crack, the crack widths are evaluated at specifc pixels on the centerline. For a query pixel on the crack centerline, the crack widths are computed as shown in Figure 20: (1)   Structural Control and Health Monitoring pixel and its neighboring pixels on the centerline; (2) a line normal to the crack orientation is then created; (3) at both sides of the crack centerline, the crack boundary pixel that is closest to the line is extracted; and (4) the distance between the two pixels is calculated as the crack width in pixels. Te second step is to map pixel widths to actual widths in the measurement unit, where the pixel ratio R (pixel/mm) between the number of crack pixels and the actual width of the crack is necessary. Considering that the ratio often changes over the distance of the smartphone camera from the surface of the detected concrete, the relation between the ratio and distance is calibrated under laboratory conditions. As shown in Figure 21, to calibrate the relation between pixel width and distance from the smartphone to the surface of the detected target, experiments are performed in a quasi-static process. Te ftted curve of the relation between the pixel ratio (R) and distance is illustrated in Figure 22, with which the pixel ratio of any working distance could be obtained.
Finally, the actual crack widths are calculated using the following formula:        where w is the actual width of the crack and p is the pixel width.
In the experiment, crack widths at ten points of three diferent cracks are calculated. With a working distance of 150 mm, the pixel ratio is 25.35 pixels/mm as estimated by the curve shown in Figure 22. Te pixel widths predicted by BACS and actual widths obtained by equation (6) are shown in Table 3. Table 3 shows that BACS achieves high accuracy with an error rate of 9.29%. Te average absolute error of the BACS is 0.0992 mm, which is approximately two pixels in the images.

Conclusions
A novel two-stream boundary-aware crack segmentation (BACS) network is proposed in this study, which combines semantic segmentation with semantically informed edge detection explicitly. Te segmentation branch using HRNet aims to acquire strong high-resolution representations for cracks in complex backgrounds in engineering practice. Additionally, a modifed dynamic feature fusion (DFF) network is adopted as the edge branch to boost the performance in elongated thin cracks. Te mIoU in a crack dataset consisting of diferent scenarios indicates that the edge branch signifcantly improves semantic segmentation. Te conclusions are summarized as follows [49]: (1) With the aid of HRNet in the segmentation branch that maintains high resolution instead of recovering the resolution through a low-to-high process, BACS reaches high performance in crack segmentation with both clean and complex backgrounds. (2) Edge branch in BACS that integrates DFF preserves fne-grained details, especially for elongated thin cracks. Based on the evaluation metric mIoU, BACS yields the best value of 70.67% under the variablewidth scenario. (3) BACS is a feasible and precise deep learning model for crack quantifcation at arbitrary working distances. With the crack segmentation results, the widths obtained by our approach are close to the actual values, with an error rate of 9.29%. Te average absolute error of BACS is 0.0992 mm, which is approximately two pixels in the images. (4) Te proposed method shows superior performance for the crack segmentation task under difcult conditions, especially in the variable-width scenario. Te fndings show a new way of structural inspection and safety assessment of concrete structures, providing an accurate data foundation for the digital twin of concrete structures.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.