Intelligent Crack Detection and Quantification in the Concrete Bridge: A Deep Learning-Assisted Image Processing Approach

,


Introduction
Most of the bridges built in the world are concrete bridges. In the process of service, it is necessary to detect the bridge regularly to facilitate the development of corresponding maintenance countermeasures [1]. Crack is one of the main diseases of concrete bridges, which has become the most important content of concrete bridge detection and maintenance. Traditional manual detection methods have the disadvantages of inaccuracy and low efficiency. erefore, using image processing or deep learning (DL) techniques to detect bridge crack has become a research hotspot [2].
Some bridge crack detectors were realized based on traditional digital image processing methods. For instance, the gray threshold segmentation method was adopted to extract the crack according to the gray difference between the crack area and the background [3,4]. Besides, the Canny iterative method was used to detect the edge feature of the crack according to the linear feature of the crack [5]. In addition, the Otsu method and multiple filtering in image processing were employed to detect cracks of concrete structures [6]. Moreover, an improved image segmentation algorithm based on the Chan-Vese (C-V) model was provided for crack extraction [7]. All the above-mentioned algorithms could lead to good experimental results, but the background of the crack image was simple, and no obstacle was considered. ese algorithms may not be applicable for bridge crack detection with complex background.
In recent years, many researchers adopted the DL to locate and classify bridge diseases and achieved good performance. For example, Lei Z. et al. used a convolution neural network (CNN) to detect road cracks. ey divided the complete images into several small image blocks and then classified the small image blocks to complete the extraction of cracks [8]. Cha et al. used the CNN and the target detection network of a faster region-based convolutional neural network (Faster R-CNN) to realize the detection of building damage, including cracks [9,10]. Li et al. adopted the YOLO to perform the detection of concrete cracks [11]. Zhang et al. further used the target detection network of the YOLOv3 and migration learning to complete the damage detection of cracks in concrete bridges [12]. Hoang and Nguyen constructed an automatic model to detect and classify asphalt pavement crack. It could detect cracks and wa able to recognize the types of pavement cracks, including the longitudinal, transverse, and alligator cracks [13]. Meng studied an image segmentation algorithm for concrete cracks based on CNN. Compared with other methods, the algorithm achieved higher detection accuracy and generalization ability [14]. Ding  e results showed that the average prediction accuracy of concrete crack identification was better than the YOLOv4 method [15]. Hu et al. studied a pavement crack detection method based on YOLOv5. e experimental results showed that the detection accuracy of the YOLOv5 series models was above 85% [16]. Note that, with the above works, one can detect the location of the cracks but cannot quantify the cracks. In practical engineering applications, the quantitative result of crack is very important, which can be directly used to analyze the stress status of the structure, and thus is an important basis for bridge maintenance decision. Moreover, the entire learning process of the present DL is a dimension reduction process. e crack width of a concrete bridge is usually narrow compared to the image. Hence, the crack information may be lost during the learning process. e fully convolution neural network (FCN) has been widely studied to realize pixel-level crack detection and measurement. FCN is an end-to-end, pixel-to-pixel convolutional network for semantic segmentation. An FCN is composed of downsampling and upsampling parts. e downsampling part mainly includes convolutional layers and pooling layers. e upsampling part includes deconvolution layers [17]. For example, Yang et al. employed the FCN to realize the pixel-level crack classification and measurement.
e results showed that the accuracy of segmentation was 97.96%, and the relative quantification error was within 24.01% for crack width [18]. Li et al. utilized the U-Net to realize the location of concrete cracks in the tunnel. e segmentation accuracy was 92.25%, and the relative quantification error was within 18.57% for crack width [19]. ey further adopted the segmentation network of FCN and Naïve Bayes data fusion (NB-FCN) to realize the location of concrete cracks. e cracks were quantified by introducing postprocessing with accuracy being 93.2% [20].
For the concrete bridge crack in practical engineering, the length is usually long and the width is narrow. Usually, the crack width may be less than 0.5 mm. us, the crack detector is more sensitive to the quantification error in the width direction. With the present mainstream camera resolution, a practical microcrack may contain only a few pixels. Even the FCN can recover the original image from the feature map with the deconvolution layers, the original scale information of the narrow crack width cannot be recovered according to the information theory. Note that it is hard to use high-resolution images as inputs for a neural network due to the limited computation power. If digital image processing is used for detection and quantification, the image plane needs to be smooth enough with less noise, which may not be practical in the engineering environment. erefore, the digital image processing method or DL method cannot solve the microcrack task detection well separately.
In this paper, in order to solve the microcrack detection task in practical engineering, we proposed a modified concrete bridge crack detector based on a DL-assisted image processing approach. e computational novelty is threefold. First, the modular split design is considered. More specifically, the latest YOLOv5 DL method is utilized for crack classification and location, and the digital image processing method is employed for crack quantification. Second, we propose a novel digital image quantification and crack recovery method by using a region connected component search algorithm based on the crack trend.
ird, the narrowest crack width that can be detected reaches 0.15 mm, and the absolute error can be controlled within 0.05 mm. e rest of the paper is organized as follows. In Section 2, the network architecture of the modified bridge crack detector based on the DL-assisted image processing approach is presented. e detection, quantification, and visualization algorithms are described. In Section 3, the establishment of bridge damage dataset, offline data augmentation, training process, and crack classification and location results obtained by the YOLOv5 algorithm are introduced in detail. We analyze crack connected component optimization methods, including the mask filter used to remove handwritten marks, the ratio filter adopted to eliminate speckle linear noises, the fusion of the same crack detection bounding boxes, the regional adaptive threshold segmentation, and the connected components search approach based on the crack trend of area. Besides, we compare the crack quantification accuracy with and without these optimization methods. A comparison between the calculated crack width based on the connected components and the results measured by a bridge engineer using the crack width gauge is given. Finally, conclusions are drawn.

Methodology
For practical bridge damage detection, it is necessary to know whether there is damage, where is the damage, and what is the size of the damage. Here, we proposed a modified bridge crack detector based on the DL-assisted digital image processing technique. e microcrack task in practical engineering is focused on.

Architecture of the Crack Detection Network Based on the DL-Assisted Image Processing Approach.
e overall architecture of the damage detection network based on the DLassisted image processing approach is shown in Figure 1. It includes two modular functional parts. At first, the region with crack is identified through the DL-based detection network. Here, YOLOv5 is employed to get the predicted bounding boxes. After that, according to the predicted bounding boxes, the digital image processing techniques are adopted to quantify the length and width of the cracks, and the crack details are further visualized in the image.

Detection Algorithm.
We use the target detection method to determine whether there is a crack in the image and where is the crack.
ere are many excellent neural networks in the field of target detection [21], such as Faster R-CNN [22], SDD [23], and EfficientDet [24]. We used the latest network of the series of YOLO [25][26][27][28], that is, the YOLOv5 network. Note that there is no corresponding paper on this network at present, but the code has been open. It is worth mentioning that the detection method here can be any network due to the modular design. When a subsequent network that is more suitable for our detection task appears, it can be adopted.
In the following, we introduce the YOLOv5 network. e overall structure is shown in Figure 2. It can be divided into three parts: backbone layer, neck layer, and head layer. In the input part, the long edge of the original image was first scaled to 608, and the short edge was scaled proportionally and filled to 608 with 0 pixels. us, the input size is 608 × 608 × 3 for the backbone. e backbone is responsible for the feature extraction, and the neck layer performs the multiscale feature fusion. In the head layer, the bounding box is obtained by predicting the coordinate offset, and the loss is calculated by means of generalized intersection over union (GIOU) between the predicted box and the ground truth box. Bounding box classification and regression is directly finished in the head layer.
More precisely, the backbone layer is responsible for feature extraction of input image data, which is divided into Focus structure and CSPNet structure. e cross-stage partial network (CSPNet) is used in the backbone layer, which can extract abundant information features from the input image [29]. CSPNet can be combined with other networks such as ResNet [30] and ResNeXt [31]. Two kinds of convolution kernels are employed. One is 3 × 3 and is used for feature extraction. e other convolution kernel has a size of 1 × 1 and is used to flexibly control the depth of feature maps. e output feature map of the backbone layer is 19 × 19 × 512. Here, the role of the Res unit is to increase the depth of the network and avoid the gradient vanishing or gradient explosion caused by the increase of the depth of the proposed detection network. e neck layer uses a path aggregation network (PANET) [32] and is mainly used to generate a feature pyramid. PANET is based on Mask R-CNN and feature pyramid network (FPN) frameworks [33,34]. e convolution kernel of the neck layer is the same as that of the backbone layer. In the spatial pyramid pooling (SPP) structure, multiple pooling layers are used. e size of the pooling core is 5 × 5, 9 × 9, and 13 × 13, and the padding is 2, 4, and 6, respectively. After the operation of the neck layer, the output feature map scales are 19 × 19 × 255, 38 × 38 × 255, and 76 × 76 × 255. e head layer is a universal detection layer, which is the same as that of YOLOv3 and YOLOv4 [27,28]. In the head layer, bounding boxes are generated and classified. ere are three output heads, whose strides are 8, 16, and 32, respectively. e receptive field of a small-scale feature map is larger, which is used to detect large targets.
e receptive field of a large-scale feature map is smaller, which is used to detect small targets. e three-scale feature maps are jointly used to improve the accuracy of the network. For these three output heads, each output feature map has 19 × 19, 38 × 38, or 76 × 76 cells, respectively. Each cell generates 3 bounding boxes; each bounding box contains 4 positional parameters, 1 confidence value, and 80 classes (classes � 80 of the COCO dataset is used by default, and crack is one of the classes). erefore, the depth of the output Advances in Civil Engineering feature map is (4 + 1 + 80) × 3 � 255. e default confidence threshold in YOLOv5 is 0.45. e bounding box is judged as a positive sample, namely, crack, when its confidence is greater than 0.45; otherwise, it is judged as noncrack.
In the process of training, YOLOv5 uses GIOU [35] to calculate loss, which can be calculated as where A and B represent the predicting box and ground truth box, respectively, A c represents the smallest enclosing rectangle of two rectangular boxes, and U represents the union of two rectangular boxes; that is, A ∪ B.

Quantification and Visualization Algorithm.
We propose a method to realize the crack quantification and visualization. e processing steps are presented in Figure 3. rough the target detection network, we get the bounding box of the crack in the original image, as shown in Figure 3(a). Because the resolution of the original image is high, in order to reduce the amount of calculation, we first crop the original image to only retain the area with cracks according to the bounding box, as shown in Figure 3(b). Note that there are many noises in the crack area, such as the salt and pepper noise in actual shooting, the concrete bridge surface handwritten marks, dirt, weeds, and shadow occlusion noises. In order to extract more pixel-level information about cracks, we extract the binary image of cracks through image processing techniques. e main steps include median filtering and graying, CLAHE and local segmentation, and image binary conversion [36][37][38]. e image obtained after median filtering and graying is shown in Figure 3(c). We carry out CLAHE processing on the image and then divide the image into multiple square regions from top to bottom with the short side of the image as the side length and calculate the redundant nonsquare parts separately. For each region, local segmentation is performed as shown in Figure 3(d). en, the gray distribution of all pixels in each square area is calculated.
Binarization is performed in each square region by adaptive threshold segmentation. In our dataset, the resolution of the image taken by the camera is 8688 × 4888. e camera takes images from a distance of 50 cm. e focal length of the camera is 35 mm. We randomly selected 50 images from the dataset. For these 50 images, the number of pixels occupied by the crack width is recorded manually by the bridge engineering experience. e results show that the number of pixels occupied by the crack width is less than 12 pixels. erefore, in each square area, the maximum number of pixels belonging to the crack is 12 where a is the side length (pixel numbers) of the square area. In the image histogram of each square area, the gray value corresponding to the pixel number of 12 � 2 √ a is used as the threshold of image binarization [39]. If the gray value of a given pixel is larger than the threshold, it is judged as background; otherwise, it is judged as crack. e binary image of each square   Advances in Civil Engineering area is shown in Figure 3(e). Here, the black pixel 0 is the background, and the white pixel 255 is the potential crack. After threshold segmentation for each square region, these small regions are combined to get a complete binary image of the crack region, as shown in Figure 3(f ). It can be seen that there are still some handwritten marks, noises, spots, and linear noises in the image. In order to better deal with these noncrack noises, we removed handwritten marks with a mask filter and filtered speckle linear noises with a ratio filter. e filtered results are shown in Figure 3(g). rough the connected component labeling algorithm [40,41], the binary image obtained after removing the handwritten marks is transformed into the corresponding connected component graph, as shown in Figure 3(h). It can be seen from the connected component diagram that the connected component belonging to the crack is approximately linear, and the length is generally much larger than the width. erefore, line-like noise is further filtered out by the aspect ratio feature of the connected component, and speckled noise is filtered out by the area feature, as shown in Figure 3(i). It can be observed that there are some intermittent connected component fragments, and there are still some unremoved noises. In order to make the visible cracks more complete and clear and in line with the actual trend of cracks, we propose a region connected component search algorithm based on the connected component of cracks. e binary image of cracks after the connection is shown in Figure 3(j).
After obtaining the contour coordinates of the crack binary image, the length of the crack is calculated according to the obtained coordinates, and the average width of the crack is calculated by dividing the area of the connected

Experiments and Results
In the present work, the experiments were performed on a computer with the following configuration: two Intel Xeon(R) E5-2620v4@2.1 GHz CPUs; 64 GB Random Access Memory; and NVIDIA Tesla P40 24 GB GPU. is method was implemented based on the DL framework PyTorch [42]. e experiments were conducted by modifying the opensource library of YOLOv5 and OpenCV [43].

Dataset.
We manually collected and labeled the images to establish a dataset for training and testing the network. ree types of bridge damage, that is, crack, spall, and rebar, are included in the dataset. 1000 digital photos with 8688 × 4888 pixels are taken by a Canon EOS 5DS R camera. All the images used in this study were acquired from the periodic inspection for bridges by the CCCC First Highway Consultants Co., Ltd. Note that the shooting distance and focal length were fixed during the image acquisition process, which was helpful for quantifying the damage. Here, we focus only on the concrete cracks. We took cracks as targets, marked ground truth, and produced a dataset. e number of images with the crack in the dataset is 487, and the dataset is divided into three parts, training set, verification set, and test set. e verification set and test set, respectively, account for 10% of all the images, and the training set accounts for 80% of all the dataset.
In order to speed up the convergence of the network, we made statistics on the average value and standard deviation of the pixels of the three channels of the image in the dataset and obtained the following results: mean R, G After feeding the image into the network, the above two sets of values are used for normalization by the Z-score standardization method.

Offline Data Augmentation.
As the amount of data collected in the actual engineering was limited, we adopted data augmentation to prevent the problem of overfitting in the training process. Here, mosaic, random rotation, random cropping, Gaussian noise, and manual exposure are used [28,44]. Data-augmented samples of the images in the crack dataset are shown in Figure 4.

Training.
During the network training, batch processing was used to increase the training speed of the network, the learning rate was set to 0.01, and the batch size of both network training and validation was 16. A total of 200 training cycles were trained. Figure 5 shows the performance of the network in the training process. e loss curve of the training set and the validation set in the network training process, as well as the precision, recall, and mean average precision (mAP) curve on the validation set, are presented [45][46][47]. It can be seen that the training algorithm converged rapidly, and high mAP can be obtained.

Crack Detection Results.
In order to quantify the performance of the network in the target detection phase, data from the validation set were used to verify the network performance, and the results are presented in Table 1. Here, 88 images with 162 targets are detected. When the IoU threshold is 0.5, the calculated mAP is mAP@0.5 � 0.987. When the threshold of IoU ranges from 0.5 to 0.95 with a step of 0.05, the mean value of mAP is mAP@0.5: 0.95 � 0.987. Figure 6 presents the network detection results of cracks in the experiment. We can find that, with the detection network, the cracks can be successfully detected. But for some cases, a single crack is detected as multiple bounding boxes.
Here, we remove the handwritten marks to reduce the connection error caused by connecting to the handwritten marks. After the crack binary image is extracted, the color (blue) information of the handwritten mark is extracted in the HSV (Hue, Saturation, Value) color space to obtain the mask of the blue marker. Note that the extracted mask also contains other speckle noises, which are much smaller than the area of handwritten marks. erefore, image morphology operation, image corrosion, and image expansion operations can be carried out on the mask to remove these speckle noises, leaving only the part of handwritten marks. As shown in Figure 7, the binary image is multiplied by the corresponding pixel points of the mask after pixel inversion to obtain the crack binary image with handwritten marks removed. e visual results of cracks before and after the removal of handwritten marks are also presented. It can be seen that, with the mask filter, the connection to the handwritten mark is avoided in Figure 7(d).
In order to remove the noncrack noise through the crack morphological features and retain the crack pixel information as much as possible, we have made statistics on the width and length scale information in the crack connected component diagram.   is the corresponding width-length ratio distribution of all connected components. It can be seen that the width-length ratio of all connected components in the extracted crack binary image is dispersive. Figure 8(b) is the ideal crack binary image, which is the expected binary diagram of cracks after removing   noise. However, it is impossible to separate these real connected components completely by the available connected component information. erefore, it is desirable to retain more connected components that are likely to belong to cracks and remove those connected components that are less likely to belong to cracks. According to the usual crack shape, the length of the connected component belonging to the crack is generally much larger than the width. us, the connected component with a small width-length ratio is more likely to belong to the crack. Figure 8(e) shows the width-length ratio of the connected component corresponding to the ideal crack binary image. It can be seen that the ratio of the majority of connected components belonging to cracks is less than 0.6. According to Figure 8(e), we sorted the connected components in the extracted crack binary image by the width-length ratio in ascending order and selected the first 60% connected components. Figure 8(c) is the actual crack binary image obtained after filtering according to width-length ratio characteristics. Figure 8(f ) displays the width-length ratio corresponding to Figure 8(c). Obviously, most of the cracks that can maintain crack morphology are left after filtering.

(2) Fusion of Multiple Bounding Boxes for the Same Crack.
When the detection network is used to identify the cracks, a crack in the image may have multiple bounding boxes because of the disconnection of the cracks or the inaccurate network identification. In that case, the visualized cracks are not continuous, and the number of cracks cannot be counted correctly. erefore, we further perform a fusion operation for multiple bounding boxes.
We sort multiple bounding boxes detected in an image in ascending order according to the y coordinate of the upper left point and save them in the list, as shown in Figure 9 (x i , y i ), we examine whether the following condition is satisfied: where △x is the width of the first detection box, △x � | x 1′ − x 1 |. If a bounding box meets the condition, the bounding box and the first bounding box overlap and need to be fused. For example, the bounding box covers C and C′ are overlapped with the bounding box covers A and A′. If no bounding box satisfies (2) for the first bounding box, then we repeat the process from the second bounding box until all the bounding boxes are checked. e aim of bounding box fusion is to find the minimum outer rectangle of the two bounding boxes. Specifically, we adopt the coordinates of the upper left point and the lower right point of the two bounding boxes and return the vertex coordinates of the fusional bounding box. When the bounding boxes are fused, we delete the original two bounding boxes from the list and add the fusional bounding box to the top of the list. A visual comparison of multiple bounding boxes corresponding to the same crack before and after fusion is shown in Figures 9(b) and 9(c). Obviously, in Figure 9(b), two bounding boxes are detected for a single crack, while only a single fusional bounding box exists in Figure 9(c).
(3) Crack Connection Strategy. As shown in Figure 3(i), the filtered connected component image contains some intermittent connected component fragments and some unremoved noises. In order to make the visible cracks more complete and clearer, we need to connect these connected components in line with the actual trend of the cracks.
Meanwhile, we consider the method to avoid the connection to noise.
We propose a region connected component search algorithm based on the crack trend. e flowchart of the proposed algorithm is shown in Figure 10(a). e first step is to initialize the index and vertex coordinates of each connected component. e second step is to update the connected component and select the connected component with the largest aspect ratio in the image as the current connected component. In the third step, we consider a downward connection from the current connected component and build a square search box according to the lower vertex of the current connected component. e width of the search box is equal to the width of the image. e schematic diagram of the connection is shown in Figure 10(b). e fourth step is to find the target connected component to be connected in the search box. In the search box, the distance d and the angle θ between the lower vertex of the current connected component and the upper vertex of the next candidate connected component are calculated. en, the confidence degree c can be calculated as follows: We For quantification of the crack width, the width of the ith connected component is calculated as w i � (s i /l i ). us, the average width of the N filtered connected components can be calculated as follows:  erefore, the average width of the crack w c is equal to the average width of the entire connected component times the actual distance δ x of each pixel in the horizontal direction of the image, namely, Here, the resolution of the image taken by the camera is 8688 × 4888. e camera takes images from a distance of 50 cm. e focal length of the camera is 35 mm. e comparison between the calculated w c and the crack width measured by a crack width gauge in the actual bridge detection engineering is shown in Table 2. It can be seen that the widths of cracks in actual concrete bridges are generally less than 0.5 mm.
e absolute error of the crack width obtained by our proposed crack quantification method is −0.03∼0.04 mm, and the relative error is −7.5% to 11.5%. For the cracks above 0.15 mm, the absolute error of the width quantification results is less than 0.05 mm. e crack quantification results and visualization results are shown in Figure 11. It can be observed that good measurement accuracy is obtained.

Conclusion
In conclusion, we proposed a method of concrete bridge crack detection and quantification based on a DL-assisted image processing approach. e detection and the quantification phases are separately designed, which can easily replace the target detection network with an advanced DL algorithm. e target detection network based on DL is used to determine whether there is a crack in the image and then extract the crack area. In addition, we proposed a new digital image quantification and crack visualization method by introducing the region connected component search algorithm based on the crack trend.
In our dataset, we collected 487 original images. Among them, 399 images were processed by offline data augmentation to obtain 3365 images as a training set, and the remaining 88 images were used for testing. e proposed DL-assisted digital image crack detection and quantification approach achieved 92.0% precision, 97.5% recall, and 98.7% mAP@0.5 in the detection stage for the testing set. With the subsequent noise processing and connected component connection, the narrowest crack that can be detected, quantified, and visualized is about 0.15 mm, and the absolute error is within 0.05 mm. ese results show that the proposed crack detection and quantification method can improve the detection efficiency and reduce the detection cost, which is helpful and valuable for the intelligent development of bridge detection.
Note that there are still several directions that need to be improved. First, the crack merging, which is desired for crack statistics and evaluation, is not carried out in our solution. Second, the current dataset is still small, and a larger dataset may further improve the accuracy of the proposed algorithm in practical application.

Data Availability
All the data are available from the corresponding author (727705858@qq.com) upon request.

Conflicts of Interest
e authors declare no conflicts of interest.