Recognizing the Damaged Surface Parts of Cars in the Real Scene Using a Deep Learning Framework

Automatically recognizing the damaged surface parts of cars can noticeably diminish the cost of processing premium assertion that leads to providing contentment for vehicle users. is recognition task can be conducted using some machine learning (ML) strategies. Deep learning (DL) models as subsets of ML have indicated remarkable potential in object detection and recognition tasks. In this study, an automated recognition of the damaged surface parts of cars in the real scene is suggested that is based on a two-path convolutional neural network (CNN). Our strategy utilizes a ResNet-50 at the beginning of each route to explore lowlevel features eciently. Moreover, we proposed newmReLU and inception blocks in each route that are responsible for extracting high-level visual features.e experimental results proved the suggestedmodel obtained high performance in comparison to some state-of-the-art frameworks.


Introduction
e evaluation of car parts is a challenging task that mainly originates from the insurance industry. e issue of automated assessment of damaged parts of a car represents the foremost challenge in the damage assessment and auto repair industry [1,2]. is eld of study has a number of application areas ranging from accidental damage evaluation for car insurance companies to car evaluation companies such as body shops and car rentals [3]. In evaluating a vehicle, the damaged parts can take any form containing missing parts, minor and major dents, and scratches. Generally, the evaluation region has an important level of noise such as oil, grease, or dirt that makes an inaccurate recognition challenging [1,4]. Also, the recognition of some speci c parts is the rst stage in the repair industry for having an accurate labor and parts evaluation where the presence of dissimilar car sizes, shapes, and models makes the task even more di cult for a model based on a machine learning strategy to perform well [5].
Nowadays, many machine learning algorithms have been broadly used in many industries to bring down the charge of manual endeavors including object recognition [6,7], exterior car body damage-detection [8], image encoding [9], and healthcare (organ, skeletal, body pose, tumor/cancer segmentation) [10,11].
Detecting the damaged parts of the outer surface in various kinds of cars has received great attention in the eld of machine vision in recent years, and many frameworks have been proposed for reducing the claim leak issue [2,12,13].
However, applying frameworks using ML strategies in this eld is a very challenging task. is challenging task is because of some issues including light re ection, the presence of unidenti ed objects surrounding vehicles, scene illumination, and background detection [14,15].
ere are two main ML strategies for recognizing damaged outer parts of a car including hand-crafted feature extracting methods and deep learning models [8,16].
Amirfakhrian et al. [3] proposed a clustering approach based on the fuzzy similarity criteria and changing the color space for recognizing the damaged parts of a car. ey used a similarity score among two images that is computed using the color spectrum. Parhizkar et al. [8] suggested a cascade convolutional neural network to recognize the damaged parts even in the presence of high illumination variation. Moreover, the Kirsch compass kernels are used to produce some edge maps for creating an encoded image. ey used two textural descriptor approaches namely local binary pattern (LBP) and local directional number pattern (LDN) to obtain more informative features from the original image. Shirode et al. [17] suggested a deep learning model to recognize the damaged parts of a car using two separate CNN model. e first model (VGG16) is utilized to identify the damaged parts, locations, and their severities. e second model is able to mask out the precise damaged regions.
Our strategy to recognize the car damaged parts is based on learning a convolutional neural network (CNN) architecture utilizing the collected dataset, which consists of 3,000 images with different dimensions captured from different cameras.
e proposed CNN classifies all pixels inside the image into the background, damaged parts, and normal parts. All pixels including normal and damaged parts of the car have their own classes. For instance, we have two classes for normal Windshield and damaged Windshield. So, overall classes include 20 classes for car parts and one class for the background. All car parts have 10 categories including windshield, hood, front bumper, rear bumper, fenders, trunk lid, front doors, back doors, roof, and quarters.
To solve the problems of two-stage models (car detection and recognizing the damaged parts), we suggest a two-route CNN model for exploring both global and local features. e main contributions of this study are listed as follows: (1) A new two-route CNN model that automatically finds and localizes damaged parts of the car inside an input image (2) Employ a transfer learning approach to find more informative details from a real-scene image (3) Applying local and global patches to the CNN model for increasing the final performance of the model

Materials and Methods
In this part, our datasets and a detailed description of the model architecture are described. e suggested model is shown in detail in Figure 1.

Proposed Convolutional Neural
Network. Convolutional neural network architectures are broadly utilized in the field of computer vision, such as medical image analysis, object detection, and action detection. In these networks, features and patterns inside the image can be explored by convolution operation [18][19][20]. e lower convolutional layers (Conv layers) can extract some features such as curves, lines, and edges. e deeper convolutional layers are able to learn more complex hidden patterns inside the image [21,22]. e convolution operation in a Conv layer is implemented using a convolution filter (kernel), and its parameters are learned during the learning process. During the convolving procedure, each filter is convolved with the input image to compute an informative feature map [23]. It should be mentioned that the dimension of the convolution filter is always smaller than the dimension of the input image. In another word, a convolution filter slides over an input image and calculates the dot product between the convolution filter and the input at each spatial position [1,13,24].
In this study, we employed two ResNet-50 models to explore low-level features, and three 3 × 3 new.mReLU blocks in each route are responsible for extracting high-level visual features. A residual neural network (ResNet) stacks residual blocks sequentially is an artificial neural network (ANN). ResNet-50 is 50 layers deep including 48 Conv layers, one maximum pooling, and one average pooling layer [25]. In order to recognize each part of the car more precisely, we need a network that is able to detect more informative features. So, we employed a pretrained ResNet-50 architecture that trained on ImageNet at the beginning of the framework [26,27]. e extracted features of the ResNet-50 model are used as the input of the 3 × 3 mReLU block. In order to achieve a better segmentation result, three 3 × 3 mReLU blocks are utilized sequentially.
As the performance of the ResNet-50 has been proved in the field of image classification (extracting high-quality features of images on ImageNet), we employed this model to explore some informative details about the damaged car and background. Moreover, by applying deeper layers (more feature extraction layers), a better result in detecting damaged parts of a car will be obtained. e employed 3 × 3mReLU architecture is shown in detail in Figure 2. is model was inspired by the concept of learning multiple patterns using intermediate layers in a CNN model [25,28,29]. In other words, utilizing stacking up the mReLU pipeline is more efficient than a simple linear chain of convolution layers to classify a varying-scale object. In the suggested mReLU block, four 3 × 3 filters (small receptive fields) are used to solve the problem of overfitting and allow the framework to use a deeper architecture [28]. Also, the scale/shift layer is applied after the second concatenation layer to apply some trainable biases and weights [30,31].
A negation layer is employed to multiply −1 into the output of the previous layer to enhance the exploration process of informative features. We applied three and two Conv layers before the negation layers for improving the performance of the feature extraction process. Moreover, some bottleneck layers (1 × 1 Conv layers) are used to reduce the computation procedure [31,32].
Moreover, we proposed a new inception block that obtains the multiscale nature of car segmentation tasks and enhances the performance of the model when encountering a complex background [20,30].
is proposed inception block in high detailed descriptions is indicated in Figure 3.
is idea is inspired by some works conducted by [31,32] that decreases the number of feature kernels in each layer.
is decreasing the number of parameters leads to maintaining the sparsity of the architecture and improving the computational performance.
e inception block has three routes for extracting features. At the beginning of each route, a 1 × 1 convolutional layer is used and its output fed into two convolution layers. en, the output of the first two routes is concatenated and is fed into another bottleneck layer. Next, the output of routes one and two is concatenated with the third route. Due to the use of inception and 3 × 3mReLU blocks, the overfitting problems have been satisfactorily addressed.

Data Augmentation.
For effective learning and implementation, a CNN model needs to be trained with a large amount of the training data [18,33]. e deep learningbased approaches need to be trained on large training datasets for avoiding overfitting and to maximize learning. Besides, the performance and learning accuracy of DL models are improved with ample and high-quality training data. Data augmentation (DA) techniques are used for changing or enhancing a dataset [34,35].

Results and Discussion
e employing dataset includes 3,000 images of different sizes obtained using various cameras. All images were resized to the size of 620 × 620 before applying to the CNN model. e experiments were carried out employing Python on the NVIDIA Tesla K80 GPU, 8 GB RAM, and Windows 10. Our technique is compared with some studies in terms of car damage detection. We assess the accuracy of the suggested strategy with different criteria that are defined as follows [39][40][41].
where false positive (FP) is the number of pixels that are incorrectly categorized as the body of a car while true positive (TP) implies the number of pixels that are correctly predicted by the suggested CNN framework. Also, a false negative (FN) indicates classified pixels that do not belong to the car; however, they are wrongly predicted as the car parts.
In Table 1, the results of detecting car parts are listed for 10 parts of the car. As it is clearly shown, the best outcomes for the precision are obtained for fenders, front doors, and windshield, and the worst results are related to quarters, roof, rear bumper, and back doors. For the recall assessment, the fenders, hood, and windshield gained the best results whereas the roof, front bumper, back doors, and quarters gained the worst results. Similar to the recall and precision outcomes, the best outcomes in terms of the IoU were achieved for front doors.
In Table 2, the comparison results for detecting car parts between our model and eight other approaches are listed. As it is clearly indicated, the proposed architecture gained the best results among all the eight other models. Moreover, the FCOMB [12] and VGG [1] strategies have the worst outcome among all the other techniques in terms of all evaluation criteria while the texture descriptor [8] pipeline is the second best model.
By looking deeply, we come to understand that the PANet [42] model indicates a very similar performance to the texture descriptor [8] model, and the VGG model [1] demonstrates an almost alike consequence to the combined feature (YOLOv3) [43] model. e differences between the obtained values of recall and precision criteria from the FCOMB [12] model and the suggested model are great numbers equal to 34% and 33%, respectively. Table 3 exhibits the results of detecting car parts using the proposed model. e best outcomes in terms of the precision achieved for the trunk lid and front doors and the worst outcomes are related to the rear bumper. For the recall assessment criteria, the roof, trunk lid, windshield, and hood obtained the greatest results whereas the back doors and fenders achieved the worst results. Similar to the recall and precision outcomes, the best outcomes in terms of the IoU were achieved for the roof. Table 4 indicates the comparison of results for detecting damaged parts of a car using our approach and some of the recently published papers in terms of Recall, Precision, and IoU. By comparing the gained results belonging to all models in Table 4, it is clear that the combined feature (YOLOv3) [43] and CNN [47] models gained the worst outcomes in terms of all criteria. In contrast, the HTC [46] and texture descriptor [8] pipelines obtained the next greatest values for all criteria. Nevertheless, the suggested two-route model    gained the best values among all compared models, which denotes its high effectiveness in the achievement of the desired objectives. Figure 4 denotes the results of detecting a damaged part of a car using different models. As it is clearly shown, the damaged region of the car could not be recognized correctly by YOLOv3, Mask RCNN, FFNN, and CNN. For further explanation, the center region of the fender could not be appropriately recognized by the Mask RCNN, Yolov3, and CNN techniques. is difficulty was addressed during the implementation of the texture descriptor and VGG architectures. e segmentation outcomes obtained using FFNN and texture descriptor demonstrated an improvement in recognizing damaged regions in areas with high illumination variation and could reduce false positives.

Conclusion
In this paper, we suggested a deep learning-based model for recognizing damaged parts of the car. We employed a tworoute CNN model that is able to extract both global and local features from the input image. In order to minimize the efficiency of the model for segmenting target parts, a pretrained ResNet-50 model was used at the beginning of the pipeline. Moreover, an inception block and a 3 × 3mReLU block were suggested to solve the overfitting. After analyzing the suggested framework, we realized that the outcomes of utilizing a pretrained CNN model that is applied to each route lead to a high segmentation performance with respect to some other models such as FFNN, YOLOv3, HTC, and VGG. To validate the suggested model, it was implemented on a private car collection database, and its outcomes from three measurement criteria were compared with some stateof-the-art techniques, including PANet [42], combined feature (YOLOv3) [43], CNN [47], FFNN [48], mask RCNN [45], VGG models [1], HTC [46], and texture descriptor [8].
e results indicated that our framework demonstrated better achievements toward the comparative approaches in different terms of other comparative methodologies.

Disclosure
e funding sources had no involvement in the study design, collection, analysis or interpretation of data, writing of the manuscript, or in the decision to submit the manuscript for publication.

Data Availability
Data will be available upon request to the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.