Yolov4 High-Speed Train Wheelset Tread Defect Detection System Based on Multiscale Feature Fusion

The Yolov4 detection algorithm does not suﬃciently extract local semantic and location information. This study aims to solve this problem by proposing a Yolov4-based multiscale feature fusion detection system for high-speed train wheel tread defects. First, multiscale feature maps are obtained from a feature extraction backbone network. The proposed multiscale feature fusion network then fuses the underlying features of the original three scales. These fused features contain more defect semantic information and location details. Based on the fused features, a path aggregation network is used to fuse feature maps at diﬀerent resolutions, with an improved loss function that speeds up the convergence of the network. Experimental results show that the proposed method is eﬀective at detecting defects in the wheel treads of high-speed trains.


Introduction
High-speed train fault diagnosis is crucial for the safe operation and maintenance of high-speed trains. As an important support and running component of high-speed trains, wheelsets incur tread wear, scratches, and other damages caused by rolling contact between the wheel and rail. e deterioration can easily cause serious damage, such as wheelset tread fractures. erefore, it is essential to diagnose wheelset tread damage to ensure the safety of highspeed trains.
Current defect detection methods include magnetic particle detection, ultrasonic detection, and machine vision detection [1]. In recent years, many scholars have studied machine vision methods [2][3][4][5], owing to their wide application range and high precision. Such methods are efficient at inspecting the damage and unaffected by the contour of the inspection part. Moreover, machine vision inspection can proceed automatically. Traditional defect detection methods based on machine vision for railway components begin from the perspective of image processing. On the one hand, they use low-level grayscale features [6], textures, colors, frequencies, and other features to detect defects, yet such artificial features must be selected manually and require rich expert knowledge. On the other hand, the damage is automatically located after image enhancement [7][8][9], although the images are susceptible to noise. Given the complex operating scenarios of high-speed rail, changing working conditions, and the nonlinearity of the sensor itself, the signals collected by the sensors often contain foreground interference, noisy backgrounds, and nonlinear characteristics such as corrosion, stains, uneven reflection, a low signal-to-noise ratio, excessive illumination, and uneven illumination. It is difficult for traditional machine vision defect detection methods to effectively extract small fault features under foreground interference and noise. Furthermore, the differences between wheelset tread defects are not obvious under these two conditions.

Defect Detection Based on a Deep Convolutional Neural
Network (DCNN). Considering that deep learning has exhibited superiority in feature extraction and pattern recognition, an increasing number of scholars have attempted to apply deep learning methods to defect detection [10]. Deep learning based on convolutional neural networks is widely used to detect railway component damage [11][12][13]. Faghih-Roohi et al. [14] proposed a DCNN with multiple structures and activation properties for rail damage detection. High-speed train wheelset tread defect detection requires scene analysis at the regional level. Current regional-level target detection methods based on DCNNs are generally divided into two types. One type is two-stage detection based on the region, such as the Regions with CNN (R-CNN) features [15] and Faster R-CNN [16]. Liu et al. [12] integrated the feature extraction module isoelectric line network (ILNET) in the Faster R-CNN and used the segmentation method intersecting the cortical model/maximization of the posterior marginal based on a Markov random field to locate and divide loose strands of isoelectric lines. He et al. [17] proposed an end-to-end twostep defect detection method for steel rolling defects and explored the trade-off between detection speed and accuracy for different numbers of regions. However, the detection speed of two-step algorithms is still slower than that of single-step detection algorithms, and it is difficult to deal with the problem of the short maintenance operation time for China's high-speed railways. e other type is a series of single-step detection algorithms, including the you only look once (YOLO) series [18], which includes single-shot multibox detector (SSD) [19] and RetinaNet [20]. Kou et al. [21] introduced the DenseNet module in Yolov3 for strip defects and proposed a strip defect detection method based on Yolov3. Based on RetinaNet, Cheng et al. [22] proposed a retinal network (DEA RetinaNet) defect detection model based on differential channel attention and adaptive spatial feature fusion. Cui et al. [23] proposed an SSDNet for defect detection, which solved the problem of large texturing and small size defect detection by introducing feature retention blocks and skip dense connection modules. Considering that the YOLO series of target detection algorithms use fast detection frameworks, this study focuses on the application of Yolov4 [24] in wheelset tread defect detection. e YOLO series includes Yolov4. Yolov1 locates targets based on the last convolution map. Yolov2-Yolov4 [25,26] only locate large, medium, and small targets from three-scale high-level feature maps. Generally, a CNN acts as a filter in deep neural networks. Semantic information and location details of the wheelset tread image change layer by layer with the filter. Shallow features are rich in location information, but their discrimination is inadequate. Deep features contain ample semantic information but at the cost of location details.

Multiscale Feature Fusion (MFF)
. Some scholars have proposed using high-quality feature learning for networks by combining the features of different scales. Yang et al. [27] designed a multiscale channel compression deep surface defect detection algorithm that generates multiscale features through the convolutional layers of different sizes of cores to address messy backgrounds and defects of various scales. e added convolutional layer is compressed to increase the speed of the network. Hu et al. [28] proposed a spatiotemporal segmentation model with a hybrid multidimensional feature fusion structure for automatic thermal imaging defect detection. An attention module was designed that encourages local interaction between adjacent pixels and calibrates the feature map self-adaptively to lighten the model. Gao et al. [29] used feature acquisition and a compression network for multiscale feature fusion of IBD defect detection and used Gaussian weighted pooling instead of ROI pooling. eir method provides more accurate defect location information.

Transfer Learning.
Deep neural networks require a large number of datasets as a drive. For wheelset treads, building a large target detection dataset is very difficult. If the dataset is too small, then the performance of the deep neural network will be limited. erefore, pretraining the network or transfer learning is commonly used for small samples. Yang et al. [30], Zhang et al. [31], Badmos et al. [32], and Sun et al. [33] used transfer learning to detect Mura defects in LCD panels, PCB defects, electrode defects in lithium batteries, and surface defects in metal parts. Kim et al. [34] compared the effects of fine tuning-based transfer learning and training the network from scratch on the DAGM defect dataset. ey demonstrated that transfer learning outperforms training the network from scratch. e contributions of this study are as follows: (1) is study proposes an MFF-Yolov4 high-speed train wheelset tread defect detection algorithm. Yolov4 uses only three high-level features to perform detection tasks, resulting in low detection accuracy.
To address this limitation, we propose a multiscale feature fusion module. e proposed module provides rich semantic information and location details after the feature extraction stage. Low-level features are integrated into high-level features, and the fused features effectively improve the classification and localization capabilities of the detection network. (2) A dataset that contains 277 high-speed train wheelset images is built by collecting data to fine-tune our pretrained model. In the case of small samples, the proposed MFF-Yolov4 achieves competitive performance on this dataset. (3) Based on the two-class detection problem-and considering that noisy background candidate frames that are not related to wheelset tread defects contribute to most of the loss-the classification loss function is optimized, and the adjustment factor α is added, thereby achieving a better anti-interference performance of the improved network.

Construction of Defect Detection Model for
Wheelset Treads e proposed MFF-Yolov4 algorithm is described in this section (see Figure 1). A single wheelset tread image of any size is processed by CSPDarknet53 and SPP for feature extraction, and a convolutional feature map at each stage is generated. Four feature maps are extracted and then merged into three-dimensional feature outputs through MFF. In this way, MFF contains both the bottom-level location information of the wheelset tread image and the high-level classification semantics. Subsequently, a path aggregation network (PANet) [35] is used to perform a secondary fusion of the three fusion features with underlying location information. Next, the three fusion features at different scales are divided into grids to predict the bounding box for each grid and then return to the basic truth box. Each scale feature map stores the defect categories ("classes"), regression parameters of prior boxes ("location"), and confidence scores ("confidence") that correspond to the three prior boxes of each grid.

Backbone
Network. Preprocessing the model on the VOC2007 dataset can improve the performance of the deep neural network. e preprocessing model can be fine-tuned on a smaller wheelset tread defect dataset. Given that MFF-Yolov4 is based on the Yolov4 target detection algorithm, the use of CSPDarknet53 and SPP as the backbone has the following characteristics: (1) According to research by the author of Yolov4, CSPDarknet53 has the following advantages over the CSPResneXt50 network: a higher input network size, which is conducive to the detection of small objects; a larger receptive field that covers a larger input network; more parameters to improve the model with single images; and the ability to detect multiple objects of different sizes. (2) SPP can significantly increase the receptive field and isolate important context features. Furthermore, it will not increase the operating speed of the network.
In this study, CSPDarknet53 and SPP are selected as the backbone. e detailed structure of the network is shown in Table 1 and the output feature of the last layer of each CSP module is expressed as {F1, F2, F3, F4, F5, F6}.

Multiscale Feature Fusion.
Yolov4 uses only three highlevel scale features for feature extraction. e evolution of wheelset tread damage is a coupled development process. e network proceeds from shallow to deep, features are mapped to high-dimensional space, and the overall semantic information is gradually strengthened in abstraction. However, the hidden positioning information and local semantic features are gradually weakened layer by layer. In order to strengthen the detection ability of Yolov4 to detect wheel tread damage, we need to extend each single-scale feature to a dual-scale fusion feature. e method we use here is to fuse the high-resolution underlying features with high-level features through convolution transformation. Two basic conditions must be met: the features must be nonadjacent, because adjacent features must have high similarity [36], and the features of each scale generated by the backbone network should be taken into account.
From Table 1, we know that when the image flows through the feature extraction backbone network, {F1, F2, F3, F4, F5, F6} multige1 scale features are generated. e original Yolov4 network uses only three high-level feature maps, {F3, F4, F6}. In order to integrate multiple scale features to obtain more comprehensive semantic information and location details, the proposed MFF module performs a fusion strategy on the underlying feature F2. In particular, F2 is connected to the same F3, F4, and F6 after L2 normalization. By modifying the number of filters in the 1 × 1 convolution, most MFFs reduce the required parameters. is operation may affect the accuracy but will prevent overfitting in the case of insufficient training data. Finally, by stitching the features together, multiscale fusion features {FF1, FF2, FF3} are generated, as shown in Algorithm 1. Its structure is shown in Figure 2.

Path Aggregation Network (PANet).
For FF i with more comprehensive feature information, the three fusion feature layers of the input are stacked through PANet's step-by-step upsampling and downsampling. A second fusion is performed to obtain three effective feature layers, namely, P1, P2, and P3, as shown in Figure 3.

Yolo Head.
In particular, Yolo head is integrated through a 3 × 3 convolution, and then, a 1×1 convolution is used to obtain an S × S × 3(4 + 1 + K) tensor, where the four coordinates of the bounding box, namely, b x , b y , b w , and b h , are stored in "4," the confidence of the detected object is stored in "1," "k" is the detected object category, and the calculation formula is presented as follows: (2) e coordinate system is established with the upper-left corner of the sample as the origin; c x and c y represent the coordinates from the upper-left corner of the bounding box to the origin; p w and p h are the width and height of the anchor box, respectively; t x and t y are the distance from the center point of the bounding box to the origin; t w and t h are the width and height of the bounding box, respectively; b x and b y are the coordinates of the center point of the prediction box; b w and b h are the width and height of the prediction box, respectively; and σ(.) is the sigmoid activation function. In equation (2), P(object) represents the probability that the prediction frame contains the target object, and IOU truth pred is the intersection and union ratio of the truth frame and the prediction frame. Given that the Yolo head can extract thousands of prediction boxes, greedy nonmaximum suppression (NMS) is often used to eliminate areas with high overlap. e threshold of NMS is set to 0.5, and bounding boxes below 0.5 are discarded. After NMS, the remaining prediction boxes are used to fine-tune our MFF-Yolov4 network.

Improved Loss Function.
e MFF-Yolov4 loss function can be divided into three parts, as shown in Figure 4. Hence, the loss function is calculated as follows: where E ciou is the location error, E coord is the confidence error, and E cls is the classification error. e proposed algorithm is designed to detect defects in the tread surface of high-speed train wheels. In the actual operating environment of high-speed trains, however, the wheelsets are exposed to natural light, stains, rust, and other interference in long-term operation under various working conditions, and the background unrelated to the detected target contributes more classification loss E cls in the total loss. As such, we draw on the idea of focal loss [20] and add an index adjustment factor α to the classification loss to improve its ability to distinguish the foreground and background. e improved classification loss is expressed as follows:   Journal of Advanced Transportation th grid is responsible for predicting the target, I defect ij � 1 when the wheelset tread damage target falls to the j-th bounding box generated by the i-th mesh; otherwise, I defect ij � 0. erefore, only the grid responsible for predicting the target needs to punish the classification error. p i (c) represents the true probability of category c, and p i (c) represents the predicted probability of category c.

Experiments
MFF-Yolov4 was evaluated on our self-built dataset, called WT-DET. e implementation results show that the model we designed is feasible and effective.

Dataset Collection.
e image samples needed in the experiment were collected from the wheel axle workshop of CRRC Group Co., Ltd., Zhuzhou City, Hunan Province, China. e collection equipment included a wheel delivery track, CCD area camera, magnetic steel sensor, and computer for image storage ( Figure 5). When the wheelset was sent into the acquisition area by the wheel delivery track, the camera waited for the hardware control signal, the locomotive wheelset triggered the magnetic steel sensor, and the camera began collecting images. e camera collected one or more wheelset images each time. e computer was used to save the collected images and generate the dataset.

Dataset Production.
We collected wheelset tread defect images according to actual needs, including defects and normal samples, with 204 images of defect samples and 74 normal samples. An example of a defect is shown in Figure 6. Note that a single image may have multiple defects. Our dataset thus contains 278 samples, of which 204 are wheelset tread defect samples, 74 are normal samples, and the number of wheelset tread defects is 218, as shown in Figure 7:

Defect Detection on WT-DET.
For the pretrained Yolov4, the MFF module is new. us, we trained Yolov4 and MFF to share the same convolutional features. e backbone is essentially a feature extraction network that generates a single multiscale feature F i . e multiscale features generated by MFF can be fed into the Yolo head after the second fusion of PANet. erefore, the pretrained backbone network was jointly trained with MFF and PANet for end-to-end training. In particular, the training model was divided into two stages. In the first stage, the shared convolutional layer (backbone) was frozen, and on this basis, the nonshared layer (others) was trained. In the second stage, the shared convolutional layer was unfrozen and the network was globally trained.
A defect detection experiment on the WT-DET dataset was conducted. e GPU used in the experiment was RTX 2080Ti, the Python version was 3.6, and it was carried out in Keras 2.1.5 and TensorFlow 1.13.2 environments. e improved Yolov4 model was used in this study. In the formal training, the training and the test sets were divided according to a 7 : 3 ratio. Among them, 193 were training sets and 84 were test sets. To ensure the reliability of model training, a ten-fold cross-validation method was used, and the 193 wheelset tread defects in the training set in one epoch were randomly divided into ten parts: nine parts as the training set, and one part as the verification set, to avoid model overfitting caused by an unreasonable data division in the case of too few samples. For MFF-Yolov4, the image input was adjusted to a uniform size of 416 × 416 × 3. We used the Adam optimizer and adopted the freezing-based training procedure described above. e total number of epochs was 100. In the first 50 epochs, the backbone network training fusion network part (MFF and PANet) was frozen. At this stage, the batch size was set to eight, and the learning rate was set to 0.001. In the last 50 epochs, the entire network was trained. At this stage, the batch size was set to two, and the learning rate was set to 0.0001.
We used the above-divided dataset to fine-tune MFF-Yolov4.
e MFF-Yolov4 model obtained after the improvement of Yolov4 was trained and tested, and an ablation experiment was carried out for each step of the improvement. Details of this are given in the following sections. e training results of the model are given, as shown in Figure 8.
As the epochs of the model increase, the loss value gradually decreases and finally reaches convergence, indicating that the model is effectively fitting the data. e loss is large at the beginning of model training. us, the first iteration is ignored when drawing the loss curve. Figure 8 shows that the loss value after 20 epochs before the start of training significantly drops, and when the training reaches a certain stage, the curve tends to stabilize. After 50 epochs, the loss value is maintained at about 1.3. e neural network learning effect is ideal, and the hyperparameter settings are reasonable in the training phase.
A comparative experiment was carried out on the selfbuilt dataset with the current mainstream single-step and two-step target detection models (Table 1). In addition, unlike defect classification, in the case of defect detection, only the F1-score is not a suitable performance metric. erefore, the accuracy, recall rate, and average precision (AP) were used to evaluate the results of the detection experiment. ese indicators are defined as follows: where TP, FP, and FN represent the number of true positives, false positives, and false negatives, respectively. e mean average precision (mAP) was also calculated to evaluate the overall performance. Table 2 shows the experimental results of defect detection. Under the same conditions, all aspects of the Yolov4 model's data are different from those of other models with our self-built dataset. However, Yolov4 with the embedded MFF module has higher a recall, mAP, and F1-score than the other models. e results comprehensively show that the wheelset tread features extracted from the multiscale features have more comprehensive semantic features and location details. Yolov4 itself is a multiscale feature detector, but the multiscale features fused by our method have a more comprehensive feature representation, as discussed in detail below. e detection example of WT-DET is shown in Figure 9.   e above ablation experiment and detection method comparison experiment show that MFF can effectively improve the defect detection mAP of Yolov4. However, it remains to be shown that the mAP improvement of MFF-Yolov4 benefits from the location information contained in the multiscale fusion features extracted by MFF. e positioning accuracy performance of MFF is evaluated in the next section.

MFF Semantic Analysis.
To evaluate the impact of the MFF module on classification, the results of defect classification are first reported to show that our method offers improved accuracy compared to the competition. Table 3 shows the results of the ablation experiments before and after the optimization of Yolov4. According to F1-score in Table 3, we can draw the following conclusions: compared with the original single-scale algorithm, MFF has better classification capabilities.
us, multiscale fusion features still have strong semantic capabilities. When the improved loss function α � 1.1, the performance index of the proposed method is further improved on the basis of MFF-Yolov4. In Table 4, we added a comparative experiment, which replaced the MFF module part in MFF-Yolov4 with two fusion modules, FPN [40] and ASFF [41], and conducted two sets of comparative experiments to demonstrate the performance of the proposed method.

MFF Positioning Analysis.
To verify that MFF improves the positioning accuracy, Yolov4 and MFF-Yolov4 were compared. If multiscale fusion features have more location details, MFF-Yolov4 should have a higher recall rate under the same IoU. Based on this, different IoU thresholds were used to evaluate the recall rate on the self-built dataset, WT-DET. IoU represents the ratio of the intersection and union of the prediction box and the underlying true value. Figure 10 shows the defect recall rate of Yolov4 with different IoU thresholds with and without the MFF module. e higher the IoU threshold, the higher the quality of the prediction box. As expected, Yolov4 with MFF is better than the original Yolov4. When 0 < IoU < 0.87, the recall rate of Yolov4 is significantly lower than that of MFF-Yolov4. e original Yolov4 only regressed three high-level features of different scales, and the position information of wheelset tread defects was filtered by the layers in front of the network, which reduced the quality of the features. Our MFF selectively combines the previous layers with less interference information and rich location information, as discussed in Section 4. is gives MFF-Yolov4 stronger positioning capabilities.

Discussion
In this part, to prove the effectiveness of MFF, several hidden factors that affect the proposed wheelset tread defect detection module are discussed.

Determining the Connection Layer of MFF.
MFF combines features from different levels into multiscale features, and this effectively improves detection. In Section 3.2, the kinds of layers that should be combined are briefly discussed. In MFF-Yolov4, two layers belong to the bottomlayer features, namely, the last layer of F1 and F2. We discuss how to integrate these two features into the three (a) (b) Figure 9: A visual example of WT-NET's test results. e red box is the bounding box indicating its location, and the black word is its category score.  Figure 11, where ⊕ denotes a splicing operation. As shown in Table 5, integrating the second layer into the other three high-level feature layers is significantly better than other methods, thereby showing that multiscale fusion features are effective at improving detection accuracy.
At the same time, the features of the lower level are integrated into F1. MFF should have stronger detection performance, but the quantitative indicators show that the fusion of low-level features in F1 leads to a decline in the model mAP. In Figure 12, we visualize the feature map F1 (Figure 12(a)) and F2 (Figure 12(b)) flowing in the network in Table 5 and analyze the reasons that lead to the degradation of the model's performance. Although the F1 feature has more locational details than the F2 feature, the F1 feature has more noise and interference because our image comes from an actual industrial environment, which causes difficulty in the network learning defect details.
MFF unifies the features of different levels of resolution and channels through 3 × 3 convolution and 1 × 1 convolution, respectively. To maintain consistency in the number of channels, a simple method uses 1 × 1 convolution to increase and decrease the number of channels. is 1 × 1 convolution method can be conducted in two ways. A placement strategy is selected by comparing the two ways of connecting the multiscale features: before and after placing the 1 × 1 convolution, and before and after multiscale feature fusion. Front mode refers to placing 1 × 1 convolution before multiscale feature connections, and rear mode refers to placing 1 × 1 convolution after multiscale feature fusion. We adopted postlocation. at is, after each two-scale feature is spliced, the channel is adjusted through a 1 × 1 convolution. Although the use of rear mode increases the number of parameters, it merely results in a slight drop in network detection speed (2 FPS), whereas the rear-mode mAP is higher, as shown in Table 6. Multiple 3 × 3 convolutional cascaded downsampling forms are used to deal with the fusion between the underlying F2 feature and the three different scale features, namely, F3, F4, and F6. Compared with direct downsampling, step-by-step downsampling can retain more image details.

Case Analysis of Missed Inspections.
Although the improved model is generally better than other methods on the self-built dataset, there are a few cases of missed detections. As shown in Figure 13, we analyze some failure cases and the reasons for the failure of detection. On the one hand, it is difficult to correctly identify damage in the initial stage of wheelset tread abrasion using MFF-Yolov4, mainly because of the lack of sample data and insufficient examples of wheelset tread defects.
us, the network cannot fully learn the characteristics of wheelset tread damage. As shown in Figure 13(a), wheelset tread damage is easily confused with the complex background environment, and even experienced people cannot accurately distinguish them from the background. On the other hand, in a complex environment, the wheelset tread is  Journal of Advanced Transportation accompanied by oil, sand, and rust during the operation of the wheelset. ese interferences are attached to the surface of the wheelset tread in blocks, and the dataset is incorrectly marked as a wheelset tread defect, as shown in Figure 13(b). Although the neural network does not predict the defect as a defect after learning the characteristics of the wheelset tread defect, in the quantitative statistics, the missed detection rate of each model is classified as a classification error, which affects the detection accuracy. Here, we compare the false detection rate of our model with the other models. e log-average miss rate of the proposed model is 19%, which is lower than that of the six other models, as shown in Figure 14.     10 Journal of Advanced Transportation

Conclusions
It is difficult for wheelset tread defect detection algorithms based on Yolov4 to consider local semantics and location details while ensuring real-time performance. erefore, this study proposed an improved Yolov4 high-speed train wheelset tread detection algorithm based on multiscale feature fusion. To obtain more wheelset tread defect categories, and semantic and location details, we embedded a multiscale feature fusion module in the Yolov4 model that improves its detection accuracy. In addition, a valuable wheelset tread defect detection dataset, WT-DET, was constructed. e wheelset tread detection performance of the proposed algorithm was compared to that of current single-step and twostep detection algorithms on the self-built dataset. e results showed that the F1-scores of this algorithm for wheelset tread defect classification reached 86%, and the mAP of wheelset tread defect detection reached 86.25%. Moreover, the model reached speeds of 37.05 FPS. Given that regional-level detection can only obtain the approximate area of the wheelset tread, it cannot reflect the contour of the damage. In future research, we will study wheelset tread defect segmentation technology based on deep learning to obtain finer contour boundaries of wheelset tread defects.

Data Availability
e data used to support the findings of this study are not applicable because the data interface cannot provide external access temporarily.