An Improved Object Detection Method for Underwater Sonar Image Based on PP-YOLOv2

,


Introduction
In recent years, with the rapid development of the "Autonomous underwater vehicle (AUV)", underwater seaway safety has become one of the important research hotspots. The obstacles, large rocks, and piers in the water will greatly affect the path planning and task execution of AUV, and more seriously, safety accidents may occur. As a kind of high resolution, multipurpose marine detection equipment, forward-looking sonar, installed in front of AUV, is an easily accessible and economical device to obtain images of the underwater obstacles and objects. It is widely applied in various fields such as automatic obstacle avoidance, seabed mapping, ecological monitoring, and pipeline inspection.
The forward-looking sonar can record the back-scattered echo intensity of the object and generate sonar images with different gray levels according to the echo intensity, which is called the reflection intensity imaging of the object. Compared with other acoustic detection systems, the advantages of object detection using forward-looking sonar are as follows: (i) High data density and high resolution (ii) Large coverage and strong recognition ability for underwater objects with special shapes (iii) Easy installation and low cost However, the traditional sonar system cannot automatically obtain the accurate positioning information of underwater object. It requires manual identification or off-line ashore postprocessing, which seriously affects the real-time and initiative of underwater task execution of AUV. And the recognition and classification accuracy is affected by unclear image edge and multi-image-noise because of the complexity of sound propagation in water medium and the characteristics of sound wave.
Many researchers have studied the automatic object detection from sonar images, such as traditional artificial design feature method [1], machine learning (ML) method [2], and deep learning (DL) method [3]. Relying on artificially designed features, traditional sonar image object detection methods cannot make full use of the deep features of sonar image and lack of robustness and generalization ability. The method [3][4][5] based on deep learning has gradually become the mainstream method of sonar image object recognition, because of its powerful automatic feature extraction capability.
According to whether a region proposal is generated and used, DL-based object detection methods can be divided into two-stage model and one-stage model. Following the idea of traditional object detection, two-stage models first generate a large number of regional suggestions in the detection process and then generate fixed-size feature maps to perform localization and classification tasks, respectively. Regionbased convolutional neural network (R-CNN) [3] is the first two-stage model. It innovatively utilized convolutional neural network (CNN) to extract image features. Other typical two-stage models include Faster R-CNN [6] and cascade R-CNN [7], which are proposed successively to improve the detection efficiency.
The single-stage model does not generate regional suggestions; therefore, the calculation is relatively small, the detection rate is fast, and the real-time performance is high, but the accuracy of the model detection is sacrificed. YOLO (you only look once) model [5] is a typical one-stage model, which is commonly known as YOLOv1. YOLOv3 [8] and YOLOv5 are famous variants of YOLOv1. PP-YOLOv2 [9] adopts a set of optimization strategies to improve the accuracy of the detector and achieve a very high cost-performance ratio. Its mean average precision (mAP) is 45.9, and the frame per second (FPS) achieves 72.9.
At present, sonar object detection is still a very challenging task due to the problems of multiple scales, dual priorities, speed, limited data, and class imbalance. These problems have a big effect on the real-time detection accuracy. To implement the real-time object recognition in sonar images efficiently, the following works have been done in this paper.
(i) PP-YOLOv2 is first introduced to the underwater obstacles and objects detection for real-time sonar image object detection (ii) Some useful preprocessing methods are presented for sonar image, including noise reduction, image resizing, and CutMix [10].
(iii) Some updates to PP-YOLOv2 are proposed including a backbone network with an attention mechanism and decoupled head, which finally forms a high-performance sonar image multiobject detector called PPYOLO-T The paper is structured as follows: section Related Work is an overview of the related work; section Methodology describes our method for constructing sonar multiobject detection network based on PP-YOLOv2, followed by extensive experiments for evaluating the proposed method in section Experiments. We conclude this paper in section Conclusions.

Related Work
Sonar image-oriented object detection is of great significance to underwater detection. It has been studied for many years. Traditional sonar image object detection methods are mainly based on artificial design features. Myers and Fawcett [1] proposed a sonar image object detection algorithm based on template matching (TM), where objects were located and classified by using the features of the template designed manually. Much work was devoted to artificial design features. Some useful features include physical characteristics of foreground and background [11], context information [12], and statistics about the environment [2].
However, traditional sonar image object detection methods cannot make full use of the deep features of sonar image for decision making. At the same time, they are usually lack of robustness and generalization ability. All these limit the application of traditional methods. In recent years, researchers [13][14][15] have introduced deep learning-based object detection method into sonar image object detection and achieved some good achievements. At present, the mainstream deep learning detection algorithms can be divided into two series of R-CNN [3] based and YOLO [5] based.

R-CNN Based Methods.
In recent years, convolutional neural networks (CNN) have been widely used in classification tasks. The region-based convolutional neural network (R-CNN) proposed by Girshick et al. [3] firstly introduced the convolutional neural network into the object detection task. It greatly makes up for the defects of traditional object detection algorithms such as deformable part model (DPM) [16] in high complexity and high computation. Then, spatial pyramid pooling (SPP) [17] and Fast R-CNN [6] further improved the accuracy of object detection on natural image data sets Pascal VOC [18] and MS COCO [19]. Faster R-CNN [6] saved the calculation cost of regional proposals by introducing region proposal network (RPN), enabled end-to-end training of the whole model, and improved detection efficiency. Cascade R-CNN [7] used cascade regression as a resampling mechanism to improve intersectionover-union (IoU) value of the proposal stage by stage, so that the resampled proposals of the previous stage can adapt to the next stage with a higher threshold.
The above R-CNN-based methods are two-stage model. In the detection process, a large number of regional suggestions are generated or referenced, and fixed-size feature maps are generated based on this, so as to perform localization and classification tasks, respectively. As a result, R-CNN-based models usually have large number of model parameters and slow detection speed, which makes it difficult for real application.
2.2. YOLO-Based Methods. YOLO [5] is a brand new network different from the regional convolutional neural network. It transforms the problem of object detection into a regression problem. The classification probability and location information of the object can be given only with one neural network and one single detection. This gives YOLO 2 Journal of Sensors a huge advantage in terms of infer time and detection accuracy, making it possible for real-time application. The models [8,20,21] from YOLOv2 [22] to YOLOX [23] constantly improve the model in terms of performance and speed. Methods for improvement include new backbones such as Darknet-19 [22] and Darknet-53 [8], adding SPP layer, new training strategy of multiscale and exponential moving average (EMA), and decoupled head. Unlike YOLOv4 and YOLOv5 that explore various complex backbone and data augmentation methods, PP-YOLO [4] is based on YOLOv3 and only relies on Mixup and keeps improving model performance through reasonable combination of tricks. PP-YOLOv2 [9] adopts a set of optimization strategies to improve the accuracy of the detector and achieve a very high cost-performance ratio (mAP 45.9 and 72.9 FPS) on the premise of almost not increasing model parameters and computation (flops).
This paper focuses on multiple target detection of underwater sonar images. Different from existing methods, we explore some preprocessing methods to improve the model robustness and design a new detection model based on PP-YOLOv2 to improve the detection accuracy for underwater sonar images.

Methodology
This section first shortly reviews the PP-YOLOv2 and then elaborates the proposed method for sonar object detection, which is called PPYOLO-T in this paper.
PP-YOLOv2 is an optimization model based on PP-YOLO [4] and YOLOv3 [8]. It uses the same backbone network with PP-YOLO, called ResNet50-vd [24], and more tricks are added, which can improve the model accuracy without introducing extra computation as much as possible. Specifically, it uses path aggregation network [25] (PANet) to aggregate the top-down information in the detection neck, applies the mish activation function [26] in the detection neck instead of the backbone, and increases the input size and applies a soft label format for the IoU aware loss.
The challenges of sonar object detection lie many facts, such as low SNR, complex background, and small object. It is difficult to achieve ideal results by directly applying the existing model to the sonar image detection task. In this paper, we first preprocess sonar image, and then based on PP-YOLOv2, propose PPYOLO-T for better sonar multiobject detection. Figure 1 shows the overall flow of our method. There are mainly three parts: sonar image preprocessing, trainning of PPYOLO-T, and target detection for a new sonar image.

Preprocessing.
Due to the complexity of sonar equipment in the market, the resolution of sonar images is usually different, and there are many noises. To train a robust model for sonar image detection, preprocessing is necessary. As shown in Figure 1, the preprocessing for model training includes noise reduction, image resizing, anchor resizing, and data augmentation. For sonar image detection in real application, only noise reduction is used.

Noise Reduction.
There is a lot of acoustic noise in sonar image. In this paper, we leverage Gaussian filtering, median filtering, and bilateral filtering to reduce noises in the original sonar images. Figure 2 shows the noise reduction process.
Gaussian filter is a kind of signal filter for signal smoothing. We use it to improve the SNR of sonar images. The following equation shows its calculation formula, in which, ðx, yÞ is the current coordinate point, ðx 0 , y 0 Þ is the central coordinate point, and σ is a Gaussian smooth curve.
Median filtering is a kind of nonlinear signal processing technology which can suppress the noise effectively based on the sorting statistics theory. It has a good filtering effect on the pulse noise. Especially when filtering the noise, it can protect the edge of the signal from being blurred. The following equation shows its calculation formula, in which f ðx, yÞ and gðx, yÞ are the original image and the processed image, respectively, and W is a twodimensional template, which is a 3 × 3 region in this paper.
Bilateral filtering is a popular noise filtering method. It is optimized on the basis of Gaussian, superimposed the consideration of pixel value. The filtering effect is more effective to preserve edge. Therefore, it is beneficial to the edge detection of stereo object in underwater sonar image. The following equation shows its calculation formula, in which 1/W p stands for a normalization factor, G σ s is space weight, and G σ r is range weight.
3.1.2. Image Resizing. In many cases, sonar images collected by different devices vary in both resolution and image size. However, DL-based methods need to use unified image size for model training. This paper proposes a simple but effective method to resize sonar images to the same size without loss of resolution. Figure 3 shows the image resizing process. We first obtain a list of various length-width ratios by statistics on the data, such as 2048 × 768 and 2048 × 512. Then, in each iteration sample of model training, we randomly segment the original image according to length-width ratio list and complete the segmented image with gray bars for the deficiency based on a predefined image size, such as 640 × 640 or 768 × 768. Finally, we use the reconstructed images with unified size for model training.
Instead of stretching a sonar to a uniform size, we normalize the image size by cutting and filling. This does not deform the image and preserve the original resolution of the image. Therefore, the generalization ability and recognition effect of 3 Journal of Sensors the model can be improved. We will illustrate this through ablation experiment in section Experiments.

Anchor
Resizing. Based on PP-YOLOv2 for improvement, the proposed PPYOLO-T is also anchor-based detection method. Anchor is actually a set of preset bounding boxes of different scales and sizes. During network training, the real bounding position is offset from the preset bounding position. In PP-YOLOv2, anchor box is preseted based on COCO data set. In this paper, we resize the anchor box based on real sonar images, in the account of small object detection. Specifically, we leverage K-means [27] algorithm to cluster all the labeled boxes. The parameter K is set to 9 following PP-YOLOv2. The mean bounding positions of     Journal of Sensors each cluster are selected as the preset anchor box. Anchor resizing for sonar images is proved to be effective. Details are shown in section Experiments.

Data Augmentation.
Data augmentation is an effective technique for improving the accuracy of image-related tasks such classification and object detection. In this paper, flipping, random expansion, CutMix, and Mosaic [28] are used for sonar image preprocessing. Mixup [29] was used for data augmentation in PP-YOLOv2, and good results were achieved on COCO data set. However, when we apply Mixup to sonar image preprocessing, it finally backfired. This may be because the SNR of sonar image is much lower than ordinary RGB images. Mixup is to overlap two photos together. If the objects in sonar image overlap, the difference between the overlapped object features and the original object features will be too large. Inspired by this, other preprocessing that may change the appearance of the original image, such as adjusting brightness, is also excluded for data augmentation of sonar image.
3.2. The Proposed PPYOLO-T. After data preprocessing, sonar images are transmitted to PPYOLO-T for model training, and then the trained model is generated for sonar image object detection. As shown in Figure 1, the overall architecture of PPYOLO-T consists three parts, namely, the backbone BotNet-dcn, the neck PAN, and the decoupled head. Among which, BotNet-dcn and decoupled head will be elaborated in this subsection. We omit the details about PAN since it is the same as used in PP-YOLOv2.

BotNet-dcn.
In original PP-YOLO and PP-YOLOv2 [9], ResNet50-vd-dcn is applied to extract feature maps at different scales. ResNet [30] has been widely used in a variety of feature extraction applications. But most recently, attention mechanism [31] has been gradually applied in the field of machine vision. There already exist some backbones used in image feature extraction such as BotNet [32] and Swin transformer [33]. In this work, BotNet is chosen for the backbone in considering of efficiency. Swin transformer, which is stacked by the attention mechanism, will lead to a significant decline in the efficiency of object detection. While BotNet only replaces some 3 × 3 convolution layers with multihead self-attention (MHSA). Its reasoning efficiency is much higher than that of Swin transformer, and the number of parameters is even lower than that of the original ResNet. Figure 4 shows the backbone network designed for sonar object detection.
Following the way of PP-YOLO [4], we replace some convolution layers in BotNet with deformable convolution networks (DCNs) in the consideration that directly replacing BotNet with ResNet will hurt the performance of PP-YOLO detector. Also, in order to balance the efficiency and effectiveness, we only replace MHSA in the last stage with DCNs, as shown in Figure 4. We denote this modified backbone as BotNet-dcn.
As shown in Figure 5, there are several positions for the replacement of MHSA with DCNs. We will demonstrate their effectiveness through ablation experiments in section Experiments. Our experimental result shows that replacing MHSA in the last stage with DCNs performs the best.
To sum up, the proposed BotNet-dcn first leverages three CNN layers to extract the local features of the image and then uses two multihead attention module layers to integrate the global features and further utilize deformable convolutional network for further adjustment and finally output the feature map. We will prove its effectiveness in experiments.

Decoupled Head.
Head is a part of model structure to predict object category and position (bounding box). Decoupled head has been widely used in most of one-stage and two-stage detectors [34,35]. However, as YOLO series' backbones and feature pyramids (e.g., feature pyramid network [36] and pixel aggregation network [25]) continuously evolving, their detection heads remain coupled as shown in Figure 6. YOLOX, proposed by Ge et al. [23], shows that replacing coupled head with decoupled head can greatly improve the model performance. Based on PP-YOLOv2, this paper proposes a decoupled head for sonar object detection. As shown in Figure 6, the coupled head used in PP-YOLOv2 generates 3 × ð1 + 4 + 1 + ClassesÞ channels through 3 × 3 convolution, and the extra channel is used to calculate IoU aware loss for smoothly processing of prediction information. Differently, we use decoupled head following YOLOX. But unlike the decoupled head used in YOLOX, we put category prediction and object prediction in the same branch and extend another branch to calculate IoU aware loss for smoothly processing of prediction information.
The calculation of loss is given by Equation (4), which includes three parts: confidence loss, classification loss, and location loss.
To be specific, confidence loss is calculated by using binary cross-entropy, as shown in Equation (5). The confidence means whether there is a center point at this grid, that is, whether there is an object. o i ∈ ½0, 1 represents the IoU of the predicted object bounding box and the real object bounding box. c is the predicted value andĉ i stands for the confidence score computed by the Sigmoid function. N is the number of positive and negative samples.
Classification loss is also calculated by using binary cross-entropy. In Equation (6), O ij ∈ f0, 1g represents whether there is an O ij object in the bounding box of the predicted object. C ij stands for the predicted value, and N pos is the number of positive samples.
During training, the squared error loss is used for location loss calculation. Equation (7) shows the computing method. whereĝ representing the coordinates x, y, width, and height of the center point of the labeled box in the training data. t x , t y , t w , and t h are the regression parameters of network prediction.
The effectiveness of the proposed decoupled head will be illustrated in the following experiments.

Detection for Underwater Sonar Image.
In real application, the detector is required to have good real-time performance. Our PPYOLO-T keeps only one processing step that is noise reduction, since the underwater sonar images have a high signal-to-noise ratio. After noise reduction, the trained model will detect objects in the image and give the results of different categories. It is able to detect multiple  Figure 6: Illustration of the difference between PP-YOLOv2 head and the proposed PPYOLO-T head. 6 Journal of Sensors objects in a time and is also friendly to small ones. We will show its effectiveness in the following section.

Experiments
In this section, we conduct extensive experiments to assess the effectiveness of our proposed PPYOLO-T on multiobject recognition for sonar images. We first introduce the data set and then elaborate the experimental settings, followed by results and discussions.

Data Set.
In this experiment, we use the forward sonar data from Ocean Space Environment Awareness (Orca) open-source project (https://code.ihub.org.cn/companies/ vgz4xa2q, 2022-08-12). One can access the data from github (https://github.com/violetweir/PPYOLO-T/tree/main/dataset, 2022-08-12). There are 5,000 images in total, of which 4,000 for training and 1,000 for test. Table 1 shows details about object categories and the number of sonar images for each category. It can be seen that the number of objects is nearly the same with number of images, indicating that most of sonar images have only one object. Figure 7 shows some examples from each object category. We can see that the underwater sonar images vary in size, and the resolution is so low that it is hard for us to recognize an object at a glance.

Baselines.
We compare the proposed PPYOLO-T with the following state-of-the-arts on multiobject detection in image area: (i) Faster R-CNN [6] is one of the representative algorithms of the classic R-CNN series. It is mainly derived from the improvement of the previous version of Fast R-CNN, including the integration of feature extraction, proposal extraction, bounding box regression, and classification into one network. All of these make the overall performance greater with improvement in speed

Evaluation Metrics.
In this paper, the performance of the proposed model is evaluated mainly by mean average precision (mAP). It is a quantitative indicator for evaluating the effectiveness of multicategory object detection. The calculation formula is as follows: with where PðrÞ is the measured precision at recallr, r taking the maximum precision whose recall value is greater or equal than r n+1 . Intersection-over-union (IoU) is an indicator based on the Jaccard similarity coefficient and evaluates the overlap between two bounding boxes. It can measure the regression precision of object detection. The formula of IoU is as follows: In which, S overlap is the overlap area between the predicted box and the ground truth box, and S union is the joint area of the predicted box and the ground truth box.
For each object detection, if the result matches some ground truth box with IoU > 0:5, we mark it as positive, otherwise mark it as negative. Calculated on this basis, the resulting mean average precision is marked as AP 0:5 . And so on, we can get AP 0:75 and AP 0:95 . What is more, the average of AP 0:5 to AP 0:95 , in which IoU increases by 0.5 each time, is denoted as AP ð0:5:0:95Þ in this paper.
Apart from the mAP, we also test inference time and FPS for each model, to analyze the real-time performance of different models. The inference time is the time required by the algorithm to process each image, and FPS means the number of frames per second processed by some algorithm.

Ablation Experiments for PPYOLO-T.
In order to test the effect of the proposed model, we conducted ablation experiments on forward sonar object detection, as shown in Table 2. We present the effectiveness of each module in an incremental manner. Inference time and FPS are different from those in YOLOv4 [20] and PP-YOLO [4], where decoding and NMS inference are not considered, but all inference processes are added in our experiments.
(i) A. First of all, we follow the original design of PP-YOLOv2 to build our baseline. Due to the large difference between sonar image and image in COCO data set, Mixup and other preprocessing methods such as brightness adjustment acting on COCO data set are not useful for improving the accuracy of sonar object detection and will increase the CPU preprocessing time. Therefore, Mixup and brightness-related preprocessing are deleted in this test. Finally, the performance of PP-YOLOv2 is shown in the first line of The second refinement with a positive effect on PP-YOLOv2 that we found was BotNetdcn. We attempted to add attention mechanisms to the PP-YOLOv2 backbone network. In consideration of efficiency and accuracy, we chose BotNet as the backbone and replaced its MHSA in the last stage with DCNs as shown in Figure 5. The reason why we replace MSHA in the last stage will be shown through the following experiment. To this end, BotNet-dcn boosts AP ð0:5:0:95Þ performance from 50.54% to 51.43% (iv) C ⟶ D. Decoupled head is the third refinement with a positive effect. Compared with couple head used in PP-YOLOv2, the proposed decoupled head takes into account the differences in the content concerned with classification and positioning. Therefore, using different branches for computations are conducive to effect improvement. By leveraging decoupled head, the accuracy in terms of AP ð0:5:0:95Þ was improved by a further 1.2% (v) D ⟶ E. Underwater sonar detectors usually detect objects in a large area, so there are many small objects in the generated sonar images.  Table 4. From Table 4, we draw the following observations: (i) Comparison between YOLO and R-CNN. We can see that on the task of sonar object detection, models from the YOLO series are superior to R-CNN based models, both in terms of accuracy and efficiency (ii) Backbone comparison. ResNet got the fastest inference speed but the lowest accuracy. Through partially adding attention mechanisms, the proposed BotNet-dcn boosted the performance significantly at the expense of slight reasoning efficiency reduction. Swin-tiny, which is purely superimposed with attention mechanisms, hurts dramatically the reasoning efficiency while the accuracy is not superior to the proposed BotNet-dcn (iii) Image size comparison. We can see that by expanding the image size of PPYOLO-T from 640 × 640 to 768 × 768, the detection accuracy is further 9 Journal of Sensors improved. We find that most of the objects in underwater sonar images are small. The larger image size, the more prediction boxes. We think that is why enlarging image size works. It also can be seen that large image size damages the speed. But compared with 800 × 1333, which is the best size used in models of R-CNN series, the proposed PPYOLO-T used a smaller image size and got better mAP performance 4.6. Performance on Different Objects. In this subsection, we further evaluate the performance of the proposed PPYOLO-T on different objects. Detection precision, recall, and F1 -score are used as the evaluation matrix. In this evaluation, we set the IoU to 0.5 which is commonly used in real application. Table 5 shows the detailed values of each category. Figure 8 further shows the P-R curve.
We can see that the overall performance of our PPYOLO-T is very well on the task of sonar image multiobject detection. The average precision is 89.6%, and the average recall is over 95%, such that the F 1 score is up to 92%.
From the P-R curve in Figure 8, we can see the similar results. Specifically, it performs well categories of cube, ball, metal bucket, square cage, and circle cage but not very well on cylinder and human body categories. This may because   10 Journal of Sensors the shape of human body and cylinder in underwater sonar images is relatively irregular. There are also many areas without objects in GT boxes. This may lead to the IoU of the prediction box and the real box less than the threshold. In this case, it will be regarded as a negative sample, and thus has a negative effect on model learning.

Conclusions
This paper presents some useful preprocessing methods for sonar image and some updates to PP-YOLOv2, which forms a high-performance sonar image multiobject detector called PPYOLO-T. By introducing attention mechanism and decoupled head, PPYOLO-T achieves significant improvement of detection accuracy with slightly speed reduction. Compared with state-of the-art models of R-CNN series, it achieves the best speed and accuracy. However, there still are some interesting future work. For example, we can further optimize some structures of attention mechanism to improve detection speed following the most recent work [38].

Conflicts of Interest
The authors declare that they have no conflicts of interest.