Research Article Ship Target Detection Based on Improved YOLO Network

,


Introduction
With the vigorous development of shipping industry, the water traffic is more and more busy. Due to the frequent occurrence of collisions and other accidents between ships, it is necessary to detect the types of ships effectively to ensure the safety of water traffic. e ship detection technology based on computer vision is of great significance to improve port management and maritime inspection. e traditional methods of ship detection are based on the automatic identification system and ship features [1,2]. Li et al. proposed an improved dimensional space clustering algorithm to identify abnormal behavior of ships [3]. Zhang et al. used AIS data to identify ships with attempted collision [4,5]. Zhou et al. proposed a detection method for classification and identification of bow and hull [6]. Zang et al. carried out ship target detection from the nonfixed platform [7,8]. Although these studies have achieved good results, there are generally problems such as low recognition accuracy and human intervention. As a result, the traditional ship detection method is difficult to achieve the ideal detection effect.
In recent years, with the rapid development of artificial intelligence technology, the method based on deep learning has become the mainstream detection method. At present, there are two methods to solve the problem of target detection through deep learning: two-stage detection method and one-stage detection method. e two-stage algorithm uses region suggestion detection, mainly including AlexNet [9], VGG [10], ResNet [11], Fast-RCNN [12], and Faster-RCNN [13]. Although the detection accuracy is better than the traditional method, the detection speed is slightly insufficient. e feature extraction process takes a long time and is difficult to achieve the effect of real-time detection. In order to ensure the accuracy and improve the detection speed, one-stage algorithm is proposed. e one-stage detection does not use the idea of combining fine detection with rough detection, but directly detects the results in a single stage. e whole process does not need region suggestion detection and directly realizes end-to-end detection for inputting images. erefore, the detection speed is greatly improved. e one-stage algorithm mainly includes SSD [14], YOLO [15], YOLOv2 [16], and YOLOv3 [17]. e YOLOv3 algorithm keeps the accuracy and improves the speed, which is favored by many researchers. Zhang et al. proposed multiscale ship target detection based on deep learning [18,19]. Feng et al. improved the effect of ship classification and recognition by using spatial transform segmentation [20]. Li et al. used one-dimensional target to detect SAR image and used classifier to determine ship type [21,22]. Chen et al. proposed automatic ship identification and behavior analysis detection based on video, which can accurately detect the ship and successfully identify the historical behavior of the ship [23]. Huang et al. proposed an improved YOLOv3 network for intelligent detection and classification of ship images/videos [24,25].
Although the YOLOv3 algorithm has achieved good detection results, it has low recognition accuracy in complex scenes such as fog or night, and has phenomenon of missing detection for small target ships. erefore, this paper proposes a Ship-YOLOv3 algorithm to solve these problems. We optimize the anchor boxes of YOLOv3 algorithm and improve the network structure of the algorithm. e experimental shows that this method can accelerate the convergence speed of the network under the premise of ensuring the real-time performance. Compared with the YOLOv3 algorithm, the precision of ship identification is improved by 12.5%, and the recall rate is increased by 11.5%.
Compared to other works on ship detection, our algorithm possesses the following advantages: (1) Target box dimension clustering: the K-means++ algorithm is used to cluster the target box of selfmade ship dataset. e optimal width and height value is calculated and the predefined anchor in YOLOv3 is modified accordingly.
(2) Improvement of network structure: aiming at the problem of false recognition rate of ship small target detection, the basic network Darknet53 in YOLOv3 algorithm is improved. Some redundant network layers of Darknet53 are reduced, and the jump join mechanism of residual network is added to enhance the detection of small ship targets.

Preliminaries
In this section, some essential concepts relating to YOLOv3 are briefly reviewed; these will be used in the rest of this work. e target detection algorithm of YOLOv3 is proposed on the basis of YOLOv1 and YOLOv2. In order to achieve better classification effect, the residual network is used to realize jump connection. At the same time, the convolution with step size of 2 is used for down sampling to reduce the negative gradient effect caused by pooling. In addition, a batch normalization and activation function of each convolution layer are added to avoid overfitting in training. Based on the up sampling and fusion mechanism of FPN, three scales of output are designed to improve the accuracy of small target detection. e network structure of YOLOv3 is shown in Figure 1.
Firstly, the input image is divided into S × S grids. When the center of the object falls into the grid in the image, the grid is responsible for predicting the bounding box of the object. e corresponding score of this grid is 1, and the other grids are 0. Each grid predicts 3 bounding boxes, and each box represents (5 + N) values. e value 5 contains the location information of the bounding box: the center coordinates (x, y), width and height (w, h), and confidence of the bounding box. e value N represents the number of categories in the dataset. e NMS algorithm filters the confidence level of the target in the grid and obtains the bounding box with the highest score as the object detection frame. Since a target may belong to multiple categories, the Softmax layer is replaced by a 1 × 1 convolution layer and a logistic regression activation function structure. e score of each category is predicted by logistic regression, and the target is predicted by a threshold. e category higher than the threshold is the real category of the bounding box. e error between the predicted value and real value is usually calculated with the following crossentropy loss function: where I obj ij means that when the jth bounding box of the ith grid is responsible for predicting the target. e I obj ij value is 1, otherwise it is 0. e S 2 is the number of grids and B is the number of bounding boxes in each grid. e σ(t x ) and σ(t y ) are the offsets of the center position. e t w and t h are offsets for width and height. e parameter C is the confidence score. e p is the probability of ship class.

Materials and Methods
Although the YOLOv3 algorithm has a good detection effect on public dataset, the ship dataset used in this paper is obtained by monitoring video. e ship image is fuzzy at night or in foggy weather, and the gray level is uneven. e classification and detection of ships will be disturbed by these conditions. erefore, it is necessary to improve the YOLOv3 algorithm to meet the requirements of ship classification and detection in complex scenes. e overall structure of the algorithm is shown in Figure 2, which mainly includes three modules: target box dimension 2 Mathematical Problems in Engineering clustering, YOLOv3 network structure optimization, and data processing. Next, the implementation process of each module is introduced in detail.    selection of the prior frame, clustering can accelerate the convergence of the network and effectively improve the gradient descent in the training process. e cluster evaluation criteria are as follows:

Target Box Dimension
where d(bbox, center) is the distance between the bounding box and the center box and IOU(bbox, center) is the intersection over union between the two boxes. is clustering method can produce a larger intersection ratio and a smaller distance between the bounding boxes in the same cluster. Each grid in the ship feature map of each scale in Figure 3 predicts three prior frames. It contains (4 + 1 + N)-dimensional vector, which represents the central coordinates (x, y), width and height (w, h), and confidence of the bounding box. e N represents the number of categories in the dataset: where b x and b y are the ensemble framework detection results and t x and t y are the predicted center coordinates of each ship's bounding box prior. After the activation function in the network, the offsets σ(t x ) and σ(t y ) corresponding to the center coordinates of the ship's bounding box will be obtained. e b w and b h are the width and height of the detected box, respectively. e p w and p h are the weight matrices on the width and height, respectively. e e t w and e t h are the rate of width and height of the bounding box. e σ(t o ) is the offset of ship prediction score, and it is obtained by the product of ship prediction probability and IOU. e area(b) is the area of the ship's bounding box, and area(g) is the area of ground truth. e ratio of the intersection area between area(b) and area(g) represents IOU, as shown in Figure 4.

YOLOv3 Network Structure Optimization.
For ships in different scenes, the feature extraction is very important for ship classification. e convolution layer of ship feature extraction network can effectively analyze the features of ships, as shown in Figures 5 and 6. e residual network is added to YOLOv3 network, and the hop connection of residual is used many times. e residual network can solve the problem that the gradient is difficult to descend in the process of feature extraction, accelerate the convergence speed, and reduce the error. e size of the image is increased to 448 × 448. e convolution with the step size of 2 in the network is used to down sample the image by 32, 16, and 8 times.
e output of the last layer is lightweight improved. e convolution of the original output prediction layer 3 × 3 and 1 × 1 is pruned, and only the convolution of 3 × 3 is reserved for ship position and category prediction.
is can reduce the network operation, avoid overfitting, and make the model have better generalization ability. Finally, the feature maps of three scales are obtained, and the feature information of the ship is shown in the matrix as follows: where f b a is the correspondence between the bth network and the ath output, f is the activation function of the layer b network, S b−1 t is the mapping of the t ship by the b − 1 convolution network layer, P b ta is the weight matrix between the tth and the ath ship feature layer, and parameter W b a is the bias of the ath output ship feature at the bth convolutional network layer.

Guided Filtering.
In this paper, a guided filter is constructed to filter the images in the night or fog scene. It can be seen from Figure 7 that the guidance filter can protect the contour features of the ship while maintaining the smooth filtering of the ship. It can effectively solve the problem of blur and uneven gray level produced by night or foggy images and has good performance in denoising. e ship image is filtered to get the output image through the guide image, and the weighted average value is filtered as follows: Ground truth Ship bounding box Intersection area where q i is the output image filtered by pixels at position i, W ij is the filter core of guide image I, p j is the input image of the ship with pixels at position j, And w k is the window, a k and b k are the constant coefficients. When the guide image I has an edge, the output ship image will maintain the edge unchanged. e unsmoothed area of the input image is regarded as noise, and the noise is reduced to the minimum. e loss function generated by filtering is shown as the following formula: e |w| is the number of pixels, μ k represents the average value of the guide image in the window, σ 2 k represents the variance, and p k is the average value of the image in the window. For different windows, average all q i and finally establish the mapping of pixels from I to q.

Gray Scale Enhancement.
e feature of the ship is weakened by the influence of illumination and environment. In this paper, the gray enhancement is used to improve the  situation. e histogram is the number of pixels in each gray level of the image, which reflects the contrast of the image. Contrast Limited Adaptive Histogram Equalization (CLAHE) is used to cut histogram by setting threshold. e cut part is divided into other histograms, and the contrast of each region is limited. At the same time, interpolation is used to improve the operation speed. e gray histogram is shown in Figure 8. After the processing of gray histogram, the image contrast is enhanced while the noise is suppressed. e gray range of the image becomes more uniform, which is conducive to the extraction of ship features.

Experimental Environment and Dataset.
Our framework was developed on the Win10 OS with 128G RAM and 3 GHz CPU. e GPU version is NVIDIA GeForce GTX 1080 Ti, which contains 11 GB RAM. e framework of deep learning is Tensorflow-GPU (1.9Version). e ship dataset is obtained by framing the surveillance video on the banks of Datong bridge and Hengmen waterway. e data scene is complex and changeable, and the time is also different. We use labeling software to label images. ere are three types of ships: heavy bulk ships, empty bulk ships, passenger ship and management ship, respectively labeled boat, boat2, and boat3. e parameters are shown in Table 1. ere are 4915 ship images, 4421 training images, 441 verification images, and 53 test images.

Training Parameter Selection.
During the training process, the weights pretrained on the PASCAL VOC dataset are loaded for migration training. In the first stage, the weight of darknet53 was restored. When the learning rate is set to 0.0001, only the last regression layer is trained until the loss is reduced to a lower level. In the second stage, the weights are restored from the first stage, and then the learning rate is set to 0.00001 to train all network layers. e attenuation factor of learning rate was 0.96, which was decayed every 5 epochs. e total training epoch is 100, the batch is 6, and the optimizer is momentum. After 73700 iterations, the loss of Ship-YOLOv3 is reduced to about 0.09794. e loss of YOLOv3 algorithm is about 0.1120, and the loss curve is shown in Figure 9. e Ship-YOLOv3 algorithm has better convergence effect, improves the convergence speed, and reduces the loss value.

Algorithm Evaluation.
In order to evaluate the effect of ship detection, this paper uses the precision rate and recall to test the Ship-YOLOv3 algorithm: where P is the precision rate of the ship, the N tp is the number of ship samples that are detected correctly, parameter N fp is the number of samples with detection errors, R is the recall rate of the ship, and the parameter N fn is the number of samples of the ship that missed detection. In addition, the NMS threshold is adjusted to 0.45. When the intersection ratio between the predicted bounding box and the actual position of the ship is greater than 0.45, the ship is detected correctly.

Contrast Experiment on Foggy Day or
Night. e YOLOv3 algorithm and our Ship-YOLOv3 algorithm are used to detect the ship video in real time, and the key frames in the video are extracted respectively, as shown in Figure 10. At frame 106, frame 272, and frame 433, the YOLOv3 algorithm and the Ship-YOLOv3 algorithm have detected the target. For large target ships, the detection effect of Ship-YOLOv3 algorithm is better. e positioning is more accurate and the border is narrow. For small target ships, the Ship-YOLOv3 algorithm has better recognition rate and robustness. In this scene, the ship's feature extraction is greatly disturbed. In the training process, guided filtering and gray enhancement processing are carried out to improve the anti-interference ability of the improved algorithm. It can effectively detect the ship and overcome the adverse factors of light and fog.

Contrast Experiment under Normal Conditions.
It can be seen from Figure 11 that the YOLOv3 algorithm only recognizes large-scale ships in frame 117. However, the inspection of small-scale ships is missing. e Ship-YOLOv3 algorithm can also accurately locate and identify small-scale ships, with remarkable effect. In 167 frames and 232 frames, the detection effect of Ship-YOLOv3 algorithm is better and the prediction frame deviates from the target ship in YOLOv3 algorithm.
Compared with YOLOv3 algorithm, the detection time of Ship-YOLOv3 algorithm in nighttime or foggy scene is reduced by 8.6 ms, as shown in Table 2. In the conventional scene, the detection time is reduced by 3.53 ms and 6.06 ms on average. In this paper, the Ship-YOLOv3 algorithm maintains a high recognition rate and improves the detection speed.

Classification and Recognition Experiment.
It can be seen from Figure 12 that the Ship-YOLOv3 can extract different types of ship features. It can detect and locate three kinds of ships, and the recognition effect is good. And the recognition effect of small target ship is significant, which effectively reduces the probability of missing detection. rough comparative analysis of Table 3, the Recall, Precision, and MAP of Ship-YOLOv3 are all higher than the YOLOv3 algorithm, but the total loss is lower, so the performance of our algorithm is more superior.

Conclusion
In order to solve the problem of ship detection in complex scenes such as fog or night, a Ship-YOLOv3 algorithm is proposed to detect ship targets. It is mainly improved from two aspects: target box dimension clustering and YOLOv3 network structure optimization. Firstly, the ship data is processed: the collected ship video is divided into frames to establish the ship dataset and guided filtering and gray enhancement processing are carried out on the night or foggy images to maintain the feature of the ship. en, the self-made ship dataset is clustered into target box dimension: the size of prior box is improved to predict the position of ships. Lastly, the structure of YOLOv3 network is optimized: increase the input size of the image and reduce the convolution layers in the down sampling process. e multiscale output is used to adapt to different ship types and improve the detection accuracy. e effectiveness of the improved algorithm in real-time detection and classification of ship dataset is verified by the comparative experiments in different scenarios. After 73700 step iteration, the loss of Ship-YOLOv3 is 0.01406 lower than that of YOLOv3, and the convergence effect is better. Compared with YOLOv3 algorithm, the detection time is reduced by 6.06 ms on average, which improves the detection speed while maintaining high recognition. e algorithm accelerates the convergence speed of the network and improves the precision of ship detection by 12.5% and recall rate by 11.5% on the premise of ensuring real-time performance.
Although the algorithm in this paper has achieved satisfactory results in real-time ship detection and classification, some remaining research work can be carried out to further improve our method performance. Firstly, the selfmade ship dataset has less categories, so it is the focus of future research to increase the category of ships for detection. Secondly, we will continue to optimize the network structure of Ship-YOLOv3. e parameters of the network are simplified to adapt to ship classification and detection in different scenarios. Finally, the tracking identification of ships will be increased in the future, and the behavior of ships will be analyzed to ensure the safety of waterway transportation and improve the efficiency of port management and maritime inspection.  Data Availability e data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.