Ship detection is one of the most important research contents of ship intelligent navigation and monitoring. As a supplement to classical navigational equipment such as radar and the Automatic Identification System (AIS), target detection based on computer vision and deep learning has become a new important method. A target detector called YOLOv3 has the advantages of detection speed and accuracy and meets the real-time requirements for ship detection. However, YOLOv3 has a large number of backbone network parameters and requires high hardware performance, which is not conducive to the popularization of applications. On the basis of YOLOv3, this paper proposes a lightweight ship detection model (LSDM) in which the backbone network is improved by using dense connection inspired from DenseNet, and the feature pyramid networks are improved by using spatial separation convolution to replace normal convolution. The two improvements reduce parameters and optimize the network structure greatly. The experimental results show that, with only one-third of parameters of YOLOv3, the LSDM has higher accuracy and speed for ship detection. In addition, the LSDM is simplified further by reducing the number of densely connected units to form a model called LSDM-tiny. The experimental results show that, LSDM-tiny has similar detection speed with YOLOv3-tiny, but has a lot higher accuracy.
In the recent years, object detection technologies based on deep learning have received more and more attention in the areas of ship intelligent navigation and ship monitoring [
Normally, there are two main types of object detection algorithms based on deep learning. One is a two-stage detection algorithm, such as Fast R-CNN [
Unlike ordinary detection models, lightweight detection models with smaller parameters aim to run on mobile devices or computers with weak computing capabilities. Iandola et al. proposed a lightweight model called SqueezeNet, in which the parameters are compressed to one-fifth of AlexNet by using a small convolution kernel and reducing the number of input and output channels of the convolution layer [
For ship detection, in addition to accuracy, improving the detection speed is also important to adapt the model to existing hardware condition. As the YOLOv3 has a relatively balanced performance in detection accuracy and speed [ We propose a lightweight ship detection model called the LSDM, with one-third parameters of the YOLOv3 network, and higher average accuracy of 94% for ship detection We propose a simpler version of the LSDM called LSDM-tiny, with one-eighth parameters of the YOLOv3 network, double detection speed, and average accuracy of 93.5% for ship detection
The rest of the paper is organized as follows. Section
For image detection, many studies have made improvements on the basic detection model. Fang et al. used DCGAN to generate samples and training in the image recognition model which based by the CNN to improve the accuracy of image recognition [
For ship detection, there are two types of images to be utilized, radar image and visible image. Generally, the radar image covers a wider range, and the visible image provides more detailed information. Dong et al. improved the R-CNN and proposed a multiangle box-based rotation insensitive object detection structure for detecting VHR (Very-High-Resolution) ship images [
There are also many ship detection methods based on one-stage detection algorithms. An et al. proposed an improved RBox-based target detection framework to improve detection accuracy and recall [
To implement real-time ship detection, Qi et al. proposed an improved Faster R-CNN algorithm by scene reduction technology to reduce the target scale during searching [
So, with review and analysis of the related works of ship detection, this paper aims to study lightweight ship detection models base on one-stage algorithm, which would keep the accuracy as is possible while reducing the number of parameters and increasing the detection speed.
YOLOv3 is an end-to-end object detection model, and its network structure includes a backbone network and a detection network [
The structure of YOLOv3.
The original backbone network of YOLOv3 is Darknet-53. Darknet-53 includes 52 fully convolution layers, in which 46 layers are divided into 23 residual units with 5 different sizes [
YOLOv3-tiny is a simplified version of YOLOv3, its backbone network only includes 7 convolutional layers and 6 max-pooling layers, and its feature pyramid network is also simplified by removing the maximum-scale prediction branch and reducing the number of convolutional layers in the other two branches. So, YOLOv3-tiny has faster detection speed than YOLOv3 due to its shallow and simple network structure; however, its detection accuracy is lower obviously than YOLOv3.
Therefore, for fast ship detection, it is important to maintain the depth of network for capturing enough features to ensure detection accuracy while reducing network parameters to speed up. In addition, the ship objects are relatively small in the images, when detecting them by Darknet-53, and their shallow features should be clear but their deep features are easy lost after multiple down sampling. So, how to utilize the shallow features as much as possible to improve the detection accuracy becomes the key issue should be solved. This paper proposes a method of feature reusing inspired from DenseNet to achieve this goal.
Different with the ResNet, DenseNet solves the vanishing-gradient problem by connecting each layer to every other layer in a feed-forward fashion. As shown in Figure
The structure of dense connection.
From Figure
As shown in Figure
The structure of the densely connected unit.
For developing a lightweight ship detection model (LSDM), a new backbone network is constructed based on the combination of darknet-53 and DenseNet. Table
The structure of the LSDM backbone network.
Type | Filters | Size/stride | Output size | |
---|---|---|---|---|
Conv | 32 | 3 | 416 | |
Avg_pool | 2/2 | 208 | ||
1x | Conv | 128 | 1 | 208 |
Conv | 32 | 3 | 208 | |
Avg_pool | 2/2 | 104 | ||
2x | Conv | 128 | 1 | 104 |
Conv | 32 | 3 | 104 | |
Avg_pool | 2/2 | 52 | ||
4x | Conv | 128 | 1 | 52 |
Conv | 32 | 3 | 52 | |
Avg_pool | 2/2 | 26 | ||
8x | Conv | 128 | 1 | 26 |
Conv | 32 | 3 | 26 | |
Avg_pool | 2/2 | 13 | ||
16x | Conv | 128 | 1 | 13 |
Conv | 32 | 3 | 13 |
In the backbone network, the densely connected unit of DenseNet is used to replace the residual units in DarkNet-53. Within a densely connected unit, the number of convolution kernels in the bottleneck layer is set to 128, and the growth rate of the feature extraction layer is set to 32. That is, for each densely connected unit, the input feature maps will be compressed, firstly, to 128, and then, 32 new feature maps will be added to the global feature. The size of feature maps become smaller as the layer goes deeper, and more feature maps are needed to keep the feature semantic information abundant. That is, as the size of feature maps decreases, more densely connected units are needed to increase their number. Therefore, in the whole backbone network, 5 levels of densely connected units are adopted with increased numbers of 1, 2, 4, 8, and 16, and the average pooling is used to downsample from one level to its next level. As a result, the backbone network contains 63 convolution layers, and the final number of global feature maps is 1024.
Although there are 11 more layers than Darknet-53, due to the existence of bottleneck layers in densely connected units and feature reuse mechanism, the proposed backbone has fewer parameters than Darknet-53. The parameter number of a convolution layer can be calculated by the following equation:
The parameters of the last residual unit in Darknet-53 and the last densely connected unit in the LSDM backbone network are shown in Table
Parameters comparison between the residual unit and densely connected unit.
Structure | Layer | Kernel size | Input channel | Output channel | Parameters |
---|---|---|---|---|---|
Residual unit | Conv | 1 | 1024 | 512 | 524288 |
Conv | 3 | 512 | 1024 | 4718592 | |
Densely connected unit | Conv | 1 | 992 | 128 | 126976 |
Conv | 3 | 128 | 32 | 36864 |
A further compressed backbone network for LSDM-tiny is also investigated. Since the number of global feature maps is related to the number of densely connected units, reducing the number of densely connected units will further decrease the parameters; however, it will also decrease the detection accuracy. In order to keep accuracy as much as possible while reducing densely connected units, a compromise method is applied in the backbone network of LSDM-tiny. There are only two densely connected units no matter which levels; however, a convolution layer with a convolution kernel size of 1
The structure of the LSDM-tiny backbone network.
Type | Filters | Size/stride | Output size | |
---|---|---|---|---|
Conv | 32 | 3 | 416 | |
1x | Conv | 128 | 1 | 416 |
Conv | 32 | 3 | 416 | |
Avg_pool | 2/2 | 208 | ||
2x | Conv | 128 | 1 | 208 |
Conv | 32 | 3 | 208 | |
Conv | 256 | 1 | 208 | |
Avg_pool | 2/2 | 104 | ||
2x | Conv | 128 | 1 | 104 |
Conv | 32 | 3 | 104 | |
Conv | 640 | 1 | 104 | |
Avg_pool | 2/2 | 52 | ||
2x | Conv | 128 | 1 | 52 |
Conv | 32 | 3 | 52 | |
Avg_pool | 2/2 | 26 | ||
2x | Conv | 128 | 1 | 26 |
Conv | 32 | 3 | 26 | |
Avg_pool | 2/2 | 13 | ||
2x | Conv | 128 | 1 | 13 |
Conv | 32 | 3 | 13 |
The abovementioned backbone networks are, then, used to replace the Darknet-53 on the basis of YOLOv3 to form completed ship detection network LSDM and LSDM-tiny. The overall structure of the LSDM is shown in Figure
The overall structure of the LSDM network.
In addition to the backbone network, the feature pyramid network is also improved. The 3
Parameters comparison of standard convolution and spatial separation convolution.
Structure | Layer | Kernel size | Input channel | Output channel | Parameters |
---|---|---|---|---|---|
Conv2d | Conv | 3 | 512 | 1024 | 4718592 |
Separable conv | Conv | 3 | 512 | 512 | 786432 |
Conv | 1 | 512 | 1024 | 1572864 |
In summary, by using new backbone network and spatial separation convolution, the total parameters of the LSDM is 20022112, which is about 32% of YOLOv3.
LSDM-tiny can be transformed from the LSDM by replacing the backbone network with that for LSDM-tiny (As shown in Table
In order to improve the detection accuracy of the LSDM and LSDM-tiny and speed up their training speed, some tricks also can be added. Firstly, the LeakyReLU activation function is used to replace ReLU (Rectified Linear Units), and the negative half-axis slope of the function is set to 0.1; its formula is shown in the following equation:
Secondly, to make the models converge faster, momentum is added to the SGD optimizer, and the improved optimization function is shown in the following equation:
Thirdly, in order to reduce the risk of overfitting, weight decay for the parameters of the convolution layers is used, and its attenuation coefficient is set to 0.000489.
The ship image dataset contains 2270 pictures captured by web crawlers and processed by data enhancement methods such as random image flipping, noise addition, and color enhancement. Also, its annotation file contains ship category information (only ships here, labeled 0) and normalized bounding box coordinate information. 80% of the dataset is used to train models, and 20% is used for testing. Figure
Samples of the training dataset.
The LSDM and LSDM-tiny are implemented by PyTorch, and their performances are investigated and compared with YOLOv3 and YOLOv3-tiny on the abovementioned dataset under an NVIDIA GTX1060 (3 g) environment. The evaluation indicators include precision, AP (Average Precision), recall,
The experiment results are shown in Table
Comparison of model training results.
Recall | Precision | AP | Parameters (M) | FPS | ||
---|---|---|---|---|---|---|
LSDM | 0.954 | 20.02 | 16.97 | |||
YOLOv3 | 0.946 | 0.830 | 0.930 | 0.884 | 61.47 | |
LSDM-tiny | 0.804 | 0.935 | 0.874 | 7.05 | 27.09 | |
YOLOv3-tiny | 8.66 |
Figure
Detection effect comparison: (a) single ship detection; (b) multiple ships detection.
The abovementioned results show that the LSDM has higher FPS than YOLOv3. As expected, with only one-third parameters, the LSDM is faster than YOLOv3 when detecting ships. More importantly, the LSDM also has better performances in “Recall,” “Precision,” “AP,” and “F1” than YOLOv3. That is, the LSDM has higher accuracy than YOLOv3.
It can be clearly observed in the detection effect image for single ship detection in the Figure
The abovementioned results also show that LSDM-tiny is much faster than the LSDM and YOLOv3 as expected. The “FPS” of LSDM-tiny is about double that of YOLOv3, but a bit less than that of YOLOv3-tiny. However, LSDM-tiny has a higher accuracy than YOLOv3-tiny. It can also be clearly observed in the detection effect image in Figure
This paper proposes a lightweight ship detection model (LSDM) based on YOLOv3 and DenseNet. In the LSDM, the features of the shallow layer are allow to be retained and used in subsequent layers. This mechanism reduces, greatly, the parameters and optimizes the structure of the backbone network, and spatially separated convolution is used to further reduce the parameters in the feature pyramid network. The two improvements make the parameters of the LSDM be only one-third of YOLOv3. Also, the experimental results show that the LSDM is not only faster than YOLOv3 but also has higher accuracy.
Furthermore, a model called LSDM-tiny is constructed as a simple version of the LSDM. By reducing the number of densely connected units, the parameters of LSDM-tiny is only one-eighth of YOLOv3. The experimental results show that the detection speed of LSDM-tiny is about double that of YOLOv3, with losing a little accuracy. Also, comparing with YOLOv3-tiny, LSDM-tiny has a similar detection speed but has a higher accuracy due to the reuse mechanism of feature maps.
The LSDM and LSDM-tiny are proposed for fast ship detection on existing normal even poor hardware condition. In the future, two aspects will be studied further. First, for the problem of uneven detection of positive and negative samples in YOLOv3, how to add a stricter penalty mechanism to reduce the impact of negative samples will be studied. Secondly, to detect a small ship object in the camera images, how to increase multiscale detection channels while maintaining a small number of parameters will be studied.
The data used to support the findings of this study are included within the article.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported in part by “the Fundamental Research Funds for the Central Universities” under Grant 3132019400.