An Intelligent Ship Image/Video Detection and Classification Method with Improved Regressive Deep Convolutional Neural Network

,e shipping industry is developing towards intelligence rapidly. An accurate and fast method for ship image/video detection and classification is of great significance for not only the port management, but also the safe driving of Unmanned Surface Vehicle (USV).,us, this paper makes a self-built dataset for the ship image/video detection and classification, and its method based on an improved regressive deep convolutional neural network is presented. ,is method promotes the regressive convolutional neural network from four aspects. First, the feature extraction layer is lightweighted by referring to YOLOv2. Second, a new feature pyramid network layer is designed by improving its structure in YOLOv3. ,ird, a proper frame and scale suitable for ships are designed with a clustering algorithm to reduced 60% anchors. Last, the activation function is verified and optimized. ,en, the detecting experiment on 7 types of ships shows that the proposedmethod has advantage compared with the YOLO series networks and other intelligent methods. ,is method can solve the problem of low recognition rate and real-time performance for ship image/video detection and classification with a small dataset. On the testing-set, the final mAP is 0.9209, the Recall is 0.9818, the AIOU is 0.7991, and the FPS is 78–80 in video detection.,us, this method provides a highly accurate and real-time ship detection method for the intelligent port management and visual processing of the USV. In addition, the proposed regressive deep convolutional network also has a better comprehensive performance than that of YOLOv2/v3.


Introduction
In the age of artificial intelligence, the shipping industry is developing towards intelligence rapidly. e ship image/ video detection and classification with the help of computer vision have been applied in the port supervision service and Unmanned Surface Vehicle (USV) technology. An accurate and rapid detection method is of great significance to not only the port management, but also the safe operation of the USV. e traditional methods of ship detection and classification are as the following two: (1) the method based on the structure and shape characteristics of ships. In 2012, Fefilatyev et al. presented a novel algorithm for the open-sea. e ship detection precision of 88% is achieved on a large dataset collected from a prototype system [1]. In 2013, Chen et al. improved an RCS density-coding method when acquiring ship features and completed the ship identification task with a high-resolution Synthetic aperture radar (SAR) dataset [2]. e accuracy of this method reached 91.54%. In 2016, Yüksel et al. extracted ship features from the contour image of a 3D ship model, and extracted ship features from optical image for ship recognition [3]. Also in 2016, Li et al. proposed a novel method for the inshore ship detection via the ship head classification and body boundary determination [4]. In 2017, Zhang et al. developed a new ship targetdetection algorithm of visual maritime surveillance. e three main steps, including the horizon detection, background modeling, and background subtraction, are all based on the discrete cosine transform [5]. (2). e method based on threshold. It is usually very practical to detect ships directly with the threshold method. In 1996, Eldhuset proposed a method based on the local threshold, which takes the ship out of the background and uses filtering window method in detection [6]. In 1999, Zhou et al. designed a global threshold algorithm which can complete the adaptive calculation and ship detection using the statistical characteristics of dataset images, that is, the adaptive threshold method [7]. In 2013, Rey used statistical data to solve feature when calculating the overall threshold value of ship images, which is a method based on the probability density function to detect ships on water [8]. In 2018, Li and Li proposed a method based on the high and low thresholds to detect ship edge feature and achieved a high accuracy of ship edge detection [9].
Although the above studies have achieved good results, the traditional methods are mostly based on the ship structure and shape for manual feature design. Even if the best nonlinear classifier is used to classify these manually designed features, the accuracy of ship detection cannot meet the practical needs. erefore, these methods cannot achieve good results in the case of complex background and small hull differences in a real environment, and the recognition rate of multiple-ship classification is also not ideal.
Fortunately, after a development of more than ten years, the target detection based on the deep Convolutional Neural Network (CNN) has made a great progress in the application of human face, pedestrian, and other scenes.
e CNN was first proposed by professor LeCun from Toronto university in Canada. e depth and width of the CNN have been continuously increased, and its accuracy for image recognition has also been continuously increased. e commonly used CNN includes the Lenet-5 [10], AlexNet [11], VGG [12], GoogLenet [13], ResNet [14], and DenseNet [15]. At the same time, there are some researches in the application of the deep CNN for ship recognition and detection. e deep convolutional network for target detection can be divided into two categories: (1) the region-based methods, such as the R-CNN [16], Fast-RCNN [17], and Faster-RCNN [18]; (2) the regressionbased methods, such as the SSD [19], YOLO [20], YOLOv2 [21], and YOLOv3 [22]. e regression-based deep convolutional network uses the CNN as a regression and returns the position information of the target in the image through an end-to-end training and gets the final bounding box and classification results.
In 2017, Kang et al. presented a contextual region-based CNN with multilayer fusion for SAR ship detection [23]. In 2018, Wang et al. proposed a ship detection algorithm combining the CFAR and CNN. is algorithm is more accurate and faster in the remote-sensing ocean satelliteimage with complex distribution [24]. In 2018, Li et al. developed a HSF-Net. is net finds the multiscale deep feature embedding for ship detection in optical remotesensing imagery [25]. Also in 2018, Yang et al. proposed an automatic ship detection of remote-sensing images from Google Earth based on multiscale rotation dense feature pyramid networks [26]. In 2019, Gao et al. applied the Faster R-CNN to detect ships without the need for land masking by incorporating a large number of images containing only terrestrial regions as negative samples without any manual marking [27]. Also in 2019, Lin et al. proposed a squeeze and excitation rank Faster R-CNN for ship detection in SAR images, which shows a much better detection effect and speed than the traditional state-of-the-art methods [28]. e above detection methods which are mainly based on remote sensing or radar images hardly meet real-time requirement due to timeliness of image acquisition. us, in 2016, Zhao et al. proposed a real-time algorithm based on the deep CNN and combined with the HOG and HSV algorithms to achieve a good ship identification effect [29]. In 2017, Yang et al. used the Faster R-CNN to achieve the video detection of river vessels [30]. In 2018, Shao et al. built a new large-scale dataset of ships, which is designed for training and evaluating ship object detection algorithms. e dataset currently consists of 31455 images and covers six common ship types [31]. In 2019, Shao et al. proposed to use visual images captured by an on-land surveillance camera network to achieve real-time detection based on a saliency-aware CNN framework [32].
However, with the improvement of the accuracy and real-time requirements of ship detection and classification in the practical application, it is necessary to propose a ship image/video detection and classification method based on an improved regressive deep convolution network. us, this paper makes a self-built dataset for 7 kinds of ship image/video detection and classification, and its method based on an improved regressive deep CNN is presented.
is method promotes the regressive CNN from four aspects. First, the feature extraction layer is lightweighted by referring to YOLOv2. Second, a new Feature Pyramid Network (FPN) layer is designed by improving its network structure in YOLOv3.
ird, a proper frame and scale suitable for the ships are designed with the clustering algorithm to reduce 60% anchors. Last, the optimal activation function is verified and optimized. en, this method can solve the problem of low recognition rate and real-time performance for ship image/video detection and classification through an end-to-end training. e experiment on 7 types of ships shows that the proposed method is better in ship image/video detection and classification compared with the YOLO series network and other intelligent methods. On the testing-set, the final mAP is 0.9209, the Recall is 0.9818, the AIOU is 0.7991, and the FPS is 78-80 in video detection, which takes into account both the accuracy and real-time performance for the ship detection. us, this method provides a highly accurate and real-time ship detection method for the intelligent port management and visual processing of the USV. In addition, this paper also proposes a regressive deep convolutional network with a better comprehensive performance than YOLOv2 and YOLOv3.

The Regressive Deep Convolutional Neural Network (RDCNN)
e basic structure of the regressive deep CNN is mainly consisted of the input layer, convolution layer, pooling layer, full-connection layer, and output layer.

e Input Layer.
e function of the input layer is to receive input image and store it in matrix form. Assuming that the regressive deep CNN has a structure of L layer, then x l represents the feature of No. l layer, l � 1, 2, . . . , L. In it, x l is composed of multiple feature graphs, which can be represented as x l � x l 1 , . . . , x l j , j is the number of the feature graphs in l layer. us, the corresponding feature of a color input image can be represented as 1 , x 1 2 and x 1 3 represents the data of red, green, and blue channels, respectively.

e Convolutional Layer.
e function of the convolution layer is to extract features through convolution operation. With a proper design, the feature expression ability of the regressive deep CNN will be strengthened with the increasing of convolution layers. e feature graph of No. l convolution layer can be calculated as where k l i,j and b l j are the weights of the convolution kernel and biases of the convolution layer, respectively; G l i,j is the connection matrix between No. l convolution layer and the feature graph of the previous l − 1 convolution layer; the symbol ⊗ represents the convolution operation; and f(x) is the activation function. When G l i,j is 1, x l−1 i is associated with x l j ; when G l i,j is 0, they are no correlations.

e Pooling
Layer. e function of the pooling layer is to reduce the feature dimension. e pooling layer is generally located behind the convolutional layer, and the pooling operation can maintain a certain spatial invariance. e feature graph x l j of the pooling operation in the l layer can be calculated as where p(x) represents the pooling operation.

e Fully Connected
Layer. e function of the fully connected layer is to transform the deep feature obtained in the front layers into a feature vector.
us, this layer is usually set behind the feature extraction layer. e feature vector x l in the fully connected layer can be calculated as where w l is the connecting weight between two adjacent network layers and b l is the offset and f(x) is the activation function.

e Loss Function.
e regressive deep CNN obtains the predicted value through a forward propagation. en, the error between the predicted value and real value is usually calculated with the following cross-entropy loss function: where x are the input samples, y is the predicted output, y is the actual output, and n represents the total number of the input samples in one batch.

e Network Performance Index.
For the regressive deep CNN, the IOU represents the overlap rate between the detection window (B gt ) generated by the network model and the actually marked window (B dt ), that is, the ratio of their intersection and union areas. area(·) means the area, and the IOU can be calculated as For the experiment of this paper, the detection result of IOU ≥ 0.5 is set as a real positive sample, and the detection result of IOU ≤ 0.5 is set as a false-negative sample.
As there are many kinds of targets detected in this paper, the AIOU (the average value of IOU) is used, that is, the average ratio between the intersection and union areas of the predicted and actual boundary boxes on the testing-set, which is denoted as where n represents the number of detected targets. e Recall (R) rate is used to represent the percentage of the positive samples in the samples that are correctly predicted: where t p represents a true positive sample, and f n represents a false-negative sample. e Precision (P) indicates how many samples of the positive prediction are truly positive samples: where f p represents a false positive sample. e AP is an index used to measure the network identification accuracy, which is generally represented by the area enclosed by the Recall rate and Precision curves. Assuming that the curve of the recall rate and precision rate is PR, then Complexity 3 As there are 7 targets detected in this paper, the mAP is used to represent the network identification accuracy, that is, the average value of AP: where n represents the number of the predicted categories, that is 7.
In addition, in order to measure the network speed for video detection, the frames per second (FPS) is also used as a performance index.

The Improved RDCNN Based on YOLOv2/v3
is research presents an improved RDCNN mainly based on the YOLO series, which also refers to the advantages of the current popular regression deep convolution networks. By promoting the feature extraction layer of YOLOv2 and the FPN of YOLOv3, the improved network overcomes the detection shortcomings of YOLOv2 and the training and recognition speed shortcomings of YOLOv3. e improved network also redesigns the anchors with the clustering algorithm and optimizes the effects of the activation function both according to the ship image/video detection and classification. Finally, this algorithm achieves a good accuracy and real-time performance in the ship image/video detection and classification.
e improved network structure built in this research is shown in Figure 1. is network structure mainly consists of three parts: the feature extraction layer, FPN layer, and prediction layer, which are specifically described below.

e Lightweighted Feature Extraction Layer.
e feature extraction layer is very important in building the network structure. If the feature extraction layer is too large, it may get better deep features, but it will also slow down the speed of the whole network. For example, in YOLOv3, the darknet-53 is used as the feature extraction layer. is extraction layer is relatively slow in training and detection speed due to the deep layer numbers. In order to improve the presented network with a lightweight feature extraction layer, first, this network adopts the Darknet-19 feature extraction layer of YOLOv2, and the structure is shown in the left of Figure 1.
is feature extraction layer has the advantage of relatively few network layers and faster calculation speed and can also extract deep features well when inputting a color ship image or video of 416 × 416 × 3 size.
In addition, with the increase of the feature extraction layer numbers, the network generally can obtain deeper features with a more expressive power. However, simply increasing the number of network layers will result in a gradient dispersion or explosion phenomena. In order to solve this problem, in the later experiment, a batch normalization strategy is added between the convolution (Conv2d) and activation (Leaky-Relu) of each convolution operation in the Darknet-19 feature extraction layer, which is shown in Figure 2. is strategy can effectively control the gradient problem caused by the network deepening.

e New FPN Layer with a Clustering Algorithm.
For the feature extraction, the feature information in shallow layer is relatively small, but its location is accurate. is has the advantage for predicting small objects. On the contrary, the feature information in deep layer is rich, but its location is relatively rough. is is suitable for predicting large objects.
us, in order to make the network obtain a better detection result, the improved network promotes the multiscale prediction idea of YOLOv3 to design a new FPN layer, which is shown in the right of Figure 1. is method up samples the deep feature map into 26 × 26 size after predicting a deep feature map of 13 × 13 size from the feature extraction layer and then merges the upsampled 26 × 26 feature map with the shallow 26 × 26 feature map. Finally, the network can detect and forecast the input image at two scales.
In addition, to get a better network structure, the clustering algorithm is also used, and the effect of the collected data is fine-tuned and optimized for the ship image/ video detection and classification. Finally, the obtained anchor values are shown in Table 1, which predicts the feature maps of 13 × 13 and 26 × 26 scales, setting 5 different anchor frames on each scale. erefore, for a 416 × 416 size image, the improved network predicts a total of 4225 fixed prediction frames, compared with YOLOv3, which has 9 anchor frames on 3 scales and 10647 fixed prediction frames in total. Obviously, the number of anchor frames in the improved network is reduced by 6422, that is, about 60%.

e Prediction Layer.
rough the prediction on the convolution layer, the spatial information can be well preserved. For the improved network, the prediction method of YOLOv2 is adopted in the prediction layer. Each predicting frame predicts 7 ship categories and 5 frame information (t x , t y , t w , t h , t o ), of which the first four parameters are the detecting object coordinates and t o is the predicting confidence. In this paper, the loss function of YOLOv2 is also used in the prediction layer.

e Optimization of the Activation Function for the Improved RDCNN.
In order to optimize the influence of the activation function, combined with the network structure proposed in this paper, the ELU and Leak-Relu activation functions of equations (11) and (12) are also used and tested except for the commonly used Relu.
4 Complexity rough the experimental comparison, the activation function with the best ship image/video detection and classification effect can be optimized. e results on the testing-set are obtained, which is shown in Table 2.
In the experiment, the Leaky-Relu activation function has the best comprehensive detection effect and is less operable than the Relu and ELU activation functions. us, the Leaky-Relu is selected as the optimized activation function.

e Making of Ship Dataset.
At present, the popular target-detection datasets are VOC and COCO, but these datasets classify ships as only one kind. In a specific application, it often needs to classify ships more precisely. erefore, in this research, the dataset of ship images is built after collecting and labeling by ourselves.
e main way to collect the ship images is the Internet. As the images are found from the Internet, the pixels resolution are different, and the size of the images are also different, such as 500 × 400 × 3 and 500 × 318 × 3. e images containing the ships are cut roughly according to the length to width ratio of 1 : 1. e scale of ship proportion to the whole image in each image is also different, even very different, which can be seen from Figures 3-5 of the database images or the detected images. ese naturally produced images of different specification and quality are more conducive to the training effect and generalization ability. Before training, they were all resized to 416 × 416 × 3 size images.
After the dataset is collected, it needs to be labeled before using as the network input. e labeling tool used in this paper is LabelIMG. In the LabelIMG, the target object can be selected in the image with a rectangle box and be saved with a label. en, a file with the suffix of xml can be got. is file contains the path, name, resolution of the original image, as well as the coordinates, and name information of the target object in the image.
ere are many types of ships in real application. In order to facilitate research and save costs, this paper only collects 7 representative types of ships: the sailing ship, container ship, yacht, cruise ship, ferry, coast guard ship, and fishing boat. After filtering and classification, the final dataset size is 4200 manual-selected images, which includes 600 images in each category. e 480 images in each category are randomly selected as the training-set, and each remaining 120 images are set as the testing-set. In this way, the total size of the training-set is 3360 images, and the total size of the testing-set is 840 images. e typical images of each category in the dataset are shown in Figure 3.

e Experimental Environment Configuration.
e experimental environment of this research is configured as follows.
e CPU : Intel i7-7700 with 4.2 GHz main frequency; the memory: 16G; the GPU: two of Nvidia GTX1080 Ti; the operating system: Ubantu 16.04. In order to make full use of the GPU to accelerate the network training, the     CUDA 9.0 and its matching CUDNN are installed in the system. In addition, the OpenCV3.4 is also installed in the environment to display the results of the network detection and classification.
During the experiment, the setting of the experimental parameters is very important. ere are many parameters to be set in our improved RDCNN and YOLOv2/v3, such as the batch number, down sampling size, momentum parameter, and learning rate. e setting of these parameters will affect not only the normal operation of the network, but also the training effect. For example, when the setting number of the batch is too large, the network will not run if the memory of the workstation is not big enough.
Considering the conditions of our experimental environment, and also for comparing convenience, the same parameters are set for the improved RDCNN and YOLOv2/ v3. e network parameters are set as follows: the number of small batch is 64 and divided into 8 sub-batches, the iteration number is 8000, the momentum parameter is 0.9, the weight attenuation is 0.0005, and the learning rate is 0.001, which are shown in the following Table 3:

The Training and Detection
Based on YOLOv2/v3 5.1. e Iterative Convergence Training. Generally, whether a network meets the training requirements is judged by the convergence of the loss function. In this experiment, due to the small size of the dataset and sufficient computing ability, the convergence with only 8000 times of iterations is achieved, which takes about only 1 hour and 40 minutes. e Loss and AIOU curves of the feedforward training process are shown in Figures 6 and 7, respectively. It can be seen from Figures 6 and 7 that the training has converged steadily when the number of the network training reaches 8000 times. e training time of YOLOv3 is relatively long, and it takes about 3 hours and 40 minutes for 8000 times of iteration convergence process. e Loss and AIOU curves of the feedforward training process are shown in Figures 8 and  9, respectively. It can also be seen from Figures 8 and 9 that after 8000 times of iterative training, the Loss and AIOU of the network also have converged steadily.
Finally, the weight parameters obtained through 8000 network iterations in the feedforward training are saved in the experiment.

e Detection Performance Testing.
After the network training is stable, it is necessary to verify its detection effect on the testing-set, especially to avoid a decline of the detection effect caused by overfitting. First, the network indexes obtained with the weights of No. 8000 iteration under the testing-set are taken as the evaluation criteria. e specific values are shown in Table 4.

Complexity
As the network cannot measure its weight parameters in real time under the training-set during its feedforward running, the network parameters generated in the Nos. 400, 600, 800, 1000, 2000, 3000, 4000, 5000, 6000, 7000, and 8000 training iterations are also taken here to load into the network for a later test and verification. In order to better analyze the detection effect of YOLOv2/v3 in this task, the AIOU and mAP parameters of the network are compared in different testing iterations under the testing-set, which are shown in Figures 10 and 11.
From the AIOU and mAP curves on the testing-set, it can been seen that the performance indexes of the network on the testing-set have been stable. ere is also no overfitting phenomenon caused by too many training times. rough comparison, we can see that YOLOv3, as an improved version of YOLOv2, has advantages in the AIOU and mAP performance indexes. at is, it has 0.0057 higher in the AIOU and 0.0115 higher in the mAP than that of YOLOv2. However, as the advantages of YOLOv3 are obtained by deepening and improving its network structure, its detection speed is 49 FPS lower than that of YOLOv2.

The Experiment and Analysis of the Improved RDCNN
6.1. e Network Performance Experiment. e improved RDCNN takes 20 more minutes to complete the 8000 training iterations convergence process compared with YOLOv2. However, the training time is much lower than that of YOLOv3. e Loss and AIOU curves of the feedforward training process are shown in Figures 12 and 13, respectively. It can also be seen from Figures 12 and 13 that after 8000 times of iterative training, the Loss and AIOU of the improved network have been converged steadily.
In order to verify the detection effect of the RDCNN on the testing-set, the network weight parameters generated in the Nos. 400, 600, 800, 1000, 2000, 3000, 4000, 5000, 6000, 7000, and 8000 training iterations are taken here to load into the improved network for a later test and verification. en, the AIOU and mAP parameter curves under the testing-set are tested in different testing iterations of the network, which are shown in Figures 14 and 15. is paper applies the two editions of YOLO networks, as well as the presented improved RDCNN based on YOLO, into the ship image/video detection and classification.
us, the comparing  Figure 6: e iterative convergence process of the Loss curve for YOLOv2 network training.    Figure 16.
According to the comparisons, it can be seen that the improved RDCNN network has surpassed YOLOv2 and YOLOv3 in the AIOU detection of positioning accuracy.
at is, it is 0.0153 higher than that of YOLOv2 and 0.0096 higher than that of YOLOv3, respectively, in AIOU. In addition, the improved network is 0.0044 higher than YOLOv2 in the mAP index. Due to the simplified network structure, the mAP index of the improved network is 0.0071 lower than that of YOLOv3, but the detecting FPS index is 33 higher than that of YOLOv3. erefore, it can be concluded that the overall effect of the improved network is better than that of YOLOv2/v3 in the collected dataset of this experiment. erefore, the experimental results show that the improved RDCNN network structure designed in this paper surpasses the two YOLO networks in three evaluation indexes.

6.2.
e Effect Demonstration of the Improved Network. For the testing-set, the representative detection results of the improved RDCNN network are shown in Figure 15. In order to achieve a better network effect, the weight parameters of the feature extraction layer extracted in the ImageNet [33] pretraining are loaded to train the improved RDCNN of this  Table 5.
It can be seen that the mAP index of the improved RDCNN is slightly lower than that of YOLOv3 when using the pretraining weights. However, the other indicators are all better than that of YOLOv3, especially in the video detection speed of FPS.
In order to better display the comparison of the network effects, the YOLOv2/v3 and improved RDCNN are used to detect a image with multiple fishing boats. e representative results of the detection effect of the three networks are shown in Figure 16. In this paper, the improved network accurately detects more ships. Obviously, the presented network in this paper has achieved a better result, which fully proves the effectiveness of the improved RDCNN network.

Comparison with Other Intelligent Detection and Classification Methods.
e proposed method is also compared with other intelligent methods, such as Fast R-CNN, Faster R-CNN, and SSD, or compared with YOLOv2 under different dataset image and hardware configuration. e work in the early published IEEE Trans paper [32] is very similar to this paper, then its experiment results can be used for the comparison.
e comparing results are shown in Table 6. e proposed method has advantage over other intelligent methods in precision and  speed, that is, mAP and FPS, and it can also satisfy the detection and classification requirement in video scene. However, our dataset size is smaller than that of Shao's work, and our hardware configuration is also weaker than that of Shao's work.

Discussion and Conclusions
In this paper, the improved RDCNN network is presented to achieve the ship image/video detection and classification task. is network does not need to extract features manually, which improves the regressive CNN from four aspects based on the advantages of the current popular regression deep convolution networks, especially YOLOv2/v3. us, this network only needs the dataset of the ship images and a successful training.
is paper makes a self-built dataset for the ship image/ video detection and classification, and the method based on an improved regressive deep CNN is researched. e feature extraction layer is lightweighted. A new FPN layer is redesigned. A proper anchor frame and size suitable for the ships are redesigned, which reduces the number of anchors by 60% compared with YOLOv3. e activation function is also optimized with the Leaky-Relu. After a successful training, the method can complete the ship image detection task and can also be applied to the video detection. After 8000 times of iterative training, the Loss and AIOU of the improved RDCNN network have been converged steadily. e experiment on 7 types of ships shows that the proposed method is better in the ship image/video detection and classification compared with the YOLO series networks. e improved RDCNN network has surpassed YOLOv2/v3 in the AIOU detection of positioning accuracy. at is, it's 0.0153 higher than that of YOLOv2 and 0.0096 higher than that of YOLOv3, respectively, in AIOU. In addition, the improved network is 0.0044 higher than YOLOv2 in the mAP index. Due to the simplified network structure, the mAP index of the improved network is 0.0071 lower than that of YOLOv3, but the detecting FPS index is 33 higher than that of YOLOv3. erefore, it can be concluded that the overall effect of the improved network is better than that of YOLOv2/v3 in the collected dataset of this experiment. en, this method can solve the problem of low recognition rate and real-time performance for ship image/ video detection and classification. us, this method provides a highly accurate and real-time ship detection method for the intelligent port management and visual processing of the USV. In addition, the proposed regressive deep convolutional network also has a better comprehensive performance than YOLOv2/v3. e proposed method is also compared with Fast R-CNN, Faster R-CNN, SSD, or YOLOv2 etc. under different datasets and hardware configurations.
e results show that the method has advantage in precision and speed, and it can also satisfy the video scene. However, our dataset size is smaller. us, the detection in a much larger dataset can be the future work.
Data Availability e [SELF-BUILT SHIP DATASET and SIMULATION] data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.