DD-Net: A Dual Detector Network for Multilevel Object Detection in Remote-Sensing Images

With the recent development of deep convolutional neural network (CNN), remote sensing for ship detection methods has achieved enormous progress. However, current methods focus on the whole ships and fail on the component ’ s detection of a ship. To detect ships from remote-sensing images in a more re ﬁ ned way, we employ the inherent relationship between ships and their critical parts to establish a multilevel structure and propose a novel framework to improve the performance in identifying the multilevel objects. Our framework, named the dual detector network (DD-Net), consists of two carefully designed detectors, one for ships (the ship detector) and the other for their critical parts (the critical part detector), for detecting the critical parts in a coarse-to- ﬁ ne manner. The ship detector o ﬀ ers detection results of the ship, based on which the critical part detector detects small critical parts inside each ship region. The framework is trained in an end-to-end way by optimizing the multitask loss. Due to the lack of publicly available datasets for critical part detection, we build a new dataset named RS-Ship with 1015 remote-sensing images and 2856 annotations. Experiments on the HRSC2016 dataset and the RS-Ship dataset show that our method performs well in the detection of ships and critical parts.


Introduction
As a fundamental task in computer vision, remote-sensing ship detection has been widely used in both military and civilian fields [1,2]. With the development of CNN (convolutional neural network), the effectiveness of ship detection has been improved dramatically. However, it is difficult to meet the needs of some special tasks such as detecting components of a ship. So a more refined detection of ship and its components is needed.
Ships can be considered multilevel objects, where cockpit, powerhouse, radar antenna, etc. are subobjects of the ship. There also many other multilevel objects in our life such person and face, car and wheel, or streetlight and light, which are shown in Figure 1. After a comprehensive analysis of the prevalence and importance of each part of ship, we regard the cockpit as a critical part and conduct related research on detecting ship and its cockpit.
Traditional ship target detection algorithms [3][4][5] rely on manual setting of extracted features, which leads to time-consuming computation, poor robustness, and a high probability of missed or false detections. In recent years, with the rapid development of the deep learning technology and its wide application in computer vision, deep learningbased ship target detection algorithms have become mainstream in the ship detection field. In [6], the ship detection process was divided into two steps. First, sea-land segmentation was performed to reduce the influence of artificial objects on the coastline on ship detection; then, the ship was detected from the sea part pixels. In the ship detection network proposed in [7], angle information was added to the border regression to make the anchor frame fit more closely to the arbitrarily oriented ship targets. This in turn enhanced the feature extraction capability of the network and thereby improved its performance in the detection of small-scale targets. In [8], an improved Mask R-CNN [9] framework was proposed to segment and locate targets using two key points of a ship: the bow and the stern; additionally, the key point bow was used in combination with the minimum bounding box of Mask to determine the direction of the target. Although the above algorithms have exhibited good detection performance in a variety of applications, they all detect the ship as a whole target but not a multilevel target and cannot detect ships in a more refined way. In optical remote-sensing ship images, the critical parts of the ship occupy only a few pixels, which makes it challenging to extract features of the critical parts from these images. Therefore, it is difficult for existing algorithms to accurately detect the critical parts of the ship from such images directly. If the critical parts are considered the detection target without factoring in their relationship with the ship, the interference of the artificial objects on the coastline will increase, which will in turn lead to a higher false detection rate.
The top-down pose estimation algorithm can locate person from images and detect the pose of the person, which is similar to the detection of a ship and its critical parts. To solve the above mentioned problems, inspired by the topdown pose estimation method [10][11][12], we propose a new network structure, named dual detector network (DD-Net), to find ships from images first and then find tiny critical parts from these ships. Our network contains two detectors which are the ship detector and the critical part detector. The ship detector adopts a single-stage network to predict a set of ship bounding boxes to find out ships from images. Then the feature maps of the detected ships are wrapped and sent to the ship region-based critical part detector to find the boxes of the critical parts. In the critical part detector, most useless information is removed, and only the pixels inside the boxed are remained. Thus, there is less interference inside ship proposals, which facilitates the detection of small parts. The whole network can be trained in an end-to-end way. To verify the proposed methods, we create a new remote-sensing ship dataset: RS-Ship. It contains 1015 well-labeled images for ships and their critical parts. To the best of our knowledge, this is the first dataset containing both ships and their critical parts, which paves the way for future researches on the detection of critical parts of ships. At last, we performed experiments on the HRSC2016 dataset [13] and RS-Ship dataset with the method. The result shows that our method achieves state-of-the-art performance both dataset for ship and critical part detection in complex scenes.

Summary Review of Previous Studies
2.1. Object Detection Methods. CNN-based object detection algorithms fall into two categories: two-stage networks [9,[14][15][16][17] and single-stage networks [18][19][20][21]. Two-stage networks rely on region proposals, which are generated by Region Proposal Network and then extract features from each region proposal for classification and bounding box regression. Single-stage networks directly estimate object candidates without relying on any region proposal, so this design brings fast computational efficiency. However, single-stage networks cannot achieve comparable detection accuracy to their two-stage counterparts. To allow the proposed network to maintain a balance between the accuracy and speed, we improve the detection accuracy by restricting the target region and reducing the influence of background based on a single-stage detection framework.

Ship Detection Methods.
With the rapid development of deep learning in recent years, deep learning-based algorithms have emerged as a much more accurate and faster alternative to traditional algorithms in the detection of ship targets from optical remote-sensing images. In [22], the Inception structure [23] was used to improve the YOLOv3 [21] network. This improvement enhanced the feature extraction capability of the backbone network without losing any feature of small-scale ships during propagation over the network, thereby making the network better capable of detecting small-scale targets. In [24], Mask R-CNN was used to separate ships and the backgrounds, and soft-NMS was used in the screening process to further improve the robustness of ship detection. In [25], angle information was used together with some orientation parameters to make anchor better fit ship targets and thereby enable significant improvement in detection accuracy. Subsequently, various improved algorithms were proposed to address the problems including insufficient positive samples, feature misalignment, and inconsistency between classification and regression due to the introduction of rotating frames [26][27][28][29]. With the continuous improvements, today's deep learning-based ship detection algorithms can meet the required levels of accuracy and efficiency for civilian applications. However, they all detect the ship as a whole target but not a multilevel target. To address this problem, the proposed network is specially designed to detect ship and its critical parts. Many top-down pose estimation algorithms achieve excellent performance on the COCO dataset [30]. A number of improvements have been made to make these algorithms perform better for dense scenes and video files. In [31], a pose correction network named PoseFix was proposed to correct the pose estimation results by following a pose estimation network, thereby improving the accuracy of human joint point localization. In [32], a list of candidate node locations and a global maximum correlation algorithm were constructed to solve the pose estimation problem in crowds, with pioneering research conducted on pose estimation for dense crowds. In [33], the temporal and spatial information of the current frame and the two adjacent frames before and after the current frame was extracted to improve the performance of human pose estimation in videos. In this paper, we propose a dual detector network (DD-Net) as an alternative to traditional stage-by-stage detection methods to refine predictions in a stepwise manner. This network is inspired by the top-down pose estimation algorithm, detecting the critical parts inside each ship proposal. More details will be discussed in Section 3.

Proposed Method
The architecture of the proposed DD-Net network is illustrated in Figure 2. It consists of three parts: (1) the backbone CSPDarknet53 network [34], which is used to extract target features; (2) the ship detector, which is designed to detect the ship as a whole; and (3) the critical part detector, which is designed to detect the critical parts inside the selected ship bounding boxes predicted by the ship detector.
3.1. Backbone Network. The backbone network is CSPDar-knet53, which is pretrained on ImageNet [35]. The input images for the backbone network are in a size of 640 × 640, and the output is four convolutional feature maps (C2, C3, C4, and C5) with different downsampling steps (4, 8, 16, and 32), respectively. Taking C2 as an example, the representation of feature map C2 will be available only when the scale of the target is larger than 4 × 4. The same principle is applied to the other output layers for different size of 8 × 8, 16 × 16, and 32 × 32. We use the network input scale of 640 × 640 as the benchmark and count the pixels occupied by ships and critical parts in the RS-Ship dataset, as summarized in Figure 3. The results clearly show that the ship targets are in a scale larger than 8 × 8 and the critical parts are larger than 4 × 4. To avoid missing detections due to the loss of target features from an excessively large downsampling step, we use C3, C4, and C5 as the input for the ship detector and C2 as the input for the critical part detector.

Ship Detector.
To effectively utilize the semantic information of high-level feature maps and the detail information of low-level feature maps, we use C3, C4, and C5 to construct a feature pyramid network (FPN) [36] from top to down. Then, the FPN is reconstructed in a bottom-top format to reduce the span between high-level feature maps and low-level ones, which can enrich detail information of high-level feature maps and avoid loss of semantic information from channel reduction. In addition, a channel attention module (CAM) [37] is introduced to reconstruct the FPN to connect adjacent feature layers while generating salient features. The CAM uses pooling operations (Max-Pool and AvgPool) to generate channel context descriptors and then outputs channel attention feature maps through a shared network, which consists of a multilayer perceptron (MLP). The computation process of the CAM can be expressed as follows: where F c avg and F c max denote the global average pooling feature and global max pooling feature, respectively. W 0 and W 1 represent two convolutional layers followed by a Relu activation function, and σ is a Sigmoid function.

Journal of Sensors
After the above processes, feature maps P3, P4, and P5 are obtained and propagated to the prediction layer for target prediction. Figure 4 shows the process of ship detection on feature map P5 with a resolution of 19 × 19. The detection process on P3 and P4 is same as that on P5. The input image is divided into 19 × 19 cells. Each cell corresponds to a pixel of P5 which is the center to generate three anchor boxes of different scales. By computing features within the anchor boxes, each anchor box generates six different prediction parameters t x , t y , t w , t h , P obj , P k . Specifically, ðt x , t y , t w , t h Þ represents the border offset, P obj is the target confidence, and P k is the classification probability. Each anchor box is regression predicted for one target bounding box, and the bounding box coordinates are calculated by the following equation.
where ðc x , c y Þ is the coordinates of the top left corner of the cell. ðb x , b y Þ is the coordinates of the center of the bounding box. b w , b h indicates the width and height of the bounding box, respectively.

Critical Part Detector.
From the statistical results of the size of critical parts in the RS-Ship dataset (as shown in Figure 3), it is clear that critical parts are small-scale targets, and their features can be lost easily during the feature extraction process. C2 is a low-level feature map, which contains rich detail information, but lacks semantic information.
With few convolutional layers passed, the feature map C2 cannot fully represent the target features. To address the above problems, we come up with a special design for the critical part detector. First, masking is employed to restrict the target region. The predicted ship bounding boxes with high classification scores from the ship detector will be selected. Nonmaximum suppression (NMS) is performed on the selected proposals to find high-quality prediction boxes. By locating ships using horizontal boxes, the region of proposals can be expanded, and the loss of critical parts from inaccurate localization of ships can be avoided. After   Journal of Sensors box regression, the coordinates of the prediction boxes are mapped in F2, and a limited binarization operation is performed on F2 (the pixel values of regions within the prediction boxes are set to 1 and the rest to 0) to generate a masked map. The masking process is shown in Figure 5. Then F2 will be filtered by the masked map and only the ship regions covered by the ship-boxes are preserved. Second, a feature extraction network is constructed for extracting deep features to enhance the representation of feature maps. The network is composed of four residual modules and four deconvolution modules. To prevent loss of spatial information, the four residual modules and four deconvolution modules are connected via skip connections. The settings of parameters such as number of channels, step size, and convolution kernel size for each layer and their specific signs are shown in Figure 2. Finally, F2 is fed to the prediction layer for ship detection. Feature map C2 is transferred to P2 after mask filtering and feature extraction, and P2 contains rich details and semantic information, which can be used for the detection of critical parts.

Experiments and Analysis
In this paper, the experimental environment is Ubuntu 18.04, E5-2630v4 CPU, 64 GB RMB, and the network is built based on PyTorch1.6 deep learning framework and accelerated with an Nvidia Geforce-GTX 1080Ti (11 G memory) graphics card. To verify the effectiveness of our proposed method, several sets of experiments are conducted on the HRSC2016 dataset [13] and the self-built RS-Ship dataset. The detection speed of the network is evaluated by frames per seconds (FPS) and its detection performance by average precision (AP). The FPS and AP values can be calculated by the following equations: where N Figure is the number of images tested, Time is the time taken by the test, P is the accuracy rate, and R is the recall rate.

Datasets.
The HRSC2016 dataset is currently the only publicly available dataset that contains only naval targets. Its data are collected from six well-known ports. The image resolution ranges from 0.4 m to 2 m, and the image size ranges from 300 × 300 to 1500 × 900. The dataset contains 1061 remote-sensing images with 2976 ship targets that vary significantly in scale. Some example images from the HRSC2016 dataset are shown in Figure 6. The main scenes covered by the dataset are sea and near-shore areas, with complex backgrounds and diverse ship types. This dataset has been frequently used by researchers to test the performance of algorithms for ship target detection. We created a new dataset named RS-Ship to verify our method for more samples. The RS-Ship dataset is mainly collected from some famous military harbors on Google Maps. The dataset has been expanded with images of ships from the Internet by crawlers, and all the images have been formated a uniform size of 800 × 600. The dataset contains 1015 ship images and 2856 ship targets, and each ship target has a certain critical part. Ships and critical parts are labeled in a PASCAL VOC format. The ships in the dataset are in widely varying scales. The covered scenes are mainly nearshore areas with complex backgrounds, and the artificial objects on the coastline will have a certain interference on the detection of ships and critical parts. Some example images from the RS-Ship dataset are shown in Figure 6.
In the experiments, the HRSC2016 dataset is applied to evaluate the performance of the proposed method in the detection of ship targets, and the RS-Ship dataset is applied to evaluate its performance in the detection of both ships and critical parts. Both datasets are divided into training   Table 1.

Experiment Implementation.
Transfer learning [38] includes various migration methods such as instance-based transfer and parameter-transfer transfer learning methods. In this paper, the idea of parameter-transfer is introduced to the training process: the CSPDarknet53 model trained with the ImageNet dataset is used as initial parameters of the network. The whole training process can be divided into two steps: first, the backbone network CSPDarknet53 is frozen and the other network parameters (other layer parameters except the backbone network CSPDarknet53) are trained with 50 training epochs. Then all convolutional layers are opened with 100 training epochs, and the whole training process is optimized with the Adam optimizer.
To make the most of the prior knowledge of ship target shapes, the k-means clustering algorithm is implemented to generate nine anchor boxes on the HRSC2016 and RS-Ship training sets, respectively. The clustering results are shown in Figure 7, and the sizes of the anchor boxes are shown in Table 2. From Figure 7, it is clear that the normalized widths and heights of most ship targets in the RS-Ship and HRSC2016 training sets are concentrated within 0.4, indicating that both datasets contain a large number of small-scale targets. Compared with those in HRSC2016, the ship targets in RS-Ship exhibit wider distributions of length and width and have larger aspect ratios, which indicates that the RS-Ship dataset built here in this paper can be used to verify the performance of algorithms in ship detection.

Ablation Studies.
To evaluate the performance of the two detectors (ship detector and critical part detector) and the overall rationality of the network, we set up six sets of experiments on the HRSC2016 dataset and the RS-Ship dataset and evaluate the experimental results mainly using AP values. The results of each set of experiments are shown in Table 3.

Example of HRSC2016 dataset
Example of RS-Ship dataset Figure 6: Example images from HRSC2016 and RS-Ship datasets. Compared with Model1, Model2 has an additional attention module. This helps to enhance the salient features of the ship targets and reduce the number of false detections but increases the number of missed detections. All in all, the second set of experiments achieved APs of 82.98% and 91.23% on the HRSC2016 dataset and the RS-Ship dataset, respectively, 3.57% and 2.54% higher than those obtained from the first set of experiments.
The C2 layer feature maps of CSPDarknet53 contain rich detail information but little semantic information, which makes it difficult for the detector to distinguish between tar-gets and interfering signals. Unlike Model3, Model4 can deeply extract the features in the C2 layer to construct P2 feature maps which contain rich detail information and semantic information, resulting in a significantly reduced number of missed detections and a 25.69% higher AP value.
Model5 integrates the ship detector and critical part detector in one framework, which allows the network to concentrate on the ship region during feature extraction, which reduces the influence of background on detection and facilitates the detection of targets. Compared with Model2, Model5 delivers 0.17% and 0.5% higher AP values   Figure 7: The clustering results, which generated by k-means and the clustering points is set to 9.  7 Journal of Sensors for ship detection on the two datasets, respectively. And Model5 has a 0.69% higher AP value for critical part detection then Model4.
Our proposed model enhances the correlation between the two detectors based on Model5. The experimental results show that our proposed model delivers an AP value of 79.65%, suggesting that the enhanced correlation allows for effective detection of critical parts while having no effect on ship detection.
The experimental results show that the combination of the two detectors in this paper can improve the feature extraction ability of the backbone network. By mapping the prediction results of the ship detector to the critical part detector, the relationship between ships and their critical parts is fully utilized, and the region for target detection is filtered for reducing the interference of background on critical part detection. The effect of adding a mask on feature extraction is visualized in Figure 8. Specifically, columns 1-4 show input images, F2 layer feature maps, mask filtered maps, and P2 layer feature maps in the sixth set of experiments, respectively, and column 5 shows the P2 layer feature maps in the fifth set of experiments without filtered by mask, noted as P2 ′ . By comparing P2 and P2 ′ , it is clear that the mask can minimize the interference of coastline artifacts on the detection of critical parts. By restricting the target region, the feature extraction network can better characterize target features and make the salient features of targets more representative.

Comparison with Other
State-of-the-Art Methods. In this section, the effectiveness of our proposed method is verified through comparisons with Faster R-CNN [17] (with an additional FPN module), SSD [18], RetinaNet [39], YOLOv3 [21], YOLOv4 [34], YOLOF [40], and TOOD [41]. The quantitative results of ship and critical parts detection by each network model on the HRSC2016 and RS-Ship datasets are shown in Table 4. From Table 4 and the PR curves (shown in Figure 9), it can be seen that the detection for ship by each method is significantly better than that of critical part, which indicates that the detection of critical part is more difficult. The main reason is that the scale of critical part is small and the features are similar to some man-  YOLOv3 has a stronger feature extraction capability with an FPN structure added on Darknet-53, but its generalization capability is insufficient. In addition, YOLOv3's ship detection performance varies greatly between the two datasets, and its capability for critical part detection is average. YOLOv4 outperforms YOLOv3 by combining the advantages of various detection algorithms. It delivers the equal level of ship detection performance on both datasets as Fast R-CNN, and its critical part detection performance is better than those of the first four algorithms. YOLOF substantially improves the speed of detection by simplifying the FPN, but it is not effective in detecting multiscale targets, resulting in its low detection accuracy on both datasets. TOOD enhances the interaction between classification and localization to improve the consistency between the two tasks, and this strategy achieves good results in ship detection but performs generally for the detection of critical parts, with an AP value of only 64.34%. From the FPS statistics of each network model (shown in Table 4), the following observations can be drawn. SSD is significantly faster than other algorithms, since it is a single-stage network with a simple network structure. Since Faster R-CNN is a two-stage model, its complex network structure and processing procedures can significantly drag down its detection speed. Our proposed method uses the prediction results of the ship detector to restrict the target region, which improves the detection performance of the critical part detector. Although this design compromises a little on the network's detection speed, it allows the FPS to be greater than 20 on both datasets, which meets the realtime detection requirement. To sum up, our proposed model can nicely balance the detection speed and accuracy and deliver optimal AP values on the detection of critical parts, proving that it is suitable for the detection of ships and critical parts in complicated scenes.
To compare the detection performance of each algorithm more intuitively, the detection results of different methods are visualized in Figure 10. The first two columns show the results on the HRSC2016 dataset, and the last two show the results on the RS-Ship dataset. From Figure 10, it can be observed that SSD, ReinaNet, and YOLOV3 perform poorly in detecting small-scale targets, and they miss quite a number of the targets. Faster R-CNN and YOLOv4 are significantly better than the first three algorithms for ship detection on both datasets, but they both have an increased number of false detections for critical parts because of the interference of a large number of artifacts on the coastline. YOLOF has more false detections of targets on both datasets, especially when detecting aligned targets. On both datasets, TOOD has a good detection performance for ships but is still affected by the background and has some false detection targets. The DD-Net has the lowest number of missed and false detections on both datasets and performs the best in the detection of ships and critical parts.

Conclusion
In this paper, we propose a dual detector network called DD-Net for detection of ships which is considered a multilevel object. We take a ship as main object and a critical part as subobject of the multilevels object, and we make special design for DD-Net in order to achieve accurate detection of both. The DD-Net consists of two specially designed detectors for recognize ships and critical parts of ships, respectively. For the ship detector, we use two different 9 Journal of Sensors directional pyramid structures to enrich the ship features and introduce attention modules to enhance the target saliency. For the critical part detector, we design an additional feature extraction module for increasing semantic information contained in low-level feature maps. To make full use of the relationship between a ship and its critical parts, we introduce an additional association between the two detectors to allow the critical parts to be detected inside each ship region with minimal influence of the background, thus improving the accuracy of critical part detection. The experimental results show that the proposed algorithm can accurately detect the ship's critical parts while accomplishing the ship target detection process.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.