Dual-Band Maritime Imagery Ship Classification Based on Multilayer Convolutional Feature Fusion

Addressing to the problems of few annotated samples and low-quality fused feature in visible and infrared dual-band maritime ship classi ﬁ cation, this paper leverages hierarchical features of deep convolutional neural network to propose a dual-band maritime ship classi ﬁ cation method based on multilayer convolutional feature fusion. Firstly, the VGGNet model pretrained on the ImageNet dataset is ﬁ ne-tuned to capture semantic information of the speci ﬁ c dual-band ship dataset. Secondly, the pretrained and ﬁ ne-tuned VGGNet models are used to extract low-level, middle-level, and high-level convolutional features of each band image, and a number of improved recursive neural networks with random weights are exploited to reduce feature dimension and learn feature representation. Thirdly, to improve the quality of feature fusion, multilevel and multilayer convolutional features of dual-band images are concatenated to fuse hierarchical information and spectral information. Finally, the fused feature vector is fed into a linear support vector machine for dual-band maritime ship category recognition. Experimental results on the public dual-band maritime ship dataset show that multilayer convolution feature fusion outperforms single-layer convolution feature by about 2% mean per-class classi ﬁ cation accuracy for single-band image, dual-band images perform better than single-band image by about 2.3%, and the proposed method achieves the best accuracy of 89.4%, which is higher than the state-of-the-art method by 1.2%.


Introduction
Object classification is a fundamental problem with numerous applications in computer vision and has been extensively studied for visible (VIS) image in the past decades. Because infrared (IR) image provides additional information of the same scene, it helps address various challenges in VIS image, such as variation illumination and occluded appearances. Thus, dual-band data consisting of VIS and IR images has been successfully applied to face recognition [1][2][3]. Many recent works in object classification [4], person reidentification [5], and pedestrian detection [6] show that dual-band data can improve performance and offer competitive advantages over single band.
After the breakthrough research in image classification by Krizhevsky et al. [7], deep convolutional neural network (CNN) has achieved remarkable success on the ImageNet challenge [8] and produced a number of excellent CNN models like AlexNet [7], VGGNet [9], GoogleNet [10], and ResNet [11]. Researchers found that features learned from CNN are hierarchical in the whole network [12]; that is, the low-level layer features are similar to Gabor filters and color blobs, the middle-level layer features include fine visual details and semantic information, and the high-level layer features are distinctive semantic features. Furthermore, they also demonstrated the generality and specificity of convolutional feature [13]; that is, first-layer features are general to many datasets and tasks, and last-layer features are specific to a particular dataset or task. However, large-scale datasets like ImageNet are expensive or difficult to collect and timeconsuming to train in practical maritime applications. Thus, in order to improve performance for various practical tasks, such as ship classification, the well-known pretrained CNN models like AlexNet and VGGNet have been widely used to fine-tune on ship image [14][15][16] and extract meaningful ship features [17,18].
Shi et al. [19] combined low-level features obtained by Gabor filter and multiscale completed local binary patterns (MS-CLBP) with high-level features extracted from the pretrained CNN model with fine-tuning and classified ship categories on VIS images. Shi et al. [20] also proposed a classification framework, which consists of a multifeature ensemble based on convolutional neural network (ME-CNN), and improved the classification accuracy of VIS images. Zhang et al. [21] combined the pretrained VGG-16 model with gnostic fields to improve dual-band maritime ship classification performance. Santos and Bhanu [22] extracted features from the 5th convolutional layer of the pretrained VGG-19 model [9] for both VIS and IR images and proposed a decision level fusion of convolutional networks using a probabilistic model. Due to being limited by high dimension of each layer, most of these methods extracted feature from only one convolutional layer or one fully connected layer. Zhang et al. [4] exploited linear discriminant analysis (LDA) to reduce feature dimension of a convolutional layer, then presented a multifeature fusion method, which combines structure fusion with spectral regression discriminant analysis (SF-SRDA) to learn structure information of convolutional feature, and achieved a promising result. However, features of single layer cannot provide sufficient information. Besides, LDA is a supervised dimensionality reduction technology, and thus, it requires the additional class labels. Although the combination of multilayer features provides richer information, it produces higher-dimensional data and requires more calculations. To address the above problems, recursive neural networks (RNNs) [23] provide one possible solution through a systematic feature learning strategy.
RNN comprises a class of architecture in which the same set of weights is recursively applied within a structural setting and, in particular, on directed acyclic graphs [24]. The main idea of RNN is to learn distributed feature representation by exploiting the same neural network recursively in a tree structure, and it is suitable for processing structured data such as natural language processing [25]. In order to process feature extracted from CNN, a fixed-tree RNN with blocks was presented for multiclass object classification tasks in [23]. The RNN uses nonoverlapping receptive fields instead of overlapping receptive fields in CNN. Besides, it not only reduces the dimension of convolutional feature but also learns feature representation to improve classification performance. Thus, the RNN allows us to transfer information from multiple layers effectively [23]. This characteristic is particularly helpful in feature fusion of multiple layers. Recently, it is also extended to object classification [25,26] and image super-resolution [27,28].
In this paper, we present a multilayer convolutional feature fusion method for dual-band maritime ship classification by taking advantage of CNN and RNN. The pretrained and fine-tuned VGGNet models are used to extract convolutional feature of each band image. A number of RNNs with random weights are applied to reduce feature dimension and learn feature representation. The concatenation of lowdimensional hierarchical convolutional features provides abundant information; thus, the proposed method has the potential to significantly improve classification performance while speeding up the network adaptation process. The main contribution can be concluded as follows: (1) A multilayer convolutional feature fusion method is proposed for dual-band maritime ship classification, and three combinations of two feature extractors are investigated (2) A number of improved RNNs with random weights are exploited to reduce convolutional feature dimension and learn feature representation (3) Multilayer convolutional features of the pretrained and fine-tuned VGGNet models are fused to improve classification performance. The proposed dual-band feature fusion method achieves the best classification accuracy of 89.4% and outperforms state-of-the-art method by 1.2% The remainder of the paper is organized as follows: the next section introduces the proposed method and improved RNN in details, Section 3 shows and analyzes the experimental results, and Section 4 draws the conclusions.

Proposed Method
In our work, we explore the effectiveness of using CNN together with RNNs to recognize maritime ship categories of dual-band data. Especially, the pretrained VGG-f model [29] is applied to extract raw convolutional feature, and the multiple improved RNNs are used to learn feature representation. The proposed framework is illustrated in Figure 1. As is known to all, due to over-fitting, fine-tuning directly the pretrained CNN model in small-scale dataset may not achieve the well classification performance [21]. However, fine-tuning the CNN model on specific dataset can learn specific semantic information of middle and high layers [12]. Therefore, we also take the pretrained VGG-f model with fine-tuning as feature extractor. A classification architecture forwards through five steps, as shown in Figure 1. Firstly, dual-band data including VIS and IR image is taken as the inputs. Secondly, multilevel features of each band image are extracted from the pretrained VGG-f models. Thirdly, a number of improved RNNs without training are employed to learn feature representation, which are hierarchically concatenated for each band image, respectively. Fourthly, the final feature representation of VIS and IR images is fused in the way of concatenation or summation and fed into a linear support vector machine (SVM) classifier in the last step.

Convolutional Feature Extraction.
The pretrained VGG-f model is used to extract image feature in our work. VGG-f network consists 8 layers, 5 of which are convolutional layers (namely, C1, C2, C3, C4, and C5 in Figure 1), and the last 3 are fully connected layers (namely, F6, F7, and F8 in Figure 1). The network architecture was trained on VIS images with 224 × 224 size and three channels from Ima-geNet dataset. The first and second layers learn general features similar to Gabor filters and color blobs, which are 2 Journal of Sensors suitable for many datasets and tasks. Then, as the depth of CNN architecture increases, features transition from general to specific [12]. At the last fully connected layer, features are finally specific to a particular dataset and task, such as 1000 classes of ImageNet. Unlike using single middle-level feature [25] and combined middle-level features [26], we exploit the general and specific features extracted from the low-level, middle-level, and high-level layers of the VGG-f network for each band image, such as C2, C5, and F6 layers. Meanwhile, in order to capture the semantic feature of ship, the pretrained VGG-f model with fine-tuning on each band training images of the VAIS dataset [21] is also used as a feature extractor.

Dimensionality Reduction and Feature
Learning. Convolutional features has high dimensions, especially in low-level and middle-level convolutional layers. To exploit the features of different levels, we adopt the improved RNN to reduce the dimension of feature space and learn feature representation. Figure 2 shows an example of two improved RNN architectures.
2.2.1. Multilayer Block RNN. RNN is firstly introduced to learn distributed representation of structured data such as logical terms in [24] and then extended to construct a binary tree in a bottom-up fashion for natural language processing [29]. Although the binary-tree RNN allows the input for more flexibility, the search over optimal trees slows down the architecture. Besides, it was not necessary to obtain high performance for task based on convolutional feature. Therefore, a fixed-tree RNN architecture named Multilayer Block RNN (MB-RNN) is proposed for object classification based on CNN [23]. MB-RNN learns feature representation from convolutional feature and generalized this architecture to allow each layer to merge blocks of adjacent vectors instead of only paired vectors of binary-tree RNN, then improved the performance of classification. An example of MB-RNN is shown in Figure 2(a), with details as follows. Assume a given convolutional feature is a 3D matrix X = x 1 ; ⋯ ; x r 2 , ðX ∈ R K×r×r Þ, in which K is the filter bank size and r × r is the size of feature maps. A square block with the size of K × b × b is defined as a list of adjacent column vector, which are merged into a parent vector p i ∈ R K . Thus, there Concatenation SVM classifier (label: sailing) Concatenation or summation

Recursive neural networks
Convolutional neural networks Step 3: Feature learnnig Step 2: Convolutional feature extraction Step 1: Dual-band images inputs Step 4: Feature fusion Step 5: Classification where f = tanh, the weight parameter matrix W b ∈ R ðK×b 2 KÞ , and i = 1, ⋯, ðm/bÞ 2 , were m should be a multiple of b. Equation (1) will be applied to all blocks of vectors in X with the same weights W b . In general, due to using nonoverlapping receptive field in RNN, ðr/bÞ 2 parent vectors form a new matrix P 1 = p 1 , ⋯, p i . The vectors in P 1 will again be merged in blocks just as those in matrix X using Equation (1) with the same tied weights resulting in matrix P 2 . This recursive procedure continues until only one parent vector p remains.  (1), and directly merges X into a parent vector p through one layer. Then, the parent vector p is passed through a nonlinear squash function. An example of OB-RNN is shown in Figure 2(b). Therefore, a feature X ∈ R K×r×r is fed into an OB-RNN and resulted in a K -dimensional vector. Feature dimension is reduced from K × r × r to K. Besides, an OB-RNN with weight W r learns a kind of feature representation, and N OB-RNNs with different weights learn N kinds of feature representation and produce a NK-dimensional feature vector. The larger the N, the higher the feature dimension. Therefore, the number of OB-RNNs N is critical and will be discussed in Section 3.3.2.

One-Layer
Additionally, due to the fact that dimensions of features extracted from convolutional layers have the form of K × r × r except fully connected layers, we reshape the features of F6 and F7 by fixing the number of filter bank sizes to 256. Thus, the outputs of F6 and F7 layers are formed into 256 × 4 × 4 dimensions.

One-Layer
Block RNN with Random Weights. Generally, training RNN and learning weights require backpropagation through the structure [24]. However, even with random weights, RNN architectures can be inherently frequency selective and translation invariant [30]. In addition,  Journal of Sensors the RNNs with random weights can also produce highquality feature vectors for multiclass object classification task [23]. Therefore, the OB-RNN with randomly initialized weights is used to produce feature representation in our work. We forward propagate through all of N OB-RNNs and concatenate their outputs to produce a NK-dimensional feature vector, which is then given into the following feature fusion. The above procedure is applied to each layer of the VGG-f model for both VIS and IR images. The role of OB-RNN in the process is twofold. First, like MB-RNN architecture, it transforms feature into lower dimension space and improves classification performance and is random-weight-based architecture without requiring back-propagation. Second, because it uses one layer instead of multiple layers, it allows more flexibility for the features extracted from the pretrained CNN model and runs faster than MB-RNN. Meanwhile, OB-RNN would not degenerate the performance of MB-RNN.

Feature Fusion and
Classification. The learned feature representation at different layers is fused by concatenation for each band image, and the final concatenated feature vector of VIS and IR images is fused by concatenation or summation. Concatenation and summation are the two most common vector fusion methods and are often used to fuse the features of multimodal or dual-band data [31]. The fusion goal is to integrate two feature vectors F VIS and F IR to a fused feature vector F F , where F VIS , F IR ∈ ℝ D denote the feature vector of VIS and IR images, respectively, and D is the dimension of a feature vector.
Concatenation is to directly concatenate two feature vectors, which can be defined as where F VIS d , F IR d , and F F d represent the d th value of F VIS , F IR , and F F , respectively. F F D+d is the ðD + dÞ th value of F F . 1 ≤ d ≤ D and F F ∈ ℝ 2D . This fusion method concatenates the dimensions of the two input feature vectors.
Summation is a simple addition of the corresponding dimensions of two feature vectors, which can be defined as where F VIS d , F IR d , and F F d represent the d th value of F VIS , F IR , and F F , respectively. 1 ≤ d ≤ D and F VIS , F IR , F F ∈ ℝ D . The dimension of the fused feature vector is the same as that of the input feature vector.
Concatenation combines two feature vectors with any dimension but generates a feature vector with twice the dimension than summation in the case of two feature vectors with the same dimension. In our work, the input feature vectors of concatenation have the same dimension for either single-band or dual-band images. After feature fusion, the final feature representation of the original input dual-band ship data is given to a linear SVM classifier for achieving ship classification task.

Results
3.1. Dataset. We investigate nine fusion models in the proposed fusion architecture on the publicly available VAIS [21] dataset, which is the only existing public database of paired VIS and IR ship imagery. The dataset contains 2865 images (1623 VIS images and 1242 IR images), of which there are 1088 "VIS-IR" unregistered image pairs, and includes 6 categories: cargo ships, medium-other ships, passenger ships, sailing ships, small boats, and tug boats. However, the images are captured at different distance and various times of day, including dusk and dawn. Therefore, some images are highresolution while a part of images may appear dim and hard to recognize even with manual inspection. In the dataset, the paired VIS/IR image set is partitioned into 539 image pairs for training and 549 image pairs for testing. A sample pair from VAIS is illustrated in Figure 1. Following the baseline method [21], the same train data and test data are used.

Implementation Platform and Details.
Our processing platform is a standard personal computer with Ubuntu 16.04, with a simple CPU (4.20 GHz) of an Intel Core i7-7770K with 16 GB of random access memory and NVIDIA GTX1080Ti Graphics PU. The computation environment is MATLAB R2017a with MatConvNet [32] toolbox for CNN computation and Liblinear [33] toolbox for classification. Additionally, as the pretrained VGG-f model expects 224 × 224 three channels VIS image as input, we simply duplicated IR image into three channels. Meanwhile, both VIS and IR images are resized to 224 × 224 using the nearest interpolation. Besides, the pretrained VGG-f model is fine-tuned on training images of the VAIS dataset, with stochastic gradient descent. Epoch is set to 50 for VIS image and 100 for IR image, learning rate is set to 0.001, and batch size is set to 32. To avoid over-fitting, a dropout layer is applied after the second fully connected layer and its rate is set to 0.5. In addition, due to OB-RNNs with random weights, there are slight fluctuations of classification accuracy in each time for the same procedure. Therefore, we take the mean per-class classification accuracy as evaluation for each time and run the same procedure 50 times for more accurate evaluation, then take the average accuracy together with standard deviation among 50 times as the final evaluation.

Experimental Results and Analysis
3.3.1. Performance Analysis of Feature Extractors. Firstly, we fine-tune the pretrained VGG-f model on VIS and IR training images of VAIS, respectively. In our previous experiments, data argument and dropout regularization techniques are used to avoid over-fitting during fine-tuning VGG-f model. However, data argument cannot achieve well performance, even if together with dropout. Fortunately, just only using dropout to fine-tune model gets satisfied results on VIS images, but not always good performance on IR   Journal of Sensors images. Figure 3 shows the accuracy and loss curves of finetuning VGG-f model. Train accuracy and loss curves perform well for VIS and IR images, but test accuracy and loss curves show different performance. As shown in Figures 3(a) and 3(b), the test accuracy and loss curves are stable after 20 epochs on VIS image. However, for the results of IR images shown in Figure 3(d), the test loss decreases before epoch 20s, but increases between epochs 20 and 60, then is gradually stale. The test accuracy increases until it stabilizes after epoch 60. Comparing the train and test loss curves, we can find that fine-tuning on IR images has the over-fitting problem. The main reason may be that IR images have low resolution, and some of them are too blur. Meanwhile, the VGG-f model is trained on ImageNet dataset, in which all of images are VIS images. After fine-tuning model several times, we observed that test accuracy of VIS and IR is about 85.0% and 63.0%, respectively. Secondly, we take the pretrained and fine-tuned VGG-f models as the feature extractors of dual-band images and investigate the influence of original features produced by the feature extractors for each band on classification performance. Due to the over-fitting problem of fine-tuning the VGG-f model on IR training images, the fine-tuned model cannot be taken as a feature extractor. For convenience, the pretrained VGG-f models without fine-tuning and with fine-tuning on VIS training images are abbreviated as NO-FT and FT-VIS, respectively. As shown in Figure 4(a), the results of FT-VIS are better than those of NO-FT for VIS testing images. From Figure 4(b), we find a great change by comparing the classification accuracy of the two feature         RNNs 8,192 8,192 8,192 8,192 8,192 8,192 8,192 10 Journal of Sensors the classification performance on the low-level layers capturing feature such as color, corners, and line segments, but obviously improves the accuracy of the last three layers. According to the above analysis, fine-tuning the pretrained VGG-f model on VIS training images can learn ship semantic information. Therefore, we investigate three combinations of the two feature extractors in our feature fusion architecture, as shown in Table 1.  , increasing the number of RNNs improves the classification accuracy, and it levels off at around 32. However, the larger the number of RNNs, the more time it takes to learn. Therefore, according to the same size of feature for each layer, the number of RNNs of each layer is set to 32 except 128 for the C1 layer. In addition, the influence of RNNs on the feature extractors from Figures 5(a) and 5(b), especially for IR test images, is also found. The results of Combination 2 and Combination 3 on IR images [see lines with circle and triangle in Figure 5(b)], which use the FT-VIS feature extractor, are bet-ter than that of Combination 1 using the NO-FT feature extractor. Furthermore, comparing the four figures in Figure 5 by using red and blue dotted lines as reference, we obverse that feature fusion with RNNs of dual-band images improves the classification performance of each band image no matter how the number changes. Meanwhile, the classification accuracies of Combination 2 and Combination 3 on dual-band images are higher than that of Combination 1.
Secondly, we evaluate the classification performance of three combinations without RNNs and with RNNs on a single layer. Table 2 gives the feature size of each layer without and with RNNs. Feature size affects classification accuracy and efficiency. The smaller the size of the feature, the faster the SVM classifier processes. Figure 6 shows the classification accuracy of three combinations without and with RNNs on each layer. Comparing Figures 6(a) and 6(b), it can be found that the classification accuracies of three combinations with RNNs are better than those of without RNNs on the last three Accuracy evaluation using the average accuracy together with standard deviation in 50 times. CON and SUM represent concatenation and summation feature fusion methods, respectively. Abbreviated symbol C2F6 represents that C2 layer and F6 layer features for each band image are concatenated, the same as to others. Bold denotes that the average accuracy is the best one in the corresponding column of the table.    Table 3 shows the classification accuracy of three combinations on two and three layers of the VGG-f model, and the results of single F6 layer feature fusion are shown for comparison. As shown in Table 3, we found three points. Firstly, multilayer feature fusion improves the classification performance of VIS image, IR images, and dual-band images and especially outperforms single layer by 1.1%~2.3% for VIS images and by 0.8%~2.0% for dual-band images. Secondly, the accuracy of dual-band images is higher than that of VIS image by about 2.3%, and three-layer feature fusion performs better than two-layer feature fusion by about 0.3%. Thirdly, the results of the concatenation feature fusion method are almost higher than those of summation by 0.2%~0.3%. However, the feature size of concatenation is twice that of summation. Therefore, as the number of combination layers increases, the summation fusion method runs faster than concatenation.

Comparison with Other
State-of-the-Arts. We compare the best of our fusion method with seven methods on the VAIS dataset: (1) the baseline method (CNN + gnostic field) [21], (2) Multimodal CNN [15], (3) DyFusion [22], (4) SF-SRDA [4], (5) MFL ðfeature-levelÞ + ELM [34], (6) CNN + Gabor + MS-CLBP [19], and (7) ME-CNN [20]. The first four methods are on paired images, and the last three methods are on VIS images of the paired images. Table 4 shows the experimental results. As it is shown, CNN + Gabor + MS-CLBP obtains the best classification performance on VIS images, and SF-SRDA achieves the highest classification accuracy on IR images. Obviously, the proposed method performs better than the other methods on dualband images and achieves 89.4% of the best classification accuracy, outperforming the current best method (DyFusion) by 1.2%. Therefore, it also shows that the proposed method is more suitable for dual-band ship classification than single band. Figure 7 shows the confusion matrices of classification result on Combination 3 for one time. In the experiments of Combination 2 and Combination 3, all categories except for medium-other and tug are above 90% accuracy, and sailing ship is sometimes 100% accuracy. However, classification accuracy of medium-other ship and tug boat are always less than 80%. As we found, medium-other ship and tub boat are often confused with small ship.

Discussions
The proposed method exploits a pretrained or fine-tuned VGG-f model to extract image features, and it is suitable for small-scale datasets with few data samples. The OB-RNN is flexible for layer convolutional features produced by most of pretrained well-known CNN models. The OB-RNNs reduce the dimension of convolutional feature to avoid the "curse of dimensionality" caused by the fusion of low-level, middle-level, and high-level convolutional features. The feature of multilayer convolutional features fusion includes richer information and stronger feature representation ability than any single-layer convolutional feature. Moreover, there is a great potential for further improvement of the proposed method. One potential factor is that the VGG-f model we used can be replace by the pretrained well-known CNN models such as VGG-16, ResNet, and GoogleNet. Besides, training OB-RNNs can also further improve the feature representation ability and classification accuracy. In our method, the simple fusion strategy concatenation and summation are used to fuse the features of dualband images. Therefore, putting the features of dual-band images into a feature space to learn a common feature representation is also a future direction.

Conclusions
According to few annotated dual-band samples, we propose a multilayer convolutional feature fusion method to recognize maritime ship category. Fine-tuning the pretrained VGG-f model on VIS images captures specific ship information and improves classification accuracy. The improved RNN with random weights reduces convolutional feature dimension and learns more feature representation as the number of RNNs increases. The low-level, middle-level, and highlevel convolutional features are concatenated for producing complementary information and improving classification performance. Experimental results on the public VAIS dataset demonstrate that the best multilayer feature fusion performs better than other existed methods and confirm that our method is more suitable for dual-band ship classification than single band. We will focus on the decision level fusion in the future.

Data Availability
The Excel data used to support the findings of this study are included within the supplementary information file (available here). VIS and IR images. " Figure 6(a)" in our manuscript is formed by these "Mean" values for "CON" in the worksheet " Fig.  6(a)." (4) The worksheet " Fig.6(b)" shows classification accuracy of CNN feature fusion with RNNs (that is With RNNs) on single layer of three combinations in VIS and IR images. " Figure 6(b)" in our manuscript is formed by these "Mean" values for "CON" in the worksheet " Fig. 6(b)." (5) The worksheet "Table3" shows classification accuracy of two/three layers feature fusion with RNNs of three combinations in VIS and IR images. " Table 3" in our manuscript is based on these "Mean + Std" values for "Combination 2" and "Combination 3" in the worksheet "Table3." (6) The worksheet "FeatureFusion-all" shows classification accuracy of single, two, and three layers feature fusion with RNNs of three combinations in VIS and IR images. (Supplementary Materials) .