A Two-Stream Deep Fusion Framework for High-Resolution Aerial Scene Classification

One of the challenging problems in understanding high-resolution remote sensing images is aerial scene classification. A well-designed feature representation method and classifier can improve classification accuracy. In this paper, we construct a new two-stream deep architecture for aerial scene classification. First, we use two pretrained convolutional neural networks (CNNs) as feature extractor to learn deep features from the original aerial image and the processed aerial image through saliency detection, respectively. Second, two feature fusion strategies are adopted to fuse the two different types of deep convolutional features extracted by the original RGB stream and the saliency stream. Finally, we use the extreme learning machine (ELM) classifier for final classification with the fused features. The effectiveness of the proposed architecture is tested on four challenging datasets: UC-Merced dataset with 21 scene categories, WHU-RS dataset with 19 scene categories, AID dataset with 30 scene categories, and NWPU-RESISC45 dataset with 45 challenging scene categories. The experimental results demonstrate that our architecture gets a significant classification accuracy improvement over all state-of-the-art references.


Introduction
Aerial scene classification is a key problem in aerial image understanding, which aims to automatically assign a semantic label to each aerial image in order to know which category it belongs to [1,2]. Aerial scene classification has important application value in military and civil areas such as disaster monitoring, weapon guidance, and traffic supervision [3,4]. Aerial images not only have rich space and texture features but also contain a large number of scene semantic information. However, since the composition of the scene is complicated, it is difficult to obtain the scene information of interest directly from the massive image data [5,6].
In order to understand and identify the scene information in aerial images, many scene classification methods are proposed; they generally can be divided into two categories: methods with low-level scene features and methods with midlevel scene features. The commonly used low-level methods include Scale Invariant Feature Transform (SIFT) [7], Local Binary Pattern (LBP) [8], Color Histogram (CH) [9], and GIST [10]. The midlevel methods represent a scene by coding the low-level local feature descriptors. The midlevel coding methods include Bag of Visual Words (BoVW) [11], Spatial Pyramid Matching (SPM) [12], Locality-Constrained Linear Coding (LLC) [13], Probabilistic Latent Semantic Analysis (PLSA) [14], Latent Dirichlet Allocation (LDA) [15], Improved Fisher Kernel (IFK) [16], and Vector of Locally Aggregated Descriptors (VLAD) [17].
In recent years, the deep learning methods have a breakthrough in computer vision tasks, such as image classification, object recognition, and face recognition [18][19][20]. Convolutional neural network (CNN) is one of the most successful deep learning algorithms. Recently, CNN models, such as CaffeNet [21] and GoogLeNet [22], achieve better performance on aerial scene classification than that of low-level and midlevel methods.
A typical architecture of CNN usually contains many layers to automatically extract useful features and exploit the logistic regression for classification. However, this classifier cannot reach a satisfactory prediction performance. To solve this problem, CNN-SVM [23] was proposed. This architecture is a combination of CNN and support vector machine (SVM), which uses pretrained CNN as feature extractor and SVM as a classifier. Inspired by its success, some new 2 Computational Intelligence and Neuroscience combination architectures were proposed, such as CNN-BPR [24].
Extreme learning machine (ELM) is a learning algorithm based on single-hidden layer feedforward neural network (SLFN) [25]. According to its creators, this model is able to produce good generalization performance and learn thousands of times faster than networks trained using backpropagation. In [26], it also shows that the ELM can outperform SVM. In [27], the authors have confirmed that the CNN-ELM outperforms CNN-SVM in the area of high-resolution aerial scene classification. Therefore, ELM with CNN-learned features can perform excellently.
In this paper, we propose a new aerial scene classification framework that combines the fused deep convolutional features learned by CNNs with the ELM classifier. First, two pretrained CNNs are used as feature extractor to learn deep features from the original aerial image and the processed aerial image through saliency detection, respectively. Second, these two sets of features extracted by the original RGB stream and the saliency stream are fused to one set of features. Finally, the ELM classifier is used for final classification with the fused features. Experimental results on four datasets illustrate that the proposed architecture outperforms the sateof-the-art methods.
The contributions of this paper are concluded as follows.
(1) We employ a two-stream deep architecture to extract features from the original aerial image and the processed aerial image through saliency detection, respectively. Thus, we can get two different types of deep convolutional features which contain the appearance information and prominent information.
(2) To the best of our knowledge, it is the first to fuse these two different types of deep convolutional features extracted by the original RGB stream and the saliency stream, which can get a good representation of the aerial images.
(3) We use the extreme learning machine as a classifier for final classification with the fused features.
The rest of this paper is organized as follows. Section 2 introduces the related works including convolutional neural networks and extreme learning machine. Section 3 describes the proposed two-stream deep fusion architecture in detail. Section 4 evaluates the performance of the proposed architecture on four different benchmark datasets and makes comparisons with several state-of-the-art methods. The conclusions are drawn in Section 5.

Related Works
2.1. Convolutional Neural Networks. As a branch of machine learning, deep learning is a calculation model consisting of multiple processing layers. Much attention has been paid to deep learning for its great breakthrough in fields including image classification, voice understanding, and video analysis.
Deep convolutional neural network is an important algorithm in field of deep learning. It is based on the classical convolution neural network devised by LeCun [28].
In general, DCNN (deep convolutional neural network) consists of two major parts (see Figure 1). The first part is feature extraction, which contains alternating convolutional and pooling layers. A convolutional layer consists of two sublayers: convolutional filter layer and feature mapping layer. Descriptions of the layers are given as follows.
(1) Convolutional Filter Layer. Convolution is a kind of linear operation. Noise reduction and characteristic enhancement can be achieved by using the layer for extraction of characteristics. Local characteristics can be extracted by the connection between the input of each neuron and local receptive field of the previous layer. Assume the input image is a twodimensional image with size of × ; an output with size of (( − )/ +1)×(( − )/ +1) can be obtained by the convolutional operation of a trainable filter set with size of × : where * denotes convolutional operation, 푖 denotes the input of convolutional layer, 푖 푗 is the parameter of convolutional kernel, 푖 is the bias, and represents step length; each filter is related to a certain feature.
(2) Feature Mapping Layer. A nonlinear activation function is used for mapping of results obtained from filter layer, thus generating feature graph .
where is a nonlinear activation function. Traditional activation functions include tanh, sigmoid, and softplus. ReLU (Restricted Linear Units) is the closest one to the activation model of stimulated biological neuron, thus gradually being used as activation function of neural networks.
(3) Pooling Layer. This layer is used for elimination of redundant data. After dividing the feature graph into × nonintersectional areas, pooling features with size of {(( − )/ + 1)/ } × {(( − )/ + 1)/ } can be obtained based on statistical mean value (or maximum value) of the separate regions. Dimensions of the feature can be greatly reduced after the pooling procedure, thus avoiding overfitting and enabling the models to be robust.
Acting as a combined effort to extract features of the input image, convolutional filter layer, feature mapping layer, and pooling layer are considered as one layer in the DCNN. After several layers of convolution and pooling, the input image is represented by some learned features.
The second part is classifier. The learned features can be put into the logistic regression classifier for classification. The logistic regression classifier uses softmax as its output-layer activation function.
The network parameters are trained by BP (backpropagation) algorithm [29] with SGD (Stochastic Gradient Descent). Dropout strategy [30] is applied to avoid overfitting and enhance the generalization ability of the networks. The dropout strategy is usually used in fully connected layers.

Extreme Learning Machine.
Extreme learning machine consists of three layers: input layer, hidden layer, and output layer. The structure of the ELM is shown in Figure 2.   With regard to different samples ( 푖 , 푖 ), 푖 = [ 푖1 , 푖2 , . . . , 푖푛 ] 푇 denotes the th sample and 푖 = [ 푖1 , 푖2 , . . . , 푖푚 ] 푇 denotes the actual label of the th sample. The number of input nodes is the dimension of each sample; the number of output nodes is total number of categories. Given hidden nodes and activation function ( ), there must exist a set of parameters 푗 , 푗 , and 푗 , which can make this network approach these different samples.
where 푗 = [ 푗1 , 푗2 , . . . , 푗푛 ] 푇 is the weight vector that connects the th hidden node with the input nodes, 푗 = [ 푗1 , 푗2 , . . . , 푗푚 ] 푇 is the weight vector that connects the th hidden node with the output nodes, and 푗 is the bias of the th hidden node. Equation (3) can be simplified as matrix form, where is the output matrix of the hidden layer and the th row of is the output of the th hidden node with respect to the input samples 1 , 2 , . . . , 푁 .
In ELM algorithm, the input weights and the hidden layer biases of SLFN need not be adjusted at all and can be arbitrarily given. With regard to the fixed input weights and the hidden layer biases, we just need to find a least-squares solution̂of the linear system = : The minimum norm least-squares solution of the linear system where † is the Moore-Penrose generalized inverse of matrix .

Proposed Architecture
In this section, we propose an effective and efficient twostream deep fusion architecture for aerial scene classification. The first stream is called original RGB stream, which can capture the appearance information by using original RGB images as input to the network. The second stream is called  saliency stream, which can capture the prominent information by using the processed images through saliency detection as input to the network. This two-stream framework uses two same deep convolutional neural networks as feature extractor to describe the original aerial image and the processed aerial image through saliency detection, respectively. Then, we use two famous strategies to fuse the extracted two sets of features. Finally, the fused features are fed into the ELM classifier for aerial scene classification. The overall framework of our proposed method is shown in Figure 3. As described in Figure 3, our proposed architecture includes the following four parts.
(1) Preprocessing the aerial images based on unsupervised saliency detection.
(2) Using the original RGB stream and the saliency stream to extract features from the two kinds of aerial image. These two streams use deep convolutional neural networks to extract features.
(3) Fusing the extracted two sets of features.
(4) Using the ELM classifier for aerial scene classification.

Saliency Detection.
When facing visual scenes, human visual system is capable of quickly focusing our eyes on some distinctive visual regions and ignoring plain ones. The selective visual attention mechanism can help human beings observe, think, and make decision quickly and efficiently. The saliency detection model [47] emulated human visual attention can make our architecture more intelligent. By use of saliency detection, we can get more informative features which could dominate the category of the image. However, saliency detection is not suitable for all aerial images. Thus, we adopt the fusion model, which can make good use of each strength. This method includes two sections. One section is the global perspective which can get a global distribution of visual properties. In this section, a visual vocabulary for the aerial scene is built. Each visual word serves as a single element in depicting the aerial scene. The representation form is the histogram of visual word occurrence.
where ∈ , = {color, texture}. frq( 푓 푘 ) indicates the frequency of occurrence of the visual word 푓 푘 . Then, a weighted factor 푓 푘 for each visual word is introduced according to the "repetition suppression principle." The other section in this method is the local perspective. The representation for patch 푚 ( 푚 ∈ ) is obtained using the histogram of visual word occurrence. Finally, the saliency value of patch 푚 is computed by where frq 푚 ( 푓 푘 ) indicates the frequency of occurrence of the visual word 푓 푘 for patch 푚 . 푓 denotes the number of color and texture feature words.

Feature Extraction.
In recent years, CNN models can get higher classification accuracy than that of low-level and midlevel methods on aerial scene classification. The impressive results of CNNs indicate that the features extracted by CNNs are more typical and representative. Therefore, we select some of the most popular CNN models as feature extractor in our original RGB stream and saliency stream. Three selected CNN architectures are presented in Figure 4. We describe the characteristics of each model in the following part. At the same time, we specify the source of the features for one specific model.

CaffeNet. Caffe (Convolutional Architecture for Fast
Feature Embedding) [21] is one of the most popular libraries for deep learning, which is developed by the Berkeley Vision and Learning Center. The network, whose architecture can be seen in Figure 4(a), is almost a replication of AlexNet [48]. However, its training process has no data argumentation and its order of normalization and pooling layers is switched. The architecture of CaffeNet includes five convolutional layers,  some of which are followed by max-pooling layers, and three fully connected layers with a softmax. In our architecture, we use CaffeNet as a feature extractor by extracting features from the second fully connected layer, which can get features of 4096 dimensions. [49] achieves the state-of-theart accuracy on ILSVRC classification and localization tasks. Due to the use of very small (3 × 3) convolution filters in all layers, the depth of the network can be increased easily by adding more convolutional layers. The authors give five configurations of VGG-Net, whose depth of weight layers is from 16 to 19. In our work, we use the VGG-Net-16 model, whose architecture can be seen in Figure 4(b). This network includes thirteen convolutional layers, five pooling layers, and three fully connected layers with a softmax. In our architecture, we use VGG-Net-16 as a feature extractor by extracting features from the second fully connected layer, which can get features of 4096 dimensions.

GoogLeNet.
GoogLeNet [22], proposed by Szegedy et al., is the 22-layer CNN architecture that won the ILSVRC14 competition. The architecture of this network can be seen in Figure 4(c). Its main characteristic is the use of the inception modules, which is derived from the idea of "network in network." The utilization of the inception modules can make GoogLeNet have two main advantages: (1) in the inception module, the size of filters at the same layer is different, which can get more accurate multiscale spatial information; moreover (2) the design of this module can reduce the number of parameters of the network, which makes the network less prone to overfitting and allows it to be deeper. In fact, the 22-layer GoogLeNet with more than 50 convolutional layers distributed inside the inception modules has approximately five millions of parameters, which is 12 times fewer than that of CaffeNet. In our architecture, we use GoogLeNet as a feature extractor by extracting features from the last pooling layer, which can get features of 1024 dimensions.

Features Fusion.
For the original aerial image and the processed aerial image through saliency detection, we use the CNN model pretrained on ImageNet to extract features from the specified layers in the original RGB stream and the saliency stream. The fused features which contain rich information of the image scene can contribute to the process of classification. How to fuse the two different sets of features is becoming an important issue. Some methods have been proposed for feature fusion [50][51][52]. We select two classical methods for fusing the two different types of features, in aim to get more informative and significant features to represent the input image.
(1) Serial feature fusion strategy is just to concatenate the two sets of features. The dimension of the fused features is equal to the summation of the dimensions of the two sets of features.
(2) Parallel feature fusion strategy is to combine the two sets of features. Each input image generated two sets of features, that is, 1 and 2 representing the two sets of features. The final fused feature representation is formulated as where is the imaginary unit.

6
Computational Intelligence and Neuroscience

Experiments and Analysis
We use the NVIDIA Titan X Pascal GPU (with a 12 GB memory) and 2.0 GHz Intel Xeon CPU E5-2683v3 in this experiment. The proposed architecture is tested on four different datasets. Firstly, we give the description of the four datasets. Secondly, the setup in our experiments is given. Finally, the classification performance of the proposed architecture is compared with the state-of-the-art in the literature.

Datasets. The first dataset is the well-known UC-Merced
Land Use dataset [31], which consists of 2100 high-resolution remote sensing images of 21 classes. The size of each image scene is 256 × 256 pixels. The class samples are shown in Figure 5. There are some highly overlapped classes, such as "dense residential," "medium residential," and "sparse residential," which make this dataset difficult for classification. This dataset has been widely used to evaluate different aerial scene classification methods. For more information, visit http://vision.ucmerced.edu/datasets. The second dataset is WHU-RS dataset [53], which is collected from Google Earth imagery. There are 950 highspatial resolution images with 600 × 600 pixels divided into 19 classes. The class samples are shown in Figure 6. The images in this dataset are collected from different regions all over the world, which creates more challenges because of its high diversity. This dataset has also been widely used to evaluate different aerial scene classification methods. For more information, visit http://dsp.whu.edu.cn/cn/staff/yw/HRSscene.html.
The third dataset named AID (a new large-scale aerial image dataset), which is collected from Google Earth imagery [41]. There are a number of 10000 (600 × 600) pixel images within 30 classes in the AID dataset. Compared with other remote sensing image datasets, the AID dataset has some properties which include high intraclass variations, small interclass dissimilarity, and relative large-scale. Figure 7 shows a sample image for each class included in this dataset.
The fourth dataset is NWPU-RESISC45 dataset, which contains 31500 images and covers 45 scene classes with 700 images in each class [46]. Figure 8 shows a sample image for each class included in this dataset. For more information, visit http://www.escience.cn/people/JunweiHan/NWPU-RE-SISC45.html. The AID dataset and the NWPU-RESISC45 dataset are more challenging datasets, which have been used for testing some high performance aerial scene classification methods.

Experimental Setup.
For feature extractor selection, we use CaffeNet, VGG-Net-16, and GoogLeNet as feature extractor, respectively. These three networks are all pretrained on ImageNet [54]. After that, we use two fusion strategies to combine among the extracted features. In classification section, we use the extreme learning machine.
With regard to training set generation, we adopt two different settings. For the UC-Merced dataset, the ratio of the number of training set is set to be 50% and 80%, respectively, and the left for testing. For the WHU-RS dataset, the ratios are set to be 40% and 60%, respectively. For the AID dataset, the ratios are set to be 20% and 50%, respectively. For the NWPU-RESISC45 dataset, the ratios are fixed at 10% and 20%, respectively. Considering that CNN requires a predefined size for the input image, all images are resized according to the size of the receptive field of the selected CNN model.
In this paper, we use the overall accuracy to evaluate the methods. The evaluation procedure is repeated ten times for a reliable performance comparison. The final results are reported as the mean and standard deviation over the ten runs. In this section, we do not make comparisons with the results of some fine-tuned networks because our architectures only use the pretrained networks, which is for the sake of fair comparison.
Computational Intelligence and Neuroscience 7 Figure 6: Class representatives of the WHU-RS dataset.

UC-Merced Dataset.
With regard to the UC-Merced dataset, we first analyze the influence of different features extractors and fusion strategies on the classification performance. The experimental results are shown in Table 1. In Table 1, we can see that the two-stream architectures provide superior performance compared to the single CNNs without fusion, which illustrates that data fusion is helping the system to increase its accuracy. The serial feature fusion strategy based architectures provide inferior performance compared to the parallel feature fusion strategy based architectures with the same CNN feature extractor. At the same time, we also see that the features extracted by VGG-Net-16 are more representative and discriminative. In this dataset, our best classification accuracy rates are 96.97% and 98.02%, using 50% and 80% training ratios, respectively. These best results are achieved by the architecture that uses VGG-Net-16 network and parallel feature fusion strategy.
We also make a comparison of the proposed architecture against several state-of-the-art aerial scene classification methods on this dataset, as shown in Table 2. As we can see from Table 2, our architecture outperforms all other aerial scene classification methods, with an increase in overall accuracy of 1.08% and 0.60% over the second best model using 50% and 80% training ratios, respectively. The good performance of our method mainly benefits from the fusion of two different types of deep convolutional features and the extreme learning machine.

WHU-RS Dataset.
On the WHU-RS dataset, to evaluate the influence of different features extractors and fusion strategies on the classification performance, we do the same experiments discussed above for UC-Merced dataset. The results are shown in Table 3. The classification results in Table 3 once again prove that the parallel feature fusion strategy is better than the serial feature fusion strategy. On the 40% training ratio, VGG-Net-16 is the best feature extractor, while CaffeNet is the best one on the 60% training ratio. Table 4 shows the comparison of the classification accuracies between our proposed architecture and the other stateof-the-art methods. As we can see from Table 4, TEX-Net-LF and DCA by addition are the most competitive approaches.  TEX-Net-LF is the method described in [43], which constructed an architecture where fusing the features obtained from the texture coded mapped image and the standard RGB image. DCA by addition is also a fusion method, which used the first and second output fully connected layers of the network and employed the DCA to fuse the two sets of features [44]. The final experimental results clearly demonstrate that our architecture achieves the highest classification accuracy rate than other state-of-the-art methods.

AID Dataset.
On the AID dataset, Table 5 shows the influence of different features extractors and fusion strategies on the classification performance. As we can see from Table 5, the parallel feature fusion strategy is the best fusion method in our architecture. Moreover, using CaffeNet and VGG-Net-16 as feature extractors achieves competitive performance compared to GoogLeNet. Table 6 shows the classification performance comparison of our architecture compared to the state-of-the-art methods. Our best architecture outperforms all other methods, with an increase in overall accuracy of 1.45% and 1.62% over the second best model using 20% and 50% training ratios, respectively.

NWPU-RESISC45 Dataset.
On the NWPU-RESISC45 dataset, Table 7 shows the influence of different features extractors and fusion strategies on the classification performance. Table 8 shows the classification performance comparison of our architecture compared to the state-of-the-art methods. Our best architecture uses CaffeNet as its feature extractor and employs the parallel feature fusion strategy, which achieves remarkable classification results.
From the classification results on all datasets, we can note that VGG-Net-16 and CaffeNet have the similar performance, while GoogLeNet performs slightly worse. The CaffeNet has only 8 layers, which is much simpler than the VGG-Net-16 and the GoogLeNet with 16 and 22 layers, respectively. From this phenomenon, we can conclude that simpler network performs better. However, we should note that all networks we used are trained on ImageNet whose images are all natural images. Thus, the deeper network (GoogLeNet) is more suitable for processing natural images, which may not be good at processing aerial scenes.