Research Article Semantic Understandings for Aerial Images via Multigrained Feature Grouping

Aerial images play a key role in remote sensing as they can provide high-quality surface object information for continuous communication services. With advances in UAV-aided data collection technologies, the volume of aerial images has been greatly promoted. To this end, semantic understandings for these images can significantly improve the quality of service for smart devices. Recently, the multilabel aerial image classification (MAIC) task has been widely researched in academics and applied in industries. However, existing MAIC methods suffer from suboptimal performance as objects are located in different sizes and scales. To address these issues, we propose a novel multigrained semantic grouping model for aerial image learning, named MSGM. First, image features presented by the backbone are sent to spatial pyramid convolutional layers which extract the instances in a parallel manner. Then, three grouping mechanisms are designed to integrate the instances from the pyramid framework. In addition, MSGM builds a concept graph to represent the label relationship. MSGM resorts to the graph convolutional network to learn the concept graph directly. We extensively evaluate MSGM on two benchmark aerial image datasets, the commonly used UCM dataset, and the high-resolution DFC15 dataset. Quantitative and qualitative results support the effectiveness of the proposed MSGM.


Introduction
Great improvements in computer vision tasks have been achieved in the last few years, especially on image classication [1], object detection [2,3], semantic segmentation [4,5], and so on. In the eld of remote sensing, vision-based sensors can integrate aerial images for continuous communication services.
us, these devices are densely developed in applications. With the advances of data collection technologies, huge amounts of aerial panoramic images are monitored and available for academic research. So how to automatically acquire semantic understandings for these aerial images is of great signi cance. Recently, aerial panoramic image classi cation provides a solution for this kind of problem automatically.
Traditional models map the given aerial image into a single semantic information, named the single-label aerial image classi cation (SAIC) task [6]. However, aerial panoramic images with a higher spectral resolution can provide a much wider eld of view and are thus associated with abundant content [7,8]. For example, in Figure 1, ship, water, and tree coexist in the given aerial image. For these aerial images, complex variations in viewpoint, scale, illumination, and occlusion make existing course-grained SAIC methods fail to learn semantic information su ciently. As a result, multilabel aerial image classi cation (MAIC) methods are proposed to deal with novel aerial panoramic image understanding tasks. For a given aerial image, MAIC methods aim to build a function that produces multiple labels (objects) of interest inside [9]. As an emerging research eld, MAIC has attracted huge attention of researchers and is therefore applied in many applications, such as air quality monitoring [10], urban management [11], and social community detection [12,13]. An increasing number of multilabel classi cation frameworks have been designed on aerial images from various angles.
However, existing MAIC methods still suffer from suboptimal performance as it is hard to extract objects located in different areas and scales of aerial panoramic images. It is well recognized that aerial image features provide the most fundamental information in the label predicting process. Early models commonly employ pretrained convolutional neural networks (CNNs) as the backbone to learn images. However, the direct use of the rough features can result in limitations in the effectiveness on the aerial images. As these backbones are designed to extract single core feature for the given image, which cannot handle the complicated multilabel image feature learning. en, some MAIC methods extract semantic features from aerial images with object localization models. However, object localization models need a large number of irrelevant, redundant proposals, accompanying with high computing costs [14]. Some other methods introduce recurrent neural networks (RNNs) to further build label correlations to guide the final prediction [15]. ey learn dependencies sequentially and cannot exploit the content correlations, thus would miss key information when predicting multiple labels.
To address these issues, our motivation is to design novel semantic understanding methods in the multilabel classification task on aerial panoramic images. Inspired by the recently proposed spatial pyramid convolutional (SPC) technologies [16], which learn the multiscale representations of the given image in a parallel way, we designed a multigrained semantic grouping framework for the MAIC task. To do so, we first leverage specifically designed SPC layers to generate multiple feature maps for the raw image. ese feature maps are at various scales and contain some semantic instances. en, we integrate these multigrained maps with three designed semantic grouping mechanisms. Besides, to capture the semantic features from label correlations, we build a concept graph to present the relationships of labels and resort to the graph convolutional network (GCN) to extract the features of the label graph directly. e main contributions of this article are as follows: (1) We propose a novel multigrained semantic grouping model for multilabel aerial image understanding, named MSGM. MSGM aims to provide comprehensive semantic classifications for aerial panoramic images by learning multigrained semantic features of images and the concept graph of labels.
(2) To extract more fine-grained features from the given aerial image, we design SPC layers with multiscale feature encoders in a parallel manner. en, these multidimensional representations are organized into final aerial image features by three designed multigrained semantic grouping mechanisms. To capture self-adapted information of aerial label correlations, we build the label relationship into a concept graph and learn the concept graph structure directly by a designed GCN module with an attention mechanism.
(3) Sufficient experiments at various angles are carried out to verify the performance of the proposed MSGM model on both the UCM and DFC15 aerial image datasets. e results demonstrate the effectiveness of MSGM in not only the research but also the real-world application.

Related Work.
In the existing remote sensing ecosystem, a large amount of data generated by sensors and devices is transferred [17,18]. Recently, with the advances in real-time data collection technologies, an increasing number of aerial panoramic images have been acquired for scientific research [19]. e MAIC task has become a fundamental problem in the aerial image learning field. Benefiting from machine learning technologies, aerial image classification has a broad follow-up application prospect [20]. A plethora of work has been carried out from a wide range of angles in this task to achieve higher accuracy in the aerial image label predicting. Among these proposed methods, one strategy, named problem transformation, aims to convert the MAIC problem into existing, well-established learning scenarios, binary relevance classifiers [21], and the K-medoids approach [22], to name a few. Other strategies, named algorithm adaptation, leverage popular learning techniques to deal with multilabel aerial images directly, such as decision trees [23] and neural networks [24]. State-of-the-art aerial image classifiers can be integrated into MAIC problems directly by multibinary classification loss. However, they cannot achieve satisfying performance as they ignore two kinds of crucial information during the pipeline of MAIC tasks: heuristic label correlation features and representative image features.

Image Feature Extraction.
Image feature extraction is the fundamental step during aerial image processing. In the past few years, deep learning-based models have shown a powerful ability to extract representative features. And MAIC methods based on deep learning have shown perspective performance. Usually pretrained on large datasets, these CNN-based models can be applied directly in MAIC tasks in an end-toend way. Many types of research have been conducted by first extracting features with the CNN decoder and then grafting label predictors such as active learning framework [25] and GCN [26]. However, the CNN encoder is trained on singlelabel aerial images each of which has only one object of interest. To this end, features from CNN encoders are for the whole image and may result in a suboptimal prediction for aerial images with multiple objects because these objects exist in the image in various locations, sizes, and shapes.  Scientific Programming To overcome the abovementioned drawbacks, a series of models were proposed to learn more fine-grained features from the raw aerial images, such as object detection methods by localizing the task-specific regions [14], handling partial occlusions by part detection [27], detecting local information via SPC layers [16], and so on.

Label Correlations
Learning. Extensive methods of image classification tasks in the literature focus on exploring the correlations among labels. Probabilistic graph models are widely utilized to formulate the coexistence of labels in early research [28].
e CGL model builds a conditional label structure learning method within a unified Bayesian framework [29]. Also, label correlation was signified by a low-rank mapping matrix in [30]. A method was designed based on the graph Laplacian regularization to exploit the label correlation in the local neighborhood [31]. Suffering from computational cost, these methods are difficult to apply to reality.
Recently, with the inference ability of RNN, both semantic and spatial label relations can be extracted with only image-level supervision in a sequential way [14,32]. Furthermore, methods based on the attention mechanism were also proposed for automatically assigning the weight of different label dependencies [33].
Lately, the newly proposed graph convolutional networks, designed for nongrid structured data modeling, have been introduced in classification tasks [26]. Different from the traditional Euclidean-structured CNN models, GCN can learn a non-Euclidean graph structure directly and thus hold the strong ability of correlation inference [9].

Motivation.
e aforementioned methods coped with the two crucial challenges of the MAIC task by different kinds of well-designed frameworks. However, most of them either ignore the label correlations during extracting more representative image features or just utilize rough coursegrained image features during building label dependencies in the MAIC task. In addition, the deep learning-based models utilize pretrained backbones to extract image features. However, these backbones are designed to extract a single core feature for the given image, which is suboptimal for multilabel image feature learning. Based on this observation, in this paper, we aim to propose a comprehensive model for the MAIC task that integrates multigrained semantic information from both images and labels. Inspired by the success of the SPC framework, the proposed MSGM model leverages SPC layers in a parallel manner to extract fine-grained image representations. en, we design different grouping mechanisms to adjoin these representations, each containing several instances. Furthermore, the GCN framework is introduced to learn the semantic network features of the label correlation.

Problem Definition.
Given the image set X and the label set L, where X � {x 1 , x 2 , . . . , x n } represents the n aerial images and L � {l 1 , l 2 , . . . , l c } be c labels of this dataset. Each image x i ∈ X is annotated with its labels [1, 2, . . ., c], otherwise y k i � 0. e primary definition of the MAIC problem is to model a function that takes the given input image x i as the independent variable and outputs its predicting label vector y i , i.e., f: x i ⟶ y i . Figure 2 illustrates the framework of the proposed model including the image feature extractor, label correlation extractor, and multilabel classifier.

e Proposed MSGM Model.
e image feature extractor is constructed by multiscale SPC layers. In addition, the label correlation extractor constructs the concept graph and utilizes the attention GCN framework to directly extract the concept graph representations. Finally, the multilabel classifier integrates the bilateral information and outputs the predicted labels.

Image Feature Extractor.
To extract more task-specific image features, we generate multiple instances of each image. For the given input image x i , the output feature map from the backbone has a dimension of 2048 × 14 × 14. en the feature map is filtered by different sizes of kernels on each SPC layer to get a group of feature maps in a parallel manner, which is described in Subsection C. Finally, the feature space is integrated by the grouping mechanisms to get the image-level feature f i ∈ R m (m � 2048).

Label Correlations Extractor.
To represent label correlations, we first build the concept graph based on the cooccurrence of labels in the label set L. en, the attention GCN (which is described in Subsection D) extracts the concept graph topologies and generates the label-level features G ∈ R c×m .

Multilabel Classifier.
e image-level feature f i ∈ R m and label-level features G ∈ R c×m are integrated by the multilabel classifier as follows [9]: e predicted score y k i ∈ y i is the probability of the corresponding label l k , y k i ∈[0,1], with its ground truth value y k i ∈ {0,1}. In this way, if y k i is above 0.5, we set the l k as positive for the image x i .

Pyramid Convolution for the Image Feature Extraction.
To get more fine-grained features and generate multiple instances for each image, we treat the output from the backbone as a bag. Each 14 × 14 sized feature vector on dimension of 2,048 represents an instance of the input image. To this end, the bag contains these instances that represent different patches of the raw input image. However, filters with receptive fields of the same size are unlikely to capture all objects in different sizes and scales. To address this problem, we introduced the SPC component to learn the Scientific Programming multiscale task-speci c features. As is illustrated in Figure 2, SPC components consist of two components: SPC ltering layers and SPC grouping mechanism.

SPC Filtering
Layers. SPC ltering layers are a set of parallel convolution operations whose lters are with a range of sizes from 1 × 1 to w × w.(w .14). With input x i from the backbone, the output of i−th convolutional operation is as follows: where θ j indicates model parameters, and σ is the activation function. e output of each convolution operation is a group of feature maps, denoted as F filter f j filter 048 , and each of them contains the corresponding instances. A j × j size lter generates (w−j+1) × (w−j+ 1) feature maps and instances. For example, a 2 × 2 size lter generates the corresponding feature map with the size of 13 × 13 and instance number of 169. Furthermore, there are 196 instances for a 1 × 1 size lter. e stride is a hyperparameter, which is empirically set to 1 in the experiments. Since the sizes of the lters are in various scales, the receptive eld sizes are di erent. Each layer contains lters of di erent sizes and depths, which can capture di erent levels of features from the scenes. To this end, the SPC ltering layers can learn the features of objects with di erent sizes, positions, and scales.

SPC Grouping Mechanism.
e feature maps F filter generated by SPC ltering layers are in di erent scales and each of them can have some instances. To this end, the instances of a bag need to be aggregated into nal aerial image features for the multiple label predicting. To address the abovementioned issue, we designed three grouping mechanisms at various angles, the feature alignment grouping (FAG): the feature stacking grouping (FSG), and the aligned feature stacking grouping (AFSG), respectively, as shown in Figure 3.
where W FAG is the weight matrix in FC layers, w 14 is the number of SPC ltering layers, and φ() is the activation function. In this way, FAG provides a solution where instances on each feature map can be learned independently. (b) Compared with FAG, the idea of FSG is concatenating the feature vectors with di erent dimensions directly. As shown in Figure 3(b), the group of feature maps F filter from SPC ltering layers are rst integrated into a uni ed feature map F FSG ∈ R 1×2,048×(1+2 2 +···+w 2 ) as follows: where w 14 is the number of SPC ltering layers. en, a FC layer and a max-pooling layer are utilized to transform the scale of F FSG as follows: Image feature extractor   Scienti c Programming where W FSG is the weight matrix in FC layers, and φ() is the activation function. As a result, FSG provides a direct approach by absorbing information from every feature representation and treating them as a whole map. (c) Based on the abovementioned two approaches, AFSG deals with the feature maps in a fine-grained way, which can be the combination of FAG and FSG. As in Figure 3(c), AFSG first utilizes FC layers to get the unified-scaled feature vectors F FAG , same as in FAG. en, these feature vectors F FSG are stacked into a unified feature space F FSG ∈ R 14×2,048 , followed by the FC layer to generate image feature f i as follows: where W AFSG1 , W AFSG2 , are two weight matrix in the FC layers, and φ() is the activation function.
Based on the grouping mechanisms, we get the imagelevel feature f i ∈ R m from various perspectives. In the experimental part, we will evaluate the three grouping mechanisms, respectively. In the following part of this paper, the AFSG method in Figure 3(c) is set as the default choice without special instructions.

Construction of Concept Graph.
To build the concept graph, we first represent all label nodes in the label space L as d-dimensional feature vectors L, where each label node l i ∈ L is denoted as l i ∈ R d . Inspired by ML_GCN [34], for the common label space L, we model the label correlation dependency in the form of conditional probability, i.e., P(l j |l i ) is the probability of occurrence of label l j when label l i appears: where N i,j denotes the frequency of images that contain both label l i and l j , and N i denotes the frequency of images that contain label l i . To clarify, the correlation matrix is asymmetrical, where P(l j |l i ) is not equal to P(l i |l j ). e concept graph is a direct graph. Based on the coexistence of labels in the correlation matrix, the concept graph is built to represent the label correlations.

Graph Convolutional Network
Recapitulation. For a given graph G, the processing idea is to integrate the knowledge from other neighbor nodes to update the central node features. Different from the traditional CNN-based models, GCN aims to learn a function on the given graph G that takes nodes features H i ∈ R c×d and the correlation  Scientific Programming matrix A ∈ R c×c as inputs, and outputs the features of updated nodes H i+1 ∈ R c×d ′ . Specifically, for the i-th GCN layer, for the input feature representation H i ∈ R c×d , the output H i+1 ∈ R c×d ′ can be written as a nonlinear operation [34]: where W i ∈ R d×d ′ is the weight transformation matrix. A∈ A ∈ R c×c denotes the correlation matrix. e nonlinear function g( ) is usually acted by LeakReLU [35]. (8), GCN works by propagating information between nodes based on the correlation matrix A. However, in most previous works, A is predefined by the conditional probability and kept fixed during the node feature learning. is kind of fixed matrix is not enough for the complicated correlations of objects. We design the attention-based GCN (attGCN) layer to address this problem in the MAIC task. e detailed introduction of the attGCN layer can be elaborated as follows: the attGCN layer is designed to learn the label node feature by integrating the typology information from its 1-step neighbors. First, a DotProduct attention mechanism is performed on the input of label embeddings and their 1-step neighbors embeddings [36]. en, the label embeddings are updated by the combination of their attention-weighted neighbor node embeddings. In specific, we first represent all label nodes as d-dimensional feature embeddings. e label node l i ∈ L is embedded as l i ∈R d . e neighbor node set h i ∈ H of l i is denoted as k∈[1,2, . . ., K], K is the number of neighbors. en, the attention score between the label node li and its one neighbor node h p i is calculated in reference to [36] as follows:

e Attention-Based GCN. From
where w ∈ R d is the weight vector to be learned. l i ∈ R d is the feature embedding of the label node l i , and h p i ∈ R d is the feature embedding of the neighbor node h p i . e operation of (·) represents a dot product operation. e g is a nonlinear function (acted by LeakReLU). e softmax function is utilized to normalize the attention scores among different label nodes. Based on the attention scores above, the feature vector of label node l i is then calculated as the weighted combination of K neighbor nodes [9] by where l i represents the updated feature vector of the label node l i .

Learning Algorithm.
As is introduced above, there are three components in the proposed MSGM framework, namely, an image feature extractor (with the backbone, SPC filtering layers, and the SPC grouping mechanism), a label correlation extractor (with the concept graph and attentionbased GCN), and a multilabel classifier. e whole model is trained end-to-end. We utilize the cross-entropy loss function to train the model with the annotation y and the prediction y. e loss function Loss is as follows [34]: where y k and y k are the k-th dimension of y and y, respectively, k∈ [1,2, . . ., c], and c is the size of the label set L. In addition, the backpropagation algorithm with a stochastic gradient descent mechanism is utilized to optimize the parameters.
We emphasize the preciseness of this learning procedure. During the training phase, the multilabel classifier generates the final predictions with the image feature f i and the label correlation feature G.

Experiments and Results
We conducted two kinds of experiments, quantitative and qualitative, to evaluate the proposed MSGM method. Quantitative results are numerical scores of these metrics studied in this manuscript, i.e., scores on EP, ER, EF 1 , EF 2 , and CP, CR, CF 1 , CF 2 . Quantitative results show the performance comparison with state-of-the-art methods on the UCM and DFC15 multilabel datasets. Qualitative results are the results of case studies, feature visualization, and histogram. Qualitative results illustrate the effectiveness of the proposed MSGM method.
is section will illustrate the experimental results in comparison with the state-of-the-art on UCM and DFC15 multilabel aerial image datasets, respectively.
en, the ablation studies are conducted to evaluate the key aspects of the proposed approach.

Evaluation Metrics.
ree kinds of metrics are widely used for classification models: precision, recall, and F-score. Specifically, in the multilabel classification task, the method performance can be valued on both example-based and label-based angles [9]. Here, example-based metrics demonstrate the dimension of aerial images. And the label-based metrics evaluate the performance from the perspective of labels [34]. In this way, we calculate the example-based precision (EP), recall (ER), F-scores (EF 1 and EF 2 ), and the label-based precision (LP), recall (LR), and F-scores (LF 1 and LF 2 ) as metrics in this paper. 6 Scientific Programming e average label-based and example-based metrics are formalized as follows: where M is the scale of label set L.

Implementation Details.
e details of the model components are as follows: the image encoder module utilizes the Resnet-101 (pretrained on ImageNet) as the backbone.
en, a set of pyramid convolutional layers is stacked in parallel for feature extraction. e label encoder module is composed of two stacked GCN layers (with output sizes of 1,024 and 2,048, respectively). e fusion layer is a dot product operation layer. In the experiment part of this paper, the AFSG method in Figure 3(c) is set as the default choice.
We select the hyperparameters of our model via grid search according to the metrics on the validation set. Specifically, we select the learning rate among {0.0005, 0.001, 0.003, 0.005} and the batch size as {8, 16, 32}. Finally, the learning rate is set to 0.001, and the batch size is 16. We utilize the SGD as the optimizer of the network and Lea-kyReLU as the nonlinear activation function. e network is trained for 20 epochs in total. All experiments are performed on an NVIDIA GeForce RTX GPU and implemented in Python using the PyTorch framework.

Datasets.
We employ two multilabel aerial image datasets, UCM [37] and DFC15 [15] multilabel datasets. e number of aerial images and classes in each dataset are shown in Table 1. Rebuilt from the single-labeled UC Merced Land Use Dataset [38], the UCM multilabel dataset is annotated with multiple tags based on visual inspection. ere are 2,100 samples in UCM and each sample has 256 × 256 pixels with a spatial resolution of one foot. e label space consists of airplane, sand, pavement, buildings, cars, chaparral, court, trees, dock, tank, water, grass, mobilehome, ship, bare-soil, sea, and field. In this work, we randomly sampled 80% of images evenly from every category for training and the remaining 20% for testing. e DFC15 dataset is a newly proposed multilabel dataset. It is rebuilt from the single-labeled dataset (published in the 2015 IEEE GRSS Data Fusion Contest [39]). Compared to the UCM dataset, the DFC15 dataset is more challenging with an extremely higher spectral resolution of 5 cm. ere are totally eight labels in the label set, including impervious, water, clutter, vegetation, building, tree, boat, cars.
e number of images is 3, 342. 80% of them are randomly selected as the training set and 20% for network testing.

Experimental Results.
In this subsection, the experimental results of MSGM on two datasets are illustrated. To clarify, we compare MSGM with other candidates with the backbone of ResNet. In addition, some benchmark comparisons with GCN-based multilabel image classification models are conducted. Besides, we list the annotation case study results to show the effectiveness of MSGM. In addition, results of MSGM with three grouping mechanisms are compared and analyzed.

Results on the UCM Multilabel Dataset.
We compare with current multilabel aerial image classification methods, ResNet-50 [40], ResNet-RBFNN [41], CA-ResNet-LSTM [15], CA-ResNet-BiLSTM [15], Image-GCN [42], and ML_GCN [34]. is is because they are trained based on pretrained ResNet. Table 2 lists the scores of different models on each metric that we analyze in this paper. For reading convenience, we mark the highest scores in bold. In general, MSGM achieves superior performance on both examplebased and label-based metrics.
For example-based metrics, the scores of MSGM on EP and ER are 83.61% and 85.48%. MSGM surpasses EP by 5.67% over CA-ResNet-BiLSTM which is state-of-the-art. In terms of EF 1 and EF 2 , MSGM achieves 84.54% and 85.10%, respectively. Although slightly lower on EF 2 than CA-ResNet-BiLSTM, our model achieves a corresponding improvement on EF 1 , showing that our model can obtain high precision while maintaining the recall. Furthermore, MSGM outperforms the GCN-based model from [42] remarkably on every metric. In comparison with ML_GCN [34], MSGM shows a stronger ability on example-based metrics with increases of 3.58% on EF 1 and 3.46% on EF 2 .
For label-based metrics, the proposed MSGM achieves 89.98% and 85.07% on LP and LR, which are 3.86% and 0.81% over CA-ResNet-BiLSTM. In addition, the scores of MSGM on LF 1 and LF 2 are 87.46% and 86.01%, much higher e experimental results on the UCM dataset verify the effectiveness of MSGM with the SPC layers and the concept graph. In the image feature learning phase, the SPC-based extractor can learn more fine-grained representations to help the model understand the image. During label predicting, the concept graph provides significant semantic information from label correlations.
On example-based metrics, the proposed MSGM method obtains 94.61% and 92.71% on EP and ER and 93.65% and 93.08% on EF 1 and EF 2 , respectively. Compared with CA-ResNet-BiLSTM which is the state-of-the-art on this dataset, MSGM improves EP by 2.68% and ER by 13.59%. Furthermore, MSGM increases EF 1 and EF 2 scores by 8.6% and 11.69% over CA-ResNet-BiLSTM. In addition, the proposed MSGM outperforms state-of-the-art methods not only in example-based indexes but also in label-based scores. MSGM reaches 91.42% on LP and 90.70% on LR. In comparison with CA-ResNet-BiLSTM, MSGM is 30.65% higher on LR, proving the robustness of the MSGM model. e improvements by MSGM on the DFC15 dataset further demonstrate the robustness of the proposed model on the more challenging DFC15 dataset. By extracting the fine-grained image feature and learning the selfadapted semantic information, MSGM can provide a more effective solution for the current MAIC task.

Annotation Case Study.
To further evaluate the effectiveness of MSGM, we conducted a case study with several images. e results are listed in Table 4. We note that the proposed MSGM generally works well for images with      Table 4. For instance, MSGM predicts all candidate labels for image (a) bare-soil, cars, buildings, court, grass, pavement, and trees. In addition, MSGM can classify images accurately, even those with sparse labels. For image (c), the annotation includes three labels (grass, trees, and water). While in the prediction results, MSGM not only predicts all ground truths but also marks sand as positive. is provides more fine-grained semantic information of images for the followup computer vision tasks. Table 5 shows the predicted labels and the corresponding semantic feature maps of MSGM on the UCM dataset. For images (a), (b), and (c), the positive labels (ground truth) are in black and the negative labels are in red. Moreover, the activation areas of each label are concentrated on semantic-aware areas. It is intuitive that the label-correlated areas are activated. For image (a), the image patch corresponding to the label ship is annotated in red, while the whole image is in blue for the label buildings. It reveals that MSGM can learn task-specific features and explore label-region interaction.

Results on
ree Grouping Mechanisms. As introduced previously, we designed three grouping mechanisms, FAG, FSG, and AFSG, for aerial image feature extraction. So in this part, we discuss the results on different mechanisms. For reading convenience, we name the proposed MSGM based on three modules as MSGM-FAG, MSGM-FSG, and MSGM-AFSG, respectively. e qualitative (the histogram) and quantitative (scores on EP, ER, EF 1 , and CP, CR, CF 1 ) results are illustrated in Figure 4. It is intuitive that MSGM-AFSG surpasses the other two candidates on almost all metrics. Particularly on F-scores, MSGM-AFSG achieves 84.54% on EF 1 , and 87.46% on LF 1 , improving MSGM-FAG (the second place) by 0.34% and 0.83%. In addition, for results on precision, MSGM-FSG achieves the best on both EP and LP, indicating the effectiveness of concatenating the feature vectors with different dimensions directly. Another interesting observation is that all three modules outperform the existing MAIC methods, verifying the robustness and feasibility of our MSGM model.

Conclusions
is paper provides a new solution for fine-grained semantic understandings of aerial panoramic images. Focusing on the crucial challenges of this research task, we designed a comprehensive multilabel aerial image classification model, named MSGM. To tackle the problem of how to learn more task-specific features from aerial panoramic images, MSGM designs pyramid convolutional layers to extract multiple instances by multiscale feature encoders. And then, three grouping mechanisms are designed to integrate the instances into the final aerial panoramic image features. Furthermore, MSGM learns semantic features from label dependencies during the multilabel predicting phase. Inspired by the recently proposed GCN-based models, which can deal with graph structure directly, MSGM builds a concept graph to represent the label correlations and then feeds the graph into a designed GCN based on the attention mechanism. To this end, with the multigrained semantic features, a novel endto-end multilabel aerial image classification considering label correlations is built. ree components constitute the whole framework of the proposed MSGM: the image feature extractor, the label correlation extractor, and the multilabel classifier. Experimental results verify the effectiveness of the proposed method both quantitatively and qualitatively on two benchmark aerial panoramic image datasets, UCM and DFC15. In the future, we will further explore the dimensions of SPC layers to provide more adaptive approaches in applications.
Data Availability e data are available from the following: https://bigearth. eu/datasets; "Recurrently exploring class-wise attention in a hybrid convolutional and bidirectional LSTM network for multi-label aerial image classification" by Hua

Conflicts of Interest
e authors declare that there are no conflicts of interest.