Multiscale Feature Filtering Network for Image Recognition System in Unmanned Aerial Vehicle

For unmanned aerial vehicle (UAV), object detection at different scales is an important component for the visual recognition. Recent advances in convolutional neural networks (CNNs) have demonstrated that attention mechanism remarkably enhances multiscale representation of CNNs. However, most existing multiscale feature representation methods simply employ several attention blocks in the attention mechanism to adaptively recalibrate the feature response, which overlooks the context information at a multiscale level. To solve this problem, a multiscale feature filtering network (MFFNet) is proposed in this paper for image recognition system in the UAV. A novel building block, namely, multiscale feature filtering (MFF) module, is proposed for ResNet-like backbones and it allows feature-selective learning for multiscale context information across multiparallel branches. *ese branches employ multiple atrous convolutions at different scales, respectively, and further adaptively generate channel-wise feature responses by emphasizing channel-wise dependencies. Experimental results on CIFAR100 and Tiny ImageNet datasets reflect that the MFFNet achieves very competitive results in comparison with previous baseline models. Further ablation experiments verify that the MFFNet can achieve consistent performance gains in image classification and object detection tasks.


Introduction
To understand the environment, unmanned aerial vehicles (UAVs) need to integrate information from various sensors such as cameras, lidar, radar, and GPS. e information from the camera provides a straightforward way of visual perception, which supports further advanced thinking and reasoning for UAV. One of the important tasks in the visual perception of UAV, image recognition [1,2] has always been a research hotspot. Convolutional neural networks (CNNs) have been widely used in solving visual cognition tasks, such as image classification [3,4], object detection [5], and salient object detection [6]. Unlike traditional hand-crafted features (e.g., HOG [7]), features learned by CNNs based on data require minimal human involvement during training. us, most of the recent research on visual recognition is based on network engineering. It is becoming increasingly important to design better CNN architectures for visual recognition tasks.
Generally, for the design criterion of convolution networks, there are three important issues: depth, width, and cardinality. In 2015, Simonyan and Zisserman [8] designed an effective and very deep network by stacking blocks of the same shape, which achieved the state-of-the-art performance. However, as CNNs become increasingly deeper, gradient propagation becomes more difficult. In order to alleviate the problem of gradient disappearance caused by the increase of network depth, He et al. [9] proposed a deep residual learning approach, which referred to the input of the layer to learn the residual function. Experiments showed that this residual learning method can be easily optimized and can obtain higher accuracy by increasing the depth. Szegedy et al. [10] showed that width was another important factor to improve the performance of CNNs. Compared with shallower and less extensive networks, the main advantage of this method was that it can significantly improve the accuracy with a moderate increase in computing demand. ResNeXt [11] employed the potential of grouped convolutions and empirically showed that increasing cardinality was more effective than going deeper or wider as capacity increases. In 2016, Zagoruyko and Komodakis [12] demonstrated that using more channels and a wider convolution can improve detection accuracy. en, Huang et al. [13] proposed a dense convolutional network, which utilized direct connections between any two layers with the same feature map size to strengthen feature propagation. Ding et al. [14] designed a novel convolutional network, which used asymmetric convolutions to strengthen the square convolution filters.
Other network studies [15][16][17][18] exploited the potential of network from attention mechanism. For example, Hu et al. [15] designed a novel squeeze and excitation (SE) block that adaptively recalibrates channel-wise feature responses by emphasizing the interdependent channel maps. After that, Woo et al. [16] introduced a simple attention module called CBAM, which exploited both spatial and channel-wise attention to emphasize meaningful features along channel and spatial axes. Li et al. [17] proposed selective kernel networks (SKNets), which realized the adaptive receptive field sizes of neurons in a nonlinear approach. Furthermore, previous work [18] captured multiscale features through the additive effects of feature-selective and spatial attention. Targets appear at different scales in the image frame and are often occluded by clutter, which is a major challenge for image recognition algorithms in UAV applications. erefore, multiscale feature representation is particularly critical for image recognition system in the UAV. However, most existing multiscale feature representation methods using attention mechanism simply employ several attention blocks to adaptively recalibrate the feature response, which overlooks the context information at a multiscale level.
Based on this analysis, in this paper, a multiscale feature filtering network (MFFNet) is proposed for image recognition system in the UAV. In MFFNet, we propose a novel building block, called multiscale feature filtering (MFF) module. Our key idea is to retain important information about smaller and insignificant objects by allowing featureselective learning for multiscale context information across multiparallel branches. ese branches employ multiple atrous convolutions at different scales, respectively, and further adaptively generate channel-wise feature responses by emphasizing channel-wise dependencies.
It is possible to construct an MFF network (MFFNet) by simply replacing the standard 3 × 3 filters in ResNet-like backbones with MFF modules. Besides, while the template for the MFF module is generic, the role it performs varies at different depths throughout the MFFNet. To compare the difference between the MFF module and standard 3 × 3 filter, we visualize the class activation mapping using Grad-CAM [19] and observe that the MFFNet-based CAM results tend to focus on the whole object more than other baseline networks. Experimental results on CIFAR [20], Tiny ImageNet [21], PASCAL VOC 2007 [22], MS COCO [23], and UAV123 [24] datasets show that our proposed method can achieve consistent performance gains in image recognition tasks. e rest of the paper is organized as follows: Section 2 introduces our proposed MFFNet and presents the details of multiscale feature filtering (MFF) module. Section 3 shows experimental settings and analyses experimental results. Section 4 concludes this study and describes future work of this paper.

Method
In this section, the MFFNet, a novel backbone network for image recognition system in the UAV, is introduced. An overview of MFFNet is depicted in Figure 1. A MARNet contains four stages, and each stage contains multiple MFF units. Each MFF unit consists of a sequence of a 1 × 1 convolution, an MFF module, a 1 × 1 convolution, and a further skip layer. Figure 2 shows the schema of an MFF unit.
Furthermore, we present the details of multiscale feature filtering (MFF) module. e MFF module consists of three submodules: split module (SM), multiscale branch module (MBM), and fusion module (FM).

MFFNet
Architecture. MFF modules can be integrated into a standard architecture, such as ResNet [9], by replacing every 3 × 3 layer with MFF modules. Here, MFF modules are used with MFF units. By making this change to each such module in the MFF unit, an MFFNet network can be constructed. Further variants that integrate MFF modules with ResNeXt [11], DenseNet [13], ShuffleNetV2 [25], and MobileNetV2 [26] can be constructed by following similar schemes. Like ResNet-50 and ResNeXt-50, MFFNet-50 and MFFNeXt-50 can be constructed by simply stacking a set of MFF units. MFFNet-50 can be obtained by changing the number of MFF units per stage. MFFNeXt-50 can be obtained from MFFNet-50 by changing the bottleneck width [12] and cardinality [11] of the MFF units. e cardinality, c, is the number of groups within a filter, whereas the bottleneck width, d, is the number of channels in a layer. Table 1 shows the MFFNet-50 and MFFNeXt-50 architectures with four phases, using 3, 4, 6, 3 { } MFF units. e filter sizes and feature dimensionalities of a residual block are shown inside the brackets. e number of stacked blocks for each stage is shown outside the brackets. "B � 3" denotes an MFF module with three branches, and "c � 32" suggests grouped convolutions with 32 groups.

Multiscale Feature Filtering Module.
e structure of the MFF module is illustrated in Figure 3. First, in MFF module, given an input feature map, to obtain fine-grained multiscale information, the SM divides the input feature map into multiple feature map subsets. Second, to capture the objects at different scales, the MBM employs multiple atrous convolutions with different rates. Meanwhile, these branches use atrous convolutions instead of standard convolutions to reduce the model's parameters. Besides, the MBM further selectively generates channel-wise feature responses by emphasizing channel-wise dependencies. Once channelwise feature responses with different scales are captured, the transformed features are connected by skip structures to enhance feature propagation. ird, a channel concatenation operator is applied to fuse previously captured information from different branches.  Figure 1, in split module (SM), for any given input feature map X ∈ R H×W×C′ , where X � [X 1 , X 2 , . . . , X C′ ], to obtain fine-grained multiscale information, the SM first equally splits X into n feature map subsets, such as the three feature map subsets shown in Figure 1, namely, X 1 ∈ R H×W×C , X 2 ∈ R H×W×C , and X 3 ∈ R H×W×C , where C � (C ′ /3). H, W, and C denote the height, width, and number of channels of the feature map, respectively.

Multiscale Branch
Module. e multiscale branch module (MBM) consists of three branches, namely, A-branch, B-branch, and C-branch. Moreover, each branch contains a feature filtering module (FFM). e structure of the feature filtering module (FFM) is depicted in Figure 4.
In FFM, we selectively generate channel-wise feature responses by emphasizing channel-wise dependencies. Specifically, for the preprocessed feature map U ∈ R H×W×C , firstly an FFM uses global average pooling and global max pooling to generate two different channel-wise statistics as n ∈ R C and m ∈ R C . e global average pooling and global max pooling operations are denoted as F ga and F gm . Specifically, the c-th element of n and m is calculated as . U c denotes the c-th feature map channel in the feature map U. In addition, U c (i, j) refers to (i, j) − th pixel in U c . en, to fuse the transformed feature information from global average pooling and global max pooling, an elementwise summation is used to obtain finer global channel-wise statistics as z ∈ R C .     Complexity where F sum indicates the element-wise summation operation between channel-wise statistics n and m. Furthermore, in order to make use of the previously fused feature information, the previously global channel-wise statistics z is forwarded to a F ex function, which is composed of one dimensionality-reduction layer with parameters M 1 and reduction ratio l, dimensionality-increasing layer with parameters M 2 , sigmoid activation function, and ReLU activation function. e final output of the FFM is computed as where σ and δ are the sigmoid and ReLU activation function, respectively.
Employing large atrous rate enlarges the model's receptive field, so that object coding can be performed at multiple scales. As shown in Figure 5, A-branch, B-branch, and C-branch employ three atrous convolutions with different atrous rates r, where r � 1, 2, 3 { }. In addition, these branches use atrous convolutions with different rates instead of standard convolutions to reduce the model's parameters.
For any atrous convolution layer, the learned set of refers to the parameters of the corresponding c-th convolution filter. Let I ∈ R H×W×C denote the input of the atrous convolution layer. U � [U 1 , U 2 , . . . , U c ] are the output of the atrous convolution layer. For the c-th filter at such a layer, the corresponding output feature map channel is where F (K;r) denotes an atrous convolution layer with filter size K × K and atrous rate r.
In A-branch, for the input feature map subset X 1 obtained from the split module, an atrous convolution layer with filter size 3 × 3 and atrous rate r � 1 is conducted to generate the output feature map U 1 of a specific scale. For the c-th filter at such a layer, K � 3 and r � 1 are put into equation (5) to obtain the c-th output feature map channel.
. Further, in order to take advantage of the information aggregated in the feature filtering module (FFM), the feature map U 1 is sent to the FFM. e output of the FFM in A-branch is denoted as s 1 .
e final output V 1 of the A-branch is obtained by rescaling the feature map U 1 with an element-wise multiplication operation.
where F mul indicates the element-wise multiplication operation.
In B-branch, to enhance feature propagation, firstly we fuse the output V 1 of the A-branch and the feature map subset X 2 obtained from the split module by using an element-wise summation operation. us, the fusion output feature map W 1 of V 1 and X 2 is computed as en, an atrous convolution layer with filter size 3 × 3 and atrous rate r � 2 is conducted to generate the output feature map U 2 . For the c-th filter at such a layer, K � 3 and r � 2 are put into equation (1) to obtain the c-th output feature map channel.
. Similar to the A-branch, in order to take advantage of the information aggregated in the feature filtering module (FFM), the feature map U 2 is sent to the FFM. e output of the FFM in B-branch is denoted as s 2 . e final output V 2 of the B-branch is obtained by rescaling the feature map U 2 with an element-wise multiplication operation.
For the C-branch, similar to the B-branch, firstly we fuse the output V 2 of the C-branch and the feature map subset X 3 obtained from the split module by using an element-wise summation operation. us, the fusion output feature map W 2 of V 2 and X 3 is computed as en, an atrous convolution layer with filter size 3 × 3 and atrous rate r � 3 is conducted to generate the output feature map U 3 . For the c-th filter at such a layer, K � 3 and r � 3 are put into equation (1) to obtain the c-th output feature map channel. . e final output V 3 of the C-branch is obtained by rescaling the feature map U 3 with an element-wise multiplication operation.
where s 3 is the output of the FFM in the C-branch. Figure 1, in order to take advantage of the feature information aggregated in the multiscale branch module (MBM), the outputs of A-branch, B-branch, and C-branch are forwarded to the fusion module (FM), which is implemented by a concatenation function. e output of FM is V, which can be calculated by

Fusion Module. As shown in
where F cat denotes the concatenation operation between feature maps.

Experimental Results and Analysis
In this section, we describe experiments that study the effectiveness of MFF modules for a range of tasks, datasets, and model architectures. Besides, all models are implemented by using the PyTorch framework. For image classification tasks, we evaluate all models on the CIFAR-100 and Tiny ImageNet datasets. e objects in the CIFAR-100 and Tiny ImageNet datasets have features of different scales, which can effectively verify the effectiveness of our proposed MFFNet in the UAV. For benchmarking, we evaluate the single-crop top-1 error rate and adopt the same data augmentation scheme used in [9,27]. Moreover, we train the network using stochastic gradient descent with momentum 0.9, weight decay 0.0001, and a mini-batch size of 32 on 1 RTX 2080Ti GPU. For the CIFAR-100 and Tiny ImageNet datasets, every model is trained for 200 epochs. We start with a learning rate of 0.1, which is divided by 10 at 60, 120, and 160 epochs, respectively.
For object detection tasks, all models are trained in the PASCAL VOC 2007 and MS COCO datasets with 1 RTX 2080Ti GPU and the mini-batch size is 2 images. We use a weight decay of 0.0001 and a momentum of 0.9. In addition, all models are trained for 80k iterations with a learning rate of 0.002 and then for 30k iterations with 0.0001. Other implementation details are as in [28]. Besides, in order to verify the effectiveness of our proposed method, we further test the MFFNet on the UAV123 dataset, which is captured from a low-altitude aerial perspective.

Experiments on Tiny ImageNet.
We evaluate our method on the Tiny ImageNet dataset, which contains 100k training images, 10k validation images, and 10k test images in 200 classes. Each class has 500 training images, 50 validation images, and 50 test images. An input image is 224 × 224 pixels randomly cropped from a resized image. We use ResNet-50, ResNet-101, and ResNeXt-50 as the representatives for the residual model architecture. In addition, we compare the results with those from the SENet and SKNet model architectures, which are based on attention mechanisms. We compare the single-crop top-1 error rate of each baseline and its MFFNet counterpart on the Tiny ImageNet dataset. As shown in Table 2, MFFNeXt-50 achieves significant performance gains over ResNeXt-50, with a reduction of 3.82% in the error rate. Compared with ResNet-50, MFFNet-50 is better by 1.04%. Meanwhile SENet-29 (16c × 32d) achieves 33.67% error and MFFNeXt-50 (32c × 4d) achieves 32.76% error. MFFNeXt-50 is better than SKNet-29 (16c × 32d) by 2.27%. Besides, MFFNeXt-50 (32c × 4d) achieves a top-1 error rate of 32.59%, although SENet-29 (16c × 32d) needs 26.88% more parameters. e top-1 testing error rate versus number of epochs for the different architectures is shown in Figure 6. SKNet-29 (16c × 32d) needs 27.45 M parameters, whereas MFFNeXt-50 (32c × 4d) needs only 25.43 M trainable parameters and achieves a higher accuracy. e results show that MFF modules consistently improve the performance of state-ofthe-art CNNs.

Ablation Studies on CIFAR-100.
To further validate the effectiveness of the MFFNet, we undertake ablation studies on the CIFAR-100 dataset. We first evaluate the trade-off between cardinality c and bottleneck width d. Next, in MFF module, we investigate the impact of changes in the complexity on performance by combining different atrous rates r.

Cardinality versus Width.
To study the effects of the cardinality c and the width of the bottleneck d, we start from the three-branch case and fix the setting atrous rates r � 1, 2, 3 { }. We first evaluate the trade-off between cardinality c and bottleneck width d. Table 4 shows the results. MFFNeXt-29 (2c × 64d) has a top-1 error of 19.78%, which is 2.78% lower than that for MFFNeXt-29 (1c × 64d). We can see that as the cardinality c increases from 1 to 4 for constant bottleneck width, the error rate falls. In addition, as the bottleneck width d increases from 24 to 64 for constant cardinality c, the error rate again decreases.

Combinations of Different Atrous
Rates. Next, we investigate combinations of different atrous rates. e atrous rate r is used to control the receptive field size. MFFNet uses 3 × 3 filters with different atrous rates r. To limit the search   space, we use only four different atrous rates, r � 1, 2, 3, and 4. To study their effects, we change the other three branches for the 3 × 3 filter with r � 1 in the first filter branch of the MFF modules. Tables 5 and 6 show the top-1 error rate for MFFNeXt-29 (2c × 64d) and MobileNetV2 1 × with MFF. We can make three major observations as follows: (

Class Activation Mapping.
To intuitively understand the multiscale representation ability of MFFNet, we visualize the class activation mapping (CAM) using Grad-CAM for different networks. Grad-CAM uses gradients to calculate the importance of the spatial locations in convolutional layers. Figure 8 compares the CAM for representative backbone networks. e areas that have a larger impact on the classification are covered with lighter colors. We can clearly see that the MFFNet-based CAM results tend to focus on the whole object more than ResNet.

Object Detection.
e PASCAL VOC 2007 and MS COCO datasets are in 20 and 80 object categories, respectively. e PASCAL VOC 2007 dataset has about 5k trainval images and 5k test images. We use the 5k trainval images and 5k test images for training and 5k test images for validation. e MS COCO dataset has 80k images for training, 40k for validation, and 20k for testing. We used the 80k training set plus a 35k validation subset for training and a 5k validation subset for validation. We adopt Faster-RCNN [28] as our detection method and evaluate the average precision (AP)      On the PASCAL VOC 2007 dataset, MFFNet-101 outperforms ResNet-101 by 1.4% on AP. On the MS COCO dataset, we improve ResNet-101 by 1.3%. Table 7 shows that MFFNet-101 has a little longer inference latency than ResNet-101 but is more accurate. For instance, on the PASCAL VOC 2007 dataset, we improve the ResNet-101 baseline by 1.4% for AP for only 1.6 ms of additional inference latency. On the MS COCO dataset, MFFNet-101 has an AP of 26.9%, which is 1.3% higher than the ResNet-101 baseline of 25.6% for only 3.2 ms of additional inference latency.
ese results demonstrate the general performance improvement of using MFF modules in object detection. Figure 9 shows detection examples generated by our proposed MFFNet-100 as backbone networks on UAV123 dataset. It can be seen that our method is able to detect target objects successfully regardless of their shapes, sizes, orientations, and appearances.

Conclusions
To address the multiscale recognition problem in the UAV visual perception, this paper establishes a new convolutional network architecture (MFFNet). In MFFNet, the MFF module is designed by employing multiple atrous convolutions at different rates with feature-selective learning ability. e MFF module is implemented via three operations: split module (SM), multiscale branch module (MBM), and fusion module (FM). In addition, MFF module can selectively generate channel-wise feature responses by emphasizing channel-wise dependencies. We further explore the effect of atrous rate on the multiscale representation ability of CNNs. Image classification results on CIFAR-100 and Tiny ImageNet datasets demonstrate that our proposed method achieves very competitive results on various benchmarks. Grad-CAM visualization results demonstrate that the MFFNet-based CAM results tend to focus on the whole object more than other baseline networks. at is, the MFFNet has a stronger multiscale representation ability, which can achieve better recognition accuracy in the UAV. Experimental results on PASCAL VOC 2007, MS COCO, and UAV123 datasets show that our proposed method achieves consistent performance gains in object detection, which is beneficial to expanding the application of UAV. We will further explore the effect of multiscale representation on image recognition results in future work.
Data Availability e detailed mechanism model and model parameters of MFFNet are given in the article. e results are computed on the PyCharm software with the model and given parameters, while the relevant results are also given in the article.

Conflicts of Interest
e authors declare no conflicts of interest.