A UAV Detection and Tracking Algorithm Based on Image Feature Super-Resolution

UAV is difficult to detect by visual methods at a long distance, so a UAV detection and tracking algorithm is proposed based on image super-resolution. Firstly, a saliency transformation algorithm is built to focus on the suspected area. -en, a generative adversarial network is established on the basis of ROI to realize the super-resolution of weak targets and restore the highresolution details of target features. Finally, the cooperative attention module is built to recognize and track UAV. Our experiments show that the proposed algorithm has strong robustness.


Introduction
Small UAV has the characteristics of portability and strong mobility and also has broad application space in unmanned investigation. However, for long-distance UAV, only a limited number of pixels are displayed on the image, which is of great significance for its accurate discrimination [1]. e research on UAV mainly focuses on target tracking. Ibrahim et al. [2] use UAV to track moving targets. Zhou et al. [3] construct a Kalman filter to realize UAV tracking. Nodland et al. [4] propose the track optimization strategy. Liu [5] introduces prediction points to assist UAV track detection. Ragi and Chong [6] propose an algorithm to dynamically realize multi-UAV tracking. Yoo and Hong [7] realize UAV detection and tracking by visual means. Kadouf and Mustafah [8] analyze the color characteristics of UAV to realize UAV tracking. Yu et al. [9] propose a new coordinate system to analyze the attitude of UAV. Teuliere et al. [10] build a 3D model to track the UAV flying indoors. Choi and Kim [11] use monocular to analyze the UAV track. Quintero et al. [12] use the output-feedback model to predict UAV flight trajectory. Santos et al. [13] build a ground visual tracking system to detect UAV. Zhou et al. [14] construct a Hough transform to detect and track UAV. Vetrella et al. [15] realize dynamic navigation through a multi-UAV network. Elloumi et al. [16] propose a low-power tracking algorithm from the perspective of UAV energy. Greatwood et al. [17] use parallel means to realize rapid detection and tracking of UAV. Santos et al. [18] propose a 3D model for UAV positioning. Zhang et al. [19] use deep learning to build a coarse to fine detection algorithm to realize UAV detection and tracking. Huang et al. [20] build correlation filters based on deep learning to realize UAV tracking. Rabah et al. [21] build a model based on fuzzy set theory to realize target tracking. Kokunko and Krasnova [22] propose variable constraint mechanism to realize UAV tracking. Li et al. [23] propose augmented memory for correlation filters to realize UAV tracking. Li et al. [24] use deep learning network to extract features and realize UAV object tracking. Moon et al. [25] realize multi-UAV tracking based on deep learning network.
rough the above analysis, main problems of many research studies on UAV detection and tracking are as follows. (1) Traditional target detection algorithms cannot be effectively applied to UAV due to the small size of UAV. (2) e number of pixels on the image of UAV is limited, and its features are not obvious. (3) e flight uncertainty of UAV leads to the construction difficulty of the unified model. erefore, on the basis of images, (1) a complete UAV detection and tracking process is proposed, (2) the enhancement algorithm is built to improve the spatial resolution of the target and a new idea of UAV detection is proposed, and (3) according to the principle of visual perception, a depth-based attention model is built to focus on the area, where UAV is located to realize tracking.

Algorithm
According to the characteristics of UAV images, a UAV detection and tracking algorithm is proposed based on image super-resolution, as shown in Figure 1. Firstly, the image composition is analyzed and the suspected area extraction module is constructed. en, a super-resolution model is constructed to highlight local information and increase signal strength. Finally, a deep learning module is constructed to realize UAV tracking based on the attention mechanism.

ROI Extraction.
e UAV is usually at a long distance, so it presents a limited number of pixels on the image, which leads to difficult detection from the spatial domain. erefore, the salient area is introduced to extract ROI. First of all, the features are obtained by linear difference: where p 1 is the characteristic, O b is the gradient third-order matrix, f r is the activation function, and R(f θ ) is the convolution calculation performed by θ. e salient target area model is constructed by supervised learning, and the image is described as I f (y j � 1 | M i t ); then, y j represents the reliability of pixel j. e corresponding cross-entropy loss function is where X + and X − represent the edge and background pixels of the salient target area, respectively. Image segmentation is realized through the edge of the salient target area, but the resolution, affected by the multilayer transmission architecture, will gradually decrease with the change of layers, which needs to be further strengthened: where f h (C i + 1 ) represents the enhancement processing of the features of the previous layer. e cumulative loss function between the corresponding estimation result and the real image is e idea of edge enhancement is used to suppress the weakening caused by multiple layers in the process of feature fusion, in order to achieve more accurate salient target area estimation. During the convolution operation, the feedback mechanism is introduced to continuously input the edge features and salient target area features of the image into the convolution operation to obtain a new estimation graph function: where U(.) is deconvolution, I p (x,y) is a priori graph, and q is convolution parameter. Divergence occurs during feedback correction. In order to solve this phenomenon, any two pixels adjacent to each other can be used as a path in I(x,y). Let all paths corresponding to n pixels be expressed as q � {q 0 , . . ., q n }. en, after obtaining the salient target edge, the loss function is calculated as Let the width and height of L(x) be w and h. e mapping matrix M � [m ij ] w×h can be constructed: According to the mapping matrix, the position of the salient target can be determined. On the basis of a priori information feedback correction, the position of the edge of the salient target can be distinguished to prevent the performance degradation of the feedback correction and realize the suspected area extraction.

ROI Super-Resolution Reconstruction.
e motion area has been extracted in the previous section. On this basis, the research on ROI super-resolution reconstruction is carried out in this section to further determine whether UAV is included. Because the target area is smaller than the background area, when the target area is enlarged, the background is enlarged at the same time, resulting in coarsegrained information after over division. erefore, we design the feature super-resolution GAN, transform the weak target features into super-resolution features through super-resolution processing in the feature space, and enhance the feature representation of the weak target. e structure is shown in Figure 2.
Let the original input image be X I . e image X 0.5I is obtained by 2 times downsampling to obtain a pair of low-resolution and high-resolution target features. e output features are obtained through the feature network. e weak target features F q 0.5I are iteratively generated into super-resolution features S q 0.5I so that the super-resolution features are similar to the features T q I output by the supervisor as much as possible. e feature loss function is defined as e generator loss function is defined as e feature super-resolution discriminator adopts a three-layer perceptron to train and distinguish S q 0.5I and T q I . e discriminator loss function is defined as e feature super-resolution supervisor extracts the high-resolution target feature T I similar to the low-resolution input feature as the supervision signal for super-resolution model training, in order to enhance the stability of training and improve the quality of super-resolution. In order to avoid the inconsistency between the receptive fields of low-resolution features and high-resolution features, a feature extraction backbone-shared parameter network is designed to extract the qth feature T q I which is more suitable for training the super-resolution model without adding parameters.

UAV Tracking.
rough the online recognition network, we can enhance the discriminative power of the classifier to distinguish the target from other interfering objects in the background, minimize the false detection rate, and complete the rough positioning of the target. We use the depth regression structure to construct the network structure, as shown in Figure 3, which uses two-layer convolution neural network: where x is the countermeasure network feature graph generated by feature super-resolution, w 1 and w 2 represent the weight of the convolution layer, * is a convolution operation, and φ is the activation function. e loss function is defined as where m is the total number of feature graph samples, c j is the learning weight, y j is the regression classification confidence of every feature sample x j , and λ k is a regular term. Gauss-Newton algorithm is used to solve the problem. We transform the target tracking task into a similarity measurement problem and take the first frame z and the candidate region x of subsequent frames as the input images of template branch and detection branch, respectively. Feature extraction network based on weight sharing φ(.) maps to the feature space, and the metric function f(z.x) is learnt to compare the similarity between the template image and the candidate area search image. Finally, return the response graph. In order to highlight the importance of different spaces, a spatial collaborative attention module is designed.
Channel attention models the dependencies between channels, learns the association between features from the semantic level, optimizes features, activates feature channels more related to the target, and removes redundant features. Suppose that the feature diagram extracted by MobileNetV2 network template branch and detection branch is φ(z) and φ(x). A typical MobileNetV2 network template is shown in Figure 4. e global information of each channel is obtained by global average pooling and condensing spatial dimensions to provide salient target features. Input the results to the input layer, hidden layer, and output layer. We reduce the number of channels in the hidden layer to 1/16 of the input layer. Output channel attention weight A c {φ(z)} and A c {φ(x)}. Finally, the channel attention feature graph is In the collaborative attention module, each branch code is integrated into another branch to make full use of the background information. In order to facilitate matrix multiplication with features, the output collaborative attention weight A{φ(x)} and A{φ(z)}. After passing through the channel attention and collaborative attention modules, the weights of the two branches are fused to obtain φ′(z) and φ′(x).
Spatial attention focus is used to describe the position, which can construct the relationship between different positions in the feature graph, and supplement the channel attention through position weighted fusion. Feature graph φ′(.) compresses along the channel dimension to obtain the spatial attention weight As(.) and then obtains the final attention feature graph. e feature graphs of each layer output by template branch and detection branch in the network are normalized by adjusting the convolution operation of layers, in order to make the feature graph with uniform resolution and the same number of channels.
Let the qth layer adjusted feature graph of classified branch input be φ q cls (z) andφ q cls (x). e adjusted feature diagrams of the qth layer of regression branch input are φ q reg (z) and φ q reg (z). Finally, the output is weighted and fused: When classifying the foreground or background of each candidate area, the same target may exist in multiple overlapping rectangular boxes at the same time, so nonmaximum suppression (NMS) is used for elimination to accurately track the target.

Experiment and Result Analysis
e experimental data include 20 groups of UAV visible light data from far to near, as shown in Figure 5, with an image resolution of 1024 × 1024.
Based on the network structure, the image is normalized to 512 × 512 in order to ensure that small targets are not lost. On Win10 operating system with Intel ® Core ™ I5-6500 CPU, 3.20ghz system, the proposed program is run by 8 frames/s, which cannot meet practical requirement. However, due to the continuity of the target, the images can be processed at an interval of 1 frame, which can achieve near real-time operation speed.

ROI Extraction Algorithm Performance.
We introduce the following indicators to measure the algorithm performance [26], as shown in Table 1: where SEN reflects the detection performance of the algorithm for real targets, SPE reflects the detection performance of the algorithm for false targets, ACC reflects the ratio of correct test results to all samples in the test results, and FPF reflects the ratio of false test results diagnosed as true targets. e experimental results are shown in Table 2. e color model built in [8] has a good detection effect on UAVs with close range and obvious color characteristics. However, for long-distance UAVs, it is difficult to obtain color information and results in poor effect. Based on the target composition structure, Zhou et al. [14] construct the model through texture features, which is more stable than the color model, and the effect is significantly improved. However, for specific UAVs, the detection rate is limited. Zhang et al. [19] construct a deep learning network based on the deep reinforcement learning (DRL) model to realize ROI detection. It is the current mainstream target detection algorithm, and the effect and performance are further improved. However,  this process needs to traverse the global image, and the computational cost is high and cannot meet the detection of targets with different scales. Our proposed algorithm introduces saliency region (SR) and focuses on the region of interest step by step. It conforms to the principle of visual perception and uses a priori knowledge to extract ROI. ACC has reached 95. For areas with too small area, there is still a risk of missing detection. On the basis of SR super-resolution module is added to establish the relationship between lowresolution and high-resolution target features, which further improves the detection of small targets.

UAV Tracking Algorithm
Performance. We introduce the tracking success rate curve to intuitively show the algorithm performance, as shown in Figure 6. For short-range UAV tracking, all algorithms have achieved good results because of the high-resolution of the target displayed on the image. With the increase of distance, the performance of the algorithm decreases. Kalman filter algorithm is the most obvious. Because the size of the target changes greatly in the image, the tracking is easy to be affected by the surrounding environment. According to the difference between UAV and background characteristics, fuzzy set constructs a segmentation algorithm to realize target tracking and has certain robustness to target size. Due to the UAV flying at low speed and uniform speed, the model updating is stable, and all algorithms have good tracking effect. However, in the face of turning flight or sudden acceleration or deceleration, Kalman filter and fuzzy set algorithm will not track due to the limitation of model updating speed. Augmented memory for correlation filters has achieved good results in analyzing short-time flight states. In depth network, UAV information      is obtained through a large number of training samples to build the model, but the time dimension information is not used, resulting in some limitations of the algorithm. However, the proposed algorithm introduces the spatial collaborative attention module to focus the target hierarchically to achieve target tracking, which decreases slowly, and the performance is better than other algorithms.

Conclusion
Aiming at the difficulty of visual detection and tracking of long-distance UAV, a complete set of weak and small UAV detection process is proposed from the perspective of visual cognition. e ROI area is focused step by step to establish and generate the GAN according to the idea of image superresolution. e target details are restored to highlight the characteristics of weak targets, and a collaborative attention module is built to identify and track UAVs. e algorithm can be applied to fixed cameras, and the region of UAV can be further determined by the difference between frames. However, the proposed algorithm can also be applied to mobile cameras to focus the UAV area according to the saliency area. It can provide a new idea for the detection and recognition of weak and small targets.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.