Research on Target Tracking Algorithm of Micro-UAV Based on Monocular Vision

Aiming at the problem of limited payload and endurance of micro-UAV, the target tracking algorithm based on monocular vision is proposed. Since monocular vision cannot directly measure distance between the UAV and the target, triangulation and triangle similarity are used to calculate the distance information. Ten, a target tracking method based on Kalman flter and KCF is designed. Te tracking result of KCF is modifed by Kalman flter to solve the problem of target occlusion. Finally, the position of the target in the world coordinate system is calculated through the coordinate transformation matrix, which is used to control the UAV for tracking the moving target. In order to verify the feasibility of the algorithm, target size estimation and target tracking algorithms are carried out. Te experimental results show that the proposed algorithm can track the moving target efectively under the condition of short-term occlusion.


Introduction
In recent years, with the rapid development of vision technology, communication technology, and fight control technology, unmanned aerial vehicles (UAVs) have been widely used in real-time monitoring, investigation, trafc control, and civil photography [1][2][3][4][5]. According to the UAV dimension, the UAV can be classifed into micro-UAV, small-UAV, and large-UAV. Due to its small dimension, light weight, good mobility, and strong concealment, the microunmanned aerial vehicle (MAV) has unique advantages in target tracking [6]. However, the MAV is limited by payload and endurance, making it impossible to carry a large computer and huge detection sensors. One of the hot research problems is how to study accurate and robust target tracking algorithms for MAV platform.
Compared with other detection sensors, cameras with optical sensors as the core component receive more institution feedback on environmental information and key points. Moreover, the camera has the characteristics of low cost and light weight, so it has great potential in the feld of target tracking. Cameras with optical sensors as the core component can be classifed into monocular cameras, binocular cameras, and depth cameras based on the sensors they carry. All of these types of cameras have been used in target tracking. Aiming at the problem of inaccurate acquisition of depth images caused by UAV jitter, Tayyab Naseer's team of Technical University of Munich presented to simultaneously carry depth camera, monocular camera, and other sensors in the UAV system. And the team used a monocular camera and label positioning methods to assist the depth camera to obtain accurate depth image information for human motion tracking [7]. However, the system is currently only suitable for indoor environments and small-scale movements. Liu et al. presented to use a UAV equipped with a three-axis pan-tilt for tracking the target, which could flter the noise caused by UAV jitter and expand the feld of view [8]. However, due to the large size of the three-axis pan-tilt, it cannot be carried on a MAV.
Target tracking algorithms can be divided into generative methods and discriminant methods. Te generative methods only focus on the target feature, ignore the background information, and match the detected images by establishing a target model. Discriminant methods fnd the optimal region in the next frame of image by training a classifer to achieve the purpose of target tracking. Te generative methods assume that the target features remain constant for a period, so these methods cannot track the target motion in complex situations. Discriminant methods based on correlation flter and deep learning can adapt to complex application scenarios.
In [9], the researcher used the correlation fltering algorithm to track the target and presented a minimum output sum of squared error (MOSSE) algorithm. Te tracking speed of this algorithm can reach more than 600 frames per second, and it has the function of resisting illumination and the shape change of the target, which improves the tracking robustness. Henriques et al. presented the Kernelized Correlation Filter (KCF), which replaces the gray features of the original fltering method with histograms of oriented gradients (HOG) features [10]. Furthermore, the nonlinear classifcation problem is mapped to a high-dimensional space to make it linearly separable, and the computational complexity is reduced by applying kernel functions and the diagonalizable properties of circulant matrix. In order to solve the edge efect, Danelljan et al. presented a spatially regularized discriminative correlation flters (SRDCF) algorithm [11]. In [12], the researchers used real shifts to generate negative samples, used real samples to train flters, and expanded the search area to improve the tracking efect. However, the algorithm is easy to lose the target when the appearance of the target changes greatly. In order to further improve the performance of the correlation flter tracking algorithm, many algorithms extract deep features to represent the target [13,14]. Although the tracking efect is improved, the tracking speed of the correlation flter algorithm based on deep features is slow and not suitable for the computing resources of the UAV platform. Aiming at the problem of background noise generated by UAV in fight, Huang et al. presented an aberrance repressed correlation flter (ARCF) algorithm, and the experiment results show that ARCF performs well on most UAV data sets [15]. However, it is difcult to efectively deal with tracking failure caused by target occlusion and size change.
With the rise of deep neural network technology, it has received extensive attention in the feld of target tracking. Convolution neural network has strong target expression ability because of the deep features obtained by learning, which gradually replaces traditional manual features. It has been introduced into the target tracking task and has made great progress [16][17][18]. Siamese instance search tracker (SINT) creatively uses Siamese neural network to measure the similarity between template images and search images, which provides a new idea for target tracking [19]. To solve the problem of poor real-time performance of deep learning in target tracking, Bertinetto et al. proposed the fullyconvolutional Siamese network (SiamFC) algorithm [20]. Due to the complex network structure of the deep learning tracking algorithm, it cannot achieve both speed and accuracy to a certain extent. In [21], the researchers presented a Siamese region proposal network (Siam-RPN) tracking algorithm. Due to the limited data set, the training quality of the Siam-RPN network is not high. Aiming at the tracking accuracy problem of Siam-RPN, Yu et al. presented a distractor-aware Siamese region proposal networks (DaSiamRPN) tracking algorithm based on Siam-RPN, which improved the anti-interference and discrimination ability of tracking and achieved a tracking speed of 160 frames per second [22]. Although deep learning tracking algorithms have made great progress, the lack of training samples makes it difcult to train high-quality neural networks for diferent tracking scenarios. In addition, deep neural networks have very high requirements for computer hardware resources, which also afect the application of the MAV platform.
In summary, the MAV target tracking mainly faces the following challenges: (1) Limited by the structural characteristics of the MAV, ensuring target tracking accuracy and reducing the complexity of the algorithm are key problems that need to be resolved (2) During the fight of UAV, the airframe jitter may cause camera shake, target blur, and other problems.
In addition, there may be short-term obstacles between the UAV and the target, which will lead to target drift and loss in tracking. Terefore, it is difcult to achieve stable and robust tracking of the UAV. Tis paper proposes a target tracking algorithm of MAV based on monocular vision to solve the abovementioned problems. Firstly, aiming at the problem that monocular camera cannot measure the depth information between the UAV and the tracking target, the initialization method of triangulation is proposed to measure the target size. Ten, the triangle similarity method is applied to estimate the depth between the target and the camera to solve the two-dimensional limitation of the monocular camera. Secondly, aiming at the defciencies of the KCF flter algorithm, a target tracking algorithm based on Kalman flter and KCF fusion is proposed. Te tracking results of KCF are corrected by Kalman flter to improve the tracking accuracy and robustness. Finally, the position of the target in the world coordinate system is calculated by the coordinate transformation matrix, which is used as the expected input of the position to control the UAV to track the moving target.

System Architecture
In order to perform the tracking task, the UAV carries the monocular camera for image acquisition. As the optical fow sensor can measure the horizontal velocity of the UAV, the UAV usually uses it to achieve fxed-point fight indoors, and it also can be used in conjunction with GPS in outdoor environments. In addition, the Nvidia Jetson Nano is applied as an onboard computer; its Quad-core ARM A57 CPU and 4 GB RAM can fully meet the experimental requirements. Te compact size of 100 mm × 80 mm × 29 mm can perfectly adapt to the size of the UAV. For fight control system, the UAV utilizes Holybro Pixhawk 4 as the UAV attitude control unit. Its PX4 frmware can run Ofboard mode and execute upper control instructions. Te UAV target tracking system is shown in Figure 1.
Concerning software, the robot operating system (ROS) is installed on the airborne computer to establish communication connections between multinodes, multitasks, and multiprocesses. Te software mainly includes the following modules:

State Estimation of the Target
Te prerequisite for performing target tracking is to estimate the position motion information of the target. Te target tracker based on discriminant is used to generate the 2D motion information of the target in the image, and then the Kalman flter is established to fuse the abovementioned 2D motion information to obtain the fnal target tracking result.

Te KCF Target Tracking Algorithm.
Te KCF (kernelized correlation flters) algorithm is a discriminative target tracking algorithm based on online learning. Te initial frame is used to generate training sample sequences through circulant matrix shift. Te target is detected by the ridge regression training classifer, and the area with the largest response is the target area.
Although the KCF algorithm needs to generate multiple virtual samples through circulant matrix in the process of target tracking, there are plenty of matrix inversion calculations in the process of training the classifer. Te algorithm makes use of the property that the circulant matrix can be diagonalized and applies the discrete Fourier matrix to diagonalize the sample set. Due to the diagonal matrix operation only needing to calculate the nonzero elements on the diagonal line, it can greatly reduce the occupation of CPU and memory resources. In addition, the KCF algorithm introduces the Gaussian kernel function to map the nonlinear problem to the high-dimensional space and converts it to the linear problem, which greatly improves the calculation speed and meets the demands of the MAV for fast response and lightweight in the tracking process. Te algorithm procedure is shown in Figure 3.
To obtain more training samples, a training sample set is generated by the circulant matrix. Te n × 1 dimensional vector x � [x 1 , x 2 , . . . , x n ] T is used as the basic sample, and the sample vector x is shifted by the permutation matrix L for n times. Te training sample set of the current frame is formulated as follows: Te defnition of the circulant matrix L is as follows: Monocular cameras Jetson Nano  In order to improve the calculation speed, the discrete Fourier matrix is used to diagonalize the sample set as follows: where x is the discrete Fourier transform of the basic sample x, diag(x) is the diagonal matrix, F is the Fourier matrix, and F H � (F * ) T represents the complex conjugate transpose matrix.
We created the classifer f(z) � ω T z with the ridge regression model, where z is the candidate sample. Te goal is to minimize the squared error over training samples x i and regression targets y i , which can be written as follows: where ω is the weight coefcient of the classifer and λ is the regularizing term coefcient. In order to improve the generalization ability of the classifer and prevent the overftting phenomenon of the classifer, a regularizing term λ‖ω‖ 2 is used to control the overftting. By setting the partial derivative of ω to zero, the expression of ω is as follows: where E is the unit matrix and y is the column vector composed of the regression label y i of each sample. We converted equation (5) into the complex feld, which can be written as follows: Using the diagonalizable property of the circulant matrix, the expression of equation (6) in the frequency domain can be represented as follows: where ω and y represent the Fourier transform of ω and y, respectively, and x * represents the conjugate matrix of x.
As the target tracking is a nonlinear problem, the sample x can be mapped to a high-dimensional space through the mapping function φ (x) to make the nonlinear problem linearly separable. Te weight coefcient ω of the classifer can be expressed as follows: where α i is the linear combination coefcient, and the kernel function k is defned as follows: Te n × n dimensional kernel matrix K composed of kernel functions between the samples is expressed as follows: Ten, the ridge regression function can be expressed as follows: Te expression of α can be derived as follows: where α is a coefcient vector composed of α i . Te Fourier transform of equation (12) can be expressed as follows: where α is the Fourier transform form of α and k xx is the Fourier transform form of the frst row of matrix K.
After training the classifer with numerous samples obtained by the circulant matrix, the target can be detected and located. First of all, the kernel matrix K z between the sample x and the candidate sample z is calculated to match the position results. The current frame response is used as input for the next frame … Figure 3: KCF target tracking algorithm fow chart. 4 Journal of Robotics where C(k zx ) represents the circulant matrix of vector k zx . Te regression function of the candidate sample is as follows: Equation (15) is converted into the frequency domain, which can be expressed as follows: In particular, the Gaussian kernel k(x, x ′ ) � exp(−1/ σ 2 ‖x − x ′ ‖ 2 ) is selected as the kernel function; the Gaussian kernel function can be obtained as follows: By Fourier transforming, the matrix inversion process is avoided. Te time complexity of the algorithm is reduced from O (n 2 ) to O (nlogn), which realizes fast detection and reduces the dependence on computer performance.

Design of Target Tracking Algorithm Based on Kalman
Filter. In the previous section, a good balance between speed and accuracy is achieved by using the KCF flter to track the target and obtain the target motion state while the camera is stationary. However, the UAV tracking target is a dynamic process and the position estimation based on the previous section is not robust enough for this process. During the tracking process, it is not guaranteed that the target is always within the feld of view of the camera, and occasionally the target may be partially or fully occluded, leading to target loss. Although the complete loss of the target caused by longterm occlusion may not be solved, the proposed method can deal with small-scale occlusion problem in a short time. Based on the abovementioned situation, this section applies Kalman flter to establish the linear motion model of the target and fuses the tracking results of KCF, while considering camera jitter as Gaussian noise. According to the input and output of the model, the optimal estimation of the motion state of the target can predict the target motion position at the next moment, so as to improve the tracking accuracy and robustness.
Te Kalman flter is widely applied in the state estimation of target motion [23][24][25]. Due to noise during the measurement of target motion, Kalman flter can efectively remove noise by using the motion information of the target and obtain the optimal estimation of the target position.
Firstly, due to the high sampling frequency of the camera, the time interval between adjacent frames of the image is very short, the motion of the target between two frames can be regarded as uniform motion, and the acceleration of the target obeys Gaussian distribution. Te state space vector of the system can be expressed as follows: where x k and u k are the state vector and control vector of the system at time k, respectively; x ik and y ik represent the position of the target at time k in I, respectively; _ x ik and _ y ik represent the velocity of the target at time k in I, respectively; and € x ik and € y ik represent the acceleration of the target at time k in I, respectively.
Te motion state equation of the system is as follows: where A k is the state transition matrix of the system at time k, x k-1 is the state vector of the system at time k-1, B k is the control input matrix of the system at time k, u k is the control vector of the system at time k, and w k is the noise of the system at time k.
Assuming that the motion of the UAV tracking target is uniform, the specifc forms of A and B are as follows: Te KCF tracking result can be used as the observation of Kalman flter. Te observation equation can be written as follows: where z k is the target tracking result at time k, H k is the state observation matrix, and v k is the measurement noise at time k. Te specifc form of H is as follows: During the process of estimation, Kalman flter can be divided into two stages: prediction stage and iterative update stage. Te specifc processes are as follows: (1) Prediction stage From the motion state equation, where x − k is the prior state estimation of the target at time k, x k−1 is the posterior state estimation of the target at time k-1, P − k is the prior estimation covariance matrix, P k−1 is the optimal estimation Journal of Robotics covariance matrix, and Q is the process noise covariance matrix.
(2) Iterative update stage where K k is the Kalman gain matrix, R is the measurement noise covariance matrix, and E is the unit matrix.
In summary, the tracking process based on the KCF and Kalman flter is shown in Figure 4. Firstly, the KCF target tracking algorithm and Kalman flter are initialized, and the target state prediction value at the current moment is calculated from the optimal estimation value of the target state at the previous moment. Ten, the predicted covariance at the current time is calculated from the optimal estimated covariance matrix at the previous time and the process noise. In the update stage, the KCF algorithm is applied to track the selected target. After the target tracking result z k is obtained, the forecasting result x − k is corrected by Kalman gain. Finally, the optimal estimate x k of the current target state is obtained.

Three-Dimensional Position Solution
After obtaining the target's plane motion coordinates in the two-dimensional image from Section 3, the coordinates are converted into three-dimensional space using the following method, so that the UAV can track dynamically.
As shown in Figure 5, the world coordinate system, body coordinate system, camera coordinate system, image coordinate system, and pixel coordinate system are defned, and the relative motion relationship between the UAV and the target is described. Among them, W � o w , x w , y w , z w is the world coordinate system, B � {o b , x b , y b , z b } is the body coordinate system, C � {o c , x c , y c , z c } is the camera coordinate system, I � {o i , x i , y i } is the image coordinate system, G � o g , u, v is the pixel coordinate system, and the unit is the pixel. Te pixel coordinate system takes the left vertex of the image as the origin, u as right axis, and v as down axis.
Suppose the coordinate of the target point M in W is (x w , y w , z w ), the coordinate of its projection m in I is (x i , y i ), and the coordinate of the origin o i of I in G is (u 0 , v 0 ). Ten, the relationship between G and I can be expressed as follows: where dx and dy are the physical dimensions of the unit pixel on the x i axis and the y i axis, respectively.
Let the coordinate of the target point M in C be x c , y c , and z c . According to the projection transformation, the relationship between I and C can be expressed as follows: where f is the focal length of the camera, which is determined by the internal parameters of the camera. Invoking equation (18) with equation (19), the relationship between G and C can be written as follows: where f x � f/dx and f y � f/ dy represent the horizontal pixel focal length and vertical pixel focal length, respectively, and let S � Ten, the coordinate of M in W can be expressed as follows: where R B C is the transformation matrix from C to B, R W B is the rotation matrix from B to W, and r ij is determined by the attitude angle of the UAV. Te specifc forms are as follows: Invoking equation (21) with equation (20), the relationship between G and W can be written as follows After determining the coordinates of the target point M in the image sequence, its position coordinate in W can be calculated. However, as the monocular camera cannot obtain the depth information z c , the similar triangle estimation method is used to estimate the depth information of the target. Te premise of the estimation is to know the actual height of the target, so the height of the target is measured by the triangulation method. Te triangulation method is shown in Figure 6. 6 Journal of Robotics For images I 1 and I 2 , with the left image as a reference, the camera optical centre moves horizontally from o c1 to o c2 . During the movement, it is assumed that the camera does not rotate and the displacement of the z c axis and y c axis are negligible. Suppose I 1 has the feature point m 1 and its coordinate in C is x c1 , y c1 , and z c1 . Te feature point in I 2 is m 2 , and its coordinate in C is x c2 , y c2 , and z c2 . According to the defnition of epipolar geometry [26], the coordinate relationship can be expressed as follows: where P c1 � x c1 /z c1 y c1 /z c1 1 T and P c2 � x c2 /z c2 y c2 / z c2 1] T are respectively the normalized coordinates of m 1 and m 2 in C, t 12 is the translation vector from o c1 to o c2 , and its value is known. Left multiply P ∧ c2 on both sides of the equation, where^represents the outer product operation, and the following relationship is formulated as follows: According to the right side of the equation, z c1 can be calculated, and the depth value of the target in I 1 can be calculated. Te actual height of the target is calculated according to the similar triangle, as shown in Figure 7.
Assuming that H m is the actual height of the target, h m is the height of the target in the image, then H m can be expressed as follows: After estimating H m based on the frst two frames, the depth value z c of the target in the subsequent frames is formulated by the similarity relationship as follows: Prediction phase Output the estimated results x k   Journal of Robotics (36)

Experiment and Analysis
Te fight experiment was carried out in an open outdoor environment. During the experiment, As the UAV and the target are in motion, the difculty of pose estimation is increased. In addition, during the occlusion experiment, the fight parameters of the UAV are set to prevent the UAV from large-scale manoeuvring in this paper. Te fight parameters are shown in Table 1. First, the ground station is applied to check the sensor data of the UAV after power on. Ten, the UAV is switched to fxed-point mode by using a 2.4 GHz remote controller, and the UAV is unlocked and controlled to hover at a fxed point after taking of to a certain height. After selecting the tracking target, in order to estimate the three-dimensional position coordinate of the target, the size of the target is frst measured and estimated to provide a reference for subsequent depth estimation. In this paper, the sizes of three diferent types of targets are estimated. Te matching results are shown in Figure 8, and the estimation results are shown in Table 2.
It can be seen from Table 2 that the proposed estimation method can efectively estimate the size of diferent types of targets. Te estimation errors are within 100 mm, which is completely acceptable for depth estimation. To verify the depth estimation algorithm proposed in this paper, targets with diferent distances are selected for depth estimation. Table 3 shows the estimated distances of Person, Car, and UAV at diferent distances. It can be seen that the estimation errors of the algorithm are within 0.2, and the estimation errors do not change greatly with the increase of distance. After that, the target tracking experiment can be carried out.
As the tracking process is processed in real time on an onboard computer, the outputs of the tracking system send control instructions to the fight control system through serial communication. Limited by the processing speed of the onboard computer, this paper uses the remote control to make the fight control system enter the Ofboard mode when switching the Ofboard mode. Te tracking algorithm is automatically started to track the target when the target is selected. Te frst perspective tracking view of the UAV is shown in Figure 9, where the green border is the KCF tracking result, and the yellow border is the Kalman forecasting result.
Tracking experimental results of target occlusion are shown in Figure 10. It can be seen from the results that even when the tracked target is completely occluded or partially occluded, the KCF tracking result will drift, but the algorithm proposed in this paper can still track the target efectively.
When the tracking target is occluded, using only the KCF algorithm results in signifcant position estimation errors. However, using the KCF algorithm to fuse the Kalman flter, the errors are within the allowable range. Te experimental results are shown in Figure 11.
In Figure 11, the tracking target is occluded at 120 s and 220 s. It can be clearly seen that the proposed algorithm improves the tracking efect in the occlusion process and efectively reduces the position estimation errors of the target. Te position estimation error of the x-axis and y-axis is reduced from about 0.8 m to about 0.3 m, and the position estimation error of z-axis is reduced from about 0.2 m to 0.1 m.
To further evaluate the system, the dynamic position of the target and the estimated results are compared, as shown in Figure 12. Te system can efectively estimate the position of the target in three-dimensional space for most of the time. Despite jitter and occasional drift, the proposed algorithm can still relocate the target in a short time.
Te errors between target position and estimated position in x-, y-, and z-axes are shown in Figure 13. For most of the time, the errors of the estimation results on the x-axis are mostly kept within 0.6 m, and the errors on the y-axis and zaxis are kept within 0.2 m. Te RMSE (root mean square error) and MAE (mean absolute error) are further calculated, and the results are shown in Table 4. Te experimental results show that the proposed algorithm can track the target efectively.
Compared with the 3D target pose estimation system in the paper [27], it is robust enough for real-time dynamic position estimation. In addition, in order to analyze the efect of the distance between the UAV and the target object on the accuracy of the target position estimation, several of target trajectory estimation experiments were performed. As shown in Table 5, it can be concluded that the performance of the proposed method does not deteriorate signifcantly when the distance between the UAV and the tracking object increases.

Conclusion
Te payload and endurance of MAV are limited, and it is impossible to carry a large onboard computer to run complex visual tracking algorithms. Aiming at the above problems, this paper proposes a MAV target tracking algorithm based on monocular vision. Te main contributions are as follows: (1) For the problem of measuring the distance between the MAV and the target, a triangulation algorithm has been designed for a monocular camera to estimate the object's size. Based on this, the triangle similarity can measure the distance between the micro-MAV and target; (2) To address the problem of target occlusion, the paper proposes a target tracking algorithm based on KCF and Kalman flter. Te algorithm combines the tracking results with the Kalman flter, solving the short-term occlusion problem and improving the anti-interference ability in the tracking process; (3) Te proposed target tracking algorithm is evaluated through numerous experiments in a real environment. Te experimental results demonstrate the feasibility and robustness of the proposed algorithm.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.