A UAV Visual Relocalization Method Using Semantic Object Features Based on Internet of Things

Unmanned Air Vehicle (UAV) has the advantages of high autonomy and strong dynamic deployment capabilities. At the same time, with the rapid development of the Internet of*ings (IoT) technology, the construction of the IoT based on UAVs can break away from the traditional single-line communication mode of UAVs and control terminals, which makes the UAVs more intelligent and flexible when performing tasks. When using UAVs to perform IoT tasks, it is necessary to track the UAVs’ position and pose at all times. When the position and pose tracking fails, relocalization is required to restore the current position and pose. *erefore, how to perform UAV relocalization accurately by using visual information has attracted much attention. However, the complex changes in light conditions in the real world have brought huge challenges to the visual relocalization of UAV. Traditional visual relocalization algorithms mostly rely on artificially designed low-level geometric features which are sensitive to light conditions. In this paper, oriented to the UAV-based IoT, a UAV visual relocalization method using semantic object features is proposed. Specifically, the method uses YOLOv3 as the object detection framework to extract the semantic information in the picture and uses the semantic information to construct a topological map as a sparse description of the environment. With prior knowledge of the map, the randomwalk algorithm is used on the association graphs to match the semantic features and the scenes. Finally, the EPnP algorithm is used to solve the position and pose of the UAV which will be returned to the IoT platform. Simulation results show that the method proposed in this paper can achieve robust real-time UAVs relocalization when the scene lighting conditions change dynamically and provide a guarantee for UAVs to perform IoT tasks.


Introduction
Unmanned Air Vehicle (UAV) has the advantages of high autonomy and strong dynamic deployment capabilities [1], which can collect various types of observation data and perform operations tasks accurately. Currently, UAVs mainly conduct one-to-one communication with the control terminal, which satisfies the basic needs of humans for manipulating UAVs to perform some simple tasks [2,3]. However, in actual applications, data monitoring or operations are often not carried out on a single line, because singleline communication restricts the application performance and benefits of UAVs to a certain extent. e rapid development of the Internet of ings (IoT) has become the mainstream of the times. e IoT consisting of UAVs acting as device nodes can solve the above problems and make the UAVs more intelligent and flexible when performing tasks [4]. e localization problem is a very critical issue in the field of UAVs. e accurate self-localization capability and robustness to common interference factors must be shown when the UAV performs IoT tasks such as industrial inspections and data collection [5,6]. UAVs are generally localized through the Global Navigation Satellite Systems and Inertial Navigation Systems (INS). When the UAV is working in inclement weather or an indoor environment, the Global Positioning System (GPS) signal may be weak or interrupted [7]. At this moment, the UAV can rely on the INS system to perform localization within a short period time. However, the localization error of the INS system will accumulate over time; therefore, how to use other supplementary information to help UAVs achieve relocalization and ensure the correct execution of IoT tasks is a very realistic problem. With the development of computer vision, the visual scene matching technology based on camera sensors is becoming increasingly mature, which can provide high-quality services for autonomous UAV flight during IoT tasks [8]. e camera sensor can provide abundant online environmental information and has the advantages of low cost, small size, light weight, noncontact, etc. Additionally, a suitable visual localization system can be combined with GPS and INS systems as a supplementary localization system for UAVs [9,10].
e UAV needs to track the position and pose at all times during the working process. When the position and pose tracking fails, the relocalization algorithms should be used to restore the current position and pose [11]. e traditional visual relocalization algorithms mostly rely on the similarity among environmental appearances, using artificially designed low-level geometric visual features as the basis for calculating image similarities [12] and then using feature matching to complete scene matching, which can achieve good performance in an environment with constant light conditions. e bag-of-words model was proposed in [13], which uses the K-Means algorithm to construct a "dictionary" and then merge similar features through a clustering approach, utilizing the bag-of-words model library DBoW3 to extract ORB features [14]. e image will be converted into a visual bag-of-words representation vector; then they count the frequency of each component in the vector and calculate the similarity between images through the distance between the representation vectors to complete visual relocalization, where hamming distance or cosine distance may be chosen according to different descriptors. In [14], the authors proposed the FAB-MAP, which is a probabilistic method based on appearance and an extended model of the bag-of-words model. It is based on the Chow-Liu Tree theory to fit the discrete probability distribution, which can well solve the perceptual deviation problem. ese two relocalization methods utilize features obtained from the appearance of the images for scene matching. However, in the real world, there are often complex environmental changes such as lighting condition changes, weather changes, and seasonal changes, which lead to a part of key features being strengthened or weakened and a decrease in the accuracy of feature matching. At this time, the relocalization effect of these two methods will be greatly deteriorated, and if we want to keep the performance of them, we have to pay expensive map maintenance costs [15].
In the field of visual localization, SIFT and SURF [16][17][18][19] are the two most important local feature descriptors, which are invariant to image rotation and scale scaling, and have a good tolerance for illumination changes and fine-tuning of viewing angles. e use of SIFT and SURF for feature matching can achieve high accuracy, but the computing speed of them is far inferior to the ORB feature. Because of the low-performance processor of the UAV, the above two feature description methods are not compatible with UAV real-time visual localization services [20]. erefore, how to complete accurate, fast, and robust visual relocalization when UAVs perform IoT tasks is a very challenging problem.
In recent years, deep learning has promoted huge breakthroughs in the field of object detection, and a large number of excellent-performance object-detection schemes have been successively proposed such as Mask R-CNN [21], Faster R-CNN [22], and YOLO [23]. e object-detection method can segment the foreground and background and obtain the semantic information of the objects (such as the categories and attributes) in the scene picture. Compared with geometric visual features, this kind of object-level semantics belongs to high-level features, which is sparse and highly invariant to lighting changes [24,25]. ere is always a semantic gap between using low-level features to describe images and perceiving images by humans. e introduction of semantic features into maps [26][27][28][29] makes the description of images closer to the level of human understanding, which can alleviate this problem to a certain extent and improve the robustness of UAV visual relocalization. In [28], the authors indicate that a defined hierarchical structure of semantic information enables improved map reproduction. In [29], the authors introduce semantic information into the map to enable the UAVs to complete terrain classification and navigation.
Oriented to the UAV-based IoT, this paper proposes a UAV visual relocalization method using semantic object features. Firstly, we use YOLOv3 as an object-detection framework to obtain higher-level semantic features that are stable to change in lighting conditions. To describe images, the semantic information is used to abstract the images into the form of topological graphs, which can simplify the preservation and comparison process of the environmental information of images. With prior knowledge of the map, the random walk algorithm [30,31] is used on the association graphs to match the semantic features and the scenes. Finally, the EPnP algorithm [32] is used to solve the UAV's position and pose for robust UAV relocalization and guarantee of UAV performing IoT tasks.

The UAV-Based IoT Model
As the device nodes of the IoT, UAVs can effectively improve the inflexible topological structure of the IoT and are very suitable for data collection and monitoring. Location information is very important during the work of the UAVs. e main components of the UAV-based IoT [4] include the UAV part, the IoT platform, and the communication links (A2G/G2A links). e control commands issued by the IoT platform are uploaded to UAVs via the G2A links, and UAVs perform corresponding tasks according to these commands. When the tracking of position and pose of a UAV fails, the relocalization algorithm is executed. After the UAV completes its relocalization, the position and pose information will be returned to the IoT platform via the A2G link, and the IoT platform will make corresponding adjustments to the tasks of UAVs based on this information [33,34]. e UAV-based IoT model is shown in Figure 1.

Graph Matching Problem Formulation
How to complete scene matching is the key issue in UAV visual relocalization. We abstract the scene images into the form of topological graphs and then transform the scene matching problem into a graph matching problem. e purpose of graph matching is to determine the mapping relationship between the nodes of two graphs. Suppose that the two graphs to be matched are G M � (V M , E M ) and G N � (V N , E N ), respectively, where V represents the set of nodes in the graph, E represents the set of edges, and c N and c M are the numbers of nodes in G M and G N , respectively. e weight matrix W of a graph measures the degree of matching between candidate correspondences whose diagonal elements are the matching weights between nodes, and off-diagonal elements represent the matching weights between edges. High weight means that the pair of points or the pair of edges corresponding to the weight have a high matching degree. X ′ ∈ 0, 1 { } c M ×c N is a c M × c N -dimensional assignment matrix with one and only one element of 1 in each row and each column. Its elements contain the matching relationship with the nodes of G M and G N ; namely, X ia ′ � 1 represents that node i in G M corresponds to node a in G N ; otherwise, X ia ′ � 0. X � vec(X ′ ) is the result of column vectorization of X ′ ; then the graph matching problem can be expressed as an integer quadratic programming problem, which can be written as Graph matching based on random walk [30] is a robust graph matching algorithm for outliers and deformations. e algorithm implements antinoise graph matching by iteratively updating and mining the confidence of candidate matching pairs. e random walk algorithm can give weight to the nodes in the graph, and the association graph can express the matching relationship between two graphs. By transforming the two graphs to be matched into an association graph, the weight of each matching pair in the original two graphs can be obtained, to select the best matching pairs. To select reliable nodes on the association graph, the graph matching problem is modeled as a random walk model, and the process of random walk on the association graph is regarded as a Markov random process, and the weight matrix is used to construct the transfer matrix. Let the weight matrix be W; then the probability transition matrix P � D − 1 W, where D is a diagonal matrix, and the weight matrix W is a symmetric matrix. e weight matrix W can be normalized by multiplying the inverse matrix of D on the left. We denote X (t)T as the probability distribution of all nodes which the random walker may reach on the association graph at time t; then the Markov chain of the random walk process can be expressed as e association graph is a weight undirected graph. By randomly walking on the association graph until the probability distribution of the nodes on the association graph converges, the assignment matrix X ′ can be obtained, and the matching relationship between G M and G N can be determined.

Relocalization Model
e traditional UAV visual relocalization methods often achieve scene matching directly calculating the similarity between the input image of the camera and the image in the map library, while the method we proposed in this paper abstracts images as semantic topology graphs and indirectly completes the calculation of the image similarity by comparing the structures of semantic topology graphs.
e UAV visual relocalization model (see Figure 2) mainly includes three modules: image processing and representation module, scene matching module, and relocalization module. e relocalization process is divided into three steps: (i) Firstly, the current scene is captured by the UAV camera, and the YOLOv3 object detection framework in the image processing module is used to obtain the semantic information of the object , and then we regard the semantic labels of objects as the nodes of semantic topology graphs. A scene map library is established as an internal representation of the environment for comparison with the real-time image captured by the UAV camera. (ii) Secondly, the association graphs are generated based on the common semantic information of the images to be matched. We comprehensively considered the semantic information difference, node position difference, and topological structure difference of the images to be matched. e random walk algorithm is used to complete the mapping of semantic feature pairs; afterwards, we can calculate the similarity between the input image and the image in the map library and complete the scene matching of UAV. (iii) Finally, according to the scene matching result and the information stored in the map library, the 3D position information of the semantic feature points in the world coordinate system can be obtained.
Combining the 2D position information of the semantic feature points we have known in the camera coordinate system, we apply the EPnP algorithm to solve the camera position and pose corresponding to the motion of the above 3D points to 2D points to realize the visual relocalization of the UAV.

Image Processing and Representation.
YOLOv3 is a onestage object detection algorithm. It is the third version of the YOLO series that skips the step of generating candidate regions and directly extracts features from the network to predict the classification and position of objects. e Darknet-53 network structure is adopted in YOLOv3, and there are 53 convolutional layers of size 1 × 1 or 3 × 3, as shown in Figure 3.
Compared with the two-stage object detection algorithms of the R-CNN series, YOLOv3 has an obvious speed advantage, which can well serve the real-time detection works, and has high adaptability to the real-time scene matching task of UAVs. e effect of object detection using UAV with YOLOv3 is shown in Figure 4

Scene Matching.
e scene matching module is the core of realizing the UAV visual relocalization. Suppose that the two images to be matched are G M and G N , respectively, and G M,N ass is their association graph. v M i and v M j denote the nodes of G M , and e M ij denotes their edge. v N a and v N b denote the nodes of G N , and e N ab denotes their edge. e UAV visual scene matching method based on semantics object features proposed in this paper comprehensively considers the semantic information difference, node position difference, and topological structure difference of the graphs to be matched in the following ways: (1) Semantic information difference: a new node v M,N ia in G M,N ass will be generated by a pair of nodes v M i and v N a with the same semantic label, while a pair of nodes with different semantic labels will be defined as a conflict matching pair and will not appear in the association graph, which simplifies the structure of the association graph (see Figure 5) and also enables   the weight matrix W to directly set the weights of conflicting matching pairs to zero during initialization, reducing the times of walking of the random walker. In addition, we consider the movement frequency of objects in the scene and believe that the objects with low movement frequency are more representative in images and are given greater weights when calculating image similarities. (2) Node position difference: assume that dis pq is the Euclidean distance between node p and node q. Obviously, the smaller dis ia is, the higher the matching degree with v M i and v N a is; the smaller the difference between dis ij and dis ab is, the higher the matching degree with e M ij and e N ab is. According to [30], the weight matrix and probability transition matrix of nonconflicting matching pairs are initialized using the two following equations: (i) In equation (3), σ 2 s is an adjustment factor. (3) Topological structure difference: according to the preset path length and the times that each node in the association graph serves as the initial node, the random walk algorithm is executed on the graph. In this paper, the similarity between images can be expressed as where s is the image similarity, m i denotes the normalized weight of the i-th matching pair in graph G ass , and n i in the range of (0, 1) represents the weight of the i-th unmatched object in the original images. e value of m i is determined by the random walk algorithm, while n i is determined by the movement frequency of the object in the scene. V i denotes the pixel position deviation of the object corresponding to the i-th matching pair in the original images, and q is the number of matching pairs, while d is an adjustment factor determined by the number of objects in images and the acquisition frequency of the camera when constructing the map library; we take d � 0.001 in this paper. e algorithm for calculating the image similarity in this paper is given in Algorithm 1.

Relocalization.
e visual relocalization of the UAV recovers its position and pose by obtaining the rotation matrix and translation vector of the UAV camera. Solving the position and pose of the camera corresponding to the motion of n 3Dto-2D points correspondences is the PnP (Perspective-n-Point) problem. EPnP is a noniterative solution to the PnP problem with a computational complexity of O(n). Its key idea is to represent n 3D points as the weighted sum of 4 noncoplanar virtual control points and calculate the coordinate values of 4 control points in the camera coordinate system; then the position and pose of the camera can be obtained.
Assuming that the UAV camera is a small hole model and the internal parameters are known, the image currently captured by the UAV is G 1 , which corresponds to image G 2 in the map library after scene matching. ere are a total of n (n ≥ 4) semantic feature matching pairs, where the 2D positions of the semantic feature points in G 1 in the camera coordinate system are known, and the 3D positions of the semantic feature points in G 2 in the world coordinate system are known.
We let the 3D coordinates of the semantic feature points and the 4 virtual control points in the world coordinate system be given as e coordinates of semantic feature points in the world coordinate system can be expressed as the weighted sum of 4 control points' coordinates (see the following equation): Let c c j represent the coordinates of the j-th virtual control point in the camera coordinate system, R and t denote the camera external parameters, and the coordinates of the i-th semantic feature point in the camera coordinate system are p c i ; then the relationships between them are as follows: α ij c c j , i � 1, 2, . . . , n.

(8)
After obtaining the 3D coordinates in the camera coordinate system, the gravity of world coordinates and the camera coordinates of the semantic feature points is calculated to obtain matrix A, matrix B, and matrix H: (i) Inputs: Semantic topological graphs of G M and G N to be matched, the path length of random walk: p, the times of each node used as a starting node: count (ii) Output: Image similarity: s (1) Construct the association graph: G ass (2) for i in c M do (3) for a in c N do (4) if v M i and v N a have the same semantic labels, then (5) add v ia to G ass (6) end if (7) end for (8) end for (9) Initialize the weight matrix: W, set the weights of conflicting matching pairs to zero, the weights of non-conflicting matched pairs are W ia;jb � e − ((dis ij − dis ab )/σ 2 s ) , σ 2 s � 2500 (10) Initialize the probability transition matrix: P � W/(max ia jb W ia;jb ) (11) for k in nodes of G ass do (12) Random Walk (G ass , k, p, count) (13) X (t+1)T � X (t)T P (14) end for (15) Get the best matching pairs (16) ALGORITHM 1: Image similarity calculation. 6 Wireless Communications and Mobile Computing Performing singular value decomposition on H, the position R and pose t of the UAV can be recovered, where R is the rotation matrix containing the pose information, which can be interchanged with the three-axis rotation angle of the camera coordinate system, and t is a translation vector containing position information: � cos θ cos φ sin ψ sin θ cos φ − cos ψ sin θ cos ψ sin θ cos φ + sin ψ sin φ cos θ sin φ sin ψ sin θ sin φ + cos ψ cos φ cos ψ sin θ sin φ + sin ψ cos θ − sin θ sin ψ cos θ cos ψ cos θ

Simulation
e experiments involved in this paper are all carried out on Ubuntu 18.04 system, using PyTorch to realize the YOLOv3 object detector. We use a UAV to shoot 800 pictures in an indoor scene at a rate of 10 fps to construct a scene map library and select 400 pictures from them as test set 1 and then shoot 50 pictures with different lighting conditions from the map library as test set 2. e precision and recall commonly used in machine learning are used as the evaluation indicators of our algorithm's performance. True positive (TP) means that the result of scene matching is correct, while false positive (FP) means that the result of scene matching is wrong, and false negative (FN) means that the relocalization of UAV fails; we have precision � TP TP + FP , e precision and recall of the perfect relocalization system are both 100%; however, there is always a trade-off between them in the actual system. In the visual relocalization system, priority should be given to avoiding false positive and ensuring a sufficiently high precision rate, as result of introducing incorrect matching results during relocalization will lead to catastrophic failure. Aiming at the precision, recall, positioning error, and algorithm running time of the UAV relocalization algorithm, this paper compares the method we proposed with the visual relocalization method based on the bag-of-words model in test set 1 and test set 2.

Precision and Recall.
To avoid the FP situation as much as possible, the image similarity threshold c for judging the success of relocalization needs to be set, and its relationship with the relocalization result is written as In test set 1, the precision and recall of the two visual relocalization methods with different similarity thresholds are shown in Figure 6. In the case of prioritizing the precision rate, the precision and recall are comprehensively considered to determine the best image similarity threshold c for judging the success of relocalization. At this time, the precision and recall of the method we proposed are 99.23% and 97.23%, respectively, and the precision and recall based on the bag-ofwords model are 100% and 99.25%, respectively.
Based on the best similarity thresholds c 1 � 0.01 and c 2 � 0.425, the two relocalization methods are compared in test set 1 and test set 2, and the results are shown in Table 1. When the lighting conditions of the UAV input images are the same as the map library, the visual relocalization effect based on the bag-of-words model is slightly better than the method proposed in this paper, and its precision and recall are 0.77% and 2.02% higher, respectively. Due to the need to prioritize higher precision in visual relocalization, the method we proposed also shows good performance. Generally speaking, the performance difference between the two methods is not obvious at this time. However, when the lighting conditions of the input images are different from those in the map library, the precision and recall achieved by the visual relocalization method based on the bag-of-words model drop sharply, reaching 40.74% and 32.35%, respectively, which are far from satisfactory for UAVs.
is is because the changes in the gray distribution of the images caused by the changes in illumination have strengthened or weakened some of the key features, resulting in a large difference between the prediction of ORB feature points and the BoW descriptors compared with those of the original images, and the probability of occurrence of FP situation increases accordingly. On the other hand, because of the strong feature invariance of semantic information extracted by the object detection approach, the visual relocalization method proposed in this paper is far more capable of coping with changes in illumination than the visual relocalization method based on the bag-of-words model. e precision and recall are still 93.18% and 87.23%, respectively, and the precision rate and the recall rate are 52.44% and 54.88% higher than those of the visual relocalization method based on the bag-of-words model, respectively.  Figure 7(a) and 7(b) correspond to the FP situation of scene matching. In addition, the X-axis, Y-axis, and Z-axis errors were obtained by the two relocalization methods in the UAV camera coordinate system control within ±10 cm. When the lighting conditions change, to intuitively reflect the distribution of the relocalization errors, we use a scatter chart to record the X-axis error, Y-axis error, and Z-axis error of the two relocation methods and use a frequency distribution table (see Table 2) to record the distance error frequency distribution of the two methods. According to Figures 7(d)-7(f ), it can be seen that, due to the changes in the image appearance caused by the changes in illumination, a large number of FP situations appear in the relocalization results based on the bag-of-words model, and the overall distribution of the data points of the bag-of-words model is above that of the object semantic model we proposed, which means that the overall X-axis error, Y-axis error, and Z-axis error of the relocalization method based on the bag-ofwords model are larger than those of the method we proposed.
It can be seen from Table 2 that, using the method proposed in this paper for visual relocalization, the proportion of samples with a distance error of less than 10 cm still accounts for 42%, and the proportion of samples with a distance error of less than 30 cm accounts for 86%, although the lighting conditions have changed. It is proved that the method we proposed is robust against changes in lighting conditions in the environment. While using the bag-ofwords model for visual relocalization, the proportions of samples with a distance error of less than 10 cm and 30 cm are only 12% and 40%, respectively.

Running Time.
We record the average running time required to complete a single scene matching for the two relocalization algorithms, respectively. e running time of the algorithm proposed in this paper is affected by the number of semantic features in the map library. e average number of semantic features contained in each image in the map library is set to 7. As shown in Table 3, the average running time required for the proposed algorithm to complete a single scene matching is 0.027 s faster than that of the bag-of-words model's algorithm known for its fast speed,  e process of removing conflict matching pairs with the help of semantic features is equivalent to pruning the association graphs, which reduces the times of randomly walking.

Conclusion
UAVs need to have accurate self-localization capabilities when performing IoT tasks; however, in the real world, complex lighting changes have brought huge challenges to the visual relocalization of UAVs. Oriented to the UAVbased IoT, this paper proposes a UAV visual relocalization   is method uses YOLOv3 as the object detection framework, extracts the semantic information in the images, and uses the semantic information to construct topological graphs as sparse descriptions of the environment. en, with prior knowledge of the map, a random walk algorithm is used to perform semantic features matching as well as the scene matching. Finally, the EPnP algorithm is used to solve the UAV's position and pose, which will be returned to the IoT platform. After simulation experiments, the precision and recall of the method in this paper are only 0.77% and 2.02% lower than those of the visual relocalization method based on the bag-of-words model when the scene lighting conditions remain unchanged. Meanwhile the precision and recall are 52.44% and 54.88% higher than those of the visual relocalization method based on the bag-of-words model when the lighting conditions of the scene change dynamically, which proves the effectiveness and robustness of the method we proposed. e average running time of this method to complete a single scene matching is 0.027 s faster than that of the bag-of-words model's algorithm using ORB features, which can meet the requirement of UAVs' realtime visual relocalization and provide a guarantee for UAVs to perform IoT tasks.

Data Availability
e image data used to support the findings of this study have not been made available because they involve privacy.

Conflicts of Interest
e authors declare that they have no conflicts of interest.