Autonomous Navigation of the UAV through Deep Reinforcement Learning with Sensor Perception Enhancement

the


Introduction
With the advantages of safety, low cost, and high mobility, unmanned aerial vehicles (UAVs) are now widely used in military and civilian fields, such as reconnaissance, patrol, rescue, and so on [1][2][3][4]. A typical application that reflects the autonomous decision-making capability of UAVs is autonomous navigation and obstacle avoidance. How to carry out real-time path planning and accomplish autonomous navigation and obstacle avoidance tasks of the UAV in the complex and uncertain environment has become a popular research problem [5][6][7].
Conventional mathematical modeling approaches to solve the path planning problem require accurate modeling of the environment as well as evaluation and then generating the optimal path [8][9][10]. However, under complex dynamic environmental conditions, subtle changes in the external environment can make the prior path planning invalid. For this reason, the UAV autonomous navigation and obstacle avoidance task in complex environments put the real-time performance of traditional path planning algorithms to a great test. Therefore, it is very meaningful to make the UAV system with autonomous learning and self-adaptive capability, which can detect changes in the environment and can generate new navigation paths in real-time.
Compared with the traditional navigation and path planning methods, the deep neural network-based agent autonomous control strategy has powerful learning and expression capabilities, and its generalization ability far exceeds that of traditional methods, which can better cope with real-time changing scenarios [11,12]. Currently, with the support of hardware and network technology [13], deep reinforcement learning (DRL) has been able to empower agents with the ability to handle continuous action space and high-dimensional state features, which can better optimize the agent's behavior [14,15]. At the same time, the generality of DRL for modeling temporal decision problems and the end-to-end ease of training has made it very popular among researchers. Researchers have achieved many landmark results based on DRL algorithms. For example, DeepMind based on DRL to achieve an agent to play the Atari game, training of the agent in the Atari game has far exceeded the human level [16], and based on DRL to achieve the AlphaGo [17] and AlphaZero [18] algorithm, has also long begun to rule the field of Go. Increasingly fields are combined with DRL to try to solve decision-making tasks that cannot be solved perfectly with conventional methods.
The autonomous decision control of UAVs based on reinforcement learning methods has also made remarkable research progress. Wei et al. [19] use the deep Q-network algorithm for the target-hunting problem by jointly optimizing the UAV's position, the UUV's trajectory, as well as their interconnectivity. Zhang et al. [20] proposed a reinforcement learning algorithm design of the reward matrix for path planning of the UAV. Yijing et al. [21] use a Q-learningbased random exploration approach for the UAV navigation and obstacle avoidance task. Yan and Xiang [22] improved the Q-learning method for UAV's path planning task by using a policy with a Q-function initialization. The abovementioned literature is based on Q-learning and related improved algorithms, which can only be applied to the discrete UAV action space and cannot be adapted to more complex UAV dynamics models and more complex environments. For this reason, increasing research has started to use DRL to overcome the shortcomings of table-based reinforcement learning algorithms. Actor-critic [23] based DRL framework can deal with UAV autonomous decision-making problems in continuous action space. For instance, Li et al. [24] performed task decomposition as well as pretraining for path planning of UAV and combined deep deterministic policy gradient (DDPG) [25] with migration learning to improve the efficient training of DDPG algorithm for UAV. Rodriguez-Ramos et al. [26] built an automatic UAV landing model by using the DDPG algorithm and achieved autonomous UAV landing missions in dynamic environments. Wang et al. [27] proposed a two-stage DDPG-based UAV obstacle avoidance method, which ensures that the UAV has better obstacle avoidance performance by combining supervised learning and reinforcement learning. Koch et al. [28] use the proximal policy optimization (PPO) algorithm [14] to implement a UAV intelligent flight control system. From the previous research, it can be concluded that actor-critic-based DRL algorithms can support UAV to adapt to more complex tasks.
From the previous analysis, we can see that DRL algorithms based on actor-critic have the ability to process continuous actions of the agent as well as high-dimensional sensory information. However, in the UAV mission, stable and reliable decision-making in partially observable environments is the main challenge for reinforcement learning to solve the UAV autonomous navigation and obstacle avoidance task [29]. In this paper, we improve the traditional actor-critic framework of DRL and design the UAV autonomous navigation and obstacle avoidance algorithm (ANOAU) to make it more adaptable to autonomous UAV decisionmaking in partially observable environments. First, in the design of the reward function, we use a reward-shaping approach to give different rewards for different situations faced by the UAV to promote its ability to converge better in complex environments. Second, to better cope with the partially observable environments, we introduce a gate recurrent unit (GRU) [30] in the sensor perception enhancement network of the UAV to perform representation-enhanced processing of the current perceptual information, enabling the critic network to derive more accurate value, thus enhancing the decisionmaking capability of the UAV. Finally, the backpropagation gradient of the critic network is used to update the perceptual representation network of the UAV to improve the policy stability of the UAV during the training process.
The contributions of this paper are as follows: (i) An actor-critic DRL-based ANOAU algorithm is proposed to generate precise behavioral actions of the UAV and avoid collision through real-time sensing of the surrounding environment by multiple sensors, and finally, achieve path planning tasks in complex environments. (ii) We introduced the GRU network to extend the sensing capability of the UAV in the time-series environment, to alleviate the bias in the decision-making process of the UAV, and to enhance the autonomous decision-making capability of the UAV in the complex environment. (iii) To verify the algorithm performance, we have designed an experimental scenario of autonomous navigation and obstacle avoidance for UAV based on Unity3D, which is more advanced than most other works.
The structure of this paper is as follows: in Section 2, the background of this paper is presented, including the PPO algorithm, the environment, the UAV's action, and the sensor perception space. In Section 3, the ANOAU algorithm is proposed, which incorporates observation enhancement, reward function design, and UAV's actor-critic learning framework. In Section 4, we give the network structure and parameters of the ANOAU algorithm and verify the effectiveness of the algorithm in a virtual environment. Section 5 provides a summary and outlook of the whole paper.

Background
In this section, we focus on the preceding context of this paper, including the DRL algorithm of PPO, and the environment setting, observation, and action for navigation and obstacle avoidance of the UAV.

PPO.
In DRL, the agents interact with the environment and generate the following discrete-time trajectories: In the previous equation, s t , a t , and r t represent the state perceived, the action performed, and the reward obtained by the agent at time step t, respectively. If the whole trajectory satisfies the current state, as well as the reward depends only on the previous state, then we call the Markov decision process (MDP) [31] for the interaction of the agent with the environment, and the MDP interaction of the UAV is shown in Figure 1.
The evaluation of all rewards that the agent can obtain in the future at the current time is called R τ ð Þ, which is defined below:

2
Mathematical Problems in Engineering where γ 2 0; ½ 1 is called the discount rate, which is used to determine the current value of future rewards. T is the terminated time step of the agent in the trajectory. The objective of RL is for an agent to learn a policy π θ , and the guidance based on that policy allows the agent to obtain the maximum reward. Based on this policy, the agent explores the environment several times and receives a total reward as follows: For enhancing the agent's decision-making, a gradient ascent method can be used so that the updated policy can obtain a greater reward, which is called the REINFORCE [31] algorithm.
R τ n ð Þr log π θ a n t s n t j ð Þ: In the previous equation, N is the total number of trajectories that the agent interacts with the environment, and T n is the number of all steps in a single trajectory. The PPO is implemented by the use of the actor-critic framework in which the critic network is used to evaluate the rewards that can get in the future under the current state s t , and the actor network is used to generate the action a t of the agent. As shown in the following equation, the critic network parameters are represented by ϕ. We evaluate the state s t of the critic network, which guides the optimization of the actor network.
The value of the total reward value of the trajectory in the current state s t can be estimated using the value function V s t ð Þ, which can replace R τ ð Þ. Equation (6) is the loss function of the critic network, and the fitting of the neural network parameters is achieved mainly through the historical data explored by the agents.
To better train the actor network, the PPO algorithm introduces the advantage equation b A t , as shown in Equation (7). The b A t , shows the advantage of the action a t of the agent relative to the average in state s t .
In addition, the PPO algorithm introduces importance sampling [32] to improve the efficiency of the agent's use of experience by a ratio of current policy experience to the old policy experience, as shown in Equation (8).
Equation (6) shows the objective function of the actor network of the PPO algorithm. PPO algorithm introduces a parameter ε to constrain the objective function value. If the objective function value exceeds the upper bound, then a truncation operation is performed. The advantage of this approach is that it can constrain the size of the backpropagation gradient, thus ensuring more stable training of the actor network.
2.2. Environment Setting. We built a more realistic UAV autonomous navigation and obstacle avoidance scene based on Unity3D, and the whole scene is shown in Figure 2.
In our set environment, we fix the altitude of the UAV, and the target location is at a certain distance from the UAV. In the initial moment, the UAV at a certain distance from the target position, for which it needs to avoid obstacles and reach the target position. The UAV uses its own equipped ray sensor to sense obstacles, and the maximum sensor perception range is d. It means that the UAV is approaching the obstacle when the distance to the obstacle received by the ray The whole MDP framework for the UAV interacts with the environment. sensor is less than a certain threshold. The UAV navigation mission is successful when the distance between the UAV and the target is less than a certain threshold. The entire scene is constrained to a fixed range of threedimensional space, in which we fixed the height of the UAV and randomly placed a number of stationary rectangles as obstacles; in addition, we set the stationary ship as the position. When the UAV is perpendicular to the ship, it means the UAV reaches the target position.

Observation and Action
Space. The core of the UAV's navigation and obstacle avoidance task is to search for an optimal path without colliding. The UAV needs to observe environmental information by sensors for autonomous decision-making when performing navigation and obstacle avoidance tasks. In the task we set, the main sensing information of the UAV includes the target position and the information about the obstacles sensed by a ray sensor. At each time step, the UAV's ray sensor observation of the obstacle is defined as follows: where d 1~n denotes the corresponding ray sensor indication, and d n ¼ 0 when obstacles are not detected, otherwise d n is a scalar between 0 and L, indicates the distance between the UAV and the obstacle. UAV's own status information includes speed information, position information, and pitch, roll, and yaw angle information, which are presented as follows: : ð11Þ To simplify, we fixed the flight altitude of the UAV, therefore, removing the pitch and roll angle information and retaining the yaw angle information. The simplified UAV status information is shown below: Considering the scalability and generalization of the algorithm, we simplify the perception information of the UAV by incorporating the relative distance and direction between the UAV and the target into the perception information and removing the position of the UAV and the target position. The complete perception information of the UAV after simplification is shown below: In the previous equation, ψ t ð Þ is the yaw angle between the UAV projection and the target dis t ð Þ is the distance between the UAV and the target, v t ð Þ is the velocity of the UAV, and d n is the UAV's sensor observation about obstacles in the environment.
The UAV needs a reasonable speed as well as a stable heading to accomplish the navigation task. For this purpose, we designed a UAV action controller so that the executable actions of the UAV are defined as below: At the time t, the action output of the UAV is the velocity v t ð Þ and yaw angle ψ t ð Þ, and the attitude of the UAV is manipulated by means of the joint Equation (12). After obtaining the position information of the UAV, the information about the UAV's own state at the current time is calculated. In the actual training, we fix the UAV's speed to quickly plan the optimal path, and the UAV relies on the steering action ψ t ð Þ to implement the obstacle avoidance operation.

Proposed Method
In this section, the sensor perception enhancement, reward function design, and learning framework are introduced, and based on the previous three components, the ANOAU algorithm is finally formed.
3.1. Sensor Perception Enhancement. In the UAV navigation task, the UAV's exploring obstacles in the environment are achieved based on sensors that do not allow for a complete observation of the environment, which poses certain challenges to the UAV's decision-making. To address this problem, we introduced recurrent neural networks to improve the UAV's decision-making ability by enhancing its memory of the environment. Specifically, the single trajectory of the UAV under incompletely observable conditions is shown below: We introduce a GRU-based sensor perception enhancement module, which processes the current observation o t of the UAV to form a representation that better represents the current state of the environment. The GRU uses two types of gates; the first is an update gate that controls how much of the current candidate's hidden state needs to be updated. The reset gate r is defined in detail as follows: The second is a reset gate that resets the memory if this gate is closed (close to zero), allowing the GRU unit to process the next input as if it were the first input in the sequence. The update gate z is defined in detail as follows: In the previous equation, o t is the observation of the current time by the UAV, h t−1 is the hidden state of the previous time, Linear is the linear transformation, and σ is the Sigmod activation function. After getting the GRU gating signal, the hidden state information of the previous time is reset by the reset gate, and then the input of the previous moment and the reset information are linearly transformed, and the data are deflated by the Tanh activation function to finally get h 0 t , which is shown in Equation (18).
Based on h 0 t and the GRU reset gate signal, we can obtain the hidden state information h t at the current time.
After sensor perception enhancement, the single trajectory experience of the UAV's exploration of the environment is shown below: The timing sensing information of the UAV is enhanced by the sensor perception enhancement, which can provide support for subsequent decision-making in the UAV autonomous navigation task.

Reward Function Design.
In a DRL-based UAV navigation task, the UAV continuously optimizes its behavioral strategy through rewards, and the design of the reward function can directly affect the model convergence. A simple way to design the reward function is to set an episode reward, and when the UAV completes an episode, it receives a reward. This design approach will make the reward distribution too sparse, resulting in slow network update and convergence, making it difficult for the UAV to learn the optimal strategy.
To improve training efficiency and practicality, we designed a nonsparse reward based on reward shaping [33] to guide UAV's autonomous navigation in partly observable environments, which consists of the UAV distance reward, obstacle avoidance reward, and track reward.
The UAV needs to travel to the target position. The difference of the UAV distance from the target position between adjacent moments reflects whether the UAV is traveling toward the target position or not. When the difference is positive, it means the UAV gets closer to the target, and when the difference is negative, it means that the UAV is moving away from the UAV. We design the distance reward as follows in Equation (21). In the equation, dis t means the distance between the UAV and the target at time t, and dis t−1 means the distance between the UAV and the target at time t − 1.
In the navigation task, the UAV needs to avoid obstacles, so we designed the obstacle avoidance reward function to encourage better obstacle avoidance by the UAV. In Equation (22), we set a safety length L. When the distance is less than L, then a negative reward is given, otherwise 0 is given. min d 1 ; ð …; d n Þ represents the distance to the nearest obstacle perceived by the UAV.
Equation (23) is the navigation reward for the UAV. In the navigation reward, when successfully navigating to the target, then a larger positive reward k is given, and if the UAV does not track the target, then a smaller penalty reward −λ penalt is given.
In summary, the reward r received by the UAV at each time in the navigation task is shown in the following equation: 3.3. Learning Framework. The whole learning framework for UAV autonomous navigation and obstacle avoidance consists of three modules as follows: (1) UAV dynamics module: this module receives the motion command output from the ANOAU algorithm and directly controls the flight attitude of the UAV. (2) Environment module: this module focuses on providing interactive mission scenarios for UAVs while returning environmental information based on the UAV's sensors to assist in generating Markov decision process quintets. (3) The autonomous control module: this module is the core of the ANOAU algorithm. In this module, the UAV uses the observation and reward information to optimize the UAV control strategy and ultimately achieve convergence of the ANOAU algorithm.
Contrary to traditional reinforcement learning parameter settings, the ANOAU algorithm consists of three main blocks of parameters, namely, the observation enhancement parameter ζ, the actor network parameter θ, and the critic parameter ϕ. In addition, to improve the stability of the training, a gradient truncation operation is performed in the ANOAU algorithm. Specifically, during the training process, the observation enhancement information is passed forward, the critic network evaluates the enhanced Mathematical Problems in Engineering information, and the actor network generates actions. When optimizing the network, the gradients are passed back through the critic network to the representation enhancement network, while the gradients generated by the actor network are not passed back. ANOAU is based on the implementation of the PPO algorithm, and Equation (25) shows how to update the actor network parameters: Equation (26) is optimized for the critic network ϕ as well as the observation enhancement network ζ. The aim of this approach is to allow a single gradient backpropagation flow to the observation enhancement network, reducing the update perturbations associated with dual gradient flow and improving the training stability. Figure 3 illustrates the entire learning framework for the UAV autonomous navigation task based on the sensor perception enhancement proposed in this paper, and Algorithm 1 demonstrates the PPO-based ANOAU algorithm.

Experiment
In this section, we focus on verifying the performance of the ANOAU algorithm to execute navigation tasks in complex environments and compare the capabilities with the PPO algorithm.

Experimental Scenarios and Algorithm Parameters.
We have described the experimental environment for the UAV autonomous navigation task in the background. In general, our experimental environment, sensors, and the dynamics model of the UAV are built based on Unity3D. In addition, we have used ML-Agents [34], a machine learning tool, to adapt the whole environment for training. Figure 4 illustrates the overall experimental scenario of the UAV autonomous navigation task. In the scene, the initial location of the UAV is in the lower left corner, and the target position is the ship at sea level in the upper right corner. There are obstacles of a certain density between the UAV and the ship. Compared with other experimental scenarios using 2D, our scenario is more realistic, and the verification of the algorithm is more concise and easy to deploy.
In Section 3, we have completed the design of the ANOAU algorithm, which consists of the sensor perception enhancement as well as the reward function design, and introduced the perception and action of UAV in the background. Three end-of-training-episode conditions were set in this experiment. First, when the UAV fails to reach the target position within a fixed time step. Second, when the UAV has collided. Third, when UAV reaches the target position. Here, we will detail the network structure and parameters used in the ANOAU algorithm.
The neural network and parameter design of the ANOAU algorithm are critical to the convergence of the algorithm. The ANOAU algorithm is based on the PPO algorithm, and its overall neural network structure is divided into two major parts. The first part of the network structure mainly forms the observation enhancement of the UAV, and the second part is the actor-critic network, where the actor network outputs the actions of the UAV, and the critic network gives a score to the current state. Table 1 shows the network structure of the ANOAU algorithm. The sensor perception enhancement network is divided into two parts; the first part is the representation network, which consists of a two-layer linear network, each layer of which is composed of 128 neurons and the second part is the enhanced network, which consists of the GRU network. In the GRU network, reset gate r, update gate z, and the current candidate memorize h 0 are all employing a linear layer of 128 neurons to achieve the enhancement of the observation. In the actor-critic network, both actor and critic networks are composed of a linear network with two layers of 64 neurons. Table 2 shows the parameters of the ANOAU algorithm. The hyperparameters of the reward function setting are as follows: λ 1 and λ 2 are set to 0.1, L is set to 3, λ penalty is set to 0.01. The learning rate of the ANOAU algorithm is set to 1.0e −4 , the batch size is set to 32, and the neural network optimizer is Adam, the network activation function is tanh. The ANOAU algorithm is implemented based on the PPO algorithm, for which we set the reward discount factor to 0.9 and the N-step to 3, which is a trade-off between the accurate and rapid evaluation of state values. The GAE factor is set to 0.97, and the truncation factor of ANOAU is 0.2.
To verify the ANOAU algorithm quickly and effectively, in our experiment, we fixed the speed of the UAV at 2.8 m/s. In addition, the horizontal angle of the UAV is controlled between −60°and 60°and mapped to −1 to 1.

Experimental
Results. In this section, the experimental results of the ANOAU algorithm are analyzed. Here, we focus on analyzing the performance of the ANOAU algorithm in terms of cumulative rewards, UAV navigation distance, collision rate, timeout rate, and success rate.
In the case of reinforcement learning tasks, the cumulative reward obtained by the agent is one of the key indicators of the effectiveness of the algorithm. Figure 5 shows the trend of cumulative reward during training for the ANOAU algorithm, PPO algorithm, and random method. From the figure, we can see that, in addition to the random method, both the ANOAU algorithm and the PPO algorithm can complete the convergence within a thousand episodes of training. It can be noticed from the figure that at the beginning of the training, the ANOAU algorithm shows a similar reward trend as the PPO algorithm. From 300 episodes onward, the ANOAU algorithm shows the advantage of sensor representation enhancement and is able to get higher rewards in each round and reaches convergence at 700 episodes. The PPO algorithm, on the other hand, has larger reward fluctuations during the training period and only converges at 900 episodes. The ANOA algorithm integrates sensor perception enhancement, which can better process the partly observation of the UAV, to better optimize the behavior strategy of the UAV in the complex scenario than other algorithms.   Mathematical Problems in Engineering Figure 6 demonstrates the variation of the distance between the UAV and the target position during the training process of the ANOAU algorithm, PPO algorithm, and random method. From the figure, the UAV behavior strategy based on the random method has an expected distance of about 90 m from the target position, while the ANOAU algorithm and the PPO algorithm continue to approach the target position as the training progresses. After 800 episodes of training, UAV can implement path planning based on the optimized strategy and reach the target position. From the overall trend, the ANOAU algorithm has a stronger strategy optimization ability than the PPO algorithm, which realizes the UAV path planning in fewer training episodes. The main reason for the previous trend is that the ANOAU algorithm integrates sensor perception enhancement that can improve the memory of the surrounding environment, thus reducing collisions and improving the effective driving distance, and the PPO algorithm only performs a simple vector stacking operation on the sensory information of the sensors and lacks the ability to memorize the environment. Figure 7 shows the trend of the number of action steps in one episode of the UAV during training. From the figure, we can find that the random method is kept at about 700 steps per episode, mainly due to the weak decision-making ability of the UAV, resulting in timeout or collision to end the current episode. However, with the training of the ANOAU algorithm and PPO algorithm, the number of action steps in one episode decreases continuously; during the initial training phase, the UAV has not yet mastered the navigation strategy and therefore needs to spend a large number of steps per episode. As training progresses, the UAV will collide, and therefore, the number of steps per episode is gradually reduced. Finally, under 1,000 episodes of training, the number of action steps in one episode of the UAV is reduced to less than 200 steps. The ANOAU algorithm achieved convergence at 700 episodes, when the UAV was able to reach the target location with a minimum number of collision-free steps, while the PPO algorithm required a certain number of additional steps to perform trial and error on the environment, eventually reaching similar model performance at around 900 episodes. Figure 8 shows the collision rate of the UAV during autonomous navigation and obstacle avoidance at different training stages. From the figure, the collision rate of the random method remained above 50% throughout the   training period, mainly because it did not master any obstacle avoidance strategy. Meanwhile, the UAV has not mastered how to avoid obstacles at the initial stage of training, so there is a high probability of collision between the ANOAU algorithm and the PPO algorithm at the early stage of training. After 1,000 episodes of training, the collision rate of the UAV is close to 0, and the ANOAU and PPO algorithms are close to full convergence. The figure also shows that ANOAU outperforms the PPO algorithm in terms of collision rate during the training phase, which reflects the advantages of sensor perception enhancement in the UAV obstacle avoidance process. Timeout for the UAV autonomous navigation task refers to the inability to reach the target position within a finite number of steps. Figure 9 shows the timeout rate of the UAV at different training stages. From the figure, we can see that under the random method, the UAV's timeout rate remains at over 40%, which means that the UAV does not have the ability to make decisions, and if no collisions occur, then it is stuck in a rambunctious behavioral action and cannot get close to the target location. In the early stages of training, the ANOAU and PPO algorithms also have a large timeout rate; however, after 400 episodes, the ANOAU algorithm and the PPO algorithm are able to acquire some initial decision knowledge, and the timeout rate is significantly lower than that of the random method. By the time the training reaches 600 episodes, the timeout rate of the ANOAU algorithm and PPO algorithm are less than 20%, and by the time the training reaches 100 episodes, both the ANOAU algorithm and PPO algorithm have learned how to avoid obstacles and how to plan an optimal path to the target location, and the policy has converged, so the timeout rate is close to 0. Meanwhile, the ANOAU algorithm outperforms the PPO algorithm in terms of timeout rate, mainly due to the capability of the ANOAU algorithm is able to make a better policy for the surrounding environment and can optimize a reasonable navigation path faster.
To demonstrate more intuitively the specific performance of the ANOAU algorithm, we capture a selection of decision cases of the UAV navigation task in the complex environment. Figure 10 shows eight frames of UAV autonomous decision-making images based on the ANOAU algorithm. In Figure 10(a), the UAV is in the departure phase, when the UAV senses that there are no obstacles around it and therefore travels toward the target location. In Figure 10(b), the UAV has detected an obstacle and started to execute an obstacle avoidance strategy to avoid a collision. In Figure 10(c)-10(g), the UAV has reached a denser area of obstacles and successfully implements obstacle avoidance with the aid of the ANOAU algorithm and gradually approaches the target position. In Figure 10(h), the UAV has reached the target position.

Discussions.
In this section, we implement the validation of the ANOAU algorithm by constructing a complex environment. The performance of the ANOAU algorithm in terms of the reward, the distance traveled by the UAV, and the collision rate and timeout rate occurring are analyzed in this section. From the experimental results, the ANOAU algorithm is able to achieve the UAV navigation task in a complex environment and has stronger performance than the PPO algorithm. The reason for this is that the ANOAU algorithm incorporates a sensor perception enhancement module, which is able to provide more accurate perceptual representation information to the UAV in terms of temporal decision-making and improve the accuracy of decision-making under incomplete observable conditions.
In addition, we set the UAV to travel at a fixed speed during the training process, which reduces the difficulty of training and accelerates the convergence of the algorithm. However, this also faces the problem that if dynamic  Mathematical Problems in Engineering obstacles appear in the environment, then the UAV may not be able to quickly avoid moving obstacles, resulting in mission failure.

Conclusions
In this paper, an ANOAU algorithm based on DRL is proposed, aiming to deal with the problem of autonomous navigation and obstacle avoidance of the UAV in a complex environment. Compared with some previous studies, our approach introduces a sensor-aware representation enhancement module to cope with the complex environment while a set of reward functions for UAV path planning is designed. In addition, we use Unity3D for experimental validation to make the scenario more realistic. Through the experiments, we found that the ANOAU algorithm has good convergence and effectiveness for the path-planning task of the UAV in a complex environment.
Autonomous navigation and obstacle avoidance for the UAV remain challenging tasks. In the future, there are the following points as the improvement objectives of this paper: (i) this paper adopts ray sensors as the main means of sensing the surrounding environment, and future work can integrate visual sensors to further enhance the perception capability of UAVs. (ii) The algorithm designed in this paper mainly considers the enhancement of the sensory representation, but not much consideration is given to the training speed and training stability, and future work needs to further improve the algorithm to meet the algorithm can quickly generate robust and stable behavioral decisions. (iii) The experiments in this paper are based on a virtual environment, and future work needs to migrate from virtual to real scenes.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.