Homing Guidance Law Design against Maneuvering Targets Based on DDPG

A novel homing guidance law against maneuvering targets based on the deep deterministic policy gradient (DDPG) is proposed. The proposed guidance law directly maps the engagement state information to the acceleration of the interceptor, which is an end-to-end guidance policy. Firstly, the kinematic model of the interception process is described as a Markov decision process (MDP) that is applied to the deep reinforcement learning (DRL) algorithm. Then, an environment of training, state, action, and network structure is reasonably designed. Only the measurements of line-of-sight (LOS) angles and LOS rotational rates are used as state inputs, which can greatly simplify the problem of state estimation. Then, considering the LOS rotational rate and zero-e ﬀ ort-miss (ZEM), the Gaussian reward and terminal reward are designed to build a complete training and testing simulation environment. DDPG is used to deal with the RL problem to obtain a guidance law. Finally, the proposed RL guidance law ’ s performance has been validated using numerical simulation examples. The proposed RL guidance law demonstrated improved performance compared to the classical true proportional navigation (TPN) method and the RL guidance policy using deep-Q-network (DQN).


Introduction
Intercepting the maneuvering targets is a particular challenge due to the complexity of the engagement [1,2]. The traditional guidance and control system for interception show its weakness when facing high maneuvering targets, but intelligent methods can solve the problem [3]. In the field of guidance, proportional navigation (PN) has found widespread applications because of the features of simplicity and robustness [4]. PN is mainly divided into true proportional navigation (TPN) [5] and pure proportional navigation (PPN) [6]. For maneuvering targets, Ref. [7] investigated the capture region of the realistic true proportional navigation (RTPN) in three-dimensional (3D) space, taking into account the nonlinearity of the interceptortarget relative kinematics, resulting in more general findings. However, when targets exhibit large maneuvering, the performance of proportional navigation (PN) can significantly decline. This is mainly due to the commanded acceleration of PN often exceeding the capability of the interceptor, resulting in large miss distances [8]. Optimal guidance law (OGL) can intercept or strike a target with a specific optimized performance index [9]. However, the time-to-go needs to be accurately estimated in OGL; otherwise, the performance may decline. Many new guidance methods such as differential geometry [10], sliding mode control [11], and other dynamic and control theories have also been proposed. However, these guidance laws are often too complex in form, which usually require too much measurement information and involve plenty of guidance parameters, and, hence, are difficult to be applied in practice.
Reinforcement learning (RL) [12] provides a new idea for the homing guidance law design problem. For example, Q-learning is used for the adaptive determination of parameters in [13] and [14] by training. In [15], a guidance framework designed by RL-based guidance law is proposed. It has been proven that guidance laws based on RL are much better than PNG, according to plenty of numerical simulation results. However, these traditional RL-based algorithms promote the guidance performance only through selecting suitable coefficients of the controller [16], which cannot achieve precise guidance under realistic disturbed conditions. Moreover, the state and action set of the traditional RL method are discrete, and the dimension is low, while the actual interception engagement is continuous and high-dimensional [17].
As deep learning (DL) continued to advance, a new class of algorithms known as DRL, combining both DL and RL techniques, emerged [18]. The DRL method can effectively overcome the difficulties of complex space and high dimensions [19,20], so it may have advantages in homing guidance. Ref. [21] proposed deep Q-Network (DQN), which solves the problem of high-dimensional input. Aiming at the problem of exoatmospheric homing guidance, a novel guidance method using DQN is proposed in [22]. However, DQN is more suitable for the problem of discrete control, while the actual interceptor's acceleration is usually continuous. The discrete action command might lead to a large deviation and also a big miss distance.
The DDPG algorithm, introduced in [18], is an actorcritic (AC) [23] algorithm that is well suited for the homing guidance problem in continuous state and action space environments. Ref. [24] explored the possibility of applying DDPG to the design of homing guidance law. By comparing the two learning modes of learning from zero and learning with prior knowledge, it is proven that the latter helps to improve learning efficiency. In [25] and [26], the terminal guidance law of missiles is also advanced based on DDPG. The result shows that the proposed policy has stronger robustness and a smaller miss distance compared with PN. However, most DRL-based guidance laws need to measure and estimate the relative velocity and position between the target and interception and the information on target acceleration [27,28]. The involved measurements are too many and are usually with lags and large errors. An RL-based guidance law was proposed in [29] and [30] to solve this problem, which only uses the LOS angle measurements and their change rates as the observation information. The problem of state estimation is simplified, and the bad influences caused by the estimation biases of position and velocity may be eliminated. Ref. [29] introduces proximal policy optimization (PPO) to propose a homing guidance law to intercept exoatmospheric maneuvering targets, combing with metalearning [31,32]. Experimental results have shown that this guidance method outperforms the augmented ZEM [30] guidance method. Ref. [33] proposed a model-based DRL method, which uses deep neural networks and metalearning to approximate the predictive model of the guidance dynamics and incorporates it into the control framework of path integration. It introduces a general framework for guidance, but it is complex in form, and the problem of estimation errors is still not solved.
A novel homing guidance law against maneuvering targets using the DDPG algorithm is proposed in this paper, which directly maps an engagement state information to the commanded acceleration, which is an end-to-end and model-free guidance policy. The homing guidance law we proposed only takes the information of LOS angles and LOS rates between the target and interceptor as observation and state inputs and does not require prior estimation of the target's acceleration. DDPG algorithm can effectively solve a continuous and high dimensional dynamic environment. Continuous action space is designed based on the interceptor's acceleration overload. The LOS rate and ZEM are mainly considered in the design of reward, and the agent is trained in a 3D environment. The results of comparison with TPN and DQN-based RL guidance law show that the proposed guidance method is with strong environmental adaptability and better guidance performance.
The paper is structured as follows: Section 2 presents the problem formulation, including the engagement scenario and the model of motion and measurement. Section 3 mainly introduces the DDPG algorithm, and the details of RL guidance law are described. The results are given in Section 4, and Section 5 presents the conclusion.

Problem Formulation
2.1. Engagement Scenario. The interception process, a simplified engagement scenario, is used. Referring to Figure 1, the target's and the interceptor's position vectors are r t and r m . r is the relative position in the launch inertial coordinate system. The velocity vectors are v t and v m , the relative velocity vector is v. The accelerations are defined as a t and a m , and the relative acceleration vector is a = a t − a m .
For the process of interception, the closing velocity is usually large. If the target and interceptor maneuver along the LOS direction, it can be challenging to alter the miss distance outcome. Therefore, we assume that the interceptor maneuvers only in a plane perpendicular to the direction of LOS in the LOS coordinate system, without considering its maneuver along the LOS direction.

Motion
Model of Interception. The intersection plane is formed by r and v which are shown in Figure 1, and the details are illustrated in Figure 2. The plane between the target and interceptor will rotate as a function of their relative motion [34].  International Journal of Aerospace Engineering parallel to r denoted by e θ and e r , respectively. Additionally, q represents the LOS angle within the plane. The relative velocity is decomposed into two components: v r represents the closing velocity, while v θ represents relative velocity perpendicular to the LOS. v θ causes the rotation of the LOS. Additionally, ω s denotes the angular velocity of the LOS within 3D space, ω s = ω s e ω and e ω is perpendicular to e r and e θ , forming the LOS coordinate system. According to [6], Ωs represents the angular velocity. By deriving from equation (1), The LOS direction can be represented using q β and q ε and within the launch inertial system [35]. Referring to [6], equation (4) can be obtained.
where x S , y S , and z S are the coordinate axis unit vectors in the LOS coordinate system. According to [36] and [22], In summary, when q ε and q β are measured, and then their rates of change are obtained by filtering, the equation of motion and the intersection plane can be determined according to equations (5)- (7). ZEM is the final miss distance generated by the interceptor when the target and missile are not maneuvering [7,34]. ZEM and time-to-go are calculated as follows.
2.3. Measurement Model of Interception. The measurement model is mainly to process the information measured by the interceptor and is developed to calculate the LOS angles and LOS rates of change based on the current missile-target state [37]. Referring to Section 2.1, the relative position and velocity vector are as follows: By utilizing equations (9) and (10), This paper's simulation neglects the effects of error related to the relative distance, closing velocity, and LOS angle measurements in the measurement model. Only the errors of measurement in the LOS angular rates are introduced. A Gaussian noise with zero mean and a specified standard deviation 1 × 10 −4 rad/s is assigned.

RL Homing Guidance Law
Establishing a Markov decision model [38] of the problem is a prerequisite for designing the homing guidance law using the DRL algorithm [12]. Then, the interception problem needs to be transformed into the RL framework.
3.1. The Overview of RL. Reinforcement learning is an iterative process [39] that involves an agent interacting with the environment, observing state S t and receiving an instantaneous reward R t for each action A t taken during a single episode of training. Then, it executes an action and feeds back to the environment, so that the agent learns the better policy.
Reinforcement learning algorithms are broadly categorized into two methods: value function and policy gradient [40]. The methods of the former, such as Q-learning and DQN, estimate the value of state-action pairs. The latter's methods, such as policy gradient and AC algorithm, directly 3 International Journal of Aerospace Engineering learn the policy which maps states to actions. DRL algorithms, such as DDPG and A3C, combine deep learning with these methods. However, value function methods are not suitable for problems with high dimensions and continuous action spaces, and the policy gradient method based on AC architecture has more advantages for such problems. DDPG is used for solving the problem and is compared against the TPN and DQN algorithms in this paper.

DDPG Algorithm.
DDPG is based on AC architecture for solving RL problems with continuous spaces in state and action. The algorithm uses neural networks to approximate the functions, which are represented by the value network (part of the critic) and policy network (part of the actor). The value network calculates the state or action values of the corresponding state, while the policy network calculates the action values of the policy. The DDPG framework is shown in Figure 3.
A dual network is also used in the DDPG algorithm, namely, the current and the target network. The AC-type algorithm generally includes a policy and value network, so DDPG has a total of four networks after using a dual network [18].
DDPG also uses the replay buffer to reduce the correlation between training data. During training, the agent randomly selects small batches of data from the experience replay pool to calculate network loss and gradient and then updates the current policy network and value network through gradient backpropagation. DDPG differs from DQN in that it implements a soft update approach to update the target network, as opposed to periodically copying parameters from the current network. The soft update slowly updates the parameters each time, and it is mathematically expressed as follows: where τ is the update coefficient. To avoid the local optimum in the process of exploring the state space, random noise μ is added to the action, which is expressed as follows: where π θ ðsÞ is the output of actor network. The loss is also obtained by temporal-difference (TD) training. The pseudocode is shown in Algorithm 1.

RL Model of Interception.
To solve the problem of interception using DDPG, the original problem needs to be transformed into the framework of RL. First, the corresponding MDP is established, and the elements of reinforcement learning are designed according to the motion model in Section 2.2.
3.3.1. State. The process of interception can be described by an MDP. The environment of this process is composed of the 3D motion model established in Section 2. The state space designed mainly includes LOS angle and their change of rate [29], which is expressed as follows: where Δq ε and Δq β are the LOS angle differences. Therefore, this information can be used for the state input of the agent only by measuring the LOS angles and their rates. It is assumed that the interceptor has a detection capability. The process is observable, and the variables in this paper are defined as follows: 3.3.2. Action. The DDPG algorithm is particularly appropriate for problems with continuous actions. Considering the continuous maneuvering form of the interceptor, without considering the maneuvering along the LOS, that is, the interceptor maneuvers along the plane perpendicular to LOS. Therefore, if the interceptor's acceleration in y S and z S directions is u 1 and u 2 , respectively, the continuous action space is expressed as follows: The total acceleration acting on the interceptor is as follows: We assume that the maneuvering target's maximal overload and the interceptor in a certain direction are 3 g and 6 g, respectively, so the target's and the interceptor's maximum total overload are 3 ffiffi ffi 2 p g and 6 ffiffi ffi 2 p g.

Reward.
The reward design is the key to RL problems.
To ensure the training converges to the optimum, the method of reward shaping [41] is used to avoid the problem of reward sparsity and learn the optimal policy. The LOS rate and ZEM are considered in the reward function of the model. During the interception, the LOS rate is positively correlated with the relative velocity. The smaller  International Journal of Aerospace Engineering the absolute value of the relative velocity, the smaller the ZEM. The Gaussian reward [30] is designed as follows: The reward is a shaping reward that depends on the velocity-leading angle θ between the LOS direction and the relative velocity. The reward coefficient σ is used for adjusting the reward value. Figure 4 illustrates the effect of different σ values on the reward. Smaller values of σ result in smaller rewards under the same conditions. When θ is smaller, it indicates that the corresponding reward is larger.
To ensure effective interception of the target, a terminal reward constraint is required. Therefore, a terminal reward function is designed. If the ZEM meets with the allowable miss distance, then a positive reward (+10) is given. Otherwise, there is no reward value (+0). Specifically expressed as: To sum up, the total reward is as follows: 3.4. Create the Agent. Based on the established framework of interception, the network is further designed, the algorithm hyperparameters are debugged, and the DDPG agent is trained.
3.4.1. The Neural Network. The Tensorflow framework is used for building the neural network of DDPG. DDPG contains two parts: value and policy network. For the value network, the output is the action value corresponding to the state and action, which is different from the Q network in DQN. The value network uses a three-layer backpropagation (BP) neural network [42], which is shown in Table 1. Relu and tanh activation functions are used in the network [43]. The policy network structure is described in Table 2 e) Compute the current target Q value y i . y i = r j + ð1 − dÞγQ′ðs j ′, π θ′ ðs j ′Þ, w′Þ.
Algorithm 1: DDPG for homing guidance law.   Table 3 shows the hyperparameters that are ultimately chosen for this study.

Simulations and Analysis
During the training, the state measurement errors and time constant are not considered. The motion equation of each episode is integrated by the Runge-Kutta method, whose order is 4 and the simulation step is 1 ms. Table 4 shows the initial conditions. During training, a terminal reward with an allowable miss distance of 0.2 m was used. The results presented include the training results, a comparison with TPN, and a comparison with a homing guidance law based on DQN [22].

Results of Training. The DDPG environment is built in
Tensorflow, and then the agent generates a large volume of data which is used to optimize its policy. We train the agent by a computer with NVIDIA GeForce RTX 2080 Ti GPU, Gold 6226R: 2.90GHz CPU. The versions of Python and Tensorflow are 3.7.6 and 1.15.0.
Tensorboard is used to show the process of training, and it took approximately 9401.2702 seconds to train 2000 episodes, equivalent to about 2.6 hours for full training. Figures 5 and 6 depict the change in the loss for the policy and value networks, respectively, with the horizontal axis representing the iterations of training. A decrease in the policy network loss corresponds to an increase in the Q-value output of the value network, as demonstrated in Figure 5.
This indicates that the parameters of the policy network are continuously optimized, resulting in maximum action value output. The loss function is in the value network, and it can be observed that in the early stages of training, TD error is relatively small. As training progresses, the network becomes increasingly optimized, with lower TD error values being more beneficial for algorithm training. Figure 7 illustrates the changes in rewards, with the horizontal axis representing episodes and the vertical axis representing the cumulative reward (in blue) and average reward (in orange) for each round after smoothing. It can be observed that the maximum cumulative reward is achieved after 250 episodes. Since DDPG contains two networks, the stability of the algorithm is affected, and there may be fluctuations. Therefore, there is a certain range of change in the cumulative reward after convergence. However, the policy of the agent is optimized during training, and the convergence speed is fast.

Comparison with TPN.
During training in Section 4.1, the effects of measurement errors and time delays are not considered. The agent is compared with TPN with different guidance coefficients in two ways, that is constant maneuvering and sinusoidal maneuvering. The simulation takes into account the measurement error. Additionally, the control system's response delay is assumed to be equivalent to two sampling periods (20 ms) after the guidance command.
The simulation is conducted under the following conditions: the launch location's latitude is 60°, longitude is 140°, launch azimuth is 90°, and altitude is 100 m. The target's and interceptor's initial information is presented in Table 5, indicating an initial relative distance between them of 100 km with q ε and q β both being 30°. The guidance acceleration's sampling period is 10 ms, with the time step being 1 ms in the test. When the relative velocity is greater than 0, the terminal miss distance approximates the ZEM.

Constant-Maneuvering Target.
In the case of constantmaneuvering targets, the maneuvering is only considered in the LOS vertical plane. To verify the DDPG guidance law's   International Journal of Aerospace Engineering generalization ability, we assume the target's acceleration is a t = ½0, 4g, 4g Τ . The interceptor's total overload is set to 6√2 g, which differs from the target and interceptor setting during training. The guidance coefficient is 3 and 5. Figure 8 shows the results. The terminal miss distance is given in Table 6. In Figure 8    7 International Journal of Aerospace Engineering decreases, v q increases continuously, and the closing velocity also changes greatly. In Figure 8(d), the guidance law based on DDPG effectively suppresses the LOS rate. According to the results, when N is 3 and 5, TPN eventually produces a large miss distance, while DDPG guidance law produces a small miss distance.
The guidance coefficient also takes 3 and 5. Figure 10 gives the simulation results. In Figure 10(a), the guidance law based on DDPG can reduce the vertical velocity more fully than TPN. In Figure 10(d), the change of the LOS rate also shows that DDPG can reduce the LOS rate more effectively during the guidance process.
The coefficient N takes 3 and 5 in this case as well. The results are presented in Figure 10. Table 6 reveals the terminal miss distance for the DDPG-based guidance law and TPN. Figure 10(a) suggests that the DDPG method can reduce the vertical velocity to a greater extent than TPN. Additionally, Figure 10(d) illustrates that DDPG is more effective in reducing the LOS rate during the guidance process.
The above results show that the proposed RL method is more effective than the TPN to intercept targets with certain maneuvering ability. The DDPG guidance law is capable of effectively reducing the vertical relative velocity, ensuring a very small final miss distance, and mitigating the divergence of the LOS rate.

Comparison with DQN.
During the training process in Section 4.1, the target's maximum overload is 3 ffiffi ffi 2 p g. In the test of this section, the initial conditions are kept unchanged. The target moves in a mode of constant maneuvering, and its maximum overload is also 3 ffiffi ffi 2 p g. The acceleration form is a t = ½0, 3g, 3g T . The interceptor is guided by the guidance   [22]. The initial conditions are presented in Table 7, and Figures 11 and 12 show the results. Figures 11 and 12(a) show that the two methods are suitable for discrete (DQN) and continuous (DDPG) accelera-tions, respectively. As shown in Figure 12(d), the terminal miss distances in DQN and the DDPG guidance law are less than the allowable miss distance, and both are less than 0.01 m after calculating. When the target's overload saturation is 3 ffiffi ffi 2 p g, both guidance laws can fully suppress the  10 International Journal of Aerospace Engineering LOS rate's divergence. In Figure 12(b), the LOS rate of the DQN guidance law fluctuates in the end, while the LOS rate of the DDPG guidance law decreases to 0. Therefore, the DDPG guidance law performs better than the DQN. In Figures 12(c) and 12(d), the velocity leading angle of the two guidance laws decreases continuously during the interception process, that is, v q decreases continuously. At the same time, the ZEM generated by the interceptor also decreases, and it effectively hits the target with a small miss distance at the end time.
In addition, when the total overload saturation of the target is less than 3 ffiffi ffi 2 p g , the simulation is compared and verified. The two RL-based guidance laws can effectively intercept the target from the results, and the final miss distance is within the allowable miss distance. However, the DDPG guidance law can deal with the continuous action space, which is more realistic. Moreover, the DDPG guidance law can effectively suppress the LOS rate (see Figure 12(b)), while the DQN guidance law diverges at the end of time, indicating that the DDPG method performs better. Based on the simulations conducted above, it can be observed that when the target possesses some maneuvering capability, both RL-based guidance laws can ensure effective target interception, with the DDPG guidance law outperforming the DQN guidance law.

Conclusion
We propose a DDPG-based guidance law for the guidance and control of interceptors with continuous maneuvering capabilities in this paper. The DDPG agent is developed using TensorFlow and optimized in the interception engagement scenario. Taking into account the effects of measurement errors and time delays in guidance control, the effectiveness of the proposed guidance law is compared with TPN and DQN-based RL guidance law through simulations of typical examples. The findings suggest that the DDPGbased guidance law outperforms the other two in terms of guidance performance. Future research could consider more complex interception scenarios, exploring more suitable intelligent guidance methods with potential implementation in real interception processes.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.