Improving Maneuver Strategy in Air Combat by Alternate Freeze Games with a Deep Reinforcement Learning Algorithm

In a one-on-one air combat game, the opponent’s maneuver strategy is usually not deterministic, which leads us to consider a variety of opponent’s strategies when designing ourmaneuver strategy. In this paper, an alternate freeze game framework based on deep reinforcement learning is proposed to generate the maneuver strategy in an air combat pursuit.'emaneuver strategy agents for aircraft guidance of both sides are designed in a flight level with fixed velocity and the one-on-one air combat scenario. Middleware which connects the agents and air combat simulation software is developed to provide a reinforcement learning environment for agent training. A reward shaping approach is used, by which the training speed is increased, and the performance of the generated trajectory is improved. Agents are trained by alternate freeze games with a deep reinforcement algorithm to deal with nonstationarity. A league system is adopted to avoid the red queen effect in the game where both sides implement adaptive strategies. Simulation results show that the proposed approach can be applied to maneuver guidance in air combat, and typical angle fight tactics can be learnt by the deep reinforcement learning agents. For the training of an opponent with the adaptive strategy, the winning rate can reach more than 50%, and the losing rate can be reduced to less than 15%. In a competition with all opponents, the winning rate of the strategic agent selected by the league system is more than 44%, and the probability of not losing is about 75%.


Introduction
Despite long-range radar and missile technology improvements, there is still a scenario that two fighter aircrafts may not detect each other until they are within the visual range. erefore, modern fighters are designed for close combat, and military pilots are trained in air combat basic fighter maneuvering (BFM). Pursuit is a kind of BFM, which aims to control an aircraft to reach a position of advantage when it is fighting against another aircraft [1].
In order to reduce the workload of pilots and remove the need to provide them with complex spatial orientation information, many research studies focus on the autonomous air combat maneuver decision. Toubman et al. [2][3][4][5] used rule-based dynamic scripting in one-on-one, two-on-one, and two-on-two air combat, which requires hard coding the air-combat tactics into a maneuver selection algorithm. A virtual pursuit point-based combat maneuver guidance law for an unmanned combat aerial vehicle (UCAV) is presented and is used in X-Plane-based nonlinear six-degrees-offreedom combat simulation [6,7]. Eklund et al. [8,9] presented a nonlinear, online model predictive controller for pursuit and evasion of two fixed-wing autonomous aircrafts, which rely on previous knowledge of the maneuvers.
Game-theoretic-based approaches are widely used in the automation of air combat pursuit maneuver. Austin et al. [10,11] proposed a matrix game approach to generate intelligence maneuver decisions for one-on-one air combat. A limited search method is adopted over discrete maneuver choices to maximize a scoring function, and the feasibility of real-time autonomous combat is demonstrated in simulation. Ma et al. [12] formulated the cooperative occupancy decision-making problem in air combat as a zero-sum matrix game and designed the double-oracle combined algorithm with neighborhood search to solve the model. In [13,14], the air combat game is regarded as the Markov game, and then the pursuit maneuver strategy is solved by computing its Nash equilibrium. is approach solves the problem that the matrix game cannot deal with continuous multiple states. However, it is only suitable for rational opponents. An optimal pursuit-evasion fighter maneuver is formulated as a differential game and then solved by nonlinear programming, which is complex and requires enormous amount of calculation [15].
Other approaches for pursuit maneuver strategy generation include influence diagram, genetic algorithm, and approximate dynamic programming. Influence diagram is used to model the sequential maneuver decision in air combat, and high-performance simulation results are obtained [16][17][18]. However, this approach is difficult to be applied in practice since the influence diagram is converted into a nonlinear programming problem, which cannot meet the demand of fast computation during air combat. In [19], a genetics-based machine learning algorithm is implemented to generate high angle-of-attack air combat maneuver tactics for the X-31 fighter aircraft in a one-on-one air combat scenario. Approximate dynamic programming (ADP) can be employed to solve the air combat pursuit maneuver decision problem quickly and effectively [20,21]. By controlling the roll rate, it can provide fast maneuver response in a changing situation.
Most of the previous studies have used various algorithms to solve the pursuit maneuver problem and have some satisfactory results, whereas two problems still exist. One is that previous studies have assumed the maneuver strategy of the opponent is deterministic or generated by a fixed algorithm. However, in realistic situations, these approaches are difficult to deal with the flexible strategies adopted by different opponents. e other is that traditional algorithms rely heavily on prior knowledge and have high computational complexity, which cannot adapt to the rapidly changing situation in air combat. Since UCAVs have received growing interest worldwide and the flight control system of the aircraft is developing rapidly towards intellectualization [22], it is necessary to do research on the intelligent maneuver strategy in UCAVs and manned aircraft combat.
In this paper, a deep reinforcement learning-(DRL-) [23] based alternate freeze game approach is proposed to train guidance agents which can provide maneuver instructions in an air combat. Using alternate freeze games, in each training period, one agent is learning while its opponent is frozen. DRL is a type of artificial intelligence, which combines reinforcement learning (RL) [24] and deep learning (DL) [25]. RL allows an agent to learn directly from the environment through trial and error without perfect knowledge of the environment in advance. A well-trained DRL agent can automatically determine an adequate behavior within a specific context trying to maximize its performance using few computational times. e theory of DRL is very suitable for solving sequential decision-making problems such as maneuver guidance in air combat.
DRL has been utilized in many decision-making fields, such as video games [26,27], board games [28,29], and robot control [30] and obtained great achievements of human level or superhuman performance. For aircraft guidance research, Waldock et al. [31] proposed a DQN method to generate a trajectory to perform perched landing on the ground. Alejandro et al. [32] proposed a DRL strategy for autonomous landing of the UAV on a moving platform. Lou and Guo [33] presented a uniform framework of adaptive control by using policy-searching algorithms for a quadrotor. An RL agent is developed for guiding a powered UAV from one thermal location to another by controlling bank angle [34]. For air combat research, You et al. [35] developed an innovative framework for cognitive electronic warfare tasks by using a DRL algorithm without prior information. Luo et al. [36] proposed a Q-learning-based air combat target assignment algorithm which avoids relying on prior knowledge and performs well.
Reward shaping is a method of incorporating domain knowledge into RL so that the algorithms are guided faster towards more promising solutions [37]. Reward shaping is widely adopted in the RL community, and it is also used in aircraft planning and control. In [4], a reward function is proposed to remedy the false rewards and punishments for firing air combat missiles, which allows computer-generated forces (CGFs) to generate more intelligent behavior. Tumer and Agogino [38] proposed difference reward functions in a multiagent air traffic system and showed that agents can manage effective route selection and significantly reduce congestion. In [39], two types of reward functions are developed to solve ground holding and air holding problems, which assist air traffic controllers in maintaining high standard of safety and fairness between airlines.
Previous research studies have shown benefits of using DRL to solve the air maneuver guidance problem. However, there are still two problems in applying DRL to air combat. One is that previous studies did not have a specific environment for air combat, either developing an environment based on the universal ones [32] or using a discrete grid world, which do not have the function of air combat simulation.
e other problem is that classic DRL algorithms are almost all one-sided optimization algorithms, which can only guide an aircraft to a fixed location or a regularly moving destination. However, the opponent in air combat has diversity and variability maneuver strategies, which are nonstationary, and the classic DRL algorithms cannot deal with them. Hernandez-Leal et al. reviewed how the nonstationarity is modelled and addressed by state-of-the-art multiagent learning algorithms [40]. Some researchers combine MDP with game theory to study reinforcement learning in stochastic games to solve the nonstationarity problem [41][42][43], while there are still two limitations. One is to assume the opponent's strategy is rational. In each training step, the opponent chooses the optimal action. A trained agent can only deal with a rational opponent, but cannot fight against a nonrational one. e other is to use linear programming or quadratic programming to calculate the Nash equilibrium, which leads to a huge amount of 2 Mathematical Problems in Engineering calculations.
e Minimax-DQN [44] algorithm which combines DQN and Minimax-Q learning [41] for the twoplayer zero-sum Markov game is proposed in a recent paper. Although it can be applied to complex games, it still can only deal with rational opponents and needs to use linear programming to calculate the Q value.
e main contributions of this paper can be summarized as follows: (i) Middleware is designed and developed, which makes specific software for air combat simulation used as an RL environment. An agent can be trained in this environment and then used to guide an aircraft to reach the position of advantage in oneon-one air combat. (ii) Guidance agents for air combat pursuit maneuver of both sides are designed. e reward shaping method is adopted to improve the convergence speed of training and the performance of the maneuver guidance strategies. An agent is trained in the environment where its opponent is also an adaptive agent so that the well-trained agent has the ability to fight against an opponent with the intelligent strategy. (iii) An alternate freeze game framework is proposed to deal with nonstationarity, which can be used to solve the problem of variable opponent strategies in RL. e league system is adopted to select an agent with the highest performance to avoid the red queen effect [45].
is paper is organized as follows. Section 2 introduces the maneuver guidance strategy problem of oneon-one air combat maneuver under consideration and the DRL-based model as well as training environment to solve it. e training optimization includes reward shaping and alternate freeze games which are presented in Section 3. e simulation and results of the proposed approach are given in Section 4. e final section concludes the paper.

Problem Formulation
In this section, the problem of maneuver guidance in air combat is introduced. e training environment is designed, and the DRL-based model is set up to solve it.

Problem Statement.
e problem solved in this paper is a one-on-one air combat pursuit-evasion maneuver problem. e red aircraft is regarded as our side and the blue one as an enemy.
e objective of the guidance agent is to learn a maneuver strategy (policy) to guide the aircraft from its current position to a position of advantage and maintain the advantage. At the same time, it should guide the aircraft not to enter the advantage position of its opponent. e dynamics of the aircraft is described by a point-mass model [17] and is given by the following differential equations: where (x, y, z) are 3-dimensional coordinates of the aircraft. e terms v, θ, and ψ are speed, flight path angle, and the heading angle of the aircraft, respectively. e mass of the aircraft and the acceleration due to the gravity are denoted by m and g, respectively. e three forces are lift force L, drag force D, and thrust force T. e remaining two variables are the angle of attack β and the bank angle ϕ, as shown in Figure 1.
In this paper, the aircraft is assumed to fly at a fixed velocity in the horizontal plane, and the assumption can be written as follows: e equations of motion for the aircraft are simplified as follows: x t+δt � x t + vδtcos ψ t+δt , where v, _ ϕ t , ϕ t , _ ψ t , and ψ t are the speed, roll rate, bank angle, turn rate, and heading angle of the aircraft at time t, respectively.
e term A t is the action generated by the guidance agent, and the control actions available to the aircraft are a ∈ {roll-left, maintain-bank-angle, roll-right}. e simulation time step is denoted by δt.

Mathematical Problems in Engineering
In aircraft maneuver strategy design, maneuvers in a fixed plane are usually used to measure its performance. Figure 2 shows a one-on-one air combat scenario [20], in which each aircraft flies at a fixed velocity in the X-Y plane under a maneuver strategy at time t. e position of advantage is that one aircraft gains opportunities to fire at its opponent. It is defined as where superscripts r and b refer to red and blue, respectively. e term d t is the distance between the aircraft and the target at time t, μ r t is the deviation angle which is defined as the angle difference between the line-of-sight vector and the heading of our aircraft, and η b t is the aspect angle which is between the longitudinal symmetry axis (to the tail direction) of the target plane and the connecting line from the target plane's tail to attacking the plane's nose. e terms d min , d max , μ max , and μ min are thresholds, where the subscripts max and min represent the upper and lower bounds of the corresponding variables, respectively, which are determined by the combat mission and the performance of the aircraft. e expression d min < d t < d max makes sure that the opponent is within the attacking range of the air-to-air weapon of the aircraft. e expression |μ r t | < μ max refers to an area in which the blue aircraft is difficult to escape with sensor locking. e expression |η b t | < η max defines an area where the killing probability is high when attacking from the rear of the blue aircraft.

DRL-Based
Model. An RL agent interacts with an environment over time and aims to maximize the long-term reward [24]. At each training time t, the agent receives a state S t in a state space S and generates an action A t from an action space A following a policy π: S × A ⟶ R. en, the agent receives a scalar reward R t and transitions to the next state S t+1 according to the environment dynamics, as shown in Figure 3.
A value function V(s) is defined to evaluate the air combat advantages of each state, which is the expectation of discounted cumulated rewards on all states following time t: where c ∈ [0, 1] is the discount factor and policy π(s, a) is a mapping from the state space to the action space. e agent guides an aircraft to maneuver by changing its bank angle ϕ using roll rate _ ϕ. e action space of DRL is {−1, 0, 1}, and A t can take a value from three options at time t, which means roll-left, maintain-bank-angle, and roll-right, respectively. e position of the aircraft is updated according to (3). e action-value function Q(s, a) can be used, which is defined as the value of taking action a in state s under a policy π: e Bellman optimality equation is Any policy that is greedy with respect to the optimal evaluation function Q * (s, a) is an optimal policy. Actually, Q * (s, a) can be obtained through iterations using temporaldifference learning, and its updated formula is defined as

Air combat environment
Air combat guidance agent  Mathematical Problems in Engineering where Q(S t , A t ) is the estimated action-value function and α is the learning rate.
In air combat maneuvering applications, the reward is sparse, and only at the end of a game (the terminal state), according to the results of winning, losing, or drawing, a reward value is given. is makes learning algorithms to use delayed feedback to determine the long-term consequences of their actions in all nonterminal states, which is time consuming. Reward shaping is proposed by Ng et al. [46] to introduce additional rewards into the learning process, which will guide the algorithm towards learning a good policy faster. e shaping reward function F(S t , A t , S t+1 ) has the form where Φ(S t ) is a real-valued function over states. It can be proven that the final policy after using reward shaping is equivalent to the final policy without it [46]. R t in previous equations can be replaced by Taking the red agent as an example, the state space of one guidance agent can be described with a vector: e state that an agent received in one training step includes the position, heading angle, and bank angle of itself and the position, heading angle, and bank angle of its opponent, which is a 7-dimensional infinite vector. erefore, the function approximate method should be used to combine features of state and learned weights. DRL is a solution to address this problem. Great achievements have been made using the advanced DRL algorithms, such as DQN [26], DDPG [47], A3C [48], and PPO [49]. Since the action of aircraft guidance is discrete, DQN algorithm can be used. e action-value function can be estimated with function approximation such as Q(s, a, w) ≈ Q π (s, a), where w are the weights of a neural network.
DQN algorithm is executed according to the training time step. However, there is not only training time step but also simulation time step. Because of the inconsistency, DQN algorithm needs to be improved to train a guidance agent.
e pseudo-code of DQN for aircraft maneuver guidance training in air combat is shown in Algorithm 1. At each training step, the agent receives information about the aircraft and its opponent and then generates an action and sends it to the environment. After receiving the action, the aircraft in the environment is maneuvered according to this action in each simulation step.

Middleware
Design. An RL agent improves its ability through continuous interaction with the environment. However, there is no maneuver guidance environment for air combat, and some combat simulation software can only execute agents, not train them. In this paper, middleware is designed and developed based on commercial air combat simulation software [50] coded in C++. e functions of middleware include interface between software and RL agents, coordination of the simulation time step and the RL training time step, and reward calculation. By using it, an RL environment for aircraft maneuver guidance training in air combat is performed. e guidance agents of both aircrafts have the same structure. Figure 4 shows the structure and the training process of the red side.
In actual or simulated air combat, an aircraft perceives the situation through multisensors. By analyzing the situation after information fusion, maneuver decisions are made by the pilot or the auxiliary decision-making system. In this system, situation perception and decision-making are the same as in actual air combat. An assumption is made that a sensor with full situation perception ability is used, through which an aircraft can obtain the position and heading angle of its opponent.
Maneuver guidance in air combat using RL is an episodic task, and there is the notion of episodes of some length, where the goal is to take the agent from a starting state to a goal state [51]. First, a flight level with fixed velocity and the one-on-one air combat scenario is setup in simulation software. en, the situation information of both aircrafts is randomly initialized, including their positions, heading angles, bank angles, and roll rates, which constitute the starting state of each agent. e goal states are achieved when one aircraft reaches the position of advantage and maintains it, one aircraft flies out of the sector, or the simulation time is out.
Middleware can be used in the training and application phases of agents. e training phase is the process of making the agent from zero to have the maneuver guidance ability, which includes training episodes and testing episodes. In the training episode, the agent is trained by the DRL algorithm, and its guidance strategy is continuously improved. After a number of training episodes, some testing episodes are performed to verify the ability of the agent in the current stage, in which the strategy does not change. e application phase is the process of using the well-trained agent to maneuver guidance in air combat simulation.
A training episode is composed of training steps. In each training step, first, the information recognized by the airborne sensor is sent to the guidance agent through middleware as a tuple of state coded in Python, see Steps 1, 2, and 3 in Figure 4. en, the agent generates an action using its neural networks according to the exploration strategy, as shown in Steps 4 and 5. Simulation software receives a guidance instruction transformed by middleware and sends the next situation information to middleware, see Steps 6 and 7. Next, middleware transforms the situation information into state information and calculates reward and sends them to the agent, as shown in Steps 2, 3, and 8. Because agent training is an iterative process, all the steps except Step 1 are executed repeatedly in an episode. Last, the tuple of the current state, action, reward, and the next state in each step is stored in the replay memory for the training of the guidance agent, see Steps 9 and 10. In each simulation step, the aircraft maneuvers according to the instruction and recognizes the situation according to its sensor configuration.

Training Optimization
3.1. Reward Shaping. Two problems need to be solved when using the DRL method to train a maneuver guidance agent in air combat. One is that the only criterion to evaluate the guidance is the successful arrival to the position of advantage, which is a sparse reward problem, leading to slow convergence of training. Secondly, in realistic situations, the time to arrive at the position of advantage and the quality of the trajectory need to be considered. In this paper, a reward shaping method is proposed to improve the training speed and the performance of the maneuver guidance strategy.
Reward shaping [46] is usually used to modify the reward function to facilitate learning while maintaining optimal policy, which is a manual endeavour. In this study, there are four rules to follow in reward shaping: With probability ϵ select a random action A t (10) Otherwise select A t � arg max a Q(S t , a, w) (11) for i � 0, Δt/δtdo (12) Execute action A t in air combat simulation software (13) Obtain the positions of aircraft and target (14) if episode terminates then (15) break (16) end if (17) end for (18) Observe reward R t and state S t+1 (19) Store 20) if episode terminates then (21) break (22) end if (23) Sample random minibatch of transitions [S j , A j , R j , S j+1 ] from D (24) if episode terminates at step j + 1then (25) set Y j � R j (26) else (27) set Perform a gradient descent step on (Y j − Q(S j , A j , w j )) 2 with respect to the network parameters w (30) Every C steps reset Q − � Q (31) end for (32) end for ALGORITHM 1: DQN [26] for maneuver guidance agent training.  Mathematical Problems in Engineering (iv) e aircraft should be guided to its goal zone in short time as much as possible According to the above rules, the reward function is defined as where R t is the original scalar reward and F(S t , A t , S t+1 ) is the shaping reward function. R t is given by where T(S t+1 ) is the termination reward function. e term w 1 is a coefficient, and w 2 is a penalty for time consumption, which is a constant. ere are four kinds of termination states: the aircraft arrives at the position of advantage; the opponent arrives at its position of advantage; the aircraft moves out of the combat area; and the maximum number of control times has been reached, and each aircraft is still in the combat area and has not reached its advantage situation. Termination reward is the reward obtained when the next state is termination, which is often used in the standard RL training. It is defined as Usually, c 1 is a positive value; c 2 is a negative value; and c 3 and c 4 are nonpositive.
e shaping reward function F(S t , A t , S t+1 ) has the form as described in (9). e real-value function Φ(S t ) is an advantage function of the state. e larger the value of this function, the more advantageous the current situation is.
is function can provide additional information to help the agent to select action, which is better than only using the termination reward. It is defined as where D(S t ) is the distance reward function and O(S t ) is the orientation reward function, which are defined as where k has units of meters/degrees and is used to adjust the relative effect of range and angle, and a value of 10 is effective.

Alternate Freeze Game DQN for Maneuver Guidance
Agent Training. Unlike board and RTS games, the data of human players in air combat games are very rare, so pretraining methods using supervised learning algorithms cannot be used. We can only assume a strategy used by the opponent and then propose a strategy to defeat it. We then think about what strategies the opponent will use to confront our strategy and then optimize our strategy. In this paper, alternate freeze learning is used for games to solve the nonstationarity problem. Both aircrafts in a oneon-one air combat scenario use DRL-based agents to adapt their maneuver strategies. In each training period, one agent is learning from scratch, while the strategy of its opponent, which was obtained from the previous training period, is frozen.
rough games, the maneuver guidance performance of each side rises alternately, and different agents with high level of maneuver decision-making ability are generated.
e pseudo-code of alternate freeze game DQN is shown in Algorithm 2.
An important effect that must be considered in this approach is the red queen effect. When one aircraft uses a DRL agent and its opponent uses a static strategy, the performance of the aircraft is absolute with respect to its opponent. However, when both aircrafts use DRL agents, the performance of each agent is only related to its current opponent. As a result, the trained agent is only suitable for its latest opponent, but cannot deal with the previous ones.
rough K training periods of alternate freeze learning, K red agents and K + 1 blue agents are saved in the training process. e league system is adopted, in which those agents are regarded as players. By using a combination of strategies from multiple opponents, the opponent's mixed strategy becomes smoother. In different initial scenarios, each agent guides the aircraft to confront the aircraft guided by each opponent. e goal is to select one of our strategies through the league so that the probability of winning is high, and the probability of losing is low. According to the result of the competition, the agent obtains different points. e top agent in the league table is selected as the optimum one.

Simulation and Results
In this section, first, random air combat scenarios are initialized in simulation software for maneuver guidance agent training and testing. Second, agents of both sides with reward shaping are created and trained using the alternate freeze game DQN algorithm. ird, taking the well-trained agents as players, the league system is adopted, and the top agent in the league table is selected. Last, the selected agent is evaluated, and the maneuver behavior of the aircraft is analyzed.

Simulation Setup.
e air combat simulation parameters are shown in Table 1. e Adam optimizer [52] is used for learning the neural network parameters with a learning rate of 5 × 10 − 4 and a discount c � 0.99. e replay memory size is 1 × 10 5 . e initial value of ϵ in ϵ-greedy exploration is 1, and the final value is 0.1. For each simulation, the reward shaping parameters c 1 , c 2 , c 3 , and c 4 are set to 2, −2, −1, and −1, respectively. e terms w 1 and w 2 are set to 0.98, and 0.01, respectively. e training period K is 10.

Mathematical Problems in Engineering
We evaluated three neural networks and three minibatches. Each neural network has 3 hidden layers, and the number of units is 64, 64, 128 for the small neural network (NN), 128, 128, 256 for the medium one, and 256, 256, 512 for the large one. e sizes of minibatch (MB) are 256, 512, and 1024, respectively. ese neural networks and minibatches are used in pairs to train red agents and play against the initial blue agent. e simulation result is shown in Figure 5. First, a large minibatch cannot make the training successful for small and medium neural networks, and its training speed is slow for a large network. Second, using a small neural network, the average reward obtained will be lower than that of the other two networks. ird, if the same neural network is adopted, the training speed using the medium minibatch is faster than using a small one. Last, considering the computational efficiency, the selected network has 3 hidden layers with 128, 128, and 256 units, respectively. e selected minibatch size is 512. In order to improve the generality of a single agent, initial positions are randomized in air combat simulation. In the first iteration of the training process, the initial strategy is formed by a randomly initialized neural network for the red agent, and the conservative strategy of the maximumoverload turn is adopted by the blue one. In the subsequent training, the training agent is trained from scratch, and its opponent adopts the strategy obtained from the latest training.
Air combat geometry can be divided into four categories (from red aircraft's perspective): offensive, defensive, neutral, and head-on [7], as shown in Figure 6(a). Defining the deviation angle and aspect angle from −180°to 180°, Figure 6(b) shows an advantage diagram where the distance is 300 m. Smaller |μ r | means that the heading or gun of the red aircraft has better aim at its opponent, and smaller |η b | implies a higher possibility of the blue aircraft being shot or facing a fatal situation. For example, in the offensive (1) Set parameters of both aircrafts (2) Set simulation parameters (3) Set the number of training periods K and the condition for ending each training period W threshold (4) Set DRL parameters (5) Set the opponent initialization policy π blue 0 (6) for period � 1, Kdo (7) for aircraft � [red, blue] do (8) if aircraft � red then (9) Set the opponent policy π � π blue period−1 (10) Initialize neural networks of red agent (11) while Winning rate< W threshold do (12) Train agent using Algorithm 1 (13) end while (14) Save the well-trained agent, whose maneuver guidance policy is π red period (15) else (16) if period � K + 1then (17) break (18) else (19) Set the opponent policy π � π red period (20) Initialize neural networks of blue agent (21) while Winning rate< W threshold do (22) Train agent using Algorithm 1 (23) end while (24) Save the well-trained agent, whose maneuver guidance policy is π blue period (25) end if (26) end if (27) end for (28) end for ALGORITHM 2: Alternate freeze game DQN for maneuver guidance agent training in air combats. scenario, the initial state value of the red agent is large because of small |μ r | and |η b |, according to equations (14) and (16). erefore, in the offensive initial scenario, the win probability of our side will be larger than that of the opponent, while the defensive initial scenario is reversed. In neutral and head-one initial scenarios, the initial state values of both sides are almost equal. e performance of the welltrained agents is verified according to the classification of four initial situations in the league system.

Simulation
Results. e policy naming convention is π b i or π r i , which means a blue or red policy produced after i periods, respectively. After every 100 training episodes, the test episodes run 100 times, and the learning curves are shown in Figure 7. For the first period (π r 1 vs. π b 0 ), due to the simple maneuver policy of the blue aircraft, the red aircraft can achieve almost 100% success rate through about 4000 episodes of training, as shown in Figure 7(a). At the beginning, most of the games are draw because of the random initialization of the red agent and the evasion strategy of its opponent. Another reason is that the aircraft often flies out of the airspace in the early stages of training. Although the agent will get a penalty for this, the air combat episode will still get a draw. During the training process, the red aircraft is looking for winning strategies in the airspace, and its winning rate is constantly increasing. However, because of its pursuit of offensive and neglect of defensive, the winning rate of its opponent is also rising. In the later stage of training, the red aircraft gradually understands the opponent's strategy, and it can achieve a very high winning rate.
With the iterations in games, both agents have learnt the intelligent maneuver strategy, which can guide the aircraft in the pursuit-evasion game. Figure 7(b) shows a typical example that π b 5 is the trainer and π r 5 is its opponent. After about 5,000 episodes of training, the blue aircraft can reduce the losing rate to a lower level. Since the red one adopts an intelligent maneuver guidance strategy, the blue agent cannot learn a winning strategy in a short time. After about 20,000 episodes, the agent gradually understands the opponent's maneuver strategy, and its winning rate keeps increasing. e final winning rate is stable between 50% and 60%, and the losing rate is below 10%. e training process iteration of π r 10 vs. π b 9 is shown in Figure 7(c). Because the maneuver strategy of the blue aircraft is an intelligent strategy which has been trained iteratively, training for the red agent is much more difficult than that in the first iteration. e well-trained agent can win more than 60% and lose less than 10% in the game against π b 9 . It is hard to get a higher winning rate in every training except for the scenario that the opponent is π b 0 . is is because both aircrafts are homogeneous, the initial situations are randomized, and the opponent in the training environment is intelligent. In some situations, as long as the opponent's strategy is intelligent enough, the trained agent is almost impossible to win. Interestingly, in some scenarios where the opponent is intelligent enough and the aircraft cannot defeat it no matter how the agent operates, the agent will guide the aircraft out of the airspace to obtain a draw. is gives us an inspiration, that is, in a scenario where you cannot win, it may be the best way to get out of the combat area.
For four initial air combat situations, two aircrafts are guided by the agents trained in each iteration stage. e typical flight trajectories are shown in Figure 8. Since π b 0 is a static strategy, π r 1 can easily win in offensive, head-on, and neutral situations, as shown in Figures 8(a), 8(c), and 8(d). In the defensive situation, the red aircraft first turns away from its opponent who has the advantage and then looks for opportunities to establish an advantage situation. ere are no fierce rivalries in the process of combat, as shown in Figure 8 Although π r 5 is an intelligent strategy, well-trained π b 5 can still gain an advantage in combat, as shown in Figures 8(e)-8(h). In the offensive situation (from the red aircraft's perspective), the blue aircraft cannot easily get rid of red one's following, and after fierce confrontation, it can establish an advantage situation, as shown in Figure 8(e). Figure 8(f ) shows the defensive situation that the blue aircraft can keep its advantage until it wins. In the other two scenarios, the blue aircraft performs well, especially in the initial situation of neutral, which adopts the delayed-turning tactics and successfully wins in air combat. π b 9 is an intelligent strategy, and well-trained π r 10 can achieve more than half of the victory and less than 10% of the failure. However, unlike π r 1 vs. π b 0 , π r 10 cannot easily win, especially in head-on situations, as shown in Figures 8(i)-8(l).

Mathematical Problems in Engineering
In the offensive situation, the blue aircraft tries to get rid of the red one by constantly adjusting its bank angle, but it is finally captured by the red one. In the defensive situation, the red aircraft adopted the circling back tactics cleverly and quickly captured its opponent. In the head-on situation, the red and blue aircrafts alternately occupy an advantage situation, and after fierce maneuver confrontation, the red aircraft finally wins. In the neutral situation, the red one constantly adjusts its bank angle according to the opponent's position and wins.
For the above three pairs of strategies, the head-on initial situations are taken as examples to analyze the advantages of both sides in the game, as shown in Figure 9. For the π r 1 vs. π b 0 scenario, because π b 0 has no intelligent maneuver guidance ability, although both sides are equally competitive at first half time, the red aircraft continues to expand its advantages until it wins, as shown in Figure 9(a). When the blue agent learns the intelligence strategy, it can defeat the red one, as shown in Figure 9(b). For the π r 10 vs. π b 9 scenario, the two sides have alternately established an advantage position, and the red aircraft has not won easily, as shown in Figure 9(c). Combining with the trajectories shown in Figures 8(c), 8(g), and 8(k), the maneuver strategies of both sides have been improved using the alternate freeze game DQN algorithm. e approach has the benefit of discovering new maneuver tactics.

League Results.
e objective of the league system is to select a red agent who generates maneuver guidance instructions according to its strategy, so as to achieve the best results without knowing the strategy of its opponent. ere are ten agents on each side, which are saved in the iteration of games. Each red agent fights each blue agent 400 times, including 100 times for each of four initial scenarios. e detailed league results are shown in Table 2. Results with a win/loss ratio greater than 1 are highlighted in bold.  vs. π r 5 , defensive. (g) π b 5 vs. π r 5 , head-on. (h) π b 5 vs. π r 5 , neutral. (i) π r 10 vs. π b g , offensive. (j) π r 10 vs. π b g , defensive. (k) π r 10 vs. π b g , head-on. (l) π r 10 vs. π b g , neutral. π r i , where i ∈ [1,10], is trained in the environment with π b i−1 as its opponent, and π b j is trained against π r j , j ∈ [1,10]. e result of π r 1 fighting with π b 0 is ideal. However, since π b 0 is a static strategy rather than an intelligent strategy, π r 1 does not perform well in the game with other intelligent blue agents. For π r 2 to π r 10 , although there is no special training for against π b 0 , the performance is still very good because π b 0 is a static strategy. In the iterative game process, the trained agents can get good results in the confrontation with the frozen agents in the environment. For example, π r 2 has an overwhelming advantage over π b 1 , so does π b 2 and π r 2 . Because of the red queen effect, although π r 10 has an advantage against π b 9 , it is in a weak position against the early blue agents, such as π b 2 , π b 3 , and π b 4 . e league table is shown in Table 3, in which 3 points are for win, 1 point for draw, and 0 points for loss. In the early stage of training, the performance of the trained agents will be gradually improved. e later trained agents are better than the former ones, such as from π r 1 to π r 6 . As the alternate freeze training goes on, the performance does not get better, such as π r 7 to π r 10 . e top of the score table is π r 6 , which not only has the highest score but also has advantages when playing with all blue agents except π b 6 , as shown in Table 2. In the competition with all the opponents, it can win 44% and remain unbeaten at 75%.
In order to verify the performance of this agent, it is confronted with the opponent agents trained by ADP [20] and Minimax-DQN [44]. Each of the four typical initial situations is performed 100 times for both opponents, and the results are shown in Table 4. e time cost to generate an action for each algorithm is shown in Table 5. It can be found that the performance of the agent presented in this paper is comparable to that of ADP, and the computational time is slightly reduced. However, ADP is a model-based method, which is assumed to know all the information such as the roll rate of the opponent. Compared with Minimax-DQN, the agent has the overall advantage. is is because Minimax-DQN assumes that the opponent is rational, while the opponent in this paper is not, which is closer to the real world. In addition, compared with Minimax-DQN, the computational efficiency of the algorithm proposed has been significantly improved because linear programming is used to select the optimal action at each step in Minimax-DQN, which is time consuming.

Agent Evaluation and Behavior Analysis.
Evaluating agent performance in air combat is not simply a matter of either winning or losing. In addition to winning and losing rate, two criteria are added to represent success level: average time to win (ATW) and average disadvantage time (ADT). ATW is measured as the average elapsed time required to maneuver to the advantage position in the scenario where our aircraft wins. Smaller ATW is better than a larger one. ADT is the average accumulated time that the advantage function value of our aircraft is less than that of the opponent, which is used as a criterion to evaluate the risk exposure from the adversary weapons.
A thousand air combat scenarios are generated in simulation software, and the opponent in each scenario is randomly chosen from ten blue agents. Each well-trained red agent is used to perform these confrontations, and the results are shown in Figure 10. Agent number i means the policy trained after i periods. From π r 1 to π r 5 , the performance of agents is gradually improving. After π r 5 , their performance is stabilized, and each agent has different characteristics. Comparing π r 5 to π r 10 , the winning and losing rates are almost the same as those in Table 3. e winning rate of the six agents is similar, with the highest π r 5 winning 6% higher than the lowest π r 10 , as shown in Figure 10(a). However, π r 6 is the agent with the lowest losing rate, and its losing rate is more than 10% lower than that of other agents, as shown in Figure 10(b). For ATW, π r 7 is less than 100 s, π r 6 is 108.7 s, and that of other agents is more than 110 s, as shown in Figure 10(c). For ADT, π r 5 , π r 6 , and π r 8 are less than 120 s, π r 7 is above 130 s, and π r 9 and π r 10 are between 120 and 130 s, as shown in Figure 10(d).
In summary, with the increase in the number of iterations, the red queen effect appears obviously. e performance of π r 8 , π r 9 , and π r 10 is no better than that of π r 6 and π r 7 . Although the winning rate of π r 7 is not high and has more disadvantage time, it can always win quickly. It is an aggressive agent and can be used in scenarios where our aircraft needs to beat the opponent quickly. e winning rate of π r 6 is similar to other agents, while its losing rate is the lowest. Its ATW is only higher than π r 7 , and ADT is the lowest. In most cases, it can be selected to achieve more victories while keeping itself safe. π r 6 is an agent with good comprehensive performance, which means that the selection method of the league system is effective.
For the four initial scenarios, air combat simulation using π r 6 is selected for behavior analysis, as shown in Figure 11. In the offensive scenario, the opponent tried to escape by turning in the opposite direction. Our aircraft used a lag pursuit tactic, which is simple but effective. At the 1st second, our aircraft did not keep up with the opponent but chose to fly straight. At the 9th second, it gradually turned to the opponent, established, and maintained the advantage and finally won, as shown in Figure 11(a).
In the defensive scenario, our aircraft was at a disadvantage position, and the opponent was trying to lock us in   with a lag pursuit tactic, as shown in Figure 11(b). At the 30th second, our aircraft adjusted its roll angle to the right to avoid being locked by the opponent, as shown in Figure 11(c). After that, it continued to adjust the roll angle, and when the opponent noticed that our strategy had changed, our aircraft had already flown out of the sector. In situations where it is difficult to win, the agent will use a safe maneuver strategy to get a draw.     In the head-on scenario, both sides adopted the same maneuver strategy in the first 50 s, that is, the maximum overload turn and waiting for opportunities, as shown in Figure 11(d).
e crucial decision was made in the 50th second; the opponent was still hovering and waiting for an opportunity, while our aircraft stopped hovering and reduced its turning rate to fly towards the opponent. At the 90th second, the aircraft reached an advantage situation over its opponent, as shown in Figure 11(e). In the final stage, our aircraft adopted the lead pursuit tactic to establish and maintain the advantage and won, as shown in Figure 11(f ).
In the neutral scenario, our initial strategy was to turn away from the opponent to find opportunities. e opponent's strategy was to lag pursuit and successfully reached the rear of our aircraft at the 31st second, as shown in Figure 11(g). However, after the 40th second, our aircraft made a wise decision to reduce the roll angle, thereby increasing the turning radius. At the 60th second, the disadvantaged situation was terminated, as shown in Figure 11(h). After that, our aircraft increased its roll angle, and the opponent was in a state of maximum overload right-turn, which made it unable to get rid of our aircraft, and finally, it lost. It can be seen that trained by alternate freeze games using the DQN algorithm and selected through the league system, the agent can learn tactics such as lead pursuit, lag pursuit, and hovering and can use them in air combat.

Conclusion
In this paper, middleware connecting air combat simulation software and the reinforcement learning agent is developed. It provides an idea for researchers to design different middleware to transform existing software into reinforcement learning environment, which can expand the application field of reinforcement learning. Maneuver guidance agents with reward reshaping are designed. rough training in the environment, an agent can guide an aircraft to fight against its opponent in air combat and reach an advantage situation in most scenarios. An alternate freeze game algorithm is proposed and combined with RL. It can be used in nonstationarity situations where other players' strategies are variable. rough the league system, the agent with improved performance after iterative training is selected. e league results show that the strategy quality of the selected agent is better than that of the other agents, and the red queen effect is avoided. Agents can learn some typical angle tactics and behaviors in the horizontal plane and perform these tactics by guiding an aircraft to maneuver in one-on-one air combat.
In future works, the problem would be extended to 3D maneuvering with less restrictive vehicle dynamics. e 3D formulation will lead to a larger state space and more complex actions, and the learning mechanism would be improved to deal with them. Beyond an extension to 3D maneuvering, a 2-vs-2 air combat scenario would be established, in which there are collaborators as well as adversaries. One option is to use a centralized controller, which turns to a single-agent DRL problem to obtain the joint action of both aircrafts to execute in each time step. e other option is to adopt a decentralized system, in which both agents take a decision for themselves and might cooperate to achieve a common goal.
In theory, the convergence of the alternate freeze game algorithm needs to be analyzed. Furthermore, in order to improve the diversity of potential opponents in a more complex air combat scenario, the league system would be refined.
Nomenclature (x, y, z): 3-dimensional coordinates of an aircraft v: Speed θ: Flight path angle ψ: Heading angle β: Angle of attack ϕ: Bank angle m: Mass of an aircraft g: Acceleration due to gravity T: rust force L: Lift force D: Drag force _ ψ: Turn rate _ ϕ: Roll rate d t : Distance between the aircraft and the target at time t d min : Minimum distance of the advantage position d max : Maximum distance of the advantage position μ t : Deviation angle at time t μ max : Maximum deviation angle of the advantage position η t : Aspect angle at time t η max : Maximum aspect angle of the advantage position S: State space S t : State vector at time t s: All states in the state space A: Action space A t : Action at time t a: All actions in the action space π: S ⟶ A: Policy R(S t , A t , S t+1 ): Reward function R t : Scalar reward received at time t V(s): Value function Q(s, a): Action-value function Q * (s, a): Optimal action-value function Q(S t , A t ): Estimated action-value function at time t c: Discount factor α: Learning rate T: Maximum simulation time in each episode Δt: Training time step size δt: Simulation time step size F(S t , A t , S t+1 ): Shaping reward function Φ(S t ): A real-valued function on states T(S t ): Termination reward function D(S t ): Distance reward function O(S t ): Orientation reward function K: Number of training periods.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.