A Case Study on Air Combat Decision Using Approximated Dynamic Programming

2014 Yaofei Ma et al.ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. As a continuous state space problem, air combat is difficult to be resolved by traditional dynamic programming (DP) with discretized state space. The approximated dynamic programming (ADP) approach is studied in this paper to build a high performance decision model for air combat in 1 versus 1 scenario, in which the iterative process for policy improvement is replaced by mass sampling from history trajectories and utility function approximating, leading to high efficiency on policy improvement eventually. A continuous reward function is also constructed to better guide the plane to find its way to “winner” state from any initial situation. According to our experiments, the plane is more offensive when following policy derived from ADP approach other than the baseline Min-Max policy, in which the “time to win” is reduced greatly but the cumulated probability of being killed by enemy is higher. The reason is analyzed in this paper.


Introduction
Unmanned aerial vehicle (UAV) plays an important role in modern battlefield.For the past decades, UAV has made significant advancement on both hardware and software and achieved mission capabilities including "simple" ones like intelligence, surveillance, and reconnaissance (ISR), and "complex" ones like electronic attack, ground targets strike, suppression or destruction of enemy air defense (SEAD/DEAD), and others.According to the development roadmap [1] proposed by UAV technical leading countries, even the mission of air combat, which has been believed to be the dominated domain of human pilots due to the dynamic and complexity on tactical decisions, is possible to be carried out by autonomy UAV in the near future.
However, the decision technologies supporting automatic air combat are far from maturity.They are still on the way for better robustness, intelligence, autonomy, team cooperation, and adaption to complex environment.The methods applied in this domain include game theory [2][3][4], knowledge-based decision [5][6][7][8][9], graphic based methods like influence diagram [10], and others.Dynamic programming (DP) [11] is one of the most powerful methods for its adaptation to dynamic environment and the capability to improve policy constantly by learning [12].However, traditional DP approach is not suitable to resolve continuous state space problem like air combat, in which the computation complex becomes intractable because of the curse of dimensionality.
In this paper, a tactical decision framework employing 5 approximated dynamic programming (ADP) method [13][14][15] is proposed for air combat mission.The trait of ADP method is that the utility function is learned from mass sampled states in problem space rather than from scratch, which lead to high efficiency in policy converging.As the result, ADP can be used as a second stage tool to improve the policy derived from other decision systems (denoted by the "first stage tool" for decision here), for example, a knowledge-based system.If we treat the combat traces produced in the first stage tool as the sampled states, then ADP algorithms can learn its utility function and policy from these states directly.Considering the optimizing capability of ADP inherited from DP approach, the learned policy can be improved constantly to achieve better decision performance.Thus, the merits of different decision system are combined together.
The content of this paper would be arranged as follows.In Section 2, the 1 versus 1 air combat problem is formulated with DP formation.In Section 3, the ADP method is briefly reviewed.Section 4 discusses the reward function for air combat, which is designed to guide the UAVs to enter into goal states smoothly.Some features are also specified to gain an insight of engagement situation.In Section 5, the key algorithms of ADP decision framework are proposed.The followed comparative experiments (Section 6) validate the effectiveness of the proposed framework.

Problem Formulation
A 1 versus 1 air combat scenario involves two opponent planes (denoted by red and blue, where the red is supposed to be "my" side).Omitting vertical movement, the kinematic equations of the plane are where V is the scalar value of velocity, which is assumed to be const during the combat. ∈ [−, ] is yaw angle and is defined as the deviation of velocity from north (the  axis).
is controlled by   .  is plane's normal overload, which always points right from the gravity center of the plane and is orthogonal to velocity.In our control schema,   can take a value from three options once a time: {−3, 0, 3}.The plane will turn counterclockwise, turn clockwise, and keep current velocity direction, respectively, with these values.The goal of the planes is to occupy advantage position by tactical decision and gain the fire opportunities at its rival.The state space of air combat can be described with vector where subscript  and  refer to red and blue, respectively.Any state  is an instance of .With (1), the state transition in combat space can be represented as a function which means the current state  will transfer into a new state   after performing   and   .The goal state is reached when one plane gains opportunities to fire at its opponent.The firing position is defined by three geometrical measures: (a) |Aspect angle| < /3.Aspect angle (AA) is a relative angle between the longitudinal symmetry axis (to the tail direction) of the target plane and the connecting line from target plane's tail to attacking plane's nose.|AA| < /3 refers to area where the killing probability is high when attacking from rear considering most close-combat air missiles are infrared guidance; (b) |Antenna train angle| < /6.Antenna train angle (ATA) is the angle between attacking plane's longitudinal symmetry axis and its radar's line of sight (LOS), as Figure 1 shows.This criterion defines an area from which the target plane is difficult to escape with radar locking.(c) Relative range () between two planes: this criterion makes sure that the target plane is within the attacking range of air-to-air weapon.

ADP Method Review
DP defines adaptive learning process and its mathematic model is Markov decision process (MDP).In DP formulation, the air combat can be described as a discrete time decision problem with five-tuples: {, , , , }: (1)  = {} is the problem space defined with state variable ;  is the instance of ; (2) () is the finite action set available in state , from which the plane selects one to execute at each decision interval.In our problem, () is same for each state  and thus can be simply denoted as ; (3) (  | , ) is the probability of transition from state  to s  ; (4) (s) is the reward of state .If  is visited multiple times during the combat, the rewards are discounted cumulated to form utility value of that state; (5) () is the utility of state .Its value is the cumulated rewards of multiple visiting.If every state is visited adequate times, the utility distribution will converge to the optimal one, by which the optimal policy is derived.
The decision process starts from an initial state  0 and then selects action to perform.The action interacts the environment and leads to a new state, and so on.Then, the utility of the starting state is the expectation of discounted cumulated rewards on all states following the start one: where  ∈ (0, 1) is the discounted coefficient, making sure () converges eventually.Policy () →  is a mapping from state space to action space.For a fixed policy , the utility satisfies Bellman equation The optimal utility  * is the value function that simultaneously maximizes the expected cumulative reward in all states  ∈ .Bellman proved that  * is the unique solution of (5): Actually,  * can be obtained through iterations on Bellman equation: is denoted by Bellman operator, representing the iterative improvement on  by traversing states throughout the space eventually.During this process, (, ,   ) would also converge to its "true" distribution.Then, the optimal policy can be derived: As we can see from ( 4)-( 8), traditional DP method needs to traverse discrete states iteratively, resulting in tabular utility function.This approach is not suitable to resolve continuous state space problem.Discretization on state space leads to two defects: (i) the unreasonable assuming that utility function is const in each discrete state cell and (ii) the curse of dimensionality.
ADP method mitigates these problems with two operations: (i) sampling mass states effectively from problem space, thus reducing the consumed time on space exploration; (ii) approximating state utility Ũ using sampled states, with which the near-optimal policy, rather than the optimal one, is employed to determine actions.Denoting the sampled states as a set  = { 1 ,  2 , . . .,   }, we have where Û is the current approximation of utility; Ũ+1 is one Bellman iteration from Û .Then, Û+1 can be approximated based on Ũ+1 .There are multiple options for approximation operation [16,17]; the least squares approximation is used here: where  is approximation coefficients vector and  is sampled states set.Normally, a set of features need to be defined to gain an insight on characteristics of the studied problem.The approximated utility function will converge more quickly and be more precise since these features come from pilots' combat experiences in real world.We have Φ is feature vector.As a conclusion, the steps of ADP method can be briefly listed as follows: (a) to sample states set  in problem space; (b) to get the one iteration improvement utility Ũ+1 from current utility Û ; the initial value of Û can be set as the reward of initial states; that is, Û0 = ( 0 ), where  0 is initial state set; (c) to update the next value of approximated utility Û+1 following ( 10) and ( 11); (d) if the policy still needs to be improved, go back to (a).

Local variables:
(2)   : action vector of blue plane derived from Min-Max policy; (3)   : action vector of red plane derived from current utility; (4) Ũ : one step improved utility function by Bellman

Reward Function and Combat Features
Before giving ADP algorithm, the reward function () needs to be discussed firstly since it is a necessary part in ADP steps.As (4) shows, the utility is actually the discounted cumulated () along state trace.Thus, a properly defined () can better guide plane approach to goal state from any starting state.
The computation of () is domain related.As for our scenario, the attacking plane in its goal state gets reward +1, and the target plane in the same state gets −1 as punishment.The reward in other states is 0. Intuitively, with these discrete rewards, the planes will spend more time on space exploration to find trace from starting state to goal state.To better guide the plane, a reward function is defined as where   is the expected attacking range of weapons,  is the relative distance between planes, and  is an coefficient adjusting the influence of  in total reward.With (12), the plane occupying firing position (AA = 0, ATA = 0, and  =   ) gets reward   = 1; the plane under attack (AA = , ATA = , and  =   ) gets reward   = 0.In other states, the reward will increase continuously and monotonically from the worst state to the most advantage state.To emphasize the punishment in bad state (the punishment will guide planes to avoid these states), a simple linear transformation is applied to   to get the final reward function: To construct the utility function, some geometric features [18] are specially defined to describe combat situation, as Table 1 shows.These feature are optional for utility approximation in ADP steps because we can use sampled states instead, as (10) shows.However, they are more straightforward to capture the "true" utility of states, and that is why human pilots also use them to judge their situations in realworld combat.In other words, these well-defined features are more representative to approximate utility function.

Method
In the combat scenario, the red plane is marked as "my" side, and the blue one is marked as enemy.To describe

ȦTA
The changing ratio of ATA.

𝐻𝐶𝐴
The error on yaw angle of both planes.
ADP approach, a reference decision algorithm, the Min-Max search algorithm [19] is employed here.The Min-Max algorithm looks into future for  steps, using domain knowledge to determine the acting consequent before giving final decision.ADP approach involves two algorithms: (i) the learning algorithm (ADP Learn()), in which the utility function is approximated, and (ii) the decision algorithm (Rollout ADP Policy()), in which the final action is determined based on ADP policy derived from learned utility.The ADP Learn() algorithm is displayed in Algorithm 1.
In ADP Learn(), the utility function is approximated with sampled states, which is expected to be sampled from frequently visited space, to fully capture the changes on utility values in these areas.An option is to use the trajectories produced in real-world combat or other authoritative decision tools for air combat, since the trajectories themselves indicate the high probability of being visited in combat.In this paper, a scenario is built to get combat trajectories, where two rival planes all take Min-Max policies, and their trajectories are recorded as  0 .
The initial value of Û is assigned as the reward of  0 (line 1 in Code section) and then is improved  rounds.In each round, firstly, the blue plane's action   is determined by Min-Max policy (line 3).Secondly, the red plane's action   is selected by applying one step Bellman operator.The changed utility Ũ is also recorded (lines 4-5) for further use.Thirdly, the feature vector is updated (line 6), with which the least squares approximation is performed to approximate Û according to Ũ (lines 7-8).
The approximated utility function Û= returned by ADP Learn() already can be used to give decisions, as (8) shows.However, a rollout procedure is employed here to further improve the quality of final decisions, as Algorithm 2 shows.
Assuming the red plane is making decision in Rollout ADP Policy().It will not follow ADP policy directly.On the contrary, it tries each possible action (line 1 in Code section).For each possible action, the red plane's future state

Simulation and Analysis
The initial state of 1 versus 1 air combat can be classified into 4 basic situations (from red plane's perspective): offensive, neutral, defensive, and confronting, as Figure 2 shows.
In our experiments, all four initial situations are configured (Table 2) to compare the performance of ADP policy and Min-Max policy.
The experiments are arranged as follows.Firstly, a baseline experiment is conducted in which both red and blue plane would take Min-Max policy (denoted as  mm ).The decision performance of  mm is treated as the baseline to compare ADP policy.
Secondly, the learned ADP policy is applied with same initial situations.ADP policy is denoted as  = , where  is the learning rounds.For example,  =40 means this policy is approximated after 40 rounds.The decision performance is measured with 2 metrics: (a) the average time to win (TTW); the winning states have been defined in Section 2; the attacking plane needs to hold that state for at least 10 seconds to win; (b) the accumulated probability of being killed (APK); this indicates the total risks of one plane during the combat, in which the probability of being killed by enemy would be cumulated.A good policy would result in both small TTW and APK.To speed up the experiment, a bigger   is assigned to red plane which means it can change direction more quickly.This measurement would avoid long time standoff when both planes follow the same policy.This performance advantage will not influence policy comparison since both set experiments use the same configured planes.
The baseline experiments are conducted firstly.Only the result of Setup 4 (confront) is displayed here considering the paper space limitation, as Figure 3 shows.Figures 4,5,6,and 7 show the result of each initial setup where red plane follows ADP policy and blue plane follows Min-Max policy.Comparing Figures 7 and 3, we can see the performance of red plane is improved greatly by taking ADP policy, in which the TTW is reduced from 23 s to 10.5 s.
The comparison on decision performance is displayed in Table 3.As we can see, the TTW of  =40  is reduced in all setups compared to  mm  , especially in Setup 3.This means  =40  can guide the red plane to get rid of the chasing quickly and find its way to occupy the firing position.On the other hand, the APK is slightly higher with ADP policy.
These results show that a plane is more offensive when following ADP policy.The plane is likely to occupy firing position risk at the risk of being killed.This phenomenon can be explained from the working mechanism of two decision

Figure 2 :
Figure 2: Four basic initial situations in air combat.

Figure 4 :
Figure 4: Engagement process starts from Setup 1 with red plane following ADP policy  =40  and blue plane following  mm  .TTW = 4.5 s.

Figure 5 :
Figure 5: Engagement process starts from Setup 2 with red plane following ADP policy  =40  and blue plane following  mm  .TTW = 3.25 s.
appx (): red plane's policy derived from Û= , see(8).best : the best action respect to current state .best : to cache the maximum utility responding to different actions;(2)   : the cache the next state computed by system equation.

Table 1 :
Features evaluating combat situation.

Table 2 :
Four initial setups of air combat.