Hierarchical Q-Learning Based UAV Secure Communication against Multiple UAV Adaptive Eavesdroppers

In this paper, we investigate secure unmanned aerial vehicle (UAV) communication in the presence of multiple UAV adaptive eavesdroppers (AEs), where each AE can conduct eavesdropping or jamming adaptively by learning others’ actions for degrading the secrecy rate more seriously. The one-leader and multi-follower Stackelberg game is adopted to analyze the mutual interference among multiple AEs, and the optimal transmit powers are proven to exist under the existing conditions. Following that, a mixed-strategy Stackelberg Equilibrium based on finite and discretized power set is also derived and a hierarchical Qlearning based power allocation algorithm (HQLA) is proposed to obtain the optimal power allocation strategy of the transmitter. Numerical results show that secrecy performance can be degraded severely by multiple AEs and verify the availability of the optimal power allocation strategy. Finally, the effect of the eavesdropping cost on the AE’s attack mode strategies is also revealed.


Introduction
With the inherent advantages in mobility, flexibility, and adaptive altitude, unmanned aerial vehicle (UAV) wireless communication has experienced an upsurge of interest in both military and civilian applications [1][2][3][4][5][6]. However, both the broadcast nature of the wireless medium and the malicious attackers make the electromagnetic environment of UAV communication hostile. Hence, the security issue of UAV communications is of paramount importance yet a significant challenge [7].
However, most approaches in the above work that mainly focused on the single-mode scenarios are not fully suitable for the novel attackers, named as "adaptive eavesdroppers (AEs)," "active eavesdropper," or "smart eavesdropper." They use programmable radio devices to flexibly choose their attack methods, such as eavesdropping, jamming, and spoofing, according to the ongoing transmission status and the radio channel states. For example, an AE sends spoofing signals if she has a similar channel state with Alice or sends jamming signals if she is very close to Bob. Compared with the traditional single-mode attackers each performing a singlemode attack, an AE can be more harmful to the UAV transmission by reducing the secrecy capacity. Therefore, it is urgent to investigate the effective countermeasures against this type of eavesdropper.
In recent years, some literature began to investigate AE. One form of the AE is achieved by the manner of multiantenna full-duplex (FD) technology [24][25][26][27], which can assign one part of antennas to wiretap and the other antennas interfere simultaneously. Another type of AE emerges during the channel estimation phase in the form of time-division duplex, and it leads to pilot contamination by sending the same pilot sequence as the legitimate node [28][29][30][31]. While in the data transmission phase, the AE reverts to passive eavesdropper again. Nevertheless, it should be noticed that the attack modes of these two forms of AE are predefined which means it cannot change the attack mode adaptively. The third form of AE in [32][33][34][35][36] can adjust its attack strategies adaptively. But there are still several problems remaining unsolved. Firstly, current work rarely considered multiple AEs case and the mutual interference between themselves. Secondly, existing studies neglected the AE adaptivity supported by the learning ability in searching for the optimal strategies of the transmitter and did not reveal the impact of the learning ability of AE on the secrecy performance of the considered system. How to search the transmitter's optimal power strategies in the face of multiple AEs with the learning ability and how to handle the mutual interference from AEs to improve the security capacity of the UAV communication system are necessary to be considered.
In our work, we mainly concentrate on a secure UAV communication scenario in the presence of multiple UAV AEs, which can eavesdrop or jam adaptively by learning others' strategies as well as dynamic environments. For the considered scenario, each AE's attack activity may affect the signal to interference plus noise ratio (SINR) of others. This implies that each AE's decision-making is not only coupled with the interactions from the transmitter but also from other AEs. Considering these hierarchical interactions between the transmitter-side and AEs-side, the Stackelberg game [37][38][39][40][41] is a suitable framework to capture the sequential interactions between the transmitter and AEs. Then, the Stackelberg Equilibrium (SE) points of the formulated game turn to be the feasible solutions to the transmit power allocation problem. However, the SE points solely provide theoretic solutions and it is challenging to obtain the SE solutions. In particular, the AE with the learning ability in this paper makes decisions spontaneously and independently, which results in unpredictable attack modes of the whole AE set. In this context, it is not feasible to handle this problem by centralized means because the number of each attack mode and locations of AEs are unknown, which motivates applying the idea of reinforcement learning (RL). So, we incorporate RL technology into the proposed game and a hierarchical Q-learning based power allocation algorithm is proposed to obtain the mixed-strategy equilibrium solution. The main contributions of this paper are summarized as follows: (i) We propose a secure UAV communication model which constitutes of one transmitter-receiver pair and multiple UAV AEs. Each AE decides to eavesdrop or jam adaptively by learning the other nodes' strategies as well as the dynamic environment to maximize its damage. Also, the interference among AEs is investigated.
(ii) We formulate the UAV secure transmission problem as a one-leader and multi-follower Stackelberg game where the transmitter acts as the leader and all AEs are followers. The optimal transmit power of leader are obtained by analyzing the pure strategy SEs under the existing conditions. Besides, the mixed-strategy SE is also derived for the finite and discretized power set. Then, we apply a hierarchical RL framework in which each player chooses its attack strategy based on a probability distribution and a hierarchical Q-learning based power allocation algorithm is proposed to discover the mixedstrategy equilibrium of the formulated game. Besides, we provide rigorous theoretical proof about the convergence of the proposed algorithm.
(iii) Numerical results show the availability of the optimal power allocation strategy of the legitimate transmitter in the more hostile situation and reveal the impact of AE's learning ability on the secrecy rate. Meanwhile, we show that the proposed algorithm has a significant convergence advantage over the single-agent RL algorithm. Finally, the effect of the eavesdropping cost on the AE's attack mode strategies is also revealed.
(iv) We organize the rest of this paper as follows. In Section 2, we present the related work. Then, we present the system model in Section 3. In Section 4, we formulate the UAV secure transmission game and investigate a power allocation policy in Section 5. In Section 6, we provide the simulation results and conclude the work in Section 7.

Related Work
In UAV communication, there have been abundant approaches, such as 3D beamforming [12][13][14], trajectory optimization [5,[15][16][17][18][19], multi-UAV cooperation [17,20], and resource management techniques [21][22][23], concerning on the single attack mode. Whereas, it is inappropriate to apply them directly to defend the novel attacker that has the multiple abilities of eavesdropping, jamming, spoofing, and so on. As a novel attacker, the AE can eavesdrop and jam simultaneously by the FD capability [24][25][26][27]. Specifically, Tang et al. investigated the physical layer security issue in the presence of an FD AE within a hierarchical game framework in [24]. In [25], Mukherjee and Swindlehurst examined the design of an FD active eavesdropper in the 3-user MIMOME wiretap channel, where the adversary intends to optimize its transmit and receive sub-arrays and jamming signal parameters to minimize the MIMO secrecy rate of the main channel. In [26], the potential benefits of an FD receiver node in the presence of an active FD eavesdropper was studied. The optimal receive/transmit antennas allocation at the receiver against active eavesdropper in an FD pattern is provided in [27]. The second AE scenario adopts time-division duplex technology. The adaptive eavesdropper sent the same pilot sequence as the legitimate user node in the training phase leading to pilot contamination [28][29][30][31]. Zhou et al. discussed how an AE attacked the training phase in wireless communication to improve its eavesdropping performance in [28]. A simple protocol to determine whether an AE is present or not using the channel properties of MMIMO is proposed in [29]. A novel random-training-assisted (RTA) pilot spoofing detection algorithm and a zero-forcing based secure transmission scheme is proposed to protect the confidential information from the active eavesdropper in [30]. Unfortunately, all AEs in the above scenarios cannot adjust attack mode adaptively. More recently, the AE that can determine the attack mode autonomously has been studied in [32][33][34][35][36]. To be specific, Li et al. studied the secure communication game under the AE from UAV with the imperfect channel estimation but ignored the mobility of UAV in [32]. Li et al. formulated the MIMO transmission in the presence of AE as a noncooperative game and obtained the power control strategy based on Q-learning in [33]. Zhu et al. proposed a noncooperative strategic game to make a complex decision between users that perform uplink transmission via relay and an active malicious node in [34]. In [35], Xiao et al. formulated a subjective smart attack game for the UAV transmission and proposed a deep Q-learning RL based UAV power allocation strategies. However, these above researches did not refer to the multiple AEs' scenario, and the mutual interference between AEs is hardly considered. Moreover, these AEs cannot learn from others' strategies and the dynamic environment. A summary of the proposed literature about AE has been given in Table 1.
Our work in this paper is different from the above researches that we focus on the AE with learning ability that can choose the attack mode independently and investigate the secure transmission problem of UAV communication in the presence of multiple AEs. Note that the approach of defending multiple AEs using the Stackelberg game in UAV communication networks was presented in our previous work [37], and the main differences and new contributions are (i) aim to the actual UAV communication, we introduce the mixed-strategies for the discretized transmit power set, and (ii) we assume that each AE has the learning ability and reveal the impact of the AE's learning ability on the secrecy rate. Besides, the similarity between the most related work in [32] and our work is that the Stackelberg game-based power allocation problem in the secure transmission of UAV communication is investigated. The main differences are (i) we consider the multiple AEs case which is more actual in UAV communication while the work in [32] ignores it, and (ii) the mutual interference among themselves is considered.

System Model
As shown in Figure 1, we consider the downlink of a UAV communication system consisting of a transmitter (Alice), a receiver (Bob), and M number of UAV AEs randomly distributed around transmitter-receiver pairs, where all nodes are single-antenna and UAVs are all hovering. Here, we adopt a 3D Cartesian coordinate system with the Alice, Bob, and the AE m located at ðx a , y a , h a Þ, ðx b , y b , h b Þ, and ðx m , y m , h m Þ. Alice communicates with Bob by using transmit power that is denoted by P s ∈ ½0, P max , where P max is the maximum transmit power. Without the loss of generality, being a programmable radio device, when Alice is transmitting a signal to Bob, some AEs act as passive eavesdroppers to overhear Alice's signals if they can derive enough information. The rest of the AEs send jamming signals if they can effectively block Alice's signal to Bob. Each AE can either eavesdrop on Alice or jam Bob, under a half-duplex constraint. Here q m ∈ fe, jg, m ∈ ½1, M, corresponding to eavesdropping and jamming, denotes the specific attack mode of AE m . Hence, the sets of the passive eavesdroppers and the active jammers can be denoted by Considering the low mobility of low-altitude UAVs, all the channels are assumed to be quasi-static fading, i.e., the channel gains are constant with each transmission block. Besides, the channel gains between the UAVs follow the free-space path loss model, which is determined by the distance between the UAVs, i.e., where β 0 is the channel power gain at the reference distance of d 0 = 1m, d i,j is the distance from node i to node j, ζ i is the coordinate of node i , and η is the path loss exponent. At each time slot, Alice first sends a normalized signal x a with transmit power P s . Then, all AEs conduct different attack modes by learning others' strategies. The legitimate link and all passive eavesdroppers suffer interference from all active jammers. The interference to legitimate link and the k th passive eavesdropper (k ∈ Φ E ) is given by ∑ j∈Φ J P j g j,b and ∑ j∈Φ J P j g j,k , where P j is the jamming power.
The received signal at Bob can be expressed as where n b~C Nð0, σ 2 n Þ is the additive white Gaussian noise (AWGN) at Bob. The received SINR at Bob can be expressed as where ω 0 = β 0 /σ 2 n and I J,B = ∑ j∈Φ J P j ω 0 d −η j,b that denotes the interference from all AEs who choose to jam. We can obtain the data rate of the Alice-Bob link as 3 Wireless Communications and Mobile Computing Due to Remark 1, each AE can get the other AEs' actions. So, the signal received at the k th passive eavesdropper can be expressed as where n e~C Nð0, σ 2 n Þ is the AWGN at the k th passive eavesdropper. Similarly, the received SINR at the k th passive eavesdropper can be expressed a where Assuming the maximal eavesdropped information is determined by the maximal SINR among all passive eavesdroppers, i.e., r E = max k∈Φ E r k . We obtain the maximal data rate of the Alice-AE links, which is given as From (4) and (7), the secrecy rate of Alice can be written as where ½X + returns X if X is positive, while returns 0 otherwise.

Secure Transmission Game
In this section, we investigate the secure transmission problem with multiple UAV AEs. The interactions between the transmitter and multiple UAV AEs are formulated under the Stackelberg game framework. The optimal power allocations and secrecy rate of Alice and the best attack modes of all AEs are derived by analyzing the equilibrium of the game.

Secure Transmission Game Formulation.
The secure transmission problem of this proposed system can be formulated as a two-stage Stackelberg game. Specifically, Alice is a leader and all AEs are followers. Alice decides its transmit power firstly and all AEs take their action adaptively based on the observation of the leader's action in the sequel. The secure transmission game is formulated as Here, N = fAlice, AE 1 , ⋯, AE m , ⋯, AE M g is modeled as the players, and P ∈ ½0, P max and Q = fe, jg are the strategy space of Alice and AE, respectively. Also, U a and U m are the utility of Alice and AE, respectively.
In this system, Alice wants to send a confidential message and thus naturally intends to maximize its secrecy rate. Meanwhile, the transmission cost is inevitable during the transmission. Therefore, the utility of the leader is the trade-off of the secrecy rate and transmission cost, which can be formulated as

Wireless Communications and Mobile Computing
where C a denotes the cost of the unit transmit power of Alice. For computational convenience, we multiply the data rate by a coefficient ln 2.
The objective of the leader is to solve the following problem to obtain the optimal power allocation: where q * 1 , ⋯, q * m denotes the optimal action of all AEs. On the other hand, each AE attempts to minimize the secrecy rate of Alice by changing its attack mode adaptively according to Alice's transmit power. Therefore, we formulate the utility of AE m with the trade-off of the secrecy rate and its attack cost as follows where θ e and θ j denotes the cost of each AE to perform as the passive eavesdropper and active jammer, respectively. We assume that θ e is related to the R ae , i.e., θ e = C e R ae , where C e denotes the cost of unit rate of R ae . θ j = C j P j , where C j denotes the cost of the unit transmit power of jammer.
To calculate the utility of a single AE accurately, at each time slot, when Alice is transmitting a signal to Bob, we divide all AEs into three parts, which are denoted as is the set of passive eavesdroppers except AE m and Φ −m J is the set of active jammers except AE m . If AE m decides to act as a passive eavesdropper, the R a can be expressed as where I −m J,B is the interference received at Bob from Φ −m J , and r m is the SINR of AE m : Similarly, if AE m selects to jam, R a can be expressed as where I m is the jamming power of AE m , and r −m E is the maximal SINR among all passive eavesdroppers in Φ −m E . In conclusion, U m can be expressed as Similarly, the objective of AE m is to solve the following problem: where q * −m denotes the optimal action of all AEs except AE m .

Analysis of Strategy
Equilibrium. Now, we will analyze the proposed Stackelberg game model and solve the optimization subproblems of (11) and (16). As a follower, each AE will adjust its attack mode after sensing Alice's strategy. Therefore, the subgame of followers is analyzed firstly.

Proposition 1.
Given the strategy of Alice, the optimal attack mode strategy of AE m is expressed as (17) if (17(a)) and (17(b)) hold.
− C e R ae , q m = e a ð Þ,  Proof. If (17(a)) holds, from (12), we have If (17(b)) holds, from (12), we have Thus (17) holds. As shown in Proposition 1, if passive eavesdropping can bring worse secrecy rate and less cost than active jamming, the AE will select to overhear and vice versa.
As the leader of the game, Alice first chooses to transmit power. The optimal power strategy of Alice can be derived by solving (11), which is revealed in Proposition 2.

Proposition 2.
The optimal power allocation is P * s , which satisfies the following equation: if (21(a)) and (21(b)) hold.
where I * J,B and I * J,E denotes the interference from Φ * J in which each AE chooses to jam as an optimal strategy to Bob and the k th passive eavesdropper, respectively.
Proof. We obtain the following differential equation describing the evolution of the utility of Alice:

Wireless Communications and Mobile Computing
If (21(a)) holds, (23) is less than zero. Thus, we have which indicates that ∂U a /∂P s monotonically decreases with P s . Therefore, if (21(b)) holds, we have indicating that there is a sole solution to ∂U a /∂P s = 0, given in (20(a)). From (22)-(24), we can find that U a ðP s , q * m , q * −m Þ increases with P s , if P s < P * s , while it decreases otherwise. Thus, (11) also holds and ðP s , q * m , q * −m Þ is a Nash Equilibrium (NE) of the game. In this way, we have completed the proof of Proposition 2.
As shown in Proposition 2, Alice stops the transmission when (21(b)) does not hold. In other words, Alice will stop the transmission under the circumstances that radio channel degradation is serious and the security cannot be guaranteed.
Another NE ðP max , q * m , q * −m Þ is revealed in Proposition 3.

Proposition 3.
The secure game has the NE ðP max , q * m , q * −m Þ if (21(a)) and the following equation hold: Proof. (21(a)) has been discussed above. Therefore, if (27) holds, we have which indicates that U a monotonically increases with P s , ðP max , q * m , q * −m Þ is also an NE of the game. In this way, we have completed the proof of Proposition 3.
As shown in Proposition 3, low transmission costs in (27) will make Alice select the maximum transmit power to transmit the signals.

Hierarchical Reinforcement Learning
Framework for Secure Transmission Game The proposed UAV secure communication problem with multiple AEs has been formulated as a Stackelberg game, which belongs to the category of two-stage dynamic game and has a significant two-layer game structure. Alice and all AEs become intelligent agents and have the learning ability to automatically optimize their configuration. Besides, the mixed-strategy is applied by both sides of the communication to confuse each other. In this section, we apply a hierarchical RL framework to derive the mixed-strategy equilibrium and implement the UAV secure communication.

Analysis of Mixed-Strategy Equilibrium.
Considering the actual wireless communication scenario, we assume that Alice has a finite and discretized power set. Specifically, a policy of Alice at time slot t is defined to be a probability vector π t = ðπ t 1 , π t 2 , ⋯π t L Þ, where π t l means the probability with which Alice chooses action (power level) P l from a finite discrete set P , which satisfies ∑ L l=1 π t l = 1. Similarly, δ t m = ðδ t m,1 m , δ t m,2 m Þ denotes the policy of AE m at time slot t, where δ t m,i m means the probability with which AE m chooses action (attack mode) Q i from a finite discrete set Q, which satisfies ∑ 2 i=1 δ t m,i m = 1. Based on the above analysis, we have the following definition of an SE for the hierarchical RL framework based on Eqs. (10) and (12). Alice's objective is to maximize its revenue as Similarly, each AE's objective is Then, we will define the SE in a hierarchical reinforcement learning framework. Definition 1. A stationary policy profile ðπ * , δ * m , δ * −m Þ is the SE for hierarchical RL framework if the followings hold.
Proposition 4. For the proposed hierarchical RL framework, there exists Alice's stationary policy and an AEs' NE policy that form an SE.

Wireless Communications and Mobile Computing
Proof. If the Alice follows a stationary policy π, the Stackelberg game is simplified into an M-player hierarchical RL game. It has been shown in [42] that every finite strategicform game has a mixed policy equilibrium. As a result, there always exists an NE ðπÞ in our formulation of the discrete power allocation game given Alice's policy π. The rest of the proof follows directly from the definition of an SE and is thus omitted for brevity.

Hierarchical Q-Learning Based Power Allocation
Algorithm. In the proposed UAV secure transmission game, since there is no information exchange between Alice and AEs, both sides can only maximize their expected utilities through repeated interactions with each other. When the action taken by the agent (Alice or AEs) brings positive feedback to the agent, the agent will strengthen the action, otherwise the agent will weaken the action. Agents constantly adjust their strategies based on the feedback to achieve optimal long-term returns. Thus, a hierarchical Q-learning based power allocation algorithm (HQLA) is adopted, where each agent's policy is parameterized through the Q-function that characterizes the relative expected utility of a particular action.
To be specific, for the follower's learning, let Q t m ðq t m,i m Þ denote the corresponding Q-function of AE m ′s action q t m,i m based on current police δ t m,i m at time slot t. Then, after conducting the action q t m,i m , the corresponding Q-value is updated as follows where α ∈ ½0, 1Þ is the learning rate and U m ðP t+1 l , q t m,i m , q t+1 −m,i −m Þ is the utility of AE m at time slot t + 1.
Each AE updates its policy based on Boltzmann distribution where temperature τ controls the trade-off between exploration and exploitation, i.e., for τ → 0, AE m greedily chooses the policy corresponding to the maximum Q-value which means pure exploitation, whereas for τ → ∞, AE m ′s policy is completely random which means pure exploration [43]. Accordingly, the Q-value of Alice is updated as follows where U a ðP l , q t+1 m,i m , q t+1 −m,i −m Þ is the utility of Alice at time slot t + 1. Then, Alice updates its policy based on Boltzmann distribution Now, we present the detailed description of the Q-learning based hierarchical RL algorithm.

Convergence
Analysis of Algorithm 1. The learning algorithm results in a stochastic process of choosing a power level, so we need to investigate the long-term behavior of the learning procedure. Along with the discussion in [43], we obtain the following differential equation describing the evolution of the Q-values: In the following, we would like to express the dynamics in terms of strategies rather than the Q-values. Toward this end, we differentiate (35) with respect to time t and use (36). Similarly, we differentiate (33) with respect to time t and use (37).

Wireless Communications and Mobile Computing
We can obtain the equations like (38) and (39).
The steady-state strategy profile z s = ðπ s ðP l Þ, δ s m,i m ðq m,i m ÞÞ can be obtained [43].
Let Z t = ðz t 1 , ⋯, z t N Þ the strategy profile of all players at time slot t. In the following analysis, we resort to an ordinary differential equation (ODE) whose solution approximates the convergence of Z t . The right-hand side of (38) and (39) can be represented by a function f ðZ t Þ as α → 0. Z t will converge weakly to Z * = ðπ * 0 , δ * 0 Þ, which is the solution to Proposition 5. The HQLA can discover a mixed-strategy SE.
Proof. We prove this by contradiction. Suppose that the process generated by (33) and (35) converges to a non-SE. But the solutions of (42) are by definition stationary points. This implies that HQLA will only converge to stationary points. This means that stationary points that are not SEs are stable, which is contradicting Proposition 4.

Simulation Results
Simulations are carried out to evaluate the performance of the proposed power allocation strategies against multiple UAV AEs. This scenario has one transmitter-receiver pair and three UAV AEs denoted as Alice, Bob, AE 1 , AE 2 , and AE 3 , respectively. We set up a scenario network where all the UAVs are distributed in a 200 m * 200 m region. The system parameters are chosen for some typical scenarios including the cost of unit transmit power and jamming power, i.e., C a = C j = 0:1 and C e = 0:5, the path loss exponent η = 2 and ω 0 = 80: Figure 2 shows the expected utilities of the leader under different algorithms. We can find that the expected utility achieved by the proposed HQLA is significantly lower than the single-agent Q-learning algorithm (SAQL). This is because in SAQL, only Alice applies the reinforcement learning mechanism to maximize the secrecy rate but all AEs' behaviors constituting joint actions are considered to be stated in the Q-learning algorithm which means each AE cannot choose the optimal strategy adaptively to maximize its utility. While in HQLA, each AE with reinforcement learning ability can maximize its damages to the secrecy rate of the considered system through repeated interactions with Alice's and other AEs' strategies. The comparison of the leader's expected utilities with SAQL implies that the agent's learning ability has a significant impact on its utility. So, the proposed HQLA provides an optimal power allocation strategy in a more hostile case that suffers the adaptive attacks from multiple AEs. On the other hand, the proposed HQLA is superior to the random selection algorithm (RS) because the proposed HQLA may converge to a desirable solution, whereas the RS is an instinctive approach. Figure 3 shows the cumulative distribution function (CDF) of the convergence of HQLA and SAQL. As observed from Figure 3, we can find that the proposed algorithm converges at about 500 iterations, while the contrast algorithm converges at about 1000 iterations, which means the convergence rate of HQLA is significantly better than SAQL. This is because that all AEs in SAQL select action randomly without learning ability whereas in HQLA, taking the interactions between two sides of communication into account, all AEs make decisions according to the mixedstrategy derived by RL which can obtain the optimal strategy via trials-and-errors. This also means that the learning ability has a significant positive impact on the convergence rate. Figure 4 presents the strategy selection probabilities evolution of the leader's transmit power. At the very beginning, Alice randomly selects transmit power according to a uniform distribution. As Algorithm 1 iterates, the strategy selection probabilities keep on updating until convergence after about 500 iterations. It is worthy to note that Algorithm 1 under this scenario converges to pure strategy NE points since the probability of selecting one power level is equal to 1, while the probabilities of the other levels of transmit power decrease to 0 if the time slots are large enough. So, the theoretical prediction in Proposition 4 is verified under the existing conditions. Specifically, the P max in Figure 4(a) as the optimal transmit power is consistent with Proposition 3, and the P * s in Figure 4(b) is consistent with Proposition 2. The leader's expected utility comparison under different C e is shown in Figure 5(a). It is noted that the steady value of the leader's expected utility increase with the value of C e growing because that C e leads to changes in AEs' attack strategies. Specifically, as a rational agent, all AEs choose to interfere with Bob finally in Figure 5(b) because they find the utility of the jammer is higher than eavesdropper according 10 Wireless Communications and Mobile Computing to the learning process when the difference of C e = 0:8. Thus, the maximal data rate of the Alice-AE link is zero which means the leader will obtain the maximal secrecy rate and expected utility. Similarly, in Figure 5(d), all AEs find the utility of eavesdropper is higher than jammer when C e = 0:2 and every AE choose to eavesdrop on Alice. As a result, the maximal data rate of the Alice-AE link between all AEs is achieved and the leader suffers the lowest utility. When C e = 0:5 (in Figure 5(c)), according to the utilities of themselves, AE 1 and AE 3 always choose to interfere with Bob and AE 2 prefers eavesdropping which makes the expected utility of leader is between C e = 0:8 and C e = 0:2. In addition, it is worthy to note that the attack strategies of all AEs have a pure strategy equilibrium since the probability of selecting one attack mode is equal to 1 while the probability of another attack mode decreases to 0.

Conclusions and Future Work
In this paper, we have investigated the transmit power optimization problem of secure UAV communication in the presence of multiple UAV AEs. A secure transmission game is formulated to prove the existence of the NE by analyzing the interactions between the legitimate user and AEs. Within a hierarchical game framework, we obtain the optimal transmit power solutions for the legitimate transmissions. Numerical results verified the theoretical analysis and shown that the secrecy performance could be degraded severely by AEs' learning ability. Moreover, the outperformance of the HQLA's convergence and the impact of the eavesdropping cost on the decision of AE's attack mode is also demonstrated. To take advantage of the UAV's mobility that can bring the potential performance enhancement, in future work, we will devote our efforts to joining the UAV's trajectory and resource allocation optimization against multiple AEs.

Data Availability
The data (figures) used to support the findings of this study are included within the article. Further details can be provided upon request.