Defense Strategy Selection Model Based on Multistage Evolutionary Game Theory

Evolutionary game theory is widely applied in network attack and defense. +e existing network attack and defense analysis methods based on evolutionary games adopt the bounded rationality hypothesis. However, the existing research ignores that both sides of the game get more information about each other with the deepening of the network attack and defense game, which may cause the attacker to crack a certain type of defense strategy, resulting in an invalid defense strategy. +e failure of the defense strategy reduces the accuracy and guidance value of existing methods. To solve the above problem, we propose a reward value learning mechanism (RLM). By analyzing previous game information, RLM automatically incentives or punishes the attack and defense reward values for the next stage, which reduces the probability of defense strategy failure. RLM is introduced into the dynamic network attack and defense process under incomplete information, and a multistage evolutionary game model with a learning mechanism is constructed. Based on the above model, we design the optimal defense strategy selection algorithm. Experimental results demonstrate that the evolutionary game model with RLM has better results in the value of reward and defense success rate than the evolutionary game model without RLM.


Introduction
e rapid development of IT infrastructures, such as cyber-physical systems and Internet of ings, has brought convenience to individuals and enterprises. But it also brings unprecedented security problems [1,2]. Data management and communication layers in cyber-physical systems and the Internet of ings are vulnerable to cyberattacks such as DDoS attacks, APT, and vulnerability attacks, which seriously threaten network security [3]. According to the Crystal Market Research (CMR) report, to resist the increasingly severe attacks, the investment in the network security market is expected to increase from $ 58.13 billion to $ 173.57 billion from 2012 to 2022 [4].
us, it can be seen that the network attack and defense are increasingly severe, and network security defense has become an important problem to be solved in the field of network information [5]. Unfortunately, much research [6,7] shows that improving information security technology alone cannot provide enough protection against persistent attacks. A new method is needed to guide the implementation of the defense strategy.
Network attack and defense have the characteristics of opposite objectives and noncooperation relationships, which are consistent with the characteristics of the game. e game theory [8] applied to the network attack and defense process, from the perspective of the defender exploring how to get the best defense strategy, has become an important research direction of network security defense [9][10][11][12]. Traditional game theory assumes that both sides of the game are in a complete information game scene and are required to be rational completely [13,14]. Complete information requires the game players to know the information of the entire environment [15]. Complete rationality assumes that game players can choose their own best game strategy after obtaining the other's strategy and its revenue [16]. Evolutionary game theory [17,18] starts from the condition of opaque information of players, takes the learning mechanism as the core, and influences the selection behavior of players through factors such as previous experience, learning, and imitating the behavior of others. Evolutionary game theory can better express the process of the mutual game between the attackers and defenders. It is widely used in the research work of the network attack and defense game [19][20][21].
However, there are still some problems and challenges in applying evolutionary games in network attack and defense.
(1) Existing studies using evolutionary game theory have introduced some relevant parameters to express evolutionary game ideas under incomplete information and bounded rationality [22,23]. e use of these parameters is feasible in certain application scenes and shows a certain application value. However, the introduced parameters are calculated manually and need to be quantified by experts.
ere is currently no automatic calculation method. (2) In the multistage network attack and defense game, network attacks develop over time using new methods and targets [24], and the defense strategy may partially or completely fail. e existing evolutionary game model cannot effectively feed the failure information back to the next game stage, which leads to the shortcomings of the best defense strategy selection algorithm in terms of timeliness, accuracy, and efficiency. (3) Network attack and defense are a dynamic, multistage process [25]. e confrontation between attack and defense is not limited to one round or a certain stage.
To address the above problems, we propose a reward value learning mechanism (RLM), a novel method for updating the reward value based on the game information in the previous stage. Inspired by machine learning, we use RLM to learn the attacker's strategy and the change of its reward to predict the return value in the next stage. Furthermore, RLM is introduced to the evolutionary game model to select the best defense strategy for each stage. e main research work and contributions are as follows: (1) Aiming at the multistage network attack and defense game scene, a reward value learning mechanism is designed. By updating the reward value of each stage, the mechanism improves the active defense ability when the defense strategy fails. (2) Construct a new multistage evolutionary game model with a reward value learning mechanism, and solve the Nash equilibrium of each stage of the attack and defense game by constructing the replication dynamic equations (RD) of each stage. (3) Based on the proposed multistage evolutionary game model, an optimal defense strategy selection algorithm is proposed. Experiments show that the algorithm can effectively improve the effective strategy selection probability, defense revenue, and defense success rate in a single defense strategy failure scene and multiple defense strategy failure scene. e remainder of this paper is organized as follows. Section 2 briefly reviews evolutionary game theory and the related literature in network attack and defense. Section 3 describes the evolutionary game model based on Q-learning replication dynamic equations (QRD). Section 4 designs RLM and gives a new evolutionary game model based on RLM-QRD. Section 5 proposes the optimal defense strategy selection algorithm. Section 6 presents the experimental results for evaluating our model and compares it to the model without RLM. Finally, Section 7 concludes this paper and discusses the future works.

Related Works
e existing network defense works based on evolutionary games mainly include network defense based on the static evolutionary game and network defense based on the dynamic evolutionary game. e following are two aspects.

Static Network Defense Evolutionary Game.
e static game assumes that the information of both sides of the game remains unchanged, and it is a one-shot game [26].
Ruan et al. [27] established the attack and defense evolutionary game model using the lightweight broadcast authentication protocol to achieve security assurance and minimum resource cost. Abdalzaher et al. [28] used the scalability and low complexity of wireless sensor networks to propose a trust model that uses evolutionary game theory to make decisions to resist network attacks. Bouhaddi et al. [29] established a Bayesian game model to analyze the interaction between defenders and potential malicious nodes in the network. Aimed at the problems of free service users and users breaking system rules in peer-to-peer networks, Shareh et al. [30] established an evolutionary game model to resist network attacks from these two types of users. To explore and calculate more revenue of existing defense strategies, Jin et al. [31] combined Q-learning with replication dynamics equation to obtain Q-learning replication dynamics equation and proposed an evolutionary game model based on QRD. Aiming at the problem of the limited learning ability of players in the static network attack and defense, Liu et al. [32] established a network attack-defense evolutionary game model and designed an optimal defense strategy selection algorithm. Shi et al. [33] proposed an evolutionary game model based on honeypot technology to improve the security of the honeypot system.

Dynamic Network Defense Evolutionary Game.
Compared with the static network attack and defense evolutionary game, the dynamic network attack and defense evolutionary game divides the game into multistage of attack and defense confrontation between the players, which is more in line with the network attack and defense.
To realize the optimal defense decision in network attack and defense, Huang et al. proposed two dynamic evolutionary game models. One of the game models is the Markov-based time game model. is model selects the best defense strategy by constructing a revenue discount factor and all possible network system states [34]. Another game model used the best-response dynamic learning mechanism to study the evolutionary law of network defense strategy selection [35].
To resist the invasion of the virus code, Hayel and Zhu [36] established an evolutionary Poisson game model by defining the number of players participating in the interaction to follow a Poisson process at a specific rate. Aiming at the security issues in radio networks, Fang et al. [37] established an evolutionary game model, using evolutionary stable strategy algorithms to defend dynamically against internal attacks. Mengibaev et al. [22] introduced parameters used to measure the dependence of game players on opponents into the evolutionary game model and applied them to the privacy protection of network users. Wang et al. [38] designed three evolutionary game models for different attack scenes in the dynamic network and introduced parameters that denote the degree of sensitivity of the players to the difference in revenue in these models.
Hu et al. proposed different dynamic evolutionary game models for the problems in network attack and defense. In [39], a multistage Bayesian attack and defense evolutionary game model is proposed for the difficulty of selecting the optimal defense strategy in a dynamic confrontation network. At the same time, the selection intensity factor was introduced to improve the replication dynamic equation and enhance the randomness of the evolution process. More recently, to improve the timeliness and predictability of network attack and defense game, Hu et al. [40] proposed a dynamic evolutionary game model based on Logit Quantal Response Dynamics (LQRD), which introduced parameters into the evolutionary game to describe the rationality of attack and defense sides.

Preliminary
is section first introduces Q-learning and then replication dynamic equations. Finally, the definition of the evolutionary game model is proposed. [41] is a reinforcement learning method, which can be regarded as an asynchronous dynamic programming method. Q-learning is also an adaptive value iteration method, which is based on the state-action value Q t (s ′ , a ′ ) and guides the estimation of the state-action value Q t+1 (s, a) at time t + 1. Among them, the state-action value Q t (s, a) is the expected revenue after action a is taken by state s at time t. State s ′ is the state the learner reaches after using action a in state s. e Q-learning formula is given as follows:

Q-Learning. Q-learning
z is the usual step size parameter, r is immediate reinforcement, and c is the discount factor. e principle of Q-learning is to move in a discrete, finite state and select one from a finite set of actions every time, forming a controlled Markov process. Continuously, it aims to improve its evaluation of the quality of specific actions in a specific state to find a strategy with the most profit, which is consistent with the game's goal.

Replication Dynamic Equation.
e replication dynamic equation is a dynamic differential equation. It describes the frequency or probability of a certain strategy used in a specific group of people [42] and the degree to which the probability of the game's main body choosing a strategy during the game. Its basic principle is that the game players gradually adopt more strategies with a revenue better than the average revenue. In addition, the replication dynamic equation can ensure that the evolutionary stable strategy is the Nash equilibrium, thereby obtaining the strategy that benefits the most. e replication dynamic equations of attackers and defenders in network attack and defense are given as follows: In the previous formulas, x i ′ (t) represents the change rate of the probability of the attacker selecting the attack strategy AS i over time, y j ′ (t) represents the change rate of the probability of the defender selecting the defense strategy DS j over time, Q AS i represents the expected revenue of the attacker's selection of the attack strategy AS i , Q DS j represents the expected revenue of the defender's selection of the defense strategy DS j , Q AS represents the average revenue of the attack strategy set, and Q DS represents the average revenue of the defense strategy set.

Definition of Evolutionary Game Model Based on QRD.
To extend the evolutionary game model to the dynamic network environment, this section introduces the stage definition into the evolutionary game model based on QRD. e model is defined below.
Definition 1. e evolutionary game model based on QRD is represented as 6 tuples, and its elements are defined as follows: distribution of the attacker's overall strategy set AS; that is, any x i ∈ P A is the attacker's choice of strategy AS i with probability x i to implement network attack, and P D � (y 1 , y 2 , . . . , y m ) is the probability distribution of the defender's overall strategy set DS; that is, any y j ∈ P D is the defender's choice of strategy DS j with probability y i to implement defense strategy. In addition, the parameters satisfy the re- is the revenue function set of the attack-defense game, and Q A and Q D represent the revenue function of the attacker and the defender, respectively, that is, the revenue of the attacker and the defender from the game strategy combination (AS i , DS j ). (6) τ is the exploration factor of both sides of the game, indicating the degree of their exploration of game information. e larger the τ is, the greater the degree of exploration is and the more the attack and defense sides explore the unknown game information and make better decisions. e smaller the τ is, the smaller the degree of exploration is. e attack and defense sides mainly make the best decision based on the current known game information.

Game Model Based on RLM-QRD
To solve the problem of invalidation of specific defense strategies in network attack and defense scene, we put forward RLM with incentive and punishment mechanisms. RLM uses parameter α to calculate the attack and defense reward value in the next stage.

Definition of Game Model.
By combining RLM with the evolutionary game based on QRD [31], this paper designs an evolutionary game model based on RLM-QRD; that is, RG � (N, K, S, θ, Q, τ, α). N, K, S, θ, Q, and τ have been defined, and the definition of α is given below.
Definition 2. α is the incentive and punishment factor of reward value, which means the reward value of the corresponding strategy combination should be stimulated or punished when RLM is triggered. e value of α affects the probability of multistage strategy selection.
In the first stage of the game, the incentive and punishment factor α formula of reward value is as follows: In stage K of the game, the formula of incentive and punishment factor α is as follows: e parameters in formulas (4) and (5) are defined as follows: (1) RV is the reward variable, which represents the largest variable of the reward value in a single stage. Its value is determined by the influence of the other player's strategy on itself. In general, RV is equal to the minimum reward value of the strategy combination of the defender. (2) SN is the maximum number of learning stages of RLM, which denotes the maximum number of the learning stages the defender can learn from the previous game. erefore, the maximum value of SN is the maximum number of stages of game T. (3) AN is the number of a specific attack strategy in the past SN stages. If the attacker implements strategy AS i in the previous stage, then AN is equal to the number of AS i in the previous SN stages. e formula is as follows:

Framework of Game
Model. e proposed framework of the multistage evolutionary game model is based on RLM-QRD, as shown in Figure 1. It can be seen from Figure 1 that our model mainly includes an evolutionary game based on QRD and RLM. e evolutionary game model based on QRD contains payoff quantification and QRD. Payoff quantification uses the information of the initial stage of the game to calculate the revenue of the attack and defense strategy in the initial stage. By QRD using the strategy revenue at the current stage, the optimal defense strategy is obtained by calculation.
RLM is responsible for connecting all stages of the game. According to the known game information, it automatically incentives or punishes the attack and defense revenue of the next stage. Based on the above methods, the model can give a better defense strategy in the next stage.

Payoff Quantification of Attack and Defense Strategy.
Network attack and defense strategy and its cost-reward analysis are the basis of achieving the optimal network security defense, so reasonable attack and defense payoff quantification affect the selection of defense strategy directly, thus affecting the defense effect. Here, we give some related definitions.
Attack payoff matrix AM comprises attack revenue value a ij generated by the attacker under attack and defense strategy combination (AS i , DS j ). According to the definition of the game model, the formula is as follows: AR and AC represent attack revenue and attack cost, respectively. e attack payoff matrix in stage K is as follows: Defense payoff matrix DM comprises defense revenue value d ij generated by the defender under attack and defense strategy combination (AS i , DS j ). According to the definition of the game model, the formula is as follows: DR and DC represent defense revenue and defense cost, respectively. e defense payoff matrix in stage K is as follows:

Q-Learning Replication Dynamic Equation.
According to the Nash equilibrium theorem [43], it can be seen that, in a game with limited players in a limited strategy set, a mixed strategy Nash equilibrium must exist, and strategy (x * , y * ) is called a mixed strategy Nash equilibrium. When the game reaches Nash equilibrium, no player is worth changing his strategy unilaterally. In this case, if the attacker chooses strategy x * and the defender chooses strategy y * , the attack and defense benefits are expressed as Q A (x * , y * ) and Q D (x * , y * ), respectively, satisfying the following conditions:  Security and Communication Networks 5 e following is the calculation of the expected revenue Q AS i of attack strategy AS i , the expected revenue Q DS j of defense strategy DS j , the average revenue of attack strategy set Q AS , and the average revenue of defense strategy set Q DS : e Boltzmann probability distribution is used to represent the attack and defense strategy, and the Q-learning algorithm is introduced into the replication dynamic equation to get the QRD equation. e probability of strategy selection for QRD is given as follows: Here, x i (k) and y j (k) obey the Boltzmann probability distribution. x i (k) denotes the probability that the attacker selects the attack strategy AS i in the k-th attack and defense confrontation at the same game stage. y j (k) denotes the probability that the defender selects the defense strategy DS j in the k-th attack and defense confrontation at the same game stage. Q AS i (k) denotes the expected revenue obtained by the attacker choosing the attack strategy AS i in the k-th attack and defense confrontation at the same game stage. Q DS j (k) denotes the expected revenue obtained by the defender choosing the defense strategy DS j in the k-th attack and defense confrontation at the same stage. e Q-learning replicated dynamic equations formulas (15) and (16) are derived from the correlation formulas (2), (3), (15), and (14): QRD consists of replication dynamic equation (RD) and mutation equation (ME). RD selects the most profitable strategy under current information. ME is to try different new strategies in unknown network attack and defense scenes, and constantly try and make error, and learn and adjust the strategies, which better reflects the diversity and uncertainty of network attack and defense.
From the definition of evolutionary equilibrium, when the strategies of players reach evolutionary equilibrium, there is x ′ (t) � 0 and y ′ (t) � 0.
Solution (x * , y * ) of the above formula is an evolutionary stable equilibrium point. At this time, the τ value in the above equation needs to be large enough to make the selection probability of each strategy stable.

Reward Value Learning Mechanism.
As shown in Algorithm 1, RLM calculates the incentive and punishment factor α according to reward variable RV and the proportion of the number AN of a certain type of attack strategy in the past SN stage. According to α and the defense result R of the last stage, RLM changes the reward value of the corresponding attack and defense strategy to change the attack and defense reward value of the next stage.
If the defense successfully resists an attack in the last stage R � 1, RLM increases the defense reward value of the specific strategy combination d K ij , and decreases the attack reward value of the specific strategy combination a K ij . If the defense failed in the last stage R � 0, RLM decreases the defense reward value d K ij of the specific strategy combination and increases the attack reward value a K ij of the specific strategy combination in the last stage.

Optimal Defense Strategy Selection Algorithm
In this paper, the Nash equilibrium solution of the multistage evolutionary game is regarded as a set of equilibrium solutions of a multistage evolutionary game. Each stage learns the known game information through the reward value learning mechanism to change the reward value of defense strategy in the current stage. According to the optimal defense strategy of each stage, the multistage optimal defense strategy set is constructed. e multistage network attack and defense game tree based on our game model is shown in Figure 2. e black dots in Figure 2 represent the attacker at stage K, picked out with probability P A � (x 1 , x 2 , . . . , x n ) and executed the attack strategy AS � (AS 1 , AS 2 , . . . , AS n ). e blue dots denote the defender at stage K, picked out with probability P D � (y 1 , y 2 , . . . , y m ) and executed the defensive strategy DS � (DS 1 , DS 2 , . . . , DS m ).
As shown in Figure 2, the process of multistage network attack and defense games is as follows: (1) In the initial stage of a network attack and defense game, the corresponding model parameters and reward values are calculated based on the current information of players, and the respective optimal decision in the current stage is reached by solving the Nash equilibrium and conserving information such as game results R of the current stage, the strategy combination (AS K i , DS K j ), and the attack and defense revenue (AM K , DM K ) for use in the calculation of the attack and defense revenue at the next stage.
(2) When the network attack and defense games enter stage K(K > 1), as K grows larger and the attacker and defender gain more information gradually, the players of the game tend to be completely rational, and the revenue of the various combinations of strategies is highly likely to change. At this time, according to Algorithm 1, the incentive and punishment factor α of this stage is calculated, the attack and defense revenue (AM K , DM K ) of this stage is obtained from α and the information saved in the previous stage. e optimal defense strategy of the current stage is solved through the Nash equilibrium solution, and the defense result R, attack and defense strategy (AS K i , DS K j ) , and attack and defense revenue (AM K , DM K ) of this stage are kept.
Next, we propose an optimal defense strategy selection algorithm based on the evolutionary game model based on RLM-QRD.
As shown in Algorithm 2, if K is the number of stages of the game, n and m represent the number of strategies of the attacker and the defender, respectively; generally, there is K > 0, n > 0, m > 0.   . e storage cost of the algorithm mainly focuses on the storage of the payoff matrix. e storage of the payoff matrix has high complexity, which contains the total number of nm memory cells. erefore, the spatial complexity is O(nm). Table 1 shows the comprehensive comparison between our model and other models in the literature. e following are some discussions:

Game in stage
(1) Payoff quantification: references [31,36] have no extra parameters for the payoff quantification. References [22,23] and [40] introduce specific parameters into the game model and set those parameters to quantify the attack and defense payoff in their respective application scenarios. We also introduce some parameters and calculate the parameters by the past game information, to better quantify the attack and defense payoff under the scenario of defense strategy failure.
(2) Equilibrium solution: the equilibrium solution represents the method used to solve the Nash equilibrium in the game model. References [23,36] use RD, and reference [22] uses fermi function. All of these methods have had some success in their application. References [31,40] improve on RD by proposing QRD and LQRD, respectively, and have been successful in optimal defense strategy selection. For the scenario of defense strategy failure, we propose RLM-QRD, which aims to realize automatic calculation of revenue, maximize defense revenue, and select the optimal defense strategy.
(3) Game type and Algorithm complexity: reference [31] is a static game, which has the advantage of low algorithm complexity. Although the complexity of the dynamic game algorithm is high, the dynamic game is more suitable for network attack and defense.
Based on the above discussion, our model is more suitable for failure scenarios of defense strategies in dynamic network attack and defense.

Experiment and Analysis
In this section, we verify the effectiveness of our model in the scenario of policy failure. Firstly, we give the attack strategy set and defense strategy set. We also introduce the strategy failure scenes. Secondly, we calculate the exploration factor τ. irdly, we analyze the defense strategy selection probability of our model under the single strategy failure scene and multistrategies failure scene. Fourthly, we compare the revenue of our model with those of the model without RLM. Finally, we compare the defense success rate of our model with those of the model without RLM.

Experimental Setup.
In our experiment, the attack and defense behavior database of MIT [44] and China National Vulnerability Database of Information Security (CNNVD) [45] are used to analyze the attack and defense atomic strategy, as shown in Tables 2 and 3. e attacker makes use of the vulnerability in the network information system to choose some atomic attack strategies. e defender selects several atomic defense strategies to defend against network attacks [46]. e attack and defense strategies in this experiment are composed of several atomic attack and defense strategies. For both sides of network attack and defense, set attack strategies AS 1 � a 1 , a 2 , a 5 and AS 2 � a 3 , a 4

and defense strategies
Input: evolutionary game model based on RLM-QRD. Output: probability set of optimal defense strategy in K-th stage P K D . (1) Initialize RG � (N, K, S, θ, Q, τ, α) (2) for i ⟵ 1 to n do (3) for j ⟵ 1 to m do Calculate AM K , DM K from equations (8) and (10) (4) end for (6) forK ⟵ 1 to T do (7) forj ⟵ 1 to m do (8) Construct y′(t) from equation (16) (9) end for (10) for i ⟵ 1 to n do (11) Construct x′(t) from equation (15 [31,34] and the definitions of attack and defense reward and cost in Section 4.3, the attack and defense revenue matrix of the first stage is given, as shown in Tables 4 and 5. e larger the number in the table, the greater the revenue from attack or defense. In summary, we set the probability that attack strategy AS � (AS 1 , AS 2 ) and defense strategy DS � (DS 1 , DS 2 ) are chosen as P A � (x, 1 − x) and P D � (y, 1 − y), respectively, and set the reward variable RV � 10. Since there is a logarithm in the Q-learning replication dynamic equation, we assume that the value of the probability of strategy selection ranges [0.01, 0.99]. To better show the results of the experiment, the maximum number of learning stages SN � 500 and the maximum number of stages T � 1000 of RLM are set in this experiment.
is experiment verifies the validity of the evolutionary game model based on RLM-QRD in the scene of specific defense strategy failure from the perspective of single defense strategy failure and multiple defense strategies failure.
is experiment assumes that the attacker has acquired a specific defense strategy method at a certain stage and proposes and implements a new attack strategy and its selection probability, which results in the invalidation of the defense strategy. As shown in Table 6, status I and status II are single defense strategy failure scenes, and status III is multiple defense strategies failure scene.

e Calculation of Exploration
Factor. It can be seen from equations (5) and (6) that when the exploration factor τ is small, both sides of the game do not fully grasp each other's relevant information under the condition of one stage.
e ME in the Q-learning replication dynamic equation has a better impact, and the probability of attack strategies and defense strategies (AS 1 , AS 2 , DS 1 , DS 2 ) selection is unstable. Figure 3 shows the influence of exploration factor τ on the attack and defense strategy evolution. As shown in Figure 3, with the acquisition and analysis of each other's information in a single stage, the exploration factor τ gradually increases. e effect of the mutation equation decreases, the replication dynamic equation gradually begins to play a bigger role. e probability of attack and defense strategy selection gradually tends to be stable. QRD gradually degenerates into the replication dynamic equation. e simulation results show that the selection probability of attack and defense strategy remains constant when τ ≥ 3.
To sum up, the larger the exploration factor τ is, the more information the offensive and defensive sides can finally obtain in this game stage. Usually, both sides of the  Remote attack √ a 3 Obtain user privileges √ a 4 Buffer overflow attack √ a 5 Homepage attack √ Modify account password √ d 3 Restart database server √ d 4 Limit packets from ports √ d 5 Reinstall listener program √ √ d 6 Correct homepage √ game will change their strategies only when they know enough game information, and the game will enter a new stage. erefore, this experiment sets the exploration factor τ � 100.

Probability of Defense Strategy Selection Experiment.
To verify that the model can reduce the selection probability of failure strategy and increase the selection probability of effective strategy in single strategy failure and multistrategies failure scenes. is section studies the selection probability of optimal defense strategy when the status is I, II, and III, and the results are shown in Figures 4 and 5. Each point represents a stage. e red points and the orange points represent the failure of defense, which means the defender chooses the failure strategy. e green points and the blue points represent the success of the defense, which means the defender chooses the effective strategy. e x-axis and y-axis are the selection probabilities of game stage K and defense strategy DS 1 , respectively. Figure 4 shows the selection probability of defense strategy DS 1 in the number of stages K � 30 when the status is I and II. As shown in Figure 4, the selection probability of defense strategy DS 1 converges to around 0.01 after only about K � 10 stages in status I because defense strategy DS 1 has failed. e model implements defense strategy DS 2 with a high probability to resist the new attack strategy. e selection probability of defense strategy DS 1 converges after stage K � 15 in status II. Compared with the selection probability of defense strategy in status I, the selection probability of defense    strategy in status II converges slowly because any strategy of the attacker can make defense strategy DS 1 invalid when the status is II.

Single Strategy Failure Scene.
As can be seen from Figure 4, the probability of failed defense strategy DS 1 shows a short upward trend before convergence and defense failure occur in the early stage in status I and status II because RLM is still learning attack strategies and their benefits at the beginning of the game stage. Due to the change of attack strategy, the strategy given by the defender is not optimal at this time. After several short stages of learning, RLM will converge the probability of failure strategy to 0.01. e defender has obtained the optimal defense strategy through several stages of learning. e game defense is successful.

Multistrategies Failure Scene.
In status III, the attacker can choose any attack strategy to make any defense strategy invalid. In this scene, the defense strategy that failed may be DS 1 in stage K, and the defense strategy that failed may become DS 2 in stage K + 1. So, the game becomes more complex. Figure 5 shows the defense strategy selection probability in status III. As shown in Figure 5, the change of defense strategy selection probability is unstable, which is caused by the high complexity of the network state in this scene. In this scene, the defense side tends to give a more selection probability to the defense strategy DS 1 .
In status III, the failed strategies of each stage are not necessarily the same, so we cannot judge the model from the probability of defense strategy selection experiment. However, the model can be verified from attack and defense revenue experiment and defense success rate experiment in status III. It can be seen from Figures 6(a) and 6(b) that the difference between the attack and defense revenues of the two models is not obvious at the beginning of the game stage, which indicates that RLM has not fully learned the attacker's information. At this time, the attack and defense revenues of the two models are unstable. Even because of the change of attack strategy, the revenues of the attacker may become more, and the revenues of the defender may become less.
After several stages, the defense revenues of the two models are positively correlated with the game stage, and the attack revenues are negatively correlated with the game stage. Compared with the game model without RLM, the model in Figure 6(a) has higher defense revenue and lower attack revenue after stage K � 10, and the model in Figure 6(b) has lower attack revenue after stage K � 15 and higher defense revenue after stage K � 20. Combined with the previous analysis, we can conclude that the revenue value in the game model with RLM changes drastically from the probability convergence of defense strategy. is situation also means that after the short-term learning stages of RLM, the defense revenue begins to rise, and the attack revenue begins to decline. Compared with Figure 6(a), the change of attack and defense revenue in Figure 6(b) is relatively slow because any attacker's strategy may make defense strategy DS 1 invalid when the status is II. To sum up, this model can more effectively resist the attackers in the scenes of statuses I and II. Figure 7 shows the comparison of attack and defense revenue between our game model and the game model without RLM when the status is III. From the previous description, we can conclude that the scene is highly complex, and the attacker can adjust the strategy at each stage to invalidate the specific defense strategy. erefore, with the continuous change of attack strategy, the attack and defense revenues fluctuate. Next, from the perspective of attack and defense revenues, we verify that our model has better defense ability than the model without RLM in status III.

Multistrategies Failure Scene.
As can be seen from Figure 7, the defense revenue of our model and the attack revenue of the game model without RLM rise with fluctuation, and the attack revenue of our model and the defense revenue of the game model without RLM decline with fluctuation. From stage K � 400, the defense revenue of our model is always higher than that of the game model without RLM. From stage K � 300, the attack revenue of this model is always lower than that of the game model without RLM.
To sum up, after learning multiple stages with RLM, the model can make the defense revenue greater and the attack revenue lower, to better resist network attacks. 6.9. Defense Success Rate Experiment. Figure 8 shows the comparison of defense success rate between our model and the game model without RLM in stage K � 1000. e green column denotes the number of successful defense stages when the models choose defense strategy DS 1 . e blue column denotes the number of successful defense stages when the models choose defense strategy DS 2 . e red column denotes the number of failed defense stages when the models choose defense strategy DS 1 . e yellow column denotes the number of failed defense stages when the models choose defense strategy DS 2 .
It can be seen from Figure 8 that our model under statuses I, II, and III has a higher defense success rate than the game model without RLM. Compared with the game model without RLM, the defense success rate of this model is about 22.5% higher under state I. e defense success rate of our model increases by about 23.5% under status II. Our model improves by about 7.4% under status III. In conclusion, compared with the evolutionary game model without RLM, our model can better select the best defense strategy in the above three scenes.

Conclusion
Considering the existing problems in the application of the evolutionary game model in network attack and defense, this paper proposes a reward value learning mechanism. is mechanism overcomes the problem of quantifying incentives and punishments in the case of bounded rationality of attackers and defenders, which reduces manual involvement. An evolutionary game model with a multistage learning mechanism is constructed by combining the learning mechanism with a multistage game model. Furthermore, the optimal strategy selection algorithm of the game model is designed.
Our future work will study how to dynamically add new feasible defense strategies and reasonably expand the model when any defense strategy fails. In addition, we will also consider how to apply more intelligent methods, such as deep learning and machine learning, to the automatic calculation of reward and punishment factor α at every stage so that the model can better select the optimal defense strategy.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Security and Communication Networks 13