Cognitive Electronic Jamming Decision-Making Method Based on Improved Q -Learning Algorithm

,


Introduction
Cognitive electronic jamming decision-making is a critical link in cognitive electronic warfare [1]. Its task is mainly divided into three steps. First, the jammer completes the recognition of the target working state based on the reconnaissance target signal. Then, the effect of the current jamming action is evaluated, and the best correspondence between the various states of the counter target and the existing jamming action is established. Finally, the optimal jamming strategy is generated intelligently for the different states of the target, which is used to guide the subsequent jamming resource scheduling. In the increasingly complex and changeable electromagnetic environment, jamming methods and antijamming technologies emerge one after another. At present, it is difficult to establish a one-to-one correspondence between the specific radar working state and some jamming action. Therefore, selecting an accurate jamming method can make the radar system give full play to its power. At the same time, with the rapid development of new weapons and equipment and the emergence of many new systems and multifunctional radars, the existing jamming decision-making methods cannot effectively deal with the battlefield environment. Therefore, the research on jamming decision-making methods is urgent.
At present, the traditional research methods of jamming decision-making mainly include the following: methods based on game theory, methods based on decision support systems, and methods based on swarm intelligence optimization algorithms. David et al. [2] proposed a framework that utilizes game-theoretic principles to provide an autonomous determination of the appropriate electronic attack action to be taken for a given scenario. Gao et al. [3] established a profit matrix based on the principles of minimizing loss and maximizing jamming benefits and used the Nash equilibrium strategy to solve the optimal jamming strategy. However, this method relies on the profit matrix's establishment and is only suitable for radar systems with constant parameter characteristics. Sun et al. [4] proposed a method of electronic jamming mode selection based on D-S theory.
Li and Wu [5] proposed a design method for an intelligent decision support system (IDSS) based on a knowledge base and a problem-solving unit. The method's applicability is more extensive, but it is too dependent on the posterior probability and lacks real-time performance. Ye et al. [6,7] proposed a cognitive collaborative jamming decisionmaking method based on bee colony algorithm, which finds the globally optimal solution through the process of bee colony searching for high-quality resources. Similarly, there are swarm intelligence algorithms such as genetic algorithm [8,9], ant colony algorithm [10], and heuristic algorithms such as differential evolution algorithm [11][12][13] and water wave optimization algorithm [14]. However, these algorithms cannot consider all possible jamming factors when solving the jamming decision model, and the decision accuracy needs to be improved. In summary, how to improve the autonomy, timeliness, and accuracy of jamming decision-making in cognitive electronic warfare remains to be studied.
The above traditional jamming decision-making methods rely on sufficient prior knowledge for "matching," mainly suitable for radars with constant characteristic parameters. They do not have real-time performance and cannot deal with the increasingly complex battlefield environment. Reinforcement learning [15,16] is a machine learning method specifically used to solve behavioral decision-making problems. The jammer establishes the connection between the jamming resource and the target state through reinforcement learning, continuously optimizes the jamming strategy, and realizes the cognitive electronic jamming decision-making. The Q-learning algorithm is a typical time-series differential reinforcement learning algorithm based on model-free learning. It allows the system to learn independently and make correct decisions in real time without considering environmental model factors and sufficient prior information. Thus, it is fully applicable to the complex and changeable radar system. Therefore, compared with traditional jamming decision-making methods, the jamming decision-making method based on the Q-learning algorithm can realize learning while fighting, which will be the future development trend and main research direction. Currently, the Q-learning algorithm is widely used in robot path planning [17,18], nonlinear control [19,20], resource allocation scheduling [21,22], and other fields, and it has also achieved specific results in electronic jamming decision-making. Aiming at the unknown radar working mode, Xing et al. [23,24] proposed that the Q-learning algorithm was applied to radar countermeasures to realize intelligent jamming decision. Li et al. [25] proposed using Q -learning to train the behavior of radar systems, which can effectively complete the jamming and adapt to a different combat mission. However, the Q-learning algorithm still has two problems in practical application: (1) exploration strategy cannot be selected. In traditional Q-learning, the exploration strategy is always a single constant value. When the exploration value is large, the algorithm is fully explored in the early stage, but the result is easy to oscillate near the optimal solution in the later stage of the algorithm, and it is difficult to converge. When the exploration value is small, the exploration is scarce in the early stage, and it is easy to converge to the local optimal in the later stage. Therefore, the fixed exploration strategy cannot balance the sufficiency of exploration and the stability of convergence. (2) There is no uniform standard for the selection of learning rate. The learning rate of traditional Q-learning is usually fixed according to experience. When the learning rate is large, the risk of learning is easy to occur in the early stage of the algorithm. When the learning rate is small, the convergence speed of the algorithm becomes slow in the later stage. Therefore, to improve the accuracy and efficiency of the Q -learning algorithm applied to radar jamming decision-making, it is still necessary to improve the Q-learning algorithm.
Considering the problems of the traditional Q-learning algorithm, this paper proposes a cognitive electronic jamming decision-making method based on improved Q -learning. The improved techniques include the following: (1) The Metropolis criterion of the SA algorithm is introduced to improve the action choice strategy to balance the exploration and utilization of the algorithm (2) The learning rate decay strategy of SGDR is used to speed up the learning convergence speed and avoid the algorithm shocking to fall into the local optimum (3) The Q value convergence rule is used as the priority termination condition of the algorithm, and the iteration number rule is suboptimum so that the algorithm is forced to stop the learning process and output the suboptimal jamming strategy when the Q value cannot converge, which can save jamming resources This paper takes the multifunctional radar provided in [26] as the research object, constructs a cognitive electronic jamming decision model based on improved Q-learning, and compares the simulation results with traditional methods. The results show that the improved method can independently learn the optimal jamming strategy by analyzing the jamming effect with the radar working state change, improve the learning efficiency, and give full play to the adaptability and timeliness of the cognitive electronic countermeasure system. The rest of this paper is arranged as follows. A detailed introduction to the Q-learning algorithm and a description of the improvement methods are presented in Section 2. The cognitive electronic jamming decision-making model and improved Q-learning algorithm are put forward in Section 3. The simulation experiment and result analysis are given in Section 4. Finally, some conclusions drawn from this study are discussed in Section 5.
2. Improved Q-Learning Algorithm 2.1. Q-Learning Algorithm. The principle of the Q-learning algorithm is shown in Figure 1. Its main idea is based on Markov decision processes. By perceiving the current environment state, the agent determines the action taken by a specific strategy and obtains the immediate reward and the 2 International Journal of Aerospace Engineering next state of the environment feedback. Essentially, it learns a mapping from the environment state to the action measure to maximize the overall reward value.
The algorithm can be summarized as the following steps: Step 1. Define the state set S = fs 1 , s 2 ,⋯,s n g and the action set A = fa 1 , a 2 ,⋯,a n g, initialize the "state-action" function Qðs, aÞ and the reward matrix R, and set the parameters such as the maximum number of iterations K.
Step 2. Randomly select an initial state from S. When the state is the target state, the iteration ends, and the initial state is reselected. Otherwise, continue to execute step 3.
Step 3. According to the ε-greedy strategy, select an action among all possible actions in the current state and reach the next state.
Step 4. Equation (1) is used to update the matrix Q.
where s t is the state of the environment at the moment t, a t is the action taken by the agent at the moment t, Qðs t , a t Þ is the "state-action" function at the moment t, s t+1 is the state of the environment at the moment t + 1, Rðs t , a t Þ is the immediate reward of the environment's feedback from t to t + 1, a ′ is the action that maximizes the value Q when agent arrives s t+1 , γ is the discount factor, γ ∈ ½0, 1, α is the learning rate, and α ∈ ð0, 1Þ.
Step 5. Set the next moment's state to the current state, that is, s t = s t+1 . If s t is not target state, return to step 3.
Step 6. When the maximum number of iterations is reached, the training is completed, the convergent matrix Q is obtained, and the optimal action strategy is output according to equation (2). Otherwise, it returns to step 2 to enter the next iteration.
2.2. Improvement of Exploration Strategy Based on the SA Algorithm. When selecting actions, the traditional Q-learning algorithm follows the ε-greedy strategy, which randomly explores an action with the probability of ε, and utilizes existing information to select the optimal action with the probability of 1 − ε. The larger the ε, the stronger the exploration ability. However, as the agent continuously interacts with the environment, if the agent still uses a larger ε to explore after acquiring empirical knowledge, the algorithm results will likely oscillate near the optimal solution. The smaller the ε, the stronger the utilization ability, but due to the lack of exploration in the early stage, it is easy to converge to the local optimum later. Therefore, the fixed exploration strategy cannot balance exploration and utilization, making the algorithm difficult to converge and easy to falls into a local minimum. Aiming at the above problems, heuristic algorithms such as particle swarm optimization, ant colony optimization, and artificial bee colony algorithm proposed early can be used to solve the contradiction between exploration and utilization. The SA algorithm is an optimization algorithm based on neighborhood search and learns from the idea of the annealing process. It keeps the probability of continuous decay in the search process to jump out of the local optimal value and converge to the global optimal value. Compared with other heuristic algorithms, the SA algorithm can accept the worse solution than the current solution with a certain probability criterion and the better solution with all probabilities. It has the characteristics of simple principle, high iterative search efficiency, strong robustness, asymptotic convergence, and strong global searchability. Therefore, it has been widely used in various fields to solve combinatorial optimization problems. The algorithm is relatively mature. In this paper, the Metropolis criterion [27] in the SA algorithm is introduced to improve the exploration strategy in the Q-learning algorithm. The exploration probability is adjusted by the cooling strategy, so that ε can stably maintain a large value in the early iteration to fully explore and quickly keep a small value in the later iteration to speed up the algorithm convergence while ensuring that the globally optimal solution is obtained. The probability equation for randomly taking actions using the SA algorithm is expressed as follows: where a r is the randomly selected action, a p is the selected action according to the current ε-greedy strategy, T is the temperature control parameter, and k is the number of iterations. When Qðs, a r Þ ≤ Qðs, a p Þ, randomly explore action a r . Otherwise, randomly explore an action with probability of exp ð−Qðs, a r Þ − Qðs, a p Þ/T k Þ.
The temperature cooling strategy in the SA algorithm determines ε's change. In the early iteration, T is large, and the probability of accepting randomly selected actions is great, which is conducive to exploration. In the later iteration, the temperature drops, and T becomes smaller; so, the probability of taking the optimal action is larger, which is conducive to utilization. Common annealing strategies 3 International Journal of Aerospace Engineering include a geometric cooling strategy, a logarithmic descent strategy, and a linear descent strategy. This paper uses the most common geometric cooling strategy to keep a higher temperature at the beginning of the iteration and quickly cool down at the end of the iteration. The specific description is expressed as follows: where T 0 is the initial temperature, k is the number of iterations, λ is the cooling parameter, λ ∈ ð0, 1Þ, and λ is generally taken as a constant close to 1.

Improvement of Learning Rate Based on SGDR.
The learning rate α determines the learning ability of the agent. The larger α, the stronger the learning ability. Equation (1) can be rearranged as follows: In the traditional Q-learning algorithm, α is a fixed value. As shown in equation (5), the larger the α, the less the previous training's effect, and the shorter the system decision time, but there is a risk of overlearning, and it is easy to fall into the local optimum. The smaller the α, the smaller the oscillation, but the convergence speed will be slower. The traditional learning rate adjustment methods include the equal interval adjustment method, exponential attenuation method, and adaptive adjustment method. The above methods always make α maintain the attenuation trend in the adjustment process, which can slow down the convergence speed later in algorithm iteration. In 2017, Loshchilov and Hutter [28] proposed the SGDR to improve α. The warm restart mechanism is set in the α decrementing process, and then α is reinitialized to a certain preset value to gradually decay after every interval period. Compared with the traditional learning rate adjustment method, the SGDR method makes the algorithm keep a larger value in the early iteration to speed up the convergence speed and keeps a smaller value in the later iteration to prevent falling into the local optimum. At the same time, α's reciprocating rise and fall can prevent small α from affecting the convergence speed. Therefore, this paper uses the SGDR method to improve the learning rate in the Q-learning algorithm, which considers the convergence speed and stability of the algorithm at the same time.
Assuming that the total number of restarts is M, the cosine decay is used to reduce the learning rate before the mth restarting. In the kth iteration, the improved learning rate calculation equations are expressed as follows: α m min = f m ð Þ, ð7Þ where α m max and α m min are, respectively, the maximum and minimum values of the learning rate in the mth restart period, which can be gradually reduced with the increase of m. τ m is the restart period, which can gradually increase with the number of restarts. τ 0 is the initial restart period, κ is the amplification factor, β is the number of iterations since the last restart, and 0 ≤ β ≤ τ m . When β = τ m , set α t = α m min . After restarting, set β = 0 and α k = α m+1 max . In this paper, the improved Q-learning algorithm is applied to radar jamming decision-making to realize adaptive cognitive electronic jamming decision-making. The jammer first judges the change of threat level by detecting the working state of the enemy's radar before and after jamming to quantitatively evaluate the jamming effect. The jamming decision system learns the best jamming strategy through the jamming effect. It adjusts Q-learning's exploration strategy and learning rate through improved algorithms to achieve fast and accurate cognitive jamming. Through the Q-learning process, the connection between the jamming resources and the radar working state can be established, the jamming strategy is continuously optimized, and the learning conclusions can be used to support the construction of the jamming rule base and the dynamic threat base. The cognitive electronic jamming decision model based on improved Q -learning is shown in Figure 2.

Cognitive Electronic
The Q-learning algorithm principle shown in Figure 1 has the following mapping relationship with the cognitive electronic jamming decision-making model shown in Figure 2: (1) Agent-jammer With the development of radar working systems, modern radar working states can be divided into searching, tracking, ranging, imaging, monitor, guidance, and other states according to combat tasks. Different working states can be flexibly switched, and their corresponding threat levels will also change. Among them, the threat level of the search state is the lowest, which is the target state of the jammer. Each working state has corresponding jamming actions [26], and the corresponding relationship is shown in Figure 3. Effective jamming can gradually reduce the threat level of radar. 4 International Journal of Aerospace Engineering 3.1.2. Definition of Reward Function. The reward function determines the decision-making ability of the jammer. Since the ultimate goal of the jammer is to improve the performance of the radar jamming, the radar jamming effect evaluation can be used as the reward function. The evaluation of the radar jamming effect is closely related to the change of threat level. This paper divides the threat level change into four situations: (1) If the threat level of the radar working state is reduced to the lowest, the jamming effect is the best, and the reward value is +100 (2) If the threat level decreases but not to the lowest level, the jamming effect is good, and the reward value is +1 (3) If the threat level remains unchanged or increases, the jamming effect is poor, and the reward value is -1 (4) If there is no transition between working states, the reward value is 0 The specific calculation equation of the reward function is expressed as follows: where min is the lowest threat level, TL is the threat level before jamming, and TL ′ is the threat level after jamming.

Cognitive Electronic Jamming Decision-Making
Algorithm Based on Improved Q-Learning. To prevent the Q-learning algorithm from falling into a local minimum and improve the convergence speed and decision accuracy of the algorithm, this paper uses the SA algorithm and the SGDR method to, respectively, improve the exploration strategy and learning rate of the Q-learning algorithm so that the improved algorithm has better decision-making performance. Combined with the above analysis of the cognitive electronic jamming decision-making model, a cognitive electronic jamming decision-making algorithm based on improved Q-learning is proposed. The algorithm flow is shown in Figure 4. The specific steps of the improved Q-learning algorithm obtained from Figure 4 are as follows: Step 1. Initialize the function Q, discount factor, and initial temperature and set the maximum number of iterations.
Assuming that the number of radar working state is M, and the number of jamming action is N, the function Q is initialized to a zero matrix of M × N, and the rows of the matrix represent the radar working state. The columns represent the possible jamming actions.
Step 2. When the iteration rule is met, the maximum number of iterations is reached and terminates the learning process and output the optimal or suboptimal jamming strategy. Otherwise, go to step 3.
Step 3. According to the results of cognitive electronic reconnaissance, identify the current radar working state and analyze the threat level in this state.
Step 4. Determine whether the current state is the target state. If it is, use equation (4) to reduce the temperature control parameters and return to step 2. Otherwise, go to step 5.
Step 5. According to the Metropolis criterion in the SA algorithm, use equation (3) to select jamming actions and change the current state to the next state.
Step 6. Analyze the changes in the radar threat level before and after the jamming and calculate the reward value of the jamming action by equation (10).
Step 7. Iteratively update the function Q according to equations (1) and (6)-(9) and set the converted state after jamming to the current state.
Step 8. Define the difference between the sum Γ Q of all Q before and after a iteration is ΔðΓ Q Þ. When ΔðΓ Q Þ is less than the convergence threshold, terminate the learning process and output the optimal jamming strategy according to equation (2). Otherwise, return to step 4.
Combined with the implementation steps of the improved Q-learning algorithm, the specific pseudocode is summarized as follows:

Simulation Experiment and Analysis of Results
To achieve improved Q-learning algorithm verification and process experimental results, the simulation experiment platform of this paper is as follows: (i) The operating system is Windows 10

Experimental Simulation Settings.
From the analysis in Section 3.1, it can be seen that the multifunction radar has multiple working states such as search, tracking, and ranging. The jamming actions adopted by the jammer include suppressive jamming and deceptive jamming. This experiment assumed that the multifunctional radar had 6 working states, and its threat level from high to low was S 1 > S 2 , S 3 > S 4 , S 5 > S 6 . S 6 was the target state. The transition diagram between each working state is shown in Figure 5.  Figure 4: Flow of the improved Q-learning algorithm. 6 International Journal of Aerospace Engineering In Figure 5, a ij represents the required jamming action from state i to state j.
Combining the definition of the radar working state transition diagram and the reward function, the reward matrix R is obtained as follows: In this paper, the relevant parameters in the improved Q -learning algorithm were set as follows: The calculation equations of the learning rate range in the m th restart cycle are expressed as follows:   (7) and (8), and restart the learning rate; 28: k = τ − τ 0 + β//τ 0 is the initial restart period. 29   International Journal of Aerospace Engineering Γ Q in two adjacent experiments is less than the convergence threshold, it indicates that the algorithm has converged. To quantitatively evaluate the algorithm's performance, this paper took the number of iterations during convergence as the evaluation index of the convergence speed, and the fewer the number of convergence steps, the better the effect.

Analysis of Parameters' Impact on the Jamming
Decision-Making Performance 4.3.1. Comparison of Exploration Strategies. The exploration strategy affects the accuracy and timeliness of selecting the best jamming action. It is generally set to a constant value. A smaller value is likely to lead to premature maturity. A larger value can ensure that the algorithm is fully exploratory in the early stage, but it makes the algorithm produce oscillations in the later stages and difficult to converge quickly. This paper adopted the Metropolis criterion in the SA algorithm to improve the exploration strategy to solve the above problems. To facilitate the comparison of the exploration strategy's impact on the performance of jamming decision-making, this paper adopted two exploration strategies, such as the SA algorithm and the ε-greedy algorithm, initialized the learning rate α = 0:8 and set ε = 0:8 in the experiment. The other parameter settings in the algorithm were the same as those in Table 1. The relationship between the convergence value Γ Q of different exploration strategies and the number of iterations is shown in Figure 6.
As presented in Figure 6, both of the two methods' convergence values can finally converge to 1154.9. The ε-greedy algorithm adopted a fixed action selection probability, which made the Q-learning algorithm oscillate at the end of the iteration. It started to converge when it reached the 76th generation. The SA algorithm adopted the Metropolis criterion, making it possible to keep a large value in the early iteration to fully explore. During the iteration process, due to the cooling strategy, the ε gradually became smaller, making the Q-learning algorithm converge quickly, and it tended to be convergent in the 45th generation. Therefore, compared with the ε-greedy algorithm, the dynamic adaptive change of the exploration probability improved by the SA algorithm meets the requirement of fully exploring to avoid falling into the local optimum and has a faster convergence speed.

Comparison of Learning
Rate. The learning rate represents the learning ability of the decision-making system for each increment. In the traditional Q-learning algorithm, the learning rate is constant in most cases and is generally set to α = 0:8. However, a fixed α is difficult to jump out of the locally optimal solution at the later iteration. The study [29] has proposed an adaptive learning rate to replace the fixed learning rate. The learning rate gradually decreases as the number of iterations increases. The calculation equations are shown in (13).
where k is the number of iterations, and nðs, aÞ is the number of Qðs, aÞ's traversals in the k th iteration. To verify the learning rate's impact on the performance of jamming decision-making, this experiment adopted the traditional ε-greedy exploration strategy and initialized ε = 0:8. Combining the improvement equation of learning rate in Section 2.3 and the related parameter settings of the improved Q-learning algorithm, the proposed SGDR learning rate, the adaptive declining learning rate [29], and the fixed learning rate are shown in Figure 7.
As shown in Figure 7, the fixed α was always kept constant throughout the iteration process, and it was easy to fall into the local optimum at the latter iteration. The adaptive α kept a small value in the later iteration for the global optimization, but too small α affected the convergence speed. Due to the introduction of the warm restart mechanism, the SGDR α was maintained at a large value in the early iteration, and the learning efficiency was improved. The SGDR α was increased repeatedly in the later iteration, which   International Journal of Aerospace Engineering avoided shocks and sped up the convergence rate to fully satisfy the timeliness and accuracy of cognitive electronic warfare. Figure 8 compares the relationship between the sum Γ Q of all Q elements and the number of iterations of the Q -learning algorithm under different learning rate setting methods. The convergence values of the three methods can eventually converge to 1154.9. Still, the constant α tends to converge in the 80th generation, the adaptive α tends to converge in the 64th generation, and the SGDR α tends to converge in the 42th generation. The SGDR α kept a larger value in the early iteration so that the number of Γ Q convergence iterations was significantly shortened. With the increase of iterations, the change of the SGDR curve did not oscillate and avoided falling into the local optimum. However, the adaptive α and constant α iterative curves oscillated at the end of the iteration. Thus, the improved learning rate method proposed in this paper is the best, followed by the adaptive learning rate, and the constant learning rate has the worst convergence effect. Thus, the learning rate improved by SGDR is reasonable. It further illustrates that the improved method can overcome the shortcomings of slow convergence in the later iteration and falling into local optimum. When the number of iterations was 42, the algorithm improved by SGDR converged and had learned the optimal jamming strategy. If continue to learn and train, it would increase the cost of learning time.

Comparison of Discount
Factor. According to equation (1), the discount factor γ also has a certain impact on the decision-making performance of the system, and the value range is (0,1]. γ represents the importance attached to future rewarding. The larger the γ, the more the agents tend to consider all possible states in the future, which means that the training is more difficult. With the continuous reduction of γ, the rewarding of possible states in the future has less and less impact on the Q value, which means that the agent only pays attention to several possible states at present. A too large discount factor can easily lead to difficult algorithm convergence, while a too small discount factor is easy to fall into local optimization.
To test the influence of discount factor on the jamming decision-making performance of Q-learning algorithm and ensure the universality and reference significance of the test results, this experiment initialized ε = 0:8 and α = 0:8 according to the general empirical value of Q-learning. The test simulation results are shown in Figure 9. Figure 9 shows the influence of discount factor γ on the number of iterations of the Q-learning algorithm. As shown in Figure 9, when γ's value is set to 0.4, the convergence of the Q-learning algorithm is stable. Still, it cannot converge to the optimal solution within the specified number of iterations. As γ continues to increase, the convergent value Γ Q increases, which means that the discount of future rewarding is getting smaller and smaller. When γ's value is set to 1, the Q-learning algorithm oscillates when the early iteration speed is fast, resulting in too large initial solution and a higher probability of converging to the suboptimal solution. When the value is set to 0.8, the initial solution is small, the later convergence is stable, and the convergence speed is faster. Therefore, the best discount factor value in this paper should be set to 0.8.

Decision Simulation and Result Analysis.
To verify the effectiveness of the abovementioned improved Q-learning algorithm, the traditional Q-learning algorithm and the improved Q-learning algorithm were compared and tested in the same simulation environment. The curve of Γ Q with the number of iterations is shown in Figure 10.
As presented in Figure 10, both algorithms could eventually converge, but the improved Q-learning algorithm stabi-lized after iterating to the 35th generation due to the introduction of the SA algorithm and the SGDR method. However, the Q-learning algorithm oscillated at the end of the iteration, and it only tended to converge until the 60th generation. It shows that the search space in the early stage is large, which leads to a long time to converge to the optimal solution for the first time, and the algorithm is unstable in the iterative process. The improved Q − learning algorithm had a significantly faster convergence speed than the traditional Q-learning algorithm. This is because the SA algorithm reduces the exploration probability, and the Q matrix does not change in the later iteration process of the algorithm, which reduces the possibility of Q-learning deviating from the optimal solution to avoid oscillation. In addition, the SGDR method adjusts the learning rate through the hot restart mechanism. It repeatedly increases the learning rate in the later stage of the iteration to improve the convergence speed. The improved Q-Learning algorithm effectively solves the balance problem about the exploration and utilization and the selection of learning rate. The improved Q -learning algorithm speeds up the training speed of learning. It improves the efficiency of jamming decision-making, which further proves the effectiveness and feasibility of the algorithm applied to cognitive electronic jamming decision-making.
The improved Q-learning algorithm was used to simulate the decision-making process. After many iterations, the final convergence results are shown in Table 2.
It can be seen from Table 2 that after applying jamming actions to different states, the convergence value Q is different. For example, when the a 21 , a 23 , and a 24 are applied to S 2 , the convergence values Q are, respectively, 51.4, 63.8, and 81.0. Therefore, the jammer prefers to take jamming action a 24 to make the state of radar transition from S 2 to S 4 . Since S 6 is the target state, the convergence value Q from S 4 and S 5 to S 6 after applying the jamming action is maximum, that is, 100. Sum all the values in Table 2 to get the final Γ Q , and the result is 1146.
According to Table 2, the jamming path can be obtained after applying the best jamming action to any radar working state, as shown in Figure 10.
The value of the arrow in Figure 11 represents the value Q after the best jamming action is adopted. According to the selection action strategy with the maximum Q value, it can be seen that when the radar is in any state, the final convergence value Q will guide the jammer to choose the optimal jamming action to make the radar work state gradually shift   Figure 11: State action value. 10 International Journal of Aerospace Engineering to the target state S 6 . For example, when the detected state is S 1 , the best jamming action strategy is S 1 ⟶ a 13 S 3 ⟶ a 35 S 5 ⟶ a 56 S 6 , and the sum of the values Q converges to 246.8. For another example, when the initial state of the radar is S 2 , the threat level of the radar is minimized by implementing the optimal jamming strategy S 2 ⟶ a 24 S 4 ⟶ a 46 S 6 .

Conclusions
Aiming at radar jamming decision-making, a cognitive electronic jamming decision-making model based on improved Q-learning is proposed. The conclusions obtained are as follows: (1) By the Metropolis criterion of the SA algorithm and the SGDR, the improved Q-learning algorithm improved the learning efficiency of the algorithm, sped up the convergence speed, and avoided falling into the local optimum (2) By applying the improved Q-learning algorithm to jamming decision-making, the correspondence between radar working state and jamming strategy was established, a cognitive electronic interference decision-making model based on improved Q -learning was constructed, and the specific Q-learning algorithm was proposed. The Q value convergence rule and the iteration number rule are used as the learning termination rule to avoid the waste of jamming resources (3) This model overcomes the shortcomings of the traditional Q-learning algorithm, such as slow convergence speed and local optimization. By interacting with the radar in an environment without any prior information, the jammer can continuously and autonomously learn and finally find the optimal jamming strategy This paper also has some limitations. For example, this paper takes the limited working state of a single radar as the research object. However, the radar is networked in the actual combat environment, and the working state is diversified. At this moment, the decision-making performance of this method will decrease. Therefore, our next step will study the jamming decision method that combines deep learning and reinforcement learning. For another example, except the SA algorithm mentioned in this paper, other recently proposed metaheuristic algorithms such as monarch butterfly optimization (MBO) [30], moth search (MS) algorithm [31], slime mold algorithm (SMA) [32], and harris hawks optimization (HHO) [33] can also be used to solve the problem of exploration strategy. Therefore, we will use the above metaheuristic algorithm to optimize the Q-learning algorithm and make a comparative analysis in the next step.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.