A DRL-Based Intelligent Jamming Approach for Joint Channel and Power Optimization

,


Introduction
Spectrum competition has become a research hotspot in recent years [1,2].On one hand, as the main attack way of the attacker, communication jamming technology still stays on the traditional jamming method.These jamming methods excessively rely on the prior information, which makes it difficult to adjust its own jamming strategy adaptively according to the spectrum environment.On the other hand, the increasing level of intelligence in the field of antijamming and the increasing diversity of antijamming methods have brought huge challenges to jamming [3][4][5][6][7][8][9][10][11][12].Therefore, to cope with the increasing powerful antijamming technology and overcome the disadvantages of the traditional jamming patterns being single and inflexible, many scholars have carried out researches on communication jamming technology [13][14][15][16][17][18][19].However, these jamming theories mainly focus on the jamming methods of a single dimensional domain, which can not be extended to other domains.As a result, there is a need to continuously strengthen the intelligent jamming capability in multidomain confrontation scenarios.In this paper, we investigate the problem of joint decision-making for jamming channel and power in the dynamic spectrum environment.
The several problems have not been solved in the existing researches on intelligent jamming: first of all, in practical communication environment, the total power of a jammer is not infinite but limited, and pursuing only jamming effect tend to result in a waste of jamming resources.Then, the jamming methods for single domain are difficult to overcome the challenge caused by the change of other domains in the opponent's communication behavior.Finally, some researches propose the jamming theory methods but did not complete the practical system verification.Obviously, it is necessary to investigate a jamming method that can save the jamming power while ensuring the jamming performance and can be applied in a practical communication environment.
To tackle the abovementioned problems, the following challenges need to be considered: (1) the system model should be reasonable and effective.A reasonable system model is the key factor to analyze the jamming decision problem.(2) The jamming method must be real-time and efficient.Multidimensional decision-making inevitably brings with it the problem of a large decision space, and it is essential to accelerate the learning and updating process of the algorithm.(3) The proposed jamming method can be deployed in the practical communication system.The practical application of jamming method is necessary to test the jamming performance, so as to promote the deployment of theoretical algorithms to realistic communication confrontation environments.
In response to the above challenges, we model the multidomain joint jamming decision problem based on the Markov decision process (MDP) in the communication confrontation scenario.Also to maximize the utilization of jamming resources, the jamming power is taken into account, and a joint channel and power decision jamming algorithm is proposed.The specific contributions of this paper are summarized as follows: (i) To make full use of the spectrum features, the Markov decision process is used to model the interaction process between the jammer and the communication user.In particular, the spectrum waterfall (SW) is defined as a state and is used to describe the detailed characteristics of the communication user's frequency usage (ii) To address the problem of large decision space caused by joint decision, we propose a joint channel and power decision algorithm based on deep reinforcement learning (DRL) with the function of "parallel learning and joint decision-making".By improving the learning network structure, we divide the policy network into two parallel subnetworks for channel decision and power decision, respectively, thus reducing the difficulty of exploration (iii) To evaluate the performance of the proposed algorithm in the practical communication environment, a software-defined radio-(SDR-) based testbed is designed and built, which contains two subsystems: a wireless communication subsystem and an intelligent jamming subsystem.The proposed jamming algorithm is verified in the testbed and the test results are consistent with the simulations It should be noted that we have carried out relevant studies on intelligent jamming [20,21].The main difference between this paper and [20,21] is that both [20,21] investigated only channel-based jamming decision problem without considering power optimization, and secondly, this paper gives solutions to the decision problem in large state space.
The rest of this paper is organized as follows.In Section 2, the related work is presented.The system model and problem formulation are given in Section 3. The details of the proposed joint jamming decision algorithm are introduced in Section 4. In Section 5, simulation results and dis-cussions are given.In Section 6, the jamming testbed platform and practical test results are introduced.Finally, the conclusion is conducted in Section 7.

Related Work
With the development of software-defined radio, artificial intelligence, and communication countermeasure technology, wireless communication confrontation has become a major research topic [2].In recent years, the intelligent antijamming methods have been widely studied, which are represented by game theory [3][4][5][6][7] and machine learning [8][9][10][11][12].The authors in [3][4][5] modeled and solved the interaction between the jammers and users using the Stackelberg game to find the optimal antijamming strategies.The authors in [6,7] used the Markov game to solve the "coordination" and "competition" problem in the multiuser scenarios.In [8], the authors proposed a radio frequency fingerprint identification (RF-FI) framework based on incremental learning to solve the problem of blind signal individual identification of the distributed system, which can overcome the shortcoming of low accuracy of RF-FI identification and poor adaptability to the environment in complex electromagnetic environment.Liu et al. [9] used the SW graph to represent the time-frequency information and proposed the antijamming algorithm based on the DRL.On the basis of [9], Wang et al. [10] proposed a new antijamming scheme called dynamic spectrum antijamming (DSA), which can obtain the optimal strategy with the help of cognitive radio and machine learning.Liu et al. [11] used the spectrum waterfall to recognize the characteristics of different jamming modes and achieved antijamming communication under different jamming modes.The author in [12] introduced a novel jamming recognition method based on distributed few-shot learning to achieve the global optimization of multiple subnetworks by federated learning with the help of the timefrequency diagram of the communication jamming signal.Therefore, to cope with the increasingly powerful antijamming technologies, it is important to investigate the intelligent jamming method.
Based on the abovementioned studies, various types of methods have made a breakthrough in the field of intelligent jamming [13][14][15][16][17][18][19].In [13], the authors proposed an intelligent jamming algorithm based on the Q-learning and evaluated the jamming performance of the algorithm in different antijamming strategies.On this basis, Zhang et al. [14] proposed a jamming framework of "offline learning and virtual decision-making" based on the Q-learning and verified the effectiveness of the algorithm in practical communication environment.Furthermore, Rao et al. [15] utilized the value variance of effective jamming action to set the confidence interval and eliminate the jamming action with low confidence from the action space, which accelerated the optimal strategy learning process.In terms of power domain jamming, the authors in [16,17] modeled the interaction between the user and jammer as the Stackelberg game, analysed optimal strategies of user and jammer, respectively, and obtained the Stackelberg equilibrium of the game.A power domain jamming algorithm was proposed in [18], where the jammer could adaptively adjust its own power strategy to jam according to the working state of the user, and the algorithm could guarantee converge to the optimal jamming strategy.RaoHua et al. [19] proposed a jamming resource allocation method based on the maximum policy entropy DRL to enhance the exploration of the strategy to determine the optimal jamming power allocation scheme.
However, all the above jamming methods focused on a single domain, such as the frequency or power domain, making it difficult to address other domains in which the communication user can make adjustments or changes.For example, once the user adjusts its strategy in the power domain, the jamming effect cannot be guaranteed by the simple channel jamming, and vice versa.The jamming effect is poor if the power is low, and the jamming resources will be wasted if the jamming power is too high.Therefore, it is necessary to consider the multidomain jamming method which can adjust both power and channel to improve the jamming effect.
It is significant to study the joint decision of channel and power, and several literatures in antijamming communication have focused on joint channel and power decisionmaking [22][23][24].The literature [22,23] investigated the problem of joint channel and power decision-making based on reinforcement learning (RL) and DRL, respectively, and the proposed algorithms achieved high throughput while reducing the power of signal transmission.Zhang et al. [24] proposed a novel communication/deception dual mode mechanism and an antijamming communication method based on joint channel and power optimization, which can ensure the normal communication of communication user by attracting jamming with deception user, so as to obtain the maximum communication rate, while little research has been done on joint channel and power jamming decision to ensure the jamming effect and reduce the consumption of jamming power.For the jammer, the jamming power is limited, and it is necessary to study a jamming method that can save jamming resources and guarantee the jamming performance.
The above research on methods of joint channel and power decision-making has led to a series of results and developments; however, the proposed algorithms have not been tested and verified in the practical system.Therefore, it is impossible to evaluate the gap between theoretical simulation and practical application of the algorithm.In this paper, we investigate the jamming problem of joint channel and power decision-making based on DRL and evaluate the algorithm performance in the built testbed platform.

System Model and Problem Formulation
3.1.System Model.In this paper, a spectrum confrontation scenario is considered, where there exist a jamming system and a communication system.As shown in Figure 1, the communication system consists of a transmitter-transceiver pair for the transmission and reception of information.The transmission signal is sent at a constant power during the information transmission.The jamming system comprises a spectrum sensing device, a jammer, and an intelli-gent terminal (agent).The spectrum sensing device and the jammer are connected via the agent.The sensing device can sense the current spectrum state and send the sensed data to the agent, which has the ability of intelligent decision-making and online learning according to analysing the current spectrum information.The jammer can receive the decision information from the agent and emit jamming signal to disturb the user's communication link.The jamming system has multiple power emission levels, which can be denoted as P = fp 1 , ⋯, p S g, and is able to select jamming power level in the dynamical spectrum confrontation environment.
Both the jamming system and communication system have the same frequency range, and the frequency range is uniformly divided into K available channels.The set of channel can be denoted as K = f f 1 , ⋯, f K g, where f k ∈ K indicates the kth channel's centre frequency, the jamming system and the communication system work according to a time slot division structure.In each time slot, the communication system and the jamming system selects a channel to complete a data transmission process and to emit jamming signal, respectively.The f u ðtÞ and f j ðtÞ represent the channel selected by communication system and jamming system, respectively.Furthermore, the jamming system can also select a jamming power level in each time slot to complete the purpose of optimizing the jamming resources.
3.1.1.Time Slot Structure.Figure 2 shows the time slot structure of the user and the jammer.T u represents the length of a single communication time slot and T JW represents the length of a single jamming time slot, which contains three subslots: spectrum sensing subslot T wbss , jamming emitting sub-slot T j , and online learning subslot T l .The specific description is as follows: (i) Spectrum sensing subslot T wbss : the sensing device senses the spectrum environment of the whole frequency range to get the spectrum data; i.e., it continuously senses the spectrum energy intensity and sends it to the agent (ii) Jamming emitting subslot T j : the agent receives the sensed information from the sensing device and decides on the jamming channel f j ðtÞ ∈ K and jamming power level p j ðtÞ ∈ P according to the state of the spectrum environment.Then, the agent sends the decision information to the jammer, which releases the jamming signal (iii) Online learning subslot T l : The jamming system learns the frequency usage patterns of communication systems based on spectrum information from sensing device and continuously updates the joint strategy 3.1.2.Communication-Jamming Model.We assume that the bandwidth of each available channel and the frequency range of f k can be represented as The power spectral density (PSD) of the 3 Wireless Communications and Mobile Computing user's signal is assumed as Uð f Þ, so the user's signal transmitting power can be expressed as where s denotes the jammer's jamming power level.The N ð f Þ represents the PSD function of environment noise.According to the literature [8], the signal-to-interferenceplus-noise ratio (SINR) can be represented as where g u and g j denote the channel gain from the transmitter to the receiver and the channel gain from the jammer to the receiver, respectively.Referring to [25], we assume that the gain g l of each channel in the same frequancy range is the same at any time and the value can be obtained in a discrete set of values, i.e., g t l ∈ fg 1 , ⋯, g n g.And the channel quality is often modeled as a finite-state Markov chain (FSMC) [26]; therefore, we model the channel gain in licensed bandwidth an FSMC chain in this paper.According to literature [27,28], the user adopts the pattern of fast frequency hopping, with channel switching based on a specific frequency-hopping sequence.We assume that β th is the threshold for determining whether a user can successfully transmit; therefore, the communication rate of the user can be expressed as According to [9], we assume that all of the above signals are present at moment t, so the result of the spectrum sensed by the sensing device at moment t can be expressed as the sampled values are expressed as where Δf is the sample rate of the spectrum.Thus, the sensed spectrum characteristics can be expressed as where o t,L represent the energy intensity at time t when the frequency is f L .

Problem Formulation.
The MDP is used for modelling the sequential decision problems in a dynamic environment [29], where the jamming system and the communication system interact to make decisions.The MDP can be expressed as <S, A, P , R > , where S is the environment state space, A is the action space that the jammer can take, P is the transition probability of the environment state, and R is the reward function obtained by the jammer after emitting jamming.The specific meaning of each element in the MDP is as follows: State.Referring to [9,21], to avoid loss of spectrum features and to make better use of historical spectrum information, the spectrum waterfall, a two-dimensional timefrequency matrix, is used as state: where M indicates the size of the recorded spectrum capacity, and the specific value of M can be determined according to the characteristics of the environment.
Action.The action space of the jammer can be expressed as where f j ∈ K and p j ∈ P denote the jamming channel and the jamming power level selected by the jammer, respectively.
Reward.In the practical spectrum confrontation scenario, the jammer aims to suppress the normal communication of the user, and therefore, we take the throughput of the communication system into account.Meanwhile, the jamming power is taken into account in order to maximize the avoidance of wasted jamming power.The reward function is designed as follows: where C t denotes the user throughput and can be obtained in Equation (2).δð•Þ denotes the indicator function and the specific expression of which is The value of the user throughput C t is zero indicating that the jammer successfully performs jamming and nonzero indicating the jammer unsuccessfully disturbs the communication.So the value of δð•Þ is 1 and -1 in the two cases.
The goal of the intelligent jamming system is to find the best jamming strategy to maximize the cumulative reward value through continuous learning and training; hence, the optimization objective of this paper is as follows: where s t+τ and a t+τ denote the state and action in the time slot t + τ, E a ½• is the mathematical expectation, and 0 < γ < 1 is the long-term discount factor.

Joint Power and Channel Jamming Algorithm Based on Deep Reinforcement Learning
A traditional RL algorithm is Q-learning, which sets up a look-up table to record the state-action value Qðs, aÞ.However, for the above proposed MDP process, there are multiple signals existing at the same time, and therefore, the number of available SWs is huge, and the Q-learning algorithm faces difficulty in convergence or even will be out of work.This function can be fitted using deep neural network (DNN), and the deep Q-network (DQN) is proposed to solve high-dimensional state [30].The DQN method uses a deep convolutional neural network (CNN) to fit the optimal action value, and the action value can be denoted as Qðs t , a t ; θÞ, where θ is the network parameter.Furthermore, the corresponding action can be decided based on the estimated Q function.In our paper, the essential reason in using DQN network is that CNN network has a better performance in image recognition [31,32], which can be used to identify 5 Wireless Communications and Mobile Computing the communication behavior of the user using the spectrum waterfall.
4.1.Algorithm Description.As shown in Figure 3, a "parallel learning and joint decision-making" mechanism is designed to accelerate the process of policy learning and updating.The single policy network in the DQN network is divided into two parallel decision subnetworks for channel decision (channel policy subnetwork) and power decision (power policy subnetwork).Each policy subnetwork learns independently and updates the network parameters according to the common reward function.The main benefit of this is to reduce the network decision space.We assume that there are M decisions to select for channel and N decisions for power.A single policy network, where the channel and power are jointly decided by the same policy network, requires a number of M × N output neurons.In parallel learning network, the number of output neurons required is only M + N, which reduces the space for exploratory learning.
A target network update mechanism is also introduced.The two deep Q-networks are defined as the evaluation Q -network and the target Q-network, respectively.To reduce the correlation between the target Q-network and the evaluation Q-network, a separate network is used to fit the target Q-network table during the neural network training, and the network parameters of the evaluation Q-network are periodically copied to the target Q-network.
For the jamming system, the sensed SW is used as the input to the joint channel-power decision network.Based on this, the jamming system makes a joint channel-power decision and receives reward from the spectrum environment.For the Q value of the channel network Q f ðs t , a t Þ or the power network Q p ðs t , a t Þ, both are updated by the following Bellman formula: where α and γ indicate the learning rate and discount factor.Small batches of data are selected from the replay memory as debug samples for data replay.Based on this, the loss function is calculated as follows: where θ t denotes the network parameters of the evaluation network of the jammer and y t denotes the target Q value, which is calculated by the following equation: in which the θ − t is the network parameters of the target network model.The network uses stochastic gradient descent (SGD) for updating the network parameters with the follow-ing specific update formula: where β denotes the learning rate and θ i denotes the channel network parameter θ f or the power decision network parameter θ p in the evaluation network.Thus, the channel policy subnetwork and the power policy subnetwork are updated by the above equations, completing the policy learning and updating.

Prioritized Experience Replay.
A critical component of algorithm-based DQN networks is the experience replay.The agent collects the transition ðs t , a t , r t , s t+1 Þ during the training process and stores them in the replay memory, which is a fixed size buffer that holds the most recent transitions collected.It can reuse data multiple times for training, rather than discarding it immediately after it is collected.In [33], the author adopts the way of random sampling for replay to reduce the correlation of training samples, but to some certain extent, random sample limits the efficiency of replay.For communication confrontation scenario in particular, the reward of jamming is obtained relatively sparsely, and a large number of useless samples may be sampled during empirical playback, thus reducing the learning efficiency.The authors in [34][35] proposed a framework for prioritizing experience, which can replay important transitions more frequently and learn more efficiently.Motivated by them, the PER technology is introduced to improve the efficiency of the algorithm.
The core of prioritized replay is a measure of the importance of each transition, and temporal difference-error (TDerror) is used as a metric to judge the priority of sampling.For each transition, we calculate its TD-error value as follows: where δ j t and r j t denote the TD-error value and the reward for the jth sample generated by the jammer in the tth iteration.A larger δ j t value means that the transition has greater replay significance.Replaying these samples can improve the efficiency of network training and promote the convergence of the algorithm.The probability value of jth sample can be defined as where p j > 0 is the priority of transition.α is used to control the coefficient applied to the priority, and it can be seen that when the α = 0, it indicates uniform sampling.We give the form of p j in the following: and σ is a small positive constant to prevent the edge-case of 6 Wireless Communications and Mobile Computing transitions not being revisited once the TD-error is almost zero.However, the PER technology introduces a new bias due to overfitting of the network as a result of reduced training sample diversity.We can correct this bias by using importance-sampling (IS) weights: where M e indicates the capacity of the replay memory and u indicates the degree of correction.These weights can be used to update the network parameters using w j δ j instead of δ j .So the corrected loss function can be obtained from the following equation: where the y j is the target value of jth sample and can be obtained by ε − greedy is widely used in previous works that mainly tends to solve the dilemma of "exploration" and "exploitation."To enhance the "exploration" ability, the dynamic ε − greedy is introduced, and ε is updated in each time slot according to the following rules: where ε 0 is the initial exploration value, ε f is the final exploration value, and λ is the decay rate.Therefore, the selection of the jammer action is carried out according to the following rules: where f t and p t denote the channel decision and power decision of the jammer, respectively, and Q f ðs t+1 , f t ; θ f Þ and Q p ðs t+1 , p t ; θ p Þ denote the channel Q value and power Q value.Therefore, a channel-power joint decision jamming algorithm based on parallel DQN (CPJ-PDQN) is proposed, whose details are shown in Algorithm 1.

Network Structure.
Deep reinforcement learning is a method that combines the strong feature extraction ability of deep learning in complex environments with the decision-making advantage of reinforcement learning.The network structure proposed in this paper comprises convolution layers (CL) and full connection layers (FCL), which are used for feature extraction and policy decision, respectively.
The neural network structure is shown in Figure 4.The first CL uses 32 filters with size 4 × 4 and stride 4, and the second CL uses 64 filters with size 3 × 3 and stride 2. The FCL1 and FCL2 of channel network have 256 neurons and jKj neurons, respectively, and the FCL3 and FCL4 of power network have 256 neurons and jP j neurons, respectively.Each neuron in FCL2 and FCL4 represents one available channel action f t and one available power channel action p t .The final jamming action a t is combined by the f t and  1.

Algorithm Complexity Analysis.
Inspired by the literature [36], we use the index of floating-point operations (FLOPs) to measure the algorithm complexity, which is calculated as where O c , O f , and O p denote the complexity of convolution layer, channel subnetwork (FCL1, FCL2), and power subnetwork (FCL3, FCL4), respectively.l c , l f , and l p denote the number of CL layers, FCL layers in channel subnetwork, and FCL layers in power subnetwork, respectively.In CL layers, H and B denote the length and width of the output feature, F is the size of filter, and C l c is the number of filters in l c layer.In FCL layer, I and E denote the number of input neurons and output neurons, respectively.According to the network parameters in Table 1, the algorithm proposed in this paper requires 5 × 10 7 FLOPs every second, and the computation requirements can be satisfied by the common multicore central processing unit (CPU).

Simulation Results and Analysis
In this section, simulations are performed to evaluate the effectiveness of the proposed algorithm.We assume that the bandwidth of the available frequency band for the user and jammer is 20 MHz, which can be divided into 10 channels with a bandwidth of 2 MHz.Spectrum sensing is performed with spectrum resolution Δf = 100 kHz per millisecond, resulting to 200 spectrum samples.The constructed SW saves a historical length Φ = 200 ms, so the size of SW is 200 × 200.The transmission power of transmitter is constant at 20 dBm, and the power of the background noise is -90 dBm.The jamming power is divided into high, medium, and low levels with power values of 30 dBm, 20 dBm, and 10 dBm, respectively, which can be denoted as P = f30,20,10g.According to [26,37], the channel gain

Initialization:
The parameters of channel network θ f and power network θ p are initialized to random value.Initialize the replay memory D f = Ø, D p = Ø and the iteration time t = 0.  15)- (18).Calculate loss value according to Equation (19).
Update network parameters θ f and θ p by SGD.Update target network θ − = θ every N t iteration.

End If End For
Algorithm 1: Channel and power joint decision jamming algorithm based on parallel DQN (CPJ-PDQN).where g p and jh r j 2 represent the path loss and the Rayleigh fading.The threshold of correcting demodulation for user is 10 dB.And the user adopts the fast frequency-hopping communication pattern (FFH) to communicate according to [27,28].Other simulation parameters are shown in Table 2. ADAM optimizer is used to train the deep Q-network, and the minibatch size is 64.In simulation, the user's throughput rate (UTR) and the jammer's jamming success efficiency (JSE) are introduced to evaluate the performance of proposed algorithm.The UTR can be defined as where S cur indicates the number of packets correctly demodulated by the receiver and S all indicates the number of all packets sent by the transmitter.The JSE can be defined as where δðC t Þ can be obtained from Equations ( 2) and ( 9), indicating that whether the jamming is successful.The p j indicates the power policy adopted for jamming.The JSE reflects the value of the jammer's utility per unit power, and a higher value of JSE indicates higher jamming efficiency and higher resource utilization.
In the simulation, the several jamming algorithms are introduced as comparison to evaluate the proposed jamming algorithm: (i) Collaborative power and channel jamming algorithm (CPCJA): the algorithm in [23] is introduced as the benchmark, and the channel and power decision are determined by a single policy network In Figure 5, we compare the performance of different jamming algorithms in UTR.From the graph, it can be seen that the performance of DQNCD-LP algorithm is the worst, followed by the DQNCD-MP.A low jamming power level is bound to cause a decrease in jamming effectiveness, which can not jam all the transmitted packets.Conversely, when the jammer works on the highest jamming power level, i.e., DQNCD-HP, almost all the packets are successfully jammed, and the value of UTR is close to 10%.The proposed CPJ-PDQN algorithm and the CPCJA algorithm are basically consistent with the highest power effect, which is maintained at about 14% and 16%, respectively.And the convergence speed of the proposed algorithm is faster than that of CPCJA, which also indicates the advantages of parallel network structure and prioritized experience replay technique in algorithm performance.Therefore, we can conclude that there is little difference in the actual jamming effect between the proposed CPJ-PDQN algorithm, CPCJA, and the DQNCD-HP algorithm, and all of them can eventually reduce the user's throughput rate to less than 20%.
Figure 6 shows the JSE comparison curves of the CPJ-PDQN algorithm and the comparison algorithms.It is obvious that the CPJ-PDQN algorithm has the highest JSE value, followed by the CPCJA algorithm, and finally the three constant power level jamming algorithms (DQNCD-HP/MP/ LP).Combined with the UTR performance in Figure 5, we focus on the analysis of the JSE value of the DQNCD-HP algorithm, CPCJA algorithm, and the proposed algorithm.Although the DQNCD-HP algorithm has the lowest UTR value, the overall JSE is low because the whole jamming process is continuously performed at the highest power level, which may cause a waste of jamming resources.The JSE value in the CPCJA algorithm is higher than that of the DQNCD-HP algorithm.This is because during the learning process, the jammer can learn the appropriate power level according to the transmission power, and can achieve the same performance with the DONCD-HP with a smaller 9 Wireless Communications and Mobile Computing power.The proposed CPJ-PDQN algorithm has the JSE value due to parallel network structure and the introduction of PER sample.The PER technique plays back continuously for samples with higher priority and therefore plays back more often for samples with higher priority, which leads the highest JSE value.Therefore, we conclude that the JSE of the proposed algorithm is much greater than that of the comparison algorithms under the condition that the UTR values are basically the same.Therefore, the proposed algorithm has a greater advantage in terms of resource utilization.
From the simulation results and analysis, it can be seen that the proposed algorithm exhibits superior performance compared to other algorithms.The proposed algorithm combines the advantage of parallel networks and PER technique, which makes the algorithm more suitable for practical communication environment by reducing user throughput while saving jamming resources.To evaluate the performance of the proposed jamming algorithm, we build a practical jamming system to verify it, and the details will be presented in the next section.

Intelligent Jamming System Implementation and Testing
In this section, an intelligent communication confrontation system based on SDR is designed and built to evaluate the performance of the proposed algorithm in practical wireless communication.The SDR system has many advantages over conventional radio systems, such as the function to be implemented can be designed by software, the system structure is universal, and the operation performance is fine, which can be used to quickly design and build powerful and independent wireless communication system.In addition, the SDR equipment are flexibility in function, and its hardware can be updated or extended with the development of devices and technology.Therefore, the SDR technology is introduced to verify the proposed algorithm.In terms of hardware selection, the USRP B210 devices are chosen as the RF-front and the industrial personal computer (IPC) X-86 as the signal processor.The B210 is used to complete sending and receiving signals, and the IPC is used to analyze and process data.
6.1.System Scheme Design.Figure 7 shows the hardware composition of the practical communication confrontation system.Functionally, the system can be divided into two separate subsystem: a jamming subsystem and a wireless communication subsystem.The whole process of the two subsystems working is demonstrated through a public visual platform.The visual platform is designed to evaluate the jamming effect, which can display the real-time spectrum environment, spectrum waterfall, and user's throughput.The jamming subsystem is divided into two submodules: a spectrum sensing submodule and an intelligent decision submodule.The spectrum sensing submodule sends the sensed spectrum information to the intelligent decision submodule and drives the intelligent decision submodule to make a joint channel and power decision to emit the jamming signal.
The communication system consists of a transmitting submodule and a receiving submodule, and the two modules coordinate to complete the data transmission and reception.The modulation and demodulation of the signals are performed in the IPC device.The relevant parameters and strategies can be set in the PC terminal.
6.2.System Implementation.For the communication confrontation system built in this paper, we focused on the design and implementation of the spectrum sensing submodule and the intelligent decision submodule.The wide-band spectrum sensing technology is used in the spectrum sensing submodule to achieve the acquisition and processing of spectrum information [38].
The intelligent decision submodule is completed in the IPC, and its main function is to determine the jamming channel and jamming power by implementing the channel power joint decision algorithm proposed in this paper.Figure 8 shows the working process of the decision submodule.Firstly, the sensing device transmits the sensed the spectrum information to the intelligent jammer.Secondly, the intelligent decision submodule makes decision according to the decision-making algorithm in the IPC and updates network parameters according to the reward value obtained.Finally, the USRP device parameters (jamming signal bandwidth, jamming duration, and jamming gain) are configured and jamming signals are emitted, and all the above process is looped to drive the continuous operation of the intelligent jamming system.6.3.System Test and Result Analysis.Based on the above system scheme design and system implementation, the practical platform diagram of the confrontation system is shown in Figure 9.In the practical system test, the range of communication frequency is 840-860 MHz, which is divided into 10 communication channels with a bandwidth of 2 MHz.The jamming time slot T JW of the jammer is set to 550 ms, and the length of a user's communication time slot T u is set to  26) can be obtained in Equation (2) in the simulations.However, in the practical system test, the SINR value at the receiver is difficult to obtain during the system testing.Therefore, when verifying the proposed algorithm in the SDR platform, we equate the user's throughput by counting the number of ACKs and NCKs on the communication user.
In the system test verification, the transmitter sends K packets per second to the receiver, and for each packet, the receiver sends an ACK frame or NCK frame to the transmitter, which indicates whether the packet was transmitted properly.The user throughput rate per unit time can where N ack and N nck indicate the number of ACK frames and NCK frames received per second, respectively.Therefore the practical test value C t can be defined as where β indicates the threshold value used to determine whether the user's transmission is successful.The value β in our system verification is set to 0.7.
To clearly compare the jamming performance of proposed and comparison algorithms, we make statistics of the test data in the whole online process and redraw them in the same graph.To accurately display the trend of the proposed algorithm, we count data every 20 user time slots.The practical test results are shown in Figures 10 and 11.
Figure 10 shows the real-time curve of UTR in the practical system test.The green curve indicates the DQNCD-LP 12 Wireless Communications and Mobile Computing algorithm with low power.Consistent with the simulation results, the performance of this algorithm is poor because the jamming power is low and the jamming signal cannot completely suppress the communication signal.Therefore, the UTR is maintained at 78%.The pink curve indicates the DQNCD-MP algorithm with medium power level, and the performance of it is than the method with a UTR value about 60%.The blue curve represents the DQNCD-HP algorithm using the highest jamming power, and the performance is the best.The UTR value when using the DQNCD-HP is around 34.1%.The red curve indicates the performance of the proposed CPJ-PDQN algorithm and the UTR value is about 36.7%.Obviously, it can be seen that the jamming effect of the proposed algorithm is basically the same as the suppression effect of the user throughput when using the highest power level (DQNCD-HP).However, the practical test results shown in Figure 10 are inferior to the simulation results in Figure 5.The reason  13 Wireless Communications and Mobile Computing causing this phenomenon is that there is a certain delay in signal processing and channel switching in practical system.The influence of the external environment also causes uncertainty, which leads to the poor test result compared with the theoretical simulations.
Figure 11 shows the comparison curve of the JSE in system test.The DQNCD-LP has the JSE value.The main reason is that the low-power algorithm has a poor jamming performance, and the number of successful jamming is few.The difference between DQNCD-MP and DQNCD-HP in JSE value is not significant.For the DQNCD-MP algorithm, although the performance is poor in the UTR value, the power level used is not high.For the DQNCD-HP algorithm, although the performance is the best in UTR value, the power level used is also the highest.Therefore, there is little difference between the two algorithms in JSE values, which are about 0.017 and 0.019, respectively.Obviously, the JSE value of the proposed CPJ-PDQN algorithm is the highest, because the UTR performance of the proposed algorithm is equivalent to that of DQNCD-HP, but the power level in the convergence stage is lower than that of DQNCD-HP, so the JSE value is the highest, about 0.025.It can be seen that the proposed CPJ-PDQN algorithm not only successfully jams on the transmission signal but also significantly saves the jamming resources.Although the practical test results are not as intuitive as the simulation results, the overall trend shown is consistent with the simulation results, which also shows the effectiveness of the algorithm.6.4.Application Scenarios.From the analysis above, it can be seen that the proposed CPJ-PDQN algorithm can optimize both the jamming channel and the jamming power at the same time and is suitable for the confrontation scenario where the jammer's energy is limited.In addition, the SDR technology is currently developing in the direction of miniaturisation and intelligence and has great prospects in the field of spectrum confrontation [39,40].Therefore, the proposed scheme can be deployed in the UAV electronic jamming system because the UAV electronic jamming system require multiple jammers to work together in a coordinated manner to complete the jamming task.The proposed algorithm can guarantee the jamming effect and save the jamming power at the same time, so as to ensure the endurance of the jammer and its continuous operation.

Conclusion
In this paper, we investigated the problem of jamming channel and power joint selection in the dynamic spectrum environment.Firstly, we introduced the MDP framework to model and analyze the problem of joint jamming channel and power selection.Secondly, a joint channel and power jamming algorithm was proposed, which works on the parallel learning and joint decision mechanism.In particular, the PER technology replaced the random sampling to accelerate the convergence of algorithm.Furthermore, verification of the proposed algorithm was performed on the SDR testbed.The performance of the simulation results and testbed results indicated that the proposed algorithm can be applied to practical system, which could guarantee the jamming performance and save jamming resources simultaneously.

(
ii) DQN-based channel decision algorithm with high power level (DQNCD-HP): in this algorithm, the jammer only makes channel decision-based DQN network, and the jammer remains the constant highest jamming power level to jam (iii) DQN-based channel decision algorithm with medium power level (DQNCD-MP): in this algo-rithm, the jammer only makes channel decisionbased DQN network, and the jammer remains the constant medium jamming power level to jam (iv) DQN-based channel decision algorithm with low power level (DQNCD-LP): in this algorithm, the jammer only makes channel decision-based DQN network, and the jammer remains the constant lowest jamming power level to jam

Figure 7 :Figure 8 :Figure 9 :
Figure7: Hardware composition of intelligent jamming system.(The jamming system is composed of a spectrum sensing submodule for obtaining spectrum state and intelligent decision submodule for intelligent decision-making.The communication system serves as the companion system.)

Figure 10 :
Figure 10: User's throughput rate comparison in testbed test.

Figure 11 :
Figure 11: Jamming success efficiency comparison in testbed test.
, add the experience ðs t , f t , r t , s t+1 Þ in replay memory D f and add the experience ðs t , p t , r t , s t+1 Þ in replay memory D p .If jD f j ≥ M e and jD p j ≥ M e do Sample minibatch of ðs t , f t , r t , s t+1 Þ from D f and ðs t , p t , r t , s t+1 Þ from D p according to the PER method by Equations ( For t = 1, 2, 3 ⋯ ∞ do Sense to construct state s t = ½o t , o t−1 , ⋯, o t−M+1 T .Calculate Q value Q f and Q p , select jamming channel f t and jamming power f p according to the ε − greedy strategy.Execute joint action a t = ð f t , p t Þ and emit jamming signals.Calculate its reward rðs t , a t Þ in Equation (8).Obtain o t+1 by spectrum sensing.Update the next state s t+1 = ½o t+1 , o t , ⋯, o t−M+2 T

Table 1 :
Structural parameters of the policy network.

Table 2 :
Parameter values used in the simulation.
It should be pointed out that the value of C t in Equation (