Double Deep Recurrent Reinforcement Learning for Centralized Dynamic Multichannel Access

We consider the problem of dynamic multichannel access for transmission maximization in multiuser wireless communication networks. The objective is to ﬁ nd a multiuser strategy that maximizes global channel utilization with a low collision in a centralized manner without any prior knowledge. Obtaining an optimal solution for centralized dynamic multichannel access is an extremely di ﬃ cult problem due to the large-state and large-action space. To tackle this problem, we develop a centralized dynamic multichannel access framework based on double deep recurrent Q-network. The centralized node ﬁ rst maps current state directly to channel assignment actions, which can overcome prohibitive computation compared with reinforcement learning. Then, the centralized node can be easy to select multiple channels by maximizing the sum of value functions based on a trained neural network. Finally, the proposed method avoids collisions between secondary users through centralized allocation policy.


Introduction
With the rapid development of next generation network technologies such as the mobile Internet and the Internet of Things (IoT), spectrum scarcity has been severe. Furthermore, most of spectrum resources are exclusively and unlimitedly allocated to primary users (PUs) at present, whose mechanisms enable to have no transmission opportunities for secondary users and lead to low spectrum efficiency [1]. According to Federal Communications Commission (FCC), the spectrum is severely underutilized with the utilization rate of some bands as low as 15% in general [2]. The low spectrum utilization and spectrum scarcity have triggered the development of dynamic spectrum access (DSA).
In this work, we consider an overlay DSA environment with multiple PUs, multiple secondary users (SUs), and a centralized node which can be able to detect all channel state at the current time and allocate a channel to each SU for transmitting data during the next time. This is a coordinated multichannel access problem of independent channels in a fully observable scenario. Each channel has two possible states (i.e., occupied by PUs or not) at any given time. We assumed the PUs have no ability to perceive channels state and can avoid collisions with other PUs. In this overlay DSA model, we assume that the transmission will success only if a SU accesses the idle channel. Once two or more users, i.e., PUs or SUs, transmit data using the same channel at the same time slot, the collision will occur and SU transmission will fail. PUs can occupy different channels according to their spectrum behavior. The above model can avoid collisions between SUs completely through overall channel allocation of the centralized node.
We assume the centralized node has cognitive ability that could be able to exploit time-domain holes of channels and improve spectrum utilization efficiency in an unknown environment. For this purpose, reinforcement learning (RL), especially Markov Decision Process (MDPs), is one potential solution due to good decision performance [3]. The advantage of RL can learn how to map state space to action space in situations where the characteristics about the environment are unknown for agent but can be learned through trial and error. However, traditional reinforcement learning is not suitable for scenarios with large state space where many state-action pair are rarely visited, which leads to the curse of dimensionality and lack of generalization. Motivated by deep learning (DL) success in other domains, Google DeepMind is combined RL into a deep learning network model, i.e., Deep Q-Network (DQN) for solving the above problems, which output an approximate value function to decide its policy [4]. But DQN algorithm overestimates the value function because it employs the same value function to select action and to estimate the value function. A DQN derived algorithm, i.e., Double DQN, has the ability to alleviate overestimation compared to DQN. Besides, Deep recurrent Q-learning (DRQN, another DQN-derived algorithm) can better deal with time sequence problems. Hence, this work gives an extension to DQN approach which utilizes double DQN to stabilize approximate value function and DRQN to extract the related temporal information.
This paper greatly expands on the preliminary work of [5], which applies double deep recurrent reinforcement learning to centralized DSA scenario and gives its simulated performance in the presence of two scenarios. The main contributions of this paper are as follows: (1) We design a centralized multiuser dynamic spectrum access model, which can effectively avoid the possibility of conflicts between secondary users. Specifically, for each environment state, the probability of each channel being idle is evaluated, and multiple channels with high idle probability are selected for communication transmission of secondary users (2) This paper explains why reinforcement learning is widely used in the field of communication and its limitations, which leads to the rapid development and wide application of deep reinforcement learning (DRL). Furthermore, the theoretical basis of the proposed algorithm (i.e., Q learning and deep reinforcement learning) is also presented and introduced (3) A multiuser dynamic spectrum access algorithm based on double deep recurrent reinforcement learning is proposed. The proposed algorithm uses the fully connection layer to directly estimate the channel allocation to avoid the lookup table of value function and uses the Long Short-term Memory network to improve the performance of sequential channel assignment decision (4) The learning ability of the proposed algorithm is verified through the simulation experiment of single SU Round-Robin Switching Scenario. Simulation experiments on arbitrary switching scenario with single SU demonstrate the robustness of the proposed algorithm. The simulation results of multiple SUs scenarios show that this strategy can effectively avoid the collisions between SUs, reduce the interference of SUs to PUs The rest of this paper is organized as follows. Section 2 presents the related algorithm application in dynamic spectrum access. Then, the system model and the problem statement are described in Section 3, and Section 4 introduces implementation of related algorithms. Section 5 describes the details of the proposed double deep recurrent reinforcement learning while Section 6 tests the performance of the proposed algorithm in the single SU scenarios and the multiple SU scenarios. Finally, Section 7 concludes the paper.

Related Work
In recent years, the machine learning approaches have become a potential solution in wireless communication, as it has excellent decision-making performance facing the unknown system dynamics [6][7][8][9][10][11]. The authors in [6,7] model the radar-communications coexistence environment as Markov Decision Process and then apply policy iteration to solve the optimal channel allocation problem. Nevertheless, policy iteration is a model-based RL algorithm that requires the transition probability function and reward function which is obtained during the training phase. Q-learning is also one of the most popular algorithms since it is a model-free RL method that does not depend on environment modeling and can learn the optimal policy via trial and error in an online manner [8][9][10][11]. These works mainly use Q-learning and its derived off-policy algorithms (e.g., SARSA) to construct and tabulate value function base on each state-action pair. Compared to the traditional algorithms, these methods decrease the computational complexity and better deal with the lack of the prior knowledge. However, the complexity of tabulating the value function for each state-action pair is still high, and the performance is poor when faced with high-dimensional, large statespace problems. Thus, in order to overcome the limitation of Q-learning, the DQN has been proposed which has a good approximation of the value function based on the neural network.
Most of recent works on DSA have adopted the deep reinforcement learning approach for solving spectrum allocation. For example, Wang et al. proposed DQN approach to overcome the large space challenge and develop a channel allocation policy for single SU in the multiple correlated channel scenarios [12]. However, SUs have no complete state information and directly learn a mapping from partial observations to action space through Deep Q-Network, which limited network utility in DSA. Recurrent neural network (RNN) is proposed to solve such partial observation Markov decision problems (POMDPs) in computer games [13,14]. Based on this advantage of RNN, Long Short-Term Memory (LSTM) layer (a RNN layer) is added to the neural network for solving such DSA problems, which can maintain an internal state and integrate sequence information [15][16][17]. Naparstek and Cohen consider a distributed DSA environment where each SU develops Deep Recurrent Q-Network (DRQN) to learn good policies [15]. However, this work assumes that there is no PUs in the scenario, which only represent specific situation and is difficult to apply to current radio environment where most spectrum resources have already been allocated. Different from [15][16][17] consider a more complex scenario where multiple PUs and multiple SUs coexist and there is no information interaction between the SUs. Wireless Communications and Mobile Computing Hence, the above paper assumed that wireless communication environment is considered as a partially observable MDP (POMDP), i.e., each node can only observe part of the channel state at each time. But technically, each node can be capable of sensing multiple channels in every time slot, which has been researched in [18,19]. Besides, the above literature still has high computation complexity in the environment, where the proposed algorithm enables each SU to search spectrum hole in an online and distributed manner. In the distributed DSA model, there is no central node between SUs to coordinate spectrum allocation. This means that each SU can only update policy and select action to maximize their own transmission rate, which may lead collision with other SUs. To decrease collision in the cognitive radio networks, we consider a centralized DSA scenario where there is a centralized node to realize multiuser channel allocation and optimize its policy through deep reinforcement learning.
Other works for DSA have mainly focused on modeldependent setting (i.e., the myopic policy and the Whittle Index) and optimization of DRL. Myopic policy regards spectrum allocation as a multiarmed bandit (MAB), in which the user estimates the immediate reward for each channel and selects a channel that will maximize expected immediate reward [20]. However, the myopic policy ignores the change of communication environment and obtains near optimal only when the channel state transitions are positively correlated or are slightly negatively correlated. The Whittle Index can only acquire optimal performance in the environment that the two-state Markov chain matrix is known, and all channels are independent [21]. The authors of [22] focus on the application of the actor-critic reinforcement learning (an improved DRL algorithm) based framework to dynamic multichannel allocation which make use of the advantage of the value-based and policy-based reinforcement learning algorithms.

System Model and Problem Statement
We consider a cognitive radio network (CRN) where there is a centralized node, several primary users (PUs), N secondary users (SUs), and K authorized nonoverlapping channels. We assume that each SU always has transmission demand, i.e., each SU has packets to transmit in each time slot. Each channel has two possible state: good (1) or bad (0) (i.e., occupied by PUs or not). At the beginning of each time slot, the centralized node senses the state of all K channels using a specific observation pattern, chooses N of K channels, and randomly allocates one of N channels to each SU for transmission. To avoid the collisions between SUs, N is set to 1 ≤ N ≤ K. Transmission of n th SU is successful when only n th SU transmits in corresponding channel and corresponding channel is in good state at a given time slot. Otherwise, transmission of n th SU will fail. After each time slot (say t), the centralized node receives a binary observation o n ðtÞ for each SU (say n), indicating whether its packet was successfully delivered or not (i.e., ACK signal). If the packet has been successfully delivered, then o n ðtÞ = 1. Otherwise, if the transmission has failed (i.e., a collision occurred), then o n ðtÞ = 0. Thus, the binary observation o n ðtÞ can be represented as o n t ð Þ = 1 if the channel selected by n th SU is in good state, In this paper, we model the cognitive radio network as the following MDP model with the goal of transmitting the maximum amount of packets in the same time. The MDP model includes state space, action space, discount factor, reward, and unknown state transition functions.
Given that there are only two possible state for each channel, the state of channels at time slot t can be defined as s t where s tk ∈ f0, 1g represents the state of k th channel at the time slot t (1 ≤ k ≤ K). For instance, s t = ½0,0,1,0 represent that K is equal to 4 and only the third channel is good at the time slot t. Since the scenario we research is completely observable, the state of all channels is known to the centralized node. We define action as aðtÞ at time slot t where a tK represents whether k th channel is chosen or not for transmission. As we noted before, the centralized node only chooses N channel for multiple SUs access at each time slot, where 1 ≤ N ≤ K. So, there are only N nonzero elements in the action vector aðtÞ and the centralized node have the total number of valid actions equal to CðK, NÞ = K!/ðN!ðK − NÞ!Þ for a specific value of N. It is easy to see that the number of valid actions tends to be much bigger than K in many cases, which can limit the network utilities when we evaluate and make decisions for each valid action. In order to decrease action space size, we consider a situation that the total number of valid actions (i.e., action space) is equal to K and action vector has only one nonzero element. Each valid action means occupying one of K channels. The centralized node will select N actions based on the algorithm proposed in this paper and randomly allocate one of the corresponding N channels to each SU at each time slot. In the multiple SU scenarios, this design can avoid collision effectively. When there are not enough channels in good state, SUs will obtain transmission opportunities with equal probability. Otherwise, if enough good channels occur simultaneously, it is possible for each SU to access a good channel for transmitting data. Let r n ðtÞ be a reward that n th SU obtain after each time slot t. The total reward rðtÞ can be viewed as a function of the achievable number of channels on the wireless channel, i.e., the accumulated binary observation, rðtÞ = ∑ N n=1 r n ðtÞ = ∑ N n=1 o n ðtÞ. The objective of the centralized node is to find 3 Wireless Communications and Mobile Computing an optimal decision policy π * that maximizes expected discount accumulated rewards R, which can be expressed as where 0 ≤ γ ≤ 1 is a discount factor, T is the time-horizon of the game. We often set γ = 1 or 0 < γ < 1 when T is bounded or unbounded, respectively. However, in order to more easily measure policy over a finite time duration T, we use the following average reward instead of the above rewards Hence, the problem can be formulated as We mainly focus on overlay DSA models. Meanwhile, we assume that there are multiple PUs and multiple SUs in the wireless networks. The state of the channel is good only when the noise power of the channel is low and the channel is not occupied by primary users. We are interested in developing a model-free centralized learning algorithm to adapt to communication environment and solve dynamic programming problems. In the following section, we will introduce existing algorithms related to our proposed algorithm, i.e., Q-learning and deep reinforcement learning algorithms.

Implementation of Q-Learning and Deep
Reinforcement Learning Q-learning is a reinforcement learning algorithm whose goal is to search a strategy to maximize the expected discount accumulated rewards for dynamic programming problems. Meanwhile, Q-learning is also a model-free algorithm that can be able to assess the consequences of each action when the system model is unknown and adapt to environmental changes. Q-learning can evaluate the value function Qðs, aÞ of each state and action pairs, where the state s is the environment state of the agent and the action a is performed by the agent given the state s. The policy π is derived from the value function, i.e., πðsÞ = arg max a Qðs, aÞ, ∀s. Updating the value function is a cumulative process, whose equation at the time slot t is given as follows: where discount factor γ is set to 0 ≤ γ ≤ 1, and learning rate α is also set to 0 ≤ α ≤ 1. After performing action a t in state s t that maximizes expected accumulated rewards, when transitioning to next state s t+1 , reward r t is obtained.
While Q-learning performs well in some simple situation, it becomes impractical in the following situations: state and action space are enormous which make the storage complexity intolerable and significantly decrease performance because many states are rarely visited. Deep Q-Network (DQN) can perfectly solve the above problems with the help of the excellent fitting performance of neural network [4]. However, we use the same value function both to select and to evaluate an action base on the state of the agent in the above two algorithms. It is more likely to result in overestimation, which degrades performance. The Double DQN (DDQN) is proposed to mitigate overestimation by decoupling estimation and selection. Specifically, we use two neural networks of the same structure with parameter θ t and θ − t to select an action and update value function, which is different from DQN. The update equation of target value function in DDQN at each time slot is given as follows: where θ t represents the parameter of online network which evaluate the value of the selection, and θ − t represents the parameter of target network which is used to update value function. We assign the parameter θ − t to the parameter θ t periodically. This approach has been shown to significantly mitigate overestimations due to estimation errors associated with the DQN and acquired high scores in many games [23].
The traditional neural network does not care about the sequential information of all inputs (and outputs). It makes decisions based on current state information and existing experience, which is trained by randomly selected samples from experience replay. To utilize sequential information, recurrent neural network (RNN) is proposed to store representations of recent input events such that it can be able to process and predict sequential data [24]. RNN has made breakthroughs in speech recognition, language modeling, machine translation, and other timing analysis domain. In this paper, we add Long Short-Term Memory (LSTM, a special kind of RNN, which can achieve long-term dependencies) between the input layer and the following fullyconnected hidden layer in order to facilitate the processing of sequence decision problems. All recurrent neural networks can be considered as multiple copies of the same network, each passing output information to a successor as input information. Each copy of the same network in RNN is termed a cell. Unlike the standard RNN, the cell of LSTM is not a simple structure, such as a single tanh layer, but will be shown in the following part of this section.
At each time slot t, we acquire the hidden state h t−1 and the cell state c t−1 from the edge of a loop. Meanwhile, the input observation is s t . The cell structure of LSTM consists of three gate layers, namely, the forget gate layer, the input gate layer, and output layer. The forget gate layer determines what information flows into the cell state, whose output is represented as

Wireless Communications and Mobile Computing
Here, sigmoid is activation function, which is sigmoidðsÞ = 1/ð1 + e −s Þ, W f and b f are parameters of the forget layer.
The input gate layer decides which values we will update, which consist of a sigmoid layer and output the value i t . Next, a tanh layer creates an input value z t at time t: Here, tanh is also activation function, which is tanh ðsÞ = ðe s − e −s Þ/ðe s + e −s Þ, W i and b i are parameters of input layer, W z and b z are parameter of the tanh layer.
We will update the old cell state c t−1 into the new cell state c t . Specifically, we multiply the old state by f t , forgetting old things we decide to forget earlier. Then, we add i t * z t , deciding how much we will learn new things. Thus, the new cell state can be calculated it from the following formula Finally, we introduce the output layer which decides we are going to output. This output layer is a sigmoid layer, whose output o t is multiplied by the new cell state c t through tanh layer to get the new hidden state h t at time t: Here, W o and b o are parameters of the output layer. It is worth noting that all parameters in LSTM are updated during learning process.

The Proposed Double Deep Recurrent Q-Network (DDRQN) for Dynamic Spectrum Access Algorithm
In multichannel spectrum access problem, the number of overall possible channel state and all possible channel allocation policies grows exponentially with the increase of the number of channels. It leads to the huge computational complexity of optimal channel allocation and transmission probabilities which is mathematically intractable as the network size increase. In this section, we develop a deep multiuser reinforcement learning algorithm based on double deep recurrent Q-network (DDRQN) to solve multichannel allocation problems. DDRQN applies for solving multiuser channel allocation without any prior experience in dynamic multichannel access.

Architecture of the Proposed Multiuser DDRQN Used in DDRQN Algorithm.
In this section, we describe the proposed architecture for the multiuser DSA used in DDRQN algorithm to solve the DSA problem. An illustration of the DDRQN is presented in Figure 1.
(1) Input Layer. The input xðtÞ to the DDRQN is a combination of historical channel conditions over previous two time slots, i.e., x n ðtÞ = ½s ðt−1Þ1 , ⋯, s ðt−1ÞK , s t1 , ⋯, s tK , where s tk ∈ f0, 1g represent the state of k th channel at the time slot t. This design increases sequence input information (increase the input dimensionality), which make the estimated value function fit actual value through more detail information (2) LSTM Layer. We add an LSTM layer to our approach that can make full use of sequence information over time. This gives the network the ability to estimate the true state transition using the history of the process. This layer is responsible of learning how to aggregate experiences over time The centralized node has no prior knowledge about the environment characteristic and makes autonomous decisions in online and centralized manners using the proposed neural network, to learn efficient spectrum access policies from its ACK signals. At the beginning of time slot t, the agent will collect the latest channels state over the past two time slot xðtÞ. Then, the centralized node will use ε-greedy method to choose N actions (i.e., N channels) according to the output of the target network, i.e., top N actions with highest scores will be selected, which can be  where 0 ≤ ε ≤ 1 balances between exploration and exploitation. The larger the value of ε, the more the algorithm tends to explore, and vice versa.
The centralized node will randomly allocate one of N channels to each SU. Next, reward of each channel will be sent from its ACK signals. A tuple which contains the current state xðtÞ, the next state xðt + 1Þ, one action with highest scores, and the corresponding reward will be stored in replay memory. Finally, a randomly selected sample tuple from replay memory will be used to update the neural network based on mean squared error. The full framework is provided in Algorithm 1 below.

Simulation Results
In this section, we first introduce the simulation setting. Then, we test the learning capability of the proposed double deep recurrent Q-network (DDRQN) in the single SU scenarios and provide comparisons with deep Q-network (DQN), Q learning, random policy, and the optimal policy. Finally, we evaluate the performance of the proposed algorithm in the multiple SU scenarios and compare it with other algorithms.
6.1. Simulation Setting. In the proposed DDRQN algorithm, the centralized node is considered as an agent which consists of two neural networks: online network and target network. For each neural network, the first layer is LSTM layer whose size is equal to the state size, and the last two layers are fully connected hidden layers. The second layer has 100 neurons with ReLU as the activation function, and the last layer has K neurons, where K is the total number of channels. The update frequency of target network is set to 100, which means that the parameters of target network are updated by online network every 100 time slots. We set the memory size as 1,000 so that the proposed algorithm has enough samples to optimize the neural network. In each time slot, a minibatch of 32 samples is randomly selected from the memory to train the neural network. We adopt the mean squared error function as the loss function and use Adam algorithm [25] to optimize the parameters of neural network by minimizing loss function. We set learning rate of two neural networks as 0.0001 and discount factor γ as 0.9. In order to avoid falling into a local optimum, we employ ε -greedy method to encourage the agent to explore the environment and greedy value will decrease linearly from 0.1 to 0.01 in each iteration time. We set N to 1 in the single SU scenarios, which means that the centralized node selects only one channel from K channels to the SU in each time slot. We set the number of training episodes as 10 in DDRQN and other comparison algorithms. In the following context, we will introduce parameter setting of other comparison algorithms.
The Deep Q-Learning (DQN) consists of two hidden layers and maintains a memory with a size of 1,000. In each time slot, a minibatch of 32 samples is randomly extracted from the memory to update parameter of the neural network. For a comparison with the DDRQN, the DQN has the same fully connected hidden layer structure, i.e., in our implementation, the DQN has two fully connected hidden layer with the same size.
In the Q learning, the centralized node will evaluate Q -value for each state-action pair and update the value function according to the feedback reward. Relevant parameters in DQN and Q learning, such as discount factor and learning rate, are equal to our proposed algorithm.
In the random policy, there is no learning and the centralized node randomly select one channel to each SU at the beginning of each time slot, and all channels will be accessed with the same probability.
In the optimal policy, we assume that the system dynamic is known to the centralized node, and interrelationship between channels is ignored. At the beginning of each iteration, the centralized node selects one channel to the SUs. Then, according the policy, for instance, when the channel state switching probability is equal to 0.5 or greater, if the previous chosen channel is good, the centralized node will choose a channel in the next activated subset. On the other hand, if the previous chosen channel is bad, the centralized node will choose the previous chosen channel. The centralized node will adopt reverse strategy if the channel switching probability is less than 0.5.

Average
Reward in the Single SU Scenarios. In this section, we consider some single SU scenarios to verify the performance of the proposed learned algorithm. Our framework is compared with Deep Q-Learning, Q-learning, random policy, and optimal decision policy.
In this experiment, we consider a system of 16 channels, and only one channel is in good state in each time slot. To evaluate the performance, we consider the average reward R for different Markov chains P scenarios. To define a Markov chain for channel distribution, we need to specify the channel states in order and the state switching probabilities. We assume that, for each state, the probability that the current state will transfer to the following state is p, and the probability that the current state will be kept is 1 − p. Our experiment can be divided into the two cases.
(1) Round-Robin Switching Scenario. In this experiment, each channel is in turn a good channel from 1 to N according to a round-robin scheduling. We actions = randomly select N of the K valid actions without repeating, rand ðÞ < ε, Wireless Communications and Mobile Computing consider five different switching probabilities p = f0:75,0:80,0:85,0:90,0:95g. A higher p is to make the state switch more frequently, so that the agent can learn more complex environmental conditions. Round-robin switching scenario with switching probability p = 0:75 is depicted in Figure 2, where channel with good state is indicted with a white square at the corresponding channel index value at a given time slot. The average reward and the probability of packet collision comparison between DDRQN algorithm and other algorithms in this scenario are shown in Figure 3 In Figure 3, we observe that the optimal policy achieves the highest average reward and the lowest probability of packet collision, which is close to the corresponding switching probability because the channel pattern is assumed to be known for the optimal policy. In addition, the average rewards of the random policy are kept low, meaning that the average number of good channels per time slot is low and this policy lacks adaptability. More interesting and competitive performances are displayed by the proposed DDRQN policy, DQN policy, and Q Learning. We notice that with the increase of channel switching probability, the average reward of these three algorithms increases gradually, while the probability of packet collision decreases. This is because the change of channel state is more deterministic with the increase of the switching probability. With the excellent fitting performance of neural network, DQN obtains higher average reward than Q learning. Meanwhile, the performance of DDRQN is better than DQN due to the addition of LSTM layer.
(2) Arbitrary Switching Scenario. In the round-robin switching scenario, the channel states switch according to a specific switching order. This switching discipline is unknown to the proposed algorithm, and of course is not being used in the process to find more good channel. But to verify the adaptability of the proposed algorithm in general switching mode, we consider many different switching sequences and fix the switching probability p at 0.9. One such switching pattern in the case is depicted in Figure 4. The performance of 10 randomly generated arbitrary switching scenarios is shown in Figure 5. As can be seen from Figure 5, there is little difference in the average reward under different arbitrary switching scenarios, demonstrating that our proposed algorithm has certain robustness 6.3. Frequency Hopping in the Multiple SU Scenario. In this section, we consider frequency hopping communication scenario with m SUs, where m is equal to 2. We assumed that there are 16 channels with 2 good channels in each time slot. The total channels are evenly divided into 8 subsets. At each time slot, only channels in one subset are good channels, and channels in the remaining subsets are all in a bad state, and good channels are switched sequentially at the next time with the probability p = f0:75, 0:80, 0:85, 0:90, 0:95g. For example, at the time slot t, the channels in one subset are good channels, and the channels in the remaining subsets are all bad. At the time slot t + 1, the channels in following subset are good and the remaining channels are in the bad state with the probability p; at the same time, the channels in current subset k are still good channels and the remaining channels are in the bad state with the probability 1 − p.
The performance of the proposed algorithm and other comparison algorithm in this case is shown in Figure 6. The performance of the optimal policy based on known system dynamics is always pretty good and acquire rewards 1) For time-slot t = 1, …, T do 2) Observe an input x(t) and feed it into the online network 3) Generate an estimation of Q-value Q(a) for all available actions a ∈ f0, 1, ⋯, K − 1g by the online network 4) Take N actions a n ðtÞ ∈ f0, 1, ⋯, K − 1g, n ∈ f0, 1, ⋯, N − 1g with ε-greedy method (according to (12)) and obtain instantaneously reward r n (t) for each SU 5) Observe an input x(t + 1) 6) Mark a h (t), r h (t) as the action and the reward with high scores Q(a) 7) Store tuple x(t), a h (t), r h (t), x(t + 1) in replay memory 8) Sample random minibatch of tuples x j , a j , r j , x j+1 from replay memory 9) Set y DDQN j = r j for terminal x j+1 r j + γQðx j+1 , arg max a t+1 Qðx j , a j , θ t Þ, θ − t Þfor non-terminal x j+1  The random policy has low average rewards due to lack of learning adaptability about channel dynamics. The DQN and Q Learning acquire similar reward value. However, the performance of DDRQN is better than DQN and Q learning.

Real Data
Trace. This section uses the data provided in [12] to carry out simulation experiments under real data trace. The data is collected from the Tutornet platform of the University of Southern California, which consists of Tel-osB, MicaZ, and OpenMote nodes that meet the IEEE 802.15.4 MAC protocol. There are 8 Wi-Fi access nodes in the tutornet platform, which well simulates the dynamic scenario of multichannel access. This section uses the data to test the proposed DDRQN algorithm and other algorithms with 2 SUs. The simulation results of real data trace will be shown in Figure 7.
As can be seen from Figure 7, although the average reward of random policy remained constant with the number of episodes, it was significantly larger than the two     Wireless Communications and Mobile Computing scenarios above. This phenomenon indicates that there are more good channels per time slot in this scenario. As the number of training episodes increases, the performance of all algorithms improves except random policy, which means that these algorithms except random policy can adapt to real data trace. The performance of DQN and Q learning in the first episode is lower than random policy, indicating that the real data trace is more complex than the above scenarios.
Meanwhile, for the same number of training episodes, the proposed DDRQN algorithm had better performance than DQN, and the Q learning was the worst. In this communication scenario, the variation of the standard derivation of the average reward in DDRQN with the number of training episodes is shown in Figure 8. As can be seen from Figure 8, the standard derivation of the proposed algorithm is relatively low. As the number of training episodes increases, the standard deviation of the average reward of the proposed algorithm will gradually decrease, indicating that stability and convergence will become better.

Conclusion
In this work, a double deep recurrent reinforcement learning is proposed and implemented to improve the spectrum utilization in the dynamic spectrum access. The proposed algorithm has a LSTM layer to utilize sequential information and two fully connected hidden layer to evaluate value function. We tested the single SU scenarios in the round-robin and arbitrary switching cases and compared the average reward with that of the DQN, the Q learning, random policy, and optimal policy. Then, we test the multiple SU scenarios in the frequency hopping and real data trace. Experimental results show that the proposed algorithm can enable secondary users to obtain more spectrum access opportunities and quickly adapt to the dynamic spectrum access environment. In addition, the proposed algorithm also avoids collision between SUs in the multiple SU scenarios.

Data Availability
The data in real communication scenario of Section 6 is collected from the Tutornet platform of the University of Southern California. More information about the testbed on http://anrg.usc.edu/www/tutornet/. The data can be found in [6] of this paper, i.e., E. Selvi, R. M. Buehrer, A. Martone, and K. Sherbondy, "Reinforcement Learning for Adaptable Bandwidth Tracking Radars," in IEEE Transactions on Aerospace and Electronic Systems, vol. 56, no. 5, pp. 3904-3921, Oct. 2020, doi:10.1109/TAES.2020.2987443.

Conflicts of Interest
The authors declare that they have no conflicts of interest.