Joint Channel Allocation and Power Control Based on Long Short-Term Memory Deep Q Network in Cognitive Radio Networks

Efficient spectrum resource management in cognitive radio networks (CRNs) is a promising method that improves the utilization of spectrum resource. In particular, the power control and channel allocation are of top priorities in spectrum resource management. Nevertheless, the joint design of power control and channel allocation is an NP-hard problem and the research is still in the preliminary stage. In this paper, we propose a novel joint approach based on long short-term memory deep Q network (LSTM-DQN). Our objective is to obtain the channel allocation schemes of the access points (APs) and the power control strategies of the secondary users (SUs). Specifically, the received signal strength information (RSSI) collected by the microbase stations is used as the input of LSTM-DQN. In this way, the collection of RSSI can be shared between users. After the training is completed, the APs are capable of selecting channels with small interference while the SUs may access the authorized channels in an underlay operation mode without knowing any knowledge about the primary users (PUs). Experimental results show that the channels are allocated to the APs with a lower probability of collision. Moreover, the SUs can adjust their power control strategies quickly to avoid the harmful interference to the PUs when the environment parameters change randomly. Consequently, the overall performance of CRNs and the utilization of spectrum resources are improved significantly compared to existing popular solutions.


Introduction
Cognitive radio networks (CRNs), also known as cognitive wireless networks (CWNs), are formed when cognitive radio devices are organically connected through cognitive base stations. Spectrum resource management is one of the basic tasks of CRNs, which aims to achieve high unitization of the spectrum resource through dividing it into a group of channels or resource blocks and designing proper management strategies. Faced with the increasing demand for mobile data capacity, channel allocation and power control play a key role in spectrum resource management [1,2].
Spectrum resource management is to determine the most suitable channels for secondary users (SUs) without affecting the communication of primary users (PUs), based on the analysis of available channels. Currently, optimization and game theory have been widely used in spectrum management. In [3], spectrum sharing was made according to interference temperature and radio frequency (RF) power per unit of bandwidth measured in the receiving antenna. e optimal solution can be obtained by particle swarm optimization (PSO) algorithm, if the objective function was convex. In addition, simulated annealing (SA) is applied to prevent falling into suboptimal solutions. ree improved algorithms of PSO, namely, binary PSO, sociocognitive PSO, and derivation zero algorithm were proposed and the throughput of SU links was compared under the interference constraints in [4]. e spectrum access algorithm, proposed in [5], improved the throughput and spectrum sensing ability of the network system by formulating a Lagrange dual optimization problem and derived the optimal power allocation strategy and target detection probability. In the research of spectrum resource management based on game theory, the core idea is to obtain the equilibrium of optimal distribution of spectrum resources among SUs. In [6], the double auction model from microeconomic theory was used in TV band transactions between TV broadcasting companies and wireless regional area network (WRAN) service providers. For WRAN service providers, spectrum bidding and pricing problems were formulated as a noncooperative game model and obtained the Nash equilibrium. Tehrani and Uysal [7] proposed a sealed bid first-price auction model, aiming to maximize the revenue of service provider and the satisfaction of SUs under incomplete spectrum sensing conditions. Tan et al. [8] considered cooperative and noncooperative spectrum access schemes based on threshold policy. Experimental results showed that, in noncooperative cases, the optimal scheme met the Nash equilibrium.
Existing work using the optimal control or game theory often assumes that users in the wireless networks have obtained the complete environmental state information. However, such information is difficult, if not impossible to obtain in complex and dynamic scenarios, so in many cases, a solution has to be given based on partial environmental information. Inspired by the emerging artificial intelligence, reinforcement learning and neural network provide us a new tool to tackle challenges in CRNs [9][10][11][12]. Deep reinforcement learning (DRL) has used the model free feature of reinforcement learning (RL) and the ability of deep learning (DL) to process data in spectrum resource management. e potential advantages of applying DRL to spectrum resource management are threefold. First, the optimal solution for decision-making problems can be obtained through trial and error, and the cycle of manual spectrum planning is greatly reduced. So, CRNs can learn and obtain efficient spectrum resource management solutions. Second, it is possible to simulate the complex real-loop scenario that is difficult to model mathematically and constantly accumulate new experiences to adapt to various extreme situations. ird, realtime effective monitoring of dynamic environment, mining the potentially important data and information, and improving the performance of CRNs can be achieved. ese advantages boost a few research works [13][14][15][16][17]. For instance, Wan and Cohen [14] proposed a distributed dynamic spectrum access algorithm based on deep multiuser reinforcement learning, aiming at maximizing network utility in multichannel wireless networks. At each time slot, each SU mapped its current state into the spectrum access action by using the trained deep Q network (DQN). Experimental results showed that, in some observable environments, SUs were able to learn out good control strategies to ensure network performance without using online acknowledgement (ACK) signals. Liu et al. [16] adopted a multiagent DQN technology, which further optimized the learning process by combining the DQN algorithm with transfer learning so that SUs of the new access network could obtain more experience and knowledge.
In spite of the aforementioned research work, spectrum resource management based on DRL is still in its infancy stage. Existing results revealed that the state information of the channels has a high degree of self-correlation [18,19]. However, this property may have a considerable time interval from the current state. ere is still a large gap in the study of this problem. Considering the extraordinary network structure of long short-term memory, it is possible to explore such self-correlation and make a better estimate of the state of the channels. Motivated by the limitations of the current state-of-the-art and the joint design problem of channel allocation and power control for spectrum resource management, this paper proposes a long short-term memory deep Q network-(LSTM-DQN-) based joint channel allocation and power control algorithm, which helps to achieve spectrum utilization flexibility by sharing the received signal strength information (RSSI) among users. Additionally, we consider that PUs may have multiple alternative power control strategies rather than a single strategy and choose the appropriate one dynamically according to the changing environment. e evaluations show that the adjacent access points (APs) access available channels without conflict, whereas SUs maximize the power control strategies to avoid harmful interference to PUs. e remainder of this paper is organized as follows. Section 2 introduces the system model and formulates the problem to be solved. e implementation of the proposed algorithm is discussed in Section 3. Section 4 describes the simulation experiments and result analysis, and finally, the conclusion and future work are presented in Section 5.

System Model.
e channel allocation problem is raised due to huge number of wireless devices accessing limited spectrum space. In such problem, there is no one-to-one connection between channels and APs. e main challenges are adjacent channel interference (ACI) and co-channel interference (CCI). For the joint optimization of channel allocation and power control, it is necessary to consider not only the transmit power of primary and secondary users but also the selection of channels at different access points and their possible conflicts to each other.
e system model we focus in this paper is shown in Figure 1. ere are 5 APs deployed in the scenario, and each AP serves several primary and secondary users distributed randomly within its communication range. We allow overlapping between APs. For instance, the service range of AP1 and AP2 overlap with each other, and so do AP3 and AP4. In contrast, AP5 is independent of others. Within the service range of each AP, the PUs always transmit data on their authorized channels, whereas SUs are only allowed to access channels without affecting the communication of PUs. e base station in the middle is mainly responsible for the communication of PUs. Meanwhile, microcells assist SUs to control the transmit power. ese microcells collect the RSSI of primary and secondary users, package the collected information into packets occupying a few bytes, and then send them to SUs through a dedicated control channel. It is assumed that each PU adjusts the transmitting power according to its own control strategy and always transmits data on its authorized channel. Both PUs and SUs are ignorant of others' power control strategy. To be more specific, PUs are never concerned about the existence of SUs. erefore, SUs need to learn appropriate transmit power strategies through utilizing the RSSI, as to accomplish their own transmission tasks.

Problem Formulation.
In the joint optimization of channel allocation and power control, the first thing to determine is whether to allow the same channel to be selected between different APs. In this paper, this is not allowed, i.e., we consider the case of no channel conflicts. Based on such assumption, the transmit power and control strategies of primary and secondary users are then determined. Table 1 specifies the symbols used in this paper. e set of APs is denoted as P, and the set of available channels is C. Each AP can only use one channel. e channel matrix is ρ: where c ∈ 1, 2, . . . , |C| { }, p ∈ 1, 2, . . . , |P| { . Accordingly, we define Ω |P|×|P| as the interference matrix, and each element is defined by the following formula: 1, adjacent AP p, q occupy the same channel, 0, otherwise.
In order to measure the service quality, the SINR of primary and secondary users need to be defined. We assume that the users are able to communicate only if the relevant adjacent APs access the channel successfully. Let the SINR of PU i in AP p at time t be written as follows: Similarly, the SINR of SU j in AP p at time t is In multichannel scenarios, both the available channels and the channel gain change with time.
erefore, the problem becomes dynamic, and thus more complicated. e throughput of a single SU j in AP p at time t is e objective is to maximize the total throughput of all SUs, which is denoted as follows:

Deep Reinforcement Learning-Based Framework
Due to the widespread application of CRNs, the network structure is becoming more and more complex. It is difficult to establish a corresponding mathematical model to simulate a highly complex network environment. e model-free RL can effectively solve this problem. In recent years, DRL has shown excellent ability in dealing with complex problems and data operations. erefore, this paper focuses on the application of DRL in spectrum resource management, especially the joint optimization of power control and channel allocation to improve the robustness and adaptability of CRNs.

Description of RL.
e model-free learning is one type of method through continuous interaction with the virtual environment in RL. In general, RL constructs the problem as a Markov decision process (MDP). At every moment t, the agent can observe the current state of the environment s ∈ S and then select an action a ∈ A. After the action is executed, the environment state is transitioned with a certain probability P ss′ (a) to a new state s ′ ∈ S. Meanwhile, the environment will feed back a reward value r ∈ R to the agent. e schematic diagram is shown in Figure 2. In a word, RL aims to find the best strategy by maximizing the cumulative reward value through a limited number of steps [9].
Using RL to solve the joint design problem in CRNs, an array (S, A, R) should be defined in advance, where S represents the set of environmental states, A is the set of SU actions, and R: S × A ⟶ R denotes the reward obtained when taking the next action in the current state.

State Space.
ere are 5 APs deployed in the network environment, with several primary and secondary users around each AP. e SUs can only obtain incomplete environmental information at APs to implement their transmission tasks. Assuming that L microcells are responsible for collecting the RSSI of primary and secondary users in the service area of each AP, a total of 5 L microcells are distributed in the whole network environment. We adopt a discretizedtime model. According to the nonfree space propagation [20], the RSSI collected by the microcells in the area served by the AP p at time slot t is denoted by the following equation: where s l,p (t) is defined by (8) erefore, the RSSI of these 5 APs is integrated and used as the input layer of LSTM-DQN, namely, 3.1.2. Action Space. We add the set of SU transmit power into the action space, and the action of all SUs in AP p at time t is where P j,p (t) represents the transmit power of the SU j in AP p. erefore, the action value of all APs in the whole network environment is

Reward Function.
For the problem of channel allocation and power control, it is firstly necessary to consider where the constraints are given as follows: I 11 : AP p access the available channel, ∀SINR i ≥ μ i , ∃SINR j ≥ μ j and ∀P i ≥ j P j and I 22 : AP p accesses the available channel, SINR i ≤ μ i , i ∈ 1, 2, . . . , N { }. e reward function of the whole network system is which represents the mean value of rewards obtained by all APs.

Power Control Strategy of PUs.
We consider that the PUs can adjust their transmit power according to the specified control strategy and always transmit data on the authorized channels. e typical power control strategy proposed in [21] is where the value of D(x) is no less than the minimum value of x according to the predefined range of the discretization threshold.
We also adopt the more intelligent strategy proposed in [22] as follows: , which represents the SINR of the PU i at the predicted time t + 1.
When a PU conducts the intelligent control strategy of equation (15), according to the current SINR at time t and the predicted SINR at time t + 1, it only needs to adjust its own transmit power only once. erefore, the advantage of this intelligent strategy lies in that it can reduce the extra energy consumption caused by frequent power switching. At the same time, it comprehensively considers the trend estimation to determine whether the PU should adjust its transmit power and has the ability of spectrum prediction.
In order to cope with the complexity of network environment, PUs may have multiple alternative power control strategies rather than a single strategy and choose the appropriate one according to the actual situation. Equation (14) is denoted as power control strategy 1 of the PU, and equation (15) is strategy 2. We will discuss and analyse these strategies in detail in the experiments in Section 5.

LSTM-DQN-Based Joint Channel Allocation and Power
Control Algorithm. LSTM is a special recurrent neural network (RNN) [23]. As shown in Figure 3, the unit of LSTM mainly includes the forget stage, selective memory stage, and output stage, which is realized through the forget gate, input gate, and output gate, respectively. e core of LSTM is to control the cell state through these three interactive gate states. It can catch the important but implicit knowledge for a long time and discard the unnecessary message. erefore, it shows excellent performance in solving the problem of gradient disappearance or gradient explosion in the process of long sequence training.
On one hand, it is verified that the state information of the channels has a high degree of self-correlation, which may have a considerably long time interval from the current state [24]. On the other hand, there is great potential to improve the probability of successfully access the channels owing to the unique network structure of LSTM because LSTM can effectively capture valuable knowledge that is not obvious. To track the implicit correlation over a long period of time, we combine LSTM with DQN (as shown in Figure 4) to integrate the collected partial known information and obtain better control strategies through offline learning. Once the training phase is completed, the users only need to communicate with the central unit by slightly adjusting the weight of the neural network. At each moment, the APs select the available channels and the SUs choose the optimal transmit power according to the trained DQN. e specific algorithm is shown in Algorithm 1.

Performance Evaluation
In this section, we evaluate the performance of our proposed algorithm through simulation-based experiments.
where path loss index τ � 4, G t and G r are the gain of the transmitter and receiver, respectively, and h t and h r are the heights of the transmit and receive antennas, respectively [20]. In order to simulate the complex change of the environment, the number of each iteration is now set to 40,000. Furthermore, the position of primary and secondary users in the environment as well as the channel gain are randomly initialized every 10,000 iterations. e LSTM-DQN is constructed with 5 hidden layers. e first hidden layer is the LSTM layer, and the middle 4 hidden layers are the full connection layer. e number of neurons in the full connection layer is 256, 128, 128, and 256, respectively. e activation function of the second, third, and fourth hidden layers adopt ReLUs function, and the activation function of the fifth hidden layer is tanh function. Besides, Adam algorithm is used to update the weight of the neural network. e size of the training samples is set to 128. e initial exploration probability of greedy algorithm is 0.8 and linearly decreases to 0 with the number of iterations. Moreover, the memory bank has a capacity of 1,000, whereas training is not started until the capacity reaches 500 or more.
For the dynamic and the complexity of the application environment, we consider the PUs take different power control strategies. One case is in which the PUs take single control strategy 2. Another one is that each time the environmental parameters are updated, the power control strategy of 1 or 2 is chosen randomly by PUs. e proposed joint algorithm based on LSTM-DQN will be compared with two benchmark algorithms: the original DQN-based algorithm and priority memory combined with DQN-(PM-DQN-) based algorithm. Figure 5 shows the loss function of different algorithms when the PUs adopt control strategy 2,   6 Complexity and Figure 6 plots the loss function when the PUs employ mixed control strategies. It can be seen that all of algorithms meet convergence after iterative learning. Our LSTM-DQN algorithm has a large instantaneous fluctuation when the environmental parameters change, which is slightly better than the benchmark. On the other hand, the algorithm based on PM-DQN has less fluctuation. is is because the PM greatly accelerates the convergence rate of the loss function by cutting off the correlation, whereas the LSTM needs to correlate the past experience so that the loss function does not converge to the minimum value quickly. Nevertheless, it is meaningful for the joint problem of channel allocation and power control without Markov property. We will explain from other aspects below. Figures 7 and 8 describe the comparison of the cumulative rewards when the PUs adopt a single and mixed control strategies, respectively. It can be seen from the results that the reward of the benchmark algorithm is always decreasing, whereas the cumulative rewards of our LSTM-DQN and the algorithm based on PM-DQN are relatively stable. Moreover, the reward of LSTM-DQN is higher. It is worth noting that the cumulative reward of LSTM-DQN is close to or slightly higher than the horizontal line of 0, which indicates that the channel allocation and power control (1) Initialization: the capacity O of memory D, the transmit power of PU and SU is P i,p (t), P j,p (t) respectively, the channel interference matrix Ω |P|×|P| , LSTM-estimates LSTM-DQN Q weight θ � θ 0 , targets LSTM-DQN Q weight θ � θ 0 (2) For episode � 1 to E do (3) According to the initial state Input(0), SUs randomly select actions Action(0) with ε probability, otherwise choose actions Action(0) � max a Q(Input(0), a; θ) with 1 − ε probability (4) For t � 1 to T do (5) e PUs update the transmit power according to their own power control strategies (6) SUs select actions Action(t) with ε t probability, otherwise select the action Action(t) � max a Q(Input(t), a; θ) (7) Obtain rewards R(t) and the next state Input(t + 1) (8) Save empirical data d(t) ≡ Input(t), Action(t), R(t), Input(t + 1) to memory D (9) If t > O/2 then (10) Select training sample d(l) randomly from D (11) Calculate Q(l) � R(l) + cmax a′ Q(Input(l + 1), a′; θ) (12) Use the gradient descent method to minimize the loss function [Q(l) − (Input(l + 1), a′; θ)] 2 and update parameters θ scheme still have room for further improvement in the future work. Figures 9 and 10 are evaluated in terms of the switching success rate. Once the user is able to access the channel and successfully complete the transmission task within 20 switches, it is deemed to a successful experience. It can be concluded from the simulation results that our LSTM-DQN can ensure the maximum success rate and adjust the strategy rapidly when the environment parameters are updated randomly. Moreover, when the PU adopts the mixed strategy, the proposed algorithm can still show excellent robustness and desirable generalization ability. Figures 11 and 12 depict the comparison of handover steps. We observe that regardless of the control strategies adopted by the PUs, and the proposed algorithm guarantees that the optimal strategy can be found after an average of one  8 Complexity handover. It helps reduce the energy consumption and greatly improve the sensitivity of the users, which can react to the change of the real-time environment more quickly. Moreover, when the environmental parameters update, the proposed algorithm shows the anti-interference performance and generalization ability. We then analyse the channel cumulative conflicts shown in Figures 13 and 14. When the PUs take the single control strategy, the proposed algorithm and the algorithm based on PM-DQN perform closely. In the situation that PUs employ the mixed strategy, LSTM-DQN-based algorithm can further reduce channel conflict. It shows that the proposed algorithm has a good potential in dealing with complex conditions.

Conclusion and Future Work
Aiming at the joint design problem of channel allocation and power control in CRNs, this paper proposed a novel   Complexity 9 algorithm based on LSTM-DQN. We analysed the feasibility and implementation process of the proposed algorithm. rough simulation-based experiments, the advantages of LSTM-DQN-based algorithm were discussed and illustrated from the aspects of loss function, reward function, success rate, handover steps, and channel cumulative conflict. Specially, our proposed method outperformed other two DQN-based competitors.
Our future work will involve using real data to verify the feasibility of the algorithm. Moreover, various factors of the environment, e.g., mobility of users, can be taken into account, as to further study the large-scale spectrum resource management problems.

Data Availability
e data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.

Conflicts of Interest
e authors declare that there are no conflicts of interest.