Reinforcement Learning for Interference Coordination Stackelberg Games in Heterogeneous Cellular Networks

In future heterogeneous cellular networks with small cells, such as D2D and relay, interference coordination between macro cells and small cells should be addressed through effective resource allocation and power control. The two-step Stackelberg game is a widely used and feasible model for resource allocation and power control problem formulation. Both in the follower games for small cells and in the leader games for the macro cell, the cost parameters are a critical variable for the performance of Stackelberg game. Previous studies have failed to adequately address the optimization of cost parameters. This paper presents a reinforcement learning approach for effectively training cost parameters for better system performance. Furthermore, a twostage pretraining plus ε-greedy algorithm is proposed to accelerate the convergence of reinforcement learning. The simulation results can demonstrate that compared with the three beachmarking algorithms, the proposed algorithm can enhance average throughput of all users and cellular users by up to 7% and 9.7%, respectively.


Introduction
In the existing and future cellular networks, i.e., 5G and beyond, small cells, such as device-to-device (D2D) and relay, are promised to constitute the heterogeneous networks to amplify system capacity and/or expand coverage [1]. D2D technology improves spectrum efficiency and reduces the power consumption of User Equipment (UE) by allowing two terminals to communicate with each other directly without Base Station (BS). Using in-band underlying D2D pairs can reuse the same spectrum with cellular UEs (CUEs) at the same time, which means more resources and also increased interference between CUEs and D2D pairs. Additionally, relay technology is applied to relay UEs (RUEs) that are far away from the BS. The RUEs can connect to the relay node (RN) by reusing the spectrum resources with the CUEs, and finally, the RN sends the information to the BS.
In recent literature on interference coordination in heterogeneous cellular networks (e.g., [2][3][4][5][6][7][8][9][10][11][12][13][14][15]), several approaches, i.e., heuristic algorithms, convex optimization, intelligent optimization, reinforcement learning, and game theory, are discussed. The authors in [2][3][4] proposed heuris-tic algorithms. In [2], by limiting the minimum Signal-to-Interference-plus-Noise-Ratio (SINR), each BS controls the distance between links reusing the same frequency by calculating the minimum restricted area where frequency reuse is not allowed. The authors in [3] divided the available spectrum into the inner region frequency and the outer region frequency. By limiting the reusable region, a cell sectorization method was proposed to solve the resource allocation and power control problems. In [4], user association of D2D communication is formulated based on maximizing received power. Then, a sequential max search-based algorithm was developed to solve resource allocation problem.
However, heuristic algorithms with low complexity can hardly reach the optimal solution.
In addition, some researchers proposed convex optimization algorithms to deal with interference coordination problems. An optimization problem of network throughput for D2D underlaying cellular networks was formulated in [5] and solved by a convex function while ensuring the qualityof-service (QoS) constraints. The authors in literature [6] studied the power control problem in the D2D heterogeneous cellular networks based on partial frequency reuse and proposed a dynamic power control scheme based on the basis of partial power control. With the goal of maximizing system capacity, the objective function of power control is established, the nonconvex function is transformed into a convex function, and the improved Lagrangian dual decomposition method is introduced to reduce the algorithm complexity. The interference coordination algorithms based on convex optimization can approximate optimal solutions by establishing optimization models and solving them via convex optimization algorithms. However, the idealized optimization model and the NP-hard problem in the solving process make these algorithms impracticable.
Some researchers also tried to use intelligent optimization algorithms to solve the NP-hard problem in optimization. Authors in [7] proposed a joint resource allocation and user matching scheme based on genetic algorithm to minimize interference and maximize spectrum efficiency, which used a limited number of resource blocks to serve a large number of users. In [8], a simple particle swarm optimization algorithm for resource allocation was proposed to improve the system capacity performance. The simulation shows that with 10 particles, the proposed scheme can obtain suboptimum performance with quick convergence.
The intelligent optimization algorithms still model the optimization target as an ideal mathematical model, which cannot perfectly describe the real scenarios. Nevertheless, Reinforcement learning (RL) has attracted more attention to solve the interference coordination in a new way. RL is a machine learning paradigm in which agents measure the quality of their actions through reward in an episode and determine the actions under their states to maximize longterm returns. For example, a scheme using Q learning was proposed in [9] to allow a small cell to learn the appropriate transmitting power for less interference between the small cell and BS. The authors in [10] investigated multiagent Q learning for D2D user (DUE) selecting frequency resources in multilayer D2D heterogeneous networks. The issues to be addressed in above two methods are shorter learning process and avoiding local optimal solutions. Game theory is also expected to solve the interference coordination problem in heterogeneous cellular networks. The authors in [11] formulated the cochannel interference coordination problem between the D2D link, the micro cell link, and the macro cell link into a potential game problem. The players' strategies are updated iteratively by message passing to achieve the Nash equilibrium. The relay selection problem was modeled as a noncooperative game in [12], which was used to improve spectral and energy efficiency. The authors in [13] utilized the Nash bargaining model to deal with the frequency reuse problem between D2D and the macro cell. A bargaining factor is introduced for performance optimization, which is solved by a maximum weight maximum stream algorithm and the Lagrange Multiplier method. A cooperative game (CG) theory-based resource allocation in cluster-based D2D communication network was discussed in [14]. In CG-D2D, a utility function for allocating the resources between the D2D pairs and the cluster was proposed. Stackelberg game with pricing was introduced in [15]. BS evaluates the QoS of the CUEs and decides a price to be paid by the D2D pairs for reusing the resource of a CUE. The purpose is to allocate channel and power levels to D2D pairs and to optimize their transmission rates.
In a Stackelberg game-based interference coordination model of heterogeneous networks, the cost parameter is critical for leaders and followers, because it can affect the results of the two-stage game. Therefore, many studies proposed several cost parameter setting methods. Authors in [16] presented an artificial adjustment method of cost parameter to guarantee both sides of the reaction function the same order of magnitude, and the value of the reaction function is within a reasonable range. The cost parameters can also be determined by the channel state of the follower [17] or by the channel state of the leader [18]. In [19], an iterative strategy was proposed to improve the cost parameter, updating the global cost parameter in a fixed number of steps according to the result of the game. The authors in [20] reckon that the cost parameters need to be set for each D2D link in each channel. In the current researches on the interference coordination algorithms based on Stackelberg game, most of the cost parameters are fixed or self-iteratively updated. At present, there are few explorations on the advanced setting of cost parameters, and thus, the effectiveness of the Stackelberg game is difficult to be guaranteed.
Centralized RL-based interference coordination methods, such as in [9], require accurate channel state information from small cells, which brings heavy burden to networks. Meanwhile, the RL-based distributed interference coordination methods, such as in [10], are executed by each UE, which can consume excessive computing resources. Stackelberg game-based interference coordination methods [15][16][17][18][19][20] allow distributed follower games in each UE of small cells; however, important cost parameters for follower games are not effectively optimized so far. Therefore, this paper focuses on applying reinforcement learning to the Stackelberg game model to address interference coordination problem in D2D and relay heterogeneous cellular networks. The main contributions of this paper can be summarized as follows.
(1) An interference coordination architecture containing a reinforcement learning model and a Stackelberg game model is first introduced to model the interference coordination problem in D2D and relay heterogeneous cellular networks. This architecture allows distributed interference coordination in D2D pairs and RUEs based on local channel state information and centralized reinforcement learning in BS to improve the performance of interference coordination (2) A reinforcement learning model is proposed to optimize cost parameters for the Stackelberg game model. The proposed Q-learning model defines the current resource reuse situation with CUEs as the state space, the cost parameters as the action space, and the utility changes of all links as the reward (3) A two-stage pretraining plus ε-greedy algorithm is proposed for a better update of the Q table. In the pretraining stage, the agent randomly picks the 2 Wireless Communications and Mobile Computing actions for dozens of episodes, and in the second stage, the agent changes between random action choice and best action choice according to the probability ε. This algorithm is aimed at faster convergence and better optimization of the Q table at the same time (4) The proposed interference coordination algorithm using two-stage Q learning for the Stackelberg game model, called RL game for short, proves its effectiveness in network throughput by comparing with benchmark algorithms in simulation. Better settings of probability ε and pretraining episode are also found through simulation results The rest of this paper is organized as follows. Section 2 describes the system model and problem formulation of a heterogeneous cell. Section 3 explains the Stackelberg game for interference coordination. A reinforcement learning algorithm for cost parameters in Stackelberg game is proposed in Section 4. In Section 5, the performance evaluation results and their discussions are analyzed. Section 6 concludes the paper.

System Model
We consider a single heterogeneous cell which is shown in Figure 1 [21]. A number of CUE and DUEs are randomly distributed within the coverage area of the BS. In the 3GPP standards, resource reuse between underlying D2D links and CUE uplinks is considered in priority [21]. In the uplink, a set of CUE M = f1, ⋯, Mg and RUE Q = f1, ⋯, Qg in a macro cell communicates with the BS and RN, respectively. There exists a set of N DUE pairs that comprise a D2D transmitter (DTx) and a D2D receiver (DRx). A D2D pair is able to communicate each other by reusing a unit of CUE uplink resource. In-band RUE-RN links and some CUE-BS links also use the same RBs. The RN-BS link shares the RBs with the CUE-BS links orthogonally in the backhaul subframes to avoid self-interference in the RN. Thus, the CUE-BS link may suffer from the interference from the RUE and the DTx. Similarly, the CUEs may also be the interference sources to DRx of the D2D links and RNs of the RUE-RN links.
Therefore, the SINR of a CUE on a Physical Resource Blocks (PRBs) in the uplink is defined as follows: where P m,k denotes the transmitting power of CUE m using PRB k, P n,k represents the DTx transmitting power of the D2D pair n using PRB k, and P q,k indicates the transmitting power of RUE q using PRB k. Moreover, PL m represents path loss of the link between CUE m and BS. PL n denotes path loss of the link between DTx n and BS. PL q indicates path loss of the link between RUE q and BS. In addition, N 0,k means Gaussian white noise; α m,k , β q,k , and γ n,k are binary variables, where 0 shows that PRB k is not used and 1 means that PRB k is used. When D2D performs uplink communication, the SINR of a certain DRx on a certain PRB can be expressed as follows: where PL m,n represents the path loss of the link between CUE m and DUE n and PL q,n denotes the path loss between DRx n and RUE q. Similarly, when the RUE performs uplink communication on the access link, the SINR of a certain RUE on a certain PRB can be written as follows: where PL m,q represents the path loss of the link between CUE m and RUE q and PL n,q denotes the path loss between DTx n and RUE q.
In summary, the total data transmission rate of the cellular communication system on PRB k with a bandwidth B can be expressed as follows: The purpose of interference coordination in this paper is to maximize the system throughput for all links on each PRB, and thus, its objective function is defined as follows:

Wireless Communications and Mobile Computing
Subject to : P q,min ≤ P q,k ≤ P q,max , P n,min ≤ P n,k ≤ P n,max , where P q,min and P q,max represent the minimum and maximum transmitting power, respectively, allowed by the RUE q; P n,min and P n,max represent the minimum and maximum transmitting power, respectively, allowed by the DTx n.

Stackelberg Game-Based Interference Coordination
In order to execute distributed interference coordination, this study uses the Stackelberg two-step game model to allow DUEs and RUEs to control their transmission power and determine the resource allocation for both small cell UEs and CUEs. DUEs and RUEs consider follower games with the information from the leaders, and CUEs compete each other in a leader game in BS. Our goal is to minimize interference on macro cells while ensuring the basic performance of small cells. Therefore, using Stackelberg's two-stage game model to model interference coordination is in line with realistic needs.

Leader Utility Function.
In the leader game, the utility function consists of CUE m, DTx n, and RUE q in PRB k, which can be expressed as follows: U m,n,q,k = B log 2 1 + P m,k PL m N 0,k + P n,k PL n + λ m γ n,k P n,k PL n + β q,k P q,k PL q , where λ m denotes the cost parameter provided by each CUE m for any other underlay links. Similarly, reusing parameters γ n,k and β q,k cannot be 1 at the same time. In other words, the DTx n and RUE q cannot reuse the PRB of the same CUE simultaneously.

Follower Utility Function.
In the follower game, the payment utility functions of DTx n and RUE q in PRB k can be expressed as follows: In the model proposed in this paper, neither resource reuse between DTx and RUE nor resource reuse between different D2D pairs is considered.

Power Control in Small
Cells. In small cells, the transmitting power is decided by letting the partial derivative of the follower utility function equal 0. Taking the transmitting power of DTx n on PRB k as an example, find the partial derivative function of P n,k in the above Equation (8), and set it to 0. Thus, the DTx transmitting power P m * n,k for maximizing the function (8) can be derived for different cost parameter λ m .
∂V m,n,k ∂P n,k = B ln 2 * PL n N 0,k + P m,k PL m + P n,k PL n − λ m PL n = 0, The partial derivative function of P q,k in the above Equation (9) can be obtained alike, and set it to 0. The transmitter power of RUE for maximizing the function (9) can be calculated: After obtaining P m * n,k and P m * q,k , they need to be limited to the maximum and minimum transmitting power. P m * n,k = max min P m * n,k , P n,max , P n,min , ð13Þ P m * q,k = max min P m * q,k , P q,max , P q,min : 3.4. Resource Allocation in the Macro Cell BS. The macro cell BS takes the P m * n,k and P m * q,k provided by each DTx and RUE in Equation (7), and the matrix size of the utility function M * ðN + QÞ can be obtained when different D2D pairs, RUEs, and CUEs reuse the same PRB. In order to achieve the optimization of interference coordination in formula (5) To meet the optimization requirements in formula (15), a resource allocation algorithm based on the Hungarian algorithm is proposed. The specific steps of the algorithm are given below: Step 1. Traverse all columns in the U m,n,q,k matrix and find the maximum value of U m,n,q,k in N + Q columns and all the corresponding rows m.
Step 2. Judge whether m in different columns are different. If so, jump to Step 4.
Step 3. Find the corresponding columns n or q for all nonrepeated rows m, and remove them from the U m,n,q,k matrix. Jump to Step 1 (note: since N + Q should be less than M, the matrix must not be empty).
Step 4. Output all ðm, nÞ and ðm, qÞ correspondences, and use the round-robin algorithm to fairly allocate all resources to CUE m, D2D pair n, and RUE q according to the selected ðm, nÞ and ðm, qÞ.

Reinforcement Learning of Cost Parameter in Stackelberg Game Model
The cost parameter is the key factor in the Stackelberg game model, because it determines the transmitting power of DTx and RUE and then affects the resource reuse between D2D/ RUE and CUE. However, it is difficult to set a suitable parameter for each CUE m. The appropriate cost parameter of CUE m in the follower games should keep the transmitting power of D2D/RUE within a reasonable range to realize power control and should improve overall system performance in the leader game. Hence, this section proposes a Q-learning method of cost parameter, which is aimed at determining an appropriate cost parameter for each CUE m. An interference coordination architecture combining the reinforcement learning model and Stackelberg game model is proposed as shown in Figure 2.

Reinforcement Learning Model.
The BS performs a learning process for all D2D pairs and RUEs in each slot t to update the triplet variables, which are state s, action a, and reward r. Three basic elements necessary for the reinforcement learning model are defined as follows.
State. It is the current situation of resource reuse with CUEs, and if a D2D/RUE is currently reusing the resource of CUE m, the state denotes m.
Action. A set of cost parameters is defined as the action space. Note that the value range of λ m is assumed from 138 dB to 197 dB with the value interval of 1 dB in this study.
Reward. The reward function reflects the learning goal, expressed as the total throughput of D2D/RUE and CUE m on PRB k with cochannel interference minus the Initializes S=S0 and Q(s,a)=0. Sets the values α =0.1 Loop % start an update episode t If t<50 % In the pre-training stage The agent selects an action randomly from the action set; Update Q value according to Generate a random number num; If num<ε %'exploration' is selected The agent selects an action which can get largest Q value from action set; Else% 'exploitation' is selected Agent select an action randomly from the action set; Update Q value according to where R m,k denotes the throughput of CUE m using the PRB k, R n,k represents the throughput of the D2D pair n reusing the PRB k, and R q,k denotes the throughput of RUE q when it reuses the PRB k. Finally, I m,k indicates the throughput of CUE m without cochannel interference from DUE or RUE.

A Two-Stage Q-Learning
Algorithm. In a Q-learning method, Q values indicate the expected rewards in all states and actions, which are saved and updated in a Q table. Based on link information and throughput feedback from D2D pairs and RUEs, updating the Q values in an episode t is carried out by BS, which can be expressed as follows.
where α is the learning rate that represents the update rate of the Q values and 0.1 in this study; γ is the discount rate, which denotes the impact of the final reward on the intermediate state, and needs to be 0 in this study.
To update the Q values, an execution strategy based on the Q table is required. As a feasible strategy, an ε-greedy algorithm [22] in an update episode t chooses an action at random with a probability ε, called "exploration," otherwise chooses an action with the highest Q values according to the current state, called "exploitation." The smaller exploration probability ε not only means more update episodes t to the convergence, but also better optimization.
For faster convergence and better optimization, a twostage "pretraining plus ε-greedy" algorithm is proposed, which divides the update episodes into two stages, the pretraining stage and the ε-greedy stage. In the first pretraining stage containing several episodes, the agent always chooses a random action regardless of the state. In the following ε -greedy stage, the traditional ε-greedy algorithm with large ε is carried out based on the pretrained Q table, which is expected to accelerate the convergence and achieve better optimization.
The proposed two-stage Q-learning algorithm is summarized in Algorithm 1.

Simulation
In this section, a single-sector system-level simulation to compare the performance of the proposed algorithm and three benchmarks is described. In this simulation, UE of different communication modes are distributed in a sector randomly, including 30 CUEs or RUEs and several DUEs. The simulation parameters are listed in Table 1.
Three benchmark algorithms are considered for comparison. The first is the round-robin-based resource allocation algorithm, which is abbreviated "RR." In the RR algorithm, RUEs and D2D pairs reuse CUE link resources randomly, regardless of their channel information and transmitting power. The second is labeled "greedy." In the greedy optimization algorithm without power control, the sum of the throughput of the CUE and RUE or D2D pair on each PRB is optimized without considering the interference between them. Moreover, the maximum transmission power of DTx and RUEs is assumed. The last is the interference coordination algorithm based on the Stackelberg game with fixed cost parameters, which is abbreviated "FC game." In the FC game, the CUEs will select two fixed cost parameters which are on the 33 rd and 67 th percentiles of the cost parameter value range.
The RL game algorithm proposed in this paper will compare performance indicators with the above three benchmark algorithms, such as DUE average throughput, CUE average throughput, and average throughput of all users.
Figures 3-5 depict the cumulative distribution functions (CDF) of the CUE, DUE, and RUE throughput, respectively, with different interference coordination algorithms. Compared to the RR and greedy algorithms, the RL game algorithm proposed in this study reaches greater CUE performance, which means more CUEs have throughput of over 1100 Kbps as shown in Figure 3. It can be observed that the DUE performance using the proposed RL game algorithm is intermediate between that using the greedy algorithm and the RR algorithm, while the RUE performance using the RL game algorithm and the greedy algorithm are almost the same and both significantly better than that using the RR algorithm. Figures 6 shows that the average throughput of all users grows up along with the increasing number of D2D pairs. Since more D2D pairs reuse more RBs, the average throughput of all users using the proposed RL game algorithm increases from 1.65 to 2.41 Mbps with the numbers of D2D pairs from 2 to 8. Figure 6 also demonstrates the top rank of the proposed RL game algorithm in the average throughput of all users with the benchmark algorithms.
From Figure 7, it can be noted that the average throughput of D2D pairs also has strong positive correlation to the number of D2D pairs. When the D2D pairs increase from 2 to 8, the DUE throughput increases significantly from 3.22 to 4.2 Mbps using the RL game algorithm. We can speculate that higher diversity of D2D pairs is exploited with more D2D pairs reusing the resources. However, the greedy algorithm has better performance because the D2D pairs will not control their power to reduce their interference to CUEs.       Figure 8 shows that when the distance between BS and RN increases, the average throughput of all users varies greatly using all algorithms. Note that "isd" is the abbreviation of inter site distance. As is seen in Figure 9, when RN moves away from BS, the CUE throughputs using different algorithms increase. The reason is that the CUE with poor signal quality in the cell edge area can be improved as the RN approaches the edge. However, when the RN is close to the BS, larger interference between CUEs and RUEs will result in a decreasing average throughput of all users from 0:3 * isd to 0:35 * isd of the RN distance from the BS. Figure 10 shows the convergence of the normalized system capacity in an episode with different random exploration probabilities ε and pretraining episodes. The system capacity with random action is selected as normalization. It can be observed that after about 600 episodes, the system capacity with 99% random exploration probability and 1% exploitation probability reaches the convergence, which is slower than that with 90% random exploration probability and 10% exploitation probability. After the convergence, the system capacity with 99% random exploration probability is larger than that with 90% random exploration    probability. This implies that a larger random exploration probability can obtain more optimization solutions with slower convergence. Pretraining the agents with random exploration actions is expected to accelerate the convergence, which can be validated by the system capacity with 99% random actions after 50 and 1000 pretraining episodes.
With 1000 pretraining episodes, the system capacity reaches convergence after about 200 episodes, while about 350 episodes are taken using 50 pretraining episodes. However, the system capacity with 50 pretraining episodes can converge to the maximum in the least episodes between these methods, which is suggested in the proposed algorithm.

Conclusions
This paper investigates an interference coordination architecture in D2D and relay heterogeneous cellular networks that combines reinforcement learning and the Stackelberg game. A reinforcement learning model for cost parameters in Stackelberg games is proposed along with a two-stage Q -learning algorithm. The simulation results prove the network throughput advantages using the proposed algorithm and the benchmark algorithms. The better episode number in the pretraining stage and the better exploration probability ε are also investigated through the simulation.

Data Availability
The simulation data used to support the findings of this study are available from the corresponding author upon request.