Reinforcement Learning Optimization for Energy-Efficient Cellular Networks with Coordinated Multipoint Communications

Recently, there is an emerging trend of addressing “energy efficiency” aspect of wireless communications. And coordinated multipoint (CoMP) communication is a promisingmethod to improve energy efficiency. However, since the downlink performance is also important for users, we should improve the energy efficiency as well as keeping a perfect downlink performance. This paper presents a control theoretical approach to study the energy efficiency and downlink performance issues in cooperative wireless cellular networks with CoMP communications. Specifically, to make the decisions for optimal base station grouping in energy-efficient transmissions in CoMP, we develop a Reinforcement Learning (RL) Algorithm.We apply theQ-learning of the RL Algorithm to get the optimal policy for base station grouping with introduction of variations at the beginning of the Q-learning to prevent Q from falling into local maximum points. Simulation results are provided to show the process and effectiveness of the proposed scheme.


Introduction
The continuously growing demand for ubiquitous wireless access leads to the rapid development of wireless cellular networks during the last decade [1][2][3].Such tremendous growth in wireless industry has made it become one of the leading sources of world energy consumption and is expected to grow dramatically in the future.Rapidly rising energy costs and increasingly rigid environmental standards have led to an emerging trend of addressing "energy efficiency" aspect of wireless communication technologies.
On account of the deterioration of air pollution and the worsening of greenhouse effect, optimization of energy utilization and the sustainable development have become one of the hot topics in academia.Base station costs much energy consumption in wireless cellular communication networks, which causes huge waste in the case of the number of users reducing to little.It goes against the energy optimization and the implementation of environmental protection.Therefore, it is necessary to reduce the energy consumption of base stations in order to optimize the energy efficiency of wireless cellular network.
In a typical wireless cellular network, base stations account for about 70% of the total energy consumption [4].In addition, a base station consumes even more than 90% of its peak energy when there is little or no traffic [5].In allusion to the energy efficiency optimization, base station sleeping strategy is proposed, in which the base station should be turned off when the traffic is low and the surrounding base stations cooperate to serve the users.It combines the coordinated multipoint communication and base station sleeping strategy in order to implement an optimal CoMP grouping to sever the sleeping cell.Coordinated multipoint communication is a new method that helps with the dynamic base station cooperation, where signals transmitted or received by spatially separated antenna sites are jointly processed [6].

Mathematical Problems in Engineering
The Reinforcement Learning Algorithm is one of the important methods in the Machine Learning.In this paper, the Reinforcement Learning Algorithm is used to optimize the cooperation of base station in CoMP communication.Reinforcement Learning (RL) is learning through the direct experimentation.It does not assume the existence of a teacher that provides examples upon which learning of a task takes place.Instead, in RL the experience is the only teacher.With historical roots on the study of conditioned reflexes, RL gets its advantages in fields as diverse as Operational Research and Robotics because of its theoretical relevance and potential applications.In this paper, the RL Algorithm operates for the Operational Research in CoMP communication.
The rest of this paper is organized as follows.Section 2 presents the system models.Section 3 presents the RL Algorithm in in CoMP communication.Simulation results are presented and discussed in Section 4. Finally, we conclude this study in Section 5.

Coordinated Multipoint Communications. Coordinated
Multipoint Communication is a key feature in LTE-Advanced technologies.For the sake of better application of CoMP in practice, we build small scale model units to research the CoMP scheme, which supplies higher energy efficiency for base stations and better communication performance for users.In this way can we "green" the field of communications.
In the wireless cellular networks, one small scale model unit consists of 19 cellular cells, in which the most lightly loaded cell is chosen as the center one.The base stations in the second channel are too far to serve the users in the center cell, so they are considered as the noise source.Then the first channel hexagonal cellular cells are numbered from 0 to 6 with base stations located in the center of the cells, and it is 500 m between two base stations as shown in Figure 1.The center base station is switched off and the users are served by the dynamically CoMP cooperating of base stations 1∼6.The number and locations of the users in the 0 cell are both optional and alterable, so the CoMP scheme must be adjusted along with that.As the definition of CoMP is explained initially in 3GPP 36.814[7] as dynamic coordination among multiple geographically separated points referred to as CoMP cooperating set for downlink transmission and uplink reception, the scheme is to find the number and locations of the base stations participating in the CoMP to provide the best communications as the environment changes.The CoMP scheme is considered in the situation where the downlink payload and data are available at each point in the CoMP cooperating set and downlink payload is transmitted on Physical Downlink Shared Channel (PDSCH) from multiple points in the CoMP set [8]. However in this paper, we mainly focus on the downlink transmission in the perfect channel state.

Large Scale Propagation and Pathloss Models.
Pathloss in wireless communication is defined as the difference in dB between the transmitted and the received signal powers due to the attenuation during the propagation [9].The traditional log-normal shadowing for large scale pathloss modeling is formulated as where  RX represents the received signal power at the user equipment (UE),  TX represents the transmitted signal power at the severing base station, and the pathloss is denoted by   formulated as where   (  ) denotes the pathloss at the reference distance,  represents the propagation distance,  is the path loss exponent, and   is the Gaussian random variable with zero mean and standard deviation  modeling the shadowing effect of the media [8].Urban Macro (UMa) pathloss model is used according to the below equations and parameters specified in Table 1 with respect to ITU for line of sight (LoS) and none line of sight (NLoS) scenarios.LoS probability is model as a Bernoulli random variable [10]: (3)

Downlink Performance.
When making CoMP scheme decision, energy efficiency is not the only index that we take into consideration.We should also pay attention to the downlink performance to guarantee the quality of communications.
The initial motivation for CoMP was to increase the cell edge user throughput and spectral efficiency by making use of intercell orthogonal resource assignments [11].And the received SINR is calculated as where  RX ( = ) is the received signal power from the serving base station using the UMa large scale pathloss model while the remaining  RX act as the interferers and the receiver noise power  noise formulated as where  0 = −174 dB/Hz is the noise spectral density and () is the frequency bandwidth assigned to the user .However in the CoMP cellular system, members of the CoMP scheme perform joint scheduling on PDSCH to transfer the user plane data using TM-9.So the received downlink SINR in joint transmission systems is formulized according to [8] as where the  JT represents the CoMP set while the  \  JT represents the received power outside the CoMP set.In this way, both the SINR and the performance of the cell edge users' communications are increased.Downlink capacity received by each user  is formulated as Depending on the user location and mobility, each user  has a distinct CoMP transmission set  JT ().Since the downlink capacity is also an important index in CoMP communication, it is also necessary to coordinate all the  JT () to find the best CoMP scheme.

Bits/Joule Energy Efficiency.
One common method to measure energy efficiency is to use bits per Joule.For a CoMP scheme  JT , energy efficiency can be defined as the ratio of transmission capacity to transmission power as follows: where ( JT ) is the total downlink capacity and  CoMP is the total transmission power in CoMP scheme  JT , which is expressed as where () is the total power consumption for base station  in CoMP scheme  JT , which is calculated using the assumptions from [12] and [13] as where () is the radiated power per base station,  SP is the signal processing power per base station, and  BH is the power due to backhauling. Aeff is the power amplifier efficiency and   and  PSBB are the cooling and battery backup losses in the system.
The signal processing power  SP per sector as a function of different cooperation sizes scales as where   is the number of base stations participating in the CoMP scheme  JT .Backhauling power consumption  BH for base stations using CoMP is modeled in [12] as where  BH is a given average backhaul requirement per base station.The expression of  BH is expressed as where  is the additional pilot density,  is the CSI signaling under CoMP network, and   = 66.7 s is the sample period which is the reciprocal of the assumed OFDM subcarrier spacing at 15 kHz [8].
As the models above, we try to get the best CoMP scheme  JT which both ensures the quality of the communication and maximizes the energy efficiency.

Reinforcement Learning
In this section, we present the application of reinforcement learning (RL) algorithm to get the best CoMP scheme  JT .Firstly, we start with the introduction of the RL Algorithm and its branch -learning.Then we build an appropriate value function applying RL to get the optimal CoMP scheme  JT .

RL Algorithm.
Generally speaking, the RL Algorithm is to try to find an optimal action policy to solve a given task in an unknown environment.During the RL, a learner observes the state of the environment and according to which it chooses its action.Firstly, a policy is taken arbitrarily, so it is a very typical learning process that the learner makes wrong actions.Then the learner receives a reward feedback signal from the environment and based on the reward it improves its policy.
The learner chooses its action  depending on the environment state  and the reward , and the policy  is adjusted along with action  as it is shown in Figure 2.
During the RL a learner looks for an optimal policy  * for which it will always receive the best rewards from an environment.For such a policy, the value function   * () is always largest or equal to the value function   () to any policy .
At present, RL Algorithm mainly includes two categories: the value function estimation and the strategy space search method.Concretely, there are Monte Carlo algorithm, Temporal Difference (TD) algorithm, Dynamic Programming algorithm, -learning algorithm, and Sarsa learning algorithm in RL Algorithm.

Q-Learning
Algorithm.-learning algorithm is formed by Watkins and Dayan in 1989 [14], which is a milestone in the development of reinforcement learning.Currently, learning algorithm is also one of the most widely applied reinforcement learning algorithms.
Compared to other RL Algorithms, -learning algorithm gets value for each State-Action, and the value function is expressed as (  ,   ).Value of (  ,   ) is the accumulation rewards on the basis of the recycle implementation that action   is chosen according to related policy in state   .The optimal policy  * of -learning is to maximize (  ,   ); thus the optimal policy  * can be expressed as As a result, the learner only needs to consider the current state and available actions and then choose the action which maximizes the (  ,   ) according to the policy.When choosing the action, we only have to select the maximum value from the  list, which greatly simplifies the process of decision-making.Values in  list are the results of the iterative learning step by step.Learner needs constant interaction with the environment to enrich  list, so that it can cover all possible situations.The -learning iterative calculation formula of value function is where  is the learning coefficient and  is the discount factor.
Steps of -learning algorithm are as follows.
(3) According to the current state, select the action along with the policy, and observe the next state.(5) Check if the learning ends; if not, set  =  + 1 and go back to (1).

RL Algorithm In CoMP Communication.
In CoMP communication, our purpose is to find the best CoMP scheme  JT , which is called optimal policy  * in RL Algorithm.As the models built in Section 2, there are 6 base stations participating in the cooperation.The base stations are numbered from 1 to 6 as Figure 1 shows.Then the cooperation policy  can be a 1 × 6 matrix.For example, if the optimal policy  * is the cooperation of base stations 1, 3, and 6, then the policy The environment state is the number and the locations of the users.The value function (  ,   ) is the energy efficiency.The rewards are related to the downlink capacity and received SINR.We set the downlink capacity standard as 1 Mbits/s and the received SINR standard as 21 dB.The rates of the users who reach the standards are expressed as  cap and  SINR .After different trials, Table 2 shows the relations between the  cap ,  SINR , and the rewards.
Because the policy  is a 1 × 6 matrix, the actions interacting with the policy are complex as follows: , 0, 0, 0, 0, 0 0, 1, 0, 0, 0, 0 0, 0, 1, 0, 0, 0 0, 0, 0, 1, 0, 0 0, 0, 0, 0, 1, 0 0, 0, 0, 0, 0 When choosing the actions 1∼6, it means adding a base station in CoMP, while when choosing the actions 7∼12, it means removing a base station from CoMP.But not all the actions are available at each step.In practice, the number and the locations change much more slowly than the process of finding the optimal policy  * , so for a period the optimal policy is the same.In order to prevent  from falling into a local optimal point, we introduce variations to  every six steps.After sixty steps, we remove the variations; then the system operates in the optimal policy  * with the maximal  while the state changes very faintly.When the state changes, the -learning repeats the process above automatically.In this way, it leads to a green

Simulation and Discussions
In this section, computer simulations are carried out to evaluate the performance of the proposed CoMP scheme  JT in the RL Algorithm.The parameters in the simulations are shown in Table 1.We assume that the channel state information is perfect in this paper.The CoMP scheme  JT changes along with the communication standards.In different standard, the number of the base stations participating in the CoMP communication is different.
Figure 3(a) shows the value of  changes with the SINR standard.When the SINR standard is low, most of the users in the center cell can reach it, so the SINR reward of each set is positive and the energy efficiency mainly affect the value of .As the SINR standard increases, less users in the center cell can reach the standard, and the reward turns negative, so the  value descends.Figure 3(b) shows the best CoMP set in different SINR standards.As the SINR standard increases, the best CoMP set increases to meet the standard.While it increases to 30 dB, even CoMP set 6 cannot meet the standard, so it turns down to CoMP set 1 by the impact of energy efficiency.
Figure 4(a) shows the value of  changes with the Capacity standard.When the Capacity standard is low, most of the users in the center cell can reach it, so the Capacity reward of each set is positive and the energy efficiency mainly affect the value of .As the Capacity standard increases, less users in the center cell can reach the standard, and the reward turns negative, so the  value descends.Figure 4(b) shows the best CoMP set in different Capacity standards.As the Capacity standard increases, the best CoMP set increases to meet the standard.While it increases to 11 Mbits/s, even CoMP set 6 cannot meet the standard, so it turns down to CoMP set 2 by the impact of energy efficiency.
Figures 5-9 are simulations under the standard in which Capacity standard is 1 Mbit/s and SINR Standard is 21 dB. Figure 5 shows that energy efficiency descends as the CoMP set increases.Figure 6 shows that both Downlink Capacity and SINR increase as the CoMP set increases.Figure 7 shows that, in this situation, CoMP set 4 usually gets the largest  value.
In Figures 8(a) and 9(a), the red circles represent the base stations in the CoMP result from the -learning, while the yellow ones represent the best base station grouping by the enumeration search.Figure 8 shows the result and process of -learning without variations.It traps into the local optimum easily, which leads to the wrong base station grouping checked by enumeration search.To come over it, variation is introduced to the -learning every six steps as Figure 9 shows.In this way, the local optimum is eliminated and the best base station grouping can be got according to the -learning method.
Enumeration search is the auxiliary examination method to check the -learning result.Compared to the enumeration search, -learning method can optimize the base station grouping automatically in a short time as Figure 10 shows.Figure 11 shows the process of -learning with a variation every six steps.For each state, we set sixty steps to adjust the strategy.So within the 60 steps, there is a variation added every six steps, which makes the  value oscillate.After sixty steps, the variations are removed.Figure 12 shows that we find the maximal  in the first sixty steps and act the optimal policy  * with the .As Figures 11 and 12 show, the state changes little.At the 121 and 272 steps,   + 1 ̸ =   , so the learning goes back to the variation to find the new optimal policy  * .And the relevant energy efficiency and the rewards are shown in Figures 8 and 12.The relevant states and the optimal policy  * are shown in Table 3.

Conclusion
In wireless cellular network, it is very important to increase the energy efficiency of radio access networks to meet the challenges.It is also important to increase the downlink performance of the wireless network to meet the requirement of the users.In this paper, we proposed the RL Algorithm to derive the optimal base station grouping decisions for efficient transmissions in CoMP.In addition, we introduce variations preventing  from trapping into local maximum.Simulation results have been presented to show the process of the -learning and the optimal policy  * it finds.
More research is in progress to make the communication model more realistic by the consideration of imperfect channel state information.In addition, two or more base stations in the small scale model unit can be dynamically switched off to further "green" the wireless communications.

Figure 3 :Figure 4 :
Figure 3: (a)  value changes with SINR standard.(b) Best CoMP set in different Capacity standard.

Figure 5 :
Figure 5: Energy Efficiency changes in different User Count.

( 4 )
According to the value function of new State-Action estimate (  ,   ).

Figure 6 :Figure 7 :
Figure 6: (a) Capacity change in different User Count.(b) SINR change in different User Count.

Figure 10 :Figure 11 :
Figure 10: Comparison between Enumeration search and learning.

Table 2 :
Rewards for downlink capacity and SINR.