A Reinforcement Learning Approach to Access Management in Wireless Cellular Networks

In smart city applications, huge numbers of devices need to be connected in an autonomous manner. 3rd Generation Partnership Project (3GPP) specifies that Machine Type Communication (MTC) should be used to handle data transmission among a large number of devices. However, the data transmission rates are highly variable, and this brings about a congestion problem. To tackle this problem, the use of Access Class Barring (ACB) is recommended to restrict the number of access attempts allowed in data transmission by utilizing strategic parameters. In this paper, we model the problem of determining the strategic parameters with a reinforcement learning algorithm. In our model, the system evolves to minimize both the collision rate and the access delay. The experimental results show that our scheme improves system performance in terms of the access success rate, the failure rate, the collision rate, and the access delay.


Introduction
In smart city applications, many smart and mobile devices connect with one another and operate in an adaptive manner.The devices generate relevant city data in bursts and in unexpected manners on a massive scale.Because Long Term Evolution-Advanced (LTE-A) cellular networks provide wide coverage and low latency, LTE-A is considered to be one of the most promising communication infrastructures for smart city applications.However, radio resources in this communication infrastructure are too limited to serve a large amount of data.3rd Generation Partnership Project (3GPP) specifies that Machine Type Communication (MTC) should be used to handle the congestion problem caused by small amounts of data being transmitted from a large number of devices within a short period of time [1][2][3].MTC traffic includes periodic-update traffic and event-driven traffic.Event-driven triggering brings about bursts and unpredictable traffic flows, and these add to the congestion problem [4].
When a device has data to send, it needs to synchronize with a base station, that is, Evolved Node B (eNB), and to reserve a Random Access Channel (RACH).The RACH is a sequence of physical radio resources (RA slots).To do this, a device follows a four-step random access procedure [5,6].First, a device with data to send selects a random access preamble randomly as a digital signature from a predefined set of preambles and sends the preamble to eNB.Then, eNB responds with a Random Access Response (RAR) message to synchronize subsequent uplink transmission.However, if multiple devices send the same preamble in the same RA slot at the first step, a collision occurs, and they will receive the same RAR message.After receiving the RAR message, a device sends a connection request message along with a scheduling request.Finally, eNB acknowledges the connection request message.If a device receives the acknowledgement message successfully, it proceeds to data transmission.If a device encounters a preamble collision, it does not receive the message from eNB and will initiate a new RA procedure after a fixed backoff time.When the number of unsuccessful attempts of a device reaches the predefined maximum value (Maximum Number of Preamble Transmission [1]), the device finally fails the RA process.

Wireless Communications and Mobile Computing
As the number of devices attempting random access in the same RA slot increases, the numbers of preamble collisions and access delays increase as well.The access delay is the time between the generation of access request and the completion of the random access procedure.As a result of such a delay, the congestion becomes heavy.3GPP specifies the use of the Access Class Barring (ACB) scheme to tackle the congestion problem [7].ACB is a well-known scheme that restricts RA attempts.ACB operates on two strategic parameters: the barring factor and the barring duration.Based on the current congestion status, eNB regulates the RA attempts of MTC devices using these two parameters.Thus, the control of the parameters is vital in protecting the system from the excessive connectivity of a large number of devices.However, 3GPP does not specify how to control the parameters dynamically.
In this paper, we model the problem of determining ACB parameter values by using a reinforcement learning algorithm.This algorithm is able to follow unexpected changes and traffic bursts rapidly.Through the learning algorithm, we propose a scheme for dynamically and autonomously controlling the parameters.The experimental results show that the use of our scheme is sufficient to resolve the congestion problem in terms of access success rate, failure rate, collision rate, and access delay.
The rest of the paper is organized as follows.In Section 2, we discuss related studies.In Section 3, we propose an access management scheme that uses a reinforcement learning algorithm.In Sections 4 and 5, we evaluate the performance of our scheme and conclude the paper with our plans for future research.

Related Work
Several proposals for tackling the RA congestion problem are discussed.In ACB [7], eNB broadcasts the barring factor (0 ≤  ≤ 1) and barring duration to its cell based on the current congestion level.A device with data to send generates a random number (0 ≤  ≤ 1).If  ≤ , the device gets permission to access RACH.Otherwise, the RA attempt is barred for the barring duration.However, there is tradeoff with respect to barring factor .If severe congestion occurs in a cell, eNB sets  to an extremely low value and most devices are barred.This results in an unacceptable access delay.On the other hand, if eNB sets  to an extremely high value, most of the preambles encounter collisions.This results in unacceptable data transmission failure.Thus, the barring factor is an important factor in determining system performance.
3GPP specifies the use of Extended ACB (EAB) as well as ACB.In EAB [8], devices are grouped into a set of ACs (Access Classes).eNB broadcasts a barring bitmap for the ACs periodically.A device with data to send compares its AC with the bitmap.If the bit that corresponds to the AC of the device is set, the device is barred from transmitting data until the bit changes.In this case, the scheduling policy among the ACs is a factor determining system performance.In current cellular networks, eNB alone determines the ACB barring factor to stabilize each cell.In [9], a cooperative mechanism is proposed to control congestion globally over multiple cells.
The barring factor of each eNB is decided cooperatively among all eNBs.This is done for global stabilization and for access load sharing.
In order to maintain a high service quality for HTC (Human Type Communication), 3GPP specifies the use of two different schemes: the MTC specific backoff scheme and the separate RA resources scheme [10].In the MTC specific backoff scheme, a dedicated backoff parameter is set for the MTC devices.The backoff scheme discourages the devices to attempt random access for certain duration of time.The backoff value for HTC devices is shorter than it is for MTC devices.In the separate RA resources scheme, RA slots are allocated to HTC and MTC devices separately.Both of the schemes focus on reducing the impact of RACH congestion on HTC devices.Thus, MTC devices may experience serious congestion because the amount of resources available is reduced.
In addition to the solutions specified by 3GPP, various other congestion solutions are proposed.In [11], a congestionaware admission control scheme is proposed.It rejects RA requests from MTC devices selectively according to the congestion level, which is directly induced from the incoming packet processing delay at the application layer.In [12], RACH resources are preallocated to different MTC classes using class-dependent backoff procedures to prevent a large number of simultaneous RACH access attempts.Dynamic access barring according to the traffic load level is proposed for collision avoidance.Under this barring, the access attempts of devices transmitting for the first time are delayed.

Proposed Scheme
To tackle the congestion problem, we adopt Q-learning (QL) algorithm.The algorithm utilizes a form of reinforcement learning to solve Markovian decision problems without possessing complete information [13,14].Because QL finds solutions through the experience of interacting with an environment, we use it to model the ACB barring factor [15].In other words, we control the ACB barring factor adaptively with QL.
Let  denote a finite set of possible environment states and let  denote a finite set of admissible actions to be taken.At RA slot , eNB perceives the current state   =  ∈  of the environment and takes an action   =  ∈  based on both the perceived state and its past experience.The action   changes the environmental state from   to  +1 =   ∈ .When that happens, the system receives the reward   .
The goal of the QL algorithm is to find an optimal policy for state  that optimizes the rewards over the long run.The algorithm estimates the -value (, ) as the cumulative discounted reward.Using the -values, the algorithm finds the optimal -value  * (, ) in a greedy manner.The -value is updated as where  (0 ≤  ≤ 1) is the learning rate.When  is 0, the value is not updated.When  is a high value, learning occurs quickly, as in where  (0 ≤  ≤ 1) is the discount factor that weighs immediate rewards more heavily than future rewards.
In this paper, we model a QL algorithm to control the barring factor in order to minimize both the number of RACH collisions and the access delay.A collision occurs when two or more devices transmit the same preamble in the same time slot.We define a set of possible states, a set of admissible actions, and the rewards.First, we define the access success rate  succ  (0 ≤  succ  ≤ 1) as a set with states .The access success rate is defined as the number of devices that successfully access RACH divided by the number of devices contending in a given RA slot. succ  is divided evenly into || states.Each state  has three possible actions: increasing or decreasing the  value by   ∈ Δ or maintaining the current  value.The Δ indicates a finite set of unit values for .To balance the exploration and exploitation of learning, an -greedy method [16] is applied to our QL algorithm.In other words, we can select a random action with probability  or we can select an action with probability 1 −  that gives an optimal (, ) in the state .We define the reward in order to minimize the collision rate ( col  , 0 <  col  < 1) and the access delay (delay  , 0 < delay  < delay max ). col  represents the number of colliding devices divided by the number of contending devices.The reward given when action  is taken at state  in RA slot  is where delay max is the maximum access delay that the system allows and  is a smoothing factor (0 ≤  ≤ 1).

Performance Analysis
In this paper, we extend the work of our previous study [15] and evaluate the performance of our access management scheme in terms of access success rate, collision rate, failure rate, and access delay.We adopted the traffic model for a smart metering application as an experimental scenario in which a large number of devices access RACHs in a highly synchronized manner [1].Smart metering is one of smart city applications.
In the model, the housing density of an urban area of London located within a single cell was used as the density of meters.
We set the number of meters  to 35,670.Each meter requested one data transmission for reading frequency .We set the reading frequency to 5 min.3GPP defines two different traffic models for smart metering applications.We adopted a Beta distribution based model for our experiments.The number of meters that start the RA procedure in the th RA slot is defined as where () follows the Beta distribution.The () is defined as where Beta(, ) is the Beta function with  = 3 and  = 4.
In our QL model, we divided  succ  evenly into 4 states:  1 for 0 ≤  succ  < 0.25,  2 for 0.25 ≤  succ  < 0.5,  3 for 0.5 ≤  succ  < 0.75, and  4 for 0.75 ≤  succ  ≤ 1. delay max was set to the size of an RA slot multiplied by the Maximum Number of Preamble Transmissions.We set the learning rate  in (1) to 0.9.For the -greedy method, we set  to 0.01.The basic RACH capacity parameters for LTE FDD networks followed [1].
Our scheme was trained using five different datasets.Each dataset included four training sets and one test set.The training sets contained the RA requests generated by the meters in a series of four reading frequencies.After the training, we measured the performance metrics of the test set.In the figures, the plotted values indicate the averages of values measured from the five datasets.
Figure 1 shows the scheme's performance with respect to various numbers of admissible actions.We defined the set elements Δ = { 1 = 0.2,  2 = 0.1} as operators to be used in actions.We used only  1 when there were three admissible actions: increasing  by  1 , decreasing  by  1 , and maintaining the current value of .We used both  1 and  2 when there were five admissible actions: increasing  by  1 or  2 , decreasing  by  1 or  2 , and maintaining the current value of .For the reward in (3), we set  to 0.5 and we set the discount factor  in (2) to 0.5.For the access success rate, the failure rate, and the access delay, the scheme with three admissible actions showed about 78%, 25%, and 26% better performances, respectively, than the scheme with five admissible actions did.The failure rate was calculated by taking the number of devices that ultimately failed RA attempts because the preamble transmission counter had reached Maximum Number of Preamble Transmission and dividing it by the number of contending devices.The scheme with three admissible actions showed a collision rate of about 4 times that of the scheme with five admissible actions.The performance with respect to various numbers of admissible actions is mainly influenced by the granularity of .When the granularity is properly coarse (e.g.,  = 0.2), the barring factor swiftly copes with the variance of the number of meters trying to access RACH.However, when the granularity is too fine (e.g.,  = 0.1), the barring factor does not promptly respond to the variance of RA requests.The issue to determine the level of granularity is still open.Figure 2 shows the scheme's performance with respect to various values of , which influenced the rewards, as shown in (3).We considered three admissible actions involving  1 with  = 0.5.For the access success and failure rates, the scheme with  = 0.5 showed about 30% and 16% better performances than those of the scheme with  = 0.3.Moreover, the scheme with  = 0.5 showed about 3 times and 35% better performance than the scheme with  = 0.7 did.For collision rate, the scheme with  = 0.7 showed about 11 times and 24 times better performances than those with  = 0.3 and  = 0.5, respectively.This is because the rewards added more weight to the collision rate when  = 0.7.For access delay, the scheme with  = 0.5 showed about 25% and 73% better performances than those of the schemes with  = 0.3 and  = 0.7, respectively.In these cases, the rewards added weight to access delay, and the scheme with  = 0.3 showed about 38% better performance than that of the scheme with  = 0. is significantly influenced by .When the weight is more added to the collision rate, the collision rate is improved.When the weight is more added to access delay, the access delay is improved.The mechanism to dynamically control the smoothing factor needs to improve the performance and we leave it for our future research.Figure 3 shows the scheme's performance with respect to various values for the discount factor  in (2).We used three admissible actions involving  1 with  = 0.5.When  was low, weight was added to quicker rewards.For the success rate, failure rate, and access delay, the scheme with  = 0.5 showed about 37%, 22%, and 25% better performances than the scheme with  = 0.3 did.Compared to the scheme with  = 0.7, the scheme with  = 0.5 showed about 26%, 17%, and 22% better performances, respectively.For collision rate, the scheme with  = 0.3 showed about 88% and 68% better performances than those of the schemes with  = 0.5 and  = 0.7, respectively.
To evaluate the scheme's performance, we compared our scheme to the original ACB [7].In the original ACB, when congestion is detected, eNB regulates meters' RA attempts by setting the barring factor to 0.1.For our scheme, we used three admissible actions involving  1 with  = 0.5 and  = 0.5.In Figure 4, our scheme shows a success rate of about 5 times better than that of the original ACB.In terms of the failure rate and access delay, our scheme showed about 70% and 50% better performances, respectively.For collision rate, the original ACB showed a very low value compared to that of our scheme.In the ACB, because most meters have restricted access to RACH and the competition for channel resources is reduced, the success and collision rates decrease, but both the failure rate and the access delay increase.

Conclusion
To tackle the RA congestion problem of MTC in LTE-A networks, we modelled the ACB barring factor decision problem using a Q-learning algorithm.The algorithm was able to follow unexpected and burst traffic changes rapidly.The goals of our model were minimizing both the RACH collision rate and the access delay.To meet these goals, we defined sets of possible states and of admissible actions by using both the access success rate and the unit values to change the barring factor.To evaluate the performance of our scheme, we adopted the traffic model for smart metering applications.The results show that our scheme improves system performance in terms of the access success rate, the failure rate, the collision rate, and the access delay.

Figure 1 :
Figure 1: A performance comparison for various number of admissible actions.

WirelessFigure 2 :
Figure 2: A performance comparison for various values of .

Figure 3 :
Figure 3: A performance comparison for various values of .

7 .Figure 4 :
Figure 4: A performance comparison with the original ACB.