Task Offloading with Power Control for Mobile Edge Computing Using Reinforcement Learning-Based Markov Decision Process

-is paper proposes an efficient computation task offloading mechanism for mobile edge computing (MEC) systems. -e studied MEC system consists of multiple user equipment (UEs) and multiple radio interfaces. In order to maximize the number of UEs benefitting from the MEC, the task offloading and power control strategy for a UE is optimized in a joint manner. However, the problem of finding the optimal solution is NP-hard. We then reformulate the problem as a Markov decision process (MDP) and develop a reinforcement learning(RL-) based algorithm to solve the MDP. Simulation results show that the proposed RL-based algorithm achieves a near-optimal performance compared to the exhaustive search algorithm, and it also outperforms the received signal strength(RSS-) basedmethod nomatter from the standpoint of the system (as it leads to a larger number of beneficial UEs) or an individual (as it generates a lower computation overhead for a UE).


Introduction
With the radically increasing popularity of mobile UEs, such as smart phones, tablet computers, and Internet of ings (IoT) devices, new mobile applications such as navigation, face recognition, and interactive online gaming are emerging constantly [1]. Nevertheless, the limited computing resources of UEs are incapable of meeting the demand of computation intensive applications and, hence, become the bottleneck for providing satisfactory QoS. Conventional cloud computing systems enable UEs to utilize the powerful computing capability in remote public clouds; however, long latency may be incurred due to the data exchange in wide area networks (WANs). In order to reduce the latency, MEC systems have been proposed to deploy computing resource closer to UEs within the radio access network (RAN) [2]. A typical MEC system is shown in Figure 1 [3]. e edge cloud consists of a number of base band units (BBUs) and a MEC server. Multiple remote radio units (RRUs) working as the radio transceivers of the RAN are connected to the edge cloud via the optical fiber. Each mobile clone in the MEC server is associated with a specific UE. It works as the proxy virtual machine for the UE as it can collect the task input data generated by the UE, produce the analytical results on behalf of the UE, and send the results back to the UE. By offloading computation tasks from UEs to proximate edge clouds, MEC has the potential to reduce computation latency, avoid network congestion, and prolong the battery lifetimes of UEs [4]. In a MEC system, a UE is likely to have many candidate RRHs via in which it can offload the task to the edge cloud. e problem of associating a UE with an appropriate RRH is becoming more important. A conventional approach (which is also suggested by LTE-A) is to select the RRH which offers the highest received signal strength (RSS) to offload the task, but this approach does not consider the interference caused by the UEs associated with the same RRH. ere have been many efforts in the literature toward the computation task offloading problems in MEC systems. Liu et al. [5] proposed a resource allocation scheme for the multiuser task offloading scenario. e target is to minimize the overall UEs' energy consumption under the latency constraint. Since only one RRH is considered in [5], Zhang et al. [6] extended the work to the multi-RRH scenario. However, the UE-RRH association is predetermined in [6]. A more flexible association scheme is required to balance the signal-to-interference-plus-noise ratios (SINRs) at different RRHs. To address this issue, Chen et al. [7] studied the task offloading problem in a multichannel interference environment. ey devised a game theoretic strategy for a UE to determine the channel via which it can offload the task. Unfortunately, optimized power control was not considered in [7]. As reported in [8][9][10], efficient power control can greatly alleviate the severe SINR of a shared channel, thus leads to a substantial performance improvement for all users.
Motivated by the previous works, this paper designs an efficient task offloading mechanism for the MEC systems with multi-UE and multi-RRH settings. In order to maximize the number of the UEs benefitting from the MEC, the optimal task offloading and power control strategy is found in a joint manner. Although the formulated mathematical problem is NP-hard, we can obtain the near-optimal solution by using an alternate RL-based MDP.

System
Model e studied MEC system working with the multi-UE and multi-RRH settings is shown in Figure 1. e set of the RRHs is denoted by K � 1, 2, . . . , K { } and the set of the UEs is denoted by N � 1, 2, . . . , N { }. e UEs are distributed uniformly in the radio coverage area of the RRHs. It is assumed that the nth (∀n ∈ N) UE has a computation task T n to be executed which is characterized by a two-tuple of parameters (b n , c n ), where b n (in bits) denotes the amount of the task input data and c n (in CPU cycles/bit) denotes the number of CPU cycles required for computing 1 bit of the data. e values of b n and c n depend on the nature of the task and can be obtained through offline measurements [8].
We assume that a UE has Y power levels (corresponding to Y modulating constellations) for data transmission. Let P 1 and P Y denote, respectively, the minimum and maximum transmit powers at a UE. Let p n denote the transmit power applied by the nth UE for uploading the task input data to the RRH. For ∀y ∈ Y � 1, . . . , Y { }, we have p n ∈ P 1 , . . . , P y , . . . , P Y for ∀n ∈ N. MEC enables a UE to perform task offloading by sending the task input data to the edge cloud via an RRH. Let Z n � (d n , p n ) denote the task offloading decision for the nth UE, where d n ∈ 0, 1, . . . , K { }. d n � 0 means that the UE chooses to execute the task locally, and d n � k(∀k ∈ K) means that the UE chooses to offload the task to the edge cloud via the kth RRH by using transmission power p n .

Local Computing
Model. If a UE chooses to execute the task locally, the latency for computing the task T n can be expressed as where f L n denotes the computation capacity of the nth UE that is measured by the number of CPU cycles per second.
Let V n denote the energy consumption per second for computing at the nth UE. e total energy consumption for computing the task T n locally is given by In this paper, we consider that the UEs may have different QoS demands. at is, some delay sensitive UEs (e.g., mobile phones and surveillance UEs) need lower latency but can bear higher energy consumption, while some energy sensitive UEs (e.g., sensor nodes and IoT devices) require lower energy consumption but is delay insensitive. So, we adopt a composite index, termed as the computation cost in [5], to reflect the QoS satisfactory of a UE for executing a computation task.
In detail, the computation cost for the nth UE to execute task T n locally is defined as where u n (0 ≤ u n ≤ 1) is the weighting factor used for adjusting the tradeoff between the execution latency and the energy consumption. When a UE is at a low battery state and cares more about the energy consumption, it can set u n � 0. In contrast, when a UE is with sufficient energy and runs some delay sensitive applications, it concerns more about the execution latency and can set u n � 1.

Mobile Edge Computing
Model. When a UE does not have enough computation or energy resource to process the computation task locally, it can offload the task to the edge cloud. In this case, a UE should select one of the RRHs and then transmit the task input data to the edge cloud via the RRH by consuming communication resource. For easy analysis, we consider a quasi-static scenario where the set of the active UEs and their wireless channel conditions remain unchanged during a task offloading decision period T (e.g., several hundred milliseconds), while they can change across different periods. We also assume that each RRH holds just one physical channel, and the channels of the RRHs are nonoverlapped. Each UE can thus select a specific RRH to offload the computation task to the edge cloud.  Let w denote the channel bandwidth available for each RRH. Given the decision profile Z � (Z 1 , . . . , Z N ) of the active UEs, the transmission rate of the nth UE that selects the kth RRH to offload the task can be computed as where σ 2 is the noise variance at the kth RRH, g n,k is the power gain of the channel from the nth UE to the kth RRH, and the term (i ∈ N − n { } and d i � d n ) denotes the ith UE other than the nth UE that also selects the kth RRH to offload the task to the edge cloud.
Due to the powerful computing capability provided by the edge cloud (as many telecom operators are capable for large scale infrastructure investment), we ignore the latency and energy consumption at the edge cloud for executing the tasks offloaded by the UEs. Additionally, as the computation results are of small size, the feedback delay can also be ignored. Hence, the latency for executing the task T n remotely at the edge cloud via the kth RRH can be expressed as e energy consumption of the UE is mainly generated by the task input data transmission which can be given as When the nth UE selects the kth RRH to offload the task to the edge cloud, we can define the computation cost for the nth UE in terms of the weighted sum of execution latency and energy consumption as

Problem Formulation
In general, the number of the UEs that attempt to access the edge cloud is much larger than the number of the RRHs (i.e., K ≪ N). e UEs are ordered to make their task offloading decision simultaneously in each decision period T. Since the wireless channel held by each RRH is a shared medium, if too many UEs select the same RRH to offload their tasks, it would incur severe co-channel interference and high computation cost for the UEs. In such a case, it would be more beneficial for a UE to select another RRH to offload the task or execute the task locally. In addition, it is also shown in [7] that if efficient power control were applied, a UE could achieve a high data rate while at the same time expending a small amount of energy. Hence, it is necessary to coordinate the transmission power of the UEs that selects the same RRH to offload their tasks to the edge cloud.
For the nth UE, the optimal task offloading decision Z * n � (d * n , p * n ) should cause the lowest cost of executing task T n . Particularly, we refer to the nth UE as the MEC benefited UE, if it chooses to offload the task T n to the edge cloud rather than executing the task locally.
at means U M n,k < U L n (∃k ∈ K) and d n > 0 for the nth UE, whereas from the system designer's point of view, the optimal task offloading decision for the UEs, denoted by Z * � (Z * 1 , . . . , Z * N ), should be able to maximize the number of the MEC benefitted UEs. It can lead to a higher utilization ratio of the MEC infrastructures and bring a higher revenue for providing the MEC service. Mathematically, we can formulate the optimal task offloading problem as p n ∈ P 1 , . . . , P y , . . . , P Y , ∀n ∈ N, is an indicator function defined as However, it can be proved that problem (8) of finding the optimal decision profile Z * is NP-hard as it is an instance of the Mixed Integer Nonlinear Programming (MINLP) problem (which is known to be NP-hard [11]). e proof is omitted here due to limited space. In order to ease the heavy burden of complex computing at the MEC server, we next model the task offloading decision process as a Markov decision process (MDP). Consequently, a reinforcement learning-(RL-) based algorithm is developed to find the solution to the MDP.

Markov Decision Process (MDP)
In the MEC system, the number of the UEs that attempt to access the edge cloud is much larger than the number of the RRHs, and the UEs are ordered to make their task offloading decision simultaneously in each decision period. ese all involve an interaction between a UE (as a decision maker) and the environment (the interference incurred by the cochannel UEs), within which the UE seeks to achieve a goal as minimizing the computation cost despite uncertainty about the environment. e UEs' actions are permitted to affect the future states of the environment (the interference levels at the RRHs), thereby affecting the options and opportunities available to the UEs at later time steps. In such a situation, where outcomes are partly random and partly under the control of a decision maker, MDPs [12] provide a mathematical framework to model and analyze the decisionmaking process. More precisely, the task offloading decision process is modelled as a MDP which is substantially a discrete time stochastic control process. An agent-environment interaction of the MDP is termed as an episode, which equals to a task offloading period T. An episode is further broken into several discrete time steps. At each time step t (t � 1, 2, . . .), a UE is in some state s. An episode of the MDP starts from a random initial state and ends in a terminal state. A UE acting as a decision maker must choose any action a that is available Mobile Information Systems in state s; thus, the MDP responds at the next time step by moving the UE into a new state s ′ and giving the UE a corresponding reward r. In the proposed MDP, future states only depend on the current state instead of the former ones; thus, the memoryless Markov property is guaranteed. e actions, states, and reward functions of the proposed MDP are formally defined as below.
States: at any time step t, if a UE offloads the computation task via the kth RRH by using transmission power p y , we say that the UE is in state φ k,y for ∀k ∈ K and ∀y ∈ Y. Otherwise, if it executes the computation task locally, we say that the UE is in state φ 0,0 . e set of states of the MDP can thus be given by S � φ 0,0 , φ 1,1 , . . . , φ K,Y .
Actions: at each time step t, a UE must take an action a according to the current state s for ∀s ∈ S, which also implies a transition from the current state s to the next state s′(∀s′ ∈ S). We define A � ϕ 0,0 , ϕ 1,1 , . . . , ϕ K,Y as the action set of the MDP, where a � ϕ 0,0 implies that a UE selects local computation, and a � ϕ k,y (∀k ∈ K and ∀y ∈ Y) implies that the UE select the kth RRH to offload the computation task by using transmission power p y .
Reward functions: after the agent-environment interaction in each time step t, a UE obtains a reward which represents the optimization objective. e reward function just maps a pair of state and action into stochastic rewards. Since we take the objective to minimize the computation cost of a UE, the reward function of the nth UE is defined as where λ 1 and λ 2 are variables for normalization.

RL-Based Solution Method
MDPs are a wide range of optimization problems which can be solved via dynamic programming (DP) and RL methods [12]. e RL method is an area of machine learning concerned with how an agent takes actions in an environment so as to maximize the cumulative reward. e main difference between DP methods and RL methods is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large scale MDPs where DP methods become infeasible. Next, we develop an RL-based algorithm to solve the MDP. First, we assume that the decision of the nth UE for choosing an action a in a given state s is determined by a policy: en, we need to find the state-action value function Q (s, a), also called the Q-value function for the MDP [12], which represents the expected return (cumulative discounted reward) that the nth UE is to receive when taking action a in state s and behaving according to the policy π(s) afterwards. In the RL method, the Q-function is learned by the interaction between the decision makers and their environment, thus can approximate the optimal state-action value function directly, independent of the policy being followed. Next, we define the updating rule of the Q-value function as where c is the reward decay and β is the learning rate. If the optimal Q(s, a) were known, the optimal policy can be found by [12] π * (s) � arg max a∈A Q(s, a). (16) Finally, the number of the MEC benefitted UEs can be obtained. e overall learning procedure is summarized in Algorithm 1.
In Algorithm 1, we use the ε-greedy policy [12] for the sake of discovering an effective action. In detail, the UEs perform exploration with probability ε (0 < ε < 1) at each time step, and they exploit stored Q-values with probability 1 − ε. It is noted that the algorithm should be performed at the MEC server, i.e., the control center of the system. Since the MEC server has full knowledge of the RRHs and UEs and powerful computation capacity, the solely required information is only the channel state information (CSI) of the UEs. e task input data and the CSI of UEs can be conveyed to the MEC server by the RRHs. Subsequently, the scheduling information can be fed back to the UEs also via the RRHs.

Simulation Results
In the simulation, we set up a MEC system, as shown in Figure 1. e coverage range of the system is 1 km and multiple UEs are scattered randomly over the region. e available channel bandwidth for each RRH is w � 1 MHz. e power-law path loss of the wireless channels is modelled as g n,k � l − α n,k , where l n,k is the distance between the nth UE and the kth RRH and α � 4 is the path-loss factor. e background noise variance is set as σ 2 � 10 − 14 W. e set of transmission powers for each UE is 50, 100, 150, 200, 250 { } mW. We take the face recognition applications [13] as the computation tasks. For the nth UE, we set the size of the task input data as b n � 2 kB, the number of the required CPU cycles per bit c n � 20 cycles, and the power for local computing V n � 0.1 W. Due to the heterogeneity of the mobile UEs, we assume that the CPU computational capability f L n of the nth UE is randomly selected from the set 0.1, 0.15, 0.2 { } MHz, and the QoS weighting factor u n for the nth UE is randomly selected from the set 0, 0.5, 1 { }. First, we testify the effectiveness of the proposed RLbased algorithm. e number of the RRHs in the MEC system is K � 5, and each UE transmits at the maximum power P Y . We compare the number of the MEC-benefitted UEs obtained by using the RL-based algorithm to that obtained by using the exhaustive search (ES) algorithm. Note that the ES algorithm is global optimum but the computational complexity grows exponentially with the number of the UEs. e simulation is repeated 100 times and the averaged results are shown in Figure 2. We see that the RL-based algorithm can find near-optimal solutions to problem (8). Since power control is not applied in the simulation, the performance can be taken as the lower bound of the RL-based algorithm.
Next, we testify the ability of the RL-based algorithm to deal with a large scale network where 120∼280 UEs can simultaneously issue their task offloading requests. For that purpose, we increase the number of the RRHs to K � 9. In Figure 3, we show the ratio of the beneficial UEs in the system by using different task offloading algorithms.
From Figure 3, we see that the performance of the RSSbased algorithm decreases sharply with the increasing N, while the RL-based algorithms (with and without power control) can maintain the beneficial ratio at a high level of 93%. In addition, we see that the RL-based algorithm with power control outperforms the counterpart without power control in all the network situations.
Finally, we show the effect of power control in reducing the computational cost of a UE. To this end, we compare in Figure 4 the average computation overheads obtained by a UE before and after applying the power control. e overhead of a UE is obtained by using equation (3) as it executes the task locally or by using equation (7) as it offloads the task to the edge cloud.

Initialization:
t ⟵ 0; s ⟵ 0, ∀s ∈ S; Q(s, a) ⟵ 0, ∀s ∈ S and ∀a ∈ A; while t ≤ t max (t max is the is the maximum number of iterations) for each UE in N if exploration chooses an action a arbitrarily with probability ε, 0 < ε < 1; else exploitation chooses an action a � arg max a∈A Q(s, a); end if perform a and get a reward r(s, a) and a successor state s′; update the Q-value function according to equation (15); s ⟵ s′; end for t ⟵ t + 1; end while ALGORITHM 1: Q-learning algorithm with ε-greedy policy.  From Figure 4, we see that the RL-based algorithm with power control can bring lower computation overhead for a UE than the counterpart without power control. It implies that the RL-based algorithm with power control can well coordinate the multiuser interference and, therefore, can greatly reduce the computation overhead of a UE.

Conclusion
is paper proposes a RL-based MDP to solve the computation task offloading and power control problem in the MEC systems with multi-UE and multi-RRH settings. In comparison to the ES algorithm, the proposed RL-based algorithm can achieve a near-optimal system performance. While dealing with a large scale network, the proposed RLbased algorithm can achieve good performance no matter if it is from the standpoint of system or an individual.

Data Availability
e data used to support the findings of this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.