Deep Reinforcement Learning-Based Dynamic Offloading Management in UAV-Assisted MEC System

Unmanned aerial vehicles (UAVs) have been envisioned as a promising technique to provide relaying and mobile edge computing (MEC) services for ground user equipment (UE). In this paper, we propose a UAV-assisted MEC architecture in dynamic environment, where a UAV ﬂies with a ﬁxed trajectory and may act as a MEC server to process the tasks oﬄoaded from the UE or act as a relay to help the UE to oﬄoad their tasks to the ground base station (BS). The objective of this work is to maximize the long-term number of completed tasks of the UE. An optimization problem is formulated to optimize the task oﬄoading decisions of the UE. Considering the random demands of the UE, a deep reinforcement learning- (DRL-) based algorithm is proposed to solve the formulated nonconvex optimization problem. Simulation results verify the eﬀectiveness and correctness of the proposed algorithm.


Introduction
e smart city paradigm brings massive computation-intensive and latency-sensitive applications, such as augmented reality (AR), virtual reality (VR), and video compression [1][2][3]. Due to the limited computing capability, it is hard for the user equipment (UE) to complete above applications in time without extra computing service. Although the mobile cloud computing can provide powerful computing service, the computation tasks needed to be transmitted to the remote cloud server, which may cause severe transmission delay and it is difficult for it to meet the delay demands of the applications. To address this issue, mobile edge computing (MEC) technique emerges.
In MEC systems, MEC servers are deployed at the edge of the network to establish direct connection between the UE and the servers to provide real-time computing service for the UE [4]. e UE offloads computation tasks to the MEC servers through wireless channels. In traditional MEC architectures, MEC servers are typically deployed in fixed locations, such as being equipped at base stations (BS) or at the access points (AP). e quality of the communication links between the UE and MEC servers is easily influenced by dense buildings, which may lead to weak transmission signals or even interruption. is circumstance is especially serious in 5G scenarios due to the transmission characteristics of millimeter waves.
Equipping the MEC server at unmanned aerial vehicle (UAV) can alleviate the problem stated above with its advantages of easy deployment, high flexibility, and maneuverability. e UAVs fly with predetermined trajectories in the air and have higher probability to establish line-of-sight (LoS) communication links with the ground UE. Hence, it is a great attempt to incorporate the UAVs in MEC systems to enlarge the MEC service coverage range and enhance the service capacity at the same time.
To prolong the endurance of the UAV, the authors in [5] studied an energy consumption minimization problem; the scheduling, resource allocation, and the hovering time of the UAV were jointly designed. e authors in [6] jointly optimized the bit allocation and communications for uplink and downlink and the trajectory of the UAV to minimize the total mobile energy consumption. In [7], a UAV enabled MEC network was considered; the UAV position, time slot allocation, and computation task partition were jointly optimized to minimize the sum of energy consumption of ground users. In [8], the computing scheduling, 3D trajectory of the UAV, bandwidth, and transmission power allocation were jointly optimized to maximize the computation efficiency. In [9], the sum of the maximum delay among all the users was minimized by jointly optimizing the trajectory of the UAV, offloading task partition, and computing scheduling of the users.
In [10], the computation rate was maximized by jointly optimizing the CPU frequencies, the length of the offloading times, the transmit power, and the trajectory of the UAV under both partial and binary computation offloading modes. A computation efficiency maximization problem was formulated in [11], and the user association, CPU frequency, spectrum resources, and the UAV trajectory were jointly optimized.
Efforts in [5][6][7][8][9][10][11] focused on how to adopt the UAVs to provide computation services to the ground UE. In [12,13], the UAV was utilized not only as a MEC server but also as a relay to establish high-quality LoS wireless links between the UE and the remote ground high-performance MEC servers. e authors in [12] designed the trajectory of the UAV and the transmit power of the UE and the UAV to minimize the outage probability of the considered UAV-assisted MEC network. In [13], the computation resource scheduling parameters and the trajectory of the UAV were jointly optimized to minimize the weighted sum energy consumption subject to the task and the information causality, the bandwidth allocation, and the UAV's trajectory constraints. In [14], the bits allocation, time slot scheduling, power allocation, and the trajectory of the UAV were jointly optimized to minimize the total energy consumption of the MEC network. e works mentioned above mainly concerned the resource allocation and trajectory design problems in static environment. However, the task arrivals or the locations of the ground UE may be randomly changed in practical scenarios [15][16][17][18][19].
erefore, the dynamic environment should be taken into account during the design process of the UAV-assisted MEC system.
In [15], the motion of the UE was considered, and a double Q-network based deep reinforcement learning (DRL) algorithm was proposed to maximize the system reward and meet the QoS constraint by optimizing the trajectory of the UAV and the association of the UE. In [16], a multi-UAVassisted MEC system was studied, where multiple UAVs helped the BS for computing the tasks offloaded from the UE. An average mission response time minimization problem was formulated and solved by a conceived multiagent reinforcement learning algorithm. In [17], the task scheduling was optimized by a DRL-based method to balance the workload among multiple UAVs while guaranteeing the coverage constraint and the quality of service (QoS) demands of the UE. In [18], a three-layer online data processing network was proposed; the bottom layer was composed of sensors to generate raw data. e middle layer included multiple UAVs to collect and preprocess the data. e top layer was responsible for relieving the processed results and conducted further evaluation. Lyapunov optimization and DRL-based algorithms were proposed to schedule edge processing and plan the path online. In [19], the multiagent deep deterministic policy gradient method was adopted to design the trajectory of UAVs to jointly optimize the fairness among all the UE, the UAV's UE-load, and the total energy consumption of the UE. It is worth noting that, due to the short battery life and the limited physical size, the UAVs cannot equip too powerful MEC servers. Hence, it is not practical to just rely on the UAVs to process all the tasks of the UE. e role of the UAVs in MEC systems should be a helper of the UE. e UAVs and the ground MEC servers (BS or AP) with strong computing power should coexist in UAV-assisted MEC systems. e main function of the UAVs should be establishing a communication link between the UE and the ground MEC servers. e secondary function of the UAVs is to provide limited computing service for the UE. Motivated by all facts analysed above, we investigate the task allocation problem in UAV-assisted MEC system in dynamic environment. e main contributions are summarized as follows: (1) We propose a UAV-assisted MEC system in dynamic environment, where a UAV is deployed to provide relaying and edge computing service to the ground UE with randomly computation task arrivals at the beginning of each time slot. (2) We formulate a long-term number of completed tasks maximization problem for the proposed UAVassisted MEC system, which is nonconvex and difficult to solve. A DRL-based algorithm is proposed to solve the formulated problem. e remainder of this paper is organized as follows. Section 2 presents the system model and formulates the optimization problem. e DRL-based optimization algorithm is described in Section 3. Simulation results are presented and discussed in Section 4. Finally, we conclude this study in Section 5.

System Model
In this section, we describe the system model and formulate a long-term number of completed tasks' maximization problem.
e key notations used in this paper are summarized in Table 1. Figure 1, we consider a UAV-assisted MEC system with one ground BS, one UAV, and N ground UE. A MEC server with strong computing power is equipped at the BS. Let N � 1, 2, . . . , N { } denote the set of UE. e system is assumed to be operated in a timeslotted manner with time slot length Δ. We are concerned about the long-term utility during T consecutive time slots. e set of time slots is denoted as T � 1, 2, . . . , T { }.

Network Model. As shown in
In this paper, the three-dimensional Cartesian coordinate system is adopted. e BS and UE are fixed on the ground with zero altitude. e horizontal locations of the BS and the UE n, n ∈ N are denoted as l BS � (x b , y b ) and l UE n � (x n , y n ), respectively. e altitude of the UAV is set to H, which is appropriate to avoid buildings in the work terrain. e UAV flies with a predetermined trajectory. e horizontal location of the UAV in time slot . As the length of each time slot is relatively short, the location of the UAV can be considered unchanged during each time slot, which is similar to [13].

Channel Model.
In this paper, we consider the case that there are no direct links between the UE and the BS owing to the long distances and blockages between them. Similar to [9,13,14,20,21], we assume the air-ground wireless channels are dominated by LoS links. e channel power gains from the UAV to the ground BS and from the UAV to the UE n, n ∈ N at time slot t, t ∈ T are, respectively, given as where ρ n , n � 0, 1, . . . , N denote the channel power gain with reference distance 1 m.

Computation Model.
At the beginning of each time slot t, t ∈ T, each user n, n ∈ N has a new task arrival θ UE n (t), which can be represented by denotes the computation complexity (i.e., the number of required CPU cycles for computing 1 bit of task θ UE n (t)) of the task θ UE n (t), and τ[θ UE n (t)] is the maximal tolerable latency of the task θ UE n (t). We consider a random task arrival scenario: L[θ UE n (t)] is randomly generated from a discrete set with M n elements Θ L n � L n,1 , L n,2 , . . . , L n,M n . e corresponding task complexity set and maximal tolerable latency set of the task Table 1: Key notations used in this paper.

Notion
Description Computation task arrived at the UE n at time slot t, t ∈ T L(x) Input data size of computation task x τ(x) Maximal tolerable latency of computation task x C(x) e number of CPU cycles required for computing 1 bit of the task x θ UE n (t . . , C n,M n and Θ τ n � τ n,1 , τ n,2 , . . . , τ n,M n , respectively. At the beginning of each time slot, the UE decides to execute the task θ UE n (t) either at local, offloading to the UAV, or relaying to the BS for processing. Let α n,t ∈ 0, −1, 1 { }, n ∈ N, t ∈ T denote the task allocation variable. If α n,t � 0, the task θ UE n (t) is processed locally by the UE n, n ∈ N; if α n,t � −1, the task θ UE n (t) is offloaded to the UAV for processing; otherwise, if α n,t � 1, the task θ UE n (t) is relayed by the UAV to the BS for processing. Similar to [22], we assume that if the task processing period exceeds its maximal tolerable latency, the task must be dropped immediately.
All the UE and the UAV keep two queues for each UE: computation queue and transmission queue, which are operated with a first-in-first-out (FIFO) manner. If α n,t � 0, the task θ UE n (t) will be stored into the computation queue at the UE n, n ∈ N and wait for processing. If α n,t � ± 1, the task θ UE n (t) will be stored into the transmission queue at the UE n, n ∈ N and wait for offloading to the UAV. If α n,t � −1, after receiving task θ UE n (t), the UAV will store θ UE n (t) into the UE n's computation queue at the UAV and wait for processing. If α n,t � 1, after receiving task θ UE n (t), the UAV stores θ UE n (t) into the UE n, n ∈ N's transmission queue at the UAV and wait for relaying to the BS.

Local
Computing at the UE n, n ∈ N. e number of time slots required for computing the task θ UE n (t) at local is expressed as where ⌈·⌉ is the ceiling function, and f l n is the CPU frequency of the UE n, n ∈ N. In practice, the task θ UE n (t) may not be completed within one time slot. e subsequent tasks of the UE n, n ∈ N may wait for processing. Let T Wait n,Comp [θ UE n (t)] denote the number of time slots the task θ UE n (t) will wait for local processing, which is given as  where ] is the index of time slot that the computation of the task θ UE n (t ′ ) is completed, which is given as Recursive method can be used to obtain the value of T is the first task processed at the UE n, n ∈ N. en, the number of total time slots required for completing the task θ UE n (t) by the UE n, n ∈ N is given as However, as mentioned above, if the task processing period of θ UE n (t) exceeds the maximal tolerable latency, θ UE n (t) must be dropped immediately. Hence, the number of total time slots consumed for completing task θ UE n (t) n, n ∈ N (including the time for waiting and processing) is given as

Task
Offloaded to the UAV for Computing. We adopt frequency-division multiple access (FDMA) for the UE's task offloading. e total bandwidth of the channel between the UE and the UAV is equally allocated to each UE. en, the offloading rate of the UE n, n ∈ N at time slot t, t ∈ T is given as where W U denotes the total available bandwidth between UE and the UAV, σ 2 denotes the noise power at the UAV, and P UE n is the transmit power of the UE n, n ∈ N. us, the number of time slots required for offloading the task θ UE n (t) to the UAV by the UE n, n ∈ N can be expressed as where Z ++ denotes the set of positive integers; μ UE n,Tr [θ UE n (t)] is the index of the time slot that the task θ UE n (t) begins to be offloaded to the UAV by the UE n, n ∈ N. e task θ UE n (t) may not be able to be offloaded to the UAV within one time slot. e subsequent tasks of the UE n, n ∈ N may wait for offloading. e number of time slots the task θ UE n (t) will wait for offloading at the UE n, n ∈ N is given as Wireless Communications and Mobile Computing T Tr n,end [θ UE n (t ′ )] denotes the index of time slot that the offloading of the task θ UE n (t ′ ) is completed at the UE n, n ∈ N, which is given as Similar to (4), recursive method can be used to obtain the value of T Tr n,end [θ UE n (t ′ )] with T Wait n,Tr [θ UE n (t Tr n,b )] � 0, where θ UE n (t Tr n,b ) is the first task to be offloaded by the UE n, n ∈ N. e number of total time slots required for completing the task θ UE n (t)'s offloading by the UE n, n ∈ N is given as After the task θ UE n (t)'s offloading is finished, it will be stored in UE n's computation queue at the UAV. We adopt a simple resource allocation scheme: the UAV equally allocates its computing resource to each UE. en, the number of time slots required for computing the task θ UE n (t) by the UAV is given as where f max is maximal CPU frequency of the UAV. e number of time slots that the task θ UE n (t) will wait for processing at the UAV is given as where T Comp UAV,n,end [θ UE n (t ′ )] is the index of time slot that the computation of the task θ UE n (t ′ ) is completed at the UAV, which is given as We is the first task of the UE n, n ∈ N processed by the UAV. e number of total time slots required for completing the computation of the task θ UE n (t) by the UAV (including the time for task offloading from the UE n, n ∈ N to the UAV and the waiting and processing time at the UAV) is given as e number of total time slots consumed for completing the computation of the task θ UE n (t) by the UAV is given as  (17) where σ 2 b is the power of noise at the BS; P UAV is the transmit power of the UAV. e number of time slots required for offloading the task θ UE n (t) from the UAV to the BS is given as where μ UAV Tr [θ UE n (t)] is the index of time slot that the task θ UE n (t) begins to be offloaded to the BS by the UAV. e number of time slots the task θ UE n (t) will wait for offloading at the UAV is given as where T Tr UAV,n,end [θ UE n (t ′ )] is the index of time slot when the offloading of the task θ UE n (t ′ ) is completed by the UAV, which is given as We have T Wait UAV, is the first task of the UE n, n ∈ N offloaded by the UAV.
Similar to [23][24][25], we assume that the computing power of the BS is strong enough and the computing time of UE's tasks can be neglected. en, the number of total time slots required for completing the computation of the task θ UE n (t) by the BS (including the time required for task offloading from the UE θ UE n (t) to the UAV, the waiting time, and the offloading time from the UAV to the BS) is given as Similar to (6), the number of total time slots consumed for completing the computation of the task θ UE n (t) by the BS is given as

Problem Formulation.
Our objective is to maximize the long-term number of completed tasks of the UE; i.e., maximize the number of the completed tasks of the UE during T consecutive time slots. Towards this end, we formulate the optimization problem as

Wireless Communications and Mobile Computing
where c ∈ [0, 1] is the discount factor, which denotes the difference on importance between future rewards and present reward [26]. I n,t [θ UE n (t)] is the indicator function that I n,t [θ UE n (t)] � 0 if the task θ UE n (t) is completed within its maximal tolerable latency, and I n,t [θ UE n (t)] � −1 if the task θ UE n (t) is failed. Obviously, the problem P 1 is nonconvex and difficult to solve by traditional optimization methods. In order to solve this problem, we propose a DRL-based algorithm.

A Reinforce with Baseline Optimization Framework
As we adopt FDMA and TDMA and the computing resource of the UAV is equally allocated to all UE, problem P 1 can be decomposed into N subproblems. e subproblem of the UE n, n ∈ N is given as We adopt the parameterized policy gradient approach with baseline method to obtain the optimal task allocation of the UE n, n ∈ N. In order to apply the DRL algorithm, we first give the state, action, reward function, and the policy of the UE n, n ∈ N as follows: (1) System state S n (t): the state of the UE n, n ∈ N in time slot t, t ∈ T is given as where s 1 (t) � max t′∈ 0,1,...,t−1 (2) Action A n (t): A n (t) is the task allocation variable of the UE n, n ∈ N, namely, (3) Reward function R n (t): reward function R n (t), n ∈ N, t ∈ T is the immediate reward when selecting action A n (t) in state S n (t), which is directly related with the objective of the optimization problem P n . Hence, the reward function is defined as (4) Policy π n,θ n : the policy π n,θ n denotes the mapping from the state S n (t) to the action A n (t) of the UE n, n ∈ N, i.e., π n,θ n : S n (t) ⟶ A n (t). θ n is the parameter of the policy.
Our goal is to find an optimal policy for the UE n, n ∈ N; the problem P n can be transformed into P n,DRL : max e parameterized policy gradient approach adopts a parameterized policy and optimizes the parameters through gradient iteration algorithm [26]. A typical scalar performance measure U(θ n ) is given as v π n,θ n [S n (0)], which is the value function for policy π n,θ n starting from initial state S n (0). In the episodic case, an analytic expression for the gradient of U(θ n ) is provided by policy gradient theorem [26], which is given as where μ[S n (t)] is the on-policy distribution over states; q π n,θ n [S n (t), A n (t)] is the value of taking action A n (t) in state S n (t) under policy π n,θ n .
In order to reduce the variance during the learning process, an action-independent baseline b[S n (t)] can be introduced, and the policy gradient with baseline is denoted as [26] ∇U θ n ∝ S n (t) μ S n (t) × A n (t) q π n,θ n S n (t), A n (t) − b S n (t) ∇π n,θ n A n (t)|S n (t) .
e estimate of the state value v[S n (t); w] is often selected as the baseline, where w is the weight vector of the state value function. Considering the baseline, θ n is updated as [26] θ n,t+1 � θ n,t + αc t G − v S n (t); w ∇lnπ n,θ n,t A n (t)|S n (t) , where α is the learning rate, G is the return following time slot t, t ∈ T, and θ n,t is the estimate of θ n at time slot t, t ∈ T. e details of the DRL-based algorithm are given in Algorithm 1.

Simulation Results
In this section, the performances of the proposed algorithm under different parameters are provided. According to the analysis above, solving problem P 1 is equivalent to solving each subproblem P n . erefore, we focus on the analysis of the performance for the UE n, n ∈ N instead of for all the UE. e default simulation parameters are set as follows, if not specified. e computation data size (in bits) of the task θ UE n (t) is randomly selected from set Θ L n � 1.5e 4 bits, 3e 4 bits, 4.5e 4 bits, 6e 4 bits . e corresponding maximal tolerable latency set of the task θ UE n (t) is set as Θ τ n � 3Δ, 5Δ, 7Δ, 9Δ { }. Without loss of generality, we set C[θ UE n ] � β, β � 1000, which can be ignored in the system state. e UAV flies around in circles with center coordinates (0, 0) and radius 20 m. e UE is uniformly distributed in a 100 m × 100 m square area with center coordinate (0, 0) m, which is the coordinate of the BS. e maximal CPU frequency of the UAV and the UE n, n ∈ N are set as 200 MHz and 10 MHz, respectively. e bandwidth of the channel between the UE n, n ∈ N and the UAV and between the UAV and the BS is set as 40 kHz and 200 kHz, respectively. T is set as 100. e length of the time slot Δ and sub-time slot δ is set as 100 ms and 20 ms, respectively. We conduct the simulations through Python 3.8 and Tensorflow 2.5.0. Fully connected hidden layer with 17 and 16 neurons in the policy and the baseline networks are employed. e learning rates of policy and baseline networks change in different figures to guarantee the convergence of the proposed algorithm. Figures 2 and 3 show the convergence performance of proposed algorithm versus different learning rate α p . e learning rate is important in iterative algorithms. As shown in Figure 2, the larger the learning rate is, the faster the convergence speed of the algorithm is. However, if the learning rate α p is too large, the variance of the learning curve may become high, which can be observed in Figure 3. Figure 4 shows the convergence performance of proposed algorithm versus different learning rate α b . As can be observed in Figure 4 the convergence speed of the algorithm is not monotonically increasing in the learning rate α b . e learning rate α b � 5e − 5 has faster convergence speed than the cases α b � 5e −2 and α b � 5e −4 . α b � 5e − 2 has faster convergence speed than the case α b � 5e − 4 . Although the convergence speed of the different learning rate α b cases is different, they will converge to the same order of accumulated reward. Figure 5 shows the performance of the proposed algorithm versus the complexity of the tasks of the UE n, n ∈ N. As can be seen from Figure 5 the complexity of the task directly affects the performance of the proposed algorithm. e higher the complexity is, the lower the accumulated reward is, which agrees with intuitions. Figure 6 shows the performance of the proposed algorithm versus the computing power of the UE n, n ∈ N. As shown in Figure 6 the accumulated reward increases with the computing power of the UE n, n ∈ N. e stronger the computing power of the UE n, n ∈ N is, the higher the accumulated reward of the UE n, n ∈ N is. Figure 7 shows the effects of the length of time slot Δ. It can be observed that the accumulated reward increases with the length of time slot Δ. When Δ is too short, most of the tasks of the UE n, n ∈ N may fail.

Conclusion
We studied the problem of maximizing the long-term number of completed tasks of the UE in a proposed UAVassisted MEC system. e formulated nonconvex problem was decomposed and solved by our proposed DRL-based algorithm. e validity and the convergence of the proposed algorithm were demonstrated through simulation results.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.