GMOM: An Offloading Method of Dependent Tasks Based on Deep Reinforcement Learning

Mobile edge computing (MEC) is considered as an efective solution to delay-sensitive services, and computing ofoading, the central technology in MEC, can expand the capacity of resource-constrained mobile terminals (MTs). However, because of the interdependency among applications, and the dynamically changing and complex nature of the MEC environment, ofoading decision making turns out to be an NP-hard problem. In the present work, a graph mapping ofoading model (GMOM) based on deep reinforcement learning (DRL) is proposed to address the ofoading problem of dependent tasks in MEC. Specifcally, the MT application is frst modeled into a directed acyclic graph (DAG), which is called a DAG task. Ten, the DAG task is transformed into a subtask sequence vector according to the predefned order of priorities to facilitate processing. Finally, the sequence vector is input into an encoding-decoding framework based on the attention mechanism to obtain the ofoading strategy vector. Te GMOM is trained using the advanced proximal policy optimization (PPO) algorithm to minimize the comprehensive cost function including delay and energy consumption. Experiments show that the proposed model has good decision-making performance, with verifed efectiveness in convergence, delay, and energy consumption.


Introduction
With the development of Internet of things (IoT) and mobile computing, mobile terminals and applications, such as virtual reality (VR) and facial recognition applications, are seeing increased diversity and complexity [1,2]. Despite the increasingly improved performance of mobile terminals, many computing-intensive applications cannot be processed efciently and delay results in compromised user experience and poorer service quality. In this case, mobile edge computing (MEC) comes as a solution. An emerging computing mode, MEC is born of traditional cloud computing. Its central goal is to expand the rich resources from the cloud server to the user terminals so that the user can employ more abundant resources to process their own computing tasks nearby, thus reducing the delay and energy consumption. Computing ofoading, a key technology in MEC, refers to the process in which the mobile terminal assigns computingintensive tasks to edge servers with sufcient computing resources through the wireless channel according to a certain strategy, and then, the edge servers return the computation results to the mobile terminal [3]. Te computing ofoading problem can be transformed into an optimization problem in a specifc environment, but due to the complexity of the MEC environment, it is challenging to address this problem by traditional approaches [4].
In practice, many computing tasks are not completely independent from each other, but involve multiple subtasks (such as VR/AR and face recognition). Te input of some tasks comes from the output of other tasks, that is, some tasks need to be processed before the current task can be executed. If the dependency between the tasks is not considered, the application may fail to run properly. Te ofloading of dependent tasks is generally modeled as a DAG to achieve fne-grained task ofoading [5]. However, as the number of task nodes increases, it is difcult to obtain the optimal ofoading strategy for all subtasks [6]. Most existing works in this regard use heuristic or metaheuristic algorithms [7][8][9][10]. Specifcally, in order to minimize the application delay, reference [7] designs a heuristic DAG decomposition scheduling algorithm, which considers the load balance between user devices and servers. In reference [8], a heuristic algorithm is proposed to solve the task ofoading problem of mobile applications in three steps to minimize the system energy consumption under delay constraint. Reference [9] considers that some tasks in DAG can only be executed locally, so as to minimize the waiting delay among tasks executed locally. Tis problem is modeled as a nonlinear integer programming problem and approximated by a metaheuristic algorithm. However, these methods cannot fully adapt to dynamic MEC scenarios because of the contradiction between fexibility and computational cost when designing heuristics [10].
Boasting the strengths of both deep learning and reinforcement learning, DRL has strong perception and decision-making ability, so it has been widely studied and applied [11] to complex decision-making problems with high-dimensional state/action spaces, such as games [12] and robots [13]. Trough continuous interaction with the environment, the DRL model learns appropriate strategies (what actions to perform in a given environment) with the goal of maximizing long-term returns. Considering indivisible and delay-sensitive tasks and edge load dynamics, a model-free distributed algorithm based on DRL is proposed in reference [14]. Each device can determine its ofoading decision without knowing the task model and ofoading decision of other devices. In order to cope with the challenge of task dependence and adapt to dynamic scenarios, a new DRL-based ofoading framework is proposed in reference [15], which can efectively learn the ofoading strategy uniquely represented by a specially designed sequence-tosequence neural network. Considering the limited performance of a traditional single-type ofoading strategy in a complex environment, reference [16] designs a dynamic regional resource scheduling framework based on DRL, which can efectively consider diferent indexes. Mobile edge computing with the energy harvesting function is considered in reference [17]. In order to solve the challenge of coordination between continuous and discrete interaction space and devices, two dynamic computing ofoading algorithms based on DRL are proposed. Simulation results show that the proposed algorithm achieves a better balance between delay and energy consumption.
In the present work, a graph mapping ofoading model based on deep reinforcement learning is proposed. First, the mobile terminal is modeled as a directed acyclic graph (DAG). Ten, the DAG task is transformed into a subtask sequence vector in order of priority. Finally, the policy function based on the recurrent neural network (RNN) is input to obtain the ofoading strategy. Te proposed model employs the proximal policy optimization (PPO) algorithm to minimize the comprehensive cost function including delay and energy consumption. Te major contributions of this paper are as follows: (1) Considering the inherent task dependency of mobile applications, we innovatively propose a DRL-based task ofoading model, which leverages of-policy reinforcement learning with RNN to capture dependencies. (2) We design a new encoding method to encode DAG into task vector, including task profle and dependency information, which can transform DAG into RNN input without loss of fdelity. (3) We introduce an attention mechanism to solve the problem of performance degradation caused by the incomplete capture of feature information in long sequences. To efectively train the model, we use an of-policy DRL algorithm with a clipped surrogate objective function to make the algorithm have strong exploration ability and prevent the training from getting stuck in the local optima. Figure 1 shows the MEC scenario considered in the present work. Te mobile terminal (MT), as the end user, is the service requester and generates computing tasks according to the program (path planning and object identifcation) called by the user. Tese applications have internal dependency, so in the present work, computing ofoading is defned as sequential ofoading of subtasks. Subtasks are transmitted to the edge base station (BS) for execution. Finally, the processed result will be sent back to the MT. Te system model of this work will be described in detail.

Task Model.
Each MT generates a computing-intensive application with N tasks, as shown in Figure 2. We modeled the application as a DAG, and G � (V, E); the vertex set V � v 1 , v 2 , . . . , v N represents each computation task; and the edge set represents the dependency between computation tasks, where v i is a direct predecessor to v j , and v j is a direct successor to v i . In particular, a task without a predecessor task is called an entry task, and a computing task without a successor task is called an exit task [18]. Each computing task in G is represented by a tuple v i � (c i , d i , q i ), which represents the number of CPU cycles required, the size of input and that of output data, respectively. Te input data generally include the source codes and related parameters in the application program, while the output data refer to the data generated by the predecessor task [19]. Each task has two computing modes as follows: the task is processed on MT (which is termed local computing) and then transmitted to the base station (BS) for processing through the wireless channel before being returned to MT, i.e., ofoading computing. Te computation mode of all tasks can be represented as a list A � (a 1 , a 2 , . . . , a N ) and a i ∈ 0, 1 { }, in which 0 indicates local computing and 1 indicates ofoading computing. When the computation tasks are intensive, since the execution of task v i may need to queue up in the task queue, and considering the restriction of dependency relationship, the execution of task v i needs to meet two conditions as follows: (1) all predecessors to task v i have been completed (2) the resources required for task v i are idle.
In order to better represent the computing model and communication model, the following defnitions are frst given: Defnition 1. Te completion time of task v i in the wireless uplink and downlink channel is expressed as FT s i and FT r i , respectively, and the completion time of task v i on MT and BS is expressed as FT l i and FT o i , respectively.
Defnition 2. Te available time that task v i can use the wireless uplink and downlink channels is denoted as AT s i and AT r i , respectively, and the available time of task v i can use the computing resources on MT and BS is expressed as AT l i and AT o i , respectively. Te available time represents when the resource is idle and depends on the completion time of the immediate precursor task of task v i on that resource. If the precursor task does not utilize the resource, we set the completion time on that resource to 0.

Local Computing Model.
If v i is computed locally, the start time of task v i depends on the completion time of its precursor task and the available time of MT. Te precursor task may be computed in the MT or BS. In this case, the completion time in MT is equal to the start time plus the local execution delay.
where v p is the direct predecessor task of v i . If v p is computed locally, FT r p � 0; if v p is computed remotely, FT l p � 0. While MT is processing other tasks, v i needs to queue and cannot be executed immediately. Te local execution delay of v i is where k is dependent on the efective capacitance coefcient of the chip structure used.

Ofoading Computing Model.
If v i chooses ofoading computing, MT frst needs to send v i to BS through the wireless uplink channel, but it can only be sent when all the precursor tasks have been executed and the uplink channel is available; then, the completion time of the wireless uplink channel is equal to the start time plus the sending delay.
According to Shannon's theorem [20], the uplink rate from MT to BS is as follows: where w, p, g, and σ 2 represent system bandwidth, transmission power of MT, channel gain, and noise power, respectively. We assume that the uplink rate and downlink rate Mobile Terminals Applications Dependent Subtasks Mobile Information Systems of a wireless channel are equal, then r down � r up . Te transmission delay of MT sending data to BS is T s i � d i /r up , and the corresponding energy consumption is E s i � p s · T s i , where p s represents the transmission power.
After BS receives v i , if the conditions are met, it can execute v i . In this case, similarly, the execution completion time in BS is equal to the start time plus the execution delay on BS.
where f s represents the computing capacity of BS. Similarly, the completion time of BS sending the processing results back to MT through the wireless downlink channel is as follows: Te return delay is T r i � q i /r do wn and the energy consumption is E r i � p r · T r i .

Problem Description.
According to the abovementioned analysis, the delay of executing a DAG task is as follows: where K is the set of all exit tasks, and the sum delay is equal to the time required to process till the last exit task. Te energy required to consume is as follows: where E o i � E s i + E r i represents the energy consumption required for ofoading computing, Ι (Δ) is an indicator function whose value is equal to 1 when condition Δ is satisfed, otherwise, its value is 0.
To measure the quality of an ofoading strategy, we defne a comprehensive cost function [21,22], and the ultimate goal is to minimize the comprehensive cost value.
where T A and E A represent the required delay and energy consumption, respectively, and α is the balance factor, which is valued in the interval [0, 1]. Tis weighted sum method is efective and easy to implement, so it has been widely used. Te balance factor refects that user preferences can be adjusted dynamically. However, the ofoading problem of general DAGs is NP-hard [23], so it is difcult to fnd an optimal ofoading strategy with appropriate time complexity.

Graph Mapping Offloading Model Based on DRL
Tis section mainly introduces in detail the graph mapping ofoading model based on DRL. First, to facilitate the processing of DAG tasks, the DAG tasks are converted into subtask sequence vectors by the heterogeneous earliest fnish time (HEFT) algorithm. Tis step acts as a data preprocessing step, after which the subtasks are processed with reference to the vector, and the computing process is detailed through an example. Ten, the ofoading problem is described as a Markov decision process (MDP), and the corresponding state space, action space, and reward are analyzed. Finally, the structure of the graph mapping ofloading model is introduced, which is based on an encodingdecoding framework with the attention mechanism and trained by the proximal policy optimization (PPO) algorithm.

DAG Instance.
In order to satisfy the dependency constraint between subtasks in the application program, we prioritize tasks in descending order based on their sequence values, which is defned as follows: Te priority is calculated in a similar way to the HEFT algorithm [24], where succ(v i ) represents the direct successor task set of task v i , T o i is the execution cost of task v i , which can be calculated as follows: , the shortest delay (without queuing) obtained by ofoading the task. We treat the sort order as an execution order vector, where o i represents the ith task to be executed. Table 1 shows the execution delays of tasks on the MT and BS. According to formula (9), the execution order vector Te DAG in Figure 2 is used as an example, and we briefy illustrate the execution of the task. Tis application program G is composed of 10 subtasks. It is assumed that the ofoading strategy vector is A � [0, 1, 1, 1, 0, 0, 1, 0, 1, 0], and Figure 3 shows the task execution process based on O and A. For example, if task v 5 is ofoaded to the BS, its available time is AT s 9 � 6s and its completion time is FT r 9 � 14s. From Figure 3, we know that the execution latency of DAG task is T A � 20s.

Structure of the Markov Decision Process.
Each task is connected to a virtual machine to provide private computing, communication, and storage resources, so the parameters of the network environment are considered to be static. Terefore, we establish MDP from the perspective of DAG task.
State space: Te state space is a sequence of information about DAG tasks and ofoading decisions of historical subtasks. Te policy set of the history subtask is represented by A 1∼i , then Action space: Since each subtask either selects local computing or ofoading computing, the action space is expressed as follows: Reward: When task v i is completed according to strategy a i , the total delay increment is △T i and the required energy consumption is E i . Terefore, the reward function can be defned as follows:

Model Design.
Te graph mapping ofoading model is based on an encoding-decoding framework that incorporates the attention mechanism [25]. Te underlying idea of the model is to use two RNNs, one as an encoder and the other as a decoder. Te task vector is input into the encoder, where the feature information of the task is extracted; then, feature decomposition is performed through the decoder and the ofoading decision is output. However, this simple encoding-decoding framework is instable in remembering previous feature information in long input sequences, which may lead to poor performance of the model. Te attention mechanism is hence introduced in our work to solve this problem. Te attention mechanism can reduce the information loss in the simple encoding-decoding framework by calculating the correlation between the hidden state of each step in the decoder and the hidden state of each step in the encoder and by assigning a corresponding weight value to the extracted features. As shown in Figures 4 and 5 [26], which represents the correlation between the hidden state of each time step in the decoder and the hidden state of each step in the encoder. At the time step j, the decoder transforms the corresponding context value c j , the last hidden state h j−1 de , and the last output result a j−1 into the current hidden state h j de � f de (h j−1 de , a j−1 , c j ). In the fnal output, the decoder obtains the strategy π(a j |s j ; ω) through the SoftMax layer and selects the action through a j � argmaxπ(a j |s j ; ω).
Te state value V(s t ; θ) can be calculated by the critic network, where θ is the parameter of the critic network. Te network is composed of a recurrent neural network and a full connection layer, which is initialized by the fnal hidden state of the encoder in the actor network. A history ofloading policy whose fnal hidden state is mapped to a state value through the full connection layer is entered.

Model
Training. PPO is a reinforcement learning algorithm based on the actor-critic (AC) framework proposed by OpenAI, which mainly consists of three networks: one critic network and two actor networks (actor network and old actor network). Te old actor network is used to generate training data, while the actor network uses the generated training data for training. Compared with the previous trust Wireless Channel Mobile Information Systems region policy optimization (TRPO) algorithm, PPO is easier to implement and has good performance, which solves the problem that the policy gradient algorithm is difcult to determine the step size and the update diference is too large [27]. In order to control the updated step of the policy, PPO adopts the clipped surrogate objective function, as shown in the following formulae:

Mobile Terminal Base Station
Te clip function clip(·) aims to limit the value of the importance of sampling weight pr t (θ), where ε is the hyperparameter controlling the clip range. Te min function minimizes the original item and the truncated item, that is, the truncated item will limit its value when the ofset of the policy update exceeds the predetermined interval.
Te actor network and the critic network in the PPO algorithm share network parameters, and its training and optimization objective function is defned as follows: where L CLIP t (θ) and L VF t (θ) are the loss functions of the actor network and the critic network, respectively; S(s t ) is the cross entropy loss, which is used to improve the strategy exploration ability; c 1 and c 2 are constants. Te PPO algorithm is based on an actor-critic (AC) architecture, so it needs to optimize these two groups of parameters, respectively.
For the actor network, general advantage estimation (GAE) is used as an estimation function of their loss functions to balance variances and bias. It borrows the idea of the time-series diference algorithm and is expressed as follows: where A GAE(c,λ) t is the advantage function and δ V t+k is the timing diference error, and the calculation formula is as follows: For the critic network, its loss function is expressed as follows: where V t arg (·) is the estimated value of the value function and is defned as follows: Table 2 shows the training process of the proposed algorithm. Te old actor network is used for sampling and the actor network for training. Te proposed algorithm alternates between the exploration stage (lines 5-9) and the optimization stage (lines 10-13). In the exploration stage, D episodes are collected in each time step using the old actor network. GAE estimates A GAE(c,λ) t for each time step in each episode and the value V π * (s t ) for each state is also calculated. In the optimization stage, the batch sample data are sampled from the experience pool, and the objective function is optimized in round H according to the sampling data. Ten, the old actor network is updated with the updated actor network to ensure that the old actor network collects data with the new parameters in the next sampling stage.

Simulation Results and Discussion
Te simulation experiment is completed on a workstation with Intel Core I7-9700H 3.6 GHz CPU and 8 GB memory. Te virtual environment is TensorFlow-GPU-1.x. Te A=[a 1 , a 2 , ..., a N ]  Table 3. A popular method is employed here to generate various DAGs to simulate applications with dependencies [28]. Te parameters of DAG are shown in Table 4, where fat and de nsity determine the width, weight, and dependency of DAGs, respectively.

Comprehensive Cost.
We set a weight space Ω � [α 1 , α 2 , . . . , α |Ω| ], and given a weight value α i , we frst consider the optimization goals T α i A and E α i A for the time delay and energy consumption, respectively. According to the abovementioned description, the comprehensive cost function is as follows: Ten, we calculate the average completion time (ACT), the average energy consumption (AEC), and the average comprehensive cost value (ACC), respectively.

Network Usage.
Te network resource usage (NU) depends on the amount of data transferred between MT and BS. Tus, NU is defned as follows:

Parameter Analysis.
In this section, we compare the performance of the model with varied parameter values by taking the reward values as indicators. Figure 6 shows the infuence of the learning rate in the optimizer on algorithm performance. It can be seen that systems with a too large or too small learning rate will not be rewarded very much. Meanwhile, considering the training efciency and other factors, we set the learning rate to 0.001. Te sampling effciency in the case of a small batch size is relatively low, and a large batch size may lead to frequent selection of old samples in the experience pool. Terefore, the batch size is set to 500 in the present work.

Indicator Analysis.
To gain insights into the proposed GMOM model, the following methods are implemented for comparison.

Greedy.
Greedy decisions are made about subtasks based on estimates of the local and ofoading computing completion time for each subtask. Algorithm GMOM (1) Initialize the initial network parameter θ that the actor network and the critic network share randomly (2) Initialize the parameter θ ol d of the old actor network with θ (3) For iteration � 1, 2, . . . do (4) for t � 1, 2, . . ., N do (5) for i � 1, 2, . . ., D do (6) the whole episode is collected with the old actor network, and the obtained data is stored in the experience pool D (7) calculate the GAE function value for each time step according to formula (14), get A GAE(c,λ) t and cache it (8) calculate the value in each state according to formula (17) and get V π * (s t ) (9) end (10) for j � 1, 2, . . ., H do (11) sample batch size sample data to optimize the objective function, update the actor network (12) end (13) Synchronize the parameters of two actor networks, i.e., θ ol d ←θ (14) end (15) end

RL-DWS.
To solve the problem of dynamic preference, multiobjective reinforcement learning with dynamic weights is proposed, and the change in the weight is measured by learning the Q value of multiobjectives. Figure 7 shows the training results of two networks. As shown in Figure 6, the system reward increases sharply in the initial 200 episodes before the growth rate levels of, and this is also manifested in the loss value. Finally, the loss of the value network and policy network approaches 0, indicating that the algorithm in the present work has good convergence.  Figure 8 shows the corresponding ACT, AEC, and ACC values of diferent algorithms. As shown in Figure 8(a), our proposed algorithm can always obtain the minimum ACT, indicating that it obtains a smaller execution delay compared with other algorithms. Some algorithms, such as RL-DWS and DQN, perform well in terms of energy consumption, but poorly in terms of delay for DAG-4. HEFT and Greedy have similar performance in terms of delay and energy consumption for DAG-1 and DAG-3. Te baseline algorithm underperforms in balancing is delay and energy consumption; our algorithm, however, shows good performance in terms of all the three evaluation indicators. In Figure 8(b), GMOM performs best in most DAGs (except for DAG-1). Although RL-DWS obtains the minimum AEC in DAG-1, its corresponding ACT is larger. In Figure 8(c), ACC of GMOM is also the smallest, indicating that GMOM can also perform best when comprehensively measuring delay and energy consumption. Figure 9 represents NU values achieved by all algorithms discussed in our work. It shows that our algorithm gets a  smaller NU value, indicating its better performance in task ofoading between MT and BS than other baselines, which is conducive to the utilization of network resources. Ten, DAG 1-4 are tested with weight α � 0.2 to compare the average performance of diferent algorithms at diferent transmission rates. At a small rate of transmission, ofoading from MT to BS will sufer a high transmission delay; however, at a higher rate, the delay will decline. Our goal is to design an efcient algorithm to accommodate diferent transmission rates. In Figure 10(a), RL-DWS fails to learn an efective strategy, and its performance is worse than that of HEFT when the rate increases from 4.5 Mbps to 8.5 Mbps. On the contrary, our algorithm has better adaptability than all baselines and approaches the optimal solution at all rates. In Figures 10(b) and 10(c), DQN is superior to RL-DWS on AEC, but the results in Figure 10(a) are opposite. Meanwhile, Figure 10(c) shows that ACC of all algorithms decreases as the transmission rate increases. Tis is because as the communication cost declines the transmission rate increases, ofloading tasks to edge servers may be benefcial.
Te weight refects the importance of delay under a particular ofoading strategy. At a small weight value, the ofloading strategy should be selected to reduce the delay. Figure 11 shows the trend of changes in ACT and AEC as the weight of the delay increases. Overall, ACT decreases while AEC increases, because as the weight increases, the number of subtasks selecting local computing decreases, while the number of subtasks opting for ofoading computing increases. Tis is mainly because MT has a much lower computing capacity than BS. However, ofoading more subtasks to BS will lead to higher energy consumption. Terefore, the time delay and energy consumption can be efectively balanced through edge-end collaborative processing of tasks.

Conclusion
In the present work, the ofoading problem of dependent tasks in the mobile edge computing environment is studied to balance delay and energy consumption. We propose a graph mapping ofoading model based on deep reinforcement learning; specifcally, the ofoading problem is modeled as an MDP, and RNN is combined to approximate the policy and value function of MDP, and it is combined with an encoding-decoding framework that introduces the attention mechanism to train the model with a popular proximal policy optimization algorithm. Simulation experiments show that the proposed algorithm has better  In order to meet the expected large number of demand services, the research of future generation wireless network (6G) has been initiated, which is expected to improve the enhanced broadband, massive access, and low latency service capability of the 5G wireless network, which is benefcial for the mobile edge network [29]. However, 6G networks tend to be multidimensional, ultradense, and heterogeneous, so artifcial intelligence (AI), especially machine learning (ML), is emerging as a solution for intelligent network orchestration and management. For example, intelligent or intelligent spectrum management can be achieved using deep reinforcement learning, especially for random state measurement problems, which also has potential.

Data Availability
No data were used to support this study.

Conflicts of Interest
Te authors declare that there are no conficts of interest regarding the publication of this paper.