Multiagent Reinforcement Learning for Task Offloading of Space/ Aerial-Assisted Edge Computing

The task oﬄoading in space-aerial-ground integrated network (SAGIN) has been envisioned as a challenging issue. In this paper, we investigate a space/aerial-assisted edge computing network architecture considering whether to take advantage of edge server mounted on the unmanned aerial vehicle and satellite for task oﬄoading or not. By optimizing the energy consumption and completion delay, we formulate a NP-hard and non-convex optimization problem to minimize the computation cost, limited by the computation capacity and energy availability constraints. By formulating the problem as a Markov decision process (MDP), we propose a multiagent deep reinforcement learning (MADRL)-based scheme to obtain the optimal task oﬄoading policies considering dynamic computation request and stochastic time-varying channel conditions, while ensuring the quality-of-service requirements. Finally, simulation results demonstrate the task oﬄoading scheme learned from our proposed algorithm that can substantially reduce the average cost as compared to the other three single agent deep reinforcement learning schemes.


Introduction
e current in-depth development of fifth generation (5G) and beyond 5G technology is envisioned to build an interconnected world opening up to everyone. e increasing number of various ultradense heterogeneous Internet of things (IoT) devices and the continuous improvement of application requirements have put forward higher requirements for data transmission rate and network coverage [1,2]. Compared with a fixed terrestrial network, the advantages of versatility, manoeuvrability deployment, as well as seamless coverage make apace-air-ground integrated network (SAGIN) an emerging hot research topic [3,4]. Meanwhile, mobile edge computing (MEC) is a promising approach to improve the quality of service (QoS) and network performance [5]. erefore, the MEC technology of the terrestrial network is introduced in SAGIN to provide efficient and flexible computing services by utilizing multilevel and heterogeneous computing resources at the edge of network. Especially, in the case of cellular base station damaging by natural disaster or the case of special senses (e.g., mountainous areas, polar regions, and oceans), UAVs and low earth orbit (LEO) satellite constellation can act as aerial relays or stations, and ground users (GUs) can offload computation tasks for fast processing [6]. In general, cooperative communication by multiple UAVs can be a possible solution to reduce the offloading delay and extend UAVs' service lifetime [7]. However, there introduces more challenging issues to minimizing offloading delay while employing the multiple UAV architecture.
Recently, task offloading has been studied extensively, and task offloading processes are generally modeled as mixed integer programming problems [8], solutions such as heuristic algorithms [9,10], and convex relaxation [11,12]. However, these optimization methods require a large number of iterations to reach a satisfactory local optimum, which makes them unsuitable for real-time offloading decisions when environmental conditions change rapidly and significantly [13,14]. Meanwhile, deep reinforcement learning (DRL) has been widely used as an effective approach to optimize different problems including offloading policy, which can help overcome the prohibitive computational requirements [15]. e research on task offloading of space/aerial-assisted edge computing has been at its initial stage. In the SAGIN, various single-agent DRL-based task offloading schemes are proposed to maximize the network utility or minimize the computation cost [16]. Considering the limited capacity of MEC server and channel conditions of UAVs, reference [17] proposed a computation offloading scheme based on deep Q-learning network (DQN) to solve the dynamic scheduling problem, and reference [18] adopted a risk-aware reinforcement learning algorithm using actor-critic architecture to minimize the weighted sum of delay and energy consumption. Furthermore, reference [19] proposed a joint resource allocation and task-scheduling methods based on a distributed reinforcement learning algorithm to achieve the optimal partial offloading policy. Reference [20] adopted a deep deterministic policy gradient (DDPG)-based computation offloading scheme to solve high-dimensional state space and continuous action space. Multiagent reinforcement learning (MARL) has been applied in different problems such as path planning [21], dynamic resource allocation [22], and channel access [23]. Compared with single-agent reinforcement learning methods, distributed multiagent systems undoubtedly have better performance. However, the study on task offloading considering the cooperation of space, aerial, and ground multilayer network under multi-UAV multiuser environment is still missing in above research studies. None of above references take full advantage of a possible collaborative framework, but only used multiple parallel deep neural network and decisions are taken independently by each agent of the system. In this paper, a MARL-based method is proposed to solve the cooperative task offloading issue in the space/aerial-assisted edge network. e multiple agents can achieve the offloading optimization collaboratively, in order to reduce the cost of computation tasks. In particular, our main contributions of this work are as follows: (1) Different from traditional UAV-enabled MEC task offloading scheme, we design a space/aerial assisted edge network for dynamic task offloading in a cooperative environment with multi-UAV. (2) is paper considers the problem of computation offloading under the SAGIN architecture with the joint communication and computing (C2) service. We formulate the above-mentioned problem as a Markov decision process to minimize the computational cost. We assume each agent shares information with other agents and makes a decision according to current strategies and local real-time observations to select which component of the system to execute.
(3) We propose a multiagent deep deterministic policy gradient (MADDPG)-based task offloading approach. Unlike other DRL algorithms such as Q-learning and DQN which restrict the agent actions to a low-dimensional finite discrete space, the agents in MADDPG can search the best action in an independent consecutive action space and maximize the long-term reward to reduce the computation cost by finding optimal strategy. Furthermore, MADDPG can be decentralized executed once the network has been centralized trained. e remainder of this paper is organized as follows. In Section 2, the space/serial-assisted edge network architecture and task offloading models are introduced. Section 3 describes the problem formulation and the MADDPG-based solution. e simulation results and analysis are presented in Section 4. Finally, this work is concluded in Section 5.

System Model and Problem Formulation
2.1. Network Architecture. As shown in Figure 1, a remote region without cellular coverage is considered; therefore, we provide network access, edge computing, and caching through the aerial segment. We consider a space/aerialassisted edge computing framework, which consists of N ground users (GUs), I UAVs and, a low earth orbit (LEO) satellite constellation. Figure 1 depicts a multi-UAV, multicomputational node (satellite with remote cloud server, UAV as an aerial MEC server) to provide services to GUs, and let N � 1, 2, . . . N { } be the set of GUs. e SAGIN components that tasks can be offloaded to are denoted by I � 0, 1, 2, . . . I { }, where indexes 1, 2, . . . , I and 0 denote UAVs and the LEO satellite constellation, respectively. Considering a discrete time-slotted system with equal slot duration τ. Furthermore, we assume that the overall system has M tasks, denoted by the set M � 1, 2, . . . , M { }. e main parameters of this paper are shown in Table 1.
e GU can either execute the computation task locally or offload it to edge server in two ways. Each GU n can determine whether or not to offload its computing task to the edge server k, and let x nmi (t) denote the task offloading decision of task m of GU n. Specifically, x nmi (t) � 1 means that the GU n offloads the task m to the edge server k, and x nmi (t) � 0 means that the GU n disposes its task locally.
ere exists constraint (1) indicating the binary constraint of offloading decision:

Computation Model.
Without the loss of generality, a tuple (ϕ, c) is adopted to model the computing tasks from GU devices, where ϕ (in bits) represents the size of computation task, and c (in CPU cycles per bit) indicates the computing workload means that how many CPU cycles are required to process one bit input data [17]. e delay and energy consumption of downloading can be ignored when the computing results are transmitted back to the GUs by the edge server because the key point of policy is task uploading in the considered scenario [9,24]. In the following, we consider the computation overhead in terms of completion delay and energy consumption for edge computing and local execution.

Edge Computing Model.
e computing capability (in CPU cycles per second, the clock frequency of the CPU chip) of edge servers mounted on UAVs and satellite is denoted by f i (i ∈ 1, 2, . . . , I { }) and f 0 , respectively. Consequently, the computational cost of all SAGIN components can be calculated as the following equation:

Local Computing Model.
Since the limited computing capability of GUs, we assume that the remaining tasks wait to be processed in the queue. e delay of local processing is the sum of the computation execution time and the queuing time. e local execution time of GU n is given by while the queuing time is calculated as (4) where ρ n (t) ∈ [0, M max ] denotes the unaccomplished computation task at the beginning of time slot, M max is the maximum length of the computing queue, · ⌊ ⌋ denotes the   floor function, and f n is the computing capability of GU n. e local execution energy consumption is given by where ξ n denotes the effective switched capacitance for the chip architecture [25]. Clearly, the f n can be adjusted to achieve the optimum computation time and energy consumption by using the DVFS [8].

Communication Model.
Since UAVs and satellites use different frequency bands to communicate, we suppose that there is no interference between UAVs and satellite in this work [26]. Meanwhile, we neglect the propagation delay from GU devices to the UAVs because we assume that the UAV is sufficiently close to GU devices [27]. e aerial to ground communication channel depends on the altitude, angle of elevation, and type of propagation environment [28]. Based on reference [16], the average path loss of the aerial to ground channel can be defined as where P LoS represents the line of sight (LoS) connection probability between GU n and UAV i, and h, r, η LOS , η NLOS denote the UAV flying altitude, horizontal distance between the UAV and the GU, and the additive loss incurred on top of the free space path loss for line-of-sight and not-line-ofsight links [29], respectively. We set the altitude of the UAV to 10m. f c denotes the carrier frequency, and c denotes the velocity of light. According to [19], the values of (η LoS , η NLoS ) are (0.1, 2.1) in remote area. Adopting the Weibull-based channel model [30], we generate the channel gain when x nm0 (t) ≠ 0, which can be given by� where G tx and G rx are antenna gains, F rain denotes the rain attenuation, and l sat denotes the distance between GU and the satellite. Consequently, the data rate denoted by r i (t) is calculated by where P n,i indicates the transmission power, B i denotes the channel bandwidth of the aerial-ground link and the ground-satellite link, indexes 1, 2, . . . , I, and 0 denotes the UAV swarm and the LEO satellite constellation, respectively. σ S and σ U represent the noise power. In line with mentioned earlier, we can define the transmission delay for task offloading over aerial-assisted computing as where d sat denotes the propagation delay between the LEO satellite and GUs, which cannot be ignored. While the communication-related energy consumption can be defined as Consequently, the computational cost of GU n in space/ aerial-assisted network can be calculated as where ω 1 n and ω 2 n denote the weights for the energy consumption and the completion delay, respectively, which can be regarded as the trade-off between delay and energy consumption. We can adjust the weights to meet different user demands by using this form of computational cost. Notably, the weights can be further divided into local execution model and edge computing model to increase diversity among these cases. We formulate the optimization problem in our scenario to minimize the computational cost. erefore, the optimization problem is given by

MARL Framework.
Since above optimization problem in (14) is non-convex and NP hard, we adopt a multiagent reinforcement learning approach to achieve the feasible solution. In thissection, we model the formulated optimization problem as a Markov decision process (MDP) [31], and the purpose of action selection is to maximize the reward function. In the space/aerial-assisted network environment, each GU acts as an agent, chooses an action, and then receives the reward at time slot t. e state space, action space, and reward function are described as follows.

State Space.
e state s n (t) ∈ S consists of the channel vectors, the task size randomly generated in the time slot t, the unaccomplished task queue, and the remaining energy. ese quantities change over time because of the impact of single and combined actions of this system, so we define the state in our scenario as

Action
Space. Based on current state s n (t) and other agents' experience, each agent is supposed to select its action to schedule the computation tasks. Formally, we define the vector a n (t) � x nmi (t), ∀n ∈ N, ∀m ∈ M, ∀i ∈ I as the binary offloading decision, where x nmi (t) ∈ 0, 1 { } indicates that GU n whether to offload its task m to the MEC server i or not. e constraints of the problem (14) are considered as the binary offloading decision strategy, and computation is offloaded to at most one node at time slot t.

Reward Function.
In line with reinforcement learning, each agent can select its own action in a decentralized execution to maximize the global reward. e agent's choice is based on the reward function, which specifies the goal of the algorithm. With the objective of long-term weighted sum of delay and energy consumption of all tasks, we define a function to minimize the computation cost as follows: Let π denotes the stationary policy, and a value function is defined to determine the value of reward, which is given by ψR n s n t , a n t |s n t � s, π ⎡ ⎣ ⎤ ⎦ , where ψ ∈ [0, 1] denotes the discounting factor, which refers to the cumulative utility. e overall reward of all agents at time slot t can be calculated as e main objective is to minimize the computation cost in the space/aerial-assisted network. We denote the group of GUs' optimal strategies as π * � π * 1 , π * 2 , . . . , π * n . We maximize the long-term reward as where the π * n can be expressed as V n s, π * n , π * − n ≤ V n s, π n , π * − n , where π n and π * − n � π * 1 , . . . , π * n− 1 , π * n+1 , . . . , π * N denote the set of all possible strategies taken by GU n and other agents' strategies, respectively. erefore, the MADRL algorithm can obtain the optimal policy through convergence.

MADDPG-Based Task Offloading Scheme.
In thissection, the MADDPG [32]-based task offloading scheme is proposed to derive the near-optimum decision by optimizing the continuous variable X. MADDPG not only retains the greatest advantages of DDPG that can consider continuous action space but also solves the shortcomings of Q-learning or policy gradient algorithms that are not suitable for the multiagent environment by extending the DDPG algorithm into a multiagent domain.
As shown in Figure 2, the MADDPG framework is carried out by centralized training and distributed execution. Each GU has a critic and an actor as agent n, critic n can choose the appropriate action a n according to the observation s n , and actor n would evaluate the action a n based on the global observation S all . During the training procedure, each actor n collects the policies of other agents and denotes them as A all . In our MADDPG architecture, the actor is trained to generate a deterministic policy, the critic is trained to evaluate the actor, and the experience replay buffer B is denoted to effectively avoid the correlated action, which can store the minibatches of samples (s, a, R, s ′ ) ∼ U(B). e random minibatch size of samples can be denoted by Z.
In the MADDPG algorithm, each agent selects an optimal action a k n � μ(s k n |θ μ ) as the output of actor network μ. e actor network can be updated in the form of gradient, which is given by (21) and the critic network Q is updated as the following expression: where y n � R n t + ψQ π (S all′ , A all′ |θ)| a k+1′ n �μ′(s k+1 n ) , a k+1′ n is the predication of the next action in target actor network. To minimize the policy gradient of agent n, the parameters are soft updated for both the target actor and networks as follows: where δ is the forgetting factor.
Security and Communication Networks e details of MADDPG-based task offloading scheme are shown in Algorithm 1. At first, we initialize four DNNs for each GU, i.e., critic network, actor network, and the two target networks (line 2-3). At the beginning of each episode, each GU obtains its observation state (line 7). Without the loss of generality, we divide each episode into T time slots. For each time slot, agents firstly select an action according to the current policy, and the noise is added into exploration (line [8][9]. Afterward, all agents execute their actions, and each agent can receive the corresponding reward and the next state (line [10][11]. en, the experience tuple generated from the above iteration is stored into the replay buffer for parameter update (line 12). Finally, given the sampled minibatch of transitions from the replay buffer, each agent updates the parameter of the critic network by minimizing the loss value, updates the parameter of the actor network by gradient ascent, and updates the parameters of the target networks using (23) (line 15-17).

Analysis of Complexity.
e deep neural network of the actor-critic framework can be represented as matrix multiplication. Let N and H define the dimension of output and the number of hidden layers, respectively. We can get the computational complexity between agents O(N), and the complexity of each actor can be expressed as O(HN 2 ). In our proposed algorithm, the training algorithms can be affected by the agent cost C n , the number of training episodes K, and batch size Z. erefore, the computational cost of critic networks and training procedure can be estimated as O(C n KN) and O(C n KZHN 2 ), respectively.

Analysis of Convergence.
In the proposed algorithm, the gradient method is adopted to approximate the optimal by updating the weight of target networks. Obviously, the parameters θ μ n and θ Q n will converge to a particular value after a finite number of iterations. erefore, the convergence of our proposed algorithm can be guaranteed. Furthermore, the convergence can be observed through simulations.

Simulation Settings.
In thissection, simulation is carried out to verify the proposed model and algorithm. Specifically, we begin by elaborating on the simulation settings. Afterwards, we present an evaluation on the experiment results. Simulation environment is implemented via Python 3.6 with TensorFlow 2.0 on a personal computer with a AMD R7-4800H CPU. ReLU function is used as the activation function after the fully connected layer, and L2 regularization is used to reduce DNN overfitting. e number of neurons in the two hidden layers are 256 and 128, and we set 2000 and 0.001 to the number of episode and learning rate. Other important constant parameters are listed in Table 2.

Convergence Analysis.
To evaluate the performance of our proposed scheme, we further compare the convergence of three algorithms and the mean computation cost in the system. We adopt the other two benchmark schemes: DDPG [31] and DQN [18]. We first conducted a series of experiments to determine the optimal values of the hyperparameters used in the algorithm. e selection is based on the performance of the algorithm under different learning rates and discount factors. e convergence performance of the algorithm under different learning rates is shown in Figure 3. We can observe that the convergence speed will be reduced when the learning rate is too small, and generally, if the learning rate is too large, the algorithm will not converge normally.  Figure 2: e MADDPG-based task offloading scheme.
As shown in Figure 4, we show the convergence of our proposed algorithm on average reward with episodes, where the weight index of time delay ω 1 � 0.6, and energy consumption weight index ω 2 � 1 − ω 1 , and the results are averaged from ten numerical simulations, proving the effectiveness of neural networks. e average award values of MADDPG, DDPG, and DQN increase continuously until convergence. Obviously, the MADDPG algorithm become stable earlier than DDPG and DQN algorithms, and the other two schemes become stable after more than 400 training episodes. Moreover, the average reward of final convergence of the MADDPG algorithm is higher than the other two algorithms. Based on the simulation result, we conclude that the MADDPG algorithm outperforms the benchmark schemes in minimizing the long-term cost, which can reduce the wastage of resource by learn the policy of cooperation and maximize the global reward.

Average Cost.
In this section, we discuss the average cost in terms of several number of GUs' devices and the size of offloading tasks. Figure 5 demonstrates the performance of proposed scheme and the three baselines in space/aerialassisted network to minimize the average computation cost. We provide one other baseline: greedy algorithm [33], and each GU firstly makes its best effort to offload computation tasks, and then the remaining tasks will be processed locally.
(1) Initialization: (2) Randomly initialize critic network Q(s, a | θ Q n ) and actor μ(s, a | θ μ n ) with weights θ Q n and θ μ n (3) Initialize target network Q ′ and μ ′ with weights θ Q′ n ⟵ θ Q n and θ μ′ n ⟵ θ μ n (4) Empty replay buffer B (5) for episode k � 1, 2, . . . , K do (6) Initialize a Gaussian noise Δμ with mean � 0; Receive initial observation state S � s 1 , s 2 , . . . , s N ; (8) for time slot t � 1, 2, . . . , T do (9) Select action a n (t) � μ(s n t | θ μ n ) + Δμ according to the current policy and exploration noise Δμ (10) Execute action a n (t) and observe the reward R n , and the next state s n (t + 1) (11) Collect the global state S all , S′ all and the action A all ; Update the actor policy by using the sampled policy gradient   From Figure 5, we can observe that the average computation cost increases as the number of GUs' devices increasing. DQN algorithm, DDPG algorithm, and MADDPG algorithm all use deep reinforcement learning to automatically generate offloading strategies. According to the simulation results, the computation cost is reduced by the MADDPG algorithm, 29.066% compared to the DDPG algorithm, 36.392% compared to the DQN scheme, and 51.602% compared to the greedy scheme. erefore, the proposed MADDPG scheme can still keep the computation cost lower than benchmark schemes, proving the validity of cooperative policy. Figure 6 demonstrates the average cost under different sizes of offloading tasks. Obviously, the average cost increases as the size of offloading data size increases. is is because each agent needs to unload a large amount of data, which increases the computational cost of the system. e proposed MADDPG scheme can obtain the best reward to minimize the computation cost than these benchmark schemes. e performance of strategies generated from the DQN algorithm and the DDPG algorithm is general in various scenarios, which is mainly because the training results of the two algorithms are unsuitable in multiagent environment. In contrast, MADDPG can effectively learn stable strategies by decentralized execution and centralized training. erefore, we conclude that the MADDPG algorithm outperforms the three comparison algorithms in terms of different scenarios.

Conclusion
In this paper, an efficient task offloading scheme is proposed for the space/aerial-assisted edge computing system in SAGIN. Firstly, we elaborated the SAGIN architecture. en, we express task offloading as a nonlinear optimization problem with the goal of minimizing the weighted sum of energy consumption and delay. On this basis, we propose an algorithm based on MADDPG to solve this problem. Finally, the simulation results show that the computation cost can be significantly saved by offloading the task to the edge server on the UAV or satellite, and the convergence performance and effectiveness of the proposed scheme in the simplified scenario are also proved by compared with three benchmark schemes.
In the future, we will further consider the mobility management of satellites and UAVs. In addition, the task offloading scheme of SAGIN in areas with rich computing resources is also worth for further study.

Data Availability
e data used to support the findings of this study are included within the article.