Joint Optimization for MEC Computation Offloading and Resource Allocation in IoV Based on Deep Reinforcement Learning

. In the Internet of Vehicle (IoV), the limited computing capacity of vehicles hardly processes the intensive computation tasks locally. The computation tasks can be offloaded to multiaccess edge computing (MEC) servers for processing, where MEC provides the required computing capacity to the nearby vehicles. In this paper, we consider a scenario where there are cooperation and competition between vehicles, the offloading decision of any vehicle will affect the decisions of the others, and the computing resource allocation strategies by MEC will dynamically change. Therefore, we propose a joint optimization scheme for computation offloading decisions and computing resource allocation based on decentralized multiagent deep reinforcement learning. The proposed scheme learns the optimal actions to minimize the total weighted cost which is designed as the vehicles’ satisfaction based on the type of stochastic arrival tasks and dynamic interaction between MEC server and vehicles within different RSUs coverages. The numerical results show that the proposed algorithms based on decentralized multiagent deep deterministic policy gradient (DDPG) which is named De-DDPG can autonomously learn the optimal computation offloading and resource allocation policy without a priori knowledge and outperform the other three baseline algorithms in terms of the rewards.


Introduction
With the development of wireless communication technology and the rapid growth of vehicles, Internet of Vehicle (IoV) has become one of the most important applications of the Internet of ings (IoT) [1,2]. However, due to the limitation of the computing resource of the vehicles, several tasks cannot be executed locally within the required delay [3]. To solve this problem, o oading IoV tasks to mobile edge computing (MEC) server is proposed as a feasible solution [4]. MEC is close proximity to the mobile vehicles, supplying more sucient computation resource to the o oaded tasks [5,6].
In recent years, many researches regarding MEC computation o oading in IoV have been studied [7,8]. Some researchers have conducted to develop the optimization scheme in computation o oading under certain constraints, such as reducing the delay, computation resource overhead, and energy consumption [9][10][11]. Moreover, computation congestion that a ects the performance of MEC server and the load balance of computation resource among MEC server have been considered into the computation o oading problem [12]. Although MEC server provides vehicles with the resource far beyond theirs, the resource of the MEC server may also be insu cient when massive vehicles access to MEC server simultaneously. erefore, the rational resource allocation to optimize the performance of various objectives is also a signi cant issue of edge computing o oading [13][14][15]. Because of the high-speed mobility of vehicles and the randomness of tasks, as well as the cooperation and competition between vehicles in IoV, the computation resource allocation policy of MEC server based on di erent o oading decisions of vehicles has been discussed by many researchers. By jointly optimizing resource allocation and o oading strategy in IoV, the overall cost of computation resource, energy, and the delay is minimized in [16][17][18]. However, these methods require a large number of iterations to obtain a satis ed local optimum, which is not suitable for application scenarios where the environment changes rapidly and decisions need to be made in real time. Meanwhile, solving this type of optimization problem is usually nonconvex and NP-hard.
Deep reinforcement learning (DRL) which is the combination of deep learning (DL) and reinforcement learning (RL) can tackle the nonconvex optimization problem and has been widely used as an effective approach to optimize different issues including offloading decision-making and resource allocation strategy [14,[19][20][21][22][23]. e previous works make many efforts to optimize task offloading in IoV. For example, deep Q-network (DQN) is adopted in multiple vehicles offloading system to obtain the optimized offloading decisions which maximize the QoS of digital twinning-empowered IoV system [23]. Similar work proposes multiagent DQN-based computation offloading scheme, in which the uncertainty environment is considered so that the vehicles can make offloading decisions to achieve an optimal long-term reward [24]. A dynamic task offloading scheme based on Q-learning is implemented to minimize the delay, energy consumption, and total overhead in IoV system [25]. URLLC-aware task offloading algorithm based on deep Q-learning is studied to maximize the throughput of vehicles with satisfied constraints in [26]. Jointly considering the task priority, vehicles' service availability, and computation resource sharing incentive, an optimal offloading policy based on soft actor-critic (SAC) maximizes both expected reward and the policy entropy of the offloading tasks in the dynamic vehicular environment [27]. Moreover, DQN-based joint computation offloading and task migration optimization are applied to minimizing the total system cost in a 5G vehicle-aware MEC network [28]. e two-stage scheme is designed to joint optimization, where DQN is used in the first step to obtain the offloading strategy and deep deterministic policy gradient (DDPG) is utilized to generate the transmit power determination strategy of the vehicles [29]. None of the above researches consider the joint optimization of offloading strategy and computation resource allocation when multiple agents interact in a dynamic IoV environment.
Different from the existing works, we propose a decentralized multiagent deep reinforcement learning-based method to solve the joint optimization of computation offloading decision and resource allocation for MEC server in IoV. e objective of our work is to minimize the weighted cost of multiagent. In summary, our main contributions are as follows: (1) We propose a IoV scenario supported by MEC server for dynamic task offloading decision and computation resource allocation in the environment with multiple RSUs cover multiple vehicles. In this cooperative scenario, because of the mobility of multivehicle and the stochastic arrival tasks, the computation offloading decision and resource allocated to multiple RSUs and multiple vehicles change in different time slots. (2) Based on the proposed model, we consider both offloading decision-making and computation resource allocation to gain the minimum weighted cost, which is related to the end-to-end delay and computation resource cost. Moreover, we formulate the problem as a Markov decision process (MDP) and design the state, action, and reward functions. (3) In order to effectively solve the abovementioned problem with continuous variables and meet the requirement of convergence, a joint optimization scheme based on decentralized multiagent DDPG (De-DDPG) is proposed. e simulation results show that the convergence of our proposed algorithm is verified and our proposed algorithm has better performance than other three baseline algorithms. e remaining of this paper is organized as follows: In Section 2, an MEC framework with multiple RSUs and vehicles is introduced, and we construct the network model, communication model, and computation model. Section 3 describes the problem statement of the joint optimization. e solution based on decentralized multiagent DDPG (De-DDPG) is proposed in Section 4. In Section 5, the simulation results and analysis are presented. Finally, we conclude this paper in Section 6.

Network Model.
A three-layer Internet of Vehicle (IoV) is considered in this paper (see Figure 1), which consists of an MEC server, M roadside units (RSUs), and N vehicles on a multilane road of length L.
e MEC server is connected to M RSUs via the fiberoptic link for receiving and transmitting the computation tasks. We assume that the total computing resource of MEC server is denoted as F. e RSUs denoted by M � 1, 2, . . . , M { } locate along the road with the same coverage range l. erefore, we divide the road into M segments, and all vehicles are randomly and independently distributed in the segments with arrival rate λ. e RSU is responsible for forwarding messages between the MEC server and the vehicles. A set of vehicles periodically send messages to RSU within its communication range, which is denoted as � 1, 2, . . . , N { }. Vehicles have the same local computing capacity which is determined by the onboard unit (OBU) [30]. For each vehicle-i, it sends not only task messages but also its driving characteristics p i , v i , where p i and v i represent its 1-D position and speed, respectively. Here, we assume that the distances between vehicles follow the exponential distribution and the speeds of the vehicles are truncated Gaussian distributed, which is more appropriate for the actual situation of the road [31,32]. In addition, we assume that each vehicle only processes one computation task within the current time period. e computation task of each vehicle is denoted as where C i is the required computation capacity to complete the task, D in i and D out i are the data size of the input and output for computing, respectively, and t max i is the maximum tolerable delay for the task completion. Vehicle needs to execute a computation task within a tolerable time period, and the task can be either processed locally or offloaded to the MEC server. We define the binary offloading strategy of vehicles as where x i � 0 and x i � 1 means that the vehicle-i decides to execute the computation tasks locally or offloads the task to the MEC server, respectively. Moreover, when the vehicle leaves the coverage of the RSU, the vehicle will be disconnected from the RSU and can no longer transmit data to the MEC server through the RSU. e time available by the vehicle before leaving the communication range of RSU-j, j � ⌈p i /l⌉, j ∈ M, i.e., sojourn time, can be given as where v represents the vehicle's equivalent speed and

Communication Model.
When the vehicles decide to offload the task to MEC server, the vehicles will transmit data to MEC server through the RSUs. Generally, the propagation time of the fiber-optic transmission between RSUs and the MEC server can be ignored [12]. We consider the V2I communication between the vehicle and the RSU is based on IEEE 802.11p in this work [33]. According to [34], the uplink and downlink transmitting rate (r UL ij , r DL ij ) of the wireless communication between vehicle-i and its belonged RSU-j is expressed as where N j is the number of vehicles which decide to offload task to MEC server via RSU-j. τ ij represents the probability that vehicle-i connects to the RSU-j in a random time slot. σ is the duration of a time slot. RTS stands for request to send interval, AIFS denotes the arbitration inter-frame spacing interval, and δ expresses the propagation delay. T success ij is defined as the success transmission period between vehicle-i and RSU-j, which is written as where Φ is specific to the MAC protocol, and it equals H + SIFS + δ +ACK + AIFS + δ + RTS + SIFS + δ + CTS + SIFS + δ. H � PHY head + MAC head represents the packet header's overhead. SIFS, ACK, and CTS stand for short interface space interval, acknowledgment interval, and CTS interval, respectively. ω j denotes the bandwidth of RSU-j, P i is the transmission power of vehicle-i, and h ij stands for the channel gain between vehicle-i and RSU-j. e uplink/downlink transmitting time under this situation is calculated as And the two-way transmission time between vehicle and RSU is given by

Computation Model.
e processing time is considered under two situations: the task is processed locally, and the task is offloaded to the MEC server for computing. Mobile Information Systems 3

Local Processing Model.
When vehicle-i processes its computation task locally (x i � 0), the processing time of vehicle-i t loc i is only dependent on its own computing capacity. e local execution time t loc exe is formulated as Here, f loc is denoted as the vehicle's computation capacity, which is related to the vehicle's CPU cycle frequency.

MEC Processing Model.
When the task is offloaded to the MEC server (x i � 1), the end-to-end delay of vehicle-i includes the task execution time and the transmitting time. e execution time of vehicle-i offloading the task to MEC server is given as where f mec j denotes the computation capacity assigned to RSU-j which connects to vehicle-i by the MEC server, and f mec j denotes the allocated CPU cycle frequency of RSU-j by the MEC server. e end-to-end delay between vehicle-i and the MEC server is obtained by e main notations and descriptions are described in Table 1.

Problem Statement
In this section, the optimization problem is formulated by jointly considering the offloading decision and resource allocation with the aim of load balance and system cost minimization. First of all, we define the cost function as follows.
Cost function is considered to quantify the satisfaction level of the vehicle's offloading decision, which is inversely related to the satisfaction and identified by the delay sensitivity and the cost of computation resource. e logarithmic function is known as proportional fairness in many researches [35], which can achieve load balance, and a logarithm function is used to represent the cost function in this paper. e processing delay of a task is generally considered to be inversely proportional to the satisfaction; that is, the shorter the task processing delay, the higher the satisfaction. In addition, if the task is completed within the maximum tolerable delay, the satisfaction of the vehicle should be non-negative. But once the completion processing time of the task exceeds its maximum tolerable delay, the processing result of the task will lose its value because the tasks in IoV are extremely tolerant of delays. Here, the penalty mechanism is brought into consideration. Another metric in the cost function is the computation resource cost. It is necessary to pay for the vehicle's computation resource when the vehicle processes the task locally. Furthermore, when the task is offloaded to the MEC server, it takes the vehicle's corresponding cost for computation resources allocated by the MEC server, which will also reduce the satisfaction of the vehicle. erefore, the cost function for vehicle-i to process the task locally is given by where β ∈ (0, 1) and 1 − β represent the weights of delay and computation resource cost, respectively. e weighted function provides a flexible scheme for different applications' specific requirements by adjusting the weight parameters. (z) + � max(z, 0) ensures that U l i is non-negative. ρ is the unit cost of the computing resource. And, P > 0 represents the penalty for the task that is not completed within its maximum tolerable delay.
Similarly, the cost function of vehicle-i offloaded the task to the MEC server for processing and can be expressed as Because when the vehicle leaves the coverage of the RSU, the vehicle will disconnect to the MEC server through the RSU regardless of whether the task is processed or not. Combining equations (9) and (10), the cost function of vehicle-i can be expressed as is work aims to minimize the system cost by jointly determining the offloading decisions of vehicles and the computation resource allocation of the MEC server. e optimization problem is formulated as Constraint C1 ensures that the available local computation resource is non-negative and less than the MEC server. C2 is the constraint of the available computation resource assigned for each vehicle-i within the coverage of RSU-j by the MEC server. e sum of the computation resource allocated to all the offloading tasks through RSU-j does not exceed the total computation resource of the MEC server in the constraint C3. C4 shows the binary offloading decision constraint for vehicle's task.
Since the cost function in the above problem involves the end-to-end delay, which is related to the indicators of the stochastic arrival tasks C i , D in i , D out i , the computing resource is allocated to RSU-j f mec j and the relative position of vehicle to RSU based on vehicle's driving characteristics p i , v i . erefore, the computation complexity is an additive change on all of tasks and the vehicle characteristics. In addition, the computation complexity also depends on the number of the generated tasks. In this optimal problem, the offloading decisions X and the allocated computation resource F are two main challenges which make the problem into a mixedinteger nonlinear programming problem that is generally nonconvex and NP-hard [36]. We adopt a multiagent deep reinforcement learning approach to feasibly solve the problem of jointly optimizing the computation offloading decision and computation resource allocation.

DRL for Computation Offloading and
Resource Allocation 4.1. Scheme Design. We assume that the state is determined by the arrival tasks and the vehicle's characteristics which are updated in each step. e state of the next time slot is related to the state of the current time slot. erefore, the formulated problem can be modeled as a Markov decision process (MDP). MDP is the iterative process in which agents observe the states in state space from the environment, select an action from action space, obtain an immediate reward sequentially, and then transit to another state, which can be represented as a tuple <S, A, P s,a , R, c > , where S is state space, A is action space, P s,a is transition probability space, R is reward space, and c is discount factor. MDP policy is completely dependent on the current state. e state space is designed to accommodate the proposed IoV environment. Each vehicle acts as the agent. At first, we define the state space, action space, and reward space as follows.

State
Space. e state at time slot t is corresponding to the required computation capacity to complete the task C i , the input data size of the task D in i , the output data size of the task D out i , the position of vehicle p i , the speed of vehicle v i , and the computing resource allocated to RSU-j. us, the state s i (t)ϵS can be described as

Action Space.
e action is the joint decision-making for the computation offloading and the resource allocation.
e vehicle needs to decide to process the task locally or

Reward Space.
We assume that all vehicles with the same functionality can share the same reward function. Each agent selects its action based on the reward to obtain the maximum global reward. e long-term weighted sum of the cost function of all the tasks is considered as the objective, and we define the below function to maximize the reward function (minimize the cost function) during the whole time period T.
e average rewards of all agents in time slot t can be calculated as Minimizing the weighted cost function of the proposed model amounts to maximizing the average cumulative reward. e expectations of future rewards can be used to measure whether the selected action is appropriate or not. e reward is the return of the selected action based on the state in time slot t. erefore, the cumulative reward which is generally indicated as the weighted expectation is maximized to select the optimal actions, formulated as Q π (s(t), π(t)) � E R(s(t), a(t)) + cQ π (s(t + 1), π(t + 1)) , (17) where c ∈ [0, 1] is the discount factor, π is the policy, and π * is the optimal policy. Q * (s, a) � max π * Q π (s, π) corresponds to the agents' optimal policy π * � π * 1 , π * 2 , . . . , π * N .

Optimal Scheme Based on Decentralized DDPG.
After formulating the MDP, we propose the optimization strategy based on the decentralized multiagent DDPG (De-DDPG) in this subsection, in which each agent is initialized with four deep neural networks (DNNs): the critic network, the actor network, and two copies of the actor and critic networks as target networks, respectively (see Figure 2) [37]. Each agent's state, action, and reward are obtained and used to train the DNNs during the training procedure. After training, each agent can select the next step strategy by its own actor network according to the local observation from the environment. As shown in Algorithm 1, the process of De-DDPG algorithm can be divided into three parts: initialization, interaction, and update. At the beginning of the algorithm, four networks of each agent and the replay buffer B are initialed, where the critic network is Q(s, a|θ Q i ), the actor network is μ(s|θ μ i ), the target critic network is Q ′ (s, a|θ Q′ i ), and the target actor network is μ ′ (s|θ μ ′ i ), respectively. In addition, the replay buffer can be large because the proposed De-DDPG is an off-policy algorithm, which allows the algorithm to benefit from learning across a set of uncorrelated transitions [38]. In the interaction procedure, for each episode, a sampled noise from the random noise process N t is added to an exploration policy μ ′ into the actor policy. e reason for introducing random noise is to solve the problem of insufficient exploration of the environment by the output actions in deterministic policy algorithms.
e Ornstein-Uhlenbeck process is used to generate temporally correlated exploration for exploration efficiency. en, the actions interact with the environment and obtain the corresponding rewards and the next step states. According to the observation, the transitions (s i (t), a i (t), R i (t), s i (t + 1)) store in the replay buffer B. When updating, a random minibatch of Z transitions is sampled from the replay buffer. en, update each agent's critic network, actor network, and two target networks in turn. Loop through each episode until the algorithm ends. In the update process, the critic network Q(s, a|θ Q i ) is updated by minimizing the loss L(θ Q ) in Algorithm 1 which is the approximation function of other agent policy by each agent. Here, y i is the predication of the next action in target actor network in formula (17). e actor network μ(s|θ μ i ) is updated by using the sampled policy gradient ∇ θ μ J in Algorithm 1 which is the unbiased estimation of the policy gradient expectation calculated by the mini-batch transitions according to Monte Carlo method. After training the mini-batch transitions and updating the weights of critic network and actor network (θ Q i and θ μ i ), the weights of two target networks for each agent (θ Q ′ i and θ μ ′ i ) can be soft updated as a running average algorithm, which is shown in Algorithm 1, respectively.

Numerical Results
is section describes the comprehensive numerical simulation analysis from simulation setup, simulation comparison, and simulation results. ere are 20 vehicles driving on the road, 4 RSUs are located at the stationary region on the roadside, and an MEC server directly connects with RSUs. at is to say, 2 groups of RSUs are set to serve all vehicles. One RSU group includes the main RSU and a secondary RSU, and the purpose is to prevent the main RSU from being abnormal due to accidents (such as power failure and communication blockage). erefore, N � 20, M � 4. e required computation resource is set as C i � 0.54 (G cycles/Mbits). e size of arrival tasks follows uniform distribution D in i ∼ U(1.6, 4.6) Met-aBits. e arrival probability of tasks is 0.45. e maximum tolerable delay of the computation task is set as t max i � 4 time 6

Simulation
Mobile Information Systems slots. One time slot t ∈ T is set to t � 1 ms. e uplink transmission rate of vehicle-i r UL ij is between 0.85 and 0.95 Mbits per time slot. e bandwidth of RSU ω j � 10 MHz. e transmission power for vehicle-i is P i � 1W. e max computation resource allocated to RSU-j is 3.8 gigacycles.
e computation resource of each vehicle is 0.5 gigacycles.
In terms of each vehicle-i (i ∈ N), the neural networks (two actor networks and two critic networks) for De-DDPG are composed of four layers, i.e., an input layer and two fully connected layers. Table 2 illustrates the parameters and values.

Simulation Results.
In this section, we analyze the simulation results in detail from the two aspects: the convergence of proposed De-DDPG and the advantages of De-DDPG compared to three baseline algorithms.
In terms of De-DDPG's convergence, the average cumulative reward of De-DDPG in one period (the whole time slots) for each episode is formulated. Figure 3 shows the convergence of proposed De-DDPG with different critics' learning rate α. e choice of learning rates α can obviously affect the convergence effect and speed of De-DDPG. From Figure 3, it can be observed that De-DDPG cannot be convergent when α � 0.01. However, from  Mobile Information Systems the blue curve when α � 0.0001, although De-DDPG can be eventually convergent, the convergence speed is too slow to influence the performance of De-DDPG. erefore, we set α � 0.001 because De-DDPG is more stable. In terms of actor's learning rate, we set α ′ � 0.0001. f mec j is the total computation resources preallocated to RSU-j from the MEC server. To ensure the fairness of computation resources allocation and the robustness of the proposed De-DDPG, we periodically update the allocated computation resources of each RSU by adding the fluctuation volatility rate. As shown in Figure 4, when the volatility rate is set to 1, 3, 5, the convergence and performance of De-DDPG are different. When λ j � 1, the performance of De-DDPG is better than the other two curves. However, the stability of De-DDPG is slightly worse. As the volatility rate increases, the performance of De-DDPG decreases, but we can see that when the volatility rate is set to 3, the stability and convergence of the De-DDPG are optimal.  (10), the greater the ρ is, the greater the cost is (the less the reward is). Although the average cumulative rewards of De-DDPG are worst when ρ � 0.08, the stability is much better than that of the curves when ρ � 0.04 and ρ � 0.06. From the curves shown in Figure 5, as the control parameter ρ increases, the performance of De-DDPG declines less and less. However, when ρ � 0.08, the convergence effect and training speed of De-DDPG are obviously improved. erefore, we set the control parameter for the cost of computation resource ρ � 0.08 in this paper.
Furthermore, Figures 6-8 will verify the performance and advantages of the proposed De-DDPG compared to other baselines Ce-DDPG, A-RSU, and A-LP. e average cumulative reward of De-DDPG for all episodes (1000 episodes) is introduced to show the performance of four algorithms. Figure 6 illustrates the comparison of four algorithms with the different numbers of arrival vehicles. When the number of vehicles N within the coverage region of each RSU is 6 or 8, because the computation resources of each RSU are sufficient to meet the computational demands of the vehicles for all generated tasks, the performance of De-DDPG, Ce-DDPG, and A-RSU is not significantly different. In terms of A-LP, regardless of the number of vehicles, its local processing ability remains unchanged, and the computation resources of RSUs have no effect on its performance. On the contrary, because the computation ability of the local processor for vehicles is insufficient for arrival tasks, a large number of penalties P in each episode will be incurred due to the incomplete tasks, thus affecting the average Update critic network Q(s, a|θ Q ) by minimizing the loss L(θ Q ) � 1/Z I (y i − Q μ (s i , a i (t)|θ Q i )) 2 Update the actor policy by using the sampled policy gradient  e uplink transmission rate of the wireless communication between vehicle-i and its belonged RSU-j r UL ij can influence the performance of four algorithms. Figure 7 shows the performance of four algorithms with different uplink transmission rates r UL ij . We can see that A-LP is not influenced by the transmission rate r UL ij because A-LP executes all tasks by its own local processor. For De-DDPG, Ce-DDPG, and A-RSU, different uplink transmission rates r UL ij mean different transmission delays t mec UL which affairs the average cumulative rewards of all three algorithms. As shown in Figure 7, the performance of De-DDPG is obviously better than that of Ce-DDPG and A-RSU due to the number of agents participating in training, offloading decisions, and the ratio of resource allocation. In Figure 8, we describe the comparison of four algorithms with different trade-off coefficients β for latency cost. From formula (10), β is the weighted parameter of delay cost and 1 − β computation resource cost. In terms of A-LP, since the computational resource cost is fixed, the performance of A-LP decreases as the trade-off coefficient β increases. However, the other three algorithms consider the trade-off between delay cost and computation resource cost, so the performance of De-DDPG, Ce-DDPG, and A-RSU varies  with the changing of β. As shown in Figure 8, in terms of the average cumulative rewards, the performance of De-DDPG outperforms that of Ce-DDPG, A-RSU, and A-LP no matter the size of β.

Conclusions
We propose a computation offloading and resource allocation scheme based on DRL for the MEC-assisted multiagent with stochastic arrival task model in the IoV environment. To minimize the total weighted cost of the proposed model, we adopt a decentralized multiagent DDPG-based approach (De-DDPG) to solve the nonconvex joint optimization problem. e simulation results demonstrate that our proposed approach has a stable learning capacity and effectively learns the optimal offloading policy and resource allocation to obtain the maximum reward (minimum cost). Compared with the three baseline algorithms, our proposed algorithm has better performance for various parameter configurations. In this paper, the binary offloading decision is used and the task priority is not considered. We will improve these two points in our future work, such as considering partial offloading and task prioritization in this joint optimization problem.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.