Multiuser Computing Offload Algorithm Based on Mobile Edge Computing in the Internet of Things Environment

. As traditional cloud computing is not eﬃcient enough to support large-scale computational task execution in IoT environments, a task oﬄoading and resource allocation algorithm for mobile edge computing (MEC) is proposed in this paper. First, a multiuser computation oﬄoading model is constructed, including a communication model and computation oﬄoading model, which is transformed into the minimization of users’ time delay and energy consumption (i.e., total system overhead) in the MEC system. Then, the task oﬄoading model is formulated into a Markov decision process, and an oﬄoading strategy based on a deep Q network (DQN) is designed to dynamically make ﬁne tunings on the oﬄoading proportion of each user so as to realize a low-cost MEC system. The proposed algorithm is analyzed based on the constructed simulation platform. The simulation results show that when the number of user terminals is 40, the average delay of the proposed algorithm does not exceed 0.9s, and the average energy consumption tends to 65J, which is better than the comparison method. Therefore, the proposed algorithm has certain application prospects.


Introduction
Nowadays, more and more mobile devices are emerging in people's lives, leading to the explosion in the population of smart network edge devices [1]. Subsequently, the data and computation tasks are also growing exponentially. In the context of massive data and large-scale computation tasks, mobile devices are required to process large amounts of application data quickly, which lays a high demand on their computing capacity. Due to the mismatch between data volume and transmission channel, traditional cloud computing brings huge pressure and ultrahigh delays to the crowded network and cannot efficiently support the execution of large-scale computing tasks [2]. Mobile edge computing (MEC) provides cloud computing capacity for mobile devices at the edge of networks via wireless access, which solves the problem of limited computation and energy resources for mobile devices. MEC has become a new paradigm for providing powerful computing and storage capabilities for mobile devices [3,4]. In order to further ameliorate the quality of services for users and the increase of resource utilization efficiency in MEC, complex computation task offloading strategies and the allocation of communication resources need to be addressed [5].
In early work on computation offloading, most researches considered single-user scenarios, such as low complexity dynamic computation offloading algorithms based on Markov decision processes and Lyapunov optimization, trying to achieve the load-balancing optimization through offloading strategies [6,7]. Reference [8] proposed a low complexity heuristic algorithm to achieve load balancing by using fractional programming with the optimization goal of minimizing the energy consumption of task offloading. However, the multiscenario and multidimensional optimization for computational resource allocation is yet to be improved. Reference [9] realizes task unloading and efficient channel resource allocation based on the differential evolution algorithm. is scheme can significantly reduce energy consumption while ensuring convergence. However, the performance is poor for multiobjective optimization of complex tasks. On the new cloud edge computing network designed by Reference [10], a joint optimization strategy based on a binary custom fireworks algorithm is proposed, which can ensure the rationality of system computing resources and response time. But the high occupancy of computational resources and the resource utilization need to be improved. Reference [11] proposed a joint distributed algorithm considering transmission power and unloading strategy and established a queue model with a separate capacity between different windows to optimize queue delay. However, with the massive number of network devices accessing the network in the 5G era, single-user scenarios are no longer able to meet people's daily needs.
Recently, deep learning techniques have been widely studied with the development of artificial intelligence. Since deep learning can solve some limitations in reinforcement learning, it is integrated into reinforcement learning to open a new era of deep reinforcement learning [12]. Deep reinforcement learning incorporates deep neural networks to optimize the process of reinforcement learning, thus improving the learning speed and performance of reinforcement learning algorithms. erefore, deep reinforcement learning is widely used in the practice of reinforcement learning [13]. Reference [14] proposed a distributed optimization method based on an alternating direction multiplier, which decomposes the optimization problem into N subproblems and maximizes the weighted sum calculation rate through the optimal allocation of system resources and task calculation time. However, this method is weakly adaptable to new environments. An offloading strategy based on metareinforcement learning was proposed in Reference [15]. Mobile applications are modeled as directed acyclic graphs and offloading strategies via neural networks, and the collaboration of first-order approximation and tailoring of agent goals is applied for effective training. Although the adaptability is enhanced, the processing time for the strategy still needs to be improved. Reference [16] constructs a task unloading model based on multiagent deep reinforcement learning and uses the MEC model to better realize computing task unloading and resource allocation. Wang et al. proposed a reinforcement learning-based computing offloading strategy [17]. Although the aforementioned deep learning algorithms can achieve better performance in MEC, the training process and initial conditions are very complex, which means they need to be further optimized in practical applications.
Based on the above analysis, to alleviate the network load and reduce the risk of network congestion in traditional cloud computing of IoT, MEC is introduced to formulate the multiuser computing offloading problem. A task offloading and resource allocation algorithm for MEC in an IoT environment is proposed. Since user's tasks and the computation tasks in the edge server may be time-varying, a deep Q network (DQN) based computation offloading strategy is proposed to achieve the minimum operation overhead of the system by dynamically fine-tuning the ratio of time delay and energy consumption, which improves the robustness of the proposed algorithm.

System
Model. In the MEC system, in order to better serve users and improve system task processing capacity, computing tasks can be offloaded to the MEC server for execution via the wireless channel according to practical situations [18]. As shown in Figure 1, the number of mobile users is n � 1, 2, · · · , N { }. e MEC server is deployed in the system, which is connected to the base station of this cell. Namely, in the process of task offloading, the computation task of each user cannot be split but can only be offloaded. Meanwhile, the offloading strategy cannot be changed.
Supposing that the number of wireless transmission channels between users and the base station is m � 1, 2, · · · , M { }, users can choose one of the multiple wireless channels to offload tasks. e offloading strategy of the user n can be denoted as a n � 0, 1, · · · , M { }. When a n � 0, it indicates that the user selects local computing, and when a n > 0, it indicates that the user selects to uninstall the MEC server for execution.
Assume that the computation-intensive task is T n � d n , c n . Where d n and c n denote the input data size of the task in K bits and the CPU cycles required to process the input data, respectively.

Communication Model.
e communication model of the system includes the selection and assignment of the channel. When a user is connected n to the server Ω, let S a n,Ω � 1, then the user n can offload tasks T n to the server through the high-speed network. At this point, if a task needs to be offloaded, the server is required to allocate a certain amount of network bandwidth to the user, which is denoted as B n,i . Since the offloading time for a task is very short, we assume that the bandwidth obtained by the task is not grabbed during the offloading. When the task is offloaded, the occupied bandwidth will be released and the total bandwidth provided by the server Ω is B Ω . e server can choose different bandwidth allocation methods β C Ω , such as fixed percentage allocation, fixed amount allocation, or allocation based on user payment criteria.
As the channel selection and allocation strategy are not the focus of this work, the previously proposed communication model is adopted and the bandwidth allocation method is given when several users share a channel that is nonpreemptible. Namely, the bandwidth allocated to a user cannot be released until the user's data have been transmitted [19,20]. In addition, unlike the strategy that requires waiting for the completion of a communication cycle before releasing the bandwidth, the occupied bandwidth is released immediately after the transmission is completed in this paper. And it is shown in this paper that immediate bandwidth release helps to improve the bandwidth utilization of the system. Based on the above analysis, it is known that the remaining bandwidth of the current server is B a Ω , and the offloading decision in each round offers n number of users to offload their tasks, then the bandwidth obtained by each user in this offloading process can be written as where δ 0 is the background noise,p represents the transmission power consumption, and g represents channel gain between the user equipment and the base station. e model has the following features. If many users choose to offload tasks at the same moment, they will bring large interference among each other and reduce the transmission rate, which results in a challenge for the offloading algorithm when choosing the combination of offloading user devices. For a task with an uplink data volume of d ⌢ n,i , the relationship between the data volume and the transmission time can be calculated as us, the communication model C Ω of the server can be formulated as e communication model C n,i of the user device can be formulated as

Local Execution.
When a n � 0, the user n chooses to execute the computation-intensive task T n locally. Let f n be the computing capacity of the user n, then the time delay t 1 incurred by the task T n when it is executed locally can be calculated as e energy consumption e 1 for local execution can be calculated as where χ 1 is a constant. According to equations (5) and (6), the total overhead of local execution Θ 1 can be expressed as where ω t is the weight of time delay, ω e is the weight of energy consumption, 0 ≤ ω t ≤ 1, and 0 ≤ ω e ≤ 1, ω t + ω e � 1.

MEC Offloading Computation.
When a n > 0, the user selects to uninstall to the MEC server for execution, during the offloading process, the TD and EC are generated during the following three steps: (1) the computation task is transmitting data through the wireless channel; (2) the computation task is executed at the MEC server; and (3) the computation result is returned to the user. When data are transmitted to the MEC, the user selects the wireless channel, and the resulting time delay can be written as where v ⌢ n (a) is the uplink data transmission rate. e energy consumption incurred by the data transmission to the MEC can be expressed as When the task is uploaded to the MEC server, it is computed using the computational resources of the server, at this point the resulting time delay can be calculated as where F n denotes the MEC server computing capacity. e energy consumed when executing tasks at the MEC can be formulated as Generally, the size of the result calculated by the MEC server is very small compared with the input data, so the TD and EC when returning the calculation result to the user can be ignored [21]. us, the total overhead of offloading the computation task to the MEC for execution can be expressed as Based on equations (7) and (11), the overhead of each user can be expressed as Θ n (a) � Θ 1 , a n � 0, Θ 2 , a n > 0.

Optimization
Objectives. e optimization objective of a multiuser MEC system is to minimize the TD and EC. Hence, it can be modeled as Wireless Communications and Mobile Computing 3 where p min is the minimum values of the transmission power and p max is the maximum values of the transmission power. e above optimization problem involves the combinatorial optimization problem in multidimensional discrete space. We can consider using reinforcement learning technology and making use of the intelligent characteristics of mobile users so that mobile users can get mutually satisfactory unloading strategies.

Solutions
3.1. Reinforcement Learning. Reinforcement learning (RL) is an autonomous learning framework that implements experience-driven learning through interactions and is used to maximize the reward when intelligent agents are finding the optimal behavior at a given state. Reinforcement learning, as a part of machine learning, differs from supervised learning, where training is based on the right answer itself [22,23]. In a standard RL model, the autonomous-learning agents interact with the environment. e process of reinforcement learning is shown in Figure 2. At each timestamp t, a state s(t) is first observed from the environment, then an action φ(t) is executed based on the current state, after which a reward/ punishment r t is fed back by the current environment. ereafter, the environment will move to another state s(t + 1), where the probability of the environment moving to a state s(t + 1) after performing an action φ(t) from the state s(t) can be represented by the state probability transfer function σ(s(t + 1)|s(t), φ(t)). e process described above is going to continue, which maximizes the desired reward in the long run. Mathematically, reinforcement learning can be described as a Markovian decision process in which the response of the environment to the state s(t + 1) depends on s(t) and φ(t). Furthermore, the key point of reinforcement learning is to learn without the knowledge of the underlying environment model. And the reinforcement learning that cannot compute rewards before actions are selected and cannot know the state probability transfer function is referred as model-free reinforcement learning [24]. Meanwhile, reinforcement learning uses a ε-greedy approach as the fundamental policy, where ε is a probability value between [0,1]. Each time an action is selected, there is a probability of ε being exploited in the Q-table and the action with the largest reward is expected, and there is a probability that an action is randomly performed in the exploitation. Figure 3 demonstrates how the system states change over time. Assuming that at the moment t, the reinforcement learning algorithm, taking DQN as an example, obtains observation o t from the server state, and makes offloading action φ t based on the observation. en the offloading action will affect the state of the user who receives the offloading permission, which in turn affects the state of the specific task to be offloaded. Once a task is in the offloading process, it undergoes state transitions such as transmission, arrival at the server cache, execution on the server, and execution completed, which have a continuous impact on the resources and state of the server throughout the transition process.

Systematic Action Transitions and Delayed Rewards.
When the task is executed, the rewards r t recorded by the system at this moment are returned to the decisionmaking algorithm for learning. It is clear from the description that at the moment t, the decision-making algorithm does not have access to the immediate rewards for this action but can only obtain the reward for the action at a previous moment.
is is a distinctive feature of the incomplete observation system and is a key point for the offloading model to meet the conditions for asynchronous decision-making [25,26]. erefore, the cumulative reward of successive decisions constructed by the learning process is the key to determining whether the optimization objective of the system is satisfied.
Assuming that the cumulative positive rewards of reinforcement learning (excluding punitive rewards) are equal to the optimization objective Ψ of the system. erefore, an upper bound on the cumulative reward is the optimization objective, which can be expressed as where r t is the punitive rewards. e computation offloading in the edge environment is a complex process of continuous decision-making. And a model-free reinforcement learning method, i.e., temporaldifference (TD), is applied, which combines Monte Carlo sampling and bootstrapping in dynamic programming and usually leads to better learning performance and efficiency. Here, the loss function can be defined as

Wireless Communications and Mobile Computing
where y and y ′ are the real value and target value of the model output, respectively. e proposed learning method runs on an offloading model on the edge server and is mainly as follows: (1) At the beginning of learning, the server starts the offloading process and updates the reward value r t � (T L n,i /ε n,i ) of the latest task for that user if the task has been executed (2) If the system is overloaded, it updates the reward value as r t � − |r t− 1 | of the latest task for that user (3) e server state s t Ω , and the list of users φ t Ω are obtained and can be offloaded in this decision from the action space. en the offloading notifications are sent to the users in the list by the server (4) e server reads the latest reward value of the task and takes the triplet of state, action, and reward as the training input to the reinforcement learning algorithm

DQN-based Offloading Strategy.
To better evaluate the effect of the policies on action selection, the value function about states and actions is converted into a recursive form, which can be denoted as where Φ t is the value of cost and ζ is the discount coefficient.
In conventional Q-learning algorithms, the number of states in the environment is generally assumed to be relatively small, and therefore a look-up table is used to record the state-action pairs (s, φ). However, since the number of states in the constructed MEC network is so large, it is computationally expensive to update the Q function in an iterative manner if the Q learning algorithm continues to be used. So DQN is proposed to address the problem. e DQN algorithm is a typical value-based policy algorithm that collects the state s t of the current network environment as input data of the estimated value of the deep network. And the output of the estimated value network is Q(s t , φ), ∀φ ∈ A, where Q values corresponds to all the actions. en, a greedy algorithm, i.e., ε − greedy, is used to select the actions φ t . Next, the user performs the action φ t , and the network environment turns to the next state s t+1 , while the value of cost Φ t is generated. Based on this value, the parameters of the estimated value network are updated, and after many iterations of update, the estimated value network has been trained to output the optimal Q function Q(s, φ). e mean square error function is used to define the loss function of the estimated value network, which can be written as where θ denotes the weight parameter of the estimated value network, and Y t represents the value of optimization objective for the estimated value network. However, if an identical deep neural network is used to obtain the target value, the target output of the network also changes with the update of the parameters, that is, the label changes during the deep learning training, which is obviously unreasonable [27]. erefore, it is necessary to introduce another neural network named as the target value network, whose network structure is exactly the same as the estimated value network. e only difference between them is that the parameter θ of the target value network will not be updated at each timeslot but will be copied from the parameters of the estimated value network after every K steps of training. Namely, the  Wireless Communications and Mobile Computing parameter of the target value network is updated K steps slower than the estimated value network. e target value can be expressed as In addition, the samples of training data are independent of each other in supervised learning. But, it is noted that the states of the MEC network are continuous in time series, which affects the reliability of training to some extent. erefore, an experience replay unit (ERU) is introduced in the DQN network and all the samples coming from the interaction between environments and agents are stored in the memory of the ERU in the form of quaternions (s t , φ t , Φ t , s t+1 ), where s t+1 is the next state for the states t . In the training phase, one sample packet is randomly grabbed from the ERU at a time, and the size of the packet can be set arbitrarily within the maximum number of samples so that the temporal correlation between datasets can be broken, making the samples independent and increasing the generalization ability of deep learning. e pseudocode of the DQN-based offloading algorithm is shown in Algorithm 1.

Experiments and Analysis
e simulation experiment platform uses MATLAB mathematical software, and the version is 2016a. e computer hardware conditions used in the simulation are as follows: CPU is i7-7200U and the running memory size is 4GB. In the experiments, the simulation scenario is assumed as follows: the bandwidth B � 5 MHz, and the computing capacity F � 12 GHz/sec, and the computing capacity of each mobile user itself is f � 5 GHz/sec. e transmission power of the mobile users is between p min andp max , where p min � 150 mW and p max � 300 mW. Assuming that the computation offloading follows a uniform distribution between 3000 and 5000 Kb. Besides, decision weights are set as ω t � ω e � 0.5. As shown in Figure 4, in the DQN-based offloading strategy, the average system cost of these three curves decreases rapidly until convergence. When (N, M) � (15, 15), the average cost converges to the lowest value after about 11,000 iterations; when (N, M) � (15,20), the average cost converges to a stable value after about 9,000 iterations, and this value approximates the value in the case of an equal number of users and edge servers. When (N, M) � (20, 15), the average cost converges to a larger value than that in the other two cases after about 13,000 iterations. Hence, it is confirmed that the proposed strategy allows the system cost to gradually decrease and converge to a stable value regardless of the relationship between the number of users and the number of edge servers.

Comparison with Other Algorithms.
Besides, it is compared with the algorithms in Reference [8], Reference [10], and Reference [15] in the simulation.

Comparison of the Number of Terminals and Average
Delay.
e relationship between the number of user terminals and the average delay for different algorithms is illustrated in Figure 5.
It can be seen in Figure 5 that since the algorithm proposed in Reference [8] does not consider the collaboration mechanism, the load on the computing server rises with the increase of the number of user terminals, leading to an overall rise in the user task delay. erefore the average user delay is the biggest. e algorithm proposed in Reference [10] adopts side cloud collaboration, and when the number of user terminals increases, the MEC server can transmit the tasks that cannot be processed in time to the cloud server for execution. Although the side cloud collaboration can reduce the task delay, the average user delay is also relatively high because the cloud server is far away from the user, which increases the transmission delay. Algorithms proposed in this paper and Reference [15] both use reinforcement learning to design the offloading strategy. However, the proposed algorithm constructs a multiuser MEC model to offload computation tasks as quickly as possible or execute them locally, so the delay is the smallest and the average delay does not exceed 0.9 s when the number of terminals is 40.

Relationship between Computing Capacity of MEC Servers and Maximum User Delay.
e effect of the computing capacity of MEC servers on the maximum user delay under different algorithms is shown in Figure 6.
From Figure 6, it is illustrated that the user task delay becomes smaller with the increase of the MEC computing capacity. Also, the proposed algorithm has a smaller delay compared to the other three algorithms, and its maximum user delay is about 0.6 s when the computing capacity of MEC servers reaches 16 GHz/sec. is is because when the computing capacity of MEC servers is low, the servers collaborate with each other to balance their load and reduce the task delay. erefore, the advantages of edge-cloud collaboration will no longer be obvious, and the performance of the algorithm proposed in Reference [8] gradually approaches the curve of Reference [10]. Figure 7 illustrates the performance of the average system overhead with a different number of mobile users, where the number of channels is set to be 12 and the computing capacity of MEC servers is set as F � 12 GHz/sec. As shown in Figure 7, the average system overhead of all four algorithms increases with more and more users. In Reference [10], some of the computation tasks are offloaded to the cloud computing center for execution, and the cloud computing center is far away from users, thereby the TD and EC increases significantly and its computation overhead is the highest. Heuristic algorithms are applied in Reference [8] for system optimization, while its optimal solution searching performance is weaker compared to the learning results in deep learning networks. Meanwhile, as the number of mobile users increases, the resources that the system can provide in the process of task offloading are limited, so the competition for the limited resources in the system is intense, which can cause an increase in the system delay and energy consumption. In such an environment of intense competition for resources, the metareinforcement algorithm proposed in Reference [15] has a more obvious advantage over the DQN strategy of the proposed algorithm. e proposed algorithm realizes the optimal task unloading and resource allocation through the powerful data optimization ability of DQN. Its total system overhead is less than 120 and can dynamically fine-tune the delay and energy consumption according to the actual needs.

Relationship between Average System Energy Consumption and Training Rounds.
When ω t � 0 and ω e � 1, the optimization objective can only be concerned about the energy consumption of the whole system and neglect the system delay. Under such circumstances, the performance of energy consumption of the system is shown in Figure 8.
From Figure 8, it can be seen that the average system energy consumption of the four algorithms tends to be stable as the number of training rounds increases, but the proposed algorithm has the smallest average system energy consumption, which tends to be 65 J when the number of training rounds exceeds 19. e proposed algorithm uses

Begin
(1) Emptying the storage area of the ERU (2) Initialize the weight parameters of the estimated value network θ, and make the parameters of the target value network θ − � θ (3) Initialization status s (4) For t � 1 : 1 : T (5) do (6) Under the greedy algorithm, an action is selected based on the state s t φ t (7) Execute the action φ t and observe the system costs Φ t and s t+1 (8) Collecting samples (s t , φ t , Φ t , s t+1 ) and storing them in the ERU (9) If the samples are larger than the size of the sample pack then grabbing a sample packet at ERU Ref. [15] Ref. [10] Ref. [8]  DQN to construct an offloading strategy in which the system optimization search is accelerated by system action transition with delayed reward and introduces the MEC system model, leading to smaller overall energy consumption. e meta-reinforcement strategy proposed in Reference [15] is computationally complex and lacks a reasonable system architecture, so the energy consumption increases. In Reference [10], the task offloading is executed based on the cloud edge collaborative architecture, but its stable energy consumption is more than 100 J due to the long distance between cloud computing centers and the users. Reference [8] has less computational overhead due to low complexity. However, the time delay is large, so the overall offloading is not effective.

Conclusion
To alleviate the network load and reduce the risk of network congestion in traditional cloud computing of IoT, MEC is introduced to formulate the multiuser computing offloading problem, where the optimization objective is set to minimize the total weighted overhead of time delay and energy consumption to ensure reasonable system resource allocation. Additionally, the optimization problem is solved using a DQN-based offloading strategy, obtaining the optimal scheme. e results based on the simulation platform show that: (1) e introduction of MEC, which enables the computation task in close proximity to the users, can reduce system time and energy consumption. e average delay of the proposed algorithm does not exceed 0.9 s, and the average energy consumption tends to be 65 J when the number of user terminals is 40. (2) e proposed algorithm can execute computational tasks as close as possible by adopting the MEC system, enabling it to reduce the total system overhead to a great extent. And the total overhead is lower than 120 when the number of users is 50.
As the mobile user devices have very limited resources and suffer from the problem of battery aging, the energy provided to the users is not enough for them to complete the whole offloading procedure in some cases. erefore, how to renew the energy for mobile users to ensure that offloading will not be interrupted deserves deep study, which will also be the focus of our research in the future. Ref. [15] Ref. [10] Ref. [8]   Ref. [15] Ref. [10] Ref. [8]

Proposed algorithm
Ref. [15] Ref. [10] Ref. [8]  Data Availability e data included in this paper are available from the author upon request without any restriction.