Resource Allocation Strategy Using Deep Reinforcement Learning in Cloud-Edge Collaborative Computing Environment

With the development of technologies such as IoT and 5G, the exponential explosion in the amount of new data has put more stringent requirements on ultrareliable and low-delay communication of services. To better meet these requirements, a resource allocation strategy using deep reinforcement learning in a cloud-edge collaborative computing environment is proposed. First, a collaborative mobile edge computing (MEC) system model, which combines the core cloud center with MEC to improve the network interaction ability, is constructed.e communication model and computation model of the system are considered at the same time.en, the goal of minimizing system delay is modeled as aMarkov decision process, and it is solved by using the deepQ network (DQN) which is improved by hindsight experience replay (HER), so as to realize the resource allocation with the minimum system delay. Finally, the proposed method is analyzed based on the simulation platform. e results show that when the number of user terminals is 80, the maximum user delay is 1150ms, which is better than other comparison strategies and can eectively reduce the system delay in complex environment.


Introduction
In recent years, mobile devices have become essential tools in our daily life, such as communication, socializing, and entertainment. e demand for mobile computing continues to escalate, resulting in an explosion in the number of mobile devices [1]. However, battery capacity and computational resources are not su cient to meet users' needs, so cloud computing, which allows the computation tasks to be transmitted to servers with more computing capacity from mobile devices, has been developed greatly to solve the challenge of limited resources for mobile devices [2]. Mobile cloud computing (MCC) combines the advantages of both mobile computing and cloud computing to further address this problem [3]. Meanwhile, [4] MCC enables the use of data from the Internet to extend the capabilities of mobile devices lacking in computing, communication, and caching and to extend e ective working time within the limit of battery life.
Nowadays, the main implementation method is to oload tasks such as computing and storage to a remote public cloud platform [5]. However, the traditional architecture of MCC is also facing new challenges. On the one hand, users have to interact with data centers when using mobile applications, and network delay has a signi cant impact on some delay-sensitive applications depending on the relative distance between the user and the cloud data center. On the other hand, because all data interactions generated by applications must be carried out through the core network, there will be great pressure on the core network during network peak hours [6]. In order to solve the problem of high delay and alleviate the network pressure, mobile edge computing (MEC) technology came into being. e core idea of MEC is to decentralize part of the computing and storage capacity of the data center in MCC to the edge of the network, i.e., close to the user. Hence, the data-processing requirements generated by mobile applications can be executed by the MEC server which is located at the edge of local network, and results can be returned without going through the core network and data center [7].
A resource allocation strategy using deep reinforcement learning in a cloud-edge collaborative computing environment is proposed to address the limitation of hardware resources such as server storage and computing as well as the nonuniformity of user distribution, which leads to problems such as higher network delay. Compared with traditional MEC system strategies, the main contributions of this paper are as follows: (1) In order to make full use of the limited capacity of MEC servers while reducing the delay of user-service interaction, a collaborative MEC system model is proposed, which not only greatly relieves the pressure on the core network but also reduces the network delay of applications through the allocation of the cloud computing center and MEC. (2) To address the reward sparse problem caused by complex environments, a special experience replay method, which is named as hindsight experience replay (HER), is introduced to give certain rewards to actions that do not reach the target state as well, so as to accelerate the learning efficiency of agents and guide them to the correct learning direction in order to achieve a global load-optimal strategy. e remaining sections of this paper are arranged as follows: Section 2 introduces the related research in this field. Section 3 introduces the system model and optimization objectives. Section 4 introduces the resource allocation strategy based on deep reinforcement learning. In Section 5, experiments are designed to verify the performance of the proposed strategy. Section 6 is the conclusion.

Related Works
Computing offloading is an essential part in MEC, and the key point of offloading decisions is how they are designed. A flexible and effective network computational resource allocation strategy is an important guarantee for the efficient operation of MEC [8]. In this regard, scholars at home and abroad have already conducted some researches. In [9], a migration strategy of cloud collaborative computing was designed. Simulations verified the feasibility of selecting the optimal migration strategy based on task division. However, the overall computational efficiency still needs to be improved. Gao et al. proposed a hierarchical multi-agent optimization algorithm that combines the genetic algorithm (GA) and multi-agent optimization [10].
is algorithm implements an improved GA by maximizing resource utilization to find a set of service nodes for deploying the requested tasks so as to maximize resource utilization and bandwidth cost in cloud computing. e algorithm pays more attention to resource utilization of cloud computing, and cloud-edge collaboration is not fully considered. In [11], in order to realize effective edge learning on heterogeneous edges with constraints on resources, an online learning strategy was proposed based on the budget-constrained multiarmed slot machine model, which improves the computational efficiency in cloud computing environments. But the performance of bandwidth optimization should be enhanced during resource allocation. A resource allocation strategy for synchronous wireless information and power transmission relay systems under general interference was studied in [12]. A resource allocation coordinate system was established using the h 2 method to divide the energy harvesting area and DF area. e optimal value was obtained by applying the golden division method, but the overall communication delay still needs to be improved. A stochastic mixed integer nonlinear programming model was formulated in [13] to jointly optimize task offloading decisions, flexible computational resource scheduling, and radio resource allocation. e proposed problem was decomposed into four separate subproblems using Lyapunov's optimization theory and solved by convex decomposition method and matching game. A low probability of intercept based collaborative power and bandwidth allocation strategy for multi-target tracking was proposed in [14]. But the algorithm is biased toward the optimal allocation of bandwidth resources, and further improvement is required for computational resource allocation. By defining a reward function for each user with consideration of aggregation effects of the other users, Liu et al. developed a multi-agent reinforcement learning framework based on independent learners to solve the problem about resource allocation in the cloud computing environment [15]. However, this algorithm has yet to be further studied for collaborative resource optimization at the cloud edge in some complex situations.

System Model and Optimization Objective
3.1. System Scenario. Traditional cloud computing, whether the distributed storage computing is used or not, is still centralized at a macro level, and all service requests still need to be transmitted to the cloud for execution. However, the cloud data center will decide to execute tasks according to the optimization strategy, which makes it difficult for traditional cloud computing to adapt to the changes in users' requests over time, space, and other factors. And MEC can better deal with this problem. e core idea of MEC is to offload part of computing and storage capacity of the data center in MCC to the edge of the network, i.e., close to the user. Data-processing requirements generated by mobile applications can be executed by the MEC server, and then results can be directly returned at the edge of their local network without going through the core network and data centers, which not only greatly relieves network pressure on the core network, but also significantly reduces network delay for applications [16]. In addition, [17] as MEC servers usually only serve the area for which they are responsible, they can be more integrated with local information. A typical collaborative MEC scenario is shown in Figure 1, consisting of a core cloud data center and M sub-MEC servers.
As shown in Figure 1, K services are running in the core cloud data center, while each MEC server is split into one area for task requests. e backhaul network connects each area to the core cloud data center, and users connect to the core cloud data center to access services via the backhaul network. To better serve the users in each area and to reduce the interaction delay and the pressure on the backhaul network, services in the core cloud data center are offloaded to the MEC servers within their capacity, so that users in that area only need to connect to the MEC servers for services. In other words, if the requested service of a user in a certain area is in its local MEC server, the switching delay is produced by the device when connecting to the local MEC server [18]. Otherwise, besides communicating with the local MEC server, it also needs to connect to the MEC server that has recently been deployed with service to obtain it via the local MEC server. erefore, the delay is correspondingly higher [19].

Communication Model.
In the MEC system model, each user terminal can offload its computation tasks to the MEC server or execute them locally. Each user is associated with the nearest base station, and the MEC server which is connected to that base station can be referred to as the user's local MEC server. When a user terminal offloads a task to the MEC server, the user needs to transmit the task to the base station associated with it via a wireless link.
Define U m as the set of user terminals associated with the base station m, and define B m as the bandwidth of the base station m. Multiple user terminals choose to offload computation tasks at the same time, and then the wireless bandwidth is evenly allocated to the offloading user terminals to upload data. e uplink transmission rate v n,m of the wireless link of the user terminal n can be expressed as follows: where p n is the transmission power of the user terminal n to upload data through the wireless link, h n,m is the channel gain between the user terminal n and the base station m, and δ 0 is the power spectral density of white noise.

Computation Model.
For each user terminal, there are three types of execution of computation tasks, which are local execution at the user terminal, execution at the user's local MEC server, and execution at the user's neighboring MEC server.

Local Execution Model.
When a user terminal n chooses to execute a computation task φ n locally, and f L n represents the local computing capacity of the user terminal n, which is the CPU frequency, then the time delay T L n for executing the computation task t n locally can be calculated as follows: where C n represents the number of CPU cycles required to execute the computation task φ n .

Local MEC Server Computing Model.
When a user terminal n chooses to execute a computation task φ n at the local MEC server, the user n first needs to offload the computation task φ n to its associated base station m via the wireless link, and the corresponding transmission delay is generated when uploading the computation task. According to the communication model, the uplink transmission delay T t n,m of user n can be calculated as follows: where D n represents the amount of the input data for the computation task φ n . Next, the MEC server in the base station allocates the computational resources to execute the task, which results in the processing delay of the computation task. Let f n,m denote the computational resources allocated by the base station m to the user terminal n; then, the processing delay T c n,m for the MEC server to execute this task can be written as follows: Finally, the results are returned to the user terminal n from the MEC server. Generally, the amount of data returned by the computation task after processing is small, and the data rate of the downlink is high, so the delay due to returning the results can be ignored. From the above analysis, it can be inferred that the total delay when the computation task φ n of user terminal n is executed at the local MEC server can be formulated as follows:

Neighborhood MEC Server Computing Model.
When a user terminal n chooses to execute a computation task φ n at a neighboring MEC server g, it first needs to offload the computation task φ n to its associated base station m via a wireless link. Next, the base station m transmits the task to the base station g via a wired link, and then the MEC server in the base station g allocates the corresponding computational resources to execute the task. Finally, the MEC server returns the results to the user terminal n. Based on the above process, the total delay for executing the computation task at the neighboring MEC server g can be defined as T [ n,g , which mainly consists of the uplink transmission delay T t n,m , the transmission delay T t m,g between base stations, and the processing delay T c n,g in the MEC server. Like the local MEC server computing model, the delay incurred by the return of results to the user can be ignored. As the uplink transmission delay T t n,m can be calculated in the same way as the local MEC server, it is not written in detail here. Let f n,g denote the computational resources allocated to the user terminal n by the base station g, and let v m,g denote the transmission rate between the base station m and the base station g; the processing delay T c n,g in the MEC server can be formulated as follows: e transmission delay T t m,g between base stations can be calculated as follows: Hence, the total delay T [ n,g of the user terminal n when offloading the computation task to the neighboring MEC server g can be calculated as follows: Further, the time delay of the user terminal n can be obtained as follows: where α n , β n , and c n,g are the weights of the local delay, the total delay when the task is executed at the local MEC server, and the total delay when the task is executed at the neighboring MEC server, respectively.

Optimization
Objective. e objective of this work is to optimize the maximum task delay among all users of the system within the computational resource constraints of the MEC server, thus ensuring fairness among users. erefore, it can be formulated as a joint optimization problem of offloading strategy � α, β, c and computational resource allocation f � f n,m , n ∈ N, m ∈ M for all user terminals, which can be shown as follows: where M n � (M/m) represents the set of adjacent base stations of user terminal n. Constraints C1 and C2 are restrictions on the offloading strategy of each user terminal. Constraint C3 specifies that the sum of computational resources allocated to the user cannot exceed the computing capacity of the MEC server. Constraint C4 ensures that the computational resources allocated cannot be negative. As there are both discrete and continuous variables in this optimization problem and the objective function is a min-max model, the optimization problem can be regarded as mixed integer nonlinear programming (MINLP) problem, which is NP-hard and non-convex. e solving complexity of such problems is very high, and they cannot be solved in polynomial time.

Markov Decision Process. Markov Decision Progress
(MDP) is a mathematical model of sequential decision making and is a combination of Markov process and deterministic dynamic programming, which can be used to model the policy and reward that the agent can achieve in a Markovian environment. e Markov property requires that the next state in the system is only related to the current state and has no relationship to all earlier states. erefore, a state can be defined as having the Markov property when it does not have historical information and the current state can determine the next or future state. If each state in a random state sequence has the Markov property, then the random sequence can be regarded as a Markov process. And MDP is a Markov process that takes actions and returns into account [20,21]. MDP is constructed according to the environment and the agent, which can be represented by a quadruplet (S, A, P, R). S is the state space, representing the set of states of the environment. A is the action space, representing the set of actions that can be performed by agents. P is the state transfer probability, representing the distribution probability of the current state becoming the next different state after selecting a different action. R is the reward function, which indicates the reward value obtained by the agent when selecting an action to transfer to next state, and the reward value indicates whether it is suitable to select this action under the current state. Additionally, the discount factor λ (λ ∈ [0, 1]) is introduced, which is one of the parameters for calculating the cumulative reward. When λ tends to be 0, it means that the future reward is not important and only the current reward needs to be considered. When λ tends to be 1, it means that the future reward is very important, so it is necessary to take both future reward and current reward into consideration.
Mathematically, the interaction between agents with environment over a discrete time series T is usually described as a Markov process. At each moment t in T, the agent is in a state and needs to select an action, constituting the set of actions A � a 0 , a 1 , . . . . And a reward value will be obtained when an action is selected, which constitutes the set of reward values R � r 0 , r 1 , . . . .
In MDP, a policy is a mapping between the probabilities of selecting each action in the action space under the current state, indicating the action that the agent should perform based on the current state.
ere are two main types of policies. e first is the deterministic policy, which means that the agent can only select one action to perform in a certain state. e second is a stochastic policy, which indicates that the agent may perform one or more actions in a certain state. Hence, the policy π can be defined as the probability that the agent selects action a t under state S t at the moment t, which can be written as follows: where π(a t |s t ) ≥ 0 and π(a t |s t ) � 1. e state-value function is the expectation of the cumulative reward in the current state R t , i.e., the expectation of the cumulative reward R t after taking the policy π in the current state S t . e state-value function is mainly used to evaluate each state of a given policy π. erefore, the statevalue function is associated with the policy and can be defined as Q π (s t ), which is the sum of the future reward values when all the actions selected according to policy π in current state S t are multiplied by the discount factor λ. Q π (s t ) can be calculate as follows: Corresponding to the state-value function, the stateaction value function represents the expectation of the cumulative reward R t when the action a t is selected based on policy π under the current state S t . Q π (s t , a t ) represents the sum of future reward values when the action a t is selected with the discount factor λ based on policy π in current state S t , which can be formulated as follows: erefore, MDP can be used in the scenario of finding the optimal policy. By means of random sampling and dynamic programming, MDP can be used to solve the problem of maximizing the cumulative reward value. And the proposed method applies deep Q network (DQN) to find the optimal strategy.

DQN-Based Offloading Strategy.
e DQN uses experience replay to disrupt the correlation between the sample data, as there is a link between successive action states. But the neural network is a nonlinear model that requires the samples to be independently and homogeneously distributed. us, by random uniform sampling in the experience replay buffer, the data distribution is averaged, and the training process is smoothed. Because the off policy learning method is adopted, the parameters of the network generating data samples are different from those of the trained network, which is the Fixed Q-targets network. e two networks have the same structure but different parameters. One of the networks is used to obtain the estimated Q value, and the other network is used to obtain the realistic Q value.
e second network will periodically replicate the parameters of the first network, thus reducing the correlation between two Q values to a certain extent and improving the network stability. However, deep learning can only be adapted to a specific environment and model, and these adapted hyperparameters will not be effective due to the large number of training episodes needed to adapt to the new environment. Meanwhile, due to the excessive complexity of the problem and sparse rewards in real-life scenarios, the reward function is required to obtain valid rewards from invalid actions to guide the agent to make better [22] decisions [23].
To deal with the problems mentioned above, HER is introduced, which is a good solution to the problem of sparse rewards and dichotomous rewards, as it allows rewards even if the final goal is not achieved and accelerates the learning process of the agent. It is assumed that the learning trajectory of the agent starts from the initial state s 0 and actually reaches the state s while the final goal is state s ∞ . en, the real learning process of the agent can be written as follows: s 0 , s ∞ , a 0 , r 0 , s 1 , s 1 , s ∞ , a 1 , r 1 , s 2 , . . . , s n , s ∞ , a n , r n , s , (14) where s 0 represents the state at time 0 and a 0 represents the action taken by the agent at time 0. Similarly, a 1 is the action at time 1. And r 0 represents the reward received at time 0. If the target state s ∞ is replaced by s, then even though the agent does not reach the target state s ∞ , it will still have feedback when it arrives at s. Hence, the learning process can be written as follows: s, a 0 , r 0 , s 1 , s 1 , s, a 1 , r 1 , s 2 , . . . , s n , s, a n , r n , s . (15) e action taken by an agent at time t is determined by the current state s t jointly with the target state s ∞ , which can be written as follows: a t � π s, s t . (16) e corresponding rewards can be shown as follows: Mobile Information Systems r t � R s t , a t , s ∞ . (17) en, each experience from the learning process is placed into HER, including the current state s t , the action taken a, the immediate reward obtained r t , the next state s t ′ , and the final goal s ∞ . A hypothetical goal s is generated by taking a policy, and the experience r ′ is calculated jointly with the current state and action and deposited into HER. Here, r ′ can be calculated as follows: r ′ � R s t , a t , s . (18) e policy used to generate the hypothetical goal is "future." Specifically, there are l number of states which can be observed in the randomly selected episode. ese states are used as target states to compute new experience values that will guide the agent in further learning. HER offers a new direction in solving the problem of sparse reward; that is, it is not necessary to achieve specific goals to learn useful experience [24]. It should be noted that the goal which is selected or achieved in the so-called "failed experience" should be related to the final goal in some way. And the learning performance will be very poor if the similarity between two goals is very low [25,26]. e pseudocode for a deep reinforcement learning algorithm based on multiple objectives and experience replay is shown in Algorithm 1. e proposed DQN structure is shown in Figure 2. e structure of the proposed DQN is the same as a normal DQN, where the agent first reaches the state s t , then inputs data into the neural network to calculate the estimated Q value and select the action with the largest Q value, executes this action to get the reward with the new state, puts the experience into the replay pool, randomly selects the minibatch when there is enough experience in the replay pool, and finally calculates the loss function between the Q value of the main network and Q which is calculated based on the target network.
e parameters in the network are updated by gradient descent, and the target network is periodically updated with parameters [27]. However, it should be noted that the replay pool here uses HER that randomly samples the target as a new hypothetical target in the same episode and then computes a new experience, overcoming the problem of sparse rewards and making the agent learn faster.

Experiments and Analysis
Consider a 520 m × 520 m area with 5 uniformly deployed base stations and N number of users, where each base station is equipped with a MEC server with 8 CPU cores and the path loss model is h � 127 + 30 log(d[km]) (operating system: Windows 10, CPU: Intel ® Core ™ 3.2 GHz, memory: 8 GB DDR3, 1600 MHz). Define the average rate at which users offload tasks to the MEC server at moment t of task as v n,m � (1/t) t−1 τ�0 v n,m (τ). Considering that the number of tasks arriving in each time slot differs in magnitude from the power consumed by the users and the MEC, the parameters are set as follows. e revenue per task is set to be 1 × 10 −3 units/bits, and the power cost per unit is set to be 0.2 units/ W. Other simulation parameters are set as shown in Table 1.

Convergence of Markov Decision Algorithm.
e convergence of Markov decision algorithm in different environments is first investigated, and the convergence performance of Markov decision algorithm is shown in Figure 3 when the number of tasks is 20.
As can be seen in Figure 3, the Markov decision algorithm converges quickly to the optimal value after 100 iterations, with a total system utility of approximately 0.2. Its computational complexity is much lower than the number of enumerations in an exhaustive search.

Comparison of the Average End-to-End Delay of Different Algorithms.
According to Little's law, the average waiting delay is proportional to the average queuing time, so the average waiting delay can be defined as the sum of the waiting time in the task queues of users and MEC servers. en, the relationship between task arrival rate and end-toend delay in the four algorithms is shown in Figure 4.
As illustrated in Figure 4, the algorithm proposed in [15] has the highest average end-to-end delay compared to the other algorithms due to its excessive reliance on cloud computing, while the other three algorithms consider local execution, which reduces the delay to some extent. For the algorithm developed in [12], the average end-to-end delay is higher because it does not consider the queuing threshold, which results in the increase of waiting delay in the queue, and the golden partition method causes more transmission delay. In addition, it can also be found in Figure 4 that when the task arrival rate is small, the performance of [9] is close to that of the proposed algorithm and the difference between these two algorithms broadens gradually when the arrival rate increases. is is due to the fact that the proposed algorithm optimizes the bandwidth resources while considering the power resource allocation, thus reducing the transmission delay of the task to a certain extent. And when the task arrival rate is 4 kbits/slot, the average end-to-end delay is about 450 ms.

Effect of Different Numbers of MEC Servers on Average
Delay. As the main optimization objective of the proposed algorithm, the average network delay has an intuitive effect on the performance of the algorithm, and the average network delay indicates the time required for each service request to access the required service. e average network delay for different algorithms is illustrated in Figure 5 when the number of MEC servers is 10, 55, and 100.
It is indicated in Figure 5 that the average communication delay is decreasing as the number of MEC servers increases, due to the fact that the increase of MEC speeds up the communication. Meanwhile, the proposed algorithm has the shortest average communication delay, which is below 300 m when the number of MEC servers is 100. Due to its cloud-edge collaboration approach, users do not need to offload tasks to a remote cloud computing center when the number of MEC servers is high, and the DQN strategy can get the best resource allocation scheme to reduce the communication delay. In contrast, [15] focuses on cloud computing environment, so the increase of MEC servers is not significant for the improvement of performance. e communication delay decreases a lot in [9]. Reference [12] has a higher overall delay which exceeds 600 ms due to the lack of a collaborative cloud-edge allocation model and a high-performance optimization algorithm.

Relationship between the Number of Terminals and the Maximum User Delay.
e proposed algorithm is evaluated by the performance metric of maximum user delay, and the relationship between the number of user terminals and maximum user delay for different algorithms is shown in Figure 6.
It can be seen from Figure 6 that the maximum user delay increases with the number of user terminals, and the proposed algorithm has a smaller maximum user delay. e

Initialization:
Initialize experience playback memory; Initialize behavior value function Q with random weight θ; Initialize the target behavior value function Q with weight θ � θ. Begin (1) For episode i � 1, 2, . . . , I (2) do e initial observation s 1 is received and the preprocessing s 1 is taken as the start state x 1 (3) For t � 1, 2, . . . , T (4) do Select behavior a t randomly with random probability ε; (5) Otherwise, select behavior: a t � arg max Q(x, a; θ); (6) Execute actions a t in the system to obtain reward r t and observe s t+1 at the next moment, and update s t+1 to x t+1 ; Store experience to experience playback memory; (8) Obtain samples in random small batches from playback memory; (9) Calculate the target Q value of the target DQN; (10) Update the main DQN by minimizing the loss function L(θ); (11) For network parameter θ, gradient descent is performed on L(θ); (12) Update target network Q value. proposed strategy maximizes the processing efficiency of computation tasks and reduces system delay by combining the advantages of cloud computing and MEC and introducing DQN to design an offloading strategy. When the number of user terminals is 80, the maximum user delay is 1150 ms, which is better than other comparison strategies.
On the other hand, although there are abundant resources in the cloud nodes, they are remote from the users and the transmission delay of the tasks is larger. Hence, the maximum user delay in [15] is more than 1240 ms. erefore, when the number of computational tasks or users is within a certain range, the edge collaboration has better performance than the edge-cloud collaboration.

Relationship between User Terminal Computing Capacity and Maximum User
Delay. e effect of user terminal computing capacity on the maximum user delay under different algorithms is depicted in Figure 7.
As shown in Figure 7, the maximum user delay in [15] decreases much more slowly, while the delay of the other three algorithms decreases with the increase of computing capacity. is is because the algorithm proposed in [15] does not consider local execution for users and mainly relies on cloud computing platforms; hence, changes in terminal computing capacity do not have a great effect on user delay. Additionally, it is also illustrated in Figure 7 that the difference between the collaborative scheme of [9] and the proposed algorithm and the non-collaborative scheme of [12] gradually becomes smaller, which is due to the fact that the advantage of the resource allocation algorithm will be less significant as the tasks tend to be executed locally at the terminal and fewer tasks are offloaded to the MEC servers when the terminal computing capacity improves.
Meanwhile, the proposed algorithm uses the multi-objective and HER-based DQN algorithm to find the optimal resource allocation strategy. And the maximum user delay is about 1300 ms when the user terminal computing capacity is Ref. [15] Ref. [12] Ref.

Conclusion
To address the high application delay caused by the limitation of hardware resources such as storage and computing capacity of servers, a resource allocation strategy using deep reinforcement learning in cloud-edge collaborative computing environment is proposed. Based on the collaborative MEC system model, the optimization objective of minimizing system delay is designed, which is solved using HER-improved DQN to obtain the optimal resource allocation scheme. e results based on the simulation platform show that improving DQN by HER can accelerate the learning efficiency of the agents. It converges rapidly to the optimal value after 100 iterations, so the total system utility is about 0.2.

Data Availability
e data used to support the findings of this study are included within the article.