Research Article Joint Radio Map Construction and Dissemination in MEC Networks: A Deep Reinforcement Learning Approach

With the development of 6G, the rapidly increasing number of smart devices deployed in the Industrial Internet of Things (IIoT) environment has been witnessed. The radio environment is showing a trend of complexity, and spectrum con ﬂ icts are becoming increasingly acute. User equipment (UE) can accurately sense and utilize spectrum resources through radio map (RM). However, the construction and dissemination of RM incur a heavy computational burden and large dissemination delay, which limit the real-time sensing of spatial spectrum situations. In this paper, we propose an RM construction and dissemination method based on deep reinforcement learning (DRL) in the context of mobile edge computing (MEC) networks. We formulate the dissemination modes selection and resource allocation problems during RM construction and dissemination as a mixed-integer nonlinear programming problem. Then, we propose an actor-critic-based joint o ﬄ oading and resource allocation (ACJORA) algorithm for intelligent scheduling of computational o ﬄ oading and resource allocation. We design a novel weighted loss function for the actor network, which combines the discrete actions for o ﬄ oading decisions and the continuous actions for resource allocation. And the simulation results show that the proposed algorithm can reduce the cost of dissemination by optimizing the o ﬄ oading strategies and resources, which is more applicable for real-time RM applications in MEC networks.


Introduction
With the development of 6G and the rapid growth of mobile data traffic, new business scenarios are constantly emerging. The Industrial Internet of Things (IIoT) is expected to be a crucial technology changing the manufacturing way [1][2][3].
IIoT is a variety of acquisition or controllers with sensing and monitoring capabilities. And it integrates mobile communication, intelligent analysis, and other technologies into all aspects of the industrial production process, thereby greatly improving manufacturing efficiency and realizing the intelligence of traditional industries. However, with the explosion of IIoT applications, the radio environment has become increasingly complex, which brings unparalleled challenges such as scarce spectrum resources, intermittent wireless connections, and high propagation delays. Further research is needed to address the above issues.
Radio map (RM) is an important tool for understanding radio environments and analyzing network performance. It incorporates geographic information to describe the radio environment from multiple dimensions such as time, frequency, space, and power [4]. RM can not only effectively acquire the distribution of the radio spectrum resources, but also utilize the multidimensional spectrum data and manage the spectrum resources in a straightforward and flexible way. It has been widely used in cognitive radio [5][6][7], interference management [8], coverage analysis [9][10][11], and active resource allocation [12][13][14]. The spectrum data can be collected by interconnected sensors or smart devices, and the spectrum data needs to be further processed to be constructed as an RM. In the traditional cloud-based network architecture, all spectrum data must be uploaded to a centralized cloud server, constructed as an RM, and disseminated to user equipment (UE). Due to the large size of RM, the traditional dissemination scheme from cloud server to UE consumes more bandwidth and time, which cannot meet the low-latency requirements of IIoT.
In recent years, edge intelligence has integrated edge computing and artificial intelligence (AI) technologies to effectively promote edge-end collaboration [15]. In mobile edge computing (MEC) systems, various services originally deployed on the central cloud server can be deployed on MEC servers, which fundamentally shortens the data transmission delay [16]. At the same time, AI technology represented by reinforcement learning (RL) can schedule the computation offloading and resource allocation in MEC networks [17][18][19], which improves the efficiency of edge computing. The authors in [20] explored the joint optimization of computational offloading and resource allocation in dynamic multiuser MEC systems and proposed a Q-learning-based method and a double deep Q-networks-(DDQN-) based method to determine joint strategies for computational offloading and resource allocation. [21] considers both the multiuser computation offloading and edge server deployment in an unmanned aerial vehicle-(UAV-) enabled MEC network. The authors proposed two learning algorithms to minimize the system-wide computation cost under a dynamic environment. To solve the joint optimization of computing offloading and service caching in the edge computing-based smart grid, the authors in [22] proposed a gradient descent allocation algorithm to determine the computing resource allocation strategy, and an algorithm based on game theory to determine the computing strategy. The authors in [23] proposed a method of saving the content service provider (CSP) based on a method of motivating drivers and deep Q-networks (DQN). [24] proposed an auction algorithm and a dynamic task admission algorithm to maximize the system average throughput in a 5G-enabled UAVto-community offloading system. [25] proposed a deep reinforcement learning (DRL) additional particle swarm optimization algorithm to maximize the long-term utility of all mobile devices in the MEC-based mobile blockchain framework, which takes into account the limited bandwidth and computing power of small base stations. Edge intelligence technology empowers edge users with more powerful information processing and content delivery capabilities by scheduling computing, storage, and other resources for users. It profoundly changes the function of mobile applications and the utilization mode of network resources. What is more, it provides inspiration for constructing and disseminating RM with computational complexity, high bandwidth, and delay-sensitive requirements.
In this paper, we decompose the RM construction task into two subtasks, which are deployed in the MEC server and UEs according to offloading modes. In MEC networks, RMs are compressed before transmission in order to reduce bandwidth consumption. Then, we propose an actor-critic-based joint offloading and resource allocation (ACJORA) algorithm for intelligent scheduling of computation offloading and resource allocation in MEC networks. Our principal contributions are summarized as follows: (1) We propose three modes of disseminating RMs in MEC networks and formulate the process as a mixed-integer nonlinear programming (MINLP) problem. Our objective is to minimize the energy consumption and delay of RM construction and dissemination (2) To solve the above problem, we propose a DRL algorithm based on actor-critic for joint computation offloading and resource allocation. Considering the offloading decision actions are discrete and resource allocation actions are continuous, we design a weighted loss function including the two types of actions in one actor network, which significantly reduces the number of training parameters and improves the convergence efficiency of the algorithm (3) Simulation experiments prove that the proposed ACJORA algorithm can find offloading and resource allocation strategies for RM dissemination effectively, which is more applicable for real-time RM applications The remaining sections of this paper are organized as follows. Section 2 reviews the related work on the problem. Then, the network model and problem formulations are introduced in Section 3. Section 4 specifies the implementation details of our ACJORA algorithm. Performance evaluations are provided in Section 5, and Section 6 concludes the paper.

Related Work
2.1. Construction of RM. Accurate RM can provide better services, and the ways to improve the accuracy of RM mainly include optimizing the deployment of sensor devices and improving the accuracy of spatial interpolation [26,27]. However, in some cases, sensor devices are predeployed and the deployment may not be optimal [28]. What is more, it is an uneconomical way to increase the number of sensor devices. Therefore, the interpolation accuracy needs to be improved under the limit of the number of sensors [29]. However, the construction complexity also increases as the accuracy increases. At present, RM is mainly constructed in the central cloud server, which is used for network planning or spectrum management and control in advance or for a long period. Some research has been carried out to reduce the complexity of RM construction. The authors in [30] proposed a method based on the Kalman filter. The work in [31] proposed a method based on regression kriging and incremental clustering. Both works of [30,31] reduced the complexity of RM construction. [32] proposed RM construction method based on a superresolution (SR) algorithm which greatly shortens the construction time while improving the accuracy. This algorithm contains two phases: offline training and online conversion. In the offline training phase, RM images with different interpolation resolutions are used to train the return forest model with the best parameters. In the online conversion phase, the trained model can directly convert LR RM to high-resolution (HR) RM. In addition, the dissemination of accurate RM requires the scheduling of computing and communication resources, and edge intelligence provides us with solutions.

Edge
Intelligence-Based IIoT. There has been a lot of research on edge intelligence in recent years, and some are used to solve problems in IIoT. The authors in [33] make a 2 Wireless Communications and Mobile Computing review of research results that expounds on the development and convergence process of IIoT and edge computing. They propose an architecture for edge computing in IIoT and comprehensively explain it from multiple performance metrics. The authors in [34] considered the energy cost optimization problem of computing and caching in the Internet of Vehicles and integrated a deep deterministic policy gradient (DDPG) algorithm to solve this problem. [35] constructed a blockchain-enabled crowdsensing framework in intelligent transportation system. The authors proposed a DRL-based algorithm and a distributed alternating direction method of multipliers algorithm for distributed traffic management. [36] proposes a novel framework for optimizing edge collaborative network (ECN) to improve the stability between edge devices and the performance of edge computing tasks. [37] proposed a multiagent imitation learning-enabled UAV research deployment approach, which enables different UAV owners to provide services with differentiated service capabilities in a shared area. In a survey paper [38], the authors explored the emerging opportunities brought by 6G technologies in IoT networks and applications. They shed light on some 6G technologies that are expected to empower future IIoT networks, including edge intelligence, massive ultrareliable and low-latency communications, and blockchain.

Video Streaming Based on Edge
Intelligence. Since the transmission of video streams is time-sensitive, and many video streaming tools (i.e. smartphones and VR devices) are limited by energy and computational capacity, just like the case we proposed. There are some research on video streaming transmission. The authors in [39] used a DRL algorithm to simultaneously optimize energy consumption and quality of service (QoS) for users during video streaming in edge networks. [40] proposed a peer-to-peer video streams transfer method based on MEC, which can perceive QoS for different users. The author modeled the transmission, computation, and offloading problem of video streams as a problem of maximizing QoS for users and then implemented an anti-fuzzy particle swarm optimization algorithm to optimize it. Distributed edge computing was used to optimize bandwidth consumption during video streaming [41]. The above methods of video streaming transmitting in edge intelligent networks mainly focused on computation offloading and resource allocation. However, RM dissemination needs to consider the construction and dissemination of RM at the same time. It is necessary to optimize the construction algorithms and compression coding of RM to reduce the use of computing and communication resources. Therefore, further research is needed on the construction and dissemination of RM.

System Model and Problem Formulation
We consider a MEC network with a base station (BS), a MEC server, and N UEs, as shown in Figure 1. The set of UEs is denoted by N = f1, 2, ⋯, Ng. UEs have radio receivers that collect spectrum data 5 times/sec based on crowdsensing without deploying radio receivers additionally.
They send spectrum and location information to the BS with newly collected spectrum data.
We adopt the RM construction method based on the SR model in [32], which was trained by RM images with different interpolation resolutions. Due to the small size of the trained SR model, it consumes less bandwidth for transmission. The construction task of RM can be divided into two subtasks, e.g., the kriging interpolation algorithm and the SR model. The kriging interpolation algorithm can construct spectrum data as low-resolution (LR) RMs, and SR models convert LR RMs into HR RMs. In order to reduce time and energy consumption, the task of RM construction is decomposed and offloaded to the MEC server and UE. Therefore, it can transform the task of disseminating RMs into the task of disseminating mode with a small amount of data. The MEC server can determine whether it is necessary to offload RM construction task to the UE due to computation resources and bandwidth resources in the downlink. The dissemination mode decision variable of UE i can be denoted as m i ∈ f0, 1, 2g. The three dissemination modes are described as follows: (a) All server (mode 0, m i = 0): in this mode, the construction task of RM only occurs on MEC server. First, the edge server completes the construction of the HR RM. Then, the HR RM is compressed and sent to UEs. This mode is suitable for situations in which the bandwidth of the dissemination link is abundant or the processing power of UE i is limited (b) Partial offloading (mode 1, m i = 1): in this mode, UE i needs to execute part of RM construction task. First, the edge server completes the construction of the LR RM and the training of the SR model. Then, the edge server disseminates the compressed LR RM and SR model to UE i, and UE i only needs to complete the SR conversion task. This mode is suitable for situations in which the dissemination link bandwidth is abundant and UE i has a certain processing capability (c) All local (mode 2, m i = 2): in this mode, MEC server disseminates the raw spectrum data of all UEs and trained SR model to UE i. UE i constructs the raw spectrum data into an LR RM. Then, it is transformed into an HR RM by the SR model. This mode is suitable for situations in which the bandwidth of the dissemination link is limited or UE i has a certain processing capability We denote the RM construction computation task of Here, s i expresses the size of computation input data, c i represents the number of CPU cycles required to accomplish the computation task, and T max i is the maximum tolerant delay of the task. In our model, computation tasks are considered to be decomposable. As a result, we decompose the construction task of RM into two subtasks, LR RM construction, and SR transformation. In addition, there is also a process of compression when MEC server disseminates RM to UE in mode 0 and mode 1. Table 1 shows the computation tasks in MEC networks.

Edge Server Execution Model.
We denote the computational capacity (i.e., CPU cycles per second) of the edge server as F. And the computational capacity allocated by the edge server to UE i is F i . The energy consumption of the edge server is calculated as where k = 10 −27 is the effective switched capacitance of the CPU, determined by the CPU hardware architecture.
3.2. Cognitive User Execution Model. We denote the computational capacity of UE i as f i . The computation energy consumption of UE i is calculated as where c local i is the number of CPU cycles of computation tasks executed by UE i, which is determined by the dissem- The time for edge server to execute the computation task can be expressed as where p denotes the transmit power of the base station which is constant. h i expresses the channel gain between UE i and the base station. σ 2 represents the noise power. The data transmission delay can be represented as where d i denotes the size of downlink transmission data between BS and UE i, which is determined by the dissemination mode.  Table 2.
The transmission energy consumption of the data transmitted by MEC server to UE i can be calculated as where w e server , w e local , and w e trans are the energy consumption weights for edge server execution, UE execution, and communication, respectively.
The total time spent on computation and communication can be calculated as where w t server , w t local , and w t trans are the execution delay weights for edge server execution, UE execution, and communication, respectively.
In order to minimize the sum cost of execution delay and energy consumption for RM construction and dissemination, we formulate the weighted sum of energy and delay as the total consumption of the MEC system. Under the constraint of computation and bandwidth capacity and maximum tolerable delay, the problem can be optimized as follows: In the above problem, ϖ is the weight for execution delay. m = ðm 1 , m 2 , ⋯, m n Þ is the offloading decision vector, W = ð W 1 , W 2 , ⋯, W n Þ is the bandwidth allocation, and F = ðF 1 , F 2 , ⋯, F n Þ is the computation resource allocation of edge server. Besides, C1 represents the offloading mode of UE i. C 2 expresses the energy consumption of UE i that does not exceed its remaining energy. C3 indicates the sum of computation resources allocated to all UEs that cannot exceed the computation capacity of MEC server. C4 expresses that the sum of the bandwidth allocated to all UEs cannot exceed the available bandwidth of MEC network. C5 represents that the sum time for RM construction and dissemination does not exceed the tolerance time of the task. T sum i denotes the total time of RM construction and dissemination.
Note that the offloading decision variables and the resource allocation variables correspond to integer variables and continuous variables, respectively. Therefore, it is an MINLP and NP-hard problem for the objective function, which has no convex feasible set. And the complexity of the feasible set grows exponentially with the number of UEs. Since traditional model-based methods are incapable of dealing with dynamic scenarios, we adopt a DRL approach, which is model-free.

DRL-Based Joint Offloading and Resource Allocation Algorithm
According to the optimization objectives and constraints of the problem, we solve it with an ACJORA algorithm. This section first defines the state space, action space, and reward function of the model. Then, we introduce the proposed actor-critic algorithm framework in detail.

State Space, Action Space, and Reward Function.
According to the system model, the state space, action space, and reward function are defined as follows.

State Space. The state space can be presented by
where s t denotes the network state at step t. The available resource state at t is represented as F remain expresses the available computation resources of the MEC server, and W remain t denotes the available communication bandwidth resource of the MEC network. The purpose of observing them is to ensure meet the constraints of computational capacity and communication channel capacity. In addition, we also needs to observe the remaining energy of UEs e local t to avoid that the energy of UE is not enough to complete the computation tasks allocated in the next period.

Action Space. The action space can be denoted as
which consists of three vectors: offload decision vector m t = ðm t 1 , m t 2 , ⋯, m t n Þ, computation resource allocation vector F t = ðF t 1 , F t 2 , ⋯, F t n Þ, and spectrum resource allocation vector W t = ðW t 1 , W t 2 , ⋯, W t n Þ. In a MEC system network, MEC server disseminates offload strategies to UEs. Meanwhile, the computation and communication resources allocated to UEs should also be determined.   (10), our goal is to obtain the smallest sum cost of energy and time consumption, while the goal of reinforcement learning is to obtain the largest reward. Therefore, the value of the reward needs to be negatively related to the value of the sum cost. The sum cost of the system at time t is denoted as cost t . Thus, the immediate reward obtained by executing policy a t in state s t can be defined as 4.2. DRL-Based Algorithm. Basically, reinforcement learning algorithms can be classified into three types: actor-only, critic-only and actor-critic [42]. Actor-only methods employ policy functions (i.e., policy gradient methods) to learn stochastic policies efficiently for models with large action spaces and converge asymptotically to local optima, which is more feasible for the models with continuous actions. However, they often cause high variance in expected reward estimates and slow learning. Critic-only methods with value functions (i.e., action-value methods) typically use time difference (TD) iterations and thus have lower variance in expected reward estimates. However, they need to use optimizers in each state encountered to find an action with the highest expected rewards. Therefore, they are not effective to solve problems with large action spaces, which is the case for the problem that we have stated. Furthermore, they need to discretize continuous actions since they are based on discrete action values. Consequently, we adopt an actor-critic approach, which combines the merits of critic-only and actor-only algorithms. The actor can produce continuous or discrete actions without requiring an optimizer for the value function. The critic employs the estimate function to estimate the output of the actor, and the actor updates the policy parameters according to the estimated value to make the variance lower. [43,44]. We propose an actor-critic algorithm to determine continuous actions for resource allocation and discrete actions for task offloading, as shown in Figure 2. The actor-critic model contains a critic network and an actor network. For the actor network, we derive a novel weighted loss function with two different action outputs, the continuous part and the discrete part. The learning rate of discrete and continuous action training is updated according to the weighted loss function in the actor training phase. And the actor parameters of continuous actions are iterated by gradient ascent based on DDPG [45], which is given as where Z represents the size of sample batch; θ Q and θ μ , respectively, express the weight and bias parameters in the critic network and the actor network; μð·Þ and Qð·Þ correspond to the output of actor network and critic network, respectively; and a c is the continuous part action component for resource allocation F t and W t . We compute the gradient of Qðs, μðsjθ μ Þjθ Q Þ to a c instead of the whole μð·Þ. Then, we consider it is constant for the discrete part action component (i.e., offloading decision vector m t ). The gradient of Qðs, μð sjθ μ Þjθ Q Þ for the discrete action is calculated as where m k i ∈ f0, 1, 2g is the offloading decision variable in the kth sample from the replay buffer and ðp k 1 , p k 2 , ⋯, p k n Þ denotes the probability of offloading modes for UEs, which is the first component of the actor output μð·Þ. Different from Equation (15), the continuous part action a c and Qðs, μðsj θ μ Þjθ Q Þ are, respectively, fixed at constant action and constant weight. The weighted loss function for discrete and continuous actions can be expressed as where w c and w d , respectively, correspond to the weight of the discrete and continuous part loss function. α d and α c denotes the learning rates of discrete and continuous part training phases. And they are update according to Equation (17) to get a good convergence performance. For the critic network, we use the average square error loss function to iterate the parameters, which is defined as Based on the above definitions, the proposed ACJORA algorithm is presented in Algorithm 1.

Parameter Setting.
In the simulations, we consider a MEC network as show in Figure 1, which includes a BS and N UEs. A MEC server is connected to the BS. UEs are randomly distributed in an area of ½0, 200 meters from the BS. Communication bandwidth is W = 10 MHz. The computational capacity of MEC server is F = 8 GHz, the CPU frequency of the UE is random in the range ½0:5, 1:5 GHz. The LR RM size is set as 100 Kbit (e.g., a LR image is 1280 × 720 with 16 bit/ pixel, using a compression ratio of 150 : 1 [46]). And the HR RM size is set as 250 Kbit (e.g., a HR image is 1920 × 1080 with 16 bit/pixel, using a compression ratio of 150 : 1). The maximum tolerated delay for RM construction and dissemination is T max = 1 s. The transmit power of BS is p = 500 mW. Detailed simulation parameters are listed in Table 3.

6:
Add noise on μðs t jθ μ Þ with ε-greedy onm t and Gaussian distribution with mean ðF t ,Ŵ t Þ. 7: Get action a t = ðm t ,F t ,Ŵ t Þ with exploration variance V. 8: Take action a t , observe reward r t and next state s t+1 . 9: Store transition ðs t , a t , r t , s t+1 Þ in B. 10: Sample a random batch of Z transitions ðs k , μðs k jθ μ Þ, r k , s k+1 Þ from B.

12:
Update the critic with minimizing the loss L critic by Equation (18). 13: According to the loss L actor , update the actor through the continuous part training phase lr c and the discrete part training phase lr d by Equations (15) and (16) 14: Update the target networks: θ Q′ ⟵ λθ Q + ð1 − λÞθ Q′ , θ μ′ ⟵ λθ μ + ð1 − λÞθ μ′ . 15: end for 16: end for Algorithm 1: Actor-critic-based joint offloading and resource allocation algorithm.  Random offloading: all UEs randomly select one of the three dissemination methods: all server, all local, and partial offloading, m i ∈ f0, 1, 2g. (4) DQN-based: a resource scheduling scheme based on deep Q-network (DQN), which is an action-value method [47]. Table 4 shows the average latency of disseminating RM to UEs in a scenario where the number of UEs is 5, the com-putational capacity of MEC server is 8 GHz/sec, and the communication bandwidth is 10 MHz. We conducted 5000 Monte Carlo experiments and took the average value of the experimental results. The average delay of the proposed scheme is 194 ms, which is 77.90% lower than that of all server construction scheme, 69.83% lower than that of all local construction scheme, and 73.61% lower than that of the random offloading scheme, 32.17% lower than DQNbased scheme.

Wireless Communications and Mobile Computing
As shown in Figure 3, with the increasing of interpolation resolution (i.e., the number of interpolation points by Kriging interpolation), the sum cost of all schemes increase at the same time. Because the computational complexity and the data size of RM dissemination will increase as the interpolation resolution increases. The DRL scheme can efficiently allocate computation and communication resources. However, DQN-based scheme cannot produce continuous actions, so it is need to discretize continuous variables. Thus, the strategies of the DQN-based scheme are worse than that of the proposed scheme in resource allocation. The proposed scheme can achieve the best performance with minimal sum cost with the increase of RM interpolation resolution. Figure 4 illustrates the sum cost of the MEC system as the computational capacity of the edge server increases. The all local curve does not change with the increase of the computation resources of MEC server because UEs does not use the computation resources of the server. The other curves   9 Wireless Communications and Mobile Computing decrease as the computational capacity of MEC server increases. Because each UE is allocated more server computation resources, the computation time will be shortened accordingly. In addition, when F > 8 GHz/sec, the sum cost of all server scheme and the proposed scheme decreases slowly. The result shows that when the computation resources of MEC server are far more than the computation resource of UE, the sum cost of MEC network is mainly limited by other factors such as communication bandwidth resources. Figure 5 depicts the sum cost of the MEC system as communication bandwidth increases. Due to the small transmission data of the all local scheme, it is less affected by the communication bandwidth. And the sum cost of the all local scheme at W = 3 MHz does not change with the increase of communication bandwidth. The sum cost of other schemes decrease with the increase of the communication bandwidth, because each UE can be allocated more bandwidth resources, and the communication transmission time will be shortened. And the proposed scheme has the least sum cost. Figures 4 and 5 show that the proposed scheme has good adaptability in a varying radio environment.
We compared the convergence performance of DQNbased scheme and proposed scheme in Figure 6. The sum cost of the both RL learning schemes has decreased rapidly with the number of episodes. Finally, the most effective offloading and resource allocation strategies are learned and the sum cost of the system has stabilized. Compared with the DQN-based scheme, the proposed scheme can converge with fewer episodes, and its sum cost is less. Figure 6 shows that the proposed scheme can efficiently train offloading and resource allocation strategies.

Conclusions
RM is an important tool for cognitive radio in the 6G era, which can provide data support for IIoT. To solve the problem of RM construction and dissemination in resourcelimited and delay-sensitive MEC networks, we have proposed a joint RM construction and dissemination approach based on DRL, which is described as actor-critic-based joint offloading and resource allocation (ACJORA) algorithm. In the algorithm, we designed a novel actor-critic model with a weighted loss function for the actor network, which combines the discrete actions for task offloading and continuous actions for resource allocation. Compared with baseline schemes, the proposed algorithm can effectively reduce the energy consumption and delay of RM construction and dissemination, which significantly reduces the cost of acquiring RM for UEs.

Data Availability
The data used to support the findings of this study are included within the article.