Dynamically Resource Allocation in Beyond 5G (B5G) Network RAN Slicing Using Deep Deterministic Policy Gradient

. Network slicing makes it possible for future applications with a variety of adaptability requirements and performance requirements by spliting the physical network into several logical networks. Radio access network (RAN) slicing ’ s main goal is to assign physical resource blocks (RBs) to mMTC, eMBB, and uRLLC services while ensuring the Quality of service (QoS). Consequently, it is challenging to determine the optimal strategies for 5G radio access network (5G-RAN) slicing because of dynamically changes in slice needs and environmental data, and conventional approaches have di ﬃ culty addressing resource allocation issues. In this paper, we present an energy-e ﬃ cient deep deterministic policy gradient resource allocation (EE-DDPG-RA) method for RAN slicing in 5G networks to choose the resource allocation policy that increases long-term throughput while satisfying the requirements of B5G systems for quality of service. This method ’ s main goal is to remove unnecessary actions in order to lower the amount of available action space. The numerical outcomes demonstrate that the proposed approach outperforms boundaries by enhancing deep-rooted throughput and e ﬀ ectively managing resources.


Introduction
The fifth generation (5G) of the mobile network is being added in order to satisfy user expectations and the business requirements of network service providers in 2020 and beyond. By 2035, it will be valued at more than $12.3 trillion, predicted in [1]. The 5G standard makes it possible to produce a state-of-the-art end-to-end network with completely mobile communication. The 5G system is well-matched with an extensive variety of existing use cases, each of which has a specific set of service needs. Multiple services are typically grouped into the three categories of mMTC, eMBB, and uRLLC [2]. The requirements for the eMBB are substantially dissimilar from those for the uRLLC and mMTC. Due to their specs, low data transfer volume, low power consumption, and delay resilience, mMTC applications stand out [3]. For run-time interaction, various uRLLC and mMTC platforms support higher throughput and reduced latency. eMBB applications stand out because of their higher data rates, bandwidth, and mobility support over a large service area. We demand substantial networks with step-based den-sities, considerably higher bandwidths, network connectivity, full coverage mobility, hyper security, and secrecy due to the tremendous growth of users, potential uses, traffic volume, and business practices [4].
Modern network slicing (NS) makes it possible to switch from a static to a dynamic network infrastructure. Network slicing is the main advance of 5G technologies, which uses network virtualization, software-defined networks, and fog computing as enablers to provide a range of network capabilities based on user needs [5]. The ability to independently change each slice is how the network works, assigns the proper amount of network resources in line with business needs, and enhances the overall flexibility, robustness, dependability, and traffic models. A physical network might be divided into numerous logical networks using network slicing. The authors of [6] optimize the distribution of diverse resources and offer suitable assistance to numerous consumers of various services. Based on the needs of the slice, an end-to-end digital network can adaptively offer various services. Each network slice can offer resources, including transmission power, processing resources, resource blocks (RBs), and bandwidth. Each slice runs autonomously from the others because of their separation; hence, issues with one slicing do not influence the functionality of the other slices [7].
Core network (CN) and radio access network (RAN) comprise a network slice. 5G core network slicing has garnered much interest compared to RAN slicing, which has so far attracted minimal interest from the research community. Allocation of resources is a major problem with RAN slicing. RAN slicing continues to be a key difficulty for users while maintaining the quality of service (QoS) needs as the radio access network environment changes in wireless link transmission conditions, user expectations, and user density. Resource selection in the RAN slice is more difficult than core network slicing when user movement and radio channel circumstances are considered [8]. The primary purposes of network slicing in 5G networks are resource scheduler allocation [9]. Resources were distributed statically in the previous research, with a set number of resources going to each slice. This would cause resource under-or overutilization, rendering the remaining resources useless and creating multiple difficulties for different mobile services in order to maintain QoS standards.
Resources will be underutilized without an adaptive resource utilization approach, which can cause issues for consumers using different services. In 5G network slicing, the MDP can be viewed as resource allocation to consider the importance of both SE (spectrum efficiency) and EE (energy efficiency) in the network. Allocation of resources is an NP-hard issue that is practically unsolvable when dealing with enormous volumes of data. A machine learning strategy can resolve NP-hard resource scheduling issues [10]. Deep reinforcement learning (DRL), a machine learning component, has recently grown in popularity and is useful for decision-making. DRL development is expanding in robotics, cyber security, and video games [11,12]. This paper introduces an enhance, efficient deep deterministic policy gradient resource allocation (EE-DDPG-RA) framework based on RAN architecture to increase the radio resource allocation effectiveness of MVNOs. Radio resources are distributed across eMBB, uRLLC, and mMTC users using a Markov decision approach (MDP). For the purpose of allocating system dynamics RB and energy infrastructure to each client in a 5G network slice, a DRL-based resource allocation mechanism is being taken into consideration. When assigning resources to multiple users under this system, each customer's requirements in each slice are considered because the channel circumstances changed. This is the first article that, as far as the author is aware, addresses RAN resource allocation through a partnership of deep learning and reinforcement learning. To help RAN make accurate decisions, the significance of online choice components and projections can be dynamically adjusted.
(i) The primary contributions of this study are, in brief, listed below (ii) A dual optimization goal of RB allocation and energy minimization is proposed for the resource scheduling problem to reduce energy consumption and satisfy QoS criteria (iii) An MDP can describe the continuous control problem known as the dual optimal problem because of its broad solution area (iv) The DDPG resource scheduling (DDPG-RS) algorithm is proposed to obtain the optimal resource scheduling scheme founded on the advantages of DDPG in solving persistent control problems and the scalability problem (v) The DDPG approach enhances the entire system's performance by dynamically allocating the abovementioned resources to each slice (vi) Finally, extensive simulations performed in Python confirm the usefulness of the suggested framework The remainder of the essay is structured as follows: there is a study of the literature in Section 2. Section 3 discusses the framework and problem definition. In Section 4, we suggest the EE-DDPG-RA algorithm to solve the issue. The simulation results are presented with an explanation of the applicability and effectiveness of the suggested technique in Section 5 outcomes. In Section 6, the conclusion is extensively explained.

Literature Review
Numerous studies that looked at RAN slicing were published in [13] with deep slice. This deep learning strategy uses neural networks to tackle network access and load balancing concerns efficiently. Using the supplied KPIs, this study trains the network for inbound traffic monitoring and network slice projection for any user type. Load balancing and efficient resource consumption across the available network slices are made possible by intelligent resource allocation.
Both the business and academic communities consider slicing as the foundational innovation of the 5G network. Network slicing, according to the International Mobile Telecommunications Union (IMT) [14], is a crucial part of the 5G network. Several business sectors and organizations that establish standards, like the International Telecommunications Union, have been actively discussing machine learning methods for network slicing. For instance, the International Communication Union is creating groups based on machine learning to support future networks like 5G [15]. We noticed the use of network RAN slicing in [16]. The standard resources and radio hardware are parts of the wireless communication system known as the RAN slices; they are less elastic than the core network. In order to manage diverse requests from various mobile services, each slice of a RAN has a distinct air parameter.
In this work, RAN slicing is considered since the RAN section of the network interacts closely with the competitive SPs, network operators, mobile customers, and the SDN scheduler responsible for all management plan decisions. Network slicing for the allocation of resources has been the 2 Wireless Communications and Mobile Computing subject of many research studies. The efficiency of multitenant resource allocation during network slicing may be evaluated using game theory as a conceptual approach [26,27]. Caballero et al. [28] developed a matching theoretic drive prioritization algorithm to assist the network in becoming unbiased concerning the networking source of the energy challenge. This enables the communication between infrastructure and service providers over an over-the-top (OTT) network. In [29], Sun et al. investigated a resource allocation technique called "share-constrained proportionality distribution" in a framework of diverse network games. An explanation of the relationship between the phone devices (MUs) in fog RAN slicing, the global spectrum sharing supervisor, and the local cognitive radio controller was provided by Xiao and Krunz in [30]. None of the preceding articles has sufficiently defined the efficiency of resource scheduling. Xiao et al. in [31] considered complex network slicing for mobile edge computing systems in the context of energy recovery techniques that are improving and becoming more accessible. A Naive Bayes technique was recommended in order to obtain the ideal resource-slicing architecture between certain edge nodes. By trusting on the movements of the statistics network, this method reinforces the necessity for network density a priori statistical knowledge. With network or spectrum resource slicing as their primary constraints, these initiatives can offer basic mobile services [32].
On the contrary, network slicing becomes more agile and adaptable to a changing network environment due to intelligent learning. For extremely large service slices, the authors of [33] developed a primary concern admittance system that included two layers of approaches and the heuristics technique. However, the 5G network is dynamically ingrained [34]. The internet provider must optimize how resources are allocated among the layers in order to satisfy the shifting slice requirements because the consumption of resources and the volumes of network activity vary over time in the slice.
Despite significant efforts, the literature on the dynamic and effective regulation of RAN slicing still had several holes. We believe that not enough research has been done on dynamic scheduling algorithms for network slicing. Q-learning is used to improve resource allocation to a single vertex in a VNE in the dynamic resource strategy described in [17,35]. Dispersion of resources differs from virtual network environment (VNE) distribution. Dynamic resource planning in 5G network segmentation is becoming more and more difficult as we deal with reliant virtualized network functions (VNFs) with prior orders and variable resource necessities, as well as separate slices with different QoS criteria. Building a flexible resource-scheduling technique for the various QoS requirements of various network slice services is essential to be able to maximize service productivity and resource consumption effectiveness [18,19].

The System Model and Problem Formulation
System Description and Assumption 3.1. Business Model. The main characteristics of the system are listed below, coupled with an illustration of the most basic wireless network configuration in Figure 1 showing a variety of supplies used by tenants: (i) Each tenant dynamically distributes different resources to many user equipment (UE) units following the service level agreement (SLA). Customers can access multiple resource blocks to find various service slices assigned to other UEs. The UE might be a network device powered by the Internet of Things (IoT). According to priority, each slice provides services to a set of users in real time (ii) Every tenant buys a portion of the network and asks the network operator for a physical resource block (PRB) on their portion's behalf. The infrastructure provider then maintains the network (iii) The major component is the controller, which distributes networking PRB to the slicing and the relevant slices' customers (iv) Because so many services are available, the control will constantly adjust the resource allocation approach for each slice to fit its needs (v) In addition, the controller learns from prior errors and assigns power and other resources to the UEs following the observed rate or queuing information of the specific slice. It is relevant in the following two situations: (a) The controller can allocate resources according to any resource allocation strategy to schedule UEs in order to avoid deadlock in the case of a huge queue (b) By sharing a channel with other users, users may cause interference problems that increase the likelihood of a service interruption Consequently, the control must change the channel allocation strategy for the network slices to ensure the QoS slices.

System Model.
We are considering a transmission situation in which a base station provides service to users across various randomly selected coverage zones. U = 1, 2, 3, ... characterises a user's set. The base station, DU, and other parties exchange CSI whenever a data center is connected to one, as well as the user equipment ( Figure 1). Available physical blocks that may be allocated to the eMBB, uRLLC, or mMTC exist within each of the s distributed systems that comprise up the physical network topology. x, y, and z stand in for the slicing for eMBB, URLLC, and mMTC, respectively. While in eMBB, URLLC, or mMTC, there are x, y, and z total network slices, correspondingly (x + y + z = N). To allocate the foundation network allocation to the network element's eMBB, URLLC, and mMTC slices, we used three binary vectors, _eI, _uJ, and _(m)L. Table 1 describes the abbreviations used throughout the paper. Table 2 depicts the RL-based resource allocation algorithms, which In wireless communication, the prevalent fading channel model is taken into consideration. We look into the claim that variations may not impact the effectiveness of the transmission channel because the user-driven learning method employed in the DRL-based heterogeneous network tries to manipulate explicit channel coefficient information. Written specifically, the channel coefficient coff (bs, u) C between both the user u and ground station (BS) is In this scenario, the substantially faded coefficient is (bs, u), and the limited fading factor is g (bs, u) CN (0, 1). In this arrangement, the following factors influence the high bandwidth rate Rate u of UE k: where the terms w bs,u ″ refers to noise power, and σ 2 u refers to the downstream subcarrier parameter from the BS (BS) to customer u. We used the NOMA systems to overcome the disturbance from the surrounding subchannels. We suppose that the base station uses a range of frequency bands in order to reduce intercellular interference.  Users are connected to the proper routing slice in this system based on the numerous requests they make once the network has determined what kind of product or service the user needs. Each slice is designed to serve a specific user type and has specific virtual resource requirements, such as those for internet and power. This article develops a distinct strategy for distributing resources across each slice. To keep the traffic queue full, each user continuously buffers incoming packets. To decrease the amount of time, evidence needs to be delayed in the pipeline until being transmitted requires careful queue scheduling to maximize system capacity. The queues must be properly scheduled to increase system capacity and decrease buffer queuing wait times. Consider for a moment that network slices are capable of supporting a variety of services. Each slice contains u s , s = f1, 2, 3, ⋯g, etc. users. The user arrives at time slot t as AðtÞ AðtÞ∑A. In comparison, A is the number of people who can physically fit in the space at once. The total amount of people coming across all slices is the same as the number of users entering at time slot t: A s ðtÞ shows the number of people who initiated slice s while time slot t. Figure 2 displays the slice s request queue as Q F s ðtÞ, Q F s ðtÞ < ∞. First users access their particular network slices in reply to the demands of various services at a time t. The slice administrator then distributes all slice users among various resource modules in compliance along with the opportunity scheduling approach [36]. The NOMA sys-tem's receivers employ successive interfering elimination (SIC) to multiple users of different power levels onto the similar subchannel. The procedure specifies that multiplexed users with higher channel gains can decode and remove noise from multiplex customers with smaller channel quality and accuracy [37]. Users often receive lower power allocations when there are strong channel gains, whereas users typically receive larger power allocations when there are low channel gains [22]. A scheduling system has been developed to guarantee that clients connected to a single subchannel have different channel gains. For accessing various resource blocks (RBs) n of section s, it is described as Q S s,n ðtÞ, Q S s,n ðtÞ < ∞ p which is defined as the likelihood that the consumer of slice s will be processed on RB n, n $ N, where N shows the set of RBs, and | N | = N. The following In this equation, N s denotes a group of RBs with slice s assigned to each of them. In order to slice s, denoted by N s , they divide a group of accessible bandwidths among one another. Equation (2) indicates that the system accommodates all targeted users. The user holds duration for sliced s on base network n while time slot t is represented as follows: Slice s on RB n's user queue length for prime time t is specified as Q s ðtÞ = Q S s,n ðtÞ. An expression for queue data storage at the time is provided after a statement for the queue caching time: Using equation (3) as an example, average queue caching Q is as follows: 3.4. Resource Management Model. The slice operator takes each user's channel circumstances into account while deciding which RB to allocate them to. The system's networks in this study are entirely independent, uniformly dispersed Rayleigh fading channels, while the study's channel noise is multiplicative white Gaussian noise. Resource blocks are a type of network resource used by the RAN. A resources block (RB) is split into a frequency domain and a time domain in Figure 2. The frequency is divided into subcarrier units. TTI units are used to measure time. The proportional fair scheduler gives a UE an RB for every TTI. When standardized by the overall data transfer rate of all UEs, the scheduler assigns far more RBs to the UE with the greatest data rate. Therefore, fair distribution may be carried out, and RBs may be allocated to UEs even when the information rate is very low, as it is for UEs close to the cell edge. The subcarriers of bandwidth B are divided into numerous U ðs, cÞ consumers of slice l that are clustered on the multicarrier c in the equation C = 1, 2 ⋯ C.
The whole purpose of substantial subcarrier c is characterised by the symbol Pow c, where P c = ∑ u s,c i=1 pow i , mpow. This study examines the downlink transmission and ranks each user according to their channel gains, as specified by the notation, in which all the I is the encoding of the number of consumers inside the subcarrier ju 1,c j 2 < ⋯<ju i−1,c j 2 < ju i,c j 2 <⋯<ju s,c j 2 . Users can be recognized based on power level and channel gain. The subcarrier c's overlying signal on the transmission connector through the NOMA transmission is represented by the symbol Sig c , which is as follows: The terms Trans i,m and pow i,c in this context, refer, respectively, to the data transmission of user I on multicarrier c as well as the energy supplied to the customer I on subcarrier c. In RAN slicing, resource block isolation gives each slice access to the greatest number of immediately available RBs. In order to avoid using more RBs than were permitted, each slice also allocates RBs to its UEs. An equation for the information from client I on multicarrier c that the receiver picked up is as follows: In (6), he detailed Rayleigh fading network parameter between several BSs to the i-th user on transmit antenna c and h i,c that are 0 complex AWGN random variables with variances of σ 2 c and ω i,c , respectively. Shannon's capability equation and the SIC technique at the transmitters can be used to determine the maximum attainable date frequency of the i-th consumer on subcarrier c as follows: According to (7), each consumer can be viewed as having a rate that will be considerably impacted by how much power is given to new users. Here, B c is the subcarrier's bandwidth B c , and Γ ði,cÞ i is the i -th user's CINR, which is characterised as follows: The preceding is an equation for the entire slice rate; meanwhile, we supposed that slice S has access to C subcarriers: Rate u,c , The determined number of available bandwidths can be allotted to satisfy the slicing performance and latency necessities. The NOMA calculation includes the fading channels c global path loss as follows: Wireless Communications and Mobile Computing the overall quantity of experience delay α u ðyÞ for many interconnections and URLLC users α u ðyÞ for mMTC in order to achieve an effective allocation of resources for MVNOs and obtain a greater sum of data rate α e ðxÞ for eMBB users. So, in a coupled issue, we express the maximizing and minimization problems of an MVNO MiM as follows: Subject to 0 ≤ Performance i ≤ Performance max , ð18aÞ Delay j ≤ Delay max j , ð18eÞ Massive l = Massive max l :

ð18f Þ
While determining these values, the user's requirements and the protection of resources' upper limits must be taken into account. The constraint (18a) ensures that the frequency fraction allocated falls between the range of 0 and Performance max , the maximum value. There is a guarantee that the channel capacity allotted to customers will not exceed the bandwidth B i offered from the InP by constraint (18b). Constraint (18d) ensures that an eMBB user's data rate must exceed a predetermined minimum standard. To ensure that the maximum number of devices connected by an mMTC user is more than a predetermined threshold, constraint (18e) states that the URLLC subscriber's data packet delay should not exceed a specific limit (18f). Our network slicing also aims to increase system throughput while guaranteeing that the quality of service requirements of various network slices are satisfied. To achieve the aim, we must consider three crucial aspects of the system throughput: Throughput of eMBB network slices ðThru x Þ Throughput of URLLC network slices ðThru y Þ Throughput of mMTC network slices ðThru z Þ 3.5.1. Throughput/Efficiency of eMBB Network Slice. The symbol Thru x , u denotes the throughput for such eMBB network slice request that a UE makes to the mobile operator: where the pair f_ designates the resource bandwidth provided to a UE u inside the i-th eMBB slice (i, u). The frequency restriction of UE should be bigger than Thru x , u, and it should be highlighted.
The corresponding sum throughput for the eMBB particular portion is as follows: 3.5.2. Throughput of uRLLC Network Slice. The UE k URLLC slice's throughput is The symbol f j,u designates the resource bandwidth given to the UE k inside the j-th URLLC slice. According to our research, one packet of data should theoretically be sent within a single URLLC frame. The frame time should not be exceeded by the maximal network delay D as [14].
whereD ðy, k, maxÞis the maximum segment latency of user equipmentkin they-th URLLC particular portion, and F ðy, kÞ is the packet size to user equipment k in the y-th URLLC data object. The URLLC network slice's connected sum throughput is Thru y = ∑ S s=1 Thru y,u :The packet length for user equipment k in the y-th URLLC set of resources is F y,k , and D y,k,max maximal is the maximum segment delay of UE k in the y-th URLLC set of resources. The related sum throughput for the URLLC slice is Thru y = ∑ S s=1 Thru y,u .

Throughput of mMTC Network
Slice. The mMTC slice of user equipment k shares characteristics with the uRLLC and eMBB slices in terms of their throughput by The UE inside the l-th mMTC slice has a resource bandwidth (fl; k) assigned to it. Such as the eMBB and URLLC slices, the mMTC slicing is not subject to a rate/latency requirement. The formula for the suitable sum flow of an mMTC slicing is Thru z = ∑ S s=1 Thru z,u . In decision, the following equation can be used to determine the total network bandwidth T ðtÞ entirely at the time t : The problem of improving system throughput across T time frames is expressed as follows: Due to the requirements of the dynamical slices and the presence of data gained in the long-term optimum objective, the novel optimization issue is highly difficult. As a result, it is challenging to solve it directly using the traditional optimization procedure. The problem can be formulated using an MDP and the necessary reinforcement learning solution approaches.

Basics of Deep Reinforcement Learning
Reinforcement learning (RL) is a field of artificial intelligence and intelligent systems that deals with the issue of a learning agent that is placed in a setting to accomplish a task. The RL agent must learn by trial and failure how to behave in order to acquire the highest reward, in contrast to reinforcement methods, where the learner's structure receives instances of good and bad performance [23]. In order to do this work, the agent must perceive the environment's state at some level and act accordingly to create a new state. The agent's action results in a reward, which encourages it to repeat the same behaviour in the future.
Modelling the environment's state transitions depending on the agent's behaviours is also required to formulate the challenge eventually. As a result, an MDP is created that has the functionalities of S, A, R, and T, where S denotes a set of environmental states, A denotes a set of potential actions within a state, T denotes the function that switches between states based on the actions, and R denotes the reward for the specific pair of S and A.

DRL-Based Resource Allocation Model.
We outline the MDP's formulation in this section. We establish the subspace, the activity floor plan, and the value function for rewords in formulating the MDP issue. To allocate bandwidth effectively, estimating the strengthening between each related user on the communications platform is required. Each MVNO periodically collects the channel gains. In actuality, every MVNO sends out model validation to all of its customers. Each user then calculates the channel state data and transmits it back to their MVNO through the controller.
The observed condition of MVNO mi at period t is designated as State i ðtÞ.
The list of user categories for the MVNO is characterised by U i and U t , where channel Gain i ðtÞ reflects the signal strength among MVNO and its users during the time slot t . The three numbersU e ,U u ,andU m are used to specify the many user groups and represented the priority of each category. Users of URLLC are typically given higher priority ratings since they have stricter delay requirements.

Action Space.
Each MVNO receives the required bandwidth fraction B i during each time slot from RIC. Users of an MVNO are given B i factions. The following is the operation zone for every MVNO during time slot t: Each action, a i Act i ðtÞ, is represented as a row vector, Performance i,j,l ðtÞ: Function. An MVNO desires an activity a i Act i ðtÞ at time step t, in exchange for which it is given a reward a i ∈ Reward i ðtÞ. Since the goal is to reduce the delay, the incentive should be defined as a function of the latency for uRLLC users regarding the data flow for eMBB users and the maximum number of devices connected.
We specify a reward associated with each end user's contentment, with These words can be used to convey the total reward: If the average of the components is far less than 1 and the fractions assigned result in latency and data speeds that match the SLA values, the action a i ∈ a i is deemed valid. A significant reward is provided if the action is invalid in order to deter the agent from making a similar decision in later phases.

Proposed EE-Deep Reinforcement Learning-Based Resource
Allocation Algorithm. The valuation and policy-based subcategories of prototype reinforcement learning systems can be used to categorise policy modification. Value-based solutions give the agent the ability to acquire the best policy by helping them comprehend the value function. The action space is always there in this piece. The value-based approach of the linear system and the naive discretization of the action space lead to the dimensional curse and the loss of crucial information about the structure of the action domain. The policy-based approaches make use of parameterized policies to successfully train probabilistic policies for addressing high-dimensional data action and state and action space challenges.
The following is a representation of the unpredictable policy function π θ at time step t: when a policy is parameterized, action in θ at state s follows the posterior distribution with parameter.
ObjðπÞ = Ε s∼p π ,a~π θ ½∑ t rðstate t , a t Þ. According to the 8 Wireless Communications and Mobile Computing objective function's specification, the subsidised official visitor probability for a policy serves as a representation of the expected return, denoted by p π . The gradient descent technique [38] uses the steep descent to get the optimum parameter π. This is a representation of the parameter update: The sequential policy gradient (SPG) must carry out complex calculations when the reaction is a high-dimensional vertex in order to sample the action again for stochastic policy. Instead of frequently sampling actions, the deterministic policy gradient (DPG) [39] immediately generates a deterministic behaviour policy. The DPG optimization problem gradient is described as follows: DPG-based techniques result in deterministic strategies as opposed to studying the environment. Outside of official policy, exploitation, and exploration can coexist. Enough action exploration is guaranteed via a stochastic behaviour policy. The goal strategy is deterministic and effectively makes use of efficient deterministic policies. As a result, the actorcritical (AC) technique, which is detailed in the next section, is used in the learning framework of the DPG technique.

Actor-Critic
Method. The actor-critic method combines the advantages of value-based techniques and policybased procedures. To put it another way, the actor generates behaviour from a state that a policy function provides. The critic develops an action-value function and usages the TD-error to assess the action's effectiveness (loss function). The actor then uses the DPG technique to upgrade the policy variable with the critic's output. The critical updates the action function f using gradient descent [40]. Additionally, the function approximations parameterized by θ Q and θ μ , the activity default value and the regulation variable are taken into account. The following changes are made to the linear combination parameter: The actor uses the DPG mechanism to update policy parameters θ μ :  [41] can destroy the correlation between succeeding data [42]. Based on the rewards of the DQN algorithm and actor-critic method, the deep deterministic policy gradient (DDPG) algorithm successfully operates over the continuous state domain. The DDPG architecture is presented in detail in Figure 2 [43,44]. The solid red line and the blue lines with full dots represent the training processes for actor or critic networks, respectively.
(1) Experience Replay. The agent communicates with the environment to collect data tuples ðstate t , a t , r t , state i+1 Þand keep them in replaying buffer D. The critique and actor randomly select a minibatch of subsampling from D to modify the dynamic programming variable and the regulation function parameter.
(2) Target Network. Deep neural networks used to execute Q -learning directly have been shown to be unstable. The network update usually differs from the original because two protocols target networks, and the predicted network shares a set of parameters. Duplicates of the actor network μ′ðStatejθ μ′ Þ and critic network Q′ðstate, ajθ Q′ Þ are generated in order to ascertain the target value. DDPG employs θ ′ ⟵ τθ + 1ð1 − τÞθ ′ soft target updates for the target networks' weights. The learning stability could be enhanced with τ ≪ 1.

Experiments and Results
Simulation studies using TensorFlow and Python were conducted to evaluate the dominance of the proposed DDPG-based training set for the allocation of resources in RAN slicing. A resource item and two resources, mainly power capacity and bandwidth, are assigned to each physical point. Five network slices, or a total of 25 VNFs, are randomly distributed within the network during each episode's deployment. The needed resources for each VNF from the inside of a slice are evenly split between one and twenty resource units during each system cycle. The simulations we could obtain are shown in the following figures. Using the following resources, we selected different slices: 150 megahertz is the total bandwidth, and 175 J of energy resources is available. Furthermore, in response to demands from the end user, we altered the resources required for each job. In order to fulfil the required quantity, it distributes resources as equally as is practical, raising, or lowering them to the level that most closely satisfies the needs of each slice. When a slice demands more resources, the agents will try their best to accommodate the request or assign as many of the resources as is practical. However, after the resources have been allocated, the agents will not reduce the resources, even if resource utilization is low. In this method, the best agent 9 Wireless Communications and Mobile Computing emphases on reducing the SLA breach. The random agent assigns a random amount of the demanding resources to each request.
In addition, the DDPG environment looks like this. The motivation discount in our simulations is fixed at 0.9, the learning rate for a performer is set to 0.001, and the learning rate for a critic is set to 0.001. This research will discuss the simulation's findings in the sections that follow. The efficient deep causal gradient descent algorithm needs two continuously trained efficient deep deterministic policy gradientbased algorithms. At each encounter, T time frames are used to repeatedly train the second efficient deep deterministic policy gradient-based algorithm for user space adaptability and the first efficient deep deterministic policy gradientbased method for slice-level adaptive. Slice-level performers and critic networks are taught once client-side actors and critic networks have been trained. The second DDPG method takes as input the result of the first DDPG algorithm's actor program.
The differences in reward systems based on the number of episodes are shown in Figure 3. We can see that the proposed DDPG-resource allotment approach converges after about 200 sessions. We map the reward according to the amount of training sessions during the training phase. We see that as the number of training sessions climbs, the overall incentive of DRL-NS increases quickly. As a result, as shown in Figure 4, the compensation rises with each episode and reaches a point of stability after around 200 episodes. Figure 4 also demonstrates that the tenant's general utility is rising, which is the main objective of our suggested plan, as indicated in the problem formulation section. RB penetration and MVNO used to trade off against each other. The MVNO rents more RBs from the systems integrator, increasing MVNO consumption to provide more transmission resources to network operations and generate cash.
The bandwidth resource distribution with the number of episodes is shown in Figure 5. When tried to be compared to end user queries in slice 2, slice 1's requests from bandwidthhungry end customers require more bandwidth. In contrast, slice 2's end users' queries are distinct from those in slice 1's end users' requests. Therefore, high bandwidth resources were allocated to slice 1 using the proposed DDPG dynamic resource allocation algorithm, while other resources were allocated to slice 2. According to Figure 5, the suggested plan allocated slice 1 after around 20-30% other resources and 70-80% bandwidth resources. The trends for other resource distributions to slices are comparable. Variations in resource capacity affect how much energy the MVNO operator or controller gets. The size of the resource capacity determines how much money the MVNO controller will make from resource allocation. Figures 6 and 7 depict the system throughput and allocation of energy resources to the delay needs. Slices 3 and 4's end user requests are slightly distributed normally. About half of the requests necessitate a significant amount of energy, whereas the other half demand system throughput. As a result, slices 3 and 4 received roughly equal amounts of energy resources and delayed requirements, respectively, using the proposed DDPG dynamic resource allocation  Wireless Communications and Mobile Computing method. Figures 6 and 7 show that slices 3 and 4 get about 50% of the energy and 50% of the delay, respectively. The DDPG algorithm is used for these wholly dynamic resource assignments that should be emphasized. Figure 7 is an exception, where resource assignment varies by 400 episodes, possibly due to different end user request patterns. The effectiveness of preserving resource actual values of an MVNO controller's usefulness is depicted in Figure 7. Every algorithm gradually becomes more useful as additional resources are made available. As a result, InP's provision of adequate resources is what makes MVNO controllers more useful. The proposed DRL-based algorithm offers the most utility. Figure 8 shows how to distribute resources to improve system performance and reduce the need for delays. The end user requests slice that correspond to the customary resource allocation. The first half of the questions are intensive, but the second half of the questions need the system to be able to handle a lot of work.

Conclusion
This article looked at the number of MVNOs that use resources when RAN slicing is in effect. We concentrated on how to apply machine learning to create reliable slicing patterns in various wireless communication environments. After that, we suggested a DDPG to create a deep reinforcement-learning system for distributing power and bandwidth simultaneously. Slices of the eMBB, URLLC, and mMTC types were taken into account in our scenarios. We organised the problem as a specific virtual network mobile operator's MDP in order to allocate radio resources to different user types (eMBB, mMTC, and URLLC). In our proposal, we combined the benefits of policy-based and value-based reinforcement learning techniques into an actor-critic approach. Since the fractional bandwidth values are constant, we simultaneously train a Q-function and a policy using a deep deterministic gradient. This gradient has an ongoing effect. To enhance how many MVNOs manage radio resource allocation cooperatively, we developed a (EE-DDPG-RA) DRL-based technique on a RAN architecture. Under numerous simulated situations with non-i.i.d. and uneven distribution of the end users, the effectiveness of the suggested (EE-DDPG-RA) DRL technique is demonstrated. Experience has shown that, in comparison to models created independently by each MVNO, the model trained to employ (EE-DDPG-RA) DRL is more resistant to environmental changes.
We pointed up certain key concerns in order to fully execute the application of DDPGL in a larger meaning. In the future, network slicing may benefit significantly from the use of DRL, in our opinion. However, you should carefully consider network slicing because it involves a number of elements before implementing DDPG: for network slicing to succeed, a flexible and dynamic slice management strategy is required, (a) limiting the acceptance of fresh slice requests. How to use DDPG also provides a fascinating difficulty because the state and action spaces must adjust to the modifications made to the "slice" space if new slice requests arise. A quick policy-learning method is needed because of user 11 Wireless Communications and Mobile Computing activity and the time-varying nature of wireless channels in (b) policy learning cost. However, the cost of policy training today is still less than the required learning rate. As a result, there are still many intriguing questions that have not been addressed.

Data Availability
There is no data included with the manuscript for publication.

Conflicts of Interest
The authors declare that they have no conflicts of interest.