Development of an AI-Enabled Q-Agent for Making Data Offloading Decisions in a Multi-RAT Wireless Network

,


Introduction
Te remarkable surge in mobile data usage resulting from technological advancements in various domains like Internet-of-Tings, online gaming, social networks, augmented/virtual reality, and more has signifcantly amplifed the burden on wireless networks and introduced a wide variety of applications with varying characteristics.In order to meet such intensive and highly heterogeneous demands, the 5th generation (5G) wireless networks are expected to be ultradense, autonomous, or intelligent, operate on multiple higher frequency bands, and with multiple radio access technologies (multi-RATs).Te deployment of ultradense networks increases the spatial reuse factor of the resources.Te use of multiple RATs and multiple bands enables effective use of spectral bands and resources of the wireless network.Moreover, it also helps in alleviating congestion on cellular RAT by ofoading data to underloaded RATs like wireless fdelity (Wi-Fi) [1].Te concept of ofoading across multiple RATs extends beyond data transfer; it has also been harnessed for the sharing of computational [2] and power resources [3].
Te 5G networks are expected to be intelligent to meet the quality-of-service (QoS) requirements imposed by the users with heterogeneous demands.Tat is why, artifcial intelligence (AI) has an important role to play in the development of intelligent and autonomous algorithms for 5G networks.Te efcient allocation and utilization of resources among users with diferent demands in a 5G multi-RAT wireless network is a challenging problem.Te network densifcation further complicates this problem by introducing interference and coverage issues, often manifesting in terms of degraded network performance and user experience.Terefore, it becomes challenging to satisfy the QoS requirements imposed by the user.Various algorithms have been presented in the literature, as discussed in Section 2, for alleviating congestion on wireless networks and for efcient utilization of the resources by employing the concept of data ofoading and AI techniques.However, some of these have been developed without taking into account the efect of interference caused due to the deployment of ultradense wireless networks [1], and others have been developed without using AI techniques.
Terefore, the main objective of this work is to develop an AI-enabled Q-agent for making data ofoading decisions in a multi-RAT wireless network by taking into account the efect of interference caused by an ultradense network and by making use of a model-free reinforcement learning (RL) technique.Here, the term "model-free" means that an accurate or approximate mathematical representation of the system under consideration is not required; thus, the AI agent can learn through trial-and-error approach.A highlevel diagram of the proposed work is shown in Figure 1.A multi-RAT wireless network scenario has been assumed which includes a cellular and a Wi-Fi RAT.Te coverage provided by both the RATs has been divided into diferent zones based on the signal-to-interference (SIR) experienced by the user in a given region.Initially, the user is under coverage of cellular RAT only, and it has generated a request for downloading of a fle.We use the Q-learning algorithm which is a model-free RL algorithm for training of the agent; that is why, we have named it as a Q-agent.It is responsible for taking a sequence of actions by observing states such that the long-term discounted cost of using a network service is minimized.
We formulate the problem of data ofoading by assuming that the user equipment (UE) plays the role of the Qagent as depicted in Figure 1.Te users' request for downloading a fle of certain size in a given duration and at a given location is used for defning states.Te Q-agent can take three actions, that is, download data through cellular RAT, ofoad data through Wi-Fi RAT, and remain idle.We defne a cost function for using network services which is a function of the actions taken by the Q-agent.We also defne a penalty function for missing the defned delay limit.Te task of the Q-learning algorithm is to minimize this cost function by learning an optimal data ofoading policy that takes best action in a given state.We have evaluated the performance of the developed Q-agent for making data ofoading decisions and also compared the results with existing model-based analytical ofoading approach [4] and some other standard data ofoading policies.In the end, we have also discussed about the issues and challenges imposed by the Q-learning algorithm and how we can tackle these for developing efcient agents that ofer near to optimal performance.Advanced AI techniques based on deep learning such as the deep Q-network (DQN) or double DQN can also be adopted for tackling the issues imposed by the Q-learning algorithm.However, the data ofoading problem considered in this work is user-centric.Terefore, its state and action space is small, and a simple algorithm such as Qlearning, with slight modifcations, ofers satisfactory performance.Moreover, according to the authors in [5], advanced AI techniques are not yet been considered for deployment in wireless networks because of the resourceconstrained nature of these networks.
Te rest of the paper has been organized as follows.Te details of the related are covered in Section 2. Te problem formulation and the assumptions are discussed in Section 3. Te development of AI-enabled Q-agent is discussed in Section 4. Performance evaluation of the developed Q-agent and comparison with existing data ofoading policies is in Section 5. Finally, Section 6 concludes the paper.

Related Work
In this section, we provide details of the studies presented in the existing literature and are related to this work in one or another way.A concise summary of these studies and our work is presented in Table 1.
A dynamic ofoading algorithm has been presented in [11] for UEs by assuming a multi-RAT wireless network.An autonomous resource allocation policy has also been developed for multiaccess edge servers.Te authors used a penalty-based genetic algorithm for learning ofoading decisions and deep Q-neural network (DQN) for efcient allocation of resources by the edge servers.Te authors assumed multiple RATs which include cellular and Wi-Fi RATs by defning diferent frequency bands.However, they did not take into account the efect of the channel-accessing scheme employed by Wi-Fi RAT which is diferent from cellular RAT.Moreover, the authors also did not consider the efect of interference due to ultradense deployment of base stations (BSs) while modeling the data rate experienced by a user.In another paper [8], a near to optimal policy for users' association in a heterogeneous wireless network has been obtained by employing DQN.Instead of assuming a multi-RAT wireless network scenario, they assumed a dual connectivity scenario wherein a user can associate with the BSs of diferent tiers under the same network, that is, macroand micro-BSs.Te authors selected DQN over SARSA or Q-learning algorithms because they were optimizing a network-centric user association policy which had large state and action spaces.Since in this work we propose a usercentric data ofoading approach wherein each UE is responsible to minimize the cost of using a network service by making automated optimal data ofoading decisions, the state and action spaces are small.Tat is why, we selected the Q-learning algorithm instead of DQN.
A multiagent RL-based algorithm for RAT access in a multi-RAT wireless network has been proposed in [9].Te authors assumed one cellular and one Wi-Fi RAT operating in diferent bands and with diferent channel accessing techniques.According to the authors, their proposed approach for RAT access ofers better performance compared to the traditional data ofoading schemes.However, the system model assumed by the authors did not incorporate the efect of dense deployment of the access points (APs) and the resulting interference, which highly impacts the data rate experienced by the users.Te authors in [12] presented an incentive-based contract-theoretic approach to motivate the third-party operators, like Wi-Fi operators, to share their resources during peak time to overloaded cellular RAT.However, the main focus of this work was to propose optimal contracts to third-party operators such that they agree to accept the ofoaded requests while the proft of the mobile network operators is maximized.
A delay aware ofoading and network selection optimization algorithm has been proposed in [6] by assuming that unlimited cellular RAT coverage and limited Wi-Fi RAT coverage are available for the users.For solving the optimization problem, the authors used the backward induction algorithm which is computationally expensive.In another study [7], the authors proposed a data ofoading approach by dividing the network coverage into various zones where each zone ofers a diferent data rate to a user.Tey considered signal-to-noise ratio (SNR) for estimating the data rate ofered to a user in a zone.However, for highly dense and heterogeneous wireless networks, SIR is considered as a better metric for estimating the coverage and data rate experienced by a user [13].To better capture the coverage and data rate ofered by Wi-Fi RAT, while estimating the data ofoading gains, the authors in [10] employed stochastic geometry (SG) modeling techniques.However, this work is limited to estimation of data ofoading gains that can be provided by a Wi-Fi RAT.
Te authors in [4] proposed an automated data ofloading framework by assuming a multi-RAT wireless network scenario and developed a model-based data ofloading policy.Unlike [7], they divided the coverage provided by each RAT into diferent zones by using SIR and SG modeling techniques.Tey adopted a model-based RL approach for obtaining an optimal data ofoading policy.A model-based RL algorithm requires a transition matrix which depicts the complete stochastic nature of the assumed environment.Te authors in [4] utilized SG and Markov decision process (MDP) for modeling stochastic nature of the assumed wireless network and for obtaining the corresponding transition matrix so that it can be used by the model-based RL algorithm.However, due to various random factors in spatial and temporal domains of a wireless network, the trafc characteristics, load, and various other parameters are prone to change.Tus, at any point in time, the practical scenario may deviate widely from the transition matrix derived for a specifc scenario.Tis problem initiates the need for the design of model-free data ofoading policies which can learn the network or user behavior in real time and accordingly take the optimal actions.Nevertheless, such approaches pose various challenges and issues when it comes to their convergence and implementation in practical scenarios due to their trial-and-error-based learning approach.

Network Model and Problem Formulation
In this section, we provide details about the wireless network scenario assumed in this work.Moreover, we also formulate the data ofoading problem by defning the Markov decision process (MDP) which includes details regarding Q-agent and its environment, that is, the set of states, the set of actions, cost, and penalty functions.

Multi-RAT Wireless Network Model.
Similar to [4,14], we make use of SG modeling techniques for simulating a multi-RAT wireless network which includes a cellular and a Wi-Fi RAT.We assume that each RAT is under the control of the same operator [15].We adopt homogeneous Poisson point processes (HPPPs) Φ c and Φ w , with intensity λ c and λ w , for drawing the locations of APs under cellular and Wi-Fi RATs, respectively.Te users are assumed to be  A data ofoading approach for a multi-RAT network using the Q-learning algorithm We assume that all APs belonging to r ∈ c, w { } RAT operate at the same power level P r over the entire bandwidth B r .Furthermore, we assume a saturated downlink channel wherein the same resources are shared by all the APs of cellular RAT, and a single channel is shared by all the APs of Wi-Fi RAT.As a result, the signal-to-interference ratio experienced by a typical user under RAT r can be approximated by using the following equation: where l(‖d‖) denotes a free space path lass model, ς y o and ς y denote small-scale fading from the tagged and other BSs, respectively, and êy is a medium access indicator function which represents if an AP of RAT r, located at y, is active or not.For an AP under cellular RAT (r � c), the indicator function is unity because all the APs are assumed to transmit simultaneously.For an AP under Wi-Fi RAT (r � w), it can be either zero or unity because not all the APs are allowed to transmit simultaneously due to the contention-based nature of carrier sense multiple access with collision avoidance (CSMA/CA) channel accessing scheme [16].Te probability that the network ofers a data rate to the user which is greater than a threshold ρ r can be defned as where N r � λ u /λ r is the average load per AP and  P r is the medium access probability (MAP) for an AP.Based on the data rate ofered to a user under a RAT, we divide the given region under each RAT into three zones as depicted in Figure 1.Te frst zone ofers the maximum data rate, the second zone ofers the minimum data rate, and the third zone is like an outage for a user.Te probability that a user is located in zone z of RAT r has been defned in [4], and it is given as where τ r z � 2 (ρ r z N r /  P r B r )−1 , ρ r z is the data rate threshold.Here, (5) is obtained after substituting (3) in (2) and rearranging.
Te users can move in a given region with possible locations denoted by the set by following a widely used Markovian model [6].Te probability that a user moves to location k ′ � (c i′ , w j′ ) in the next time slot given the current location as k � (c i , w j ) can be defned as follows: where is the scaling factor capturing the speed of mobility, and P(r z ) is defned in (4).For readability and clarity, we have included details of the assumed network scenario in this section.For details of the derivation of these equations which are obtained by using SG modeling techniques, please see [4,6,14].

Markov Decision Process
Formulation.An MDP is a discrete stochastic process which is used for sequential decision-making.It is defned by a tuple (S, A, P(s ′ |s, a), Ω, α), where S is the state space, A is the action space, P(s ′ |s, a) is the state transition probability, Ω is the cost function, and α is the discount factor.Since we employ a model-free RL algorithm, the transition probability matrix is not required for the problem formulation.Te rest of the components have been defned in the following subsections.

Q-Learning Agent: States and Actions.
Te UE has been defned as a Q-agent in this work which is responsible for taking sequence of actions after observing states.Assume that a user generates a request to download ψ bits of data within D units of time.Here, the users' request is expressed in terms of a tuple μ o (ψ, D).We suppose that the time axis is divided into slots t ∈ T � 1, 2, . . ., D { } of fxed length, and the Q-agent is required to take action at each time epoch.It is assumed that the duration of a time slot is so small such that the state of the system does not change.Te state of the user, s ∈ S, at a time slot t, has been defned as s t � (k, h, d), where h ∈ ψ represents the remaining fle size in bits, d ∈ D denotes the remaining time, and k � (c i , w j ) ∈ K denotes the location of the user specifed by available zones of cellular and Wi-Fi RATs, respectively.As we assume stationarity, for simplicity, the notation t is omitted from this point onward.
Tree possible actions a ∈ A are available for the Qagent to make: remain idle (a � 0), download data through cellular RAT (a � 1), and ofoad data through Wi-Fi RAT (a � 2).However, in any state s, with a given location k and ∀h, d, the number of permitted actions a ∈  A(s) is at most two as defned in the following: Equation ( 8) refers to the decision of ofoading through Wi-Fi RAT if the data rate supported by cellular RAT is smaller than the Wi-Fi RAT.Equation ( 8) refers to the permitted actions when the data rate supported by cellular RAT is greater than the Wi-Fi RAT.Equation ( 8) refers to the permitted actions when Wi-Fi RAT is not available.Equation ( 8) refers to the action when none of the RATs are available.

Feedback from Environment: Cost and Penalty.
Similar to [6], we assume that the cost for using cellular RAT for data download is higher than the Wi-Fi RAT.It means, by making an ofoading decision to Wi-Fi RAT, the Q-agent can minimize the cost of data usage for the user.Moreover, while waiting for the availability of Wi-Fi RAT, the deadline limit associated with the generated request cannot be ignored which results, in the end, a huge penalty if exceeded.Tus, through the ofoading process, the Q-agent is required to minimize the overall cost of downloading the fle while maintaining the given QoS requirement.
We adopt a usage-based cost scheme, where a user is charged proportional to its data usage.Let φ(a) represents the cost for downloading per unit of data by choosing action a.Let us assume that ρ(k, a) denotes the average supported data rate in bits per second at location k when action a is chosen.Tus, the total cost during a time slot, when the Q-agent chooses action a in state s such that the delay timer is not expired, is given by Te penalty for the Q-agent when it is not able to complete the download within D units of time has been defned as follows: where Υ(h) is a nondecreasing function of h [6].Tus, the objective function can be defned as which implies that the Q-agent is responsible to choose sequence of actions such that the accumulated cost of using a network service is minimized.

Development of a Q-Learning Agent for Data Offloading
Q-learning is a model-free RL algorithm in which an agent interacts with its environment and tries to learn optimal actions for given states through the trial-and-error approach.Te quality of an action taken in a given state by the agent is recorded by defning a quality function, which is denoted by Q π (s, a).It denotes the expected long-term discounted reward of taking action a in state s by using policy π.In this work, the UE plays the role of an agent, named as Q-agent, and the Q value is defned as the expected long-term discounted cost for taking action a in state s by using policy π.Tus, here the aim of the agent is to fnd the best policy π * (s), that minimizes this quality function for each (s, a) pair, by choosing the optimal action in a given state, i.e., π * (s) � argmin a Q(s, a).Assume that at the current time epoch, the agent observes state s and takes action a.As a result, it receives the cost Ω(s, a) from the environment for taking action a in state s, and it ends in state s ′ at the next time epoch.Tus, the Q value for (s, a) pair can be defned as follows: where α is the discount factor and c is the learning rate, and it is defned in [7] as where β(s, a) denotes the number of times action a is taken in state s.It has been proved that while β(s, a) is sufciently large and c is reduced to zero over time, Q(s, a) is guaranteed to converge to Q π * (s, a) [17].Te algorithm for training of the Q-agent has been defned in Algorithm 1, and its details are discussed in the following:

Q-Learning Algorithm without Employing ϵ−Greedy
Approach.In the Q-learning algorithm, at decision epochs, the agent decides randomly or based on previously learned Q values, which action should be taken in a given state.For minimizing cost, the agent may take low-cost actions it has tried in the past.Tis is known as the exploitation mode.Te agent also needs to try actions it has not taken before, which may play a role in further minimization of the accumulated cost.Terefore, the agent may take one of the actions randomly from the set of available actions, to enhance its future decisions.Tis is known as the exploration mode.
Since Q-learning is a model-free iterative learning algorithm, it is important that exploration and exploitation should be simultaneously performed.Te agent must observe the efect of taking diferent actions in a given state and progressively favor ones with the minimal cost [17].
In most of the existing literature [7], the ϵ−greedy method is utilized, in which an agent explores with probability ϵ and exploits with probability 1 − ϵ.However, in this work, we did not employ any method for coping up with this trade-of as the Q-learning algorithm by default has a feature which causes it to switch between exploration and exploitation modes, during the training of the Q-agent.For example, ∀(s, a) pairs if Q(s, a) values are initialized to the same value, then random policy can be executed for breaking the ties; here, the use of random policy is equivalent to the exploration mode.Moreover, if we carefully evaluate (12), when an (s, a) pair is visited for a number of times, its Q value increases.Since, in this work, the agent is required to fnd the action in a state with the smallest Q value, the less visited (s, a) pairs by default get a chance to be explored.Tus, this insight shows that the Q-learning when defned in terms of a minimization optimization problem, by default, has the capability of switching between exploration and exploitation modes.

Results and Discussion
We used Python for creating simulation setup and implementation of Algorithm 1. Unless otherwise specifed, the parameters used for generating the results, presented in this section, are mentioned in Table 2.
For training of the Q-agent, we executed 120 × 10 4 episodes of Algorithm 1. Te numbers of times the Q-agent observed certain states, irrespective of the actions taken, are reported in Figure 2. According to [17], the Q-learning algorithm is guaranteed to fnd an optimal solution if the number of visits to each (s, a) pair is sufciently large.However, in practical scenarios, it is highly likely that some states are observed more often as compared to others.We have reported the results in Figures 2(a)-2(c) for the states when the remaining fle size (h) to download is 200 Mbits, 500 Mbits, and 800 Mbits, respectively, as a function of remaining delay (d) and users' location (k).It must be evident from the numbers reported in Figure 2 that some states are visited more often as compared to the others.One of the main reasons behind such results is users' mobility; that is, the locations with higher probabilities are visited more often as compared to the others.Furthermore, the states with higher d and larger h are visited less often as evident from Figure 2. Because, if h � 800 Mbits and d � 10 mint at the (1) Initialization (2) π(a|s) as a random uniform policy (3) Q(s, a)⟵Ω(s, a) +  ∀a i  ∀s ′ π(a i |s)Q(s ′ , a i ) (4) β(s, a)⟵0 ∀s ∈ S, a ∈ A (5) for each download request μ o (ψ, D) -episode do (6) defne state s(k, h, d) -d � D, h � ψ, k randomly generated using (6) (7) while download is not complete (h > 0) do (8) if Q(s, ∀a) is same then (9) choose action a at random (10) else (11) choose a � argmin a Q(s, a) (12) end if (13) take action a (14) update c(s, a) by using ( 13) (15) if d > 0 then (16) obtain Ω(s, a) using ( 9) (17) obtain  We have reported the data ofoading policy learned by the Q-agent in Figure 3, that is, the optimal actions taken by the Q-agent in the same states as mentioned in Figure 2. We have included data for only three most important locations wherein the decision-making is challenging just to give better insights.For example, the decision is simple at the locations where both the RATs ofer the same data rate, that is, to ofoad data through Wi-Fi RAT.However, it is challenging for Q-agent to choose the correct action at the locations where both the RATs ofer diferent data rates.For example, at location (c 2 , w 1 ), the Wi-Fi RAT ofers data rate which is greater than the cellular RAT; therefore, the Q-agent has learned to ofoad data through Wi-Fi RATonly, as evident from Figure 3(a).At location (c 2 , w 3 ), the cellular RAT is available, but Wi-Fi RAT is not available.Te Q-agent has learned to download data through cellular RAT in this case, as evident from Figure 3(a), although it can wait for the availability of Wi-Fi RAT for higher d.However, due to the stochastic nature of the wireless network, it is possible that the Q-agent observes locations where both the RATs do not ofer any data rate.Tat is why, the Q-agent has learned to download data through cellular RAT even for higher d which is evident from Figure 3(a).Moreover, since Q-learning is a model-free learning algorithm, minor fuctuations in the Q-agents' decisions are possible because of the stochastic nature of the environment.At location (c 1 , w 2 ), the data rate supported by cellular RAT is higher than the Wi-Fi RAT.Since h � 200 Mbits, it can be easily downloaded in zone w 2 of Wi-Fi RAT for d > 4 mint.Terefore, except for d � 6 mint, the Q-agent has learned the optimal actions at this location; that is, it has decided to download data through cellular RAT for lower d and ofoad data through Wi-Fi RAT for higher d, as evident from Figure 3(a).
All the actions learned by the Q-agent in Figure 3(b) are optimal.For larger h and d ≤ 7 mint, it is important to download data through cellular RAT when Wi-Fi RAT ofers a lower data rate or not available.In Figure 3(c), we have reported results for h � 800 Mbits as a function of d and k.Te Q-agent has not learned optimal actions for most of the states in Figure 3(c), because of larger h, most of these states have not been visited as evident from Figure 2(c) and already discussed in previous paragraphs.Tus, if observed, the Q-agent employs random policy for taking actions in such never visited states.At location (c 1 , w 2 ), the Q-agent has learned to download data through cellular RAT only, as evident from Figure 3(c), because the data rate supported by c 1 is greater than w 2 .Moreover, since h � 800 Mbits, the RAT which ofers a higher data rate must be selected to successfully complete the download before d ⟶ 0. Tus, the Q-agent has learned the optimal actions for these states.Similarly, at location (c 2 , w 1 ), the Q-agent has learned to ofoad data through Wi-Fi RAT because the data rate supported by zone w 1 is greater than c 2 .Tus, we can conclude that the Q-agent has learned the optimal actions for almost all the states.Although it looks like it has learned a few incorrect decisions as well, such decisions have been learned due to the stochastic nature of the environment and can change over time after sufcient experience.
After each learning episode, the remaining fle size (h) after hitting the deadline (d) is reported in Figure 4. We have reported the results in Figures 4(a)-4(c) for the episodes in which the users have generated requests for ψ � 200 Mbits, 500 Mbits, and 800 Mbits, respectively.Te episodes are denoted in sorted order from left to right as a function of delay limit D, and the color bar is used to represent it.Te fle with ψ ≥ 200 Mbits cannot be successfully downloaded within D � 1 mint no matter which RAT Q-agent choose in the assumed scenario.Because the maximum data rate supported by both the RATs is 3Mbps and in 1 mint at maximum 180 Mbits can be downloaded given that the user is located in the zone which supports the maximum data rate.Tat is why, h ≈ 200 Mbits for almost all the episodes with D � 1 mint.However, for higher D, the Q-agent is trying to minimize the remaining fle size, that is, h ⟶ 0 as D increases which is evident from Figure 4(a).Moreover, with each learning episode, the Q-agent has improved its decision-making capability.As it is evident from Figure 4 that during initial episodes for most of the cases, the download is incomplete, that is, h > 0. However, with each passing episode, h has been reduced and it ultimately approaches to zero.It is important to note here that for larger ψ like in Figures 4(b) and 4(c), the successful download is possible only for higher D. Tat is why, even after learning for a quite large number of episodes, the agent is unable to successfully complete the download for certain cases.Nevertheless, given the network availability and higher D, the agent has learned to successfully download larger fles by taking a correct sequence of actions.
Te accumulated payment for downloading ψ bits of data in D mint is shown in Figure 5 for a few randomly selected episodes at the end of the training period of the Q-agent.Te minimum payment for downloading ψ bits of data, by using Wi-Fi RAT only, is shown by a double-dashed line in Figure 5. Te maximum payment for downloading ψ bits of data, by using cellular RAT only, is shown by a dashed-dotted line in Figure 5. Tis minimum and maximum payment limits serve as a reference for evaluating the performance of the data ofoading policy learned by the Q-agent.Te average payment for downloading of a fle, as a result of actions taken by the Q-agent, has been represented by a solid line.For ψ � 200 Mbits, the fle can be successfully downloaded for D ≥ 4 mints.Tat is why, for ψ � 200 Mbits in Figure 5, the Q-agent has taken the sequence of steps, for most of the episodes, which has resulted in the minimal payment.However, the average payment for downloading of the fle, as a result of the actions taken by the Q-agent, is above the minimum threshold.Tis is due to the stochastic nature of the wireless network because Wi-Fi RAT may not be available at certain locations and the Q-agent must download data through cellular RAT to complete the download in such situations.As a result, the average payment for download of the fle is slightly higher.However, it must be interesting to note that the average payment is much 8 Journal of Computer Networks and Communications smaller than the maximum threshold and closer to the minimum threshold which implies that the Q-agent has mostly ofoaded data through Wi-Fi RAT.
It must be evident from Figure 5(a) that for ψ ≥ 500 Mbits, the average payment for downloading the fle is approximately equal or smaller than the minimum threshold.Tis is possible only in those situations in which the fle download has not been completed successfully within the defned D. Since D in Figure 5(a) is only 4 mints, larger fles cannot be successfully downloaded even if the Q-agent chooses the RAT which supports maximum data rate always.On the other hand, in Figure 5(b), D � 8 mints.As a result, the average payment for the fle download is above the minimum threshold because for most of the episodes, the Q-agent has successfully downloaded the fles with larger size as well.
We have evaluated the performance of the developed data ofoading policy learned by the Q-agent and compared it with the standard policies and analytical (Ana.)approach presented in [4].Te evaluation results are reported in Figure 6.In always ofoad (AO) policy, the data are downloaded by using Wi-Fi RAT only.In no ofoad (NO) policy, the data are downloaded using cellular RAT only.In on-the-spot ofoad (OTSO) policy, the data are ofoaded through Wi-Fi RAT whenever it is available.Otherwise, it is  Te NO approach has resulted maximum payment, which is evident from Figure 6(a), because the cost of using cellular RAT is larger as compared to the Wi-Fi RAT.Moreover, due to users' mobility and unavailability of cellular RAT at certain locations, the NO approach could not successfully download the fle even for larger D > 7 mint, as evident from Figure 6(b).Similarly, the AO approach has resulted in minimum payment, as evident from Figure 6(a), because the cost of using Wi-Fi RAT is smaller as compared to the cellular RAT.However, due to users' mobility and unavailability of Wi-Fi RAT at certain locations, it sufers from the same issue of incomplete data download even for larger D > 7 mint, as evident from Figure 6(b).Since OTSO exploits both the RATs given their availability and prefers Wi-Fi RAT over cellular RAT, the cost for downloading data is smaller than NO and is slightly larger than AO which is evident from Figure 6(a).Moreover, since the OTSO approach is using both the RATs, it has successfully completed the download request for larger D > 7 mint, as evident from Figure 6(b).Te Ana.approach presented in [4] uses both the RATs for data download; however, for lower delay limits, it prefers the RAT which ofers a higher data rate so that the data download can be completed.Tat is why, in Figure 6(a), the payment for Ana.approach is slightly larger and the remaining fle size in Figure 6(b) is slightly smaller than the OTSO approach.
As evident from the results reported in Figure 6, for delay limits D < 7 mint, the performance of data ofoading policy learned by the Q-agent is comparable to the model-based analytical approach.Although the payment of the Ana.approach and Q-agent is slightly higher compared to the OTSO and AO policies, h is much smaller as evident from Figure 6(b).Tis implies that these approaches have tried to complete the download without waiting for the availability of Wi-Fi RAT because D is short.On the other hand, for D > 7 mint, the OTSO, Ana., and Q-agent have successfully completed the download of the fle.However, for larger D, the payment of the Ana.approach is slightly higher because it has downloaded data through cellular RAT without waiting much for the availability of Wi-Fi RAT.Since D is large, the Q-agent has waited for the the availability of Wi-Fi RAT in this case and tried to minimize the payment as well.Te AO policy has obtained minimal payment because it always downloads data through Wi-Fi RAT.Tus, it is clear that close to the best and stable performance has been offered by the Q-agent for diferent D. For lower D, it tried to complete the download at the cost of slightly larger payment.On the other hand, for higher D, it tried to minimize the payment while successfully completing the download.

Conclusion and Future Work
In this work, we developed an AI-enabled Q-agent for making data ofoading decisions in a multi-RAT wireless network by using a model-free Q-learning algorithm.Although model-free learning algorithms ofer quite a good set of features, their successful implementation poses various challenges.Terefore, we also discussed a few of the challenges along with their possible solutions.For speeding up the learning process, we initialized the Q(s, a) values of the Q-learning algorithm by employing a single back-up sweep.Moreover, we exploited an inherit feature ofered by the Qlearning algorithm, by redefning it in terms of expectation minimization problem, to balance the trade-of between exploration and exploitation modes.We evaluated the performance of the trained Q-agent and also compared against an existing analytical data ofoading approach [4] and other ofoading policies like always ofoad, no ofoad, and on-the-spot ofoad.Te results showed that the performance of the Q-agent developed in this work is near optimal for diferent data download requests.For lower delay limits, the performance of the Q-agent for making data ofoading decisions is close to the model-based approach presented in [4] which tries to complete the download at the cost of a higher payment.For higher delay limits, its performance is close to the on-the-spot ofoading policy which tries to minimize the payment.Tus, the Q-agent has learned to make intelligent and near optimal decisions under diferent situations.Te future work includes the development of such adaptive and optimal agents for 6G wireless networks by using advanced AI techniques such as DQN or double DQN.

Zone 1 : 1 Figure 1 :
Figure 1: An illustration of the multi-RAT wireless network which includes cellular and Wi-Fi RATs and UE playing the role of Q-agent for taking sequence of optimal actions under diferent situations.

[7] ✓ 7 7 ✓ 7 ✓✓ 7 ✓ 7 ✓ ✓ 7 7 ✓ 7 A 7 A
A user association approach in a heterogeneous wireless network using DQN[8] 7RAT access in a multi-RAT network using a multiagent RL algorithm[9] ✓ 7 Estimation of ofoading gains provided by Wi-Fi RAT by using stochastic geometry[10] Dynamic ofoading and resource allocation in multiaccess edge servers using the genetic algorithm and DQN[11] ✓ ✓ An incentive based contract-theoretic approach for third-party operators[12] model-based data ofoading approach for a multi-RAT wireless network[4] model-free AI-enabled approach, presented in this work, for making the data ofoading decision in a multi-RAT wireless network Networks and Communications distributed according to another HPPP Φ u , with intensity λ u .

4. 1 .
Initialization of Q Values.Poor initialization of Q values, like Q(s, a)⟵0, Q(s, a)⟵1 or Q(s, a)⟵ uniformly distributed ∀(s, a) pairs, can badly afect the overall learning curve of the Q-agent and convergence speed of the Qlearning algorithm.However, initialization based on context of the problem can greatly help in speeding up the learning process.Terefore, in this work, as clear from Algorithm 1, we initialize the Q values by exploiting a single back-up sweep and using uniform random policy.Since at the end of each episode, the Q-agent is supposed to observe a penalty if the deadline is missed, and the single back-up sweep spreads the infuence of penalty throughout the Q values for all (s, a) pairs, which overall improves the learning process.

Figure 3 :Figure 2 :
Figure 3: Te actions learned by the Q-agent, a � π * (s), as a function of states s(k � (c z , w z ), h, d).Here, denotes remain idle, △ denotes download data through cellular RAT, and * denotes download data through Wi-Fi RAT.

Figure 4 :Figure 5 :
Figure 4: Remaining fle size (h) at the end of each episode as a function D.

Figure 5 :Figure 6 :
Figure 5: Accumulated payment at the end of a few randomly selected episode as a function D. (a) D � 4 mint.(b) D � 8 mint.

Table 1 :
A summarized comparative analysis of the existing approaches with the approach presented in this work.

Table 2 :
Te default parameters used for simulating multi-RAT wireless network scenario and training of Q-agent.bh 2 , h is in Mbits and b � 0.001 current decision epoch, then at the next decision epoch, h ≤ 800 Mbits and d < 10 mint.Tis implies that the states with h ≪ 800 Mbits and d ≪ 10 mint are visited more often.