This paper proposes a WiFi offloading algorithm based on Q-learning and MADM (multiattribute decision making) in heterogeneous networks for a mobile user scenario where cellular networks and WiFi networks coexist. The Markov model is used to describe the changes of the network environment. Four attributes including user throughput, terminal power consumption, user cost, and communication delay are considered to define the user satisfaction function reflecting QoS (Quality of Service), and Q-learning is used to optimize it. Through AHP (Analytic Hierarchy Process) and TOPSIS (Technique for Order Preference by Similarity to an Ideal Solution) in MADM, the intrinsic connection between each attribute and the reward function is obtained. The user uses Q-learning to make offloading decisions based on current network conditions and their own offloading history, ultimately maximizing their satisfaction. The simulation results show that the user satisfaction of the proposed algorithm is better than the traditional WiFi offloading algorithm.
National Natural Science Foundation of China61971239616310201. Introduction
With the popularity of smart devices, cellular data traffic is growing at an unprecedented rate. Cisco visual network index [1] predicts that global mobile data traffic will reach 49 exabytes per month in 2021, which is equivalent to six times that of 2016. In order to solve the problem of data traffic explosion, we can add cellular BS (base station) or upgrade the cellular network into networks such as LTE (long-term evolution), LTE-A (LTE-Advanced), and WiMAX release 2 (IEEE 802.16m), but this is usually not economical, which requires expensive CAPEX (capital expenditure) and OPEX (operating expense) [2]. In addition, the limited licensed band is another bottleneck to improve network capacity [3]. As a result, mobile data offloading technology [4] has gradually become a mainstream in 5G, and WiFi offloading is one of the most effective offloading solutions.
WiFi offloading technology transfers part of the cellular network load to WiFi network through the WiFi AP (access point), by which we can solve the congestion in licensed band, achieve load balancing, and fully utilize unlicensed spectrum resources. Due to the effectiveness of WiFi offloading, many literatures have studied it. Li et al. [5] considered the coexistence of WiFi and LTE-U on unlicensed bands and offloaded LTE-U services to WiFi networks, establishing multiple targets for maximizing LTE-U user throughput while optimizing WiFi user throughput. To solve the problem, the authors used the Pareto optimization algorithm to get the optimal value. In [6], a satisfaction function reflecting the user communication rate is defined in the scenario of overlapping WiFi network and cellular network, and a resource block allocation matrix is constructed. Based on the accurate potential game theory, the best response algorithm is used to optimize the total system satisfaction. Cai et al. [7] proposed an incentive mechanism to compensate cellular users who are willing to delay their traffic for WiFi offloading. The authors calculated the optimal compensation value according to the available attribute parameters in the scenario and modeled the problem into two stages. In the first stage of the Stackelberg game, the operator announces that it would provide users with uniform compensation to delay its cellular services. In the second phase, each user decides whether to join the delayed offloading based on the compensation, network congestion, and estimation of the waiting cost for WiFi connection. From the perspective of operators, Kang et al. [8] formulated mobile data offloading problem as a utility maximization problem. The authors established an integer programming problem and obtained a mobile data offloading scheme by considering the relaxed condition. The authors further proved that when the number of users is large, the proposed centralized data offloading scheme is near optimal. Jung et al. [9] proposed a user-centric, network-assisted WiFi offloading model. In this model, heterogeneous networks are responsible for collecting network information and users make offloading decisions based on this information to maximize their throughput. In the heterogeneous network scenario composed of LTE and WiFi, aiming at maximizing the minimum energy efficiency of users, a closed expression is proposed in [10] to calculate the number of users to be offloaded, and these users with the smallest SINR (signal to interference and noise ratio) are offloaded into WiFi network. According to the above references, the most challenging problem in WiFi offloading is how to make an offloading decision, that is, how to choose the most suitable WiFi AP for communication. Fakhfakh and Hamouda [11] aimed to minimize the residence time of the cellular network and optimized it by Q-learning. The reward function considers SINR, handover delay, and AP load. By offloading cellular services to the best WiFi AP nearby, operators can greatly increase their network capacity, and users’ QoS will also increase. However, the above references only make an immediate offloading decision based on the current network conditions, without considering the user’s previous access history. In addition, most of the references only perform an offloading decision for the optimization of one particular attribute, such as throughput or energy efficiency, without considering multiple network attributes for comprehensive decision making.
In this paper, for the mobile user scenario where the cellular base station and the WiFi AP coexist, considering the current network conditions and the access history, a Q-learning scheme is used to make the offloading decision. By considering its own access history, users will accumulate the experience of offloading, which will not only avoid offloading to the poor network that was previously accessed but also actively select the best WiFi AP according to the maximum discounted cumulative reward, which in turn increases user’s QoS. In this paper, four attributes including user throughput, terminal power consumption, user cost, and communication delay are considered and the reward function in Q-learning is defined by TOPSIS. In addition, if the service type is different, the importance of each network attribute will be different. We use AHP to define the weight of each network attribute according to the specific service type. The mobile terminal collects various attributes of the heterogeneous network, and the user continuously updates his discounted cumulative reward in combination with the instant reward and the experience reward until convergence. After the convergence, the user can make the best offloading decision in each state.
The rest of this paper is arranged as follows. Section 2 gives the system model of WiFi offloading in heterogeneous networks. Section 3 builds the Q-learning model, defines the reward function model based on AHP and TOPSIS, and gives the specific steps of the WiFi offloading algorithm. In Section 4, the simulation results are presented and analysed. Finally, Section 5 concludes the paper.
2. System Model
The system model in this paper is shown in Figure 1. A cellular base station is located in the center of the cell with a radius equal to rcell. There are NAP WiFi APs in the cell, which are represented as APk,k∈1,2,…,NAP. The cell is covered by overlapping cellular network and WiFi network. These networks are divided into valid networks and invalid networks. When the throughput of the user accessing a certain network is greater than a threshold, we regard this network as a valid network; otherwise, it is considered as an invalid network. The mobile multimode terminal is the agent of Q-learning, and it can perform data transmission through both cellular network and WiFi network. The agent moves straightly inside the cell, marking its passing position as Posii,i∈1,2,…,Np, where Np represents the total number of positions the user has passed. Due to the movement of the agent, the network environment such as channel quality and available bandwidth is constantly changing, which will cause the network attribute of the user to change. This paper regards the four network attributes of the agent in different locations as the state in Q-learning, including throughput, power consumption, cost, and delay. In addition, we consider the offloading decision as the action choice in Q-learning and offload mobile data if agent chooses WiFi network.
System model of WiFi offloading. The system model consists of a cellular BS, a few WiFi APs, one moving agent, and some other users.
Figure 2 shows the algorithm structure based on Q-learning. The agent first collects the network environment information, filters out invalid networks, and calculates four attributes of user throughput (TP), terminal power consumption (PC), user cost (C), and communication delay (D) of the valid network. The AHP algorithm is used to calculate the weights of the four attributes under different services, and the instant rewards obtained by selecting each network under the current state are calculated by TOPSIS. In combination with the instant reward and the experience reward, the Q-learning iteration is performed and the Q-table is updated. As a result, the offloading decision is made based on the discounted cumulative reward in Q-table.
Algorithm structure based on Q-learning.
This paper reflects the performance of the network from four aspects of throughput, power consumption, cost, and delay. The throughput reflects the rate of wireless transmission. According to the large-scale fading model of the wireless channel in [12], combined with the small-scale fading model, when the distance between the agent and the cellular BS or WiFi AP is d, the pass loss is defined as(1)L=L0+10αlog10dd0+FRayleighθ,β,where d0 is the reference distance, L0 is the path loss when the distance between agent and BS or AP is d0, α is the path loss exponent, and FRayleighθ,β is the Rayleigh fading of the Gaussian distribution with a mean of θ and a variance of β. The signal power Pir received at BS or AP from agent d away at the i-th position is expressed as(2)Pir=Pit−L,where Pit is the transmit power of the terminal which is not fixed. By the Shannon capacity formula [13], we can get the throughput of the agent accessing a network at the i-th position:(3)ViTP=W×log21+PirN0×W,where N0 is the additive white Gaussian noise power spectral density and Wis the available bandwidth of the agent. Since the available bandwidth of the network is constantly changing and each AP or BS provides services to other users in addition to the agent at the same time, which affects the available network bandwidth of the agent, this paper uses the Markov model to describe the change of W and quantizes the continuous W into Nmarkov states. The available bandwidth is transferred to the two adjacent states with the probability of ptr or remains unchanged with the probability of 1−ptr.
Power consumption is an important attribute to be considered for the operation of mobile terminals. According to [14], it is assumed that the minimum received power threshold of BS or AP is Pminr. When the transmit power of the terminal is too small, BS and AP will not receive the uplink signal of the terminal. To ensure the normal transmission of data, we define the minimum transmit power Pmintof the terminal as(4)Pmint=Pminr+L.
The actual transmit power Pit of the terminal must be greater than Pmint. In this paper, the power consumption of the agent accessing a network at the i-th location is expressed as(5)ViPC=P0+Pit,where P0 is the fixed operating power consumption of the terminal and Pit is the transmitting power of the terminal.
The operator charges the agent whether he accesses the cellular BS or a WiFi AP. In this paper, the unit price costed per second after the agent accesses a network in i-this defined as ViC, which is used to represent a relative price of two networks. It is usually cheaper if the user chooses to offload.
Communication delay is also an important indicator for users to evaluate the network. In this paper, the transmission delay after the agent accesses a network in i-th location is defined as ViD. Because of CSMA/CA (Carrier Sense Multiple Access with Collision Avoidance), the delay time is longer when the user accesses WiFi, which makes ViD bigger than accessing the BS.
This paper considers the above four network attributes to calculate the satisfaction Φjsat of the agent in the whole mobile scenario.
Firstly, we calculate the average of the four network attributes at Np locations; that is, VaveTP=∑iViTP/Np, VavePC=∑iViPC/Np, VaveC=∑iViC/Np, and VaveD=∑iViD/Np.
Then, we normalize the four values using the method in [15]:(6)u=U−UminUmax−Umin,when U is a positive attribute,Umax−UUmax−Umin,when U is a negative attribute,where Umax is the maximum possible value of the attribute and Umin is the minimum possible value of the attribute. For user satisfaction, the greater the throughput is, the better satisfaction the agent gets, which is a positive attribute. On the other hand, the other three attributes are kept as small as possible, belonging to the negative attribute. The normalized values of the four network attributes are expressed as Vavetp=VaveTP−VminTP/VmaxTP−VminTP, Vavepc=VmaxPC−VavePC/VmaxPC−VminPC, Vavec=VmaxC−VaveC/VmaxC−VminC, and Vaved=VmaxD−VaveD/VmaxD−VminD.
Combining the attribute weight data of different services obtained by using the AHP algorithm, the satisfaction of the user in the entire mobile scenario is defined as the sum of the weighted normalized attribute values:(7)Φjsat=wjtp×Vavetp+wjpc×Vavepc+wjc×Vavec+wjd×Vaved,j∈1,2,where j is the user service type, j=1 is the streaming media service, j=2 is the conversation service, and wjtp,wjpc,wjc, and wjd are the AHP weights of the throughput, power consumption, cost, and delay when the service type is j.
The optimization goal of this paper is to find out the best offloading decision of the user to maximize the satisfaction of the entire mobile scenario:(8)∏∗=argmaxΠ∈ΩΦjsats.t.c1:0<wjh<1h∈tp,pc,c,dc2:∑hwjh=1h∈tp,pc,c,dc3:Pit>Pminti∈1,2,…,Np,where Ω=A1⊗A2⊗⋯⊗ANp is the total action space of the user during the whole movement process in which Ai is the action set when the agent passes position i. It is the Cartesian product of the action set of the user passing Np positions, and Π∗ is the optimal offloading strategy of the whole moving process. In equation (8), c1 and c2 indicate that the weight of each network attribute is limited to 0 to 1 and the sum is 1; c3 indicates that the user’s transmit power is greater than the minimum transmit power at each position. However, because the action space is very large and the network environment such as available bandwidth is constantly changing, the traditional method is difficult to solve this optimization problem, so we use Q-learning to solve it.
3. WiFi Offloading Algorithm Based on Q-Learning and MADM
For the mobile user scenario where the cellular BS and the WiFi AP coexist, we propose a WiFi offloading algorithm based on Q-learning and MADM. Considering the current network conditions and the access history, the Q-learning algorithm is used to make the offloading decision, which will not only avoid offloading to the poor network that was previously accessed but also actively select the best WiFi AP according to the maximum discounted cumulative reward. MADM is an effective decision-making method when we need to consider a variety of factors. According to [16], attribute weight and network utility value are of great importance in MADM. We use two MADM algorithms in this paper, called AHP and TOPSIS. AHP is used to define the weight of each network attribute according to the specific service type. TOPSIS is used to obtain the instant reward of Q-learning based on the network utility. The agent collects various attributes of the heterogeneous network and continuously updates his discounted cumulative reward in combination with the instant reward and the experience reward. After the convergence, the user can make the best offloading decision in each state.
3.1. Q-Learning
Q-learning is one of the widely used reinforcement learning algorithms that treat learning as a process of trying, evaluation, and feedback. Q-learning consists of three elements, including state, action, and reward. The state set is denoted as S and the action set is denoted as A, and the purpose of Q-learning is to obtain the optimal action selection strategy Π∗ to maximize the agent's discounted cumulative reward [11]. In state s∈S, the agent selects an action a∈A from the action set to act on the environment. After the environment accepts the action, the environment changes and generates an instant reward Rws,a feedback to the agent. Then, the agent will select the next action a′∈A based on the reward and his own experience, which will in turn affect the discounted cumulative reward Rcs and state s′ of the next moment. It has been proved that for any given Markov decision process, Q-learning can be used to obtain an optimal action selection strategy Π∗ for each state s, maximizing the discounted cumulative reward for each state [17].
The discounted cumulative reward Rcs for state s is(9)Rcs=Rws,a+δ∑s∈SPs′s,aRcs′,where Rws,a is the instant reward obtained by the agent selecting action a in state s, δ∈0,1 is the discount factor, and Ps′s,a is the probability when agent performs action a and transmits from state s to s′. According to Bellman's theory [18], when the discounted cumulative reward is maximum, the optimal action selection decision under state s can be obtained:(10)Rcs∗=maxRws,a+δ∑s∈SPs′s,aRcs′.
The optimal action selection decision is(11)π∗s=argmaxa∈ARws,a+δ∑s∈SPs′s,aRcs′.
Since Rws,a and Ps′s,a are still unknown, the agent can learn these values during the Q-learning process of trial, evaluation, and feedback. We use Q function to represent the discounted cumulative reward when agent selects a in state s:(12)Qs,a=Rws,a+δ∑s∈SPs′s,aRcs′.
This paper uses Q-learning to solve the problem of WiFi offloading and proposes a WiFi offloading algorithm based on Q-learning and multiattribute decision making. In this paper, the multimode terminal moving inside the cell is regarded as the agent. The state, action, and reward of Q-learning are mapped in the following, respectively:
State set S: the location that agent passes and the network environment around the location, that is, S=si=Posii,Enviii∈1,2,…,Np, where Posii represents the location of the agent and Envii represents the network attributes of location i, including throughput, power consumption, cost, and delay
Action set A: the process of selecting an action is regarded as an offloading decision, that is, A=ak,k∈0,1,2,…,NAP, where a0 indicates that the terminal accesses the cellular BS and ak,k∈1,2,…,NAP indicates that the terminal is offloaded to the WiFi AP corresponding to the subscript
Reward function Rws,a: the utility value of the TOPSIS algorithm is used to represent the instant reward that the user obtains after attempting to access a certain network
3.2. AHP Algorithm
This paper uses AHP to calculate the user’s subjective assessment of the importance of each network attribute under different service types. AHP is one of the MADM algorithms using qualitative and quantitative calculations, which is widely used in network evaluation and strategy selection. According to [15], AHP has five steps: (1) establishing a hierarchical model; (2) constructing a paired comparison matrix; (3) calculating attribute weights; (4) checking consistency; and (5) selecting network. However, this paper only needs to use AHP to calculate the weight of different network attributes, so steps (1) and (5) are omitted. The specific steps are as follows:
Step 1: construct the paired comparison matrix according to the user service type j and the attributes to be analysed. Since this paper considers four attributes of throughput, power consumption, cost, and delay, the paired comparison matrix B can be expressed as
where bmn represents the ratio of the importance degree between m and n network attributes. We assume bmn as an integer from 1 to 9 or a reciprocal of them to evaluate the relative importance between different attributes. Furthermore, we have bmn=1/bnm, and the value on the diagonal is 1.
Step 2: calculate the weight of each network attribute in the service type scenario. According to [19], B is a positive reciprocal matrix which has multiple eigenvalues and eigenvector pairs λ,V:
(14)B×V=λ×V,
where λ is a certain feature value of B and V is a feature vector corresponding to λ. The feature vector corresponding to the largest eigenvalue λ∗ is selected and normalized into wjtp,wjpc,wjc,wjdT, which is also the AHP weight of the four attributes.
Step 3: check the consistency of the paired comparison matrix. Normally, the most accurate AHP weight cannot be obtained at one time because the paired comparison matrix may be inconsistent if bmn≠bmk/bkn, so the weight calculated in Step 2 is not accurate. It is necessary to check consistency of comparison matrix to ensure the subjective weight reasonable [15]. This paper uses the consistency ratio CR to measure the rationality of B:
(15)CR=λ∗−NN−11RI,
where N is the number of network attributes and is the order of matrix B. RI is the index of average random consistency, and it is fixed if comparison matrix order is known [15], as is shown in Table 1.
According to the theory of AHP, if the consistency ratio CR>0.1, then B is unacceptable, and it is necessary to return to Step 1 to adjust B until CR>0.1. Finally, the accurate AHP weights of the four network attributes can be obtained (Table 1).
Average random consistency with respect to matrix order.
Matrix order
RI
1
0.00
2
0.00
3
0.58
4
0.90
5
1.12
6
1.24
7
1.32
8
1.41
9
1.45
3.3. TOPSIS Algorithm
This paper uses TOPSIS to calculate the instant reward Rw obtained by the terminal accessing the cellular network or WiFi network. TOPSIS is also a MADM algorithm, the principle of which is to calculate and sort the proximity of candidate solutions to ideal solutions. In the Q-learning model, the action set contains all possible network choices; however, this is not a candidate network set because before the TOPSIS algorithm, this paper has filtered the invalid network whose actual throughput is less than the throughput threshold VthTP. So, we use TOPSIS to calculate the reward corresponding to the candidate network. Assume that the filtered candidate network set is Net1,…,Netl,…,NetL, which are the L valid actions extracted from the action set A, and the reward corresponding to the filtered invalid action is 0. The specific steps for calculating the Q-learning reward using the TOPSIS algorithm are as follows:
Step 1: establish a standardized decision matrix H. Constructing a candidate network attribute matrix X using the network attribute values calculated in Section 2:
where l represents the number of the candidate network and n represents the number of the network attribute. Normalize each column to obtain a standardized decision matrix H=hlnL×N, where hln is the normalization of xln:
(17)hln=xln∑l∈1,2,…,Lxln.
Step 2: establish a weighted decision matrix Y. Each attribute is weighted by the AHP weight wjtp,wjpc,wjc,wjdTobtained in Section 3.2, which is represented by w1,w2,w3,w4T, and the attribute value of each column in H is multiplied by the corresponding AHP weight to obtain Y=ylnL×N:
(18)yln=wnhln.
Step 3: calculate the proximity of each candidate solution and two extreme solutions. First, determine the ideal solution and the least ideal solution. Since throughput is a positive attribute and power consumption, cost, and delay are negative attributes, the ideal solution Solution+ is
Step 4: calculate the instant reward after the user selects a candidate network. In this paper, Rwl is expressed by the relative proximity of the candidate network to the ideal solution:
(22)Rwl=EDl−EDl++EDl−.
The larger EDl− is, the smaller EDl+ is and the closer Rwl is to 1, indicating the candidate solution is closer to ideal solution and the reward is larger. Conversely, the smaller EDl− is, the larger EDl+ is, indicating that the network accessed by the agent is poor and Rwl is closer to 0.
In summary, the reward function of the paper is as follows:(23)Rws,a=EDl−EDl++EDl−,valid action,0,invalid action.
3.4. Algorithm Steps
In order to maximize the satisfaction of mobile users in the cell, this paper considers the four attributes of throughput, power consumption, cost, and delay, uses AHP to calculate the weight of each attribute, defines the reward function by TOPSIS, and relies on Q-learning to iterate until convergence. The best offloading strategy in each state can finally be obtained. In Q-learning, the Q value will be updated with the user learning:(24)Qts,a=1−μQt−1s,a+μRwts,a+δmaxa′∈AQt−1s′,a′,where μ∈0,1 is the learning rate. The larger μ is, the less the Q value of the previous training is retained and the more important is the instant reward Rwts,a and the experience reward maxa′∈AQt−1s′,a′. δ is the discount factor of the experience reward, and s′ is the state that the agent transfers into.
In addition, this paper also introduces the ε-greedy algorithm. In each action selection of Q-learning, the agent explores with a small probability ε, that is, randomly selects a network to offload. Without ε-greedy algorithm, it is possible that the cumulative reward of a suboptimal action becomes bigger and bigger, which makes the user choose this action and increase the cumulative reward again, instead of finding a better one. In other words, the core of ε-greedy is to explore. The reason why the ε-greedy algorithm performs better is that it continuously explores the probability of finding the optimal action. Although it is possible to reduce the user satisfaction in the next period of time, hoping that in the future, we can make better action choices and ultimately get the most user satisfaction. Based on the above analysis, Algorithm 1 gives the WiFi offloading algorithm based on Q-learning and MADM.
Algorithm 1: WiFi offloading algorithm based on Q-learning and MADM in heterogeneous networks.
Input: state set S, action set A, paired comparison matrix B, candidate network attribute matrix X, and iteration limit Z
Output: trained Q-table, best action selection strategy Π∗, and user satisfaction Φjsat
Calculate attribute weights based on B
For s∈S, a∈A
Qs,a = 0
End For
Randomly choose sini∈S as the initialization state
While iteration < Z
For each state
If rand < ε
Randomly choose an action
Else
Select the action corresponding to the maximum Q value in this state.
End If
Perform a
Calculate Rwts,a according to equation (23)
Observe the next state s′
Update the Q-table according to equation (24)
End For
End While
Record the action corresponding to the maximum Q value in each state into Π∗
Calculate user satisfaction Φsat
4. Numerical and Simulation Results
As shown in Figure 1, the simulation scenario is established in a circular cell with a radius rcell of 500 m. The cellular BS is located in the cell center, and NAP WiFi AP is randomly distributed inside the cell. The additive white Gaussian noise power spectral density N0 is −174 dBm/Hz, and reference distance d0 is 1 m. In FRayleighθ,β, mean θ=0 and variance β=5 dB. Furthermore, the learning rate μ of the Q-learning is set to 0.8, the discount factor of the experience reward δ is set to 0.1, and ε in ε-greedy is set to 0.01. In AHP, when network attribute number N=4, the consistency index RI=0.9 [15]. The paired comparison matrices B of different services are shown in Table 2, and they are recognized results based on the general needs of each service, which are given by experts’ opinions. The remaining parameters are shown in Table 3.
Comparison matrices corresponding to stream service and conversation service.
Network attribute
Stream
Conversation
TP
PC
C
D
TP
PC
C
D
TP
1
3
2
5
1
2
1
1/9
PC
1/3
1
1
2
1/2
1
1/3
1/9
C
1/2
1
1
3
1
3
1
1/9
D
1/5
1/2
1/3
1
9
9
9
1
Simulation parameters of cellular network and WiFi network in this paper.
Simulation parameters
Cellular network
WiFi network
User cost ViC (/s)
0.8
0.1
Communication delay ViD (ms)
25 to 50
100 to 150
Bandwidth W (MHz)
4 to 6
10 to 12
Path loss L0 at d0 (dB)
5.27
8
Terminal fixed power consumption P0 (mW)
10
10
Minimum received power Pminr (dBm)
−110
−100
User throughput threshold VthTP (kb/s)
10
12
Path loss exponent α
3.76
4
Firstly, we analyse the performance of this algorithm under stream service. According to AHP algorithm, the weight vector corresponding to throughput, power consumption, cost, and delay is obtained as w1tp,w1pc,w1c,w1dT=0.4891,0.1896,0.2321,0.0893. When the user conducts streaming media services like watching a video, the most important thing is throughput and the least is delay. Because a video usually has a large size such as 500 MB, 1 GB, or more, we need the throughput to be big enough to support the cache of the video. The user equipment only needs to read the data precached in it to perform the service, which is not real-time. So stream service does not need low delay.
Figure 3 shows the convergence comparison between the invalid action filtering and nonfiltering in the WiFi offloading algorithm under stream service. Advance filtering means that this paper filters the invalid network whose actual throughput is less than the throughput threshold VthTP before Q-learning. Assume NAP=30, and the total number of positions Np passed by the user is equal to 10. The two cases are subjected to Q-learning in the same experimental scenario, and the convergence was observed. Since the action selection in Q-learning is discontinuous, user satisfaction will jump when changing the action selection strategy. As can be seen from Figure 3, after filtering out the invalid network whose throughput is less than the threshold VthTP in advance, the convergence speed of the Q-learning can be greatly accelerated.
Convergence comparison between the invalid action filtering and nonfiltering, recorded 2000 iterations under stream service.
Figures 4 and 5 show the comparison between this paper’s algorithm, Fakhfakh and Hamouda’s algorithm [11], and RSS (received signal strength) algorithm based on user satisfaction, throughput, power consumption, cost, and delay under stream service. We repeatedly scatter APs 1000 times to eliminate randomness. The number of user-passed positions Np is equal to 10, and the number of WiFi AP is changed from 20 to 60. As can be seen from Figure 4, the WiFi offloading algorithm in this paper is superior to the other two algorithms in user satisfaction. The main difference between this paper and [11] is the reward function of the Q-learning. Fakhfakh and Hamouda’s algorithm [11] aims to minimize the residence time of the cellular network and optimize it by Q-learning, but its reward function only considers SINR, handover delay, and AP load, without considering the attributes directly related to user QoS, such as terminal power consumption, user cost, and communication delay. The RSS algorithm only considers the received signal strength of the terminal, and the terminal automatically accesses network with the largest RSS, so the user satisfaction is lower. The Q-learning algorithm in this paper not only considers the attributes directly related to user QoS but also uses two MADM algorithms to obtain the intrinsic relationship of these attributes. It establishes a more reasonable Q-learning reward function and obtains the best user satisfaction. As can be seen from Figure 5, the algorithm in this paper is similar to [11] in terms of user throughput. This is because Fakhfakh and Hamouda’s algorithm [11] regards SINR as the most important aspect of the reward function, which directly affects throughput. Since the simulation is based on the stream service, the weight of throughput accounts for almost half of all the attributes, so the two algorithms perform similarly in throughput. Since the other two algorithms do not consider power consumption and cost, the algorithm performs better on these two network attributes. The RSS algorithm selects the network with the highest receiving power to access. In this scenario, as long as the terminal is not too far away from the cellular BS, RSS of the cellular network will be the largest, so the number of WiFi offloading is reduced. Since the WiFi network uses the unlicensed frequency band, the bandwidth available to the user is usually larger than accessing the cellular network. As a result, the throughput of it becomes less. Because the delay of cellular network is usually lower than WiFi network, the RSS algorithm performs best on the delay attribute. However, since the weight of the delay attribute in the stream service is very low, the user does not pay attention to the delay of the precached data when watching video or listening to music. As a result, although the algorithm in this paper is not as good as the RSS algorithm in delay, user satisfaction is much higher than it.
User satisfaction comparison under stream service. The error bars represent the standard deviation for the user satisfaction of 1000 times scatter of WiFi APs.
Comparisons of throughput, power consumption, cost, and delay under stream service.
Figure 6 shows the user satisfaction against the number of positions passed by agent after repeatedly scattering AP 1000 times to eliminate randomness. The number of WiFi AP NAP=30, and the terminal passes through 6, 8, 10, 12, and 14 positions, respectively. It can be seen that the more the positions, the higher the user satisfaction because as the number of positions increases, the states of Q-learning will increase, and the chances of agent actively selecting the optimal network to offload will also increase, so the satisfaction will also become higher.
Plot of user satisfaction with respect to the number of positions passed by the agent. The number of WiFi APs is 30, and the service type is stream.
Figures 7 and 8 show the comparison between this paper’s algorithm, Fakhfakh and Hamouda’s algorithm [11], and RSS algorithm based on user satisfaction, throughput, power consumption, cost, and delay under conversation service. The number of user-passed positions Np is equal to 10, and the number of WiFi AP is changed from 20 to 60. According to AHP algorithm, the weight vector is obtained as w2tp,w2pc,w2c,w2dT=0.0955,0.0534,0.1084,0.7427, which indicates that when the user chooses conversation service like making a voice call, the most important attribute is communication delay while the other three attributes are less important. When we make a voice call, it will drastically reduce the QoS if the time we wait is too long. As can be seen from Figure 7, the WiFi offloading algorithm in this paper is superior to the other two algorithms in user satisfaction. Fakhfakh and Hamouda’s algorithm [11] does not consider the communication delay, so the satisfaction is the worst. As is mentioned above, RSS algorithm usually makes the terminal access the cellular BS which has a bigger transmit power and a lower delay, so the satisfaction is better than [11]. As can be seen from Figure 8, the WiFi offloading algorithm in this paper is superior to the RSS algorithm in throughput, power consumption, and cost, while the communication delay performance is near RSS algorithm. In this paper, delay is the most important attribute under conversation service, so the delay performance nears RSS algorithm. We also consider other attributes, which makes a few users offload to WiFi network, so the delay of this algorithm is slightly higher than the RSS algorithm.
User satisfaction comparison under conversation service. The error bars represent the standard deviation for the user satisfaction of 1000 times scatter of WiFi APs.
Comparisons of throughput, power consumption, cost, and delay under conversation service.
5. Conclusion
In the heterogeneous network scenario where cellular network and WiFi network overlap, this paper establishes a model of mobile terminal WiFi offloading, and the Markov model is used to describe the change of available bandwidth. Four network attributes of user throughput, terminal power consumption, user cost, and communication delay are considered to define a user satisfaction function. The AHP algorithm is used to calculate the attribute weights, and the TOPSIS algorithm is used to obtain the instant rewards when the user accesses the cellular network or offloads to the WiFi network. Using the Q-learning algorithm, combined with instant rewards and experience rewards to update the discounted cumulative rewards, the user can make the optimal offloading decision and get the maximum satisfaction in each passing position. The simulation results show that the proposed algorithm can converge under limited times, and compared with the comparison algorithm, the algorithm has a great improvement in user satisfaction.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61971239 and 61631020).
Cisco White Paper2017Prague, Czech RepublicCisco SystemsHoD.ParkG. S.SongH.Game-theoretic scalable offloading for video streaming services over LTE and WiFi networks20181751090110410.1109/tmc.2017.27485922-s2.0-85029177928ChenQ.YuG.ShanH.MaarefA.LiG. Y.HuangA.Cellular meets WiFi: traffic offloading or resource sharing?20161553354336710.1109/twc.2016.25204782-s2.0-84969758490AijazA.AghvamiH.AmaniM.A survey on mobile data offloading: technical and business perspectives201320210411210.1109/mwc.2013.65074012-s2.0-84877788969LiZ.DongC.LiA.WangH.Traffic offloading from LTE-U to WiFi: a multi-objective optimization approachProceedings of the 2016 IEEE International Conference on Communication Systems (ICCS)December 2016Shenzhen, ChinaIEEE10.1109/iccs.2016.78336222-s2.0-85013955960XuJ.WuS.XuL.ZhangN.ZhangQ.Green-oriented user-satisfaction aware WiFi offloading in HetNets201812550150810.1049/iet-com.2017.04892-s2.0-85044230014CaiS.DuanL.WangJ.Incentive mechanism design for delayed WiFi offloadingProceedings of the ICC 2015—2015 IEEE International Conference on CommunicationsJune 2015London, UKIEEE10.1109/icc.2015.72488482-s2.0-84953725344KangX.ChiaY.-K.SunS.ChongH. F.Mobile data offloading through a third-party WiFi access point: an operator’s perspective201413105340535110.1109/twc.2014.23530572-s2.0-84908032842JungB. H.SongN. O.SungD. K.A network-assisted user-centric WiFi-offloading model for maximizing per-user throughput in a heterogeneous network20146341940194510.1109/tvt.2013.22866222-s2.0-84903639480SethakasetU.ChiaY. K.SunS.Energy efficient WiFi offloading for cellular uplink transmissionsProceedings of the 2014 IEEE 79th Vehicular Technology Conference (VTC Spring)May 2015Seoul, KoreaIEEE10.1109/vtcspring.2014.70229092-s2.0-84936880800FakhfakhE.HamoudaS.Optimised Q-learning for WiFi offloading in dense cellular networks201711152380238510.1049/iet-com.2017.02132-s2.0-85032750549KunarakS.SuleesathiraR.Predictive RSS with fuzzy logic based vertical handoff algorithm in heterogeneous wireless networksProceedings of the 2010 10th International Symposium on Communications & Information TechnologiesJune 2010Tokyo, JapanIEEE10.1109/iscit.2010.56651772-s2.0-78651248994PelaezJ. I.MartinezE. A.VargasL. G.Consistency in positive reciprocal matrices: an improvement in measurement methods20186256002560910.1109/access.2018.28290242-s2.0-85045736556ZhangL.ZhuQ.Network selection algorithm based on multi-radio parallel transmission for heterogeneous wireless networks2014301011761184ZhangL.ZhuQ.Multiple attribute network selection algorithm based on AHP and synergetic theory for heterogeneous wireless networks2014311294010.1007/s11767-013-3131-12-s2.0-84894318363YuH.-W.ZhangB.A hybrid MADM algorithm based on attribute weight and utility value for heterogeneous network selection201927375678310.1007/s10922-018-9483-y2-s2.0-85056907277WatkinsC. J. C. H.DayanP.Technical note: Q-Learning199283-427929210.1007/bf00992698ShafieA. E.KhattabT.SaadH.MohamedA.Optimal cooperative cognitive relaying and spectrum access for an energy harvesting cognitive radio: reinforcement learning approachProceedings of the 2015 International Conference on Computing, Networking and Communications (ICNC)February 2015Anaheim, CA, USAIEEE10.1109/iccnc.2015.70693442-s2.0-84928019075BazziA.MasiniB. M.ZanellaA.DardariD.Performance evaluation of softer vertical handovers in multiuser heterogeneous wireless networks201723115917610.1007/s11276-015-1140-82-s2.0-84949646170