A Two-Dimensional Multiarmed Bandit Approach to Secondary Users with Network Coding in Cognitive Radio Networks

We study how to utilize network coding to improve the throughput of secondary users (SUs) in cognitive radio networks (CRNs) when the channel quality is unavailable at SUs. We use a two-dimensional multiarmed bandit (MAB) approach to solve the problem of SUs with network coding under unknown channel quality in CRNs.We analytically prove the asymptotical-throughput optimality of the proposed two-dimensional MAB algorithm. Simulation results show that our proposed algorithm achieves comparable throughput performance, compared to both the theoretical upper bound and the scheme assuming known channel quality information.


Introduction
With the fast development of the technology of softwaredefined radios [1], cognitive radio networks (CRNs) emerge to solve the spectrum underutilization [2] problem.In CRNs, secondary users (SUs) are allowed to simultaneously share the licensed spectrum with primary users (PUs), when it is not occupied by PUs.However, due to the strict low priority, the idle duration of the PU channel (also available transmission time for SUs) is actually not just limited, but uncertain in the sense that SUs' transmissions may be interrupted by PUs at any time.Therefore, how to improve the transmission quality via increasing spectrum utilization under such uncertainty is urgently needed.
Recently, network coding [3] has emerged to an efficient technology to improve SUs' performance in terms of throughput in CRNs, by employing network coding in SUs' data transmissions [4][5][6][7][8], and so forth.The key idea of network coding is to combine multiple input packets into one packet algebraically before forwarding [3].The number of packets combined together is named as block size, which is a key coding parameter in network coding literatures.It is worth emphasizing that the block size must be determined before data transmission [9][10][11].According to previous studies [12][13][14], the block size selection usually needs some channel information including the available transmission time and channel quality as a priori.
Nevertheless, in CRNs, similar to the idle duration, the channel quality may also be unknown to SUs.For example, when the SUs choose a new channel, the quality of this channel can not be obtained immediately.Moreover, the channel quality itself may be changing dynamically due to the complex spectrum environment.Unfortunately, most existing works (e.g., [4,[6][7][8]) assume that the channel quality information is perfectly known and available at SUs.Thus, the block size selection algorithms proposed in these previous studies cannot be applied to the problem of SUs with network coding under unknown channel quality.
In this paper, we propose a two-dimensional multiarmed bandit (MAB) approach to combat the problem of SUs with network coding under unknown channel quality in CRNs.Specifically, we develop a two-dimensional MAB algorithm that sequentially chooses the idle duration and block size jointly.With the help of MAB, the exploration and exploitation on the idle duration-block size pair can ensure the asymptotical-throughput optimality of the proposed algorithm.Unlike previous studies, our proposed algorithm does not need the channel quality in advance.(4) The performance of the proposed algorithm is compared to the theoretical upper bound and the scheme assuming known channel quality information, in terms of throughput achieved.
The rest of the paper is organized as follows.In Section 2, the related work will be reviewed.Section 3 introduces the system model and problem formulation.We propose a twodimensional MAB algorithm for SUs with network coding under unknown channel quality in Section 4. The performance evaluation is conducted via extensive simulations in Section 5. Finally, Section 6 concludes the paper.

Related Work
Several previous studies have demonstrated the various benefits of network coding in CRNs from different aspects.Jin et al. [4] studied the problem of multicast scheduling with network coding in CRNs and exploited network coding to perform error recovery and reduce overhead.Both Asterjadhi et al. [15] and Baldo et al. [16] utilized network coding to transmit control information reliably, in order to maintain up-to-date information in CRNs.With the help of network coding, Almasaeid and Kamal [6] proposed an algorithm to reduce the negative impact of the channel heterogeneity property on the performance of multicast in CRNs.Although network coding has been widely exploited in CRNs, the above works actually did not consider how to utilize network coding in the SUs' data transmissions.Note that the very recent studies [7,8] have explored the throughput benefit of network coding over SUs' multicast in CRNs.However, they all assume that the channel quality of the licensed channel is perfectly available at SUs and thus their proposed algorithm cannot be applied to our problem.
On the other hand, our work can be classified as the study of learning the primary user environment in CRNs, since we try to utilize network coding for SUs' data transmission without knowing the idle duration and channel quality information.Among the works about the channel access in CRNs, there are several studies employing MAB to solve the learning-based dynamic spectrum access problem from a sequential decision aspect.Shu and Krunz [17] utilized MAB to develop a throughput optimal decision algorithm for stochastic homogeneous channels.Anandkumar et al. [18] proposed two distributed learning and allocation schemes by MAB for the case of preallocated ranks for SUs and nonsuch information, respectively.Li et al. [19] formulated the joint channel sensing, probing, and accessing problem as an nonstochastic MAB problem and present an almostthroughput optimal algorithm for nonstochastic channels.To sum up, existing works on MAB-based dynamic spectrum access in CRNs mainly focus on how to access the best channel and merely consider how to transmit data efficiently in the uncertain idle durations under unknown channel quality after accessing the channel.

Model and Formulation
In this section, we will first introduce the system and network model used in our work and then present the problem formulation.

Network and System
Model.We consider a CR network where secondary users (SUs) want to achieve better performance by utilizing network coding.Consider that there are one PU channel and multiple SUs.Without loss of generality, we assume that there are  + 1 SUs, where one SU is the sender, and the rest of  SUs are the receivers.Before accessing the channel, the SU sender needs to sense the channel to determine whether it is idle or not.Here we assume that the time for sensing the status of the channel is   on average.If the channel is idle, the SU sender then should predict how long the idle status would last and determine the corresponding coding parameter it would use.We assume that the idle duration of the PU channel is nonstationary, in the sense that the idle duration length  changes with time and does not follow any probability distribution.Therefore, the accessing time of SUs may be unequal, which is more general in CRNs, while most existing studies assume the identical accessing time.
Time is slotted and synchronized among the channel and multiple SUs.We note that, in our work, one time slot consists of both the time of channel sensing and the time of transmitting one data packet.The slot structure is shown in Figure 1.Suppose there exists a maximum idle duration of the PU channel, that is,  max slots.For SUs' application, any data packet that cannot be delivered to all SU receivers is dropped without contributing to the throughput.In the SUs side, we allow each SU receiver to feedback only when it can decode a block or at the end of the current idle duration, in order to reduce the feedback overhead.
In this work, we focus on systematic network coding (SNC) [20] instead of random linear network coding (RLNC) [21] to improve SUs' throughput performance, where SNC is a special version of RLNC.The main difference between SNC and RLNC is that under RLNC all packets sent are coded packets, while under SNC both coded packets and uncoded packets are sent.Since SUs' transmissions may be interrupted by PUs at any time, we believe that SNC is more suitable for SUs than RLNC.The reason is as follows.Under RLNC, all SU receivers can decode all packets of the block simultaneously only when enough independent coded packets are collected.If a SU receiver cannot collect enough coded packets to decode the block before PUs' arriving, the number of delivered data packets is zero.Comparatively, under SNC, some SU receivers can still obtain several packets by successfully receiving the uncoded ones, even if they do not receive enough independent packets before PUs' arriving.All in all, SNC can reduce the negative effect from the uncertainty of idle durations while maintaining the throughput benefit by network coding.However, with SNC, the number of time slots for transmitting uncoded packets (the block size), for example, , and the number of time slots for transmitting coded packets, for example,  − , should be allocated reasonably.
According to several previous studies [12][13][14], the optimal value of the block size  can be determined on the estimated idle duration  and channel quality .In these studies, the channel quality is assumed perfectly known at SUs.However, in CRNs, the channel quality may also be unknown.Therefore, the aforementioned solutions may not be applied to the case when channel quality is unknown.Accordingly, the idle duration  and block size  may be determined jointly whenever SUs transmit data by SNC on the PU channel.A reference for the major notations used in this paper is provided in A Summary of Key Notations.

Problem Formulation.
In this study, we aim to provide a general sensing/transmitting mechanism for SUs with SNC under unknown channel quality in CRNs.The sequential sensing/transmitting problem is to determine the idle duration of the channel and the block size jointly, without knowing future channel states, for SUs to improve the throughput.We model the problem into a nonstationary multiarmed bandit (MAB) problem.Since there is no gain if no data is transmitted in one time slot, we employ several continuous time slots spent for sensing the channel as our study unit, which is called as a round in this paper.After a round, the SU sender may or may not transmit data, according to the sensing result and previous gains.Note that the transmission time slots are not counted in a round.The strategy for sensing/transmitting is composed of many sequential rounds of sensing.
A sensing/transmitting strategy  = ⟨  ,   ,   ⟩ by the SU sender will decide its action.At every round,   ∈ {0, 1} denotes whether the SUs decide to access the channel,   ∈ {1, 2, . . .,  max } is the idle duration selected, and   ∈ {1, 2, . . .,   } denotes the block size used.The value of   is determined according to the sensing result:   = 1 if the channel is busy and   = 0 otherwise.Note that, for any ,   should be strictly no more than   ; otherwise, the block cannot be decoded.At each round , the SU sender should choose a strategy vector ⟨  ,   ⟩ over strategy space  = {1, 2, . . .,  max } × {1, 2, . . .,  max } with the restriction that the second component should be no greater than the first one.The size of  is thus equal to ( max + 1) max /2.
Let   () denote the gain of a strategy  at round .In practical, the SU sender will examine the performance of its strategy based on the ACK from the SU receivers after each round.The actual gain of each round can thus be calculated as the total number of successful delivered data packets.The accumulative gain up to round  of each strategy  is defined as and the total gain of all chosen strategies accumulating up to round  is where the strategy   is the SUs' strategy at round , which is chosen randomly according to the determined probability distribution over .
In the modeled two-dimensional MAB problem for SUs with SNC, each idle duration is considered as an arm of a gamble machine and the block size is the coin the gamble machine bets on that arm.The reward is defined as the number of successfully delivered data packets.Typically, the gambler's performance is measured in terms of regret in the MAB problem.It is defined as the difference between the expected return of the gambler's actions and that of the static optimal strategy.In this study, we aim to design a strategy  that maximizes the expected throughput; that is, under the system lifetime constraint, where the system lifetime   is defined as the total time for both transmitting and sensing at each round.The regret in our application is the difference between the expected total number of delivered packets using our proposed algorithm and that using the static optimal strategy over lifetime   .The static optimal strategy for the SUs is the strategy  to achieve the maximum delivered data packets if the SUs keep using that strategy for all  rounds.Therefore, we can define the SUs' regret  after  rounds of an online strategy   as A strategy whose average regret per round /  → 0 with probability 1 when   → ∞ is a zero-regret strategy.Our objective is to design a strategy  with small regret.

Two-Dimensional MAB Algorithm for SUs with Network Coding
In this section, we focus on developing a MAB-based algorithm for SUs with network coding under unknown channel quality in CRNs.Specifically, we propose a two-dimensional MAB algorithm to sequentially choose the idle duration   and block size   jointly.It is worth emphasizing that the proposed two-dimensional MAB algorithm does not work for SUs without network coding, since it is unnecessary to determine the block size, when network coding is not employed in SUs' transmissions.In the rest of this section, we will first provide an overview of the algorithm, explain the detailed design of the algorithm, and then theoretically analyze the performance of the proposed algorithm.

Overview.
Inspired by the approach in [19], we propose a two-dimensional MAB algorithm for SUs with SNC under unknown channel quality in CRNs, as shown in Algorithm 1.
As put in Algorithm 1, three parameters are involved and their meanings are follows. is to control the bias in the estimation of gain   , (). is used to control the learning speed.And  reflects the tradeoff between exploration and exploitation.The values of , , and  are very critical to the performance of the proposed algorithm, which will be discussed in Section 4.3.
Generally speaking, the SUs choose the strategy vector ⟨  ,   ⟩ over the strategy space  according to the corresponding probability distribution { , } (,)∈ .In the very beginning, the strategy probability is uniformly distributed, that is,  , (1) = (1/( max ( max + 1)/2)) (∀⟨, ⟩ ∈ ), since there is no idea about the relationship between the gain and strategy pair initially.This means that the SUs begin to explore the best strategy pair uniformly over the entire strategy space , as illustrated in the initialization step.In the following rounds,  , is determined by the corresponding strategy weight  , in the overall strategy weights.

Design.
In step 1, the calculation of  , () reflects the tradeoff between the exploitation and exploration.Specifically, with probability 1 − , we will exploit the best idle duration-block size pair in previous rounds, which is shown in the left part of the calculation.It is worth emphasizing that a better idle duration-block size pair ⟨, ⟩ always implies a greater weight  , .The exploitation will guarantee an asymptotical optimal performance if the previously used strategy is asymptotically optimal.On the other hand, with probability , we will explore new idle duration-block size pair with an equal probability 1/( max ( max + 1)/2), which is illustrated in the right part of the calculation.The exploration would eventually improve our strategy to the static optimal solution.
In steps 2 and 3, the SUs execute the transmission according to the selected   and   and get the scaled output afterwards, respectively.Then, the SUs calculate the gain of the current strategy pair in step 4 and update all weights based on the strategy gain in step 5.Note that, at th round, the idle duration-block size pair ⟨  ,   ⟩ is chosen according to the corresponding probability distribution  , ().Specifically, every idle duration-block size pair ⟨, ⟩ dominates a probability  , ( = 1, 2, . . .,  max ,  = 1, 2, . . ., ).To choose the current idle duration-block size pair, we generate a random value  ∈ (0, 1) and select ⟨ * ,  * ⟩ if  falls in the corresponding probability   * , * .And the output    ,  () is calculated based on the feedback from all SU receivers.The feedback information includes the packets that are decoded by each SU receiver.According to the feedback, the SU sender calculates the number of common data packets decoded by all SU receivers and then normalizes with the block size to get the scaled output    ,  ().For instance, if the block size   = 10 and the number of common data packets decoded by all SU receivers is 6, then    ,  () is 0.6.
(2) Choose the strategy (  ,   ) in th round according to the above probability distribution, transmit data by SNC to the SU receivers based on the idle duration   , and block size   .
(3) Get the scaled output    ,  () ∈ [0, 1] after the round based on the feedback from all SU receivers.

Theoretic Analysis.
In this part, we analyze the performance of our two-dimensional MAB algorithm in terms of regret.In our application, the regret is the difference between the number of successfully delivered data packets using the static optimal idle duration-block size pair and that using our two-dimensional MAB algorithm.
Proof.The detailed proof is provided in the Appendix.
Based on the above regret analysis, we prove that Algorithm 1 is asymptotically throughput optimal as shown in the following theorem.Theorem 3. Algorithm 1 is asymptotically throughput optimal when  is sufficiently large.
In Theorem 3, the value of  is critical to the asymptotical optimality of Algorithm 1.To guarantee the asymptotical optimality, we may give  a small value, for example,  = 0.01, to derive the approximate value of .Specifically, according to the proof of Theorem 3, we only need to solve the inequality 1/ √  ⋅ 6 √ ln/(  + 1) ≤ .The solution of the aforementioned inequality is easily obtained as follows:  ≥ 36ln/ 2 (  + 1) 2 .

Results and Discussion
In this section, we evaluate the performance of our proposed two-dimensional MAB algorithm via simulations.Note that, when channel quality is known in advance, the optimal coding block size of SNC can be determined according to the current idle duration [7].We call this block size selection scheme "OSNC."We take the algorithm that assumes both perfect idle duration information and known channel quality and employs OSNC as a reference labeled as "Full Information + OSNC" in the figures, whose performance is the theoretical upper bound of all schemes.Furthermore, to show the efficiency of the two-dimensional MAB algorithm under uncertain channel quality, we compare it with the onedimensional MAB algorithm with OSNC proposed in [7] labeled as "MAB + OSNC" in the figures, which assumes the known channel quality information and chooses the idle duration by MAB.
Besides, we define a new metric named "utilization," which is the ratio of the performance of our two-dimensional MAB algorithm to that of "Full Information + OSNC" or "MAB + OSNC."This metric can not only precisely characterize the spectrum utilization of SUs in CRNs but also show the throughput performance of our proposed algorithm under unknown channel quality.Lastly, we present the practical regrets of our two-dimensional MAB algorithm to validate the regret bound analysis.

Simulation Results.
In the simulation, we fix the number of SU receivers  at 10 and vary the identical channel erasure rate  from 0.05 to 0.20.The number of total rounds  is fixed at 1000.We define the total number of packets delivered over the system life time as the main performance metric, since we mainly focus on the throughput performance in our study.
To simulate a trace of nonstationary idle durations, we mix several different probability distributions.In this work, we only present two representative cases due to space limitation.One consists of a Poisson distribution (), an uniform distribution [, ], and a geometric distribution ().Another is composed of a hypergeometric distribution (, , ), a normal distribution (,  2 ), and a binomial distribution (, ).To be specific, we generate three sequences of the distributions and then everyone of the trace data is chosen with equal probability from these distributions.Accordingly, the generated sequence does not follow any probability distribution.We set the distribution parameters  = 50,  = 30,  = 50,  = 0.025,  = 60,  = 30,  = 40,  = 25,  = 10, and  = 40, as an illustration in this study.Moreover, we also generate a trace of nonstationary idle durations by applying a Poisson distribution with rounddependent parameter; that is,  continuous idle durations constitute a Poisson distribution with a specific parameter   .Specifically, the  idle durations are divided into / groups, where the idle durations in th group follow the Poisson distribution (20|sin(2)|) ( = 1, 2, . . ., /).In the simulations, we set the traces with  = 50 (fast changing) and  = 200 (slowly changing) as nonstationary idle durations III and IV, respectively.
Figure 2 shows the performance of different algorithms under the first nonstationary idle durations when channel quality deteriorates.From Figure 2(a), we can see that the performance of all algorithms degrades with the deterioration of channel quality, while the proposed two-dimensional MAB algorithm achieves fairly comparable performance, compared with the theoretic upper bound and the "MAB + OSNC" assuming the known channel quality in advance.Specifically, the relative utilization of the two-dimensional MAB algorithm over "Full Information + OSNC" and "MAB + OSNC" is up to 80.2% and 87%, respectively.This is because although we do not know enough channel information to select the optimal block size, we are able to approximate the optimal transmission scheme by the tradeoff between exploiting the past best idle duration-block size pair and exploring the new possible better idle duration-block size pair.This shows the great advantage of our proposed two-dimensional MAB algorithm, which promises a wide range of applications when faced with more channel uncertainties.For example, when SUs enter a new CRN, the channel quality information cannot be learned at once.In this situation, our proposed two-dimensional MAB algorithm can be used immediately and achieves a comparable performance.Besides, this twodimensional MAB algorithm has flexible expansibility compared to the "MAB + OSNC," since different channel may have different quality and the quality of one single channel may be changing with time.
Moreover, from Figure 2(b), we observe that the utilization of the proposed algorithm over "MAB + OSNC" is relatively stable when channel quality varies.This is because although the performance of both the proposed algorithm  and "MAB + OSNC" degrades with the deterioration of channel erasure rate, our proposed algorithm jointly selects the idle duration and block size to maximize the throughput rather than only estimate the idle duration.Also, as illustrated in Figure 2(b), the utilization of our proposed algorithm over "Full Information + OSNC" decreases when channel erasure rate increases.The reason is as follows.When channel erasure rate increases, the uncertainty of the channel is more complex besides the idle duration, which can make the learning of the two-dimensional MAB algorithm harder.
Figure 3 describes the performance of different algorithms under the second nonstationary idle durations when channel quality deteriorates.The performance of the twodimensional MAB algorithm under the second nonstationary idle durations is similar to that under the first nonstationary idle durations.The utilization of the two-dimensional MAB algorithm over "Full Information + OSNC" and "MAB + OSNC" is 81% and 87% at most, respectively.Also, as shown in Figure 3(b), the utilization of our proposed algorithm over "MAB + OSNC" under the second idle durations changes from 87% to 81% when channel erasure rate varies from 0.05 to 0.2.Although the utilization of the two-dimensional MAB algorithm over "MAB + OSNC" in the second idle durations is not as stable as that in the first idle durations, the utilization is as least 81%.This shows the effectiveness of our proposed algorithm choosing idle duration and block size jointly when faced with uncertain channel quality.
Figure 4 presents the performance results of our proposed two-dimensional MAB algorithm under nonstationary idle durations III and IV, compared with the "Full Information + OSNC" and "MAB + OSNC."From Figures 4(a) and 4(b), we can see that the two-dimensional MAB algorithm achieves similar relative throughput performance under nonstationary idle durations III and IV, with the deterioration of channel quality.Furthermore, according to Figure 4(c), the relative utilizations of the two-dimensional MAB algorithm over "Full Information + OSNC" and "MAB + OSNC" are up to 81% and 88%, respectively.And the relative utilization of the proposed algorithm over "Full Information + OSNC" is stable, in the sense that the two relative utilization curves almost coincide when  varies from 50 to 200.This shows that the stability of the two-dimensional MAB algorithm is relative independent of the type of nonstationary idle duration length.This is because, with the help of MAB, the two-dimensional MAB algorithm can efficiently handle the uncertainty of both idle duration and channel quality, no matter whether the idle duration is slowly changing or fast changing.

Analysis of the Practical Regrets.
To validate the regret bound analysis, we conduct simulations on the practical regrets achieved by the proposed two-dimensional MAB algorithm, compared with the corresponding theoretic bound.Specifically, we fix the number of SU receivers  and the channel erasure rate   ( ∈ {1, 2, . . ., }) at 10 and 0.1, respectively, and increase the number of rounds  from 1 to 100.To better investigate the regret, we let the parameter  in Theorem 2 change from 0.001 to 0.1, which represents different confidential levels on the achieved regret bound.
Figures 5(a) and 5(b) illustrate the practical regrets of the two-dimensional MAB algorithm compared with the theoretic bound derived by Theorem 2, under nonstationary idle durations I and II, respectively.As shown in the figures, the practical regrets of the proposed two-dimensional MAB algorithm are strictly bounded by the corresponding theoretic bound, no matter what the value of  is chosen.These results validate our theoretic analysis on the regret bounds.

Conclusion
In this paper, we investigate how to utilize network coding to improve the throughput performance of SUs in CRNs when the PU channel quality information is not available.We formulate the problem of SUs with network coding under unknown channel quality as a two-dimensional MAB problem.We propose a two-dimensional MAB algorithm that chooses the idle duration and block size jointly.The performance of the proposed algorithm is analyzed in terms of regrets and analytically proven asymptotically throughput optimal, compared to the static optimal scheme.Our extensive simulation results show that the throughput performance of our proposed algorithm is close to the theoretic bound and the network coding algorithm assuming known channel quality information, by achieving up to 87% utilization.
For the lower bound, according to the definitions, we have

Figure 1 :
Figure 1: SU's slot structure for channel sensing and data transmission.
Two-dimensional MAB Channel erasure rate  (a) Performance of different schemes under different channel qualities  (b) Relative utilization of two-dimensional MAB Algorithm