Application of Reinforcement Learning in Cognitive Radio Networks: Models and Algorithms

Cognitive radio (CR) enables unlicensed users to exploit the underutilized spectrum in licensed spectrum whilst minimizing interference to licensed users. Reinforcement learning (RL), which is an artificial intelligence approach, has been applied to enable each unlicensed user to observe and carry out optimal actions for performance enhancement in a wide range of schemes in CR, such as dynamic channel selection and channel sensing. This paper presents new discussions of RL in the context of CR networks. It provides an extensive review on how most schemes have been approached using the traditional and enhanced RL algorithms through state, action, and reward representations. Examples of the enhancements on RL, which do not appear in the traditional RL approach, are rules and cooperative learning. This paper also reviews performance enhancements brought about by the RL algorithms and open issues. This paper aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive to readers outside the specialty of RL and CR.


Introduction
Cognitive radio (CR) [1] is the next generation wireless communication system that enables unlicensed or Secondary Users (SUs) to explore and use underutilized licensed spectrum (or white spaces) owned by the licensed or Primary Users (PUs) in order to improve the overall spectrum utilization. The CR technology improves the availability of bandwidth at each SU, and so it enhances the SU network performance. Reinforcement learning (RL) has been applied in CR so that the SUs can observe, learn, and take optimal actions on their respective local operating environment. For example, a SU observes its spectrum to identify white spaces, learns the best possible channels for data transmissions, and takes actions such as to transmit data in the best possible channel. Examples of schemes in which RL has been applied are dynamic channel selection [2], channel sensing [3], and routing [4]. To the best of our knowledge, the discussion on the application of RL in CR networks is new albeit the importance of RL in achieving the fundamental concept of CR, namely, cognition cycle (see Section 2.2.1). This paper provides an extensive review on various aspects of the application of RL in CR networks, particularly, the components, features, and enhancements of RL. Most importantly, we present how the traditional and enhanced RL algorithms have been applied to approach most schemes in CR networks. Specifically, for each new RL model and algorithm which is our focus, we present the purpose(s) of a CR scheme, followed by in-depth discussion on its associated RL model (i.e., state, action, and reward representations) which characterizes the purposes, and finally the RL algorithm which aims to achieve the purpose. Hence, this paper serves as a solid foundation for further research in this area, particularly, for the enhancement of RL in various schemes in the context of CR, which can be achieved using new extensions in existing schemes, and for the application of RL in new schemes.
The rest of this paper is organized as follows. Section 2 presents RL and CR networks. Section 3 presents various components, features, and enhancements of RL in the context of CR networks. Section 4 presents various RL algorithms in the context of CR networks. Section 5 presents performance enhancements brought about by the RL algorithms in various schemes in CR networks. Section 6 presents open issues. Section 7 presents conclusions.

Reinforcement Learning and Cognitive Radio Networks
This section presents an overview of RL and CR networks.

Reinforcement Learning.
Reinforcement learning is an unsupervised and online artificial intelligence technique that improves system performance using simple modeling [5]. Through unsupervised learning, there is no external teacher or critic to oversee the learning process, and so, an agent learns knowledge about the operating environment by itself. Through online learning, an agent learns knowledge on the fly while carrying out its normal operation, rather than using empirical data or experimental results from the laboratory. Figure 1 shows a simplified version of a RL model. At a particular time instant, a learning agent or a decision maker observes state and reward from its operating environment, learns, decides, and carries out its action. The important representations in the RL model for an agent are as follows.
(i) State represents the decision-making factors, which affect the reward (or network performance), observed by an agent from the operating environment. Examples of states are the channel utilization level by PUs and channel quality.
(ii) Action represents an agent's action, which may change or affect the state (or operating environment) and reward (or network performance), and so the agent learns to take optimal actions at most of the times. (iii) Reward represents the positive or negative effects of an agent's action on its operating environment in the previous time instant. In other words, it is the consequence of the previous action on the operating environment in the form of network performance (e.g., throughput).
At any time instant, an agent observes its state and carries out a proper action so that the state and reward, which are the consequences of the action, improve in the next time instant. Generally speaking, RL estimates the reward of each state-action pair, and this constitutes knowledge. The most important component in Figure 1 is the learning engine that provides knowledge to the agent. We briefly describe how an agent learns. At any time instant, an agent's action may affect the state and reward for better or for worse or maintain the status quo; and this in turn affects the agent's next choice of action. As time progresses, the agent learns to carry out a proper action given a particular state. As an example of the application of the RL model in CR networks, the learning mechanism is used to learn channel conditions in a dynamic channel selection scheme. The state represents the channel utilization level by PUs and channel quality. The action represents a channel selection. Based on an application, the reward represents distinctive performance metrics such as throughput and successful data packet transmission rate. Lower channel utilization level by PUs and higher channel quality indicate better communication link, and hence the agent may achieve better throughput performance (reward). Therefore, maximizing reward provides network performance enhancement.
-learning [5] is a popular technique in RL, and it has been applied in CR networks. Denote decision epochs by ∈ = {1, 2, . . .}; the knowledge possessed by agent for a particular state-action pair at time is represented byfunction as follows: where (i) ∈ represents state, (ii) ∈ represents action, (iii) +1 ( +1 ) ∈ represents delayed rewards, which is received at time + 1 for an action taken at time , (iv) 0 ≤ ≤ 1 represents discount factor. The higher the value of , the greater the agent relies on the discounted future reward max ∈ ( +1 , ) compared to the delayed reward +1 ( +1 ), (v) 0 ≤ ≤ 1 represents learning rate. The higher the value of , the greater the agent relies on the delayed reward +1 ( +1 ) and the discounted future reward max ∈ ( +1 , ), compared to the -value ( , ) at time .
The Scientific World Journal 3 At decision epoch , agent observes its operating environment to determine its current state . Based on the , the agent chooses an action . Next, at decision epoch + 1, the state changes to +1 as a consequence of the action , and the agent receives delayed reward +1 ( +1 ). Subsequently, the -value +1 ( , ) is updated using (1). Note that, in the remaining decision epochs at time , + 1, . . ., the agent is expected to take optimal actions with regard to the states; hence, -value is updated using a maximized discounted future reward max ∈ ( +1 , ). As this procedure evolves through time, agent receives a sequence of rewards and the -value converges. Q-learning searches for an optimal policy at all time instants through maximizing value function ( ) as shown below: Hence, the policy (or action selection) for agent is as follows: ( ) = arg max ∈ ( ( , )) . ( The update of the -value in (1) does not cater for the actions that are never chosen. Exploitation chooses the bestknown action, or the greedy action, at all time instants for performance enhancement. Exploration chooses the other nonoptimal actions once in a while to improve the estimates of all -value in order to discover better actions. While Figure 1 shows a single agent, the presence of multiple agents is feasible. In the context of CR networks, a rigorous proof of the convergence of -value in the presence of multiple SUs has been shown in [6].
The advantages of RL are as follows: (i) instead of tackling every single factor that affects the system performance, RL models the system performance (e.g., throughput) that covers a wide range of factors affecting the throughput performance including the channel utilization level by PUs and channel quality and, hence, its simple modeling approach; (ii) prior knowledge of the operating environment is not necessary; and so a SU can learn the operating environment (e.g., channel quality) as time goes by.

Cognitive Radio Networks.
Traditionally, spectrum allocation policy has been partitioning radio spectrum into smaller ranges of licensed and unlicensed frequency bands (also called channels). The licensed channels provide exclusive channel access to licensed users or PUs. Unlicensed users or SUs, such as the popular wireless communication systems IEEE 802.11, access unlicensed channels without incurring any monetary cost, and they are forbidden to access any of the licensed channels. Examples of unlicensed channels are Industrial, Scientific, and Medical (ISM) and Unlicensed National Information Infrastructure (UNII) bands. While the licensed channels have been underutilized, the opposite phenomenon has been observed among the unlicensed channels. Cognitive radio enables SUs to explore radio spectrum and use white spaces whilst minimizing interference to PUs. The purpose is to improve the availability of bandwidth at each SU, hence improving the overall utilization of radio spectrum. CR helps the SUs to establish a "friendly" environment, in which the PUs and SUs coexist without causing interference with each other as shown in Figure 2. In Figure 2, a SU switches its operating channel across various channels from time to time in order to utilize white spaces in the licensed channels. Note that each SU may observe different white spaces, which are location dependent. The SUs must sense the channels and detect the PUs' activities whenever they reappear in white spaces. Subsequently, the SUs must vacate and switch their respective operating channel immediately in order to minimize interference to PUs. For a successful communication, a particular white space must be available at both SUs in a communication node pair.
The rest of this subsection is organized as follows. Section 2.2.1 presents cognition cycle, which is an essential component in CR. Section 2.2.2 represents various application schemes in which RL has been applied to provide performance enhancement.

Cognition Cycle.
Cognition cycle [7], which is a wellknown concept in CR, is embedded in each SU to achieve context awareness and intelligence in CR networks. Context awareness enables a SU to sense and be aware of its operating environment; while intelligence enables the SU to observe, learn, and use the white spaces opportunistically so that a static predefined policy is not required while providing network performance enhancement.
The cognition cycle can be represented by a RL model as shown in Figure 1. The RL model can be tailored to fit well with a wide range of applications in CR networks. A SU can be modeled as a learning agent. At a particular time instant, the SU agent observes state and reward from its operating environment, learns, decides, and carries out action on the operating environment in order to maximize network performance. Further description on RL-based cognition cycle is presented in Section 2.1.

Application
Schemes. Reinforcement learning has been applied in a wide range of schemes in CR networks for SU performance enhancements, whilst minimizing interference 4 The Scientific World Journal to PUs. The schemes are listed as follows, and the nomenclatures (e.g., (A1) and (A2)) are used to represent the respective application schemes throughout the paper.
(A1) Dynamic Channel Selection (DCS). The DCS scheme selects operating channel(s) with white spaces for data transmission whilst minimizing interference to PUs. Yau et al. [8,9] propose a DCS scheme that enables SUs to learn and select channels with low packet error rate and low level of channel utilization by PUs in order to enhance QoS, particularly throughput and delay performances.
(A2) Channel Sensing. Channel sensing senses for white spaces and detects the presence of PU activities. In [10], the SU reduces the number of sensing channels and may even turn off channel sensing function if its operating channel has achieved the required successful transmission rate in order to enhance throughput performance. In [11], the SU determines the durations of channel sensing, time of channel switching, and data transmission, respectively, in order to enhance QoS, particularly throughput, delay, and packet delivery rate performances. Both [10,11] incorporate DCS (A1) into channel sensing in order to select operating channels. Due to the environmental factors that can deteriorate transmissions (e.g., multipath fading and shadowing), Lo and Akyildiz [3] propose a cooperative channel sensing scheme, which combines sensing outcomes from cooperating onehop SUs, to improve the accuracy of PU detection.
(A3) Security Enhancement. Security enhancement scheme [12] aims to ameliorate the effects of attacks from malicious SUs. Vucevic et al. [13] propose a security enhancement scheme to minimize the inaccurate sensing outcomes received from neighboring SUs in channel sensing (A2). A SU becomes malicious whenever it sends inaccurate sensing outcomes, intentionally (e.g., Byzantine attacks) or unintentionally (e.g., unreliable devices). Wang et al. [14] propose an antijamming scheme to minimize the effects of jamming attacks from malicious SUs, which constantly transmit packets to keep the channels busy at all times so that SUs are deprived of any opportunities to transmit.
(A4) Energy Efficiency Enhancement. Energy efficiency enhancement scheme aims to minimize energy consumption. Zheng and Li [15] propose an energyefficient channel sensing scheme to minimize energy consumption in channel sensing. Energy consumption varies with activities, and it increases from sleep, idle, to channel sensing. The scheme takes into account the PU and SU traffic patterns and determines whether a SU should enter sleep, idle, or channel sensing modes. Switching between modes should be minimized because each transition between modes incurs time delays.
(A5) Channel Auction. Channel auction provides a bidding platform for SUs to compete for white spaces.
Chen and Qiu [16] propose a channel auction scheme that enables the SUs to learn the policy (or action selection) of their respective SU competitors and place bids for white spaces. This helps to allocate white spaces among the SUs efficiently and fairly.
(A6) Medium Access Control (MAC). MAC protocol aims to minimize packet collision and maximize channel utilization in CR networks. Li et al. [17] propose a collision reduction scheme that reduces the probability of packet collision among PUs and SUs, and it has been shown to increase throughput and to decrease packet loss rate among the SUs. Li et al. [18] propose a retransmission policy that enables a SU to determine how long it should wait before transmission in order to minimize channel contention.
(A7) Routing. Routing enables each SU source or intermediate node to select its next hop for transmission in order to search for the best route(s), which normally incurs the least cost or provides the highest amount of rewards, to the SU destination node. Each link within a route has different types and levels of costs, such as queuing delay, available bandwidth or congestion level, packet loss rate, energy consumption level, and link reliability, as well as changes in network topology as a result of irregular node's movement speed and direction.
(A8) Power Control. Yao and Feng [19] propose a power selection scheme that selects an available channel and a power level for data transmission. The purpose is to improve its Signal-to-Noise Ratio (SNR) in order to improve packet delivery rate.

Reinforcement Learning in the Context of Cognitive Radio Networks: Components, Features, and Enhancements
This section presents the components of RL, namely, state, action, reward, discounted reward, and -function; as well as the features of RL, namely, exploration and exploitation, updates of learning rate, rules and cooperative learning. The components and features of RL (see Section 2.1) are presented in the context of CR. For each component and feature, we show the traditional approach and subsequently the alternative or enhanced approaches with regard to modeling, representing, and applying them in CR networks. This section serves as a foundation for further research in this area, particularly, the application of existing features and enhancements in current schemes in RL models for either existing or new schemes. Note that, for improved readability, the notations (e.g., and ) used in this paper represent the same meaning throughout the entire paper, although different references in the literature may use different notations for the same purpose.
3.1. State. Traditionally, each state is comprised of a single type of information. For instance, in [11], each state The Scientific World Journal 5 ∈ = {1, 2, . . . , } represents a single channel out of channels available for data transmission. The state may be omitted in some cases. For instance, in [10], the state and action representations are similar, so the state is not represented. The traditional state representation can be enhanced in the context of CR as described next.
The value of a state may deteriorate as time goes by. For instance, Lundén et al. [20] propose a channel sensing (A2) scheme in which each state , ∈ {0 ≤ idle, ≤ 1} represents SU agent 's belief (or probability) that channel is idle (or the absence of PU activity). Note that the belief value of channel deteriorates whenever the channel is not sensed recently, and this indicates the diminishing confidence in the belief that channel remains idle. Denote a small step size by (i.e., = 0.01); the state value of channel deteriorates if it is not updated at each time instant; specifically, , +1 = , − .

Action.
Traditionally, each action represents a single action out of a set of possible actions . For instance, in [10], each action ∈ = {1, 2, . . . , } represents a single channel out of the channels available for data transmission. The traditional action representation can be enhanced in the context of CR as described next.
Each action ∈ can be further divided into various levels. As an example, Yao and Feng [19] propose a joint DCS (A1) and power allocation (A8) scheme in which each action ∈ = { 1 , 2 , . . . , } represents a channel selection, and each ∈ PA = { 1 , 2 , . . . , PA } represents a power level allocation with PA being the number of power levels. As another example, Zheng and Li [15] propose an energy efficiency enhancement (A4) scheme in which there are four kinds of actions, namely, transmit, idle, sleep, and sense channel. The sleepaction sp, ∈ = { sp1 , sp2 , . . . , sp } represents a sleep level with sp being the number of sleep levels. Note that different sleep level incurs different amount of energy consumption.

Delayed Reward.
Traditionally, each delayed reward represents the amount of performance enhancement achieved by a state-action pair. A single reward computation approach is applicable to all state-action pairs. As an example, in [2], +1 ( +1 ) ∈ = {1, −1} represents the reward and cost values of 1 and −1 for each successful and unsuccessful transmission, respectively. As another example, in [8], +1 ( +1 ) represents the amount of throughput achieved within a time window. The traditional reward representation can be enhanced in the context of CR as described next.
The delayed reward can be computed differently for distinctive actions. As an example, in a joint DCS (A1) and channel sensing (A2) scheme, Felice et al. [21] compute the delayed rewards in two different ways based on the types of actions: channel sensing se and data transmission tx . Firstly, a SU agent calculates delayed reward +1 ( , se, ) at time instant + 1. The +1 ( , se, ) indicates the likelihood of the existence of PU activities in channel whenever action se, is taken.

=0
, / nbr, where nbr, indicates the number of neighboring SU agents, while , , which is a binary value, indicates the existence of PU activities as reported by SU neighbor agent ∈ nbr, . Secondly, a SU agent calculates delayed reward +1 ( , tx, ) at time instant + 1. The +1 ( , tx, ) indicates the successful transmission rate, which takes into account the aggregated effect of interference from PU activities whenever action tx, is taken.
where DATA, indicates the number of data packets sent by SU agent , ACK , indicates the number of acknowledgment packets received by SU agent , and DATA , indicates the number of data packets being transmitted by SU agent .
Jouini et al. [22] apply an Upper Confidence Bound (UCB) algorithm to compute delayed rewards in a dynamic and uncertain operating environment (e.g., operating environment with inaccurate sensing outcomes), and it has been shown to improve throughput performance in DCS (A1). The main objective of this algorithm is to determine the upper confidence bounds for all rewards and subsequently use them to make decisions on action selection. The rewards are uncertain, and the uncertainty is caused by the dynamicity and uncertainty of the operating environment. Let ( ) represent the number of times an action ∈ has been taken on the operating environment up to time ; an agent calculates the upper confidence bounds of all delayed rewards as follows: ( , ( )) = ( , ( )) + ( , ( )) , where ( , ( )) = ∑ −1 =0 ( )/ ( ) is the mean reward, and ( , ( )) is the upper confidence bias being added to the mean. Note that ( ) = 0 if is not chosen at time instant . The ( , ( )) is calculated as follows: where exploration coefficient > 1 is a constant empirical factor. For instance, = 1.2 in [22,23]. The UCB algorithm selects actions with the highest upper confidence bounds, and so (3) is rewritten as follows: .
3.4. Discounted Reward. Traditionally, the discounted reward has been applied to indicate the dependency of -value 6 The Scientific World Journal on future rewards. Based on an application, the discounted reward may be omitted with = 0 to show the lack of dependency on future rewards, and this approach is generally called the myopic approach. As an example, Li [6] and Chen et al. [24] apply -learning in DCS (A1), and the -function in (1) is rewritten as follows: 3.5. -Function. The traditional -function (see (1)) has been widely applied to update -value in CR networks. The traditional -function can be enhanced in the context of CR as described next.
Lundén et al. [20] apply a linear function approximationbased approach to reduce the dimensionality of the large state-action spaces (or reduce the number of state-action pairs) in a collaborative channel sensing (A2) scheme. A linear function ( , ) provides a matching value ( , ) for a state-action pair. The matching value ( , ), which shows the appropriateness of a state-action pair, is subsequently applied in -value computation. The linear function ( , ) is normally fixed (or hard-coded), and various kinds of linear functions are possible to indicate the appropriateness of a state-action pair based on prior knowledge. For instance, ( , ) yields a value that represents the level of desirability of a certain number of SU agents sensing a particular channel [20]. Higher ( , ) value indicates that the number of SU agents sensing a particular channel is closer to a desirable number. Using a fixed linear function ( , ), the learning problem is transformed into learning the matching value ( , ) as follows: The parameter ( , ) is updated as follows: 3.6. Exploration and Exploitation. Traditionally, there are two popular approaches to achieve a balanced trade-off between exploration and exploitation, namely, softmax and -greedy [5]. For instance, Yau et al. [8] use the -greedy approach in which an agent explores with a small probability (i.e., = 0.1) and exploits with probability 1 − . Essentially, these approaches aim to control the frequency of exploration so that the best-known action is taken at most of the times. The traditional exploration and exploitation approach can be enhanced in the context of CR as described next.
In [3,25], using the softmax approach, an agent selects actions based on a Boltzman distribution; specifically, the probability of selecting an action in state is as follows: where is a time-varying parameter called temperature. Higher temperature value indicates more exploration, while smaller temperature value indicates more exploitation. Denote the time duration during which exploration actions are being chosen by ; the temperature is decreased as time goes by so that the agent performs more exploitation as follows: where 0 and are initial and final values of temperature, respectively. Note that, due to the dynamicity of the operating environment, exploration is necessary at all times, and so ≥ 0 . In [21], using the -greedy approach, an agent uses a simple approach to decrease exploration probability as time goes by as follows: where 0 ≤ ≤ 1 is a discount factor and min is the minimum exploration probability.

Other Features and Enhancements.
This section presents other features and enhancements on the traditional RL approach found in various schemes for CR networks, including updates of learning rate, rules, and cooperative learning.

Updates of Learning Rate.
Traditionally, the learning rate is a constant value [16]. The learning rate may be adjusted as time goes by because higher value of may compromise the RL algorithm's accuracy to converge to a correct action in a finite number of steps [26]. In [27], the learning rate reduces as time goes by using ( ) = ( − 1) − Δ, where Δ is a small value to provide smooth transition between steps. In [14], the learning rate is updated using ( ) = Δ ⋅ ( − 1).

Rules.
Rules determine a feasible set of actions for each state. The traditional RL algorithm does not apply rules although it is an important component in CR networks. For instance, in order to minimize interference with PUs, the SUs must comply with the timing requirements set by the PUs, such as the time interval that a SU must vacate its operating channel after any detection of PU activities.
As an example, Zheng and Li [15] propose an energy efficiency enhancement scheme in which there are four kinds of actions, namely, transmit, idle, sleep, and sense channel. Rules are applied so that the feasible set of actions is comprised of idle and sleep whenever the state indicates that there is no packet in the buffer. As another example, Peng et al. [4] propose a routing scheme, specifically, a next hop selection scheme in which the action represents the selection of a next hop out of a set of SU next hops. Rules are applied so that the feasible set of actions is limited to SU next hops with a certain level of SNR, as well as with shorter distance between next hop and the hop after next. The purposes of the rules are to reduce transmission delays and to ensure The Scientific World Journal 7 high-quality reception. Further description about [4,15] is found in Table 1.

Cooperative Learning.
Cooperative learning enables neighbor agents to share information among themselves in order to expedite the learning process. The exchanged information can be applied in the computation of -function. The traditional RL algorithm does not apply cooperative learning, although it has been investigated in multiagent reinforcement learning (MARL) [28].
Felice et al. [11] propose a cooperative learning approach to reduce exploration. The -value is exchanged among the SU agents, and it is used in the -function computation to update -value. Each SU agent keeps track of its ownvalue ( ), and it is updated using the similar way to [6] (see Section 3.4). At any time instant, each agent receivesvalue from its neighbor agent ∈ = {1, 2, . . . , nbr, }. The agent keeps a vector of -value Q i t (s i t ) with ∈ . For the case = , the -value ( ) is updated as follows: where ( , ) defines the weight assigned to cooperation with neighbor agent . Similar approach has been applied in [25], and the -value ( ) is updated based on the weight ( , ) as follows: In [11], the weight ( , ) depends on how much a neighbor agent can contribute to the accurate estimation of value function ( ), such as the physical distance between agent and . In [25], the weight ( , ) depends on the accuracy of the exchanged -value ( ) (or expert value ( ) as described next) and the physical distance between agent and .
In [25], an agent exchanges its -value with its neighboring agents only if the expert value ( ) for -value ( ) is greater than a particular threshold. The expert value ( ) indicates the accuracy of the -value ( ). For instance, in [25], the -value ( ) indicates the availability of white spaces in channel , and so greater deviation in the signal strengths reduces the expert value ( ). By reducing the exchanges of -value with low accuracy, this approach reduces control overhead, and hence it reduces interference to PUs.
Application of cooperative learning in the CR context has been very limited. More description on cooperative learning is found in Section 4.8. Further research could be pursued to investigate how to improve network performance using this approach in existing and new schemes.

Reinforcement Learning in the Context of Cognitive Radio Networks: Models and Algorithms
Direct application of the traditional RL approach (see Section 2.1) has been shown to provide performance enhancement in CR networks. Reddy [29] presents a preliminary investigation in the application of RL to detect PU signals in channel sensing (A2). Table 1 presents a summary of the schemes that apply the traditional RL approach. For each scheme, we present the purpose(s) of the CR scheme, followed by its associated RL model. Most importantly, this section presents a number of new additions to the RL algorithms, which have been applied to various schemes in CR networks. A summary of the new algorithms, their purposes, and references, is shown in Table 2. Each new algorithm has been designed to suit and to achieve the objectives of the respective schemes. For instance, the collaborative model (see Table 2) aims to achieve an optimal global reward in the presence of multiple agents, while the traditional RL approach achieves an optimal local reward in the presence of a single agent only. The following subsections (i.e., Sections 4.1-4.9) provide further details to each new algorithm, including the purpose(s) of the CR scheme(s), followed by its associated RL model (i.e., state, action, and reward representations) which characterize the purposes, and finally the enhanced algorithm which aims to achieve the purpose. Hence, these subsections serve as a foundation for further research in this area, particularly, the application of existing RL models and algorithms found in current schemes to either apply them in new schemes or extend the RL models in existing schemes to further enhance network performance.

Model 1:
Model with = 0 in Q-Function. This is a myopic RL-based approach (see Section 3.4) that uses = 0 so that there is lack of dependency on future rewards, and it has been applied in [10,17,18]. Li et al. [10] propose a joint DCS (A1) and channel sensing (A2) scheme, and it has been shown to increase throughput, as well as to decrease the number of sensing channels (see performance metric (P4) in Section 5) and packet retransmission rate. The purposes of this scheme are to select operating channels with successful transmission rate greater than a certain threshold into a sensing channel set and subsequently to select a single operating channel for data transmission. Table 3 shows the RL model for the scheme. The action ∈ is to select whether to remain at the current operating channel or to switch to another operating channel with higher successful transmission rate. A preferred channel set is composed of actions with -value ( ) greater than a fixed threshold th (e.g., th = 5 in [10]). Since the state and action are similar in this model, the state representation is not shown in Table 3, and we represent +1 ( ) = +1 ( +1 ). Note that +1 = if there is no channel switch. The reward +1 ( ) represents different kinds of events, specifically, +1 ( ) = 1 in case of successful transmission, and +1 ( ) = −1 in case 8 The Scientific World Journal   [11,21] Dual -function Model This model updates two -functions for the next and previous states, respectively, simultaneously in order to expedite the learning process Xia et al. [33] Partial observable model This model computes belief state, which is the probability of the environment operating in a particular state, in a dynamic and uncertain operating environment Bkassiny et al. [34] Actor-critic model This model adjusts the delayed reward value using reward corrections in order to expedite the learning process Vucevic et al. [13] Auction model This model allows agents to place bids during auctions conducted by a centralized entity so that the winning agents receive rewards Chen and Qiu [16], Jayaweera et al. [36], Fu and van der Schaar [37], and Xiao et al. [38] Internal self-learning model This model enables an agent to exchange its virtualactions continuously with rewards generated by a simulated internal environment within the agent itself in order to expedite the learning process Bernardo et al. [27] Collaborative model This model enables an agent to collaborate with its neighbor agents and subsequently make local decisions independently in distributed networks. A local decision is part of an optimal joint action, which is comprised of the actions taken by all the agents in a network Lundén et al. [20] Liu et al. [39] Competitive model This model enables an agent to compete with its neighbor agents and subsequently make local decisions independently in worst-case scenarios in the presence of competitor agents, which attempt to minimize the accumulated rewards of the agent Wang et al. [14]  of unsuccessful transmission or channel +1 is sensed busy. The RL model is embedded in a centralized entity such as a base station. Algorithm 1 presents the RL algorithm for the scheme. The action ∈ is chosen from a preferred channel set. The update of the -value +1 ( ) is self-explanatory. Similar approach has been applied in DCS (A1) [30,31].
Li et al. [18] propose a MAC protocol, which includes both DCS (A1) and a retransmission policy (A6), to minimize channel contention. The DCS scheme enables the SU agents to minimize their possibilities of operating in the same channel. This scheme uses the RL algorithm in Algorithm 1, and the reward representation is extended to more than a single performance enhancement. Specifically, the reward +1 ( ) represents the successful transmission rate and transmission delay. Higher reward indicates higher successful transmission rate and lower transmission delay, and vice versa. To accommodate both transmission rate and transmission delay in -function, the reward representation becomes +1 ( ) = , +1 ( ) + , +1 ( ), and so thefunction becomes +1 ( ) = ( ) + , +1 ( ) + , +1 ( ). The retransmission policy determines the probability a SU agent transmits at time , and so +1 ( ) indicates the probability a SU agent transmits at time . The reward , +1 ( ) = 1, 0, and −1 if the transmission delay at time is smaller than, equal to, and greater than the average transmission delay, respectively. The reward , +1 ( ) represents different kinds of events; specifically, , +1 ( ) = 2, 0, and −2 in case of successful transmission, idle transmission, and unsuccessful transmission, respectively; note that idle indicates that channel is sensed busy, and so there is no transmission.
Li et al. [17] propose a MAC protocol (A6) to reduce the probability of packet collision among PUs and SUs, and it has been shown to increase throughput and to decrease packet loss rate. Since both successful transmission rate and the presence of idle channels are important factors, it keeps track of the -functions for channel sensing ( se ) and transmission ( tx ) using RL algorithm in Algorithm 1, respectively. Hence, similar to Algorithm 2 in Section 4.2, there is a set of two -functions. The action is to select whether to remain at the current operating channel or to switch to another operating channel. The sensing reward +1 ( se ) = 1 and − 1 if the channel is sensed idle and busy, Repeat (a) Choose action ∈ (b) Update -value: Algorithm 1: RL algorithm for joint DCS and channel sensing [10].

Model 2: Model with a Set of -Functions.
A set of distinctive -functions can be applied to keep track of thevalue of different actions, and it has been applied in [11,21]. Di Felice et al. [11] propose a joint DCS (A1) and channel sensing (A2) scheme, and it has been shown to increase goodput and packet delivery rate, as well as to decrease end-to-end delay and interference level to PUs. The purposes of this scheme are threefold: (i) firstly, it selects an operating channel that has the lowest channel utilization level by PUs; (ii) secondly, it achieves a balanced trade-off between the time durations for data transmission and channel sensing; (iii) thirdly, it reduces the exploration probability using a knowledge sharing mechanism. Table 4 shows the RL model for the scheme. The state ∈ represents a channel for data transmission. The actions ∈ are to sense channel, to transmit data, or to switch its operating channel. The reward +1 ( +1 ) represents the difference between two types of delays, namely, the maximum allowable single-hop transmission delay and a successful single-hop transmission delay. A single-hop transmission delay covers four kinds of delays including backoff, packet transmission, packet retransmission, and propagation delays. Higher reward level indicates shorter delay incurred by a successful single-hop transmission. The RL model is embedded in a centralized entity such as a base station.
Algorithm 2 presents the RL algorithm for the scheme. Denote learning rate by 0 ≤ ≤ 1, eligible trace by ( ), and the amount of time during which the SU agent is involved in successful transmissions or was idle (i.e., no packets to transmit) by , as well as the temporal differences by ( se − ( , se )) and ( − ( , tx )). A single type of -function is chosen to update the -value ( , ) based on the current action ∈ = { se , tx , sw } being taken. The temporal difference indicates the difference between the actual outcome and the estimated -value.
In step (b), the eligible trace ( ) represents the temporal validity of state . Specifically, in [11], eligible trace ( ) represents the existence of PU activities in channel , and so it is only updated when channel sensing operation se is taken. Higher eligible trace ( ) indicates greater presence of PU activities, and vice versa. Hence, the term ( ) is in the update of -value +1 ( , se ), and (1 − ( ))is in the update of -value +1 ( , tx ) in Algorithm 2. Therefore, higher eligible trace ( ) results in higher value of +1 ( , se ) and lower value of +1 ( , tx ), and this indicates more channel sensing tasks and lesser data transmission in channels with greater presence of PU activities. The action sw switches channel from state to state +1 . The -greedy approach is applied to choose the next channel +1 . In [21], eligible trace Table 4: RL model for joint dynamic channel selection and channel sensing [11].
State ∈ = {1, 2, . . . , }; each state represents an available channel Action ∈ = { se , tx , sw }, where action se senses a channel for the duration of se , tx transmits a data packet, and sw switches the current operating channel to another one which has the lowest best-known average transmission delay for a single-hop Reward +1 ( +1 ) represents the difference between a successful single-hop transmission delay and the maximum allowable single-hop transmission delay Table 5: RL model for the routing scheme [33].
State ∈ = {1, 2, . . . , − 1}; each state represents a SU destination node . represents the number of SUs in the entire network Action ∈ = {1, 2, . . . , }; each action represents the selection of a next-hop SU node . represents the number of SU node 's neighbor SUs Reward +1 ( +1 , +1 ) represents the number of available common channels among nodes and ( ), which represents the temporal validity or freshness of the sensing outcome, is only updated when the channel sensing operation se is taken as shown in Algorithm 2. The eligible trace ( ) is discounted whenever se is not chosen as follows: where 0 ≤ ≤ 1 is a discount factor for the eligible trace. Equation (15) shows that the eligible trace of each state is set to the maximum value of 1 whenever action se is taken; otherwise, it is decreased with a factor of . In step (c), the ( , sw ) value keeps track of the channel that provides the best-known lowest estimated average transmission delay. In other words, the channel must provide the maximum amount of reward that can be achieved considering the cost of a channel switch . Hence, ( , sw ) can keep track of a channel +1 that provides the best-known state value ( +1 ) the SU agent receives compared to the average state value ( ) by switching its current operating channel to the operating channel +1 . Note that the state value ( ) is exchanged among the SU agents to reduce exploration through cooperative learning (see Section 3.7.3).
In step (d), the policy ( +1 , +1 ) is applied at the next time instant. The policy provides probability distributions over the three possible types of actions = { se , tx , sw } using a modified Boltzmann distribution (see Section 3.6). Next, the policy is applied to select the next action +1 in step (a).

Model 3: Dual -Function Model.
The dual -function model has been applied to expedite the learning process [32]. The traditional -function (see (1)) updates a singlevalue at a time, whereas the dual -function updates twovalues simultaneously. For instance, in [33], the traditionalfunction updates the -value for the next state only (e.g., SU destination node), whereas the dual -function updates the -value for the next and previous states (e.g., SU source and destination nodes, respectively). The dual -function model updates a SU agent's -value in both directions (i.e., towards the source and destination nodes) and speeds up the learning process in order to make more accurate decisions on action selection; however, at the expense of higher network overhead incurred by more -value exchanges among the SU neighbor nodes.
Xia et al. [33] propose a routing (A7) scheme, and it has been shown to reduce SU end-to-end delay. Generally speaking, the availability of channels in CR networks is dynamic, and it is dependent on the channel utilization level by PUs. The purpose of this scheme is to enable a SU node to select a next-hop SU node with higher number of available channels. The higher number of available channels reduces the time incurred in seeking for an available common channel for data transmission among a SU node pair, and hence it reduces the MAC layer delay. Table 5 shows the RL model for the scheme. The state ∈ represents a SU destination node . The action ∈ represents the selection of a next-hop SU neighbor node . The reward +1 ( +1 , +1 ) represents the number of available common channels among nodes and = . The RL model is embedded in each SU agent.
This scheme applies the traditional -function (see (1)) with = 1. Hence, the -function is rewritten as follows: where ∈ is an upstream node of SU neighbor node , so node must estimate and send information on max ∈ ( +1 , ) to SU node .
The dual -function model in this scheme is applied to update the -value for the SU source and destination nodes. While the traditional -function enables the SU intermediate node to update the -value for the SU destination node only (or next state), which is called forward exploration, the dual -function model enables an intermediate SU node to achieve backward exploration as well by updating thevalue for the SU source node (or previous state). Forward exploration is achieved by updating the -value at SU node for the SU destination node whenever it receives an estimate max ∈ ( +1 , ) from SU node , while backward 12 The Scientific World Journal exploration is achieved by updating the -value at SU node for the SU source node whenever it receives a data packet from node . Note that, in the backward exploration case, node 's packets are piggybacked with its -value so that node is able to update -value for the respective SU source node. Although the dual -function approach increases the network overhead, it expedites the learning process since SU nodes along a route update -value of the route in both directions.

Model 4: Partial Observable
Model. The partial observable model has been applied in a dynamic and uncertain operating environment. The uniqueness of the partial observable model is that the SU agents are uncertain about their respective states, and so each of them computes belief state ( ), which is the probability that the environment is operating in state .
Bkassiny et al. [34] propose a joint DCS (A1) and channel sensing (A2) scheme, and it has been shown to improve the overall spectrum utilization. The purpose of this scheme is to enable the SU agents to select their respective operating channels for sensing and data transmission in which the collisions among the SUs and PUs must be minimized. Table 6 shows the RL model for the scheme. The state s i t ∈ 1 × 2 × ⋅ ⋅ ⋅ × represents the availability of a set of channels for data transmission. The action ∈ represents a single channel out of channels available for data transmission. The reward represents fixed positive (negative) values to be rewarded (punished) for successful (unsuccessful) transmissions. The RL model is embedded in each SU agent so that it can make decision in a distributed manner.
Algorithm 3 presents the RL algorithm for the scheme. The action ∈ is chosen from a preferred channel set. The chosen action has the maximum belief-statevalue, which is calculated using belief vector b(s i t ) = ( ( 1, ), ( 2, ), . . . , ( , )) as weighting factor. The belief vector b(s i t ) is the probability of a possible set of state s i t = ( 1, , 2, , . . . , , ) being idle at time + 1. Upon receiving reward +1 (s i t+1 , ), the SU agent updates the entire set of belief vectors b(s i t ) using Bayes' formula [34]. Next, the SU agent updates the -value +1 (s i t , ). Note that max ∈ , +1 (s i t+1 , ) = max ∈ ∑ ∈s i t ( ) ( , ). It shall be noted that Bkassiny et al. [34] apply the belief vector b(s i t ) as a weighting vector in its computation ofvalue +1 (s i t , ), while most of the other approaches, such as [20], use belief vector b(s i t ) as the actual state, specifically, +1 (b(s i t ), ). This approach has been shown to achieve a near-optimal solution with a very low complexity in [35].

Model 5: Actor-Critic Model.
Traditionally, the delayed reward has been applied directly to update the -value. The actor-critic model adjusts the delayed reward value using reward corrections, and this approach has been shown to expedite the learning process. In this model, an actor selects actions using suitability value, while a critic keeps track of temporal difference, which takes into account reward corrections in delayed rewards.
Vucevic et al. [13] propose a collaborative channel sensing (A2) scheme, and it has been shown to minimize error detection probability in the presence of inaccurate sensing outcomes. The purpose of this scheme is that it selects neighboring SU agents that provide accurate channel sensing outcomes for security enhancement purpose (A3). Table 7 shows the RL model for the scheme. The state is not represented. An action ∈ represents a neighboring SU chosen by SU agent for channel sensing purpose. The reward +1 ( ) represents fixed positive (negative) values to be rewarded (punished) for correct (incorrect) sensing outcomes compared to the final decision, which is the fusion of the sensing outcomes. The RL model is embedded in each SU agent.
Algorithm 3: RL algorithm for joint dynamic channel selection and channel sensing [34].   [16], or it may learn using RL to maximize its utility [36]. The RL model may be embedded in each SU host in a centralized network [16,[36][37][38], or in the centralized entity only [36]. Chen and Qiu [16] propose a channel auction scheme (A5), and it has been shown to allocate white spaces among SU hosts (or agents) efficiently and fairly. The purpose of this scheme is to enable the SU agents to select the amount of bids during an auction, which is conducted by centralized entity, for white spaces. The SU agents place the right amount of bids in order to secure white spaces for data transmission, while saving their credits, respectively. The RL model is embedded in each SU host. Table 8 shows the RL model for the scheme. The state ∈ indicates a SU agent's information, specifically, the amount of data for transmission in its buffer and the amount of credits (or "wealth") it owns. The action ∈ is the amount of a bid for white spaces. The reward +1 ( +1 ) indicates the amount of data sent. This scheme applies the traditional -learning approach (see (1)), to update -values.
Jayaweera et al. [36] propose another channel auction scheme (A5) that allocates white spaces among SUs, and it has been shown to increase transmission rates of the SUs and to reduce energy consumption of the PUs. In [36], the PUs adjust the amount of white spaces and allocate them to the SUs with winning bids. The winning SUs transmit their packets, as well as relaying PUs' packets using the white spaces so that the PUs can reduce its energy consumption. In other words, the SUs use their power as currency to buy the bandwidth. Two different kinds of RL models are embedded in PUs and SUs, respectively, so that the PUs can learn to adjust the amount of white spaces to be allocated to the SUs, and the SUs can learn to select the amount of bids during an auction for white spaces.
The state is not represented, and we show the action and reward representations of the scheme. Table 9 shows the reward representation of the RL model. The reward +1 ( , ) indicates a constant positive reward in case of successful bid and a constant negative reward in case of unsuccessful bid. The reward representation is embedded in both PUs and SUs. The actions for both PUs and SUs are different. Each SU selects the amount of bid , ∈ during an auction for white spaces in channel , while each PU adjusts the amount of white spaces , ∈ to be offered for auction in its own channel . Higher amount of white spaces encourages the SUs to participate in auctions.
This scheme applies -function , +1 ( , ) = , ( , ) + +1 ( , ) with = 0 (see Section 4.1) at both PUs and SUs. The SUs' -function indicates the appropriate amount of bids for white spaces, while the PUs' -function indicates the appropriate amount of white spaces to be offered for auction.
Fu and Van der Schaar [37] propose a channel auction scheme (A5) that improves the bidding policy of SUs, and it has been shown to reduce SUs' packet loss rate. The purpose of this scheme is to enable SU agents to learn and adapt the amount of bids during an auction for time-varying white spaces in dynamic wireless networks with environmental disturbance and SU-SU disturbance. Examples of environmental disturbance are dynamic level of channel utilization by PUs, channel condition (i.e., SNR), and SU traffic rate, while 14 The Scientific World Journal  [37].

State
= ( , p ) ∈ ; each state represents a two-tuple information composed of the fullness of the buffer state and channel states p = ( ,1 , ,2 , . . . , , ), where , represents the state of channel in terms of SNR Action ∈ = { ,1 , ,2 , . . . , , }; each action represents the amount of a bid for white spaces in channel . represents the number of available channels Reward +1 (s +1, w +1, a +1 ) = +1 + +1 represents the sum of the number of lost packets +1 and the channel cost +1 that SU must pay for using the channel. Note that the packet loss +1 and channel cost +1 depend on the global state s +1 , available channels w +1 , and bidding actions a +1 of all competing SUs Table 11: RL model for a power control scheme [38].
Action ∈ = { sh , mh }, with sh and mh being transmitting SU 's packets to the SU destination node using single-hop transmission and multiplehop relaying, respectively Reward +1 ( +1 ) represents the revenue obtained from the other SUs for relaying their packets. Higher rewards indicate higher transmission rate and transmission power of SU node an example of SU-SU disturbance is the effect from other competing SUs, who are noncollaborative and autonomous in nature. Compared to traditional centralized auction schemes, SUs compute their bids based on their knowledge and observation of the operating environment with limited information received from other SUs and the centralized base station. Note that the joint bidding actions of SUs affect the allocation of white spaces and bidding policies of the other SUs, and so the proposed learning algorithm improves the bidding policy of SUs based on the observed white space allocations and rewards. Table 10 shows the RL model for the scheme. The state ∈ indicates SU agent's information, specifically, its buffer state, as well as the states of the available channels in terms of SNR. The action ∈ is the amount of bids for white spaces. The reward +1 (s +1, w +1, a +1 ) represents the sum of the number of lost packets +1 and the channel cost +1 that SU must pay for using the channel. Note that the channel cost +1 represents network congestion, and hence higher cost +1 indicates higher congestion level. The RL model is embedded in each SU host.
Algorithm 4 presents the RL algorithm for the scheme. In step (a), SU agent observes its current state and available channels (or white spaces) w advertised by the centralized base station. In step (b), it decides and submits its bids to the base station, and the bids are estimated based on SU 's state and other SUs' representative (or estimated) statẽ− . Note that, since SU needs to know all the states and transition probabilities of other SUs, which may not be feasible, it estimates the representative statẽ− based on its previous knowledge of channel allocation and channel cost +1 (or network congestion). In step (c), SU receives its channel allocation decision and the required channel cost from the base station. In step (d), the representative statẽ− and transition probabilities̃− of the other SUs are updated based on the newly received channel allocation decision and the required channel cost information. In step (e), SU computes its estimated -value, which is inspired by the traditional -function approach, and this approach explicitly takes into account the effects of the bidding actions of the other SUs based on their estimated representative statẽ− and transition probabilities̃− . Note that a also denotes Markovbased policy profile that representsthe bidding policies of all the other SUs. In step (f), the -table is updated if there are changes in the SU states and channel availability.
Xiao et al. [38] propose a power control scheme (A8), and it has been shown to increase the transmission rates and payoffs of SUs. There are two main differences compared to the traditional auction schemes, which have been applied to centralized networks. Firstly, the interactions among all nodes, including PUs and SUs, are coordinated in a distributed manner. A SU source node transmits its packets to the SU destination node using either single-hop transmission or multihop relaying. In multihop relaying, a SU source node must pay the upstream node, which helps to relay the packets. Secondly, the PUs treat each SU equally, and so there is lack of competitiveness in auctions. Each SU may accumulate credits through relaying. Game theory is applied to model the network in which SUs pay credits to PUs for using licensed channels and to other SUs for relaying their packets. The purpose of this scheme is to enable a SU node to choose efficient actions in order to improve its payoff, as well as to collect credits through relaying, and to minimize the credits paid to PUs and other SU relays. A RL model is embedded in each SU.
The state is not represented, and we show the action and reward representations of the scheme. Table 11 shows the RL model for the scheme. The action ∈ represents transmission of SU 's packets by either using singlehop transmission or multihop relaying. The reward +1 ( ) indicates the revenue (or profit) received by SU node for providing relaying services to other SUs, and so higher reward indicates higher transmission rate and increased transmission power of SU node . Denote the payoff of SU by p , as shown in (17). The payoff indicates the difference between SU 's revenue and costs. There are two types of costs represented by , and ,PU . The , represents the cost charged by the upstream SU node for relaying SU node 's packets, and the ,PU represents the cost charged by all The Scientific World Journal 15 Repeat (a) Observe the current state and available channels w (b) Choose an action and submits it to the base station (c) Receive channel allocation decision and the required channel cost (d) Estimate the representative statẽ− and update the state transition probabilities̃− of the other SUs (e) Compute the estimated -value (( ,̃− ) , w , a ) as follows: , w +1 ) using learning rate as follows: Algorithm 4: RL algorithm for the channel auction scheme [37].
PUs for using the white spaces in licensed channels. The ,PU increases with the SU 's interference power in the respective channel. Consider This scheme applies -function +1 ( ) = ( ) + (p ⋅ ( )), which indicates the average payoff, where is a constant step size and ( ) is the probability of SU choosing action , which is computed according to Boltzmann distribution (see Section 3.6).

Model 7:
Internal Self-Learning Model. The internal selflearning model has been applied to expedite the learning process. The uniqueness of the internal self-learning model lies in the learning approach in which the learning mechanism continuously interacts with a simulated internal environment within the SU agent itself. The learning mechanism continuously exchanges its actions with rewards generated by the simulated internal environment so that the SU agent learns the optimal actions for various settings of the operating environment, and this helps -value and the optimal action to converge.
Bernardo et al. [27] propose a DCS (A1) scheme, and it has been shown to improve the overall spectrum utilization and throughput performances. Note that, unlike the previous schemes in which the RL models are embedded in the SU agents, the RL model is embedded in each PU base station (or agent) in this scheme, and it is applied to make mediumterm decisions (i.e., from tens of seconds to tens of minutes). The purpose of this scheme is to enable a PU agent to select its operating channels for transmission in its own cell. In order to improve the overall spectrum utilization, the PU agent preserves its own QoS while generating white spaces and sells them off to SU agents. Table 12 shows the RL model for the scheme. The action a i t ∈ 1 × 2 × ⋅ ⋅ ⋅ × is a set of chosen available channels for the entire cell. The reward +1 (a i t ) has a zero value if  the estimated throughput of an action selection a i t is less than a throughput threshold th ; otherwise, the reward is based on the spectrum efficiencŷ(a i t ) and the amount of white spaces (a i t ), which may be sold off to SU agents. Both and are constant weight factors. Figure 3 shows the internal self-learning model. The learning mechanism, namely, RL-DCS, continuously interacts with a simulated internal environment, namely, Environment Characterization Entity (ECE). Based on the information observed from the real operating environment (i.e., the number of PU hosts and the average throughput per PU host), which is provided by status observer, the ECE implements a model of the real operating environment (i.e., spectrum efficiencŷ(a i t ) and the amount of white spaces (a i t )) and computes reward +1 (a i t ). Hence, the ECE evaluates the suitability of action a i t in its simulated internal model of the operating environment. By exchanging action a i t and reward +1 (a i t ) between RL-DCS and ECE, the RL-DCS learns an optimal action a i t at a faster rate compared to the conventional learning approach, and this process stops when the optimal action a i t converges. Repeat (a) Take action ∈ (b) Exchange collaboration message ,1 with SU neighbor agents // First round of collaboration (c) Determine delayed reward +1 (s i t+1 , ) (d) Exchange collaboration message ,2 with SU neighbor agents // Second round of collaboration (e) Choose action +1 ∈ (f) Update +1 (see (9)) (g) Update -value, +1 (s i t , ) Algorithm 6: RL algorithm for the channel sensing scheme [20].
Algorithm 5 presents the RL algorithm for the scheme. The action a i t = ( 1, , 2, , . . . , , ) ∈ 1 × 2 × ⋅ ⋅ ⋅ × is chosen using a Bernoulli random variable [27]. The PU agent receives reward +1 (a i t ) computed by ECE and computes the average reward ( , ) for each subaction , at time using the exponential moving average [27]. Denote the probability of taking action , by ( , ) and the current overall unused spectrum, which is the ratio of the unused bandwidth to the total bandwidth of a cell, by . Upon receiving reward +1 (a i t ), the PU agent updates the -value +1 ( , ) for each action , ∈ a i t . Finally, the probability of taking action , , specifically, ( , ), is updated. Note that the exploration probability is .

Model 8: Collaborative Model.
Collaborative model enables a SU agent to collaborate with its SU neighbor agents and subsequently make local decisions independently in distributed CR networks. It enables the agents to learn and achieve an optimal joint action. A joint action is defined as the actions taken by all the agents throughout the entire network. An optimal joint action is the actions taken by all the agents throughout the entire network that provides an ideal and optimal network-wide performance. Hence, the collaborative model reduces the selfishness of each agent through taking other agents' actions or strategies into account. The collaboration may take the form of exchanging local information, including knowledge ( -value), observations, and decisions, among the SU agents.
Lundén et al. [20] propose a collaborative channel sensing (A2) scheme, and it has been shown to maximize the amount of white spaces found. The purposes of this scheme are twofold: (i) firstly, it selects channels with more white spaces for channel sensing purpose; (ii) secondly, it selects channels so that the SU agents diversify their sensing channels. In other words, the SU agents perform channel sensing in various channels. Table 13 shows the RL model for the scheme. The state s i t ∈ 1 × 2 ×⋅ ⋅ ⋅× represents the belief on the availability of a set of channels for data transmission. An action ∈ , which is part of the joint action a t representing all the actions taken by SU agent and its SU neighbor agents, represents a single channel chosen by SU agent for channel sensing purpose. The reward +1 (s i t+1 , ) represents the number of channels identified as being idle (or free) at time + 1 by SU agent . The RL model is embedded in each SU agent.
Algorithm 6 presents the RL algorithm for the scheme, and it is comprised of two rounds of collaboration message exchanges. After taking action ∈ , the SU agent exchange collaboration messages ,1 = ( , ) with its SU neighbor agents. The ,1 is comprised of two-tuple information, namely, SU agent 's action and SU agent 's sensing outcomes . SU agent determines the delayed reward based on ,1 . Next, the SU agent exchanges collaboration messages   = +1 with its SU neighbor agents. During the second round of collaboration message exchange, a SU agent chooses its action +1 for the next time instance upon receiving ,2 from SU neighbor agent . Note that the SU agent transmission order affects the action selection. This is because a SU agent may receive and use information obtained from its preceding agents, and so it can make decisions using more updated information in the second round. Since one of the main purposes is to enable the SU agents to diversify their sensing channels, the SU agents choose action +1 from a preferred channel set. The preferred channel set is comprised of sensing channels which are yet to be chosen by the preceding SU agents. The SU agent chooses channels with the maximum -value from the preferred channel set. Finally, the SU agent updates -value +1 (s i t , ) and +1 ( , ) (see Section 3.5).
Liu et al. [39] propose a collaborative DCS (A1) scheme that applies a collaborative model, and it has been shown to achieve a near-optimal throughput performance. The purpose of this scheme is to enable each SU link to maximize its individual delayed rewards, specifically, the SNR level. Note that this collaboration approach assumes that an agent has full observation of the actions and policies adopted by all the other SU links at any time instance. Hence, (1) is rewritten as follows: where represents the action taken by agent and a −i t represents the joint action taken by all the SU agents throughout the entire CR network except agent . Note that ⋂ a −i t ∈ , where represents joint actions by all the SU agents throughout the entire CR network. Therefore, (19) is similar to the traditional RL approach except when an action becomes a joint action ⋂ a −i t (or set of actions). To take into account actions taken by the other agents a −i t , agent updates an average -value ( , ), which is the average -value of agent in state if it takes action , while the other agents take action a −i t . The ( , ) is updated as follows: where is the number of agents. Next, ( , ) is applied in action selection using the Boltzmann equation (see Section 3.6). Further research can be pursued to reduce communication overheads and to enable indirect coordination among the agents.

Model 9:
Competitive Model. Competitive model enables a SU agent to compete with its SU neighbor agents and subsequently make local decisions independently in CR networks. The competitive model enables an agent to make optimal actions in worst-case scenarios in the presence of competitor agents, which attempt to minimize the accumulated rewards of the agent. Note that the competitor agent may also possess the capability to observe, learn, and carry out the optimal actions in order to deteriorate the agents' accumulated rewards.
Wang et al. [14] propose an antijamming approach (A3) scheme called channel hopping, and it applies minimaxlearning to implement the competitive model. This approach has been shown to maximize the accumulated rewards (e.g., throughput) in the presence of jamming attacks. Equipped with a limited number of transceivers, the malicious SUs aim to minimize the accumulated rewards of SU agents through constant packet transmission in a number of channels in order to prevent spectrum utilization by SU agents. The purposes of the channel hopping scheme are twofold: (i) firstly, it introduces randomness in channel selection so that the malicious SUs do not jam its selected channels for data transmission; (ii) secondly, it selects a proper number of control and data channels in a single frequency band for control and data packet transmissions. Note that each frequency band consists of a number of channels. Due to the criticality of control channel, duplicate control Algorithm 7: RL algorithm for the channel hopping scheme [14].
packets may be transmitted in multiple channels to minimize the effects of jamming, and so a proper number of control channels are necessary.
Note that, as competitors, the malicious SUs aim to minimize the accumulated rewards of SU agents. Table 14 shows the RL model for the scheme. Each state is comprised of four-tuple information; specifically, s i k,t = ( , , , , , , , , , ) ∈ 1 × 2 × 3 × 4 . With respect to frequency band , the substate , ∈ 1 = {0, 1} represents the presence of PU activities and , ∈ 2 = { 1 , 2 , . . . , } represents gain, while , , ∈ 3 and , , ∈ 4 represent the numbers of control and data channels that get jammed, respectively. An action ∈ represents channel selections within a single frequency band for control and data packet transmissions purpose, and the channels may be jammed or not jammed in the previous time slot. The reward +1 (s i k,t+1 , a i t , a m t ) represents the gain (e.g., throughput) of using channels that are not jammed. Note that the reward +1 (s i k,t+1 , a i t , a m t ) is dependent on the malicious SU's (or competitor's) action a m t . The RL model is embedded in each SU agent.
Algorithm 7 presents the RL algorithm for the scheme. In step (b), the -function is dependent on the competitor's action a m t , which is thechannels chosen by the malicious SUs for jamming purpose. In step (c), the agent determines its optimal policy , * (s i k,t ), in which the competitor is assumed to take its optimal action that minimizes the -value, and hence the term min (s i k,t ) . Nevertheless, in this worst-case scenario, the agent chooses an optimal action and hence the term argmax , * (s i k,t ) . In step (d), the agent updates its value function (s i k,t ), which is applied to update the -value in step (b) in the next time instant. Using the optimal policy , * (s i k,t ) obtained in step (c), the agent calculates its value function (s i k,t ), which is an approximate of the discounted future reward. Again, the competitor is assumed to take its optimal action that minimizes the agent's -value and hence the term min (s i k,t ) .   [3] reduce false alarm, which occurs when a PU is mistakenly considered present in an available channel, in channel sensing (A2).

Performance Enhancements
(P9) Higher Probability of PU Detection. Lo and Akyildiz [3] increase the probability of PU detection in order to reduce miss detection in channel sensing (A2). Miss detection occurs whenever a PU is mistakenly considered absent in a channel with PU activities.
(P10) Higher Number of Channels Being Sensed Idle. Lundén et al. [20] increase the number of channels being sensed idle, which contains more white spaces.
(P11) Higher Accumulated Rewards. Wang et al. [14] increase the accumulated rewards, which represent gains, such as throughput performance. Xiao et al. [38] improve SU's total payoff, which is the difference between gained rewards (or revenue) and total cost incurred.

Open Issues
This section discusses open issues that can be pursued in this research area.

Enhanced Exploration Approaches.
While larger value of exploration probability may be necessary if the dynamicity of the operating environment is high, the opposite holds whenever the operating environment is rather stable. Generally speaking, exploration helps to increase the convergence rate of a RL scheme. Nevertheless, higher exploration rate may cause fluctuation in performance (e.g., end-to-end delay and packet loss) due to the selection of nonoptimal actions. For instance, in a dynamic channel selection scheme (A1), the performance may fluctuate due to the frequent exploration of nonoptimal channels. Similarly, in a routing scheme (A7), the performance may fluctuate due to the frequent exploration of nonoptimal routes. Further research could be pursued to investigate the possibility of achieving exploration without compromising the application performance. Additionally, further research could be pursued to investigate how to achieve an optimal trade-off between exploration and exploitation in a diverse range of operating environments. For instance, through simulation, Li [6] found that, with higher learning rate and lower temperature , the convergence rate of the -value is faster.

Fully Decentralized Channel Auction Models.
To the best of our knowledge, most of the existing RL-based channel auction models (see Section 4.6) have been applied in centralized CR networks, in which a centralized entity (e.g., base station) allocates white spaces to SU hosts with winning bids. The centralized entity may perform simple tasks, such as allocating white spaces to SU hosts with winning bids [16], or it may learn using RL to maximize its utility [36]. The main advantage of the centralized entity is that it simplifies the management of the auction process and the interaction among nodes. Nevertheless, it introduces challenges to implementation due to additional cost and feasibility of having a centralized entity in all scenarios. While there have been increasing efforts to enhance the performance of the RL-based auction models, further research is necessary to investigate fully decentralized RL-based auction models, which do not rely on a centralized entity, along with their requirements and challenges. For instance, by incorporating the cooperative learning feature (see Section 3.7.3) into the RL auction model, SUs can exchange auction information with PUs and other SUs in a decentralized manner, which may enable them to perform bidding decisions without the need of a centralized entity. However, this may introduce other concerns such as security and nodes' selfishness, which can be interesting directions for further research.

Enhancement on the Efficiency of RL Algorithm.
The application of RL in various application schemes in CR networks may introduce complexities, and so the efficiency of the RL algorithm should be further improved. As an example, the collaborative model (see Section 4.8) requires explicit coordination in which the neighboring agents exchange information among themselves in order to expedite convergence to optimal joint action. This enhances the network performance at the expense of higher amount of control overhead. Hence, further research is necessary to investigate the possibility of indirect coordination. Moreover, the network performance may further improve with reduced overhead incurred by RL. As another example, while RL has been applied to address security issues in CR networks (see application (A3)), the introduction of RL into CR schemes may introduce more vulnerabilities into the system. This is because the malicious SUs or attackers may affect the operating environment or manipulate the information so that the honest SUs' knowledge is adversely affected.

Application of RL in New Application
Schemes. The wide range of enhanced RL algorithms, including the dualfunction, partial observable, actor-critic, auction, internal self-learning, collaborative, and competitive models (see Sections 4.3-4.9), can be extended to other applications in CR networks, including emerging networks such as cognitive maritime wireless ad hoc networks and cognitive radio sensor networks [40], in order to achieve context awareness and intelligence, which are the important characteristics of cognition cycle (see Section 2.2.1). For instance, the collaborative model (see Section 4.8) enables an agent to collaborate with its neighbor agents in order to make decisions on action selection, which is part of an optimal joint action. This model is suitable to be applied in most application schemes that require collaborative efforts, such as trust and reputation system [41] and cooperative communications, although the application of RL in those schemes is yet to be explored. In trust and reputation management, SUs make collaborative effort to detect malicious SUs, such that malicious SUs are assigned low trust and reputation values. Additionally, Section 3 presents new features of each component of RL, which can be applied to enhance the performance of existing RL-based applications schemes in CR networks. Further research could also be pursued to (i) apply new RL approaches, such as two-layered multiagent RL model [42], to CR network applications, (ii) investigate RL models and algorithms applied to other kinds of networks such as cellular radio access networks [43] and sensor networks [44], which may be leveraged to provide performance enhancement in CR networks, (iii) apply or integrate the RL features and enhancements (e.g., state, action, and reward representations) to other learning-based approaches, such as the neural network-based approach [45].

Lack of Real Implementation of RL in CR Testbed.
Most of the existing RL-based schemes have been evaluated using simulations, which have been shown to achieve performance enhancements. Nevertheless, to the best of our knowledge, there is lack of implementation of RL-based schemes in CR platform. Real implementation of the RL algorithms is important to validate their correctness and performance in real CR environment, which may also allow further refinements on these algorithms. To this end, further research is necessary to investigate the implementation and challenges of the RL-based scheme on CR platform.

Conclusions
Reinforcement learning (RL) has been applied in cognitive radio (CR) networks to achieve context awareness and intelligence. Examples of schemes are dynamic channel selection, channel sensing, security enhancement mechanism, energy efficiency enhancement mechanism, channel auction mechanism, medium access control, routing, and power control mechanism. To apply the RL approach, several representations may be necessary including state and action, as well as delayed and discounted rewards. Based on the CR context, this paper presents an extensive review on the enhancements of these representations, as well as other features including -function, trade-off between exploration and exploitation, updates of learning rate, rules, and cooperative learning. Most importantly, this paper presents an extensive review on a wide range of enhanced RL algorithms in CR context. Examples of the enhanced RL models are dual -function, partial observable, actor-critic, auction, internal self-learning, and collaborative and competitive models. The enhanced algorithms provide insights on how various schemes in CR networks can be approached using RL. Performance enhancements achieved by the traditional and enhanced RL algorithms in CR networks are presented. Certainly, there is a great deal of future works in the use of RL, and we have raised open issues in this paper.