Optimal Channel Selection Based on Online Decision and Offline Learning in Multichannel Wireless Sensor Networks

. We propose a channel selection strategy with hybrid architecture, which combines the centralized method and the distributed method to alleviate the overhead of access point and at the same time provide more flexibility in network deployment. By this architecture, we make use of game theory and reinforcement learning to fulfill the optimal channel selection under different communication scenarios. Particularly, when the network can satisfy the requirements of energy and computational costs, the online decision algorithm based on noncooperative game can help each individual sensor node immediately select the optimal channel. Alternatively, when the network cannot satisfy the requirements of energy and computational costs, the offline learning algorithm based on reinforcement learning can help each individual sensor node to learn from its experience and iteratively adjust its behavior toward the expected target. Extensive simulation results validate the effectiveness of our proposal and also prove that higher system throughput can be achieved by our channel selection strategy over the conventional off-policy channel selection approaches.


Introduction
Multichannel communication enables terminals to transmit on different channels simultaneously without mutual interferences.It has been widely used in Wireless Sensor Networks (WSNs) or Internet of Things (IoT) to support large and dense networks [1].Since different channels may result in different transmission qualities, the channel selection plays crucial role in multichannel WSNs.
Owing to the constraints of energy budget and memory size of WSN nodes, centralized approaches are usually considered to conduct channel selection.In these approaches, a central node, for example, access point (AP) or sink node, performs all the necessary computations and informs reasonable channel selection decision to other sensor nodes.Wu et al. [2] adopt a static tree-based channel selection approach where the sink node can operate on attribute sensor node to switch to a channel with minimum interference.Li et al. [3] extend the typical two-level architecture by using aggregator coordinate associated with sensor nodes to avoid the transmission of huge collected data.However, centralized approaches have limited performances in largescale networks.Therefore, the distributed approaches have attracted more interests, since they allow better flexibility and scalability in node deployment.Tang et al. [4] design a counter-based approach in which nodes select channels based on the channel quality.Nevertheless, information exchange and negotiation among nodes in this approach require tight synchronization.
In order to implement self-decision and self-learning, the approaches based on game theory and reinforcement learning have been introduced to improve channel selection or other resource allocation problems, for example, [5][6][7][8][9][10][11]. Game theory has been a powerful tool to model decentralized networks to obtain an equilibrium state.Its common drawbacks lie in the huge instant computational costs.Félegyházi et al. [5] define two-tier noncooperative medium access game which composes of a channel allocation and a multiple access subgames.Han and Kawanishi [6] provide two types of game strategies to adapt to different collision probabilities.Canzian et al. [7] design an equilibrium game between the pricing and intervention to achieve the maximum efficiency in the perfect monitoring scenario.On the other hand, reinforcement learning can be used to help each individual node learn from a sequence of their individual feedback history and adjust their behaviors towards expected state, gradually.Nie and Haykin [8] provide a classical reinforcement learning framework to solve the channel assignment problem.Naddafzadeh-Shirazi et al. [9] and Zhou et al. [10] investigate reinforcement learning schemes to help secondary users capture the state of the primary user and learn the satisfactory feedback to improve its own utility.Zame et al. [11] design a statistical count learning scheme to make secondary stations learn from and coordinate their own histories, while simultaneously teaching other stations about these histories of counter.Unfortunately, reinforcement learning approaches have a common disadvantage that they usually require lots of learning iterations to converge to an acceptable solution.Furthermore, most existing work is based on information exchange and negotiation among users, which may cause computational complexity and communication overhead.
In this paper, we propose an intelligent channel selection strategy with hybrid architecture which benefits from the combination of centralized methods and distributed method.
Our work requires neither central control nor any exchange or negotiation messages among sensors.Most importantly, we make use of the intelligent technique, for example, game theory and reinforcement learning, to find a solution to the application limitation problem of optimal channel selection with different communication overhead.To achieve this goal, we formulate two algorithms from the perspective of sensors which, respectively, named online decision algorithm and offline learning algorithm.We consider the proposed online strategy and offline strategy have their own merits and are targeted at different application scenarios.
The online decision algorithm based on noncooperative game is to help each individual sensor immediately select optimal channel when the network can satisfy the requirements of energy and computational costs.In terms of the computational complexity, the online strategy is based on the noncooperative game and is less complex than the cooperative game.The "online" here means real-time computation that sensors can obtain immediate results.This approach is focused on how to find the optimal equilibrium state through local computation by each individual sensor.
The offline learning algorithm based on reinforcement learning is to address the iterative channel selection to decrease energy consumption.Each sensor can learn from a sequence of its individual feedback history and adjust its behavior towards the expected target.This approach emphasizes the learning ability that sensor learns its behavior and picks optimal choices while converging to an acceptable and stable solution.Different from the online decision, the offline learning algorithm cannot affect node's selection behavior immediately, but in an iterative way.Therefore, the main contributions of this paper are twofold: (i) We propose a hybrid architecture, in which centralized processing and distributed processing are jointly considered in order to alleviate overhead of AP node and allow more flexible deployment.
(ii) In this architecture, we present two types of optimal channel selection algorithm based on intelligent decision and learning.They can be used to adapt to different requirements of the communication overhead.It requires no central control, no information exchange, or negotiation among individual nodes, which allows low computational complexity, communication overhead, and storage requirement.
The rest of this paper is organized as follows.Section 2 briefly reviews related work.Section 3 introduces the system model.In Sections 4 and 5, the online decision and offline learning method for channel selection are, respectively, explained in detail.Section 6 validates our proposal via simulation.Section 7 concludes this research.
For the game-based category, a host of game strategies are presented to fit the optimal channel selection process, including cooperative game and noncooperative game solutions.The cooperative game can improve the performance of resource allocation protocol, while it needs more information exchange and negotiation among nodes, which incurs high communication overhead and computational complexity.Nuggehalli et al. in [12] exploit AP node to manage the priority of other nodes, which guarantees the fairness of the bargaining process.This thread is enhanced by penalizing and pricing mechanisms [13][14][15][16].In specific, Shrestha et al. [13] and Chatterjee and Wong [14] investigate punishment mechanism to promote cooperation.Wang et al. [15] and Cui et al. [16] propose a pricing model based on Stackelberg game, in which the leader node formulates price list for follower node to access certain desirable channel.Recently, a new incentive scheme, called intervention, has gained utilization in [17][18][19].These approaches aim at formulating several incentive mechanisms to achieve higher utility.However, these methods need designers or coordinators to price and formulate reasonable rules during the initial phase.Nodes should perform strictly with the rules during the execution phase, which brings an inevitable problem for dynamic network scenarios that designers need to monitor and adjust rules constantly.All of the aforementioned works depend on a centralized server to solve the resource allocation issue and inform the decision to each individual node.However, in many cases, the synchronization information may not be available for all nodes, and some nodes may deviate from cooperation.The second thread in this category uses noncooperative game-based algorithms and policies for the distributed scenario, in which each node makes real-time decisions considering only to maximize its individual utility.Cho and Tobagi [20] indicate that the noncooperative game has lower computational complexity than cooperative game.The work suggests that noncooperative system converges the individual benefits with an appropriate selfish strategy that can lead to a global network optimal result.In [21], each node maintains some local counters to collect the states of packet transmission, based on which the conditional collision probability and the transmission probability of the opponents' behavior can be calculated without negotiation.Zheng et al. [23] extend the works in [22] and investigate the problem of channel selection where no information exchange is available among users without the centralized controller.Each user adaptively updates its channel selection strategy relying on the individual experienced action-reward.It can be noted that the prior solutions in [22,23] are similar to our design as they do not need to exchange information in dynamic and distributed networks.Nevertheless, the prior work needs a mechanism to distinguish active users from inactive ones, which is not required in our design.
For the learning-based category, reinforcement learning approaches are firstly proposed to achieve low energy consumption and low computational complexity in WSN [24][25][26].Subsequently, more and more investigations are performed by combining it with the other mechanisms [29][30][31][32][33] in cognitive networks.Teng et al. [27] have discussed a scheme which adopts a -learning-based auction game to help nodes compete channel access opportunity.Kakalou et al. [28] and Saleem et al. [29] use cluster-based architectures instead of the central entity, in which cluster head observes the traffic of primary user (PU) to avoid collisions while keeping other member nodes synchronized.In [30], Lin et al. have investigated a novel dynamic spectrum access framework with control information exchange through beacons.In [31], a novel distributed -learning algorithm with heuristically accelerated scheme has been shown to be a powerful approach to solve dynamic spectrum access problem.The main insight of these contributions is employing interactive information among nodes; however, an excess of information exchange may oppose the alleviation of communication overhead.Motivated by these observations, in our solution, the information exchange is not necessary in our approach.

System Model
There are two types of node in the considered system: AP and sensor node.Assume that there are  orthogonal channels and  selfish sensors in the networks ( ≤ ).Orthogonal Frequency Division Multiple Access (OFDMA) is applied so that each sensor can access different channels by utilizing the feedback information (e.g., the channel gain) from AP [33,34].The sensors in the network are related to diverse channel gains.We further assume that the interference comes only from the sensors that are intended to contend the same channel.Figure 1 shows the considered network model.There are two crucial problems in this system, that is, how the sensors select proper channels and how they compete to access if two or more sensors select the same channel simultaneously.We will discuss these problems in detail in the followed subsections.

Channel Selection.
We consider a noncooperative channel selection scenario in Figure 1, where  = {1, 2, . . ., } is the set of fully distributed sensors,  = {1, 2, . . ., } is the set of the available channels, and ℎ = {ℎ 1 , ℎ 2 , . . ., ℎ  } denotes the set of channel gains.We assume that each sensor is equipped with a single radio transceiver, and it can dynamically access to any channel.It is worth noting that there is a unique policy in the channel selection process that the sensors individually select the channels with maximum channel gain.Due to the selfish behavior among the sensors, there are no negotiation messages exchanged in this process.However, the request and acknowledgement interacting messages between the sensors and the AP are still existed.According to the different strategies, we divide the on-policy channel selection method into two categories.
One category belongs to the real-time strategy, which is named online decision method based on noncooperative game.During the online decision process, the sensor first transmits a random access request to AP and then AP will send the feedback acknowledgement message.After that, the sensor can calculate its own maximum utility via the feedback from AP, which covers the information on channel gains of different channels.The optimal channel, that is, usually the one with the highest channel gain, will be selected.
The other category depends on the history of sensor states, and we name it offline learning method based on reinforcement learning.The offline learning process is actually an iterative exploration and exploitation process, in which each sensor will evaluate its current behavior and then improve it greedily.By this means, the random selection will gradually converge to the optimal one.It is noteworthy that the reinforcement learning algorithm is generally known as a real-time machine learning approach due to the immediate reward back from the environment.However, the immediate reward requires a number of learning iterations and may bring unaffordable overhead to the network.Therefore, we revised it to gain an "offline" learning approach which will be affordable for the system with limited computational capability and power supply.The detailed procedures will be explained in Sections 4 and 5, and it is validated that this "offline" algorithm can prolong the network lifetime significantly.

Multiple Access Contention.
The multiple access model is to determine the opportunity of channel access when some sensors select the same channel.We consider a CSAMA/CA scheme (e.g., the 802.11DCF protocol) which is used in this distributed application to resolve the channel contention.In the carrier sensing phase, each sensor detects whether the channel is idle, and then, the sensor takes binary exponential backoff (BEB) algorithm to access the channel in the collision avoidance phase.According to the well-known research [35], in saturation conditions, the conditional transmission probability of sensor can be calculated as where , , CW min , and , respectively, denotes transmission probability, collision probability, minimum contention window, and backoff stage.
Based on (1), probability that at least one sensor transmits packets can be expressed as ( The transmission success probability of each sensor can be written as Accordingly, the achievable rate of sensor  is given as follows where  is the bandwidth.SINR is the signal-to-interference-plus-noise ratio. denotes the ratio of the current data rate to the Shannon capacity. tx is the transmission power.ℎ  denotes the channel gain of sensor node . is the correlation coefficient between sensor node and AP. 2 is the noise power.
Next, we can calculate the time consumption by different events in the transmission process as follows: (i) The time consumption due to conflict when more than two sensors contend for one transmission opportunity.
where  is the propagation delay.PHY header denotes the length of header for the physical layer.MAC header denotes the length of header for the MAC layer.packet length denotes the length of one data packet. denotes the achievable rate.difs denotes the length of Short Interframe Space.
(ii) The time consumption associated with a successful transmission, where sifs denotes the length of Short Interframe Space.ack denotes the length of acknowledgement message.
(iii) Therefore, the average time consumption can be calculated respectively as where slot denotes the length of one slot.And the average throughput can be formulated as where  success denotes the average time consumption of transmission success. fail denotes the average time consumption of transmission fail.
From the above expression, we observe that the throughput of sensor is influenced by the achievable data rate , which is essentially affected by the channel gain according to (4).Due to this fact, the channel with the maximum gain will be selected through the channel selection procedure.
The frequently used notations in this paper are summarized in Notations.

Online Decision Algorithm for Channel Selection
Taking account of the selfish behavior of each sensor in the network, we formulate the channel selection problem in multichannel WSN as a noncooperative game.The benefit for noncooperative game is that it requires no coordination control or information exchange among nodes.Based on this model, we propose an online decision algorithm for channel selection.

Noncooperative Game Model.
Let us denote the game as Γ = {, {  } ∈ , {  } ∈ }, where  is the player set (i.e., the set of sensor nodes), {  } ∈ is the channel selection strategy set for player , and {  } ∈ is the utility of player .
The utility function reflects the throughput of a sensor in the selected channel , which can be defined as // Initialization (1) AP evaluates the ℎ  // Each sensor obtains the channel gain (2) Sensor  transmits request to AP (3) Sensor  obtains channel gain {ℎ   },  ∈ ,  ∈  from the feedback // Each sensor select the optimal channel through the Non-cooperative game (4) while ( , ( *  ,  * − ) <  , (   ,  * − )) do (5) for  = 1 to  (6) for  = 1 to  (7) Sensor  calculates the utility function  , on each channel according to (9) (8) end for (9) end for (10) Note that NE can be generally classified into pure strategy NE and mixed strategy NE.The mixed strategy NE usually seeks for a stable equilibrium state in which sensors select channel with negotiation.In this paper, we employ pure strategy NE in which each sensor selects channel in an on-policy manner.If player  decides to deviate from its individual NE, its utility will be degraded if the system is at such NE.Therefore, this property is particularly desirable.However, the sum-utility optimal channel selection problem is NP-hard.Thus, conventional optimization techniques cannot be applied directly and even centralized algorithms cannot guarantee the globally optimal solution.We propose Theorem 2 to characterize the game.

Theorem 2. With the maximization of an individual node utility the global benefit of the system is also maximized.
Proof.Dov Monderer and Shapleyb [32] have proven that the individual or global NE is the maximization of the potential function.According to the concept of NE and (9), we can find that the utility   is the best response for node  with strategy  * , either individually or globally.
In terms of the global optimization, the sum-utility optimal channel selection problem can be formalized as where  denoted the sum of each individual node's utility.According to Theorem 2, we should develop an effective algorithm to obtain the global optimal NE.

Algorithm Description.
Each sensor is regarded as an online decision automaton agent, which selects the channel according to greed strategy  * ( * is the strategy that selects the channel with the highest channel gain).In other words, each sensor will maximize its utility function  , in a greedy way.And the algorithm can be described as below.
In the initialization phase, the AP evaluates the channel gain in each channel.Each sensor first transmits the required message to the AP.Next, sensor  obtains channel gains of different channels in feedback acknowledgement message from the AP, based on which the utility function of sensor  is locally calculated.The above operations will be performed iteratively until the expected utility function  , converges to the unique NE.Finally, sensor  takes greedy strategy  *  to select the corresponding channel which has the optimal channel gain value.Algorithm 1 describes the channel selection process based on noncooperative game.

Convergence and Complexity
Analysis.The convergence of Algorithm 1 is guaranteed since the expected utility function  , converges to the unique NE, and the number of iteration is ().Within each iteration, the maximum computation of a sensor is (  ).Therefore, the total computational complexity of Algorithm 1 is (  ).In terms of storage requirement, each sensor needs to cache channel gains of different channels in feedback acknowledgement message; thus  memory units are required to store the immediate feedback acknowledgement messages in the case of the online decision algorithm based on noncooperative game.Obviously, the computational complexity and storage requirement of this immediate algorithm will be increased with the number of sensors and channels.In the following section, we will present an alternative channel selection algorithm with reduced computational complexity.

Offline Learning Algorithm for Channel Selection
In this section, we present a decentralized channel selection algorithm by using the reinforcement learning framework.This framework selects the channel "offline" in a simpler way than the online algorithm.

Decentralized Reinforcement Learning
Framework.Reinforcement learning (RL) is usually adopted to solve the problem that a learning agent is interacting with its environment to achieve goals related to the state of the environment [36].Such an agent should be able to observe the state of the environment and take actions according to the feedback of the observation that affects the state in next time.As illustrated in Figure 2, at time , sensor  observes the channel state and obtains the current channel state   , and then sensor  takes action   and obtains the reward   .
In a general reinforcement learning framework, there are two interacting objects which are agent and environment.And three types of exchange information are included in the learning process: state   , action   , and reward   .For each sensor, the learning process is called the explorationexploitation tradeoff with two important characteristics: trial-and-error search and delayed reward.The current reward may not affect the next time state immediately, and the expected learning target will be obtained after a period of time.Therefore, we called this learning algorithm "offline" to distinguish it from the online decision algorithm in Section 4.
We propose a -learning algorithm for sensors to optimally select the channels according to the histories of observed states and the rewards accumulated into the current choice of action.Some definitions of the algorithm are given as follows.
State.We define that state   is observed by node  at time .We use  = { 1 , . . .,   } to denote the finite set of state space,   ∈ .The state transition from   to  +1 depends on the action, and accordingly the next state  +1 can be observed when the next action occurs.
Action-Value Function.The action-value function (, ) is associated with the action  and state  at time .In Section 4, it is equivalent to the utility function.

Channel Selection with Reinforcement
Learning.In the above reinforcement learning framework, each sensor interacts with the channel environment.At each discrete time step , sensor observes current state   , takes action   , and obtains feedback reward   .As we assume that there are  orthogonal channels allocated by the AP in the system model, each sensor selects the optimal channel with the probability of 1/ and other channels with the probability of ( − 1)/.The channel selection process is a memory less random process and obtains a sequence of random states with the Markov property.Therefore, it can be modeled as the Markov Decision Process (MDP) with a sequence of state information.The "state information" includes actions, states, and rewards.According to policy ( | ), sensor observes state  which is affected by action .The task for a sensor is to learn the policy ( | ) to maximize the expectation of action-value function   (, ).
Since   is a Markov process, that is, the information related to past states [ 1 , . . .,  −1 ] is covered by state   .We only need to store the current state   , which allows a considerable reduction of storage requirement.Definition 3. A history   is a sequence of actions, states, and rewards Definition 4. A channel selection process is a tuple ⟨, , , , ⟩, where , , , and , respectively, denote a finite set of states, actions,  rewards, and the transition probability matrix. ( ∈ [0, 1]) is the discount factor in order to avoid infinite returns in cyclic Markov processes.
As illustrated in Figure 3, sensor  fulfills the channel selection process through two discrete time steps.White point is the initial state, red point implies sensor  selected optimal channel with the probability 1/, and black point implies sensor  selected other channel with the probability ( − 1)/.
Let us consider the operation at time  + 1 as an example, and the optimal state transition probability is defined as and the other state transition probability is defined as * (, ) can be recursively computed by adopting the Bellman optimality equation: * (, ) always specifies the best possible performance in the MDP.For the MDP of channel selection, this property is hold according to the following theorems.Theorem 6.There exists an optimal policy  * that is better than or equal to all other policies,  * ≥ , ∀.Theorem 7. All optimal policies achieve the optimal actionvalue function,   * (, ) =  * (, ).Now, we use the -learning algorithm to solve the MDP problem of channel selection.Each sensor decides its next action  +1 based on the trend of a sequence of actions, states, and rewards, and then it observes current reward   and state   to update the -value of the action-value function (, ).The updated -value will affect the nextround channel selection.The -value can be expressed as where  is the learning rate, which specifies the updating speed of the -value. is the discount factor that determines the present value of future rewards.(  ,   ) denotes current -value and ( +1 ,  +1 ) denotes the expected -value.
As shown in Figure 4, we improve the learning policy  by acting greedily with respect to optimal policy  * greedy in each turn of iteration.Figure 4(a) represents the learning policy  that will converge to the optimal policy  * greedy after  times iteration.In Figure 4(b), we use contracting mapping theorem to represent this channel selection process.The red solid line denotes optimal policy  * greedy , and the blue solid line denotes learning policy .The green dash arrow denotes stochastic action, and this action will be improved gradually by a series of feedback information according to the optimal policy  * greedy .The rose dash arrow and purple dash arrow, respectively, denote reward information and state information.They are used to evaluate the action according to the learning policy  after   -step returns (  = 1, 2, . . ., ∞).
Using the offline algorithm to select channel is an exploration-exploitation process that requires no prior knowledge for each sensor.Through the learning process, the stochastic behavior of sensors can gradually converge to the optimal channel selection.

Algorithm Description.
Each sensor is regarded as an offline learning automaton agent, and the relevant task is to learn the policy  maximizing the expectation of -value (  ,   ).The algorithm is described as follows.
In the initialization phase, each sensor initializes its action space and -value.Then, sensor  first takes an random channel selection action   and the AP feedback reward   to sensor .Next, sensor  observes the current reward value and the next time state.If sensor  selects optimal channel at time , the corresponding reward value will be equal to "1"; otherwise the reward value equals "−1."Subsequently, sensor  updates -value.This process is repeated until the learning policy  that converges to optimal policy  * greedy .Finally, sensor  chooses channel selection action  +1 from  +1 by using policy  which derives from (  ,   ).Algorithm 2 describes the channel selection process based on -learning.

Complexity and Convergence
Analysis.The number of iterations for Algorithm 2 is ().Within each iteration, the maximum computation of a sensor is  (1).Therefore, the total computational complexity of Algorithm 2 is ().In terms of storage requirement, each sensor only needs to use one memory unit to store the current state information.In this way, the computational complexity and storage requirement can experience signification reduction compared to Algorithm 1.
The convergence of this channel selection algorithm is characterized by the following theorem.Theorem 8.The learning policy  converges to optimal policy  * greedy if the following conditions are met: (i) The optimal policy  * greedy has unique point.
(ii) Robbins-Monro sequence [37] of step-sizes  satisfies The state  and action  spaces are finite.
From Theorem 8, we know that the learning policy  converges to the optimal policy  * greedy .Since a finite channel  The   -step (  = 1, 2, . . ., ∞) return values can be considered as follows: In (18), the -value at time  is a determinate value.For the return value  (  )  of the   -step -value, we can see that it is finite and based on the number of evaluation , which suggests the convergence property of  (  )  is irrelevant to the number of step.
Therefore, the -value in (18) must be a finite value.Theorem 8 is proved.

Evaluation
6.1.Simulation Setup.When multiple sensors select one same channel, we assume that all sensor nodes resolve contention based on IEEE 802.11 standard, that is, the BEB backoff algorithm.The parameters are set as CW min = 32,  = 5,  tx = 12dBm,  = 0.1,  2 = −80dBm,  = 1Hz, and  = 0.5.The AP randomly assigns the channel gain in each turn of iteration according to ℎ = rand(, ) × 0.3 + 0.6.OFDMA technique is used in the physical layer, where the packet length is 512 bits, the acknowledgement message length is 304 bits, and the headers for the physical layer and  MAC are, respectively, 192 bits and 272 bits.And the slot time is 20 s, the Short Interframe Space (SIFS) and the DCF Interframe Space (DIFS) are, respectively, 10 s and 50 s, and the propagation delay is 1 s.The discount factor  was chosen to be 0.9 and the learning rate  was designed to be 0.6.To simulate the dynamic network environment, we set the initial number of sensor to 2, and then 10 sensors are added in each turn of iteration until the total number of sensors reaches 92.In the simulation, we vary the number of channels from 6 to 12.The presented results are obtained by 5000 independent Monte Carlo simulations.

Evaluation of Online Decision Based on Game Theory.
Figure 5 shows the performance comparison in terms of the global throughput, which is the sum of individual throughput achieved by each sensor.The stochastic algorithm acts as the competing scheme as it is a typical off-policy channel selection algorithm.As shown in Figure 5, the noncooperative game algorithm performs better in throughput under various numbers of channels.In particular, the throughput is higher and more stable than the stochastic algorithm.We can also observe that the increasing tendency of the curves becomes slower when the number of sensors becomes larger.It can  be concluded that the throughput gets saturation in each channel, and more sensors bring more severe contention.

Evaluation of Offline Learning Based on Reinforcement
Learning.When the offline reinforcement learning algorithm is adopted, as shown in Figure 6, a better performance in the global throughput is attained compared to the stochastic channel selection algorithm.Similar to the simulation results given by Figure 5, the throughput will get saturation with the increasing number of sensors, which also stems from the severer contention.As shown in Figure 6, we can see that the initial process of reinforcement learning algorithm performance curve is unstable.This is because the learning algorithm is used to help sensors to reduce stochastic behavior.Therefore, at the beginning the sensor behaves stochastically, while it eventually converges to the stable and optimal behavior.The performance gaps between stochastic algorithm and our proposed algorithm increase with the number of sensors.However, the gaps between our proposed algorithms are minor.The reasons can be as follows: (1) more sensors incur severe contention, and (2) our learning algorithm gradually converges toward a pure NE employed in Section 4.
Figure 7 plots the cumulative distribution function (CDF) of -value versus the number of sensors.It can be seen that the convergence performance becomes stable with the increasing number of sensors.In addition, the convergence can be gradually achieved when the number of sensors is larger than the number of channels.Moreover, the increasing tendency of the curves becomes stable when the number of sensors added in each turn of iteration is larger than the number of channel.The reasons can be as follows: (1) the samples space becomes larger on each channel when the total number of sensor increases, and (2) sensors have been accumulated enough historical observations and decision experiences.

Conclusions
In this paper, two channel selection algorithms based on online self-decision and offline self-learning, respectively, have been investigated in a multichannel wireless sensor networks.Sensor nodes in both algorithms behave selfishly and do not mutually negotiate information among other sensors.The online self-decision is made based on noncooperative game, and the offline self-learning is done based on the reinforcement learning.The online self-decision can be made immediately and is suitable for the real-time application.By contrast, the offline self-learning algorithm can iteratively converge to the optimal channel selection with lower occupation of computational and storage resources; thus it is available for the applications with low computational complexity, communication overhead, and storage requirement.Theoretical analysis and simulation results demonstrated that the proposed channel selection methods can improve the throughput performance compared to the existing off-policy strategies.

𝐾:
S e to ft h ea v a i l a b l ec h a n n e l s : Set of fully distributed sensor nodes : Number of sensor nodes : N u m b e ro fc h a n n e l s ℎ: Set of channel gains ℎ  : Channel gain in the th channel ℎ   : Channel gain of sensor  in the th channel  *  : The optimal strategy for sensor    : The utility function of sensor    : Th ec h a n n e ls t a t ea t time   : The feedback reward at  time   : The sensor node take action at  time : The discrete time steps   : N u m b e ro fs t e p s   (, ): The action-value function with policy   * (, ): The optimal action-value function   +1 ,RED : State transition probability of sensor selected optimal state from  to  + 1   +1 ,BLACK : State transition probability of sensor selected other state from  to  + 1  * greedy : The optimal policy with greedy algorithm.

Figure 5 :
Figure 5: Throughput performance of online noncooperative game algorithm.

Figure 6 :
Figure 6: Throughput performance of offline reinforcement learning algorithm.

Figure 7 :
Figure 7: Convergence performance of offline reinforcement learning algorithm under different numbers of channels.