Novel Learning Algorithms for Efficient Mobile Sink Data Collection Using Reinforcement Learning in Wireless Sensor Network

Generally, wireless sensor network is a group of sensor nodeswhich is used to continuouslymonitor and record the various physical, environmental, and critical real time application data. Data traffic received by sink in WSN decreases the energy of nearby sensor nodes as compared to other sensor nodes. This problem is known as hot spot problem in wireless sensor network. In this research study, two novel algorithms are proposed based upon reinforcement learning to solve hot spot problem in wireless sensor network. The first proposed algorithmRLBCA, created cluster heads to reduce the energy consumption and save about 40% of battery power. In the second proposed algorithm ODMST, mobile sink is used to collect the data from cluster heads as per the demand/request generated from cluster heads. Here mobile sink is used to keep record of incoming request from cluster heads in a routing table and visits accordingly. These algorithms did not create the extra overhead on mobile sink and save the energy as well. Finally, the proposed algorithms are compared with existing algorithms like CLIQUE, TTDD, DBRkM, EPMS, RLLO, and RL-CRC to better prove this research study.


Introduction
This research study started with a valid question of how to enhance the network lifetime of WSN with better energy optimization of sensor nodes by using reinforcement learning.The solution of this question lies in improving energy efficient WSN algorithms which is a key research area already addressed by various literatures in the last decades.Therefore, it is expected that sensor nodes perform sleep and wake-up mechanism to better utilize their energy for enhancement of network lifetime.The concept of clustering also works very well in this manner.
Due to multihop communication, generally the sensor nodes which are near to base stations always become overloaded because they are intermediate nodes between base station and remaining wireless sensor network for data forwarding to the base station [1].This situation happens to be a hot spot problem [2,3] where SNs near to sink node send their own data as well as other nodes data.This leads to decrease the performance of wireless sensor network significantly.Therefore we have motivated towards research in sink mobility which has emerged in WSNs to properly handle the hot spot problem and to decrease the energy communication overheads [4].Traditionally, mobile sink [5,6] needs to visit every cluster head [2] to collect the data, leading to longer mobile sink traversal path which in turns creates data delivery latency [7,8] and higher energy consumption.For this reason, we proposed RLBCA and ODMST algorithm upon reinforcement learning.We proposed the visit of mobile sink only to interested cluster heads by receiving a request message packet for collection of data.However, design of such on-demand mobile sink traversal path [3] is a challenging task as it highly depends upon coverage of network, data delivery, energy efficiency, and lifetime of network.
Reinforcement learning (RL) techniques [9] are used here, being an unsupervised class of learning in the field of machine learning which permits an agent to learn the behaviour in new environment.The prime goal of the agent is to generate actions which increase the rewards in future.Later on these rewards lead to formulate optimal policy.The elements of RL can be formalized using the Markov decision process (MDP) framework [9][10][11].MDPs [9] consist of states, actions, and transitions between states and reward function definition.Thus, the use of RL techniques can largely improve the WSNs performance significantly.
Finally, our contribution in this research study is as follows: (i) Proposed reinforcement learning based clustering algorithm (RLBCA) to form cluster heads (ii) Proposed on-demand mobile sink traversal (ODMST) algorithm to collect data.

Related Works
This section presented the review of recent research studies including energy efficient routing, network lifetime enhancement, coverage, clustering, and reinforcement learning based WSN solutions.In [17], the authors proposed geometric model for mobile sink which has performed very well on various performance matrices.In [1,13], TTDD protocol is designed where WSN is partitioned into virtual grids based upon the mobile sink node.The path for mobile sink is based upon the grid node [18,19] which eliminated the hot spot problem.However this process of developing grid consumes more energy of SNs.In [5], here author focused on energy efficient routing and clustering based on PSO algorithm.Here authors have also presented a technique which extends the network lifetime by eliminating the traffic load of the gateways whose remaining energy is beyond a particular threshold value; however authors have considered only failure of the gateways due to complete energy depletion.The EPMS algorithm [5] performs the virtual clustering by using PSO algorithm to improve the network performance.
Here the selection of cluster head depends upon the reception of data to control the movement of mobile sink.However this algorithm did not solve the WSNs transmission coverage problem.In [5], authors focused on the delivery latency minimization problem in WSN along with the deployment of mobile sink on a plane randomly; here the proposed algorithm performs well in terms of shortening data delivery latency and reducing route length.However, transmission issue affected the performance of WSN.In [20][21][22][23], authors proposed data dissemination framework which is called tree overlay grid, to handle mobile target detection where multiple mobile sinks appear in WSN to consume less energy along with a longer network lifetime; however implementation of this algorithm on real time WSN created complexity.In [3,24], author described how information local to each node can be shared without extra overhead as feedback to neighbouring nodes which enabled efficient routing to multiple sinks.Such type of situation arises in WSNs with multiple mobile users collecting data from a monitored area; here authors formulate the problem as a reinforcement learning task and applied Q-Routing techniques to derive a solution.Evaluation of the resulting FROMS protocol demonstrates its ability to significantly decrease the network overhead over existing approaches.Here authors proposed two algorithms RkM and DBRkM for path formation of mobile sink.The RkM algorithm worked to determine a path by joining the SNs through one hop communication where DBRkM generated a delay bound path.However every SN has equal load of data aggregation and the sojourn time of mobile sink is negligible.In [25], authors proposed EAPC method which constructed a data collection path and selected the eligible sensors to work as a collection point head.
The EAPC method constructed a minimum spanning tree which is rooted at the base station.This method improves the network lifetime and energy consumption but a little bit lacks throughput while increasing numbers of SNs.In [26], author presented cluster based routing known as I-UMDPC where route delays sensitive data to mobile nodes within a time period.However complexity of this algorithm is a little bit higher than existing approaches.In [4], the problem of selecting an optimal cluster is formulated as MDP which showed good performance and energy consumption minimized by determining an optimal number of clusters for intra-and intercluster communications.In [12,27], CLIQUE algorithm was used for data clustering which saved cluster head selection energy by using reinforcement learning to enable nodes to independently decide whether or not to act as a cluster head on a per packet basis; however on the setting of nonuniform data dissemination paradigm requires more work.In [24], reinforcement learning based clustering algorithm is proposed to address energy and primary user detection challenges in WSN; here Q-value slows the convergence of the proposed algorithm due to the long learning period.In [28], authors presented survey of multiobjective optimization in WSN which includes various performance metrics along with very useful algorithms.In [29], authors presented proactive way to enhance network lifetime, coverage, and discovery of redundant nodes with well-defined simulation results.In [30], an energy efficient routing algorithm is presented for multiple mobile sink which advocates the presence of less than three mobile sinks for collection of data.In [16,30], RLLO and RL-CRC algorithms uniformly distributed the consumption of energy and further enhanced the PDR ratio with better topology control.
Wireless Communications and Mobile Computing

System Model and Problem Formulation
This section presented network environment model, basis assumptions, energy model, problem formulation, and our contribution.

Network Environment Model.
We have deployed multiple sensor nodes in random topology [31] to a rectangle area with a radius of R. The basic architecture of mobile sink traversal is shown in Figures 1 and 2. All the senor nodes are static and homogeneous in nature [32].The entire sensor network environment has equal sectors.Source nodes have the liberty to adjust transmission power as per the distance to target nodes.

Basic Assumptions.
We have made following assumptions in this research study: (i) All the deployed WSN nodes are static and homogenous in nature.
(ii) All the WSNs nodes are equipped with same amount of initial energy [33].
(iii) Any physical hurdles/obstacle is not present in the network environment.
(iv) The mobile sink [34,35] is able to collect the data from the cluster heads in proper time.

Energy Model.
In this research study, we considered first radio energy model [23,30] as energy model for the calculation of energy consumption.Generally energy consumption works in two modes: transmission and reception.Equation Cluster Head ID Position Distance (1) shows the transmission of l-bit message (consumption of energy): where   represents the consumption of energy.  and   represent the coefficient of free space and multipath fading model.Equation (2) shows the calculation of reception energy consumption.

Problem Formulation and Contribution.
The key performance factors of WSN are network lifetime and energy consumption.The lifetime of WSN is counted in terms of whenever first node dies.We have provided following contribution in this research study: (i) Proposed reinforcement learning based clustering algorithm (RLBCA) (ii) Proposed novel algorithm for on-demand mobile sink traversal (ODMST).
The proposed mobile sink traversal path is shown in Figure 2 based upon Table 1 and Equation (4).
Figure 2 shows the formation of proposed mobile sink traversal path based upon Tables 1 and 2, (4), and Algorithm 4. It is clear from Figure 2 that initially MS advertises its current position to all CHs.CHs send their request message to MS if any.MS calculates the distance between CHs and MS by using (4) and then creates and executes the mobile sink traversal path.During the traversal of MS, if any CHs send their request again then MS updates the traversal path as per the shortest distance and execute it to collect the data.Finally, as per Figure 2, the first round mobile sink traversal path works as follows:

Reinforcement Learning
The reinforcement learning technique presents what to perform and how to react to present actions for maximizing the

Reward
The computation of cumulative reward ∑  +1  is based upon the selection of action and state [9,11,24,37] reward value to develop the policy [9,36].Basically RL has various basic components like agent, action, state, reward, policy, value function, and environment model.Mainly RL is based upon MDP [9] which in turn includes temporal difference and -greedy selection approach [24,37] as a selection and mathematics approach.The basic learning process of RL is shown in Figure 3.
The basic reinforcement learning algorithm works as shown in Algorithm 1.
RL technique also performs policy iteration which has been described in Algorithm 2.
The basic components for clustering by reinforcement learning are given in Table 3. ３ Ｃ (State)

The Proposed Algorithms
This section highlights the clustering of SNs and formation of cluster heads based on RLBCA and ODMST algorithms.

Clustering of SNs by Using Reinforcement Learning.
In this section, we proposed RLBCA in which WSN node works as a learning agent.These learning agents learn the energy level of nearest neighbour to form clusters based upon certain policy.Markov decision process (MDP) [9,37] is calculated to find cluster.The MDP contains state, action, reward, and policy.The learning agent uses temporal difference method to learn from network environment to draw action policy.The RL model is used for clustering (Table 4).
From Table 4, it is clear that RL model for clustering is encoded with every SN to calculate the cost of a route which goes to cluster head node based upon certain Q-value update   +1    .The action    shows the selection of next-hop node j to forward data packets to any cluster head [24].The reward   +1    shows the link cost towards next-hop node [9,11].The basic elements of Markov decision process (MDP) are [S, T, A, R] where S represents set of states, T represents transition function, A represents set of actions, and R represents the reward function.The learning agent selects an action A with all states S which is shown in Figure 4.The selected action later on computes the energy consumption for the cluster.Reward R derived from the calculated energy consumption to take proper decision.The formed decision increments current state S to Si+1 and next action A to Ai+1.The learning agent develops the optimal policy Q which increases the reward from learning experience to create optimal cluster heads [9,11,24,37] which are shown in Figure 4.
In MDP, the state transition function P and reward function R are connected with current state and action.The main objective of learning agent is to develop a policy : S → A. Learning agent has taken action Ai as per the current state Si, i.e.,  (S i ) = A i , so that the cumulative value function V  (S i ) derived from present/initial state Si worked as follows: From ( 5), r represents the return value and ¨represents the discount factor.The main objective of learning agent is to develop an intelligent strategy to make V  (S i ) highest [9,11,24,37].This strategy is known as policy and represented by Finally, to update Q-value the following equation is used: Constantly the Q-table is updated using (7). and ¨are learning rate and discount factor.r t is the return value, max Q t (S t+1 ,   ) is the highest Q-value, and   is action taken by learning agent [9,11,24,37].Based upon ( 5), (6), and (7), the proposed RLBCA for clustering works as shown in Algorithm 3. Algorithm 3 is simulated in MATLAB as per the simulation parameters specified in Table 5.The results are compared with CLIQUE algorithm [12] on the basis of average residual energy of cluster heads against number of rounds which are shown in Figure 5.This comparison clearly showed that as the number of rounds increases the average residual energy of CLIQUE algorithm's cluster heads goes down but our proposed RBBCA algorithm performs very well.

On-Demand Mobile Sink Traversal (ODMST) Algorithm.
Initially the cluster heads formed by RLBCA; now ODMST algorithm collected the data from cluster heads in Algorithm 4.
ODMST Algorithm 4 saved the energy of mobile sink by visiting only interested cluster heads.This algorithm also prolongs the lifetime of network.Each round of data transmission cycle is set as T. The method of calculating T is shown in where V is the moving speed of mobile sink.The average energy Ec is formulated as per the following equation [9,11,24,37]: where Ei is the residual energy of the node and n is the number of nodes in the cluster.The return value of the agent node can be calculated as per the following equation [9,11,24,37]: Here, the return value of agent node keeps track of remaining residual energy as well as energy consumption.The highest value of Q always leads to optimal path.The mobile sink keeps the updated Q-value and selects the MS traversal path with maximum Q-value [37].If any issue takes place with SNs then the second maximum Q-value is selected for MS traversal path.Generally this method saves the energy among SNs.

Wireless Communications and Mobile Computing
Step 1. Initially all sensor nodes sends hello message packet to show their residual energy and current position.
Step 2. The learning agent records the total number of neighbour nodes and their residual energy.Periodically the residual energy of each sensor nodes is set and return value of the node is set to zero.Step 3. Based upon step 2, cluster head formation probability is computed.The base station selects the optimal number of cluster heads among the desired cluster heads and creates the list.
Step 4. The base station announces the list of eligible cluster heads.
Step 5.The newly formed cluster heads send advertisement packets to their nearest Neighbours for communication purpose.
Step 6.The state-action Q-values [10] are updated by reward function (equation ( * )) and Q-matrix (equation ( * * )) to achieve the optimal policy (equation ( * * * )): Step 7. if the current node's residual energy is greater than other neighbour's nodes, the sensor node with higher residual energy is elected as a cluster head for next subsequent round.
Step 2. Initially mobile sink placed randomly.Mobile sink advertise his position to all cluster heads.
Step 3. Interested cluster heads sends their request to visit message packet to mobile sink.
Step 4. Mobile sink stores these received messages in routing table to calculate distance (as per equation ( 4)) and visit cluster head to collect the data.
Step 5.If multiple request messages are received by mobile sink then Step 5.1.Mobile sink calculates distance of SNs as per the equation ( 4) and store them in routing Table 1.
Step 5.2.Mobile sink creates the traversal path as per shortest distance and execute it.
Step 5.3.During this execution of mobile sink traversal, if again any cluster heads send their request message then mobile sink used to calculate the shortest distance, update the traversal path and execute it.
Here the network environment contains 500 sensor nodes in the area of 200 X 200 m 2 area.The initial energy of all sensor nodes is 0.5 J.The extensive simulation has taken place in MATLAB 2012 (A) [38] based upon simulation parameters specified in Table 5.

Result and Discussion
. In this section of our research study, the performance of our proposed RLBCA and ODMST algorithm based upon simulation parameters specified in Table 5 using MATLAB is evaluated.Energy consumption and network lifetime are the main performance criteria for our research study.Therefore, it is mandatory to ensure less energy consumption for every cluster heads and mobile sink.
The simulation results are compared with other algorithms like TTDD [13], DBRkM [3], EPMS [5], RLLO [15], and RL-CRC [16] on the basis of the following performance metrics: (i) Routing energy loss: Every cluster head in the WSN consumes certain amount of energy.The cluster heads which are not involved in the routing path of mobile sink are said to be idle, to save their energy.
(ii) Network lifetime: It includes the duration from the starting of WSNs operation until the death of first sensor node.
(iii) Learning time: The sensor nodes learn through learning agent.As the learning (alpha value) increases, the performance of WSN operation also increases along with decrement in routing energy loss.
(iv) Convergence of algorithm: This is the performance of algorithm which is expressed by two ways: rate and order of convergence.(v) Sum of square error (SSE): This presents standard way of analysis within a cluster.The SSE shows the performance of cluster heads as per the energy optimization.
(vi) PDR ratio: This is the difference between generated packets and received packets.Data loss comes under this ratio which is a very important factor in WSN.
(vii) Average end to end delay: Here, average end to end delay is calculated as the duration of time taken by data packets to reach the mobile sink from cluster head.
(viii) Average node degree: This is the number of edges connected to each SN.In WSN, cluster head forms dense network to represent average node degree.
After extensive simulation in MATLAB, we observed that as alpha value (Figure 6(a)) (learning parameter) increased, the routing loss decreased as the distance from initial node to sink node increases.Figure 6(b) showed that as alpha value (learning parameter) is incremented, the residual energy of sensor nodes is also increased.Figure 6(c) showed the comparison between alpha value and q-value which clarify that as the learning parameter (alpha value) increased, the q-value also incremented.Figure 6(d) showed that as cluster size increased the remaining energy of sensor nodes decreased.Finally, we can see that the RL based schemes learned energy dissipation for every cluster by exploration of the clusters to find the perfect cluster.
The RLBCA founds the optimal solution just after state action pair's exploration.This mainly depends upon learning rate, discount factor, and action selection policy.We simulated proposed RLBCA to test the convergence of our algorithm over 4000 episodes and evaluated its performance as presented in Figure 7.The simulation result showed that RLBCA (part of RL) founds the optimized solution only after certain number of episodes.We embedded Q-learning for clustering due to its faster convergence and shorter learning period.Figures 7(a) and 7(b) show cumulative average rewards for five cluster heads CH01, CH02, CH03, CH04, and CH05.This represents that learning agent adapts to environment through neighbour cluster heads.Figure 7(c) showed the performance during learning period and selected the optimal cluster head based upon energy dissipation and accuracy of local decision.Figure 7(d) showed the sum of square error (SSE) for the entire network which is a key component to determine the performance of cluster heads based upon energy optimization.As per the Figure 8(a), as the number of rounds increased, the energy consumption of ODMST algorithm is comparably lower than other algorithms like TTDD [13], DBRkM [3], and EPMS [5].This is due to the on-demand mobile sink traversal whereas in TTDD algorithm [13] every data source established a virtual grid network which has consumed more energy.Figure 8(b) showed the network lifetime where in TTDD algorithm [13] first node died after 200 rounds while it is about 400 rounds for DBRkM [3] and 1500 rounds for EPMS algorithm [5] but in our ODMST algorithm, first node died after 2000 rounds which is comparably better than other algorithms.ODMST algorithm worked very well up to 2500 rounds.Packet delivery ratio is shown in Figure 8(c) which clearly justifies that ODMST algorithm provided much better PDR ratio up to 2500 rounds other than the TTDD [13], DBRkM [3], and EPMS [5] algorithms due to better communication link and fewer burdens on mobile sink.The mobile sink speed is shown in Figure 8(d) which reflects the average end to end delay of packets; here ODMST algorithm performs better than other state-of-the-art algorithms because our mobile sink did not suffer from flooding of data packets.
Figure 9 shows that the average node degree is maximum whenever the numbers of sensor nodes are increased from 100 to 500.Simulation result showed that ODMST and RLBCA are able to achieve node degree in even harsh type of node deployment in the WSN.Finally Table 6 shows the overall performance improvement of our proposed algorithms.

Conclusions and Future Scope
This research study has proposed two novel learning algorithms to properly overcome hot spot problem in WSN by using RL techniques.The main idea is to collect the data from cluster heads as per their demand/request to mainly save the energy consumption of mobile sink and to improve the network lifetime.Simulation results showed that RLBCA created cluster head properly and ODMST algorithm collected the data from cluster heads through mobile sink efficiently compared with the state-of-the-art algorithms.This research study motivated us to further test the scalability and convergence of the proposed algorithms in large scale of WSN.

Figure 3 :
Figure 3: The agent learning environment.

Figure 5 :
Figure 5: Average residual energy of cluster heads versus number of rounds.

Table 1 :
Proposed phases of mobile sink traversal path as per Figure2.

Table 3 :
Representation of RL based clustering elements.

Table 4 :
RL model for clustering.