EER-RL: Energy-Efficient Routing Based on Reinforcement Learning

Wireless sensor devices are the backbone of the Internet of things (IoT), enabling real-world objects and human beings to be connected to the Internet and interact with each other to improve citizens’ living conditions. However, IoT devices are memory and power-constrained and do not allow high computational applications, whereas the routing task is what makes an object to be part of an IoT network despite of being a high power-consuming task. *erefore, energy efficiency is a crucial factor to consider when designing a routing protocol for IoT wireless networks. In this paper, we propose EER-RL, an energy-efficient routing protocol based on reinforcement learning. Reinforcement learning (RL) allows devices to adapt to network changes, such as mobility and energy level, and improve routing decisions. *e performance of the proposed protocol is compared with other existing energy-efficient routing protocols, and the results show that the proposed protocol performs better in terms of energy efficiency and network lifetime and scalability.


Introduction
e emergence of wireless technologies and information systems and mobile technologies has opened up a new era for the Internet of things (IoT). e latter has become the backbone for ubiquitous computing while enabling the environment to be smart through recognition, identification of objects, data generation, transmission, and retrieval [1,2]. IoT allows real-world things and people to be connected and be part of the virtual world of the Internet through wireless communication. Initially, IoT has been targeted to the network of RFID tags and later it has been broadly extended to various devices and applications with the goal to first make objects capable of learning and understanding their environment and interact with it [3].
rough wireless communication, these objects can interact with each other and enable the system to be remotely controlled via Internet connection [2][3][4][5]. Due to its implications to various fields, IoT has recently received much attention, and it has been applied to a wide range of applications such as smart cities, smart healthcare systems, smart homes, object tracking, disaster management, and environmental monitoring [6,7].
IoT consists of the interconnection of heterogeneous wireless devices including smartphones, wireless sensors, actuators, identification by radio frequency (RFID) tags, and real-world things with sensing capabilities [8,9]. Generally, a sensor device comprises four units, namely, power unit, sensing unit, processing unit, and communication unit [10][11][12][13]. e sensing unit is responsible for sensing data from the surrounding environment, whereas the processing unit carries out the computation tasks. e communication unit is in charge of sending packets across the network. Finally, the power unit consists of a small battery that supplies power to the remaining three modules. Logically, the power unit does not consume any energy but supplies energy to other modules, the sensing module and processing module also consume negligible energy, whereas the communication module is the most energy-consuming [14][15][16].
However, to accommodate a large number of devices in an IoT, several requirements are needed including energy efficiency, scalability, interoperability, security, and flexibility [2]. Energy efficiency is crucially important to maintain a fully operational network for the most prolonged time possible [16,17], especially for devices deployed in a harsh environment where recharging and replacing the battery are impossible. Hence, energy-efficient routing protocols are known to manage the consumption of devices' available energy and extend the lifetime of the network [6,13].
Reinforcement learning is a subfield of machine learning that solves the problem of an agent that takes actions in an unknown environment and improves over time through a sequence of trial-and-error interactions with the environment [18]. In other words, the agent interacts with the environment by performing actions and gets rewards, which can be either positive when the action performed was right or negative otherwise. is approach has brought dynamism in data routing and adaptation capability in network communication compared to static routing approaches [19][20][21]. In IoT networks, RL can be used to deal with problem such as network topology change due to mobility of devices, energy level, and other transmission parameters such as distance, signal strength, and bandwidth, which can change over time and influence the network performance.
In this paper, we propose EER-RL, an energy-efficient routing protocol for IoT based on reinforcement learning. e proposed EER-RL balances the energy dissipation between devices in an IoT network and extends the network lifetime and improves the network scalability. EER-RL also provides optimal paths using a feedback mechanism to share local information as a reward, the latter is computed using the residual energy and hop count to the sink, and the hop count parameter can reduce the end-to-end delay. To evaluate the performance of EER-RL, we carried out simulations and the results show that EER-RL achieves an efficient energy consumption, extends the network lifetime, and is more scalable for large-scale IoT networks. EER-RL is also compared to LEACH [22] and PEGASIS [23], the comparison results show that EER-RL outperformed them by providing a better energy balance and extending the network lifetime. e remainder of this paper is organized as follows: in Section 2, we give an overview of RL and its application in routing. Section 3 discusses the existing solutions on routing protocols using RL. Section 4 describes our proposed solution. e performance evaluation of the proposed solution is presented in Section 5, followed by the conclusion remarks and future work in Section 6.

Overview on Reinforcement Learning
RL problems are formalized as Markov decision processes (MDP) with a tuple (S, A, P, R), where S represents a set of states an agent can be in at a given time t; A is a set of possible actions an agent can take. e transition probability that an agent at a given time t, and from a given state s(t) which performs an action a(t) to enter in state s(t + 1), is denoted as P, and R is the reward obtained by the agent for the action performed [18]. Applying RL to routing protocols requires defining the main components of an RL model, such as agent and environment, state and action, and reward. First, the agent is the decision-maker of an RL model while the environment is what the agent observes and reacts to it. In IoT networks, every device is considered an agent; for the whole network, multiagent RL is required. Secondly, a state is any useful information about the environment at a given time, whereas an action is the agent's reaction in a given state. e state space for an agent is the available routing information from all available neighbouring devices. A state can be a tuple of decision-making factors such as residual energy, hop count, and signal strength, depending on the factors taken into account while designing the protocol. On the other hand, an action refers to selecting the next-hop to route the packet towards the base station. us, the action space represents the set of all available routes through neighbours at a given time. irdly, the cost of the action performed by the agent in a given state is called a reward. e following definitions are considered for the implementation of the proposed protocol: (1) every device in the network is considered an agent, and (2) for each device, the set of available routes through its neighbouring devices to the base station is the state space. (3) e set of all available neighbours through which packets can be sent to the base station is denoted as the action space. (4) e agent's behaviour is denoted as a policy.
A policy maps the state-action pair; it can be stochastic or deterministic and improves over time. e goal of every RL model is to find an optimal policy that maximizes the long-term reward of each state-action pair [18,19]; this goal can be reached using the policy iteration process, which includes evaluating and improving the given policy. e policy evaluation evaluates the policy from the results while policy improvement improves the policy towards the best one [18,20]. Figure 1 depicts a simple RL model.

Related Work
RL in network routing protocol was first introduced by Littman and Boyan in [24], who proposed the Q-routing algorithm, a delivery time-optimal solution that aimed to increase the packet delivery ratio while minimizing the average delivery time. Its experimental results outperformed the shortest path routing protocol in packet delivery time, especially in heavy network loads. However, it could suffer from poor learning policy, as the agents did not update their information about the environment. Consequently, the selected routes could not be reliable; also, Q-routing did not consider devices' residual energy. us, it did not guarantee energy efficiency.
Recently, significant research efforts have been made to apply RL to wireless sensor networks (WSN) and IoT. Several approaches were proposed, including the cooperative approach, which is one of the most used in many works, such as SSAR in [8], FROMS in [25], OPT-Q-Routing in [26], EQR-RL in [27], and the work by [28]. is approach is deemed to be suitable for multiagent RL models where agents are required to cooperate and work together for the same purpose. For instance, for self-organized wireless networks [29,30], it allows devices to communicate with each other by sharing local information such as position coordinates, hop count to the base station, initial energy, residual energy, transmission range, and signal strength.
ese considerations, taken alone or combined based on some policies, allow devices to make better decisions on the next-hop to route the data. However, due to devices' resource-constrained nature, the residual energy is a crucial factor for many energy-efficient routing protocols; it is usually combined with others to optimize the routing protocol's performance.
In [25], feedback routing for optimizing multiple sinks (FROMS) used a feedback mechanism to share local information as reward, i.e., the receiver of a packet shares its local information as feedback to the sender without extra network overhead. FROMS also provided a recovery mechanism after device failure to deal with packet loss. However, this protocol could suffer from packet loss in sink mobility, especially in high-speed mobility, and this problem triggers routing errors and extra energy consumption.
In [26], OPT-Q-Routing, an energy balancing routing protocol based on RL, optimized the multihop communication and extended the network lifetime by balancing the devices' energy dissipation and reducing the network overhead. However, they considered only the residual energy of devices; thus, the proposed protocol did not ensure a good balance for multihop communication in the long run. In [27], devices periodically broadcasted heartbeat packets that include the delivery ratio estimate and the sender's residual energy using EQR-RL.
is information enables each device to compute the next-hop using a probability function.
e simulation results showed a good delivery ratio and low end-to-end delay while supporting sink mobility. However, network isolation is likely to happen due to the avoidance of some devices.
In [8], the authors proposed a smart and self-organizing routing algorithm (SSAR), which selects the best route based on some communication parameters, namely, the distance between nodes, the stability of links, and residual energy of IoTdevices. ese parameters have an impact on the stability and energy-efficient routing. SSRA achieved a good QoS expressed in packet delivery and end-to-end delay; it also extended the network lifetime compared to other existing algorithms. In [28], the authors added the bandwidth to the three parameters considered in [8] and used fuzzy logic and RL to train network devices for optimal route selection.
eir proposed algorithm performed better in terms of energy efficiency and network lifetime.

EER-RL Protocol
is section describes the proposed EER-RL protocol, a cluster-based energy-efficient routing protocol for IoT wireless networks using RL. EER-RL allows devices to learn how to make better routing decisions by sharing local information with the neighbourhood to optimize the next-hop selection and minimize energy consumption.
e sender includes its local information in the packet header, every device in the neighbourhood that can overhear the packet, extracts the information encapsulated in the packet header, and updates its routing table. e local information shared includes device id, residual energy, position coordinates, and hop count. Like other cluster-based routing protocols, EER-RL consists of three steps: network set-up and cluster head election, cluster formation, and data transmission.

Network Set-Up and Cluster Head Election.
is phase consists of two steps: first, the network set-up allows devices to compute the initial Q-value based on their local information. Initially, the base station broadcasts a heartbeat message where it shares its position coordinates. Every device, after receiving the packet from the base station, saves the latter's position and computes the initial Q-value using the initial energy level and the hop count using equations (1) and (2). Furthermore, we assume that all the devices have different energy levels and we also define a distance threshold between cluster heads (CHs) and the base station to mitigate network overhead and facilitate sensors far from the base stations to find a CH easily. Besides, to avoid links to diverge to the base station rather than converging, a CH cannot be at the edge of the network as this can result in energy wastage due to the extension of the communication distance. Algorithm 1 describes the CH election process.
After the cluster heads' election phase, every CH sends an invitation message to inform all the devices within its transmission range that it has been elected as CH, the invitation also includes the CH Id, its initial Q-value, and its location coordinates. Every non-CH device that overhears packets from CHs decides which cluster to join depending on the distance and sends a membership request to the designed CH, including its local information. Additionally, if a device receives more than one invitation, i.e., the device is at the intersection of different clusters, it can decide to join the one whose CH is the closest. Once all the devices send membership requests to CHs, every CH confirms the membership and forms the cluster. Algorithm 2 describes the cluster formation process.
Devices with the base station in their transmission range do not need to join any cluster [4], but they can communicate directly with the base station and save more energy. Another objective of the EER-RL protocol is to provide an energy-efficient multihop scheme for intercluster and intracluster communication. Intercluster communication is about managing the multihop communication between CHs; for instance, a CH far from the base station can send packets through an intermediate CH, or if there is a powerful device next to the base station, it can also be used to aggregate data from that CH. us, CHs far from the base station can communicate in multihop through either other CHs or powerful devices next to the base station.
Similarly, for intracluster communication, devices within the cluster can also send packets directly to the CH or in multihop if they are far from the CH; in other words, a device far away from the CH can communicate with the latter through another device in the same cluster. As mentioned earlier, IoT devices have different energy levels and transmission range; thus, we assume that devices with high energy levels also have a more extended transmission range than those with low energy levels. Figure 2 depicts the cluster formation.

Data Transmission: Application of RL.
e application of RL intervenes at this phase, where every device behaves like a learning agent and learns to make better routing decisions. e learning process consists of updating the Q-value using the immediate reward obtained from the action performed and finding the best policy that optimizes the long-term reward. We use three functions, namely, the energy consumption model, the reward function, and the update function for Q-value. e energy consumption model allows updating the residual energy by subtracting the energy dissipated after every packet transmission; the updated residual energy and the hop count are then used to evaluate the reward. Finally, the reward is then used as an argument in the Q-value function to update the Q-value.

Energy Consumption Model.
After a packet transmission, the sender and receiver both consume energy; unlike the receiver, the sender consumes more energy to send packets over the network and amplify the signal over the distance. e energy consumption model computes the energy which is dissipated at packet transmission or reception and updates the residual energy. e energy consumption model is presented in the following equation [22]: where E Tx (k, d) and E Rx (k) are the energy consumed by the transmitter and receiver, respectively. At each transmission (or reception), E elec estimated at 50 nJ/bit is the energy consumed to run the transmitter or receiver circuit, whereas amplification energy estimated at 100 pJ/bit/m2 is the energy consumed to amplify the signal over the distance, with m � 2 or 4 depending upon the distance.

Reward Function.
As mentioned earlier, the learning agent gets compensation denoted as a reward for every action performed; the reward is the cost of an action helping the agent to know whether the action performed was good or bad. In EER-RL, the action refers to selecting a neighbour as the next-hop to route the packet. e reward function is computed using the residual energy (E r ) and hop count N h , unlike the link distance used in the previous phase to compute the hop count, the distance between the sender and each neighbour is also considered at this phase. For instance, the distance of a device S i to the base station via an intermediate device S j is denoted as D link , computed as in equations (4)- (6). is distance D link can also equivalent to N h × TX range , where N h denoted as the hop count and TX range is the transmission range [9]. From the above, the estimated hop count is computed as in equation (2), and the reward is computed as in equation (7). ( (6) While length (CH table ) ≤ CH tot, do (7) Q max ⟵ max(S.Q) (8) For i ⟵ 1 to n, do (9) If Min Thres ≤ S(i).dist < Max Thres , then (10) If length(CH Table �� 0), then (11) CH Table .append(S(i)) (12) S.pop(S(i)) (13) Else (14) For h ⟵ 1 to length(CH Table ), do (15) dts � Euclidean(S(i), CH(h)) (16) If dts ≥ Min Thres , then (17) C ⟵ True (18) Else (19) C ⟵ False (20) Break (21) End if (22) End for (23) If C � � True, then (24) CH Table .Append(S(i)) (25)

Mobile Information Systems
where 0 f p f 1 is the probabilistic variable, which defines the impact of the E r in contrast to N h . A high value of p gives a high probability to the neighbour with a high energy level to be selected as nexthop, whereas a high value of q � 1−p gives to the closest neighbour a high probability to be selected. erefore, a trade-off between p and q is required to optimize the performance of the protocol; if E r is null(0) in this case, S i will assign a negative reward to S j and select another next-hop device. Similarly, S j also repeats the same process when forwarding the packet to the next-hop and sends feedback to S i , and the process goes on until the packet reaches the destination. e reward is encapsulated in the packet header to avoid the network overhead, all neighbourhood devices can overhear the packet forwarded and update their routing table. For instance, if S i overhears a packet that it sent earlier, it will also extract the reward as feedback.

Function for Updating Q-Value.
To learn the real cost of an action, we have to compute the action-value function, which defines how good it is to perform an action from a given state following a policy π [18,19]. e action-value function is the expectation of the discounted sum of returns given a state and an action as presented in the following equation: In [31], Watkins and Dayan proposed an approach to estimate the action-value function (or Q-function). His approach is a model-free (the opposite of model-based, modelfree system does not need any environment model at all, i.e., agents cannot think about how their environments will change in response to a single action [18]) learning technique called Q-learning. e action-value function approximation proposed by Watkins depends on the policy followed by the agent, which makes Q-learning easy to implement and applicable in many situations [19]. us, we have s, a)).
In Q-learning, the initial Q-value can be set to a random fixed value or computed using some arguments, and this is implementation-dependent. In this work, the initial Q-value is calculated using the initial energy and hop count as described in equation (1). en, for each action performed, the agent gets a reward and uses it to update the value of Q using the following equation: where α is the learning rate, which in most cases is set to 1 to speed up the learning process, and r t+1 (s, a) is the immediate reward computed using equation (7). e policy used is such that the sender selects the neighbour with the highest Q-value denoted as max a Q (S ′ , a), e discount factor c varies between 0 and 1; it defines the importance of the long-term rewards against the immediate one. If c approaches 1, it means that the agent emphasizes the future reward rather than the immediate reward. erefore, most RL-based protocols set a value that approaches 1 to give high importance to future reward. In the EER-RL as well, we use c � 0.95. However, a discount factor of zero (0) means that the agent is more concerned only with maximizing the immediate reward.
Algorithm 3 describes the data transmission process. First, devices with less than or equal to zero residual energy are considered dead, thus cannot transmit data. However, devices with the highest Q-value in their neighbourhood and devices with the base station in their transmission range can communicate directly with the base station without any intermediate device. On the other hand, devices far from the base station send packets through the CH, but if the CH is also far, they can send to the CH through another closest device member of the same cluster.

Performance Evaluation
To evaluate the performance of the proposed EER-RL, we conducted simulation using MATLAB and deployed 100 devices distributed randomly over a sensing field of 100 × 100 m following the normal distribution [32]. e base station was placed in the middle of the sensing field with (50, 50) coordinates. Furthermore, we assumed the network to be heterogeneous, including devices with different energy levels ranging from 1 to 2 joules. e simulation parameters are presented in Table 1.

Parameter Tuning.
As mentioned above, the proposed protocol takes the hop count and residual energy into consideration, and some probabilistic parameters such as p and q � 1−p were assigned to both residual energy and hop count, respectively. A high value of p gives the devices with high energy levels a high probability to be selected. Similarly, a high value of q increases the probability of devices with less hop count to the base station to be selected. erefore, to optimize the performance of EER-RL, we tested different values of these parameters to select the best ones. e performance evaluation showed slightly different results with different values of p and q. However, with p and q equal to 0.5, the network lifetime was extended more while keeping a good energy balance. In some cases, p � 0.4 and q � 0.6 can also give similar results. Figure 3 shows the performance results of EER-RL using different values of p and q.

Comparison between EER-RL and FlatEER-RL.
rough comparison, we highlight the difference between the proposed cluster-based protocol EER-RL and its flatbased version denoted as FlatEER-RL. e first difference is that the flat-based protocol consumes more energy at the beginning until the learning process finishes. After the learning process, the next-hop selection is then optimized. In contrast, the cluster-based protocol assigns to devices far from the base station, the CH as the next-hop by default, which speeds up the learning process and optimizes the energy consumption from the beginning. However, in the cluster-based protocol, CHs consume more energy for data aggregation, resulting in high-energy consumption per round. While with the flat-based, after the learning process, the energy consumed per round is balanced between the devices. us, the time until the first device dies is slightly extended compared to the FlatEER-RL. erefore, to balance the energy consumption in EER-RL, we set an energy threshold for CHs, and when a CH reaches the threshold, it needs to be replaced before it gets completely depleted. is approach extends the network lifetime when using EER-RL compared to FlatEER-RL. Figure 4 shows the results of both EER-RL and FlatEER-RL.

Energy Efficiency and Network Lifetime Evaluation.
e major objective of this paper is to enhance energy efficiency and extend the network lifetime. Various definitions (1) For i ← 1 to n, do (2) If S(i).E > 0, then (3) max Q � max(Q(i, : )) (4) If S(i).d ≤ TX range (5) If S (i) is next-hop, then (6) Aggregate data (7) Send data to sink (8) Else (9) Send data to sink (10) End if (11) Else if S(i).role �� 0, then (12) If CH within TX range, then (13) Send data to CH (14) Else (15) Find closest neighbour in the cluster (16) Send data to closest neighbour (17) End if (18) End if (19) Compute reward (20) Update Q-value (21) End if (22) End for ALGORITHM 3: Data transmission. of network lifetime were presented in the literature. In this paper, we define network lifetime as the time until data transmission is no longer possible. We evaluated the energy efficiency and network lifetime of the proposed protocol by comparing it with existing clustering protocols such as LEACH [22] and PEGASIS [23]. We used the following metrics for comparison: (1) number of alive devices per round; this metric also helps to evaluate the network lifetime.
(2) Energy consumed per round. It is the sum of the energy consumed by all the devices each round, and (3) time until the first device dies. ese metrics are used to evaluate energy efficiency. In Figure 5, we evaluated the proposed protocol's network lifetime using a different number of sensor devices. Also, we made a comparison with LEACH and PEGASIS to prove the efficiency of the proposed protocol. With 30 and 50 devices as shown in Figures 5(a) and 5(b), respectively, FlatEER-RL performs better than EER-RL and EER-RL slightly performed better than LEACH and PEGASIS. However, in Figure 5    Mobile Information Systems EER-RL showed similar results, yet both outperformed LEACH and PEGASIS, while with 100 devices, EER-RL outperformed much better than FlatEER-RL. e proposed protocol shows better results with long network lifetime compared to the existing protocols. We equally considered the residual energy and the hop count to maximize the network lifetime because the energy consumed when transmitting the data can be high if the distance is too large. Furthermore, we also compared the time until the first device dies in Figure 6(a); in all test scenarios, the proposed protocol outperformed LEACH and PEGASIS. From the above, we can note that the cluster-based version of the routing protocol proposed in this paper works better in a large-scale network, while for a small network(less or equal to 50 devices), the flat-based version can be preferred. On the other hand, energy consumption mainly affects the network performance; this is to say, the usage of devices' energy can have a good or bad influence on the overall network performance [33][34][35][36]. erefore, it is crucial to consider the energy efficiency in a routing protocol.  consumes less energy and outperforms LEACH and PEGASIS. We used a cluster-based routing protocol with RL to avoid the cold start with a flat-based protocol and enhance the network lifetime. Moreover, to speed up the learning process, we set the learning rate α to 1, and after the learning process, the energy consumed per round was considerably reduced. In addition, we set up the discount factor c to 0.95 to focus more on the future reward. is played a major role in balancing energy consumption in the long run, minimizing the energy consumed by each device, and extending the network lifetime, as shown in Figure 6.

Conclusions and Future Considerations
In this paper, we proposed a cluster-based energy-efficient routing protocol for IoT using Reinforcement Learning, named EER-RL. e objective of this work was to optimize energy consumption and prolong the network lifetime by finding an optimal route for data transmission. We produced two versions of the same algorithm, one is clusterbased (EER-RL) and the other one flat-based (FlatEER-RL), and through comparison, we proved that the cluster-based routing protocol proposed is more scalable than the flatbased one. However, it is preferred to use a flat-based version of the proposed work in a small network.
EER-RL was designed in three phases, such as network set-up and CH election. We considered the hop count factor and initial energy to compute the initial Q-value used for the CH election in this phase. e second phase was to form clusters, every CH sends an invitation to all the devices in its transmission range, and every device far from the base station joined the cluster whose CH was the closest. Finally, the data transmission phase characterized by the learning provided an energy-efficient routing with both the residual energy of devices and hop count considered for making routing decisions. Moreover, an energy threshold was set for CH replacement. e simulation results showed that EER-RL achieved better energy consumption and network lifetime than LEACH and PEGASIS.
In this paper, we used a lightweight RL to reduce the protocol runtime and minimize energy consumption. In the future, we would like to consider other parameters for a more optimal routing protocol.
Data Availability e source code used in this manuscript is available from the following author upon request: Mutumbo (mutombo.ka-zadi@gmail.com).

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.