Event Driven Duty Cycling with Reinforcement Learning and Monte Carlo Technique for Wireless Network

Reducing transmission delay and maximizing the network lifetime are important issues for wireless sensor networks (WSN). )e existing approaches commonly let the nodes periodically sleep to minimize energy consumption, which adversely increases packet forwarding latency. In this study, a novel scheme is proposed, which effectively determines the duty cycle of the nodes and packet forwarding path according to the network condition by employing the event-based mechanism and reinforcement learning technique. )is allows low-latency energy-efficient scheduling and reduces the transmission collision between the nodes on the path. )e Monte Carlo evaluation method is also adopted to minimize the overhead of the computation of each node in making the decision. Computer simulation reveals that the proposed scheme significantly improves end-to-end latency, waiting time, packet delivery ratio, and energy efficiency compared to the existing schemes including S-MAC and event-driven adaptive duty cycling scheme.


Introduction
Wireless Sensor Network (WSN) has been used for a wide range of applications, primarily for target area monitoring [1]. Event monitoring applications such as intrusion, lightning, or fire detection should be designed according to their operating condition [2]. In WSN, a large number of sensor nodes are distributed in the target area, which can process the signal and communicate with each other. e major problem in such a WSN-based monitoring system is the limited energy of the nodes, and, therefore, it is important to minimize the energy consumption of these for extensive network operation. Various energy-efficient communication algorithms and schemes have been proposed to maximize the life of the WSN. e Media Access Control (MAC) layer is responsible for scheduling nodes in WSN to effectively manage communication between nodes. e method commonly adopted with the MAC protocol for minimizing energy consumption in WSN is duty cycling. Here, the nodes stay awake only a fraction of time for sensing and communication.
e periodic dormancy, however, increases the transmission delay, which is detrimental especially to human life-critical applications. Energy-saving at the sacrifice of performance might be fatal for them. e transmission delay is caused by the sleeping nodes on the multihop path between the source and the destination node, called sleep latency [2][3][4][5][6][7]. is is a serious concern with WSN where the transmission range of a node is usually smaller than the distance between the communicating nodes. As the network operation is dynamic, the duty cycle of the nodes is required to be continuously adapted to avoid early sleep under high traffic load or overlistening under low traffic load. Event-driven adaptive duty cycling of the nodes can satisfy this requirement, which is the main objective of this paper.
It was shown that a significant amount of energy can be saved by employing sleep and idle listening mode for the nodes [8]. e duty cycle-based MAC protocols are classified into synchronous and asynchronous approaches. In the synchronous protocol, such as S-MAC [7], T-MAC [9], RMAC [8], and P-MAC [10], a schedule table is created for all the nodes to specify the sleep and wake-up time. S-MAC is based on broadcasting the preframe of SYNC and DATA packet for scheduling. Here, the performance metrics related to the network operators were not included in designing the protocol [10]. e asynchronous MAC protocol such as B-MAC [11], X-MAC [12], and RI-MAC [13] allows the nodes to operate independently to enhance the adaptability against dynamic load changes. To achieve a more adaptive schedule, the authors of [14] have shown that a significant amount of energy can be saved, and the delay is reduced by dynamically adjusting the latency. BADCS is proposed to reduce event detection latency and data routing delay using a duty cycle adjustment algorithm [15].
In this paper, a novel event-driven scheduling approach employing the reinforcement learning (RL) algorithm is proposed to reduce the sleep latency and improve the performance of packet switching in WSN. It adjusts the duty cycle of the nodes in the multihop path according to the status of the network so that the delay and waiting time incurred during packet transmission can be minimized. Here, the low-delay energy-efficient transmission path from the source to the sink node is decided using the RL algorithm. For a node on the path, the feedback information on the delay and energy taken by the path is provided to its next-hop nodes called the parent nodes. e RL algorithm is used to choose the best parent node and wake it up for forwarding the data. Additionally, to reduce the waiting time due to early sleep, the node of high traffic such as the one having many neighbors or close to the sink node is woken up for a relatively long time. e simulation results show that the proposed approach substantially outperforms S-MAC and the existing adaptive duty cycling scheme [16] under various network conditions. e main contributions of the paper are summarized as follows: (i) e existing node scheduling problem is transformed into a decision problem employing the event-driven approach and RL to effectively deal with the dynamically changing network condition of WSN. e transmission delay is due to early sleep and transmission collision. Early sleep is avoided by the event-driven approach to wake up the sleeping nodes promptly, while transmission collision is avoided by the RL technique to properly select the forwarding path. (ii) e existing MAC protocols are based on local feedback information in deciding the schedule. In this paper, the Monte Carlo (MC) evaluation technique is employed to obtain global information and sampling, which greatly improves the speed and accuracy for finding a suitable schedule. (iii) A technique for finding maximum achievable reward in RL is developed by solving Bellman's optimal equation, which allows accurate solutions in the small number of computation steps. e rest of the paper is organized as follows: in Section 2, the work related to duty cycling and RL-based scheduling for the MAC of WSN is discussed. e proposed scheme is presented in Section 3. Section 4 discusses the simulation results, and the conclusion is made in Section 5.

Related Work
2.1. Duty Cycling. Generally speaking, each sensor node in WSN operates on battery power, where two factors affect the rate of energy consumption. Firstly, the rate is high if the transceiver is in transmission, reception, idle (or overhearing), and low during sleeping. Secondly, the event other than successful packet transmissions such as collision or retransmission causes energy waste. Also, the existence of two kinds of delays explained below increases the transmission time, which is affected by transmission characteristics and duty cycle.
Early Sleep Delay. Assume that some packets in a node are needed to be sent to another node that awakes and sleeps periodically. e problem with early sleep occurs when a packet is sent to the sleeping node on the multihop path, and the data transmission is delayed until it switches back to the active state. Transmission Collision Delay. Collision occurs if some nodes send packets at the same time when they are in the transmission range of the other node. Figure 1 compares two types of duty cycling schemes. As shown in Figure 1(a), the nodes of S-MAC periodically switch from sleep to listen mode for prolonging the lifetime. Only the nodes in the listen mode can receive, forward, or process the packets. If the packet arrives during the sleep mode (event-A of Figure 1(a)), the process is delayed until the node switches to the listen mode. erefore, the latency with periodic duty cycling is usually high. Figure 1(b) shows event-driven duty cycling, which controls the listen/sleep mode of a node based on the arrival and departure event of a packet [16]. Here, the next-hop node is woken up when a packet arrives to reduce the latency.
Various event-driven approaches have been proposed to address the problem of delay caused by early sleep [17], and the state change of a node is promptly reported by continuous monitoring of the operation. While the event-driven approach reduces the transmission time and energy consumption of a node, an efficient scheme needs to be developed to properly reflect the occurrence of the events to the scheduling. e machine learning technique such as RL is effective for meeting this requirement. RL is a biology-based machine learning approach that acquires knowledge by exploring the operation environment without external supervision or prior knowledge. Numerous studies have been conducted on RL for various applications [18][19][20][21], including the reduction of transmission delay and maximization of sensor node lifetime [22,23]. Improving the performance of the network by replacing time-based duty cycling with event-driven reinforcement learning (EDRL) is the main objective of this paper. (iii) P(a|s n , s n+1 ): the probability that action a, leads the system in s n to s (n+1) . S × A× S ⟶ [0, 1] is the state transition probability density function. (iv) R(a|s n , s n+1 ): the return after the transition from s n to s (n+1) due to action a. S × A ⟶ R is the reward function.
A key feature of MDP is the Markovian property; the probability to reach state s at step-n depends on only the previous step, step-(n − 1) [24]. In discrete-time MDP, which is considered in this paper, the agent is in state s n (∈S) and takes action a(∈A) according to the policy, π, at step-n. In response to the action, the environment provides scalar feedback, called a reward, R(a|s n , s (n+1) ). is process is illustrated in Figure 2, where the value of state v(s n ) and action, q(s, a), is returned as the reward. RL is a commonly employed solution for MDP when the application possesses the Markov property. RL algorithm aims to find a policy that maximizes the accumulated reward. If the system operates in a finite time domain, it can be solved using the dynamic programming approach and Bellman optimality equation. Otherwise, it is solved using the value iteration, policy iteration, linear programming, approximation method, or online learning technique [25]. RL has been used to solve the typical sequence decision problem, using the learner and decision-maker called agent [26]. e agent chooses a good action based on only the current sensory observation and remembers the past sensations to select a good action [27]. e proposed scheme is presented next.

The Proposed Scheme
In this section, the proposed scheme is presented, which decides the communication path using RL, which minimizes the transmission delay and energy consumption. e list of notations used in the paper is given in Table 1.

Design Goal.
Regarding packet transmission in WSN, the transmission delay and energy efficiency are conflicting factors due to the limited energy of the nodes. Various protocols have been developed to reduce the transmission delay between the nodes of finite energy. e primary task of WSN is to monitor and report abnormal or emergency conditions, and each node in the network may serve as a source or relay node. e existence of a duty cycle increases the delay due to early sleep. Another cause of delay is a collision. e proposed scheme effectively avoids early sleep by employing an event-driven approach to wake up the sleeping nodes promptly and avoids transmission collision by the RL technique properly selecting the forward path.
Considering the trade-off between performance and scalability, an event-based wake-up strategy is adopted. e proposed scheme consists of two phases: RL phase and report phase. During the RL phase, the nodes of the forwarding path are selected by carrying out exploration producing the consumed energy and delay data as a reward. In the report phase, the value of the RL function is obtained, where the state is input and the state-action pair is output. en, the function is used to decide and explore the next action with a greedy algorithm. Finally, through the interaction between the nodes and the environment, the optimal wake-up schedule is decided. In the learning process of the proposed scheme, each node selects the forwarding path and then calculates the reward. e result affects the decision and exploration of the next state. Applying the proposed scheme, a proper duty cycle is obtained using the wake-up mechanism for timely transmission. Figure 3 compares the operations of different duty cycling schemes, where the length of the working cycle, |T|, is 12 and the number in the bracket denotes active time slots of each node.  Mobile Information Systems to the active state to receive data packets, or (ii) it has some packets to transmit to a receiver that is active at that time. A cycle is divided into 12-time slots, and each is enough to send and receive a packet. In Figure 3(b) of S-MAC, the forwarding nodes and their active slots are predecided and fixed, where n 7 sends data to n 4 at slot-6 because n 4 works only at slot-6 and n 8 has to wait for n 4 to work in the next cycle and send data. e transmission latency is increased due to this problem. In Figure 3(c) of the event-driven scheme, the node wakes up the next-hop node to reduce the waiting delay if it has a packet to transmit. Observe from the figure that n 10 wakes up n 5 and transmits a packet at slot-4, while n 6 transmits a packet to n 3 . Since n 5 is within the transmission range of n 6 , a collision occurs causing retransmission, and as a result, the latency becomes greater than 8. As shown in this example, a node needs to be properly chosen when there exists more than one neighbor node to avoid collision and transfer the packet to the sink node fast. us, this scheme, waking up appropriate nodes based on reinforcement learning, is proposed to make use of the available time slots and neighbor nodes. e latency can be reduced to 8, as shown in Figure 3(d).
WSN with the set of nodes, V, and edges, E r Transmission range of a node NB(i) Neighbor nodes of node-i w(i) Duration of slots when node-i works Nodes of NB(i) forbidden to wake up p c (i) Candidate parent nodes of node-i τ � (n s , . . ., n d ) e path from the source to the destination node n 11 n 10 n 9 n 8 n 7   ), where V is a set of N sensor nodes, and node-i has a queue of the capacity of q i packets. E = {(u, v)|1≤u ≤ N,1≤v ≤ N} denotes the link between node-u and node-v. As in [28], all nodes are assumed to have the same transmission range, r, for simplicity. dis(u, v) (∈E) represents the distance between node-u and node-v, which is smaller than r if node-v is the neighbor node of node-u, i.e., v ∈ NB(u) (dis(u, v) ≤ r). Each node has sleep and work states. Let T denote a work period that is usually divided into a fixed number of time slots. Each slot is long enough so that a source node and a relay node can either cooperatively transmit one data packet to the destination or transmit one of their packets. en, the work schedule of node-i, w(i), is defined as the active time slots in T. wk(i) is the slot when node-i is woken up and working.
where c is a nonnegative integer and t l is an element in the set of the active time slot of the node. With duty cycling, each node can receive data in only a working state, and thus the time duration for receiving data is quite limited. Concerning the energy efficiency of a single node and lifetime of the entire network, min(|wk(i)|) and min( |wk(i)|(i ∈ V)) are the objectives, respectively. e ratio of duty cycle of a node is k/T if it works for k slots. Note that k and wk(i) are fixed with time-based duty cycling. Considering early sleep delay and collision delay, wk(c(i)) ≤ wk(i) ≤ wk(p(i)) is the basic condition of successful transmission when a packet is transmitted from c(i) to p(i). Here, p(i) and c(i) denote the parent and child node of node-i, respectively, which receives and sends the packet. e delay caused by early sleep is wk(p(i)) − wk(c(i)) (≤|T|). e total transmission delay, D, is then where d c represents the delay caused by transmission collision and duty cycling. e existence of collision between the hidden and exposed node in the wireless network environment causes the delay. e time-based scheduling approach is not efficient due to the synchronization overhead and lack of information on the network condition. e proposed scheme is based on event-driven, and RL helps the nodes make the proper local decision based on the feedback information on the global network status. Here, a mechanism is employed to wake up a proper node on the next hop and alleviate the early sleep problem. As for data transmission scheduling, the goal is to construct a set of collision-free transmission schedules allowing aggregation of the data in the sink node. e delay caused by early sleep is reduced by waking up the node instead of waiting for the termination of sleep period. Note that unreasonable selection of the parent node makes d c larger. Assume two transmission schedules for node-u and node-v ((u, v) ∈ E), {p(u), wk(u)} and {p(v), wk(v)}. Here, {p(u), wk(u)} means that the nodes in the set of the senders of node-u are scheduled to transmit to p(u) at wk(u), which is decided by c(u). A collision-free transmission should satisfy one of the following two conditions: e transmission schedules are collision-free if the wakeup slots of node-u and node-v are different. Otherwise, their parent nodes must not be in the transmission range of each other (p(v)∉NB(u) and p(u)∉NB(v)). In the following, the formal definition of the minimum latency problem based on the event-driven scheduling approach is given.
Output: e schedule, sch(i) (= {p(i),wk(i)} ∀i ∈ V), satisfies the following condition: e length, m, is minimized (4) Data sent fromni to nj according to sch(i) and sch(j)are collision-free, ∀i, j ∈ V & i ≠ j; e wake-up schedule is decided for each node. Here, wk(i) � wk(c(i)) + ϖ, while ϖ denotes the time required for transmitting the data from node-c(i). Collision occurs if node-u and node-v send a packet at the same time in the case of (u, v) ∈ E. If (u, v) ∉ E, it still occurs when p(v) locates inside the transmission range of the other node as (u, p(v)) ∈ E.
To take care of the first cause of collision, for node-u, some nodes are forbidden to be woken at the same time, denoted as F(u). Here, wk(u) is decided by c(u) because the node switches to work state after it is woken up. For the case of (u, v) ∈ E, node-u and node-v are forbidden to be selected as parent node of a node simultaneously.
For the second cause, a node is forbidden to be selected as a parent node if there is a neighbor node transmitting the packet in the same time slot.
Combining the two cases, F (u) becomes e transmission schedule for each node is decided starting from the leaf node while moving toward the sink. At first, the source node decides the schedule for itself and its parent node, and then the data are sent to the parent node, which does the same thing as the child node. e set of nodes, p c (i) (�(NB(i)−NB(p(u))∪NB(u), (u, i) (∉E)), includes the nodes that will cause collision less likely. For |p c (i)| > 1, there exist several neighbor nodes for node-i to be selected as parent node, and thus a weight for each candidate parent node is Mobile Information Systems estimated using RL to select the best one. e selection algorithm of the parent node minimizing the latency is shown in Algorithm 1.

Exploration of Packet Forwarding Path.
Assume that neighbor nodes allowing minimum early sleep delay and transmission delay have been selected. en, a packet is transmitted effectively by minimizing the number of hops, m. e process of obtaining the weight for each transmission schedule is given in the following.
Let v(i) be the state value of node-i obtained from the RL process. A set of nodes forming a path is evaluated using the rewards, and then v(i) is updated. When a new event occurs, node-i having a packet to transmit finds p(i) from p c (i) and wakes it up for packet forwarding. When the sink node receives data from node-i, it records the path and estimates the reward due to node-i for improving the schedule based on RL.
e following shows the model and the process of solving the target problem using RL. Here, s(i) and v(i) denote state-i and the value of state-i estimated by the RL process, respectively. e state is estimated by the feedback information on the amount of energy consumed and forwarding latency after a packet is successfully transmitted to the succeeding node or sink node. e lower the energy consumption and latency, the larger the estimated value. It is preferred to choose the state of large value when making a decision.
In deciding v(i) using a stochastic decision process, an agent interacting with the environment is implemented in each node. It works as follows. Let A(s) be a finite set of control actions allowed to be taken with a state-space denoted as S such that s(i) (∈S). Suppose that an agent chooses an action, a s(i) (∈A(s)), that is available at s(i). After the action, the agent receives an immediate reward, R, and the system makes a transition to a new state, s′, (a s(i) × p⟶s′) with a transition probability, p. Policy π, π(s)⟶a, denotes the rule of action selection. An optimal policy, π * , maximizes or minimizes the objective function. e state of a high value implies that the transition to this state gets more reward. e final solution consists of the states that have a long-term revenue. e value of state-s, v π (s), is defined as a state-value function: Accordingly, the state-action is viewed as a decision made in the current state and evaluated by the value of the state-action. e state-action value function is A global track, τ * � {u * 0 , u * 1 , . . . , u * m }, for an m-hop path from node-i to the sink, node-s, is optimal if it satisfies where J() is the object function. u j (j ≠ i, u j ∈ V), which is decided by node-(j − 1), forms a sequence of states minimizing the delay and energy consumption of the whole process. Considering the limited feedback information in the local nodes, it is hard to evaluate the decision once it is made. After making a decision, the node needs to get the feedback on the decision regarding the transmission delay and energy consumption and then update the policy, π. Since they are local information, the Monte Carlo (MC) evaluation technique is employed to obtain global information. e target function, R(τ)P π (τ)dτ , is expected value of the cumulative return denoting overall revenue of the policy, π. Define η(τ, π) as the average reward of a policy as follows: Here, E π π 0 is the expectation with the probability measure generated by the policy, π, with initial policy, π 0 . RL is used to update π 0 iteratively, leading to the optimal policy, π * . Maximizing the expected discounted total reward is the objective, which is defined as follows: V * π (s) is the maximum achievable reward at state-s, which is found by solving the following Bellman's optimal equation.
v(s) and q(s, a) can be obtained using the principle of Bellman optimality: v * (s) � max a R a s + c s′∈S p a ss′ v * (s), Even though the process evolves in the continuous-time domain, a discrete-time model is assumed in this paper, where time is slotted with intervals of unit length. In the proposed scheme, a node is represented as a state.
where s(i) indicates that node-i has a packet to transmit, and s(u) is the next hop selected by node-i. e action taken is decided according to π and s(i). E τ (∈{0, 1}) equals 1 if an event is reported along the path, τ, and 0, otherwise. τ � (n s , n i , n j , . . ., n des ) represents a path from the source node, n s , to the destination node, n des . n τ is the number of nodes in τ. e reward function, R(s t , a t ), is then given by where d and l denote the delay and total energy consumption of the path, respectively. To avoid local optimum solution, the state value is not updated until one period is completed with the MC process, which updates the value as follows: where G t is the objective of MC. V(S t ) is the expected discounted rewards, which are updated after one path is tried. e method to obtain true expected value by exploring all possible paths is extremely inefficient.
us, finding an approximate value through effective sampling is a better way. e MC method conducts sufficient sampling of the state space using the ε-soft greedy algorithm. Q(s, a), a * n � arg max a n ∈A s n ,e n ( ) s n , s n+1 |θ .
For a node having more than one parent node, |p c | ≥ 2, there exists |p c | m path for the m-hop path. Exploring every path based on RL is not effective. e ε-soft greedy is a popular exploration method used to obtain samples from the probability space and get sampling space, θ, for exploitation. Also, to ensure sufficient and efficient sampling space, the variable, ε, is added to the RL for a better learning process. e soft greedy policy can ensure sufficient sampling of the state allowing accurate estimation of the state value. e updated state will be relatively small as exploration continues, while excessive exploration delays the convergence to the optimal value. erefore, a constraint on the update condition for the parameters of the soft greedy policy is needed. ε is used as the constraint. Note that the bigger the change in the state value, the greater the chance of exploring the untried state.
e set of sample data, θ, is obtained by the exploration of the environment with the soft greedy policy. e node evaluation process based on RL is shown in Algorithm 2.
e flowchart of the proposed EDRL scheme is depicted in Figure 4. It is implemented in two blocks, the network operation and RL process, which run independently. Unlike the time-based duty cycling, the nodes switch to sleep mode for saving energy until packet transmission is required. e proposed scheme is evaluated next.

Performance Evaluation
In this section, the performance of the proposed EDRL scheme is evaluated. It is also compared with S-MAC and the existing adaptive event-driven scheme (ED) [14] in terms of packet delivery ratio, latency, packet loss rate, and energy efficiency as the load varies.
In the simulation, 25 nodes are distributed randomly in a 50 * 50 area, and the nodes send packets to the sink node, via one or multihop path. All the nodes have the same transmission range, and the interference range is equal to the transmission range. Here, one node is selected as the sink node (destination node), which never goes to sleep, while the other nodes periodically generate packets as an event occurs. e parameters used in the simulation are listed in Table 2.
In Figure 5, the delivery ratio of the three schemes is compared as the number of packets per event varies from 50 to 300. Compared to ED and S-MAC, the proposed EDRL scheme yields a consistently higher delivery ratio. Note that the load is high when the number of the packets is large per event. is demonstrates that the proposed scheme is quite effective in dealing with the duty cycle in response to dynamic load change. is is because the decision of the state is made via RL based on the data obtained from the environment, which responds to the changes of the network on time. e selection of the parent node effectively reduces the transmission collision and improves the packet delivery ratio in the network. e soft greedy policy can provide an adequate sampling of the state allowing for an accurate estimate of the state value. Since ε is bigger than the others, however, the performance of EDRL scheme is similar to the other schemes in the beginning. Figure 6 shows the end-to-end latency of packet transmission. It can be observed from the figure that the delay of the proposed EDRL scheme is always smaller than that of the other schemes. S-MAC is relatively insensitive to the load, which indicates that it is not adaptable to the traffic load.
e reduction of transmission conflicts and correct path selection make the end-to-end delay smaller compared to the others in the proposed scheme. Small fluctuations with EDRL are due to the RL process involving exploration and periodic MC update of state value and policy. Figure 7 compares the packet loss rate of the three schemes. Observe from the figure that the ratio with the proposed EDRL is always lower than that of the other schemes regardless of the load. EDRL not only displays the slower rate of packet loss than the others with the increase of the load, but also allows no or little packet loss until the Initialization; while e is smaller than the number of total episodes do while n is smaller than the maximum step do Take action with ε-soft greedy: a Q(s, a), While the nodes are in RC phase do Wake up the nodes decided from the RL process; Generate data packets or receive data packets; end while Determine the subsequent state; n � n+1; Observe the delay and energy consumption; Compute reward otherwise, Store transition (s n , a n , s n+1 , R) and τ in sample space Q; update π, V(s), ε;  number of packets per event exceeds 150. is is attributed to the reward function of RL in equation (14), which indicates that the path is decided based on latency. A packet is transmitted effectively by minimizing the number of hops to obtain the weight for each transmission schedule based on EDRL. Because of this, the proposed scheme is better than ED. Figure 8 shows the average waiting time of the schemes. e proposed EDRL consistently outperforms the other schemes, which validates its effectiveness and robustness regardless of the load condition. It is achieved by properly selecting the node for packet transmission, which results in a reduced collision of the data transmission. e combination of RL with MC makes adaptive decisions results, which give more stable and better performance of the proposed scheme than S-MAC and ED. e proposed EDRL adjusts the duty cycle of the nodes in the multihop path according to the status of the network so that the waiting time incurred during packet transmission can be minimized. Figure 9 shows the node survival rate of the three schemes. Observe from the figure that the fraction of dead nodes of the proposed scheme is substantially smaller than that of the other schemes. In the case of relatively low load, the fraction of dead nodes of ED scheme is smaller than that of S-MAC due to energy saving. As the load increases, the awakened nodes may increase transmission collision with the ED scheme, and as a result, the number of dead nodes becomes larger than S-MAC of fixed duty cycle. Figure 10 shows the energy consumption rates of the three schemes. In the case of high traffic load, the transmission tasks and transmission conflicts significantly increase the energy consumption. e proposed EDRL scheme consistently outperforms the other schemes. is is attributed to the event-driven approach and RL, which reduce the operation time of the nodes and waiting time in forwarding the packet. It can be observed from the figure that the      waiting time of the proposed EDRL scheme is significantly smaller than that of the other two schemes because of the set of nodes that is selected based on equation (5) and because it causes collision less likely.

Conclusion
In this study, an adaptive duty cycle scheduling scheme applicable to MAC has been proposed for WSN. It effectively improves the performance of the network by employing event-driven duty cycling and RL technique to effectively adapt to dynamic change in the network condition. In addition, the sampling approach based on Monte Carlo evaluation greatly improves the speed and accuracy for finding a suitable schedule. Computer simulation revealed that the proposed scheme substantially reduces the energy consumption, latency, and packet loss rate compared with the existing schemes.
In the future, we will further enhance the proposed scheme with a more sophisticated adaptive approach and reinforcement learning technique, along with the study on other performance metrics including throughput. We will also consider different learning techniques such as hybrid or federated learning to effectively cope with various operating conditions of the network. Furthermore, the proposed scheme will be extended to be applied to the virtualized network environment such as software-defined networking.

Data Availability
All data included in this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.