Reinforcement Learning with Probabilistic Boolean Network Models of Smart Grid Devices

The area of Smart Power Grids needs to constantly improve its efficiency and resilience, to pro-vide high quality electrical power, in a resistant grid, managing faults and avoiding failures. Achieving this requires high component reliability, adequate maintenance, and a studied failure occurrence. Correct system operation involves those activities, and novel methodologies to detect, classify, and isolate faults and failures, model and simulate processes with predictive algorithms and analytics (using data analysis and asset condition to plan and perform activities). We show-case the application of a complex-adaptive, self-organizing modeling method, Probabilistic Boolean Networks (PBN), as a way towards the understanding of the dynamics of smart grid devices, and to model and characterize their behavior. This work demonstrates that PBNs are is equivalent to the standard Reinforcement Learning Cycle, in which the agent/model has an inter-action with its environment and receives feedback from it in the form of a reward signal. Differ-ent reward structures were created in order to characterize preferred behavior. This information can be used to guide the PBN to avoid fault conditions and failures.


Introduction
ere is not a picture of the present that is complete without electrical power; it has become essential to our civilization.Electrical power has been a constant in our lives for almost two centuries since Faraday's discovery and the first alternating current power grid in 1886.Generating, transmitting, and distributing electrical power has evolved from a commodity to a basic need during this time.is process has not changed much for a long time.Electricity is produced in different ways, but the basic cycle is essentially the same: it is generated (via electromechanical generators, geothermal power, nuclear fission, solar, and other means) and then it is delivered to clients via a transmission-distribution network.
Most modern systems are still like the first ones: centralized, unidirectional electrical power transmission systems with demand-driven control.In the latter decades of the 20th century, local grids began to arise, and since the early 21st, the industry has attempted to take advantage of telecommunication improvements to solve the limitations imposed by centralization and the challenges brought with the use of renewable sources and new technology like photovoltaic panels and wind turbines [1].Decentralized systems provide bene ts and cause signi cant challenges, boosting e cient techniques in modeling and controlling smart grid systems [2].e European Union Commission Task Force on Smart Grids has de ned these as an "electricity network that can cost e ciently integrate the behavior and actions of all users connected to it-generators, consumers, and those that do both-in order to ensure economically e cient, sustainable power system with low losses and high levels of quality and security of supply and safety" [3].Applying Signal Processing and Communications to the power grid has allowed a ow of data that is one of the de ning elements of the smart grid. is includes the use of "Smart Devices," such as the Intelligent Power Router (IPR).
is device [4] was inspired on Internet routers, and it has a degree of "intelligence" that allows it to switch lines and shed loads.Devices such as the former allow the Electrical Power Distribution System (EPDS) to become reliable, resilient, exible, and e cient.With them, decisions can be made in the event of power failures or component malfunctions, coordinating with other devices in their vicinity to react to load, demands, faults, and emergencies.It represents a multidisciplinary issue being faced in the last decade [5].e basic elements of an IPR are shown in Figure 1.
EPDS that has incorporated IPRs (EPDS-IPR) has also the capacity of automatic service restoration if a network of IPRs is deployed strategically throughout the power grid, and if they are programmed for the exchange of information to manage and recon gure the network following a rule set, any time a perturbation occurs.is allows survivability and better use of system resources.
Designing these EPDS-IPR networks is a very complex task.ere is no speci c model that can guide the designer.e devices must be con gured with preset instructions on how to react when a particular set of conditions has occurred.Another challenging task is to make these grids adaptive [6] and not just follow a hard-wired set of instructions.A much more favorable situation is that the network can act autonomously and self-recon gure in the event of a perturbation, i.e., loss of a power source, higher demand in critical loads, or sabotage.IPR devices when interconnected can be modeled as an intelligent Probabilistic Boolean Network, which is a complex-adaptive system that can learn from its steady state behavior and exhibits selforganization and resilience.Methodologies for modeling based on a Probabilistic Boolean Network (PBN) have been presented in [7][8][9][10][11][12][13][14][15][16], validating the use of PBNs as a modeling mechanism for industrial processes and Smart Grids using IPRs and enabling the simulation of several scenarios.is has the potential to allow designers to better program the devices and design a more robust network.We would like to imbue EPDS-IPRs with the intelligence that allows them to survive a wide set of perturbation events that are practically impossible to predict.
Biomimetic approaches have been used to analyze and solve complex problems in general and for EPDS design in particular.Frameworks that are qualitative in nature, such as PBNs, permit the description of biological system networks, with no property losses that are relevant to the system, and allow the representation of complex-adaptive behavior, such as self-healing and self-organization.Probabilistic Boolean Networks are used in bioinformatics for Gene Regulatory Network (GRN) modeling.GRNs are DNA segments in a cell that in uence other segments and substances in it indirectly, to rule the level of expression of a gene or a set of genes.ey are used to estimate the main rules that command the regulation of genes in genomic DNA.ese PBNs are state-transition systems satisfying the Markov Property; they have no memory, so they are not reliant on previous states of the system).Proposed by I. Shmulevich and E. Dougherty [16] extending Stuart Kau man's N-K or Boolean Network (BN) concept [17,18], they mix the rulebased modeling richness of BNs and introduce probabilistic behavior.ese PBNs are built upon a collection of constituent BNs which are assigned selection weights or probabilities, in which every BN can be considered a "context."Information for every one of the cells comes from antithetical sources; each represents a cell context.For every point in time t, a particular system can be commanded by a single BN, and the PBN will change to another context or constituent BN at a di erent time, based on a particular switching probability.e methodology for using PBNs in manufacturing engineering systems was proposed in [12,16], with continued development in [7][8][9][10][11][13][14][15].
In genomic research, the focus is to discern the way cells exercise control and perform extensive numbers of operations that are needed for their operation and function.ey are massively parallel and highly cohesive systems, and a path that considers a perspective supreme to that of a single gene is needed so we can understand these biological processes better.Bioinformatics tools and algorithms are in high demand and have proven to be useful in solving these tasks [19].However, novel computational approaches, digital medicine technologies, and networks and metabolic pathways analysis are needed to fully understand biological 2 Complexity systems.Genes, cells, and molecules are networked systems that require a deeper understanding, to manufacture improved medicines and delivery mechanisms for treating and eradicating human disease.A mechanism for treating and processing massive quantities of data using computational methods and model checking can be used to understand the rules that govern them and make more accurate predictions about how these systems behave.EPDSs are akin to GRNs because to understand the main rules that control them and to make accurate forecasts on how they will behave, endure, or decline under a collection of prospects, models that correctly describe the system and its behavior are essential.e harmonized synergy, interaction, and governance between genes and their products form these chains, in which gene expression is an important factor.
In this research, the use of Probabilistic Boolean Networks, already applied auspiciously in manufacturing engineering systems, will be broadened to analyze IPR reliability and trust and the scrutiny of faults that may lead to catastrophe.As our main contribution, we explored the PBN's model capacity for performing a basic Reinforcement Learning (RL) cycle, and we explored RL as a means for directing the network's evolution to increase the network's resilience, working towards achieving automatic learning and control of itself using RL.

Preliminaries and Theoretical Background
A review of Boolean and Probabilistic Boolean Networks, Reinforcement Learning, and a basic understanding of Electrical Power Distribution Systems and Intelligent Power Routers is presented in the following subsections.

Probabilistic Boolean Networks in System Modeling and
Simulation.Kauffman N-K or Boolean Networks (BNs) [17,18] and PBNs [20] have been studied for biological systems modeling and their dynamics and to infer [21] their behaviors with statistical data analysis and [22] simulation.
Kauffman's BNs are a finite grouping of Boolean nodes [35,36], in which states quantize to 0 or 1 (although in PBNs, alternative quantizations are possible).A state is determined by the present state of other nodes/genes in the network.e set of entry/input nodes in a BN is known as regulatory nodes, with a collection of Boolean functions (known as predictors), that dictate the future values of the different nodes.When the set of genes and their respective predictors are defined, the network is defined as well.PBNs are, in essence, a tree of BNs for which, at any particular time period, the node state vector transitions are established by one of the rules of the constituent BNs.Formally, a Kauffman Network is a graph G(V, F) defined by the set that contains all the network's nodes, and the set of sets of predictor functions, where the subindex j denotes the realization or constituent network and the subindex i denotes the predictor number, e.g., f (1)  2 is the first predictor of the second constituent network of the PBN.Instead of a single predictor per node, we have one or more predictors, that can be selected to determine the future state of node x j .
e probability of selecting f (i) j as the predictor for the node is given by c (i)  j , where ( A useful metaphor is to think of the PBN as a tree of BNs, and each BN is selected with a particular probability.
Let f i denote the i th possible realization of the network, with for every A realization of a PBN is one of its constituent BNs. e maximum number of realizations is given by In [12], the authors validated that PBNs are appropriate for modeling engineering systems through a system model that was verified using model checking and the simulation results compared with real machine data.In [11], this methodology was applied to a manufacturing process, to gather quantitative occurrence data for DFMEA.In [10], the methods were further expanded including the application of PBNs in industrial manufacturing processes, using intervention (guided perturbations) as guide to move a system away from fault conditions and catastrophe, thus postponing its failure.A formal and thorough description of BN and PBN is presented in [21].

Reinforcement
Learning.Artificial intelligence techniques are developing and growing rapidly.Methods like Deep Learning and Reinforcement Learning are helping with the complexities and uncertainty of power systems [37].In this sense, there is a correlation between power systems and machine learning in order to predict the consumption [38], low prices [39], and energy optimization [40].
Born in the field of Behavioral Psychology, Reinforcement Learning (RL) [41,42] is considered an area of Machine Learning (which can be defined as the design and analysis of algorithms that can improve on the base of Complexity experience) in the eld of Computer Science.It is concerned with how agents should perform actions in a given environment such that they maximize a cumulative reward signal.In Reinforcement Learning, a learner, or agent, is not told what to do or which set of actions to take, rather it must discover the sequence of actions that achieve an optimal reward by trying them.e use of trial-and-error and delayed rewards are the two characteristic features of this approach.RL is studied in many other disciplines, such as control theory, operations research, statistics, and game theory.RL allows the software agent to learn a correct behavior based only on feedback from the environment, automating the learning scheme and extinguishing the need for human expertise and cutting the time needed to devise a solution.ere are multiple solutions to a RL problem, but the most common approach is to allow the agent to select actions that yield a maximum reward in the long run, by using algorithms with an in nite horizon.One of the most used approaches is to make the agent learn to estimate the expected future rewards of (action, states) pairs.e estimates are adjusted through time by propagating part of the future state's reward, and if all states and all actions are tried numerous times, an optimal policy can be learned.Improved RL has been used for hybrid energy system management and optimization, e.g., SAC-based RL [43] and DDPG-based RL [44].
A RL agent learns by interacting with its environment.e RL agent acquires knowledge from the result of its interactions with the environment, instead of being taught explicitly, and selects its actions based on past interactions (called exploitation) or by making new choices (exploration).e reinforcement signal (mostly numerical in nature) it receives is a reward that encodes the success (or failure) of a given action's outcome, and it seeks to acquire knowledge by selecting actions that maximize the cumulative reward over time.Figure 2 illustrates the standard Reinforcement Learning Cycle.
In a standard Reinforcement Learning model, the learner is known as the agent, who makes decisions and is connected to its environment through perception and action.Agent and environment interact at a sequence of steps in time, t, and at each interaction step, our agent senses the environment for information and determines the state of its "world."Based on this information, the agent chooses and takes an action.e information of the state of the environment constitutes the input of our agent, and the action chosen by our agent becomes its output.e actions taken by the agent in each step change the state of its environment and its own state.A time step later, the state transition's value following the action taken is given to our agent by its environment as a numerical value, called reward.
Reinforcement Learning di ers from supervised learning, another form of learning studied in machine learning where the agent learns from examples that are provided by a supervisor, which is external to it.A challenge that is present in RL and not in other learning methods is that we have to choose and/or balance exploration and exploitation.An agent that uses exploration discovers and tries new actions to see if they produce a greater or lesser reward, but an agent that uses exploitation uses preferred and tried actions that in the past have been successful at producing reward.It also considers the whole problem; in uncertain environments, the agent does not consider subproblems and sees how everything ts into the whole picture, starting with an agent that is complete, interactive, and with explicit goals, sensing aspects of its environment and choosing actions that inuence it.

Supervised and Unsupervised Learning.
A representative set of pairs of states and actions is provided by a teacher to the agent in supervised learning.e agent must modify its strategy for selecting actions so that its actions get closer every time to the selected target actions.erefore, the main problem in supervised learning is then the approximation of a functional mapping from states to actions that is an unknown to the agent and known to the teacher, which can be done with neural networks, fuzzy systems, or other learning models.is is impractical for complex problems because of the inability to specify a representative set of pairs of states and actions, making nding optimal solutions unknown in some instances.An agent that performs unsupervised learning in its purest form is perceiving the states of the process it has under control but will not get any information about the actions, and the control strategy is not evaluated.Unsupervised learning cannot be used to learn control strategies.A typical application of it is the identi cation of structure in data, as in data clustering.

Reinforcement Learning versus Purely Unsupervised
Learning.Just as in unsupervised learning, our agent performs Reinforcement Learning and receives no information about an optimal strategy for control, but in RL the agent gets rewards or reinforcement signals provide feedback about its control strategy.With these signals, the agent can improve the strategy, giving intelligence to the trial-anderror process.
e problem faced by our agent in RL is that it must learn its behavior through trial-and-error interactions with its environment.Two main strategies are used for RL problem solving: an agent can choose to search in the behavior space to nd a behavior that is appropriate to its environment (the approach used in genetic algorithms and genetic programming) or it can use statistics and dynamic 4 Complexity programming to estimate a utility function of taking actions in states of its environment.

Main Components of a Reinforcement Learning System.
In addition to the agent and the environment, the principal components of a Reinforcement Learning system are as follows: (i) e policy dictates the way that our agent will behave at any given time.It maps states of the environment to a set of actions that are going to be taken when the states are reached.is policy is central to our agent since it is the only thing needed to determine the agent's behavior.(ii) e reward function is the reinforcement signal and in our RL problem defines the goal of the agent, by mapping the perceived state or states of our world to a numerical value.is way we know which state is more desirable.In RL, our agent's only purpose in "life" is to maximize the total reward it will receive, and our agent must choose which actions contribute to that goal.Reward functions may be stochastic and are the basis for changing the policy of the agent.A strong assumption of the RL framework is that the reward signal can be unequivocally and directly observed, as the feedback the framework receives is part of the environment in which the agent is working on.However, rewards are often delayed as the effective reward is obtained several steps after the action leading to it has been executed.is where E π { } defines the expected value when the agent follows the policy π, "R" is the reward, "s" is a state, and c is the discount factor.e terminal state's value, if it exists, is zero.V π is the state-value function for π.Most RL algorithms estimate value functions.A value function is a mapping of states that provide an estimate of how fit it is for the agent to be in a given state, defined in terms of the future rewards to be expected or the expected return.ey are delineated with respect to a given policy.We also define Q π (s, a), the actionvalue function for π, as follows: where "a" represents action.
In many RL algorithms, the action-value function Q is used instead of the value function V because it easily lets the agent choose the action with higher expected rewards.e RL agent interacts with its world in a series of time steps.With each discrete time step t, the agent receives an observation o t and a reward r t .An action at is chosen from the group of actions available to the agent and executed within the environment.is moves the environment to a new state s t+1 .A new reward for the transition and new state is now determined.e agent needs to accumulate as much rewards as possible.
e RL problem is defined as finding a policy for the agent that will specify the action that the agent will take when in a given state.Once an MDP combines with a policy as such, the action for each state is fixed and behaves like a Markov Chain.RL is not considered a technique for the solution but rather a way of formulating a problem [45].In [46], the basic problem that is apt for the use of RL is formulated, where a system needs to interact with the environment to achieve a certain goal and based on the current state's feedback, what action should it perform next?RL is the way of learning the correct action to be taken in each situation based solely on feedback obtained from the environment [41].For our purposes, feedback is a numerical reward that we assign to the actions that an agent takes.RL agents can learn offline or online.Offline learning is similar to the knowledge acquired by a student from a teacher; the agent is taught what is needed to know before venturing into the environment.Online learning is more spontaneous similar to the way a child learns how to walk, where knowledge is acquired in real-time.e agent explores its environment, and it is constantly adding experiences to make better future decisions.

Electrical Power Distribution Systems and
Intelligent Power Routers Industrial, commercial, and residential end-users must receive reliable electrical power at their facilities or homes.Several factors, natural and artificial, in the process of generating, transporting, and distributing electricity can damage equipment (wind, ice, storms (thunderstorms, typhoons, hurricanes), vegetation growth that can induce short circuits, and other nature-induced or human disasters, as well as malicious perturbations).Some factors cannot be predicted and must be taken care of in real time.Other events and factors can cause network unbalance.As an example, variations in temperature may cause changes in electrical loads, and overall demand for electrical power varies with time, season, weather, and so on.Some of these factors affect supplied power quality, while other factors cause emergency situations that force network operators to disconnect power to regions that cause problems to prevent chain reactions.Other severe situations may cause power outages or network power imbalance.Intentional power outages should be limited and minimized.
Electrical Power Grids are almost always managed from control rooms.Some can be telemetered, such that control engineers have accurate real-time information about their status.ey can also have protection equipment that can be actioned from within the control room, so larger failures are prevented.ere are instances in which telemetry may be cost ineffective, and aberrant network states can be reported by operators, engineers, workers, or customer communication.System repairs and maintenance may be performed manually by skilled workers.
Electrical substations [8] can have several busbars, and two of these may be interconnected via a switch or a conductor line.Both extremes of the power line are connected to a breaker.A breaker is a standard protection mechanism that has a relay that can automatically open in case of a short circuit, giving it the ability to disconnect all or a single circuit from the remaining network.Messages with alarms to control rooms can be generated as well, and with these, engineers can have the ability to control the state of the breakers.
e main objective in fault management of an EPDS is to restore the power supply quickly to as many end users as possible.Since in an EPDS there may be different routes through which power can be served, the EPDS can be switched to select alternative routes through breakers and switches that can bypass the areas, lines, or devices that cause problems.
e need arises to isolate and determine any malfunction in protective equipment, generate correct diagnoses from the alarm messages that are received, and 6 Complexity continue to postulate a plan to restore electricity safely and efficiently to the largest number of end-users.
EPDSs [8] are a group of sources and power lines that operate under common supervision to provide electrical power to end-users.Systems for electrical power delivery are formed by joining Distribution and Transmission Systems.Transmission Networks are meant to transport high voltage electricity over longer distances.
eir high voltage loads are reduced at major load centers and then distributed to customers, where distribution networks transport electricity from the Transmission Network to the customers.EPDSs are ubiquitous, from large ships to modern data centers.For our scope, we consider only Generation and Transmission Systems.[4] are the principal components of a smart grid that was developed as a distributed architecture for decentralized coordination, control, and communication between power system components.Intelligent control and planning of network operations are built into smart computing devices attached to sources, power lines, and other power network devices, thus allowing them to have a picture of current network conditions and assign resources to respond to failures, priority, or demand.ey are configured on a Peer-to-Peer (P2P) network architecture, and in the event of a failure, they make local decisions and coordinate with other devices in their neighborhood to return the system into operation from undesired state.Currently, the control of electrical power generation and distribution, even if redundant generators and lines are present, is done in a centralized way.Future EPDS should be capable of distributing coordination and control of generation and distribution tasks throughout the network when contingencies or emergencies arise.IPRs were engineered for survivability, fault tolerance, scalability, cost-effectiveness, and continuous unattended operation.At its core, the IPR is a power flow controller with embedded software.An IPR has two principal components: Interfacing Circuits (ICKT) and an Integrated Control and Communications Unit (ICCU).e ICKTs operate power flow control and sensing devices, such as breakers, capacitors, and transformers.ey can also receive network status information from sensors and dynamic system monitors.

Intelligent Power Routers. IPRs
ey have direct control of the ICCU, and with their logic and software, calculate how to route power, change loads, and take any corrective or preventive actions that enhance safety, stability, and security.e network architecture and communication protocols are similar to the Internet Protocol (IP) Local Area Networks.A load connected to an IPR can be assigned a priority, and contrary to nonsmart power networks, when a power source fails, the ICCU of an IPR reacts to this failure through reconfiguration of the network, so that the load with the highest priority may be served.

Materials and Methods
With the following methodology, faults and failures can be categorized for a single IPR's failure mode in an EPDS.We propose establishing the model using the Probabilistic Symbolic Model Checker (PRISM) [47], to verify its use and formal correctness using Probabilistic Computational Tree Logic (PCTL).
e models were built in PRISM by constructing three modules: one for the environment in which the device operates, a module for the IPR's Probabilistic Boolean Network, and a reward structure.e actual state of the device PBN's nodes is in the second module, which uses the state of the variables available in the environment module and applies the corresponding Boolean Predictor Functions to transition to the next state.With the values of these variables as a base and the device's failure modes, the state of the IPR variables is changed, giving us the device's current state.In this manner, given the device's failure modes (which are based on the possible failure modes of its components), the model produces the failure modes corresponding to the system as an output.
To calculate individual IPR reliability, we have divided them into three principal subsystems: power hardware (power circuit breakers), computer hardware (used for IPR-to-IPR communications, routing, and CPU functions), and the software that manages the device.e reliability estimates of each of the subsystems that compose the IPR are provided in [4].
e reliability of a circuit breaker was obtained from data sheets as 0.99330.Each IPR has two circuit breakers, a main breaker and a redundant secondary.
e reliability of data routers is estimated at 0.9009 (in a year).Lastly, software reliability is estimated at 0.99.
PBNs can precisely emulate an EPDS with IPRs since this has coincidences with GRNs that have been modeled with BNs and PBNs.As a first step, the PBN representing the EPDS is built.Each modeled component of the EPDS is equivalent to a gene (node) in a GRN, where a gene can assume one of two states; 0 means the IPR is ON and 1 means it is OFF (by the convention established in [4]).For each node, the Boolean functions that determine the state of its IPR in time t + 1 are applied, given the state of the EPDS's nodes in time t.
In the next step, a matrix for every node is built, to construct its Predictor Function.When calculating the predictors, only relevant nodes, those directly affecting the status or state of the node under study, are considered.All nodes that do not directly affect the current node's state are ignored.As per the connections between relevant nodes and the observed node, the equation or set of equations (constructed with the basic Boolean operators) that determine the state of each node are presented.For every node, there exists a set of equations (one or more per node).ese Boolean Functions are solved from the examination of the relationships between each node and its relevant nodes.All possible states of all relevant nodes are analyzed, and an evaluation is made about the next state of the node in time t + 1, given the state of all relevant nodes in time t. is method proposed adapts the Fault Detection and Isolation (FDI) scheme described in [48] and shown in Figure 3, where a model is used for describing the normal operation of the process and another model is used to describe each of the faults or failure modes.

Complexity
PBNs possess a characteristic called self-organization, and they do so into attractor states [36].Attractors are sets of repeating states that, in the case of the models under study, are related to the failure modes that the system exhibits.
ere are similarities between the construction and semantics of the models we present and those in [11].By characterizing the failure modes of the device under study, the models can, with model checking, characterize the state of their nodes to determine the faults and failures correlated to the device's fault conditions. is methodology is exible, and the design of the network model and its state transitions depends on how much resolution the experts need, based on design speci cations.Complexity and expression are scalable in this method, depending on the needs of the experts.
Device operation has been modeled by simulation of the network's components, taking into consideration the reliability analysis in [4].ese simulations were performed for the IPR by modeling of its relevant components based on the data of their Mean Time between Failures (MTBFs).e model can detect and isolate single and multiple IPR failure modes.

Classi cation.
We begin by classifying the di erent states of the IPR's components, to properly create models of the device.Following the methodology in [48], we begin describing the system's failure modes, starting from its normal operation and continuing into modeling the different types of faults in the system.
Each subsystem is perceived as being in one of two di erent states: Breakers: 0: the breakers can close/switch properly 1: the breakers do not close/switch properly Router: 0: the data router communicates/sends-receives information in the network properly 1: the data router does not communicate/send-receive information in the network Software: 0: the software makes correct decisions 1: the software makes incorrect decisions e state of the device is, therefore, a set of states of its subsystems.ere is redundancy in the breakers because all con gurations of the IPR break into a series system, and the reliability of a series system is below the reliability of its lowest component.erefore, the only way to increase the reliability of the IPR would be to provide a redundant path to the breaker.e device can be in 16 states (Router, Software, Breaker1, and Breaker2) that go from all subsystems operational (0000) to all subsystems in failure (1111).Some of these states, such as the failure of a single breaker, are identical, and after merging, there are 12 unique states.Table 1 summarizes the Categories, Types, and states that constitute these categories in the IPR.
Failure probability for each component is assumed to be independent of each other.Reliability estimates for each of the device's components was detailed in "Electrical Power Distribution Systems and Intelligent Power Routers" Section.
e relevant genes of the IPR's PBN are its data router, software, and the main and secondary breakers [8].For these, the state of their components determines the failure mode they are currently on, as per the categories.Category 1 is a type of fault, where the IPR acts appropriately and changes the state of the breakers on an Active Signal (AS) but may change them when switching is unnecessary.Category 2 describes the normal operation mode of the IPR.Category 3 describes a failure (catastrophic) of this device.Lastly, Category 4 describes a fault condition on which the device does not act upon an AS and may also switch the breakers unnecessarily when there is no AS.Table 2 presents the predictor Boolean functions for every IPRs subsystems, based on their con guration.
is permits the prognosis of fault conditions those that do not cause a total failure but rather failure modes that will lead to instances where the device continues its operation but does not perform the required task to speci cations.
ese are unhealthy states of the device, and they should be 8 Complexity treated, or they will otherwise lead to failure.For the device under study, the failure modes described in [4] were used, and an expert determination was made as to which device components and failure modes produce a failure or a fault.PRISM's Property verification in PCTL was used to determine the maximum probability of occurrence of the failure modes that could evolve to a fault or a failure.From an initial state for the IPR, such as Category 2, a determination is made about the maximum probability of reaching one of the different identified failure modes.Property verification in PRISM permits the verification the models, and they also permit, through experiments, to reach an estimate regarding when in time a fault is certain.

Results and Discussion
PRISM [47] was used to validate the model quantitatively.
ese experiments were performed using a PBN representation of the IPR.Its main components (router, software, and breakers) were modeled, and their interrelationships are expressed as Boolean Functions or predictors.ese components are considered the PBN's nodes, which give as output the overall state of the device.In the experiments, time is expressed in hours (h).A reward is assigned to the interaction of the PBN agent with its environment.A reward of '1' has been assigned to the state in which all components of the IPR are operating correctly.In this way, the agent can obtain a feedback signal, based on its actions.
e main objective of the PBN-RL agent is to remain in a normal operating state through its operation.
We performed reward-based property experiments to test the model's capacity to emulate the standard RL cycle.We studied the agents' combined actions in the environment, and we also studied the actions individually.e experiments assess the model's capacity to perform the standard Reinforcement Learning cycle.In the standard RL cycle, an agent interacts with its environment and receives feedback upon performing actions in the form of a reward or cost.
us, a PBN-based model was established in PRISM using an MDP, with a module for the environment and a module for the PBN, with its predictors.PBNs used as GRN models have been expressed and developed as MDP [1,2,5,19,37,38].e actions of the model correspond to the different states in which the model can assume, which are correlated to the classifications previously presented.ese classifications correspond to the device's different failure modes, as per the reliability analysis.In this scheme, the only missing element would be to assign a reward for the actions that are to be reinforced, so that the agent can receive feedback from its environment.PRISM has a rewards structure that can be used within the model's specification to assign rewards to states or sets of states.Currently, all assigned rewards in PRISM must be positive, and therefore we do not assign penalties or costs to states or sets of states.It is possible, however, to create multiple reward structures within the same model.With these multiple reward structures, we can analyze the effect of the different actions in the model.We are also able to conduct reward-based experiments that can provide information about the maximum cumulative reward over time, for a particular action or state.We have studied the effect of these rewards separately because although we value the benefits of model checking, we are unable to assess the effect of assigning costs and rewards at the same time with this tool.We understand that the current benefits of the use of model checking outweigh its limitations.
e first experiment conducted was performed to determine the maximum expected reward Rmax for the agent interacting in its environment and executing any of its actions.e rewards structure assigns a reward of '5' to the normal operation mode, a reward of '1' to any of the fault modes, and a reward of '0' to the failure of the IPR (a higher reward for the action that we would like the device to reinforce more). is was executed through verification of the following property: where Rmax is the maximum reward property operator and 'C' is the operator for Cumulative Reward in PRISM.Figure 4 presents the results of this experiment to assess the maximum expected reward of the agent when interacting with its environment in the standard RL cycle.On AS, the IPR works as intended.On IS, the IPR works as intended.
On AS, the IPR does not work as intended.On IS, the IPR does not work as intended.

Complexity
In this experiment, the nal value of the reward is 42946.e agent performs its actions and receives feedback in the form of a reward from each action from the environment, resulting in the linear plot from Figure 4. To recall, Figure 5 presents a graph of the maximum probability of occurrence of the normal operation action that can be seemingly approximated by a sigmoid curve.Since this property reaches 100% probability quickly, a year of operation is not plotted.
Figure 5 can also be approximated by a sigmoid curve.Figure 6 presents the maximum expected reward for the normal operation action.It shows a plot of the maximum reward obtained for the normal operation mode of the IPR in time, over a period of one year of operation.e nal value of the maximum cumulative reward for this action is 8620, which translates into the device receiving a reward of '1' for every hour of normal operation or 8,620 hours of normal device operation in the simulation. is is the action with the largest cumulative reward, as the set of states that are part of this classi cation have the highest probability of occurrence.e rewards for all the other actions presented have a smaller cumulative reward as these actions also have a lower probability of occurrence.We can adjust the scale of the maximum occurrence experiment in Figure 6 to match the time axis to Figure 5. Figure 7 shows this adjustment.
e reward initially is low in the very early hours of operation, corresponding to the early failures (or infant mortality) period, and as the device enters a steady state, its expected reward rate increases almost linearly.
As PRISM supports multiple reward structures within a single model, each reward structure needs to be identi ed when running an experiment, as  10 Complexity where the property used the "normop" (for normal operation) reward structure of the model.e rest of the experiments was executed with similar properties.Figure 8 presents a plot with the results of a maximum cumulative reward for the failure of the IPR over a year of operation.
e nal value for the maximum cumulative reward is 73, which re ects a total of 73 hours for a period of one year in which the device was in a catastrophic failure in the simulation.
In Figure 9, the results of a maximum expected reward experiment for the Fault 1 operating mode of the IPR over a year of operation are presented in a plot.e maximum cumulative reward was 6, which indicates a total of 6 hours in which the device was in a type 1 fault in the simulation.
Figure 10 presents the result of the maximum cumulative reward for the Fault 2 operation mode of the IPR over a year of operation, and this was found to be 118, re ecting 118 hours in the simulation that the device spent in a Type 2 fault.
ese results demonstrate that the PBN model of the IPR can correctly emulate the standard RL cycle; as for every iteration in time, the variables' states are assessed and compared with the failure modes of the device, and a reward is assigned to the actions that the user wants to reinforce.e results of these experiments are exportable in PRISM as a le that can later be used to analyze the model's behavior using statistical packages or machine learning tools.
In these systems, modeled with complex systems tools, learning occurs in the fundamental sense of adaptation to changes; the system adapts to survive.Complex systems selforganize into steady states, which are the long-term behavior of the system, and this self-organization in the most basic sense is a form of learning (considering learning as a type of adaptation and self-organization as an adaptive mechanism).
e evolution of these complex systems can be controlled externally through interventions [11] so that the system can avoid "unhealthy" states (failures or faults).
In a standard RL model, the agent makes decisions and is connected to its environment through perception and action.Agent and environment interact during a sequence of time steps, and at each interaction step, our agent senses the environment for information and determines the state of its  Complexity "world."Based on this information, the agent takes an action.e information of the state of the environment constitutes the input of our agent, and the action chosen by our agent becomes its output.e actions taken in each step change the state of its environment and its own state.A time step later, the state transition's value following the action taken is given to our agent by its environment as a reward.e model is an RL agent acting on its environment and receiving a reward that reinforces a certain behavior over others.Rewards were assigned to the IPR RL Agent acting upon the information on its environment.A reward of '1' was assigned to the state in which all of the IPR components are operating correctly (Cat.12 Complexity to run an experiment in which an IPR is operating continuously for a year, and we obtained the maximum reward assigned to that action.In Figure 4, the RL agent's goal is to remain in a normal operating mode.If instead of positively rewarding the correct operation of the IPR, we rewarded one of its failure modes, such as a Category 1 Failure mode, the resulting experiment would yield a maximum reward as per Figure 9. e maximum rewards obtained are directly correlated with the estimated reliability of the IPR in [4].
erefore, this reward result is related to the occurrence of the Category 1 failure mode.Validation through experiments that PBN modeled systems can perform the standard RL cycle is an important step towards an automatic way of system control through Machine Learning, where control is not external but intrinsic to the system.

Conclusions and Future Work
In this work, we have studied a smart grid management device, the Intelligent Power Router, assessed the device's reliability, and presented a bioinspired modeling technique that uses Probabilistic Boolean Networks to create simple and logical models that exhibit complex behavior.ese models self-organize into constituent Boolean networks with attractor cycles that can describe long-term system behavior.In this case, this behavior is equivalent to the different states in which the device's components can assume, and, therefore, to the different failure modes of the device.We used PBNs to model Intelligent Power Routers, a smart grid component, with Reinforcement Learning, and we studied the model's evolution in time and how they may learn to avoid undesirable states in an autonomous way, increasing their reliability and resilience.We proposed PBNs as a building block with a novel analysis technique for problem solving in smart grid modeling, through RL.We validated and verified through model checking (MC) the viability of this methodology.
We believe that many areas of future work are available for the unanswered questions in this research.Particularly, we have only explored modeling the standard RL cycle, but we infer it is possible to use other RL techniques, such as Q-Learning and deep Q-Learning, to endow the system with automatic machine learning-based control of its evolution.
ere is a more fundamental question that can be answered through the study of RL in PBN modeling that an alternative to artificial neural networks can be achieved using PBNs as the building block for this structure.To achieve this, artificial neural network neurons need to be proven equivalent to the set of nodes of a PBN, which have input and output states.e set of input nodes have a set of predictor functions that define the output state (like the threshold function).e learning task would be to change the transition probabilities to select a context (constituent BN) that represents the state in which the network must be steady state, where states are the goals to be achieved in the system.

Figure 1 :
Figure 1: Basic elements of an IPR.

Figure 5 :
Figure 5: Maximum probability of occurrence of the normal operation mode.

Figure 6 :
Figure 6: Maximum reward obtained for the normal operation mode of the IPR in one year of operation.

Figure 9 :
Figure 9: Maximum reward obtained for the Fault 1 operation mode of the IPR in one year of operation.

Figure 8 :
Figure 8: Maximum expected reward obtained for the failure operation mode of the IPR in one year of operation.
Named after Russian scientist Andrey Markov, MDPs can be viewed as RL tasks that satisfy the Markov Property.When a stochastic process satisfies the Markov Property, it is memoryless.In other words, the conditional probability distribution of its future states only depends on the present state and not on the past.MDPs are discrete time When both state and action spaces are finite, MDPs are said to be finite.eValue Function, V π (s), in Reinforcement Learning is of extreme importance.It estimates the expected cumulative reward of state s.In MDPs, the value function can be defined as behavior of the problem's environment.It is required by some RL algorithms although in practice it is only used in relatively simple problems.2.6.Markov DecisionProcesses.Reinforcement learning problems are well modeled as Markov Decision Processes, or MDPs.T: S × A ⟶ Π(S)∨Π(S) is a probability distribution over S. (v) Policies are sequences of mappings in the form Π: π 0 , π 1 , . . . , Π: π 0 , π 1 , . . . , where π k , π k maps the state s k ∈ S to an action a k � π k (s k ) ∈ A(s k ).
power to distribution networks, that reach end-customers.(ii) Distribution networks: these are engineered to provide power to smaller areas and have lower voltages than transmission grids but are denser because they are meant to serve electricity to the final customers.Lower voltages are used for safety and security reasons and due to installation costs.ey are also able to provide several voltage levels to different end users, through transformers.
(i) Transmission networks: covering large areas, they make sure that most or all regions of a country are covered and provide the service.ese high voltages (in the range of 230 kV or 138 kV) allow for the minimization of losses in its transmission.Different lines are bundled together at electrical power substations, and the networks eventually interconnect and feed

Table 2 :
Predictors and selection probability, IPR PBN. ) 2) and no rewards for other failure modes.A rewards-based property in PRISM was used Figure 10: Maximum reward obtained for the Fault 2 operation mode of the IPR in one year of operation.