Resiliency Assessment of Power Systems Using Deep Reinforcement Learning

Evaluating the resiliency of power systems against abnormal operational conditions is crucial for adapting effective actions in planning and operation. This paper introduces the level-of-resilience (LoR) measure to assess power system resiliency in terms of the minimum number of faults needed to produce a system outage (blackout) under sequential topology attacks. Four deep reinforcement learning (DRL)-based agents: deep Q-network (DQN), double DQN, the REINFORCE (Monte-Carlo policy gradient), and REINFORCE with baseline are used to determine the LoR. In this paper, three case studies based on IEEE 6-bus test system are investigated. The results demonstrate that the double DQN network agent achieved the highest success rate, and it was the fastest among the other agents. Thus, it can be an efficient agent for resiliency evaluation.


Introduction
e deployment of recent technologies in communication, computing, and control of smart grids can be suitable for clients and electrical facilities. Energy infrastructures are natively connected to other areas of demanding infrastructures, and their supply breaking can have disastrous cascading results [1]. One of the important features that is essential in today's smart grids is to run resiliently when attacks/faults and other contingencies occur.
Determining the resilience of power systems (PSs) has been a subject of concern in latest years. Stochastic and statistical analysis techniques are used to evaluate power system resilience [2]. While these techniques can aid understanding the system resilience to large-scale contingencies, however, they are not always appropriate when evaluating resilience in the presence of malicious sources. ese methods are based on the comparably simple DC model, which does not consider effects like voltage breakdown that may happen during a cascade. Also, there is a need to enhance their data, sampling ways, and the extent of models and effects represented [3]. Accordingly, it is essential to investigate new approaches to evaluate the resilience of the grid using the more realistic and scalable AC models. e applications of machine learning (ML) algorithms are identified by Olowononi et al. [4] in the field of security and resiliency of the power grid. eir target is to effectively survey the interactions among resilient grid using ML and resilient ML when used in the grid.
e power system's cybersecurity and ML have a wide range of interdisciplinary crossways between them. For instance, reinforcement and deep learning (DL) can be used to build smart models for applying malware classification, observing the use of the intrusion detection and prevention systems (IDS/IPS), and implementing threat intelligence sensing [5].
Reinforcement learning (RL) is one of the established ML approaches [6]. RL does not depend directly on data sets but has an agent that is placed in an anonymous environment and can receive feedbacks in form of rewards by making actions that can result in maximizing cumulative rewards, so the agent learns from its own experience. e agent focuses on finding an optimal policy rather than analyzing data as compared to supervised and unsupervised learning. e environment usually has dynamics that are unknown to the agent. e DL approaches grant computational models that are created of numerous processing layers to learn representations of data with various levels of abstraction. ese approaches have effectively enhanced the visual object recognition, speech recognition, and numerous realms [7]. e combination of RL with DL techniques (DRL) is most useful in problems with a high dimensional state-space which makes it suitable for evaluating the resilience of power systems. Classical RL techniques has a complex design issue in the decision of features. Nonetheless, DRL has been rewarding in difficult assignments with a lower prior knowledge [8]. e recent advancement in DL techniques is summarized by Dick et al. [1] for creating machine vision models. e current applications of this technology are also investigated to improve the resiliency of critical infrastructure protection (CIP).
Several works investigated the cybersecurity of power grids using RL and DRL. For instance, Dibaji et al. [9] considered cyber physical systems' security from systems and control perspectives in general, and shortly discussed the possibilities of using RL and DRL to this purpose. Qlearning was proposed by Yan et al. [10] to interpret the transmission grid vulnerability against sequential topology attacks and determine critical attack sequences taking into account physical system behaviors. A modified Q-learning (termed the nearest sequence memory Q-learning) was adopted by Wang et al. [11] to evaluate threats imposed by false data injection attack on voltage control of a power system. Test results revealed that even if a few substations are attacked, a voltage collapse with its consequences can happen in the system. Secure state estimation using multiagent reinforcement learning was dealt by He et al. [12] with the assumption that measurements are sent over a wireless network under jamming attacks. e antijamming game framework was used to determine the optimal path against an intelligent attacker. He et al. [13] considered secure-state estimation with risk-averse transmission path selection method that is based on RL concept. ey demonstrated how the proposed approach can improve secure-state estimation robustness. e use of RL was discussed by Oozeer et al. [14] in a general framework of cognitive risk control for cyber-attacks in smart grids. RL was presented by Chen et al. [15] to evaluate false data injection attacks against automatic voltage control of power systems (in normal operating states). A Q-learning algorithm with the nearest sequence memory was employed for online learning of attacking strategy. e optimal attack strategy was modelled as a partially observable Markov decision process. Based on kernel density estimation, a bad data detection and correction technique was presented to reduce the disruptive influence of the attacks. Table 1 shows some recent studies that were performed on smart grid system security using RL and DRL. e novelty of this work lies in evaluating power system's resiliency level (LoR) under sequential topological attacks/ faults using DRL techniques. e framework design methodology is based on using four DRL agents which are trained and optimized with the aim of determining the minimum number of faults required to black out the system. is number is used to determine the LoR for three different topologies of IEEE 6-bus system case study under single and three-phase attack scenarios. e performance of the tested DRL agents was compared. e double DQN agent was stable and achieved the highest success rate among all agents.
us, it can be used for resilience studies that investigate the system's ability to withstand attacks/faults by aiding system designers to select the most resilient system's topology. e rest of the paper is straightened out as follows: Section 2 illustrates power system's topologies along with the attack/ faults scenarios. Section 3 presents the resiliency measure formulation and the DRL techniques. Experimental results are shown in Section 4. Section 5 summarizes and presents certain future directions. Table 2 illustrates the acronyms and notations used through the paper.

Electric Power Grid
Topology. An electrical power grid is a complementary network for carrying electricity from producers to consumers. Electrical grids differ in size from serving whole countries through national grids to crosscontinents through transnational grids [21].
ree power system topologies were considered in this paper. ese are PS1, PS2, and PS3, respectively, as shown in Figures 1 to 3. ey have identical buses, generation, and load units. Each system is a three-phase electric power system that consists of three loads (each has an active power of 70 Mw), three generators (two photovoltaic (PV) generators and one swing) with active power of 50 Mw for each, six buses and 36 transmission lines. e power system PS1 is an IEEE 6-bus system introduced by Kennedy [22]. PS2 was generated by altering PS1's topology, while PS3 can be described as a fully connected system where all the RLC circuits are connected to each other.
In PS1, PS2, and PS3, the loads L1, L2, and L3 are connected to buses 4, 5, and 6, respectively. Nonetheless, the generators Swing, PV1, and PV2 are connected to buses 1, 2, and 3, respectively. e values of RLC of lines are also equal in all the three grids. e topology differences can be shown in the transmission line connections which resulted in altering the potential paths of current flow.

Faults Scenarios.
Typically, a power system performs well under balanced conditions. However, the system might become unbalanced due to several reasons, such as natural disturbances (e.g., earthquakes, lightning, and high-speed winds), tree falling on the lines, and insulation failure. ese reasons can lead to short-circuits or a fault in the lines [22]. e most harming faults in power systems are short-circuit faults because their occurrence can result into a significant increase in the electrical current. Nonetheless, there exists two types of short-circuit faults: symmetric and asymmetric [23].
In a symmetric fault, all the phases are short-circuited to each other and often to earth. Such a fault is balanced in the sense that the system remains symmetrical, or in other words, the lines are displaced by an equal angle. It is the most relentless type of faults, including the largest current. Yet, it rarely materializes [24], such as a three-phase line to the ground fault (L-L-L-G) where the fault occurs between the three phases and the ground of the system. e asymmetrical fault gives rise to asymmetrical current, that is, the current is differing in magnitude and phase in the three phases of the power system. When a short-circuit occurs, the current comes into its peak value rapidly, and then it reduces exponentially with time through three different states: subtransient, transient, and permanent states [25]. Examples of asymmetrical faults are single line-to-ground (L-G) fault, line-to-line fault (L-L), and double line-to-ground (L-L-G) fault. In this work, the asymmetric (L-L-G) and symmetric (L-L-L-G) faults were considered against the three topologies.

Resiliency Measure Formulation.
LoR is the factor that is employed to hold the evolution of system's features through the variations of system's modes of operation under a sequence of fault and recovery actions. For a number of PSs under a sequence of faults/attacks (an attack scenario), suppose the resulting system modes are represented by Z 0 ⟶Z 1 ⟶. . .⟶Z m , where Z 0 is the initial mode, while Z h is the mode after the hth fault and reconfiguration (h � 1,. . ., m). A power system is more resilient if it needed a larger number of faults/attacks N over all possible attack scenarios M before its outage. is factor can be determined by using a reinforcement agent who finds the optimal number of faults

DRL Algorithms.
When the agent begins to learn, the agent will be in a state S of the environment, by selecting an action A, the agent can switch from one state to another. e transition probability between states, that is, P, denotes the probability of the state to which the agent will arrive to. When the agent conducts an action, the environment delivers a reward R as feedback. e model describes the reward function and transition probabilities. e agent's policy π(S) provides the strategy on which is the best/ optimal action to be taken in a specific state with the aim of maximizing the cumulative rewards. Every state is identified with a value function V(S) predicting the expected number of future rewards that the agent will obtain in this state by choosing an optimal action under the current/ other policy. e future reward (also called return) G t is the total sum of discounted rewards in the future as represented by:   where c ∈ [0, 1] is the discounting factor which penalizes the rewards in the future, so an agent can focus on the future reward rather than the immediate reward. Both policy and value functions are what the agent tries to learn in RL. e cooperation among the agent and the environment includes a sequence of actions and rewards evolving in time t � 1, 2, . . ., T, where T is time step at which the termination state is reached. During this process, the agent gathers information about the environment and gives decisions on which action to take next to precisely learn the best policy. e state, action, and reward at time step t can be represented as S t , A t , and R t , respectively. erefore, the full cooperation sequence is represented by one episode (trajectory) and the sequence terminates at the terminal state: DQN was introduced by Mnih et al. [26] through a combination of Q-learning with a function approximator (neural network) to overcome the tabular limit of Qlearning. e algorithm was tested on Atari games and the agent was able to achieve the human level in Atari games. e inputs were raw pixels of the game so that the same agent can learn multiple games with no need for a special processing of the inputs. e past trials of combining Qlearning with function approximators in the past were not successful due to the deadly triad issue [27], where the model suffered from instability and divergence.
is issue was solved by improving and stabilizing the training procedure of Q-learning using two methods of experience replay and periodically updated target. Here, DQN is a neural network model that receives states as inputs and produces action values Q(S; θ) for network parameters θ. e episode step e t � (S t , A t , R t , S t+1 ) is stored in one replay memory D t � e 1 , . . . , e t , where D t has experienced e t tuples over many episodes. During Q-learning updates, samples are drawn randomly from the replay memory (called experience replay). us, one sample could be used many times. is was useful in reducing the correlation between samples, which resulted in a network that can learn without any overfitting. Moreover, the experience replay could reuse old experience, which resulted in a smooth learning and more efficient tuples samples.
In periodically updated target, DQN keeps a copy of the network with an identical architecture and initializes with the same parameters (weights values). e predicted Q from the target network will be used to update the main Q-network. e target network's parameters are not trained like the main network, instead they are periodically synchronized with the parameters of the main Q-network. e idea behind this is to serve the same goal as the experience buffer by reducing the correlation between samples using different parameters in the main Q-network with θ and θ − for the target network. us, optimizing the Q values towards the target values. is has shown to stabilize the learning. Here, the target network with parameters θ − is the same as the main Q-network except that its parameters are copied every C time steps. e C steps were chosen to be two steps so that θ − t � θ t and are kept fixed in all other steps. e main Q-network goal is to produce an estimation of the Q values for each action that can be taken from that state, but the objective is to find an optimal Q value that satisfy the Bellman optimality equation: For any state-action pair (S, A) at time t, the expected return from starting in state S selecting action A and following the optimal policy q * thereafter is going to be the expected reward we get from taking an action A in state S, which is R t+1 plus the maximum expected discounted return that can be achieved from any possible next state-action pair. Also, since the agent is following an optimal policy, the following state S ′ will be the state from which the best possible next action A ′ can be taken at time t + 1 and the max A′ q * (S ′ , A ′ ) is outputted from the target network. is will be used eventually to calculate the loss from the main Qnetwork which is calculated by comparing the generated Q values from the main Q-network to the target Q values from the right-hand side of the Bellman equation, where the objective here is to minimize this loss. After the loss is calculated, the parameters θ within the main Q-network are updated via Stochastic Gradient Descent (SGD) and backpropagation. is process is done repeatedly for each state in the environment until minimizing the loss and arriving to an approximate optimal Q value as follows: which can be rewritten into the following equation: However, DQN has the drawback of overestimation in most cases. Normally, the overestimation is caused by Q value update rule of taking the maximum Q value of the new state. erefore, a double DQN was proposed by Hado et al. [28] to overcome the overestimation of the DQN. Double DQN improved Q value update rule by selecting the action corresponding to the maximum Q value of the current Qnetwork rather than using the maximum Q value of the target Q-network.
To make sure that the selected action for the next state is the action with the highest value function (highest Q value), the current Q network is used to find the best action with the highest Q value (A max ), then the target network is used to calculate the target Q value (Q − ) of taking this action at the next state: where DQN and double DQN are concerned with learning a state-action value (Q value) function and then selecting actions based on this value, where the Q value indirectly evaluates the policy that the agent follows. On the other Computational Intelligence and Neuroscience hand, policy gradient methods instead learn the policy π directly by a parameterized function π θ (A|S) with respect to θ, where the objective function value relies on the policy. us, the algorithm goal is to optimize θ to determine the optimal value of the function π θ (A|S).
e REINFORCE [29] (Monte-Carlo policy gradient) is a model-free, online, on-policy reinforcement learning technique. REINFORCE depends on an estimated return by Monte-Carlo methods using episode samples to update the policy parameter θ. e policy gradient methods learn a policy function directly (instead of a Q function). On-policy, means that REINFORCE learns from trajectories generated by the current policy. e objective function for policy gradients is defined as follows: A useful way to learn an approximation policy is by directly maximizing the expected reward using a gradient method (i.e., policy gradient). It describes the gradient of the expected reward with respect to the parameters, where the objective function J is calculated to learn a policy that maximizes the cumulative future reward R to be received starting from any given time t until the terminal time T. e policy optimization process uses a gradient ascent with the partial derivative of the objective with respect to the policy parameter θ to maximize the objective function: REINFORCE works because the expectation of the sample gradient is equal to the actual gradient as shown in the consecutive equation: Here, one can measure G t from real sample full trajectories and employ it to update the policy gradient. A commonly used modification of REINFORCE is to subtract a baseline value from the return G t to decrease the variance of gradient estimation, while keeping the bias unchanged. For example, a common baseline is to subtract state-value from action-value, and if adapted, one could use the advantage δ(S, A) � Q(S, A) − V(S) in the gradient ascent update. While training the agent for each training episode, the agent generates episode experience by following actor policy μ(S). e agent conducts actions until it arrives at the terminal state S T . e episode experience includes the sequence S 1 , A 1 , R 2 , S 2 ,. . ., S T−1 , A T−1 , R T , S T . en, the agent calculates the return G t each time step. In case a baseline was used, then the advantage function δ t is calculated employing the baseline value function estimated from the critic as given by: In fact, the REINFOR. CE-with-baseline technique learns both a policy and a state-value function, but according to Sutton et al. [29], it will not be considered as an actor-critic method because the state-value function is used only as a baseline, not as a critic.
is means that the critic will not be used for bootstrapping that illustrates updating the value estimate for a state from the estimated values of subsequent states. However, RE-INFORCE applies the state-value function only as a baseline for the state whose estimate is being updated. Afterwards, in reinforce with baseline, the agent accumulates the gradients for the actor network and critic network as represented by Wang et al. (11) and He et al. (12): Finally, the agent will update the actor parameter θ μ , and the state-value θ v in case of a baseline, as shown by He et al. (13) and Oozeer and Haykin (14), respectively, where α and β are the learning rates.

Agents Features.
To train the agents, the topological line states were given as inputs (also called observations).
Likewise, in every time step t, the agent selects to defect one line out of the I or K possible faults, where A t (I∨K) � 1. Once a fault is selected, the faulted line is disconnected, and the current is rerouted into other possible paths (if exists) toward loads. In addition, the reward function R is defined as follows: Each time step the agent selects a line to attack, the agent receives a negative reward. erefore, the number of faults that is needed to cause an outage of the system equals to the time steps in this episode. Also, during an episode, the actions that are taken by the agent will be stored in a buffer We t � (A 1 , A 2 , . . . , A T−1 ). e current action A t taken by the agent at time t is compared to We t to prevent the agent from repeating an action that was taken previously in the episode. By doing so, the agent can be trained with the aim of determining the minimum number of faults required to black out the system.

Networks Parameters.
e DQN and double DQN agents were implemented by first defining the critic networks that get the observations as inputs. A critic network has two hidden layers each with 24 hidden neurons, and each hidden layer is connected with a rectified linear activation function (RELU) and passed to the output layer to find the Q value for each defined action. e optimizer for the critic network is ADAM, with a learning rate of 0.001. e gradient threshold parameter was set up and defined to be 1 to prevent any gradient explosion when the network back propagates to update the network weights.
is usually happens when the gradients increase in magnitude exponentially, which results in an unstable training and can diverge within a few iterations. Gradient clipping can prevent gradient explosion by stabilizing the training at higher learning rates and in the presence of outliers. Gradient clipping enables networks to be trained faster and does not often affect the accuracy of the learned task [30].
Adding a regularization (L2 regularization factor) term for the weights to the loss function is one way to reduce overfitting [31]. Another parameter that is needed to train the agent is the experience buffer that is assigned with size of 3000 since the model is relatively small. e agent computes updates using a mini batch of experiences randomly sampled from the buffer with size of 64 which is large enough to reduce the variance when computing gradients, but it increases the computational effort. e discount factor that applies to future rewards during training is 0.9. e REINFORCE agent is composed of an actor that has two hidden layers with 24 hidden neurons, and each hidden layer is connected with an RELU activation function. Likewise, the REINFORCE with baseline agent, was constructed of an actor and a baseline network. e baseline has two hidden layers with 24 hidden neurons with a RELU function. Similar to DQN agent, the gradient threshold was set to 1. Alongside an ADAM optimizer with a learning rate of 0.005 and a discount factor of 0.9, the learning rates for the two REINFORCE agents were optimized with different values until 0.005 was found to produce better results.

Experimental Results
e four agents were implemented using Simulink (Simscape Electrical) environment for the three topologies for the two cases of L-L-L-G and L-L-G fault scenarios, respectively. ese agents are DQN, double DQN, REINFORCE, and REINFORCE with baseline. e results for the case of symmetrical L-L-L-G fault scenarios are shown in Figure 4. e figure shows the training progress of the four agents, where it points out the success rate with the number of episodes. Each episode describes a scenario of lines outages the agent applies to cause a complete system blackout. It can be observed that the DQN agent successfully found a policy that is able to outage the three topologies with a high success rate. It shows also that the DQN agent learned faster than the other agents and was stable during the learning. e double DQN agent was slightly slower at the start of the training but later was stable and achieved a higher success rate than the DQN agent in the three topologies. However, the REIN-FORCE and the REINFORCE with baseline were slower in learning. e REINFORCE failed in the three topologies to converge and had lots of spikes, which explains that the agent was not stable during the training process. e RE-INFORCE with baseline succeeded to stabilize in PS3. But in PS1 and PS2, it was improving slowly, which means that by letting the agent train in more episodes, it will converge to an optimal policy. e agent cannot explore the action-state space efficiently. us, it takes longer time to find a good policy. It is worth mentioning that all the attempts to optimize the REINFORCE agent by adjusting the learning rate and the number of hidden neurons in the actor network were not sufficient to stabilize the learning procedure and to find an optimal sequence of actions. Table 3 shows the minimum possible number of faults to outage the three systems PS1, PS2, and PS3, respectively, determined by the four agents. It can be shown that the double DQN was able to find a solution or a sequence of actions that results in system outage with a smaller number of faults as compared to the other agents in PS2.
Following Definition 1, the results illustrate that the third topology PS3 is the most resilient topology, as it needed 7 faults to black out the system. is is because PS3 has more redundant paths, so even if a line is faulted, the current can still flow through other paths towards the intended load.
For the second case of single-phase L-L-G fault scenarios, the results are illustrated in Figure 5. e results demonstrate that the double DQN network agent achieved a higher success rate, and it was faster than the other agents. Also, the agent was capable of finding the optimal number of faults for PS1, while the other agents could not find them.
e results also illustrate that the REINFORCE agent failed once again to determine the optimal number of faults for the three topologies. Besides that, the agent was not stable, and the success rates were declining in PS1 and PS2, respectively. e REINFORCE with baseline was improving similar to symmetrical fault scenarios but needed longer training episodes to converge. e DQN agent had a similar behaviour to the double DQN agent but could not find the optimal number of faults in PS1.

Conclusion
A new measure for comparing the LoR was proposed for PSs) under attacks/faults. is measure is based on comparing the minimum number of faults that causes system outage by employing reinforcement learning approaches. e reinforcement learning agents were DQN, double DQN, the REINFORCE (Monte-Carlo policy gradient), and RE-INFORCE with baseline.
e LoR of three different PS topologies under symmetrical and asymmetrical fault scenarios were compared. Experimental results showed that while the three PSs have the exact set of generators and have enclosed the same set of loads, yet, they had distinct resiliency levels due to their topological dissimilarity. e multipaths presented in PS3 topology supported the load's demands by the generation side. e results also showed that the double DQN agent was stable and achieved the highest success rate among all agents, as opposed to the REIN-FORCE agent that failed to determine the minimum number of faults for the three topologies under both symmetrical and asymmetrical faults. In this work, the agents were trained for a certain number of observations (current flow paths and lines availability states) and possible attacks/faults actions for three IEEE 6-bus topologies. However, investigating the LoR for other PSs topologies requires defining and training new agents properties with new observations and actions. As a future work, other factors need to be investigated like recovery time, stability, as well as checking the LoR of more topologies to determine the most resilient PS design. In addition to that further development on the resiliency enhancement can be obtained through the adaptation of DL and decision-making techniques.

Conflicts of Interest
e authors declare that they have no conflicts of interest.