Cost-Effective Energy Usage in a Microgrid Using a Learning Algorithm

The microgrid is a new concept of integrating the distributed energy resources (DER) within the grid. The management of the heterogeneous sources of energy presents a challenge, especially as most of the DER are unpredictable. Besides, implementing microgrids should be economically beneficial to the customer; this will raise the challenge of decreasing the costs while ensuring the energy balance. In this paper, we used a stochastic approach based on amodel-freeMarkov decision process (MDP) to derive the optimal strategy for the home energy management system.The approach aims to decrease the energy bill while taking into account the intermittency of the renewable energy resources (DER) and other constraints.While other proposals charge the battery from the utility energy, making the state of charge (SOC) of the battery a deterministic variable, our work adopts a scenario where the battery is charged from the excess of the generated energy, which makes the SOC a nondeterministic variable affected by the uncertain character of the renewable energy.Therefore, our model considers the randomness at two levels: renewable energy level and battery SOC level. We take into account the complexity of the solution, and we propose a simple strategy that can be implemented easily in microgrids.


Introduction
The electric grid is one of the big consumers of the fuel.Such type of energy source is exhaustible and is getting to disappear in the too near future.Electricity suppliers across the world are now searching for new alternatives to compensate it and are making solid steps to incorporate new technologies in many aspects of the grid.
With the raise of the new green technologies such as the PV panels, the wind turbines, and the electrical batteries, new ways of consuming energy emerge.In particular, the incorporation of such technologies with an information infrastructure created the concept of the smart grid [1].Though it has not yet a standardized definition or a defined architecture, the smart grid can be seen as the innovation that will transform the electric grid from centralized and producer-controlled to a distributed and consumer-driven grid.The opportunities offered are countless, and the integration of the ICT infrastructure leads to a rich content and multiple scenarios of use.In fact, with the help of the ICT infrastructure, more information is provided and this will enable an efficient control and monitoring of the grid system.
The home energy management system (HEMS) is one of the research fields associated with the smart grid concept.How to manage the energy efficiently and optimally using the new resources of energy and taking advantage of the ICT infrastructure is the core topic of the HEMS.Several distributed architectures of the HEMS are suggested in the literature; some of them propose a multiagent system [2,3] to monitor the different technologies (power electronics, telecommunications, generation, and storage energy systems) that compose the energy management system [4].Others proposed architectures that manage the energy flow at the substations; the local generated energy is sold to the utility instead of being consumed locally.The utility integrates then controlling strategies at substations to prioritize the DER and make them the first energy supply for the customers [5,6].Other works have suggested the use of the smart microgrid (grid) architecture (e.g., [7][8][9][10]).The US Department of Energy (DOE) defines the grid as "a group of interconnected loads and distributed energy resources (DER) with clearly defined electrical boundaries that acts as a single controllable entity with respect to the grid [and can] connect and disconnect from the grid to enable it to operate in both grid-connected or island mode."In the grid, the management of the heterogeneous sources of energy is done locally, thus, presenting a challenge, especially as most of the DER are unpredictable.Besides, implementing microgrids should be economically beneficial to the customer.And this will raise the challenge of decreasing the costs while ensuring the energy balance.
As aforementioned, the grid system works normally in an "on-grid" mode (or the connected mode).In this mode, the grid is connected to the utility grid.It can either injects the local generated energy into or take off energy from the utility grid.When a malfunction occurs in the grid, the grid disconnects from the grid and turns to the autonomous mode (off-grid mode, islanded mode).In this mode, the grid starts working autonomously and uses the local energy only.In previous works [11,12], we proposed a load shedding/shifting strategy for the off-grid and the on-grid mode, respectively.The proposed strategy for the off-grid mode aims to reschedule efficiently the home loads to meet the limited amount of energy available during the off-grid switching, whereas in [12], we suggested a strategy for the on-grid mode that aimed to reduce the energy bill.The optimization strategy was modelled as an MDP and the model was solved using the value iteration algorithm.We assumed that the transition function and the reward function are known and used sample values in the simulation.However, in more realistic scenarios, we cannot predict those functions.Notably, a grid system relies on a complex stochastic structure, which is not amenable to closed form analysis.On the one hand, renewable energy depends on the most random natural phenomena.Research has shown that even the most robust forecasting models derive a forecast error of 10% when the time scale of forecasting exceeds 6 hours.On the other hand, the underlying structure and probability distributions for the consumer data  are based on the preferences and behaviour of individual consumers which are generally unknown.Therefore, it is not possible to find a closed form expression for the transition function.This paper is an improvement of [12]; it models the system as a model-free MDP and thus does not need to specify the transition function, and it solves it by the reinforcement learning.It proposes a strategy that returns an optimal charging/discharging cycle of battery while considering the intermittency of the renewable energy resources (RER) and the reduction of the utility energy consumption during peak hours and ensuring the user comfort.The rest of the paper is organized as follows.Next section presents a literature review of the existing solutions.Section 3 presents the grid architecture on which we work.The decision making approach as an MDP model is formulated in Section 4, and its resolution is presented in Section 5.The simulation results and the evaluation of the proposed solution are given in Section 6.The final section concludes this paper and discusses future works.

Related Work
A variety of energy management applications was widely addressed for the residential side.Many papers have treated the problem of efficiency and optimality in the grid systems [13,14], and explored the role of the Demand Side Management (DSM) to reduce the electrical energy consumption, minimize the energy cost, and reduce environmental degradation.Several papers consider the optimal exploitation of renewable energy [5,15,16].Some papers investigate the optimal exploitation of renewable energy considering costefficient energy scheduling for residential grids equipped with a centralized renewable energy source [5,15].Others focus on maintaining a continuous energy balance by ensuring uninterrupted energy delivery to end users [16].Another approach to manage the grids is the load shifting approach.The approach consists of separating the residential loads into different classes (e.g., shiftable/nonshiftable loads [17], priority based classification [18]) and rescheduling the loads to achieve significant savings while ensuring the user comfort.The works introduce two techniques of shifting: dayahead load scheduling and real-time load scheduling.The scenarios proposed for the demand shifting are multiple and include electrical storage shifting (e.g., batteries, electrical vehicles) [17,19], thermal storage shifting [17,20], or electrical equipment shifting [18].Soares et al. [21] combined the different provided approaches.They present an evolutionary algorithm that optimizes the usage of heterogeneous energy resources (local generation and storage systems) and proposes an optimal strategy for the loads (shiftable loads and thermostatically controllable loads).Thus, a multiobjective model was developed to minimize the energy cost and to minimize the end-user's dissatisfaction associated with the suggested management strategies.
Other researches reduce the problem of optimality in the battery and focus on the optimal charging/discharging battery cycles [22][23][24].In fact, if used optimally, that is, charged or discharged at the right moment, the battery can ensure an optimal performance of the microgrid.
The authors of [22,23] suggested the scenario that gives the battery the right to be charged from the utility grid only; the power flow from the battery to the grid is not permitted.The authors of [22] studied the control of the energy storage under price fluctuations and then derived the structure of the cost-minimizing storage policy using a model-based MDP framework.In [23], the resolution of the system was done using a model-free reinforcement learning.The objective of the optimal control in that paper was to find the optimal battery charging/discharging/idle control law which minimizes the total expense of the power from the grid while considering the battery limitations.Li and Jayaweera [24] proposed a scenario that consists of storing, fully or partially, the positive excess energy of a customer for its own use in the future or selling the excess to the utility, taking into account the possible time variant pricing information.The authors developed an HMDP (hidden MDP) model for customer real-time decision making.
In this paper, we discuss another scenario where the flow of energy from the grid to the utility is not allowed, and the battery is charged from the excess of the generated energy.Thus, the battery SOC becomes a nondeterministic variable, affected by the uncertain character of the renewable energy.Therefore, our model takes into account the randomness at two levels: renewable energy level and the battery SOC level.
Our approach aims to enhance cost-savings, while taking into account the following constraints: (1) the optimal exploitation of the battery which is charged by the renewable energy, (2) the management of utility energy especially during peak hours, (3) the maintainability of energy balance, (4) the ensuring of user comfort, (5) the integration of a dynamic pricing method.
We take into account the complexity of the solution, and we reduced the resolution into a simple strategy that can be implemented easily in the microgrid.

Smart Microgrid Architecture
A grid system can serve different electricity customers: a house, a residence, a small industry, and so on.We consider an on-/off-grid architecture as defined in Figure 1.The prosumer is connected to two types of energy resources: renewable energy resource (RER) and utility energy.The distributed RER is the primary energy supply and the utility grid is the backup energy supply [3].The energy generated locally is directly consumed without being injected through the grid.The local energy supplying isolates the prosumer appliances from the line and generators; this reduces the loss due to energy transport and provides benefits such as the reduction of signal distortion [5] and the reduction of costs of maintenance and management operations.Furthermore, the direct consumption of the local generated energy will be beneficial, since it will ensure the grid flexibility by allowing its operation under two different modes: the connected mode and the islanded mode.
We decided not to allow the energy flow from the grid to the utility, since this is the most adopted case in a great number of countries where the power injection to the utility grid is not legalized yet.In fact, the power grid as it is implemented currently in most developed countries has a hierarchical structure: we have the centralized power plants, the HV stations (transmission networks), the MV stations (distribution networks), and the low-voltage (220/230 V) that run into our houses.The transformer stations that connect the different levels of power voltage often have some kind of decoupling mechanism that prevent the bidirectional flow of energy, especially from a lower level to a higher level as it would through a linear transformer.In particular, this case is suitable for Morocco, and the study is applied on the institute implemented grid.However, the energy bidirectional flow between the grid and the utility grid can be treated but only in the theoretical level and will be the subject of a further research.
In order to take a full advantage of the RER, we integrate an ESS (Energy Storage System) into the model.The ESS is used to store the excess generated energy and work as a source of electricity supply in case the local energy is not sufficient to meet the load demand.This component will help increase the autonomy rate of the system and reduce the dependence of the prosumer on the utility grid.In the off-grid mode, the ESS is considered as the backup supply that performs the energy balance.When there is an excess of local energy, that is, if the local generated energy feeds the load and fully charges the battery, the RER is disconnected not to harm house equipment.
We integrate a HEMS infrastructure within the electric grid.The HEMS is equipped with a smart meter, which is a component of the AMI (Advanced Metering Infrastructure), to enable a two-way communication.On the one hand, the smart meter ensures the communication from the HEMS to the control center by transmitting the detailed measurement of electricity consumption in a household.On the other hand, it serves as a gateway that transmits the information about the utility prices from the control center to the HEMS.The smart meter spots the power losses and helps identify the devices that consume the most electricity.The data collected in the smart meter is transferred to the Supervisory Control and Data Acquisition (SCADA) Center for storage and processing.The data gathered in the SCADA can be used for many purposes.The data archive of the generated energy and the power consumption can help elaborate efficient models of forecasting.Meanwhile, the real-time data is processed to elaborate controlling algorithms including the home energy management strategies.The controller is a subfeature of the SCADA, it is used to communicate the instructions of the SCADA to the components of the electric system through control signals.The above architecture was implemented in INPT at a larger scale for the MDE smart grid project [25].The institute is composed of different buildings (cf. Figure 2); each building is considered as a grid and has all the electrical components that we cited in Figure 1.

MDP Based Approach Formulation
A Markov decision process is defined by a 4-tuple {, , , }, where  is a finite set of states,  is a finite set of actions,  is the transition function, and  is the reward function.A transition element   (,   ) of the transition function defines the probability that action  in state  at time  will lead to state   at time  + 1.It requires well knowledge of the system states which are composed (as we will see later) by renewable generation and electric consumption.Hence, renewable energy generation forecasting faces several mathematical and statistical challenges [26,27] in the technical level (material, technology, meteorological estimations, etc.).Besides, The consumer electric behaviour cannot be predicted; it depends on one's preferences and can vary depending on time period (for instance, the consumption differs remarkably between summer days and winter days).Therefore, we will use the model-free reinforcement learning to learn the statetransition function and solve the system.The three remaining elements (, , ) will be projected on our case study and detailed in the current section.(ii) SE() is the amount of stored energy in the battery at time ; (iii)   () is the price of the utility energy at time .
The battery state of charge (SE()) is not deterministic.The stochasticity in the battery state of charge is derived from the stochasticity of the renewable energy generation, since the battery is charged from the excess of the generated energy.
The stored energy SE varies in the range [ min  max ] with  max being the maximum capacity of the battery and  min being the minimum capacity that has to remain in the battery, in order to extend its life time.The net demand, on the other hand, varies in the range of [− max  max ].The lower bound of the net demand corresponds to the case where the generated energy is the highest and there is no load demand (this case is not realistic since there will always be the background appliances running in the house but we take it as a reference case to limit the demand load range), and the upper bound corresponds to the case where the load is in the peak but there is no generated energy.
Dynamic pricing models have been adopted in the latest smart grid strategies.Methods of dynamic pricing include among others time of use pricing (TOUP), real-time pricing (RTP), and critical peak pricing (CPP).
TOUP determines two or three levels of energy price, each level for a certain period of time of the day.The price levels are predetermined and can be changed only once or twice a year (summer period TOUP, winter period TOUP).In RTP, instead of predetermining the price levels, the exact price value for each period is calculated and announced to the user only at the beginning of the operation period.CPP is an event-driven pricing method, a CPP event is activated when the utility determines that there is a need to call on customers for temporary reductions in electricity use.In this study, we consider a TOUP price with three levels: off-peak price, midpeak price, and on-peak price changing once a year (one model for the summer period, and another for the winter period).Our state space is then continuous, which makes the problem resolution impossible.We use a grid to discretize the state space and approximate the continuous MDP state space via a discrete space as shown in Table 1.The discrete state space is presented by (1), where  refers to the index of net demand interval to which () belongs,  refers to the index of state of charge interval to which SE() belongs, and  is the price level.
(1) 4.2.Actions: .The decision making process will be limited to the states where () > 0, because once the generated energy exceeds the demand load (i.e., () ≤ 0), the best action is to use only this generated energy, which is obviously  the most natural choice, since we have enough cheap energy to fulfill the load and to charge the battery. 0 : use the generated energy to fulfill the load and to charge the battery.
Depending on the actual state () = {() > 0, SE(),   ()}, the system chooses between the following actions:  1 : feed the net demand  with only the utility energy. 2 : use the battery to feed the net demand and complete with the utility energy if the stored energy is not sufficient.
The battery state (state of charge or depth of discharge) at time  + 1 depends on the action taken at the time  and the preceding battery state.In fact, the action  1 does not affect the state of the battery since it does not use it neither in charge nor in discharge.In this case, an amount of the stored energy is decremented in the battery by the self-discharge phenomenon.Yet, since the battery does not remain stagnant for a long period of time, we can neglect this effect.This is a reasonable assumption when the time scale over which the loss caused by the self-discharge takes place is much larger than that of interest to us.The battery will be discharged if  2 is chosen.The amount of energy discharged from the battery depends on the battery parameters and the available energy.
with  being the maximum charging/discharging rate.

Reward: 𝑅.
The reward is a bounded, real-valued function that expresses the system preferences.One common form of the reward is the one consisting of two components, a positive component representing the utility  and a second component corresponding to the cost:  =  − Cost In this paper, the cost function, Cost, corresponds to a fictive monetary cost of the consumed energy having chosen the action : where   () and   () represent the prices, at time , of the stored energy and the utility energy, respectively, and () is the amount of stored energy consumed at .The goal of MDP planning is to compute the optimal policy  * , that is, the policy that maximizes the expected sum of rewards: An optimal stationary policy is the policy that is not worse than any other policy at any state and any time.

Model Resolution: Optimization of Energy Use
Reinforcement learning is a branch of the machine learning and also a branch of the artificial intelligence.It allows machines to define the best behaviour to enhance their performance.The machine learns-via a trial-error series-its behaviour based on the feedback (in a type of reward or punishment) of the environment.Among the reinforcement learning algorithms existing in the literature, the -learning, SARSA, actor-critic learning, and the average reward difference temporal learning are the most used in the MDP context.
5.1.-Learning.-Learning is an off-policy algorithm for temporal difference learning.It is exploration-insensitive, which means that it will converge to the optimal policy regardless of the exploration policy being followed, under the assumption that each state-action pair is visited an infinite number of times, and the learning parameter  is decreased appropriately.The basic idea of the -learning is to compute a running average of the -value.At each iteration, the system faces new experience.The -learning averages the -value of the new experience with the current running average: where  is the discounted factor; it is a number in the range [0 1] used to discount reward's value.It is a reasonable way to encourage sooner rewards than later ones.The  factor also helps the algorithm to converge since it enforces the sum of discounted rewards to be bounded.In fact, ∑ ∞    ×   ≤  max /(1 − ).The closer  to 0 is, the smaller the horizon is; we say that it is a shorter term evaluation.In the other case ( closer to 1), the evaluation is said to be "far-sighted." is called the learning rate; it has to decrease as the learning process progresses, in order to give the last estimates the highest weights.Some commonly used examples include () = /( + ), where  <  are real numbers and  is the iteration step.The rule 1/ where  = 1 and  = 0 may not lead to good behaviour [28], although it is still widely used.We suggest the use of the log rule since it converges to 0 slower than the /( + ) rule for many values of  and .This will help our algorithm to get the right estimate as the learning progresses.

Exploitation versus Exploration.
During the learning process, the system discovers through acting the "goodness" or the "badness" of a state-action pair.On the one hand, the system tries to choose safe actions, which report a positive reward (exploiting the solution); on the other hand, it would explore all the states in order not to skip or neglect a better solution in terms of reward!Therefore, the system must take risks and visit the unknown stat-action pairs.The exploitation-exploration dilemma is typical for a reinforcement learning problem.The solution is to choose randomly an action while focusing on the actions that currently have better rewards.Moreover, the exploration rate should decrease as long as we progress in the learning process.The Boltzmann probability distribution verifies the two conditions: (1) focus on the good states, and (2) decrease as the learning progress.
where  is the Boltzmann temperature. is initialized to a large value making all the state-actions equiprobable.By reducing  from a large initial value to a small final value, the system will focus on good states.At a temperature of 0 the exploration-exploitation rule becomes deterministic and the system chooses only the state-action that returns the best value of the -function.

Final Executed Algorithm
. Algorithm 1 describes the main steps of the model resolution.The algorithm starts by observing the initial state of the system.if the net demand is negative, the battery will be charged, and the system will get a reward for charging the battery.In the other case, the system generates a random number and compares it to the exploration rate.At the beginning of the algorithm execution, the exploration rate is equal for all actions (with the function  being initialized at 0 for all the pairs state-action), and the system is more likely to choose a random action.As the learning progresses, the exploration rate decreases and its probability to be less than the random number decreases too.
The system starts focusing on the good actions returning the best rewards.After executing the action, the system receives a reward and updates the state of charge of the battery according to (2).The system observes the next state and updates the -function (5).

Simulation Parameters.
To test the reliability of our model, we choose to simulate the results with the following.
Statistical Data.We consider statistical data for the generated renewable energy [29] and the demand load [30].The solar panel technical characteristics are shown in Table 2. during one year.Solar Battery.Regarding type of batteries suitable for the photovoltaic systems which allows a large number of charging/discharging cycles, we choose for the simulation a battery with 5 KWh (5280 Wh) as a maximum capacity.Yet, to extend the life time of the battery, we will use only 60% of this capacity (cf. Figure 4), which means that only 3168 Wh will be used.The minimum capacity is then equal to 2112 Wh.The maximum charging/discharging rate varies between  max /5 and  max /20, depending on the state of charge (SOC)/depth of discharge (DOD) of the battery.
For simplification, we took a maximum charging rate fixed to  max /10 which returns 0.528 Wh.Time of Use Price.The TOUP model determines the price of the utility grid; Figure 5 shows the utility price variation depending on the time of the day and the day of the year.

Results
Discussion.The continuous interval ]0  max ] of () is divided into two intervals.
The state of charge of the battery is divided into intervals of the range  max /3.
Since we have 3 levels of the utility price, we will have 18 states for the case () > 0 and 1 state for the case () ≤ 0. Table 3 summarizes the values of the different parameters of the simulation.Table 4 shows the optimal policy of the suggested system.The resolution is reduced to two actions which are in fact a simplification of the following actions: energy generation, use of the utility energy as source, use of the batteries as source, and the combination of these last two possibilities.The goal is to reduce the complexity while keeping all the necessary elements for the model to be realistic.For the summer case, some of the states are never visited, and this is due to the large amount of generated energy provided in this period.In fact, during the summer period the net demand () decreases and the battery is almost nearly charged.These states are marked by a "-" symbol in the table of optimal policy.
We compare the suggested solution to two other scenarios: (i) Classic scenario: in this case, the battery is always given a priority over the utility energy.When the generated energy is not sufficient, the system chooses to use the battery and compensate with the utility energy if the sum of generated and stored energy is not sufficient to fulfill the user load.(ii) MDP-finite horizon scenario: we have proposed this case study in [12].In this work, we considered that the data (the consumer total load and the predicted local generated energy) is updated daily.Therefore, the resolution program was done every 24 hours.We chose to work with the value iteration method, since it is the most adopted for resolving MDPs in finite horizon [31].
Figure 6 visualizes the monthly cost of the three scenarios.The total cost for the classic scenario is 3448.6 cents, while the  one of the MDP-FH scenario is 2946.3cents.The  Learning resolution stands with a total cost of 3014.3 cents, which means that the MDP-FH outperforms the results of the  learning.Nevertheless, the MDP-FH scenario has a greater cost of computational complexity.In fact, the MDP-FH treats each day individually and is launched at the beginning of the day, contrary to the  learning that solves the system independently of the time of day and is launched only for once.In addition, the transition matrix for the MDP-FH was chosen arbitrarily.In more realistic scenario, this matrix may contain a margin of error due to the use of a statistical estimation method, which makes it very critical to use in a real implemented system.For greater readability, we show the results for restricted periods.Figures 7 and 8 show the system behaviour for two different weeks, a winter week and a summer week.Once the generated energy exceeds the demand load, the best action is to use only this generated energy, which is obviously the most natural choice since we have enough cheap energy to be used and to charge the     battery.The period of energy excess differs from 5 hours in the winter day to 13 hours in the summer day.In this period, the battery starts charging (cf.Figures 7(c) and 8(c)).The battery is rarely utilized in the summer day, due to the fact that the peak period of the summer days coincides with the period of the generated energy excess, and the period of higher net demand coincides with the lower utility price.

Conclusion and Future Work
This paper proposes an intelligent home energy management model that manages the use of the intermittent distributed energy under several constraints in order to improve the balance of usage of heterogeneous energy resources.The proposed HEM model takes into account the elaboration of an effective scheduling of charging/discharging cycles of the battery and returns the optimal strategy that reduces the annual energy bill.The effectiveness of the suggested model for finding an optimal solution is demonstrated through extensive simulations.These latter indicate that the modelfree based resolution approaches the results of the modelbased resolution in terms of cost optimization.The results could be more satisfying if a grid sizing study was carried on.Therefore, and in order to increase the consumer benefit, an optimal sizing of the grid architecture presented in the present paper will be performed.

4. 1 .
System States: .In each time-slot , the system state is defined by the triplet: () = {(), SE(),   ()}, with the following:(i) () is the net demand of the prosumer, () = () − (), where () is the demand load of the user at time  and () is the amount of local produced energy at time ;

Figure 5 :
Figure 5: Time of use price.

Figure 6 :
Figure 6: Cost of the different scenarios.
Battery state of charge

Figure 7 :
Figure 7: Simulation for a winter week.

Figure 8 :
Figure 8: Simulation for a summer week.

Table 3 :
Simulation parameters.Figure 4: Cycle lifetime as function of depth of discharge.