Continuous Reinforcement Algorithm and Robust Economic Dispatching-Based Spot Electricity Market Modeling considering Strategic Behaviors of Wind Power Producers and Other Participants

In a spot wholesale electricity market containing strategic bidding interactions among wind power producers and other participants such as fossil generation companies and distribution companies, the randomly fluctuating natures of wind power hinders not only the modeling and simulating of the dynamic bidding process and equilibrium of the electricity market but also the effectiveness about keeping economy and reliability in market clearing (economic dispatching) corresponding to the independent system operator. Because the gradient descent continuous actor-critic algorithm is demonstrated as an effective method in dealing with Markov’s decision-making problems with continuous state and action spaces and the robust economic dispatch model can optimize the permitted real-time wind power deviation intervals based on wind power producers’ bidding power output, in this paper, considering bidding interactions among wind power producers and other participants, we propose a gradient descent continuous actor-critic algorithm-based hour-ahead electricity market modeling approach with the robust economic dispatch model embedded. Simulations are implemented on the IEEE 30-bus test system, which, to some extent, verifies the market operation economy and the robustness against wind power fluctuations by using our proposed modeling approach.


Introduction
Wind power is one of the fastest growing renewable power resources [1].In the spot electricity market (EM) with wind power penetration, the fluctuating and random nature of this intermittent resource hinders the integration of wind power into EM and operation of power systems.Moreover, the strategic interactions among wind power producers (WPPs) and other market participants such as fossil generation companies (GenCOs) and distribution companies (DisCOs) have increased the complexity of EM modeling which is a necessary tool for market analysis, design, bidding decision-making, and every market modification [2].
e objectives of all participants bidding in EM are maximizing their own profits.Wind power and some other renewable power resources often participate in spot EM as "price takers" because of their low marginal costs.erefore, the only bidding parameter a WPP needs to determine is its production level [3].On the one hand, the limited predictability nature of wind power makes WPPs usually not meet the production level they bid, which increases the probability of system imbalances [4].Relevant regulators in many countries have designed various penalty mechanisms to financially punish WPPs for their deviations of real-time productions from their bidding ones.Hence, if neglecting the marginal costs of wind power [5], maximizing a WPP's profit means minimizing the deviation cost and maximizing the bidding revenue simultaneously.On the other hand, the fluctuating and random nature of wind power makes other EM participants to bid in this stochastically fluctuating EM environment in order to maximizing their own profits, which in turn affects the bidding revenues of WPPs mainly through locational marginal prices (LMPs) clearing by the independent system operator (ISO).erefore, in this more complicated situation, developing fast and reliable market modeling approaches which contain bidding interactions among all kinds of participants has become considerably more important than before.One aim of this paper is to apply a new reinforcement learning algorithm based on the gradient descent continuous actor-critic (GDCAC) algorithm for solving double-side hour-ahead EM modeling containing strategic bidding interactions among WPPs and other market participants such as GenCOs and DisCOs.
Generally speaking, literatures relevant to our research can be divided into two categories: optimal wind power (or other renewable power) bidding in EM with wind penetration and EM modeling considering (or not considering) wind and some other renewable power penetrations.In the aspect of optimal wind power bidding in EM, methods for finding the optimal bidding strategy for a WPP have been introduced by many researchers.Vilim and Botterud [3] proposed two stochastic bidding models based on kernel density estimation (KDE) for a WPP to obtain the optimal day-ahead bidding strategy.Ravnaas et al. [6] proposed a seasonal autoregressive integrated moving average (SARIMA) algorithm for a WPP to obtain the optimal day-ahead bidding strategy.Sharma et al. [5] studied the behaviors of strategic WPPs in markets dominated by wind generators using the Cournot game model.In [7], Matevosyan et al. proposed an imbalance cost minimization bidding strategy for a WPP by forecasting the wind power probability distribution functions.Li and Shi [8] proposed a stochastic bidding model for a WPP based on the Roth-Erev reinforcement learning algorithm.Laia et al. [9] considered the uncertainty on the electricity price through a set of exogenous scenarios and solved the bidding problem of a price-taker thermal-wind power producer by using a stochastic mixed-integer linear programming approach.In [10], Chaves-Ávila et al. analyzed the impact of different balancing rules (penalty mechanism) on wind power short-term bidding strategies through a stochastic optimization model.Based on the Stackelberg game model, Xiao et al. [11] put forward a closed analysis on WPP's optimal bidding strategy in dayahead EM involving large-scale wind power.Lei et al. [12] studied, using a stochastic bilevel model, the optimal bidding decision for a WPP participating in a day-ahead EM that employs stochastic market clearing and energy and reserver cooptimization, in which only the wind generation uncertainty is considered.Similar researches on the optimal bidding strategy of a WPP can also be seen in [11,[13][14][15][16][17][18].
However, authors in [3,[5][6][7][8][9][10][11][12][13][14][15][16][17][18] only studied how to find the optimal bidding strategy for a WPP within EM environment, and the modeling methods of those literatures are either static game models (Cournot and Stackelberg game models) or bilevel stochastic optimization model which cannot simulate the impact of wind power on dynamic bidding process of other participants (GenCOs and DisCOs) in a spot EM considering wind power penetration.
In order to overcome those deficiencies listed above, researches on spot EM modeling methods considering or not considering wind and some other renewable power penetration have been proposed in many literatures.
In general, the main purpose of EM modeling approaches is to regard the EM as a whole system, in which the interactions among all market participants are investigated, and the bidding process or the equilibrium result is simulated.EM modeling approaches mostly lie within twofold [2]: game-based models and agent-based models.In [2], Salehizadeh and Soltaniyan have summarized that gamebased EM models are inferior to agent-based models, and the reasons are as follows: (1) some game-based models often result in a set of nonlinear equations which cannot be easily solved or might yield no solution; (2) some game-based models need to repeatedly solve the multilevel mathematical programming approaches so as to depict the dynamic bidding process in EM, while the computational complexity limits the ability to simulate large EM systems with a gamebased model; and (3) almost all game-based models are based on an assumption which is to take the known probability distribution function of the market clearing price (MCP) or other competitors' bidding strategies as common knowledge, and the abovementioned assumption is not more applicable in a realistic situation [19].Hence, many researches about the application of agent-based methods for EM modeling have been proposed recently.Rahimiyan and Rajabi Mashhadi [19] modeled and simulated the EM bidding process using the multiagent Q-learning algorithm considering discrete state and action sets and the game-based approach, respectively.Comparison of the agent-based model with the game-based model in [19] confirms the superiority of the agent-based model in this issue.Santos et al. [20] proposed an agent-based wholesale EM test bed (called MASCEM: multiagent simulator of competitive electricity markets) in which the variant Roth-Erev reinforcement learning (VRERL) algorithm was used to model the bidding behavior of the GenCOs agents.Similar researches on agent-based EM modeling can also be seen in [21][22][23][24][25][26][27][28], but none of researches in [19][20][21][22][23][24][25][26][27][28] is involved in considering wind and some other renewable power penetrations.
Shafie-khah et al. [29] proposed a multiagent EM model based on a heuristic dynamic algorithm to help analyzing the market powers of GenCOs in EM considering wind power uncertainty.Dallinger and Wietschel [30], based on an agent-based EM equilibrium model, have studied the impact of plug-in electric vehicle on EM with renewable power penetration.Reeg et al. [31] studied the policy design problem to foster the integration of renewable energy sources into EM by using an agent-based approach.Zamani-Dehkordi et al. [32] studied the impact of a proposed wind farm project on wholesale and retail electricity prices by using EM models based on nonparametric regression algorithms.In [33], by using the Q-learning algorithm, Haring et al. proposed a multiagent EM approach to analyze the effects of renewable power uncertainty on the spot EM bidding progress.Salehizadeh and Soltaniyan [2] modified the multiagent EM approach through the fuzzy Q-learning algorithm, by which the effects of renewable power uncertainty on the spot EM bidding progress was also studied within a continuous market state (wind power) space, but discrete action spaces.Paschen [34] analyzed the dynamic behavior of day-ahead EM prices in Germany due to structural shocks in wind and solar power by using a dynamic structural vector autoregressive model.Similar studies can also be seen in [35,36], but researches in [29][30][31][32][33][34][35][36] regard the wind power or other renewable powers as an exogenous random variable so that strategic bidding behaviors of wind or other renewable power producers as well as impact of the EM bidding process on WPPs are neglected in those literatures.
So far as we know, there is no relevant research containing the following three points simultaneously: (1) To construct a multiagent-based EM model which contains not only the impact of WPPs' uncertain output on strategic bidding behaviors of other market participants but also the impact of the EM bidding process on WPPs' bidding decision-making (2) To construct a multiagent-based EM model in which both the EM environment state space and bidding strategy (action) spaces of all kinds of market participants such as WPPs, GenCOs, and DisCOs are continuous (3) To construct a multiagent-based EM model in which the market clearing model of ISO is propitious to promote the wind power accommodation capacity of the power system, which is another aim of this paper is paper applies a new modified reinforcement learning algorithm, namely, GDCAC algorithm, for hourahead EM modeling.In our proposed EM approach, all kinds of participants such as WPPs, GenCOs, and DisCOs are regarded as interactively strategic bidding agents who, during the bidding process, must select their optimal bidding strategies from their continuous strategy spaces based on the EM environment state they learned within a continuous state space, respectively, and without causing troubles of "curse of dimensionality."e market clearing model of ISO in our approach is a robust economic dispatch model (REDM) [37] which can optimize the permitted realtime wind power deviation intervals based on WPPs' bidding power output.By using our proposed approach, the dynamic interactions among all kinds of participants as well as the Nash equilibrium (NE) results of EM can be simulated and obtained.On the one hand, our proposed approach can provide a bidding decision-making tool for WPPs, GenCOs, and DisCOs to get more profits in EM.On the other hand, our proposed approach can also provide an economic and operational analysis tool for promoting the development of renewable resources.Moreover, in our simulation, the proposed approach is implemented on the IEEE 30-bus test system.Other than testing and verifying the feasibility and rationality of our proposed approach such as reaching NE results after enough iterations and being superior to other agent-based approaches, comparison of our proposed market clearing model with that in [12] under the same bidding approach based on the GDCAC algorithm is also implemented, which indicates the necessity of adopting the REDM for promoting wind power accommodation in EM. e rest of this paper is organized as follows: in Section 2, the multiagent double-side hour-ahead EM modeling containing strategic bidding interactions among WPPs, GenCOs, and DisCOs are explained.Sections 3 and 4 describe the detailed procedure of applying the GDCAC algorithm for EM modeling.Section 5 conducts the simulations and comparisons.Section 6 concludes the paper.

Multiagent Hour-Ahead EM Modeling
2.1.Participants' Bidding Models.In our proposed doubleside hour-ahead wholesale EM model, we consider every WPP, GenCO, and DisCO as an agent.An agent has the ability of learning through its bidding experiences in order to maximize its own profit.For the sake of simplicity and without the loss of generality, we assume that every WPP and GenCO has only one generation unit.In each hour, every GenCO and DisCO solves its own bidding problem and sends its price-quantity bid curve for the next hour to the ISO.Moreover, every WPP, because of its "price taker" role in EM, solves its own bidding problem and sends its bidding power output to the ISO.ISO, after receiving all bid curves from GenCOs and DisCOs as well as all bidding power outputs from WPPs, performs the process of robust economic dispatch management and sends the scheduled power results as well as LMPs to all market participants (WPPs, GenCOs, and DisCOs).
For WPP i (i � 1, 2, . .., N w ), the only bidding parameter for hour t is its planed (bidding) power output P wi,t (P wi,t ∈ [P wi,min , P wi,max ]).WPP i can adjust its bid by changing this parameter.In power systems of many countries, wind power is given priority to be scheduled by ISO comparing with other nonrenewable resources [37], which is to say prior-scheduled wind power for hour t, namely, P * wi,t , is equal to P wi,t .However, because of the high variability and random nature of this intermittent resource, the (predicted) real-time output power of WPP i for hour t, namely, P (r)  wi,t (P (r) wi,t ∈ [P wi,min , P wi,max ]), which is actually a random variable [12], usually tends to deviate from the scheduled one, which is harmful to the secure operation of the power system and tends to cause system imbalance.Hence, penalty mechanisms to financially punish WPPs for their deviations of real-time productions from their bidding ones must be involved.Taking the penalty method of [12] into consideration, the expected profit of WPP i for hour t can be described as follows: where LMP wi,t represents the hour-ahead nodal price (LMP) for hour t at the bus connecting WPP i. ε is a random variable, which is used to describe the scenarios of wind power uncertainty.S represents the envelope space of wind power scenarios.p ε represents the probability of occurrence of the scenario ε.P (r,ε) wi,t and ρ (ε) wi,t represent the (predicted) real-time power output and penalty price of WPP i for hour t in scenario ε, respectively.In this paper, we involve that the penalty price of WPP i is related to the (predicted) real-time LMP at the bus connecting WPP i [12].
Moreover, there is a difference between the (predicted) real-time power output and the (predicted) natural power output (namely, P (na)  wi,t and P (na) wi,t ∈ [P wi,min , P wi,max ]) of WPP i in hour t.WPP i can determine whether its (predicted) realtime power output is equal to the natural one by conducting pitch control or using storage equipment [37].e functional relationship between these two random variables can be formulated as follows [37]: wi,t � P lb wi,t , P (na) wi,t ≤ P lb wi,t , P (na)  wi,t , P lb wi,t < P (na) wi,t < P ub wi,t , P ub wi,t , P (na) wi,t ≥ P ub wi,t , where P ub wi,t and P lb wi,t (P ub wi,t and P lb wi,t ∈ [P wi,min , P wi,max ]) represent the permitted upper and lower bounds of power output of WPP i that the can be accepted by system for hour t.In this paper, we consider the (predicted) real-time natural wind power outputs of all WPPs as common knowledge.
For GenCO j (j � 1, 2, . .., N g ), the formulation of its bid curve for the next hour t is a supply function based on its real marginal cost function [28] where P gj,t and k gj,t represent the power production (MW) and bidding strategy ratio of GenCO j for hour t, respectively.GenCO j can adjust its bid curve by changing its parameter k gj,t .e marginal cost function of GenCO j is where a j and b j represent the slope and intercept parameters of GenCO j 's marginal cost function, respectively.Moreover, we assume every GenCO is an AGC (automatic generation control [37]) unit which can automatically undertake the real-time power imbalance of system with a certain proportion (namely, α).
erefore, the expected profit of GenCO j can be described as where LMP gj,t represents the hour-ahead nodal price (LMP) for hour t at the bus connecting GenCO j , ρ (ε) gj,t represents the (predicted) real-time nodal price (LMP) for hour t at the bus connecting GenCO j in scenario ε, and P * gj,t represents the GenCO j 's hour-ahead scheduled output power result for hour t.
For DisCO m (m � 1, 2, . .., N d ), the formulation of its bid curve for the next hour t is a demand function based on its real marginal revenue function [28]

ISO's Market Clearing Model.
In the traditional dispatching mode considering wind power penetration, ISO sends the scheduled values of wind power to WPPs and WPPs are required to strictly follow the scheduled values in the case of their generation capacities. is traditional mode has the following two obvious defects [37]: (1) In the case of low precision of wind power prediction, the traditional dispatching mode is not conducive to the wind power accommodation.It can lead to extreme operating conditions, which may seriously threaten the system security when the wind power violently fluctuates.(2) It may lead to frequent pitch control when wind turbines strictly track the scheduled values of output power, which would affect the lives of the wind turbines.
e main reason for those two defects listed above is that in the traditional dispatch mode, the uncertainty of wind power is not taken into account.Hence, ISO does not know the maximum permitted wind power output fluctuation range in the premise of ensuring system security and cannot optimize wind power accommodation capacity of the power grid.erefore, nowadays, more and more attentions have been paid to the REDM [37] which aims to promote the wind power accommodation in considering wind power uncertainty.According to [37], the robust hour-ahead economic dispatch model for hour t can be mathematically described as follows: P wi,min ≤ P lb wi,t ≤ P * wi,t ≤ P ub wi,t ≤ P wi,max , ∀i, where M ub i and M lb i (M ub i and M lb i > 0) in equation ( 8) represent the deviation penalty coefficients of permitted upper and lower bounds of the wind power output of WPP i, and equations ( 9)-( 15) represent the hour-ahead system constraints including power balance constraint (equation ( 9)), DC power flow constraints in each transmission line l (equations ( 11)-( 13)), and load and power production of every DisCO and GenCO (equations ( 14) and ( 15)).e hour-ahead LMPs of system can be calculated by using dual variables of equations ( 9)- (13).Formulations for hourahead LMP are in Appendix A. Equations ( 16)-( 19) represent the (predicted) real-time system constraints including power balance constraint (equation ( 16)), DC power flow constraints in each transmission line l (equations ( 17) and ( 18)), and power production of every WPP (equation ( 19)).
From equations ( 16)- (18), it is obvious that (predicted) real-time DC power flow in each transmission line l is the linear function of (predicted) real-time power output by every WPP.From equation ( 2), (predicted) real-time power output of WPP i (i � 1, 2, . .., N w ) must satisfy to say we can solve the abovementioned REDM by replacing P (r) wi,t with P ub wi,t and P lb wi,t , respectively (Appendix B) [37] and generating new (predicted) real-time balancing and transmission constraints as follows: e (predicted) real-time LMPs (RTLMP 1 s) of system when (predicted) real-time power output of every WPP increases to its (scheduled) permitted upper bound can be calculated by using dual variables of equations ( 9) and ( 21)- (23), and the (predicted) real-time LMPs (RTLMP 2 s) of system when (predicted) real-time power output of every WPP decreases to its (scheduled) permitted lower bound can be calculated by using dual variables of equations ( 9) and ( 24)- (26).
erefore, RTLMP 1 s and RTLMP 2 s represent 2 extreme real-time dispatching results caused by real-time wind power deviations of all WPPs.For the sake of simplicity and without loss of generality, we approximately consider the mean value of RTLMP 1 and RTLMP 2 at bus z as the (predicted) real-time Journal of Electrical and Computer Engineering LMP at bus z and neglect the impact of different ε ∈ S on (predicted) real-time LMPs.

Agent-Learning Mechanism
For an agent in our proposed approach, all the other agents together constitute the EM environment it faces.
erefore, interactions between an agent and all the other agents are equivalent to interactions between this agent and the EM environment it faces.An agent has the ability of learning through repeated interactions with the EM environment for finding its optimal action (bidding strategy or bidding power output), which can maximize its (expected) profit in face of whatever the EM environment state is.In this paper, in order to clearly describe our proposed approach, we use the definitions which are organized as follows: (1) Iteration.Since the market is assumed to be cleared in hour-ahead basis, we define each market round as an iteration.( 2 Obviously, from equations ( 28)-( 30), we can see that the action spaces for WPP i , GenCO j , and DisCO m are continuous, closed, and bounded intervals.
(4) Reward.In iteration t, similar to what was mentioned in [28], every agent learns from the state of the EM environment (x wi,t , x gj,t , and x dm,t ) and then selects its action which in turn forms its bidding power output or curve for sending to the ISO.After receiving all bidding outputs and curves, hour-ahead LMPs permitted upper and lower bounds of (predicted) realtime power outputs by WPPs, as well as hour-ahead power supply and demand schedules are determined by ISO with our REDM represented by equations ( 8)- (19).Rewards of WPP i , GenCO j , and DisCO m can be depicted as equations ( 1), (5), and ( 8), respectively.
Based on experiencing these received rewards over enough iterations, an agent in EM can gradually learn to know how to take the corresponding optimal hour-ahead action
As mentioned in [28,38], TBRLAs can only rapidly solve the Markov decision-making problems with discrete state and action spaces.When one of the state and action spaces becomes continuous, the problem called "curse of dimensionality" will be caused, and the learning speed of TBRLAs becomes so slow that the agent cannot find its optimal action under any given state of environment over iterations.
As mentioned in Section 3, actually both the state and action spaces of every agent in EM are continuous, closed, and bounded space or interval, which guarantees the process of global optimization.erefore, it is improper to model and simulate the dynamic bidding process in our proposed hour-ahead EM containing strategic bidding interactions among WPPs, GenCOs, and DisCOs by using TBRLAs.Method in this paper is to apply a modified reinforcement learning algorithm, called the GDCAC algorithm [28,38], for modeling and simulating our proposed EM.

6
Journal of Electrical and Computer Engineering Because the mathematical principle and pseudocode of the GDCAC algorithm have been described in [28], we only propose the step-by-step procedure of implementing the GDCAC algorithm for hour-ahead EM modeling containing strategic bidding interactions among WPPs, GenCOs, and DisCOs as follows: ( where Moreover, input the discount, standard deviation, as well as the maximum training and decision-making iterations parameters, namely, 0 ≤ c ≤ 1, σ wi (σ gj , σ dm ) > 0 and T 1 and T 2 , for every WPP, GenCO, and DisCO.

Simulation Results and Discussions
5.1.Data and Assumptions.In this section, our proposed approach is implemented on the IEEE 30-bus test system with 2 WPPs, 6 GenCOs, and 20 DisCOs [2].e schematic structure of this test system is shown in Figure 1. e output power of the WPP connected to bus 7 (marked as WPP 1) and 10 (marked as WPP 2) lies within the ranges of [0 80] MW and [0 50] MW, respectively.According to [39,40], we assume both of the real-time wind power outputs of these two WPPs follow the Weibull distribution independently and respectively.en, the (predicted) real-time WPOSs of these two WPPs can be generated by using the Monte Carlo method, and method of real-time WPOS reduction is referred to [39,40].Table 1 shows the reduced 10 (predicted) real-time WPOSs and their corresponding probabilities of these two WPPs which can be used as exogenous parameters in our proposed approach.
Based on Table 1, the number of joint WPOSs corresponding to combinations of (predicted) real-time power outputs generated by WPP1 and WPP2 is still 100 (10 × 10) which is too many for the subsequent calculations.Hence, in this paper, the 100 joint WPOSs are further reduced to 10 by using the tabu search algorithm proposed in [40].Table 2 shows the reduced 10 (predicted) real-time joint WPOSs and their corresponding probabilities.
Moreover, parameters of GenCOs' and DisCOs' bid functions are shown in Tables 3 and 4 [2], respectively.
In order to verify the 3 points, which are as follows: (1) our proposed EM approach can reach dynamic stability and Nash equilibrium (NE) after enough training and decision-making iterations, (2) the superiority of our proposed EM approach comparing with approaches based on TBRL algorithms (e.g., Q-learning algorithm) in terms of participants' (expected) profits and expected social welfare (SW) can be calculated as the sum of (expected) profits of all participants [2], and (3) the impact of different market clearing methods (e.g., REDM and stochastic economic dispatch model (SEDM) [12]) on bidding stability results considering strategic interactions among WPPs and other participants, 3 corresponding simulations conducted by using Matlab R2014a software are carried out one by one as follows.

Testing the Ability of Our Proposed EM Approach to Reach
Dynamic Stability and NE.In this section, we assume that every WPP, GenCO, and DisCO in the market are the GDCAC-based agents with continuous state and action spaces, and dynamic interactions among all GDCAC-based agents actually constitute our proposed GDCAC-based EM approach.e related parameters of the GDCAC algorithm are listed in Table 5.
In our simulation and comparisons (the same as the subsequent sections), every agent will go through a process of training with 3000 iterations in which all agents' action selecting policies consider the balance of exploration and exploitation [28].After the training process, decisionmaking process with 500 iterations will be implemented by all agents, in which only the greedy policy will be adopted when selecting actions in face of any state of the market [28].Moreover, in the beginning of the first training iteration, because every agent has no experience in strategy selecting, we randomly set hour-ahead bidding outputs of WPPs and bidding strategies of GenCOs and DisCOs within their respective intervals.
During the decision-making process, the dynamic adjustment of the EM environment state and bidding strategy (output) of every agent may be constant which means the market reaches the dynamic stability.Testing and verifying whether our proposed GDCAC-based approach reaches to dynamic stability after 3000 training iterations can be shown in Figures 2-4, respectively.
From Figures 2-4, we can see that the adjusting processes of hour-ahead LMPs, (expected) profit of every agent, and (predicted) real-time LMPs connecting WPPs (penalty prices charging from WPPs) in our proposed GDCAC-based approach keep constant during 500 decision-making iterations.It has been verified in [28] that other adjusting processes in EM such as that of expected SW and every agent's bidding strategy would reach constant while the adjusting process of LMPs keeps constant.erefore, reaching the dynamic stability of our proposed GDCACbased approach after 3000 training iterations is concluded in this paper.However, dynamic stability is not equivalent to NE.Hence, in order to examine whether the obtained bidding strategies of all agents after 3000 iterations of the training process and 500 iterations of decision-making process reach NE, we observe each agent's (expected) profit by changing its bidding strategy but fixing the other agents' bidding strategies after 3500 iterations.A combination of the obtained bidding strategies of all agents represents NE when there is no agent that can increase its (expected) profit in case of other agents' bidding strategies unchanged.We define a Nash index [2] which is equal to 1 when the NE is reached and otherwise is equal to 0. Figure 5 demonstrates the adjusting process of Nash indices during 3500 iterations in our proposed GDCACbased approach.
It is known to us from Figure 5 that our proposed GDCAC-based EM approach is able to successfully generalize agents' experiences in face of any state point from the adjacent state points to reach NE after enough training and decision-making iterations.Moreover, by using the same method, the ability to reach the dynamic stability and NE of the comparative Q-learning-based approach, which will be mentioned in Section 5.3, after the same iterations can also be verified, which will not be demonstrated here due to the length of the article.
e obtained hour-ahead LMPs, RTLMP 1 s, and RTLMP 2 s of 30 buses after 3500 iterations in our GDCACbased EM approach are depicted in Figure 6.
It can be seen in Figure 6 that hour-ahead LMPs of 30 buses are equal to each other after 3500 iterations, which is to say the hour-ahead dispatched results causes no congestion in any transmission line of this test system.In addition, there exist differences among RTLMP 1 s and RTLMP 2 s of 30 buses no matter with respect to the permitted upper or lower bound of power outputs by WPPs.Explanations of the above simulation results given by this paper can be expressed as when deviations between the (predicted) real-time outputs of WPPs and their hour-ahead scheduled ones exist, the power output of each generator connected bus and the power flow on each transmission line in this system are redistributed, in order for the system to tolerate the (predicted) real-time wind power deviations to a certain degree,  [41] (under the Creative Commons Attribution License/public domain).For the sake of simplicity, here it is assumed that the minimum and maximum congestion constraints in all transmission lines are ± 40 MW.Journal of Electrical and Computer Engineering and it is necessary for REDM in hour ahead to not only make each GenCO maintain a certain value of reserve capacity but also to reserve for each transmission line some additional transmission capacity to deal with the (predicted) real-time power flow changes.

Comparison of Our Proposed TBRL-Based
Approach.In this section, for the purpose of approaches comparisons, our proposed GDCAC-based EM approach and the Q-learning-based EM approach are implemented on this test system, respectively.ere are 3 learning scenarios (LSNs) which are set in this paper for simulation and comparisons.LSN.1 assumes that every WPP, GenCO, and DisCO in the market are the GDCAC-based agents with the continuous state and action spaces, which is the same as our proposed GDCAC-based approach mentioned in Section 5.2.LSN.2 assumes that WPP1 is a Q-learning-based agent with discrete state and action spaces, while other agents are the same as that in LSN.1, and LSN.3 assumes that every WPP, GenCO, and DisCO in the market is a Q-learningbased agent with discrete state and action spaces, which means the comparative Q-learning-based EM approach.Table 6 presents the related information while taking LSN.2 and LSN.3 into account, respectively.e parameters of the comparison of the Q-learning algorithm [19,28] which use ε − greedy policy to balance exploration and exploitation in 3000 training iterations and greedy policy in 500 decisionmaking iterations are also listed in Table 6.
After 3500 iterations, (expected) profits of all agents and expected SWs in 3 LSNs are listed in Table 7.
From Table 7, the following can be inferred: (1) After the same number of iterations, WPP 1 's (expected) profit in LSN.1 is higher than that in LSN.2.is, to some extent, indicates one can get more profit by using our proposed GDCAC-based method to bid in EM than using the Q-learning based one within the same condition (namely, the same parameters values, number of iterations, and adaptive learning mechanism of other agents).
(2) After the same number of iterations, the expected SW in LSN.1 is higher than that in LSN.2 and the expected SW in LSN.2 is higher than that in LSN. 3. is, to some extent, indicates that, with the increase in the number of agents using our proposed GDCAC-based method to bid in EM, the expected SW can be improved.
In conclusion, regarding to the (expected) profit of a specific agent and expected SW, it is obvious that our proposed GDCAC-based approach is better than the comparative Q-learning based one.e main reasons of this result are as follows: (1) the state and action spaces in the comparative Q-learning approach are discrete; otherwise, it will cause the curse of dimensionality, which is not the same as all continuous state and action spaces in the GDCACbased approach; (2) the phenomenon of discrete state and action spaces makes it harder to find the globally optimal action solution in face of any given state than the continuous ones [28].

Comparison of Different Market Clearing Models in Our
Proposed EM Approach.In this section two market clearing models embedded in our proposed GDCAC-based EM approach are compared in this test system, respectively.One is the REDM mentioned in Section 2.2 e other is the SEDM mentioned in [12].Under SEDM, we still assume that it gives priority to the scheduling of the hour-ahead bidding outputs of WPPs in the system.Moreover, with respect to SEDM, it, based on 10 joint real-time WPOSs listed in Table 2, takes maximizing the expected SW as the objective function [12], and simultaneously considers hour ahead and (predicted) real-time transmission constraints etc. in order to obtain the optimal hour-ahead scheduled power output and demand results of all GenCOs and DisCOs.In this paper, we adopt expected SW, bidding power outputs, and permitted (predicted) real-time upper and lower bounds of power outputs of WPP1 and WPP2 obtained after 3500 iterations for comparison.e calculating results of these indices by using different dispatch models in our proposed EM approach are listed in Table 8.
From Table 8, the following can be inferred: (1) After 3500 iterations, hour-ahead bidding outputs of WPPs within the REDM-embedded EM approach are significantly more than those within the SEDMembedded EM approach.Explanations of this simulation result given by this paper can be expressed as follows: although both of the two dispatch models have endogenous penalty mechanism for wind Note: in order to ensure that all DisCOs do not lose in competition because of their obvious deference in revenue parameters from other DisCOs and to ensure the general balance between the sum of maximum outputs and demands in the market, a small part of parameters in the 4 th and 6 th column are slightly adjusted from [2].power output deviations, which can affect the dynamic adjustment process of bidding power outputs of WPPs, the permitted upper and lower bounds are dynamically adjusted to adapt for the hour-ahead bidding power output of each WPP within the REDM-embedded EM approach while in each iteration of the SEDM-embedded EM approach, the hour-ahead bidding power output of each WPP is required to meet the (predicted) real-time transmission constraints corresponding to 10 WPOSs listed in Table 2. erefore, WPPs in REDM-embedded EM approach can adjust their bidding power outputs to relatively high levels while those in SEDM embedded one are more inclined to adjust their bidding power outputs to the average level of the 10 WPOSs listed in Table 2 in order to avoid the risks of (expected) profit decline caused by larger power deviations.
(2) After 3500 iterations, expected SW obtained from the REDM-embedded EM approach is significantly more than that obtained from the SEDM embedded one.Explanations of this simulation result given by this paper can be expressed as follows: in order to meet all (predicted) real-time transmission constraints corresponding to 10 obviously different WPOSs listed in Table 2, more reserve transmission capacity in each transmission line are required by using SEDM, which may force out more scheduled power outputs and demands by GenCOs and Dis-COs than REDM under the same bidding power outputs of WPPs.(3) Moreover, other than the scheduled hour-ahead power outputs and demands of all GenCOs and DisCOs, it can also be scheduled by REDM the permitted upper and lower bounds of (predicted) real-time power output of each WPP.If a WPP's (predicted) natural power output exceeds its permitted power output interval which is defined by its scheduled permitted upper and lower bounds, its (predicted) real-time power output can be adjusted equal to the adjacent bound by conducting pitch control or using storage equipment [37]. is characteristic of REDM means continuous arbitrary changes of the real-time power output within the corresponding permitted power output interval by each WPP would not cause congestion in any transmission line in the system.However, by using SEDM, only the hour-ahead power outputs and demands of all GenCOs and DisCOs can be scheduled.Although a scheduled output by REDM can meet all (predicted) real-time transmission constraints corresponding to 10 WPOSs listed in Table 2, it cannot be guaranteed that real-time power outputs of WPPs other than any of the WPOS listed in Table 2 also would not cause congestion in any transmission line in the system, and WPPs would not know their corresponding permitted power deviation intervals according to which they can adjust their natural power outputs by conducting pitch control or using storage equipment.Hence, the SEDM-embedded EM approach is less conducive to the wind power accommodation than the REDM embedded one.erefore, no matter with respect of economy or reliability, REDM has a lot of advantages over SEDM when being embedded in the EM modeling approach.

Conclusion
In this paper, considering strategic interactions among WPPs, GenCOs, and DisCOs, we have proposed a GDCACbased EM modeling approach with REDM embedded.Simulation results have verified the feasibility and the scientific nature of our proposed approach, and some conclusions can be drawn as follows: (1) With our proposed GDCAC-based EM approach, the simulated bidding process after enough training and decision-making iterations can reach dynamic stability which has been tested and verified as the NE result.
(2) Our simulation on the IEEE 30 bus test system with 28 participants takes only 1.17 minutes to reach the final result.at is to say, the time complexity of our proposed approach is relatively low so that we can extend it to the modeling and simulation of more realistic and more complex EM system.(3) Our proposed GDCAC-based EM approach is superior to the TBRL-(Q-learning-) based approach in terms of increasing the profit of a specific agent and expected SW. e main reason is that only TBRL algorithm can be used to analyze Markov decision-making problems with discrete state and action spaces.(4) e obtained bidding results also reveal that in, the premise of maintaining relatively high wind power accommodation ability of the system, the overall SW can be improved by using REDM as the market clearing model when comparing with SEDM.is, to some extent, has verified the robustness against wind power fluctuations, the reliability about scheduling results, and the market operation economy of our proposed EM approach with REDM embedded.
Moreover, on the one hand, our proposed approach can provide a bidding decision-making tool for WPPs, GenCOs, and DisCOs to get more profits in EM.On the other hand, our proposed approach can also provide an economic and operational analysis tool for promoting the development of renewable resources.where λ, η l1 , and η l2 represent the dual variables of equations ( 9), (12), and (13), respectively.L represents the generalized Lagrange function of model (equations ( 8)-( 19)).

B. Discussion on the Reformulation of Constraints (16-18) to (21-25)
From equations ( 16)-( 18), it is obvious that  Z z�1 P (r) Gz,t × sf l,Gz −  Z z�1 P * Dz,t × sf l,Dz increases with the increase of P (r)  wi′,t ∈ [P lb wi′,t , P ub wi′,t ] (i ′ ∈ BUS z ) and decreases with the decrease of P (r)  wi′,t ∈ [P lb wi′,t , P ub wi′,t ] (i ′ ∈ BUS z ). is is to say the violation of real-time constraints is most likely to happen when P (r)  wi′,t � P lb wi′,t or P (r) wi′,t � P ub wi′,t .Hence, for the purpose of maintaining robustness, we can solve the abovementioned REDM by replacing P (r)  wi,t with P ub wi,t and P lb wi,t , respectively, and generating new (predicted) real-time balancing and transmission constraints as follows:

Figure 1 :
Figure1: Diagram of the test system.Note.Figure1is reproduced from[41] (under the Creative Commons Attribution License/public domain).For the sake of simplicity, here it is assumed that the minimum and maximum congestion constraints in all transmission lines are ± 40 MW.

Figure 2 :
Figure 2: Dynamic adjusting process of hour-ahead LMPs in the GDCAC-based approach.

Figure 3 :
Figure3: Dynamic adjusting process of (expected) profit of every agent in the GDCAC-based approach.

Figure 4 :Figure 5 :
Figure 4: e dynamic adjusting process of (predicted) real-time LMPs connecting WPP1 and WPP2 (penalty prices charging from WPP1 and WPP2) in the GDCAC-based approach.

A.
Formulations for Hour-Ahead LMP e hour-ahead LMP for energy credit and load payment at bus Gz (or Dz) can be calculated as 14 Journal of Electrical and Computer Engineering LMP Gz � zL zP * Gz,t � λ −  l sf l,Gz η l1 − η l2 , (A.1)

:
SF j,t P gj,t , k gj,t   � k gj,t a j P gj,t + b j  , P gj,t ∈ P gj,min , P gj,max  , gj , and X dm are continuous, closed, and bounded state spaces for WPP i , GenCO j , and Dis-CO m , respectively.(3) Action Variable.For WPP i , the hour-ahead bidding power output, namely, P wi,t (P wi,t ∈ [P wi,min , P wi,max ]), is defined as the action variable of this agent in iteration t.For GenCO j or DisCO m , the hour-ahead bidding strategy rate, namely, k gj,t or k dm,t , is defined as the action variable of GenCO j or DisCO m in iteration t.Hence, the action scalars for WPP i , GenCO j , and DisCO m can be formulated as follows: [28]ate Variable.For WPP i and in iteration t, the hourahead and (predicted) real-time LMPs at the bus connecting WPP i calculated in iteration t − 1, namely, LMP wi,t−1 , ρ wi,t−1 , are defined as the EM environment state variables; for GenCOj, the hourahead and (predicted) real-time LMPs at the bus connecting GenCOj calculated in iteration t − 1, namely, LMP gj,t−1 and ρ gj,t−1 , are defined as the EM environment state variable.For DisCO m , the hourahead LMP at the bus connecting DisCO m calculated in iteration t − 1, namely, LMP dm,t−1 , is defined as the EM environment state variable.Hence, the state vectors and scalar for WPP i , GenCO j , and DisCO m can be formulated as follows[28]:x wi,t � LMP wi,t−1 , ρ wi,t−1   ∈ X wi , x gj,t � LMP gj,t−1 , ρ gj,t−1   ∈ X gj , x dm,t � LMP dm,t−1 ∈ X dm ,(27)where X wi , X u wi,t � P wi,t ∈ P wi,min , P wi,max  ,(28)u gj,t � k gj,t ∈ k gj,min , k gj,max  , (29) u dm,t � k dm,t ∈ k dm,min , k dm,max  .
1) Input.For the whole EM, input common knowledge is such as every WPP's reduced (predicted) realtime wind power output scenarios (WPOSs) with corresponding probabilities and all WPP's joint real-time WPOSs with corresponding probabilities.For WPPi (i � 1, 2, ..., N w ), input the basic function ϕ → wi : X wi ⟶ R n for formulating its value function  V wi,t (x wi,t ) �  n h�1 ϕ wi,h (x wi,t )θ wih,t � ϕ → wi (x wi,t ) T θ wi,t , x wi,t ∈ X wi , and its optimal policy function u → wi (x wi,t ) T ω wi,t , x wi,t ∈ X wi , time step length parameter series α (w) � 1, 2, ..., N g ), input the basic function ϕ → gj : X gj ⟶ R n for formulating its value function  V gj,t (x gj,t ) �  n h�1 ϕ gj,h (x gj,t )θ gjh,t � ϕ → gj (x gj,t ) T θ gj,t , x gj,t ∈ X gj and its optimal policy function u (opt) gj,t (x gj,t ) � ϕ → gj (x gj,t ) T ω gj,t , x gj,t ∈ X gj , time step length parameter series α dm (x dm,t ) T ω dm,t , x dm,t ∈ X dm , time step length parameter series α (d) t   ∞ t�1 and β (d) t   ∞ t�1 wi,t , GenCO j selects and implements an action u gj,t � ϕ → gj (x gj,t ) T ω gj,t (u gj,t ∈ [k gj,min , k gj,max ]) from state x gj,t , and DisCO m selects and implements an action u dm,t � ϕ → dm (x dm,t ) T ω dm,t (u dm,t ∈ [k dm,min , k dm,max ]) from state x dm,t .After action selecting and sending it to ISO by every agent, ISO implements the REDM represented by equations (8)-(19) by which the EM environment state vector variables are updated from x wi,t , x gj,t , and x dm,t to x wi,t+1 , x gj,t+1 , and x dm,t+1 and the immediate reward r wi,t , r gj,t , and r dm,t are generated.(5) WPP i observes the immediate reward r wi,t by using equation (1) and the new EM environment state x wi,t+1 ; GenCOj observes the immediate reward r gj,t by using equation (5) and the new EM environment state x gj,t+1 ; and DisCO m observes the immediate reward r dm,t by using equation (8) and the new EM environment state x dm,t+1 .(6) Learning.In this step, θ wi,t and ω wi,t for WPP i, θ gj,t and ω gj,t for GenCO Output.For WPP i , θ * wi � θ wi,T 1 +T 2 and ω * wi � ω wi,T 1 +T 2 For GenCO j , θ * gi � θ gj,T 1 +T 2 and ω * gj � ω gj,T 1 +T 2 gj (x gj,t ) T ω gj,t , σ 2 gi ) (u gj,t ∈ [k gj,min , k gj,max ]) from state x gj,t , and DisCO m selects and implements an action u dm,t ∼N( ϕ → dm (x dm,t ) T ω dm,t , σ 2 dm ) (u dm,t ∈ [k dm,min , k dm,max ]) from state x dm,t .If T 1 < t < T 1 + T 2 , then in iteration t, WPP i selects and implements an action u wi,t � ϕ → wi (x wi,t ) T ω wi,t (u wi,t ∈ [P wi,min , P wi,max ]) from state x j , and θ dm,t and ω dm,t for DisCO m are updated by using the TD (0) error (namely, δ wi,t , δ gj,t , and δ dm,t ) and gradient descent method.WPP i: δ wi,t � r wi,t + c ϕ → wi x wi,t+1   T θ wi,t − ϕ → wi x wi,t   T θ wi,t , θ wi,t+1 � θ wi,t + α (w) t δ wi,t ϕ → wi x wi,t  , ω wi,t+1 � ω wi,t + β (w) T θ gj,t , θ gj,t+1 � θ gj,t + α (g) t δ gj,t ϕ (7) t � t + 1.

Table 1 :
Reduced 10 real-time WPOSs and their corresponding probabilities of these two WPPs.

Table 2 :
Reduced joint real-time WPOSs and their corresponding probabilities.

Table 4 :
Parameters of DisCOs' bid functions.

Table 5 :
Related information about the GDCAC algorithm.

Table 8 :
Calculating results of expected SW, bidding power outputs, and permitted (predicted) real-time upper and lower bounds of power outputs considering different dispatch models.

Table 7 :
(Expected) profit of every agent and expected SW results of 3 LSNs.