Deep Reinforcement Learning-Based Trading Strategy for Load Aggregators on Price-Responsive Demand

With the development of the Internet of things and smart grid technologies, modern electricity markets seamlessly connect demand response to the spot market through price-responsive loads, in which the trading strategy of load aggregators plays a crucial role in profit capture. In this study, we propose a deep reinforcement learning-based strategy for purchasing and selling electricity based on real-time electricity prices and real-time demand data in the spot market, which maximizes the revenue of load aggregators. The deep deterministic policy gradient (DDPG) is applied through a bidirectional long- and short-term memory (BiLSTM) network to extract the market state features that are used to make trading decisions. The effectiveness of the method is validated using datasets from the New England electricity market and Australian electricity market by introducing a bidirectional LSTM structure into the actor-critic network structure to learn hidden states in partially observable Markov states through memory inference. Comparative experiments of the method show that the method can provide greater yield results.


Introduction
e basic feature of the electricity market is that prices follow demand and price changes affect the quantity demanded [1]. e economic operation of the electricity market will help to reduce the cost of electricity use and is an effective way of enhancing the security of the electricity system through economy [2]. e study of response characteristics in terms of timing, trading rules, etc., can enhance the flexibility of electricity markets to improve the accuracy of forecasting and decision-making [3][4][5]. In recent years, with the development of the Internet of things and smart grid technologies, especially the technological advancement of ambient intelligence, the widespread deployment of smart meters has equipped more customers with two-way communication capability, making priceresponsive load possible. Price-responsive demand (PRD), which unifies the original price-based and incentive-based demand-side response, makes the originally uncontrollable price-based demand response a controllable resource and unifies the incentive-based demand response to the response to price.
For system operators, PRD is a reliable real-time resource that can be described as the price-adjusted load, providing a new means and tool for dispatch; for consumers, PRD reduces electricity bills and improves energy use efficiency. e EcoGrid EU trial results show that residential load could be considered price-sensitive on certain test days [6]. Moreover, the California ISO PJM and Alstom Grid results show that PRD helped improve the efficiency of market operations and significantly increase system reliability [7]. Price-responsive mechanisms facilitate the integration of new flexible energy sources and reduce rail operating costs [8].
Load aggregators can consolidate demand response customer resources and become price-responsive loads as a single large customer. To a certain extent, this eliminates uncertainty in user response behavior and allows small and medium loads to participate in the electricity market in conjunction with their own load control characteristics; aggregated demand response resources can be flexibly managed to improve response efficiency based on forecast or current electricity spot market prices. In Liu et al. [9], a hybrid stochastic/robust optimisation approach with a model that minimizes the expected net cost was proposed for distributed generation (DG), storage, dispatchable DG, and price-sensitive load bidding strategies in the pre-electricity day market. e results show that the wind power output had a negative correlation with the price-based demand response load response, and the correlation could reduce the system operating cost and improve the economy of system dispatch. In Geng et al. [10], a two-stage stochastic market power purchase model with DR resources was constructed to minimize the energy purchase cost of integrated energy service providers in different types of markets, and the impact of flexible heating load on their power purchase strategy was presented. In the previous day's market, a multi-time scale stochastic optimal scheduling model for electric vehicle (EV) charging stations with demand response was proposed with the objective of minimizing the daily operating cost and introducing price-based demand response to optimise the net load curve of charging stations [11]. Combining the price-based demand response measures, the optimisation was proposed with the objectives of maximizing the revenue of EV load aggregators and minimizing the load fluctuation [12].
Heuristic algorithms, meta-heuristics, and intelligent evolutionary algorithms for the optimal solution of decision problems are used in various fields. In Zhao and Zhang [13], a learning-based generalisation algorithm is proposed to improve generalisation by adjusting the evolutionary strategy of the algorithm based on feedback information in the optimisation process according to the actual problem. Pasha et al. [14] present an integrated optimisation model whose objective is to maximize the total turnover profit generated by the transport business and solve the proposed model through a decomposition-based heuristic algorithm. Kavoosia et al. [15] propose an evolutionary algorithm to solve the developed mathematical model, implemented through an enhanced adaptive parameter control strategy that effectively varies the algorithm parameters throughout the search process. Dulebenets' [16] study proposes a new adaptive polymorphic memory algorithm to solve the scheduling problem of transport and to help operators in proper operational planning. Rabbani et al.'s [17] study presents a mixed integer linear programming model to find the optimal route sequence and minimize time consumption through non-dominated sequential genetic algorithm II and multi-objective particle swarm optimisation.
In recent years, deep reinforcement learning has autonomous recognition and decision-making capabilities and has been successfully applied in the energy sector [18][19][20]. e feasibility of using it for grid regulation has also been demonstrated [21,22]. e requirements associated with demand response can be met [23]. Reinforcement learning theory represents a mathematical model of learning that is rewarded by repeated trial and error and is based on the psychological term operant conditioning, which derives its name from the phenomenon of the increased frequency of autonomous behavior reinforcement. A customer agent model was proposed in [24] applying reinforcement learning Q-learning for predicting price-sensitive load reductions. A pricing strategy was investigated in [25] for charging station operators based on noncooperative games and deep reinforcement learning, and the effectiveness of the proposed framework was validated with real data from cities. Moreover, a real-time pricing technique was proposed in [26] based on multi-intelligent reinforcement learning, and it worked well in producing consumer-driven applications of mini-smart grids. e researchers behind [27] considered thermostatically controlled loads, energy storage systems (ESS), and price-responsive loads for flexible demand-side dispatch of microgrids based on deep reinforcement learning, which significantly reduced input costs. e researchers of [28] gave a dynamic pricing strategy based on DDPG considering the historical behavior data of electric vehicles, peak-valley time-sharing tariff, and the demandside response pattern to guide the customer tariff behavior and exploit the economic potential of the electricity market. Considering the cooperation between wind farms and electric vehicles, an intelligent pricing decision was proposed in [29] for EV load aggregators based on deep reinforcement learning algorithms to achieve an increase in overall economic benefits. To maximize the long-term revenue of electricity sellers under the electricity spot market, the researchers of [30] proposed a dynamic optimisation scheme for demand response using reinforcement learning. For the price difference between the day-ahead and real-time markets in the electricity spot market, the researchers of [31] achieved an effective solution for the optimal bidding strategy based on deep reinforcement learning. Further, an improved deep deterministic policy gradient algorithm was proposed in [32] as a building-level control strategy to improve the distributed electric heating load-side demand response capability. A dual DQN agent was proposed in [33] to evaluate the elasticity of power systems. Other research [34] combined the cross-entropy method (CEM), the maximum mean difference method (MMD), and the deep deterministic policy gradient algorithm with twin delays (TD3) in evolutionary strategies to propose the diversity evolutionary strategy deep reinforcement learning (DEPRL).
In summary, load aggregators, acting on behalf of small and medium electricity consumers in price-responsive load trading, face the problem of how to purchase electricity from the market and sell it to consumers and need to optimise their decision-making options in terms of both purchases and sales in order to maximize profits. erefore, it is necessary to study the buying and selling strategies of priceresponsive loads that can be carried out by load aggregators in dynamic trading in the electricity spot market. It is also necessary to overcome the problem of the slow training convergence rate when the input dimension of reinforcement learning is too large. Based on the above problems, this study proposes a deep reinforcement learning method based on BiLSTM for load aggregators to purchase and sell electricity, taking the maximum revenue of load aggregators under the price-responsive load mechanism as the scenario. e contributions of this study are as follows.
We propose a BiLSTM-DDPG model to make the trading strategy for load aggregators. We describe the trading process as a partially observable Markov decision process (POMDP). e bidirectional LSTM neural network is used to process the bidirectional time axis state 2 Computational Intelligence and Neuroscience information one by one and generate bidirectional coded information to cope with the dynamic changes in an uncertain environment. We propose the BiLSTM-DDPG method, which integrates time-domain processing and has autonomous recognition and decision-making capabilities. BiLSTM can extract features and temporal relationships, avoiding gradient disappearance and gradient explosion. DDPG allows for more accurate recognition and optimal decision-making for complex electricity spot market environments.

BiLSTM
Model. e recurrent neural network (RNN) is a neural network that processes temporal data as input to itself. In a single computational unit, the data (x t ) from the previous t moments and the computational output (h t − 1 ) from the previous t − 1 moments are used as input, and in the unit output, in addition to the output y t , ht is also generated, and the data are passed on to the next moment (t + 1) for the next computation. e RNN based on this design structure has predictive capability. LSTM is an improved RNN, and compared with RNN, LSTM adds the forgetting gate at the output and implements the forgetting function by a state parameter (c). e LSTM structure is shown in Figure 1.
e LSTM cell contains an oblivion gate, an input gate, and an output gate. e oblivion gate (ft) selectively forgets the information of the previous cell, as shown in equation (1); it takes the information of the previous cell and the current state as input and outputs a value from 0 to 1 by the sigmoid function, and this value is the percentage of retained transmission information. e current cell input information proportion is controlled by the input gate, as shown in (2), C ∼ is the proportion of retained information, as shown in (3), and (4), representing C t , weights the retained information and new information as the current cell state. e output gate determines how much information is output, and (5) and (6) pass some of the information from the current cell to the later cells [35][36][37]. e DDPG algorithm with LSTM added stores and passes on information about the trend of the hidden state of the environment in the time domain, as shown in the following equations: e BiLSTM propagates the state of the hidden layer using a timeline of "from the past to the future" and "never to the past" directions, as shown in Figure 2. e BiLSTM captures the transformation pattern of features on a bidirectional time axis. In the figure, LSTM1 and LSTM2 are the forward and reverse LSTM models, respectively. e output at moment h t can be expressed as follows:

Reinforcement
Learning. e mathematical basis for reinforcement learning is the Markov decision process (MDP), which consists of a state space, an action space, a state transfer matrix, a reward function, and a discount factor. e MDP tries to find a strategy that allows the system to obtain the maximum cumulative reward value. e state is a generalisation of the current environment; the state space is the set of all possible states, denoted as S; action is the decision made; the action space is the set of all possible actions, denoted as A; the agent is the subject doing the action; and the policy function is the decision to control the action of an intelligent body based on the observed state.
Agent environment interaction (AEI) is when an intelligent body observes the state of the environment (s) and makes an action (a), the action changes the state of the environment, and the environment gives the intelligent body a reward (r) and a new state (s′), as shown in Figure 3.
In this study, MDPs can be expressed as (S, O, A, P, r, c, S), where S is a set of consecutive states, and A is a series of consecutive actions. P:S×A×S⟶R is the transfer probability function, r:S×A⟶R is the reward function, c is the discount factor, S is the initial state distribution, and O is the set of continuous partial observations corresponding to the states in S. In training, S 0 is obtained by sampling from the initial state distribution S. At each time step t, the intelligence determines the current ambient state space (S t ∈ S). e reward r: S × A ⟶ R is obtained by taking the action a t ∈π (s t ) according to the strategy π: S ⟶ A, and the new ambient state S t +1 is obtained. e goal of the intelligent body is to maximize the expected return, as follows: e payoff is the discounted sum of future returns, as follows: Computational Intelligence and Neuroscience 3 e Q function is defined as follows: In the partially observable case, an agent acts on partial observations, a t � π(O t ), where O t is the partial observation corresponding to the complete state (S t ).

DDPG Model.
e DDPG algorithm incorporates the ideas of DQN and uses a deterministic policy function to enable the problem to perform better on continuous spaces of high dimensionality. e learning framework for deterministic strategies takes the approach of the actor-critic algorithm, where the actor is the action strategy, and the critic is the evaluation, which in this case estimates the value function using function approximation methods. e network structure of DDPG is shown in Figure 4.
DDPG uses two neural networks to represent the deterministic strategy A � π θ (s) and the value function Q μ (s, A). e network parameters are θ and μ, where the strategy neural network is used to update the behavioral strategy of the intelligence, corresponding to the actor network in the actor-critic structure, and the value network is used to approximate the value function and provide gradient information for the update of the strategy network, corresponding to the critic network in the actor-critic structure. DDPG finds an optimal strategy π θ to maximize the expected return, as follows: A parameter update of policy network by the gradient ∇ θ J(θ) is as follows: e expected return value after taking action A in state S, following strategy π, is as follows: e value network is updated according to the valuenetwork-updating method in DQN; namely, the loss minimization function L(μ) is used to update the value network parameters, as shown in the following equations: y t � r s t , a t + cQ μ s t+1 , a t+1 , where θ′' and μ′' denote the target actor network and target critic network parameters, respectively. DDPG uses a data playback mechanism to obtain training samples [38][39][40][41]. e information about the gradient of the Q-value function regarding the action of the intelligent body is passed to the actor network through the critic network, and the update of the policy network is performed in the direction of boosting the Q-value according to (16).

BiLSTM-DDPG-Based Trading Strategy for Load Aggregators on PRD.
e description of the variables of the BiLSTM-DDPG-based trading strategy for load aggregators on PRD is shown in Table 1. e BiLSTM-DDPG model processing steps for power markets are shown in Figure 5.
e DDPG deep reinforcement learning with BiLSTM structure is based on the actor-critic network structure, shown in Figure 6.
For load aggregators, the main objective of participating in price demand response is to maximize the benefits of energy trading. e total benefits received by the load aggregator in the real-time market are as follows:  Computational Intelligence and Neuroscience where R RT is the profit on electricity sales in the real-time market, C RT is the cost of electricity purchased by the electricity seller in the real-time market, and C DA is the cost of electricity purchased in the day-ahead market.
where λ RT− t is the selling price in the real-time market at time t, and P RT t,s is the amount of electricity sales. λ RT+ t is the purchase price in the real-time market at time t, and P RT t,p is the amount of electricity purchased. λ DA t is the purchase price in the day-ahead market at time t, and P DA t is the amount of electricity purchased. e input of the neural network is the state, and the output is action value. e neural network consists of three full connection layers; the first two layers are activated by the rectified linear unit function, and the third layer is the linear connection. e agent is built according to the logic of the pseudo-code, obtaining the reward values, iterating through the Bellman equation, and then gradient descending the difference between the target network and the action network, where the target network is updated using the soft update method. e parameters and description of DDPG algorithm used in the case are shown in Table 2.
e DDPG that introduces the BiLSTM needs to use the before-and-after order of states during training, so the corresponding experience pool data are saved as a sequence of whole sets to provide experience data for subsequent updates of the actor and critic networks, and the sequence of saved data is as follows: where T is the number of steps per set. When the number of time steps is a multiple of T, the program structure is cleared of historical data records, and the empirical data are recoded.
We can reconstruct observable historical information and full-state historical information from empirical data, as follows: e critic and actor networks are updated separately. As BiLSTM is a time-series-based RNN, the updates to the critic and actor networks are backpropagated through time (BPTT), and the updates are as follows: e pseudo-code of BiLSTM-DDPG is as follows: e parameters μ and θ initialize the critic network Q μ (a t , s t ) and the actor network π θ (h t ), respectively; μ′← μ,θ′←θ initialize the target networks Q μ and π θ ; Initialize the experience replay area ® : for episode � 1,. .., M do Clear the history information h 0 and c 0 ; For t � 1,. .., T do Get observation (o t ) and full state (S t .) from environment Update the history information, h t ←h t−1 , a t−1 , o t ; Generate action, a t � π θ (h t ) + ξ; End for Store the empirical sequence (o 1 , a 1 , r 1 , s 1 , o 2 , a 2 , r 2   Computational Intelligence and Neuroscience       Figure 7; the performance of buying and selling is shown in Table 3; trading strategies from April 1, 2021, 0:00, to April 2, 2021, 4: 30, in the ISO-NE results are shown in Figure 8. e overall evaluation from April 1, 2021, 0:00, to April 2, 2021, 4:30, in ISO-NE is shown in Table 4. It is demonstrated that the proposed method is more economical than DNN-DDPG, RNN-DDPG, and LSTM-DDPG, indicating ith as better convergence ability.

Experiment 2: Load Aggregator's Trading Strategy Every
Half Hour for 2 Days in AEMO. In this experiment, the prediction of every half hour for two days' trading strategy from January 1, 2018, to September 27, 2021, is used as the training set, data from September 28, 2021, to September 29, 2021, are used as the validation set, and data from September 30, 2021, are used as the test set. e comparison of the profit curves for the hourly load aggregator trading strategy in AEMO is shown in Figure 9; the performance of buying and selling is shown in Table 5. Trading strategies from April1, 2021, 0:00, to April 2, 2021, 4: 30, in AEMO results are shown in Figure 10. e overall evaluation from April 1, 2021, 0:00, to April 2, 2021, 4:30, in AEMO is shown in Table 6. It is also demonstrated that the proposed method is more economical than for DNN-DDPG, RNN-DDPG, and LSTM-DDPG, indicating ith as better convergence ability.

Conclusions
is study investigates deep reinforcement learning in load aggregators' participation in the electricity spot real-time market trading strategy. e proposed improved DDPG algorithm can be used for load aggregators' real-time load purchase and sale transactions in the electricity spot realtime market. e main work is as follows: (1) an improved BiLSTM-DDPG with better convergence ability is proposed to solve the problem that DDPG does not easily converge when the input dimension is too large; (2) deep reinforcement learning is introduced into the analysis of power purchase and sale strategies in the electricity spot market so that load aggregators can participate in demand response with better results; and (3) in the case of IOS-NE and AEMO, it is proved that under the strategy implemented by the proposed method, it is more economical for the load aggregator to participate in the price-responsive load than for DNN-DDPG, RNN-DDPG, and LSTM-DDPG.  e proposed algorithm can be used to solve scenarios with large data volumes and high requirements for timeliness in the electricity market, providing an idea for the study of optimisation problems. is study focuses on the load aggregator's purchase and sale model, but has not studied the point-to-point user. Future research will combine transfer learning and federal learning to achieve distributed peer-to-peer transaction optimisation in electricity retail market.
Data Availability e data of the models and algorithms used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.