HVAC Optimal Control with the Multistep-Actor Critic Algorithm in Large Action Spaces

We propose an optimization method, named as the Multistep-Actor Critic (MAC) algorithm, which uses the value-network and the action-network, where the action-network is based on the deep Q-network (DQN). The proposed method is intended to solve the problem of energy conservation optimization of heating, ventilating, and air-conditioning (HVAC) system in a large action space, principally for the cases with high computation and convergence time. The method employs the multistep action-network and search tree to generate the original state and then selects the optimal state based on the value-network for the original and the adjacent states. The results from the application of the MAC algorithm to a simulation problem on the TRNSYS system, where the simulation problem is referring to a real supertall building in Hong Kong, have shown that the proposed MAC algorithm balances control actions between diﬀerent HVAC subsystems. Further, it substantially saves the computational time while maintaining a good energy conservation performance.


Introduction
e building sector, on a global scale, accounts for nearly 40% of the total energy consumption and contributes 30% of the total CO 2 emissions [1]. Buildings account for nearly 20.6% of the total energy consumption in China and contribute 19.4% of the total CO 2 emissions [2]. By 2018, the global energy consumption grew almost twice as fast as the average since 2010, with about 80 % of that proceeding from fossil fuels [3]. HVAC is considered the largest energy consumption system in a building [4] since more than 50% of the building energy consumption is from it. erefore, improving the efficiency of HVAC system control strategy can enhance performance [5]. e energy consumption of HVAC is directly related to the setting values of its systems and subsystems [6], besides the building type and climate that also affect the consumption.
Various optimization strategies have been used [7] to reduce the energy consumption of the HVAC system. It is far more sustainable and cost effective to improve the control algorithms with more efficient modern technologies to achieve higher efficiency than replacing the HVAC equipment [8]. Research has revealed that appropriate control strategies of the HVAC system can maximize the overall operating efficiency while maintaining satisfactory thermal comfort [9], whereas unreasonable control strategies can result in excessive energy consumption. However, most existing HVAC systems have been optimized locally leading to the moderate performance [10]. Concurrently, one disadvantage of these methods, such as dynamic programming, game theory, and Markov Decision Processes, is that they must be recalculated when the system needs to be optimized or its underlying assumptions have changed. is can be time consuming in complex and changing environment [11]. Reinforcement learning and deep reinforcement learning have the characteristics of a free model, self-learning, and online learning that can provide faster solutions and better adaptation for the changing environments than the traditional method. Moreover, they are the most suitable for alleviating the cost minimization problem with the ability to learn the optimal behavior, despite the global optimum being unknown [12].
Reinforcement learning is a part of machine learning, which emphasizes how an agent act based on the environment to maximize the expected benefits [13]. e agent learns by trial and error and finally approaches the optimal decision [14]. Studies have shown that reinforcement learning is able to solve stochastic optimal control problems [15] and an energy consumption scheduling problem [16] with dynamic pricing [17]. e combination of decisionmaking ability of the reinforcement learning and the perception ability of deep learning engenders a novel method termed as deep reinforcement learning. Deep reinforcement learning has overcome the challenge of learning from high dimensional state inputs, by taking advantage of the end-toend learning capability of deep neural networks [18]. Deep reinforcement learning has been successfully applied to games [19], robotics [20], and natural language processing (NLP) [21]. Concurrently, several papers have reported on the applications of reinforcement learning and deep reinforcement learning in the HVAC system.
We propose a novel optimization method for the energy conservation optimization of HVAC systems based on the DQN and value-network balancing performance and the computation time, called the MAC algorithm. e objective was to determine the ability of MAC to solve the problem of large state-action pair convergence caused by control actions, while maintaining better energy performance comparison. e HVAC system in a superhigh building in the subtropical region has been considered; besides, based on this system, various forms of the proposed method for optimizing HVAC systems have been tested and evaluated. e rest of the paper is organized as follows. Section 2 introduces related optimization methods of HVAC. Section 3 introduces the accurate modeling in HVAC system and system formulation. e proposed optimization method has been expounded step-by-step in Section 4. Various forms of the proposed method for optimizing HVAC systems are tested and evaluated in Section 5. Lastly, the conclusions are given in Section 6.

Related Work
Manifold problems in the HVAC system can be transformed into decision-making tasks. Various optimization methods proposed for HVAC systems in recent decades are listed in Table 1. e first category includes the traditional mathematical optimization methods, such as the Newton-Raphson method [22] and interior point method [23]. ese methods generally have the advantage of rigorous mathematical model and tight logic. However, explicit objective function expressions in those methods are difficult to abstract in a myriad of actual optimization decision scenarios; hence, those methods are not viable. e second one includes the heuristic methods, such as genetic algorithm (GA) [24], ant colony optimization [25], and particle swarm optimization [26]. ese methods are suitable for almost all optimization problems and have good performance, especially for nonconvex optimization. However, the heuristic methods usually get the local optimal solution, and these methods are less robust and devoid of a rigorous mathematical proof [27].
Another category includes the machine learning methods, such as reinforcement learning [28][29][30] and deep reinforcement learning [31]. As the most popular technology in recent years, these methods are not restricted by the exact objective function like the traditional mathematical optimization methods, but they are more robust with more stable convergence results than the heuristic methods. However, an infinite number of state-action pairs can exist in buildings owing to the outdoor environment and thermal characteristics of spaces, as the HVAC systems are time varying. Most of the existing machine learning methods are limited to the local control of one subsystem or the low precision control of all subsystems, to avoid the extensive computation time which can be used for updating the policy [32].
Furthermore, there are several alternative optimization methods for HVAC systems. Sun et al. [33] have proposed a multiplexed optimization scheme for complex air-conditioning systems. He et al. [34] have proposed a computational intelligence algorithm based on the performance optimization for the HVAC. Wang et al. [35] have proposed a new method called event-driven optimal control for HVAC systems, in which optimization actions are triggered by events instead of a clock. Simone Baldi et al. [36] have proposed a holistic framework for HVAC systems with energy-aware and comfort-driven maintenance. Baldi et al. [37] have proposed a switched self-tuning approach to solve a multiple-mode feedback-based optimal control problem for HVAC systems.
However, all of these methods have the drawback of highly increasing computation time with respect to the increase of search space. For a vast search space, the increasing computation time makes most of these optimization algorithms unworkable. According to relevant studies [34], the average computational time for the Strength Pareto Evolutionary Algorithm is nearly 900 seconds, for an optimization time interval of 15 minutes. Hence, the computation time is unacceptably huge. In the online optimal control of HVAC systems, excessively long computational time leads to the degradation of optimal control performance owing to the ensuing delay in the control response.

System Description and Formulation
3.1. Complex HVAC System. A superhigh building in the subtropical region has been selected as the relevant case. e HVAC system of the building is all-electric cooling without thermal storage (with primary and secondary chilled water loops), which is a commonly used commercial system. Based on the reference building, a simulation model has been established, which is used by the TRNSYS software. e configuration of the HVAC system of the building includes a cooling water loop, two chilled water loops (primary loop before heat exchangers and secondary loop after heat exchanger), and an air distribution subsystem. In the simulation model, the dynamic performance of the heat exchangers and the AHUs are simulated by Type 699 and Type 508a. e time delay model of the chilled water and the cooling water is simulated by Type 661. e cooling load and the weather information are read into TRNSYS by Type 9e. MATLAB is used to simulate the control algorithm of subsystems in HVAC and optimization method by the component Type 155 of TRNSYS. e system simulation model, which is verified, is shown in Figure 1, and the accuracy of the model is found to be acceptable. e structure of a complex AC system in this case study is shown in Figure 2, which consists of cooling towers, chillers, heat exchangers, air-handing units, and zones. Cooling towers deliver the condenser water to the chillers. e heat exchanger has been used for delivering the cold water supplied by the chillers to the primary and secondary chilled water loops. e air-handing units introduce outdoor fresh air and exhaust air emissions through the air distribution subsystem.

System Formulation.
e optimization of the HVAC system is to seek optimal settings of all subsystems, viz., the cooling of water supply temperature, the chilled water supply temperature, the chilled water supply temperature in the heat exchangers, and the supply air temperature, aiming to lower system energy consumption at each point. e objective function is to minimize the system energy consumption as given by P sys,tot � P ch,tot + P ct,tot + P pump,tot + P fan,tot , where P is the energy consumption and T is the temperature. e subscripts sys, tot; ct; pump; and fan represent the sum of the system, cooling tower, pump, and fan. Further, the cooling water, chilled water, primary loop, secondary loop, and air supply are represented by the subscripts cw, chw, prm, sec, and sa, respectively. e superscript " * " is the optimum value for the corresponding decision factor.
Here, four set points are the cooling water supply temperature, the chilled water supply temperature, the chilled water supply temperature in the heat exchangers, and the supply air temperature. It is well-known that these set points have a significant impact on the energy consumption of the whole system, and their optimal values vary with the operational conditions by many studies.
e range of these set points is subjected to (2)-(5), which take the operations or other types of constraints into account. Concurrently, two additional constraints are taken in (6) and (7), which consider the account system stability and the minimal temperature difference between the primary and secondary sides of the chilled water loops. e specific settings are shown in Table 2.

Multistep-Actor.
We propose a basic action to solve the problem of high computation and convergence time effectuated by large action space. By combining the transition model and search tree, the selection of multistep basic action is employed to replace the selection of the optimal action from all the actions. e specific contents are given as follows.

Constructing DQN Based on the Basic Action and
Training Model. Map the states within a continuous space in HVAC output to the discrete state set (S), and then the discrete action (A) abstract from the discrete state set. Under the constraint of the HVAC system, the number of the controlled spaces in the HVAC system is N. e traditional DQN is constructed by all of the controlled spaces in the HVAC system. However, occasionally it does not converge in finite time when N is large owing to the extensive computation time for converging the policy.
To address these issues, the structure of DQN in our architecture has been built by basic actions (a), which    Poom   T253_CT  T251_Divl_CT  Type155  Load-2   Type9e Equa-2 contain cardA actions. Concurrently, building the valuenetwork and transition model is also based on the basic actions. e basic actions can be defined as follows: if action space can be represented by the combination of itself with a subset of it, then call this subset basic action. e choice of basic actions is based on the principles given below: DQN, value-network, and transition model have been built by neural network and replay memory and trained in the simulation model. At each optimization stage, we could get (s, E, a, s s , s ′ , E ′ ) and place it to replay memory, where (s, E, s s , s ′ , E ′ ) have been from the simulation model and "a" represents the action with abstraction from s tos s . Among them, s is the current state and E is the energy consumption of the current state. e factors in s s are the next setting state, the factors in are the next actual state, and E′ is the energy consumption of the next state.
Transition model is trained by (s, a, s ′ ) from replay memory, where s and a are input and s ′ is output. Valuenetwork is trained by (s, E, s ′ , E ′ ) from replay memory, where s is the input and E is the output. Concurrently, s' is the input and E′ the output. DQN is trained by a set of (s, a, r, s ′ ) from replay memory, where r � E ′ − E. e training process of transition model and value-network is the same as that of traditional neural network. e training process of DQN-based HVAC control algorithm is shown in Algorithm 1.where t mod k is a control time step.

Multistep-Actor Based on the Transition Model and
Search Tree. Under the constraint of the HVAC system, we have constructed the search tree by a transition model and the current state to get leaf nodes, where action a m is chosen by DQN based on Section 4.1.1. DQN is used to parameterize the Q function and provide action.
Search tree chooses the next state (s 1m ) by a m . If a m is nota Z (a Z � [0,0,0,0]), then put s 1m to actor (DQN) to get next actiona i , and search tree chooses next state (s 2i ) by a i from actor. Until the action a from the actor makes state invariability, search tree chooses next state (s) by action a Z from actor and stops passing. Finally, we get the original state s by search tree, which is decided by the action, transition model, and current state. e process is shown in Figure 3.

Critic.
Critic module is added on the basis of the multistep-actor for further improving the energy saving efficiency and finding the optimal state. e purpose of the critic module is to select the optimal state by comparing the energy consumption value of the original state and its nearby states. e specific contents are as follows.

States Chosen by KNN.
is original state from the multistep-actor will likely not be an optimal state, owing to the error of the transition model and the error accumulation of multiple steps. Nevertheless, the optimal state must be close to this original state. We need to map from s to an element in S. We can do this as follows: S k � argmin k s∈S |s − s| 2 . S k is a k-nearest neighbor mapping of the discrete action set (S), which has k states in S that are the closest to s by Euclidean distance. In the exact case, this lookup has the same complexity as the argmax in the value-function derived policies described in Section 4.2.2, but each step of evaluation is Euclidean distance, not the full value-function evaluation.
is task has been extensively studied in the literature for approximate nearest neighbor and can be searched in a similar manner in logarithmic time. is step is described by the bottom half of Figure 4, where we can see the actor network producing a proto-action, and the knearest neighbors being chosen from the action embedding.
Time complexity of the above algorithm scales linearly with the number of selected actions (k). However, as we will see in practice, increasing k to a certain limit does not improve the performance. Our approach has a study on the quantity of k, and the results show that it provides significant energy conservation performance gains for the initial increase in k, but quickly makes the additional performance gains negligible.

Critic Selection Based on Value-Network.
Depending on how well the state representation is valued, the states with a low value may occasionally sit closest to s even in a part of the space where most of the states have a high state value. Furthermore, in the motion embedding space, certain actions may be adjacent to each other, though in certain states they must be distinguished owing to having similar value. In both of these cases, trivially selecting the closest element to s from the set of states generated previously is not the ideal.
To avoid the selection of these exception states, general improvements have been made to the final emitted state. e second phase of the algorithm, which is described by the top part of Figure 4, refines the choice of the state by selecting the highest-scoring state according to.
As we have demonstrated in Section 4.2, this second part makes our algorithm significantly robust without imperfections in the choice of action representation, and it is essential for our system learning in certain domains. e size of the generated action set, k, is task specific and allows for an explicit trade-off between the policy quality and speed.

Multistep-Actor Critic (MAC) Algorithm.
We propose a novel optimization method, termed the Multistep-Actor Critic (MAC), to solve the optimization problem of the HVAC energy consumption. e subsystems of the HVAC system interact with each other. e decrease in the energy consumption in one subsystem may lead to an increase in the energy consumption in another subsystem. Hence, the controllable decision factors in the HVAC systems are considered as a whole. e details are as follows.
First, to find basic actions in the action set and construct DQN, the value-network and the transition model were based on basic actions. Second, to train DQN, the valuenetwork and the transition model were based on basic actions by the simulation model in the TRNSYS online at the same time. ird, at each optimization stage, i.e., every 30 minutes, the search tree has been constructed by the transition model to get the leaf nodes based on the current state, and we placed the current state in DQN to get the best action that yields the lowest energy consumption. ereafter, the choice of the next state is based on the search tree, until it is stabilized. Fourth, the original state has been constructed by step 3, and k has been chosen from the states close to the original state by KNN. Finally, the setting value of the optimal energy consumption has been obtained by finding the best state that yields the lowest energy consumption in those k states, based on the value-network.
e Multistep-Actor Critic algorithm is described fully in Algorithm 2 and illustrated in Figure 4.

Building DQN.
We consider the reliability and robustness of the data in this work. Based on this, the cooling water supply temperature, chilled water temperature, chilled water supply temperature in the heat exchangers, supply air temperature, cooling load, dry bulb temperature, and wet bulb temperature have been selected as the state (s). Among these, the cooling water supply temperature (Tcw), chilled water temperature (Tchw), chilled water supply temperature in the heat exchangers (Tchw_HX), and supply air temperature (Tsa) are the controllable decision factors.
Under the constraint of (6) and (7), the controllable space of the four controllable decision factors has been divided by 0. Finally, the constructed DQN in this paper contains seven neurons in the input layer, which correspond to seven neurons from the state (s). e first, second, and third of the three hidden layers contain 600, 1000, and 600 neurons, respectively, and the number of neurons in the output layer has been based on basic actions.

Training DQN, Value-Network, and the Transition
Model. Since the biggest value of the basic actions is 0.3 with the rest being small enough, s′cannot get near the exact result owing to PID control, and hence s ′ must be approximated by s s . Hence, we default the next value s ′ to a set value s s , which implies that the probability distribution in transition model is 1.
e constructed value-network in this paper is radial basis function neural network which contains seven neurons corresponding to the state (s) in the input layer, 50 neurons in the hidden layer, and one neuron corresponding to the energy consumption (E) in the output layer.
Considering the cooling load, dry bulb temperature, wet bulb temperature variation, and time lag in the system, 30 minutes has been set as the time interval for constructing the energy consumption model in this paper. At each optimization stage (every 30 minutes), we obtained (s, E, a, s ′ , E ′ ) and put it to replay the memory. e value-network has been trained by (s, E, s ′ , E ′ ) from replay memory, and DQN was trained as given in Section 4.1.1, where an episode was 30 minutes with r � E ′ − E and C � 2 (every 1 hour) for this case.

System Optimization.
At each optimization stage (every 30 minutes), first, the search tree has been constructed by transition model to get leaf nodes based on the current state, and then the current state was put in DQN to get the best action that yields the lowest energy consumption. ereafter, the next state has been based on the search tree, until the next state was invariable. Second, the original state has been obtained by step 1, and k states are selected close to the original state by KNN. e optimal energy consumption setting value has been obtained by finding the best state that yields the lowest energy consumption in those k states, based on the value-network; hence, V * � argmin s∈S k V θ1 (s). Finally, we have determined the value of the four controllable decision factors.

Exhaustive Method (EXM).
e exhaustive method (EXM) enumerates all the cases related to the problem. In this experiment, there are four controllable decision factors, viz., Tchw_HX, Tsa, Tcw, and Tchw, which take into account the stability of the HVAC systems (the maximum change value of each controllable decision factor should be Mathematical Problems in Engineering controlled within 1 degree) and the limit by the response time of HVAC systems (when the range of each decision factor is large, the search time is long).
e simulation system does not get the set value in time, resulting in the system crash. erefore, the range of each decision factor has been defined as [T − 0.5°C, T + 0.5°C], where T is the initial value of each controllable decision factor. When the range exceeds (6) and (7), the range should be truncated to satisfy the constraint. e four controllable decision factors have formed 14641 states; subsequently, a trained neural network has been employed to obtain the energy consumption prediction of the state. Finally, the point has been selected with the lowest energy consumption. (DQN-L). DQN-L is a method which is based on the traditional DQN of the low precision control. e control action of DQN-L is {−1, −0.5, 0, 0.5, 1}, which takes into account of the stability of HVAC systems (the maximum change value of each controllable decision factor should be controlled within 1 degree) and the limited response time of HVAC systems. In this experiment, there are four controllable decision factors, viz., Tchw_HX, Tsa, Tcw, and Tchw. When the range exceeds that of (6) and (7), the range should be truncated to satisfy the constraint. is method has selected the best control action which had the lowest energy consumption from DQN. Finally, we set the values of the four controllable decision factors to the HVAC system.

Results and Analysis.
e experiments focus on the HVAC energy consumption in spring, summer, and autumn, respectively. At each optimization stage (every 30 minutes), all methods have been optimized according to the current environment including the cooling load, dry bulb temperature, wet bulb temperature, and setting values of subsystems. According to each method, the setting values of the cooling water supply temperature, the chilled water supply temperature, the chilled water supply temperature in the heat exchangers, and the supply air temperature have been optimized. With respect to the two indicators, the daily energy saving and the computational time, the performance of all methods was simulated and evaluated. e first indicator has been used to represent the energy performance whereas the second one to reflect the computational complexity of the method.

Energy Performance Comparison of Different Basic
Actions.
e energy performance of the algorithm has been compared with the benchmark case in which the decision factors, including the set points of the cooling water supply temperature, the chilled water supply temperature, the chilled water supply temperature in the heat exchangers, and the supply air temperature, have been set as constants. Benchmark settings are given as.   Table 3 shows the daily energy saving rates of these methods in the spring, summer, and autumn. From Table 3, the highest daily energy saving rate was achieved by EXM in spring and autumn, but 2401-MAC in summer.

Energy Performance Comparison of Different Nearest
States. Although 2401-MAC has higher energy saving effect than 625-MAC, the slight improvement of energy saving effect seems negligible after comprehensive consideration of the training convergence cost; hence, we choose 625-MAC as the next research object. Figures 8-10 show the energy saving effect of EXM, DQN-L, and 625-MAC with different nearest states compared with the benchmark. e k in MACk indicates the number of nearest states. More frequently, 625-MAC-1000 and 625-MAC-5000 have almost the same power consumption and could reduce the energy consumption more than 625-MAC. Furthermore, in certain times, 625-MAC-1000 and 625-MAC-5000 can find the lowest power set point. Table 4 lists the daily energy saving rates of these methods in spring, summer, and autumn. From Table 4, 625-MAC-1000 and 625-MAC-5000 have the highest daily energy saving rate for all three seasons.     Table 5 compares the computational time of EXM, DQN-L, and MAC. e time required for the optimization has been taken as the calculation amount for the method to search for the optimal setting based on the CPU i5-4460. e calculations of each algorithm simulate the running load of an optimization (time is the average of one day), and EXM has been used as a benchmark for calculating the computational time comparison. We note that MAC has significantly reduced the computational time more than EXM, thus saving 97.4%, (1) Initialize replay memory D to capacity N (2) Initialize action-value function Q with random weights θ (3) Initialize target action-value function Q with random weights θ − � θ (4) For episode � 1 to M do (5) Reset building environment to initial state (6) Initialize sequence s 1 � x 1 and preprocessed sequence ϕ 1 � ϕ(s 1 ) (7) For t � 1 to T do (8) If t mod k � � 0 then (9) With probability ε select a random action a t (10) Otherwise select a t � argmax a Q(ϕ(s t ), a; θ) (11) Execute action a t in emulator and observe reward r t and image x t+1 (12) Set s t+1 � s t , a t , x t+1 and preprocess ϕ t+1 � ϕ(s t+1 ) (13) Store transition (ϕ t , a t , r t , ϕ t+1 ) in D (14) Sample random minibatch of transitions (ϕ j , a j , r j , ϕ j+1 ) from D (15) Set y i � r j if episode terminates at step j + 1 r j + c max a ′ Q(ϕ j+1 , a
(1) Build DQN, constructed by basic actions (2) Train DQN, value-network and transition model (3) For (every 30 minute) (4) Construct search tree based on transition model and basic actions (5) While action a is not a Z (a Z � [0, 0, 0, 0]) (6) Choose best action by DQN, and then put this choice to search tree (7) Expand search tree based on this choice and transition model (8) Choose k states close to original state by KNN from state set (9) Based on the value-network, find the best state that yields the lowest energy consumption (10) Change the set point (11) End for ALGORITHM 2: Multistep-Actor Critic algorithm.

Conclusion
We have reviewed the existing HVAC optimization algorithms. We have proposed a novel optimization method called MAC, by optimizing the DQN framework to solve the HVAC optimal control problem in large discrete action spaces, while maintaining better energy performance comparison and low computational time. e results have shown that, in the 2-month simulation experiment, basic actions have an important influence on MAC, and the number of KNN has positive effect on MAC for the choice of a good set point. However, the number of KNN is not linearly related to the energy saving performance. Hitherto, we have chosen 625-MAC-1000 as the best setting. e 625-MAC-1000 can be easily trained for obtaining lower energy consumption points in a short time interval. It can be seen that MAC has a more comprehensive performance in terms of both energy saving and computational time.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no known conflicts of interest or personal relationships that appear to influence the work reported in this paper.