DDPG-Based Energy-Efficient Flow Scheduling Algorithm in Software-Defined Data Centers

With the rapid development of data centers, the energy consumption brought by more and more data centers cannot be underestimated. How to intelligently manage software-defined data center networks to reduce network energy consumption and improve network performance is becoming an important research subject. In this paper, for the flows with deadline requirements, we study how to design the rate-variable flow scheduling scheme to realize energy-saving and minimize the mean completion time (MCT) of flows based on meeting the deadline requirement. The flow scheduling optimization problem can be modeled as a Markov decision process (MDP). To cope with a large solution space, we design a DDPG-EEFS algorithm to find the optimal scheduling scheme for flows. The simulation result reveals that the DDPG-EEFS algorithm only trains part of the states and gets a good energy-saving effect and network performance. When the traffic intensity is small, the transmission time performance can be improved by sacrificing a little energy efficiency.


Introduction
With the development of 5G technology [1][2][3][4], more and more data centers as important carriers of data storage and processing will be established [5][6][7]. Worldwide data center energy consumption has reached about 8% of global energy consumption. It mainly comes from three aspects: network, server, and refrigeration system. The network energy consumption accounts for about 20% of the total data center energy consumption [8]. With the introduction of DVFS (Dynamic Voltage and Frequency Scaling) and hardware virtualization technology [9,10], the energy consumption efficiency of servers has been greatly improved, and more and more new heat dissipation technologies have emerged [11]. The proportion of network energy consumption in data center energy consumption will continue to increase, and it cannot be ignored.
The network energy-saving technology can be divided into three types: topology design and transformation [12], device sleeping (DS) [13][14][15][16][17][18], and adaptive link rate (ALR) [19][20][21]. The energy-saving topology design and transformation mainly studies how to maximize the ratio of the forward-ing capacity provided by the topology to the total energy consumption of the network under constraint conditions, to improve the utilization rate of topology energy consumption. As for DS technology, because in the data center network with SDN technology, the "rich connection" topology has more alternative routes. Under the constraints of network connectivity and network performance, the network flows are aggregated and transmitted in the subset T of the network topology G, and the devices and links in T/G are dormant, to achieve energy saving. The ALR-based energy cost model shows that the link energy consumption is exponentially related to the actual transmission rate on the link. When the energy-saving topology is determined, the ALRbased energy cost model is used to guide further improving the energy utilization efficiency of the network link, which is through adjusting the bandwidth usage of different links at different times. There are few studies on the third one. In this paper, we will study it.
The representative ALR-based energy-saving scheme is a preemptive Most-Critical-First scheduling algorithm for the flow proposed in Ref. [20]. It selects the interval with the largest energy consumption density as the critical interval, and all the flows in this critical interval will be preferentially scheduled. The transmitting rate of every flow is constant. If the rate is time variable, then the energy saving can be more efficient. Hence, we propose a rate-variable routing scheme based on bandwidth sharing mechanism in Ref. [21]. It sorts the flows according to EDF (earlier deadline first) policy and then calculates the routes in turn. However, the scheduling order based on EDF is too simple and the energy efficiency still has the space to be improved.
In addition to the optimization goal of energy saving, network operators must ensure and optimize the QoS (quality of service). There are some representative flows on the data center network, such as search and social networks that generate many requests and responses, which need to go through the data center to perform the tasks requested by the user. The performance that the user is concerned about is the response speed of the requests, and a tolerable deadline is generally given [22]. If the mean completion time (MCT) of the flows can be decreased based on ensuring the deadline and ensuring the network energy efficiency, then the QoS will be greatly increased [23]. Above all, this paper will assume the route has been determined and try to design a rate variable flow scheduling mechanism to minimize network energy consumption and improve the QoS performance.
Most of the current works model the traffic engineering problem as a mixed-integer linear programming problem and propose heuristic algorithms. The dual optimal problem has a large solution space, and the above heuristic algorithms are close to traversal, and the scalability is limited. Researchers began to study DRL-based flexible traffic control mechanisms to improve the performance of the data center network. The DRL-based traffic engineering algorithms [24][25][26][27][28][29][30] are mostly about finding the best QoS route, and they are driven by experience to deal with the overly complex and dynamic network environment. At present, there is no DRL-based flow scheduling scheme with a variable rate as far as we know.
The proposed energy-saving scheduling optimization problem can be modeled as a Markov decision process (MDP) with a state space, action space, and reward function. Although the RL algorithm can learn from the surrounding environment itself, it still needs to design the corresponding features manually for it to be able to converge. In practical applications, the number of states may be large, and in many cases, the features are difficult to be designed manually. The neural network happens to have particularly good processing for massive data. And the flow rate is designed to be constant in the current literatures as we know. We suppose the flow transmitting rate is more flexible and variable, so it is a continuous control problem to design a variable rate energysaving flow scheduling scheme. The deep deterministic policy gradient (DDPG) algorithm is one based on the actorcritic (AC) framework proposed by Lillicap et al. [31], which is based on the DQN and the deterministic policy gradient (DPG) method, and it is an effective method to solve the continuous control problem. Hence, we adopt the DDPG method to solve it.
In summary, this paper focuses on the energy-saving flow scheduling problem based on the ALR model and applies the DDPG algorithm to solve it. Our main contributions are two folds as follows: (1) When the network topology and the routes of flows are determined, to further reduce the energy consumption and improve the QoS requirement, based on the ALR energy cost model, the energy-saving QoS flow scheduling problem and the dual optimization objective of minimum energy consumption and mean completion time of flows are proposed. The dual optimal problem is a continual control problem, and it has a large solution space, which can be modeled as a Markov decision process (2) Based on the advantages of DDPG in solving continuous control problems, and the problem of scalability, the DDPG Energy-Efficient Flow Scheduling (DDPG-EEFS) Algorithm is proposed to obtain the optimal scheduling scheme (3) Based on the ALR energy cost model, a rate variable flow transmission mechanism is proposed to flexibly scheduling flow and to balance the flow transmission on the link in time and space and improve the energy-saving effectively The rest of the paper is organized as follows: Section 2 analyzes related works. Then, the dual-objective optimization problem is presented in Section 3. We propose the DDPG-EEFS algorithm to solve the problem in Section 4. The simulation results are presented to verify the feasibility and effectiveness of the proposed approach in Section 5. Finally, the conclusion is given in Section 6.

Related Work
In this part, we review the past studies from two aspects: data center network energy-saving technology and DRL-based network traffic control algorithm.
2.1. Data Center Network Energy-Saving Technology. In terms of data center network energy-saving technologies, there is a lot of research, which can be divided into energysaving topology design [12], and energy-saving routing and flow scheduling schemes [13][14][15][16][17][18][19][20][21], which are mainly based on the DS energy cost model or the ALR energy cost model.
The first technology is designed from the perspective of topology design and transformation to save energy. Ref. [12] defined the influence parameters of links on topological connectivity and the threshold of network connectivity decline percentage. Under the constraint of the threshold, the links are deleted from the network topology and the topology is updated according to the increasing order of the influence parameters, to achieve the goal of network energy saving. Energy-saving topology is suitable to be applied in the initial stage of network construction.
The DS-based energy-saving schemes are mainly from the view of forwarding and routing. Ref. [13] proposed a method to construct an elastic tree, which can dynamically adjust the set of active nodes and links. While reducing 2 Wireless Communications and Mobile Computing energy consumption, it can deal with burst traffic and has good fault-tolerant performance. Ref. [14], Ref. [15], and Ref. [16] all proposed heuristic energy-saving routing algorithm to minimize the number of active links, under constraints. Ref. [17] proposed an online switching mechanism of multiple topologies in the data plane, which can sleep some devices and ports to achieve energy saving while meeting the dynamic demand of traffic. Li et al. [18] explored a new energy-aware flow preempting scheduling method in the time dimension and used the policy of EXR (exclusive routing), i.e., each flow preempted the route according to the priority and occupied its own route. This kind of energy-saving routing and scheduling method based on the DS model refers to the state transition of switches and ports. So, this kind of method will take some time and the DS model is not suitable for real-time flow scheduling. A few literatures are targeted to the ALR-based energy efficiency flow scheduling. In Ref. [20], the flows in critical interval with the largest energy consumption density will be scheduling first. The energy consumption efficiency of the network is further improved by balancing the flow in the link space and time. And the flow scheduling problem mainly involves the transmission rate and transmission time. The flow bandwidth allocation is fixed, i.e., the flow rate is constant in the current flow scheduling literature as we know. In this paper, we will provide a rate variable flow scheduling mechanism to minimize network energy consumption and improve MCT.

DRL Algorithm for Network Traffic
Control. The solution space of the above algorithms in Section 2.1 is large, so the scalability of the above algorithms is limited. With the development of DRL, the recent trend in the field of network technology is to use AI algorithm to control and operate the network traffic. Ref. [24] proposed an adaptive multimedia flow control method based on DDPG to optimize QoE performances. In a complex and dynamic network environment, each multimedia flow is allocated appropriate bandwidth in each path based on experience rather than a mathematical model. Ref. [25] involves scheduling the transmission times and bandwidths of multiple flows. The state is defined as resource allocation, and the action is the route selection. The contribution ratio of multiple resources in reducing delay is quantified, so that the performance requirements of flows are transformed into resource requirements of flows. Then, the DRL agent interacts with the network continuously and obtains a feasible path adaptively according to the network state, which means allocating the optimal network resources for the flow, to improve the network throughput, the completion time of the flow, and the load balance of the link. Most of the above literatures focus on nondata center network, and the optimization objective is network throughput, delay, and other performance requirements. And there is few DRL-based energy-saving flow control study in softwaredefined DCN.
To optimize the energy-saving effectiveness in DCN, based on the DS energy consumption model, we put forward a DQN-based routing algorithm in Ref. [29]. To further save energy, based on the ALR network energy consumption model, we will use the DDPG algorithm to design a variable flow rate scheduling mechanism to achieve saving energy and improve the QoS.

Model of Network System
3.1. Motivation. The data center network is an undirected graph and can be modeled as GðV, EÞ, where V is the set of switches, E is the set of links. The set of traffic that needs to be transmitted is defined as J = fj 1 , j 2 ,:j i , ⋯, j n g, where each flow is defined as j i = ½p i , q i , r i , d i , w i , j i ∈ J. p i and q i indicate the source and destination nodes separately; r i and d i represent the start time and the deadline of the flow, respectively; and w i means the data size that needs to be transmitted. Taking Figure 1 as an instance, we assume that one simple network which consists of five switches and six links and currently three flows come, where j 1 = ½1, 3, 0, 4, 8, j 2 = ½2, 5 , 0, 3, 9, and j 3 = ½3, 5, 0, 2, 4. We assume each flow uses a single path transmission, and the routing has been determined in advance; thus, active switches and links are also known ahead of time. Let P i be the sequence of links through which j i is routed, which is calculated by the shortest path principle. Here, we set P 1 = ½l 1 , l 3 , P 2 = ½l 4 , and P 3 = ½l 3 , l 4 .Therefore, the number set of the actual active switches is f 1, 2, 3, 5g, and the number set of the actual active links is f1 , 3, 4g, and both of which are marked with green in Figure 1.
The main problem in flow scheduling is to provide an appropriate flow scheduling scheme to minimize the energy consumption and minimum mean completion time of flows, while guaranteeing the transmission performance of the flow deadline. The scheduling scheme mainly includes flow transmission interval and transmission rate, which have a great impact on the link energy consumption and MCT. Flows can share the same link for transmission, or they can be transmitted by exclusive link mode. The flow can be transmitted at a constant speed or at a variable speed. So, there will be a large number of possible transmission combinations. Different transmission schemes will produce different energy consumption and MCT.
Here, we adopt the ALR power consumption model to calculate the power energy. The power consumption function [19] f ðx e Þ is given by Formula (1) to uniformly characterize the manner in which the energy is being consumed with respect to the transmission rate x e of each link e ∈ ε.
where σ, μ, and α are constants associated with the link type. Constant σ represents the idle power energy for maintaining link state, and α > 1 that means f ð:Þ is superadditive. Here, we define the parameter α = 2. Constant C is the maximum link transmission rate.
For the three flows above, the time range is set from the current time to the next five units of time, and each link has five units of bandwidth. We give three representative transmission schemes as shown in Figure 2. The first scheme adopts the transmission strategy of uniform speed and shared bandwidth, and the deadline is guaranteed, and the energy consumption is 15σ + 115μ, and MCT is 3 units of 3 Wireless Communications and Mobile Computing time. In the second scheme, the flow with earlier deadline is transmitted first, and the transmission strategy with the flow occupying the link bandwidth exclusively is adopted. The energy consumption is 15σ + 141μ, and MCT is 2.33 units of time. Scheme 3 is improved based on scheme 2, that is, the flow is transmitted as evenly as possible in time without changing the transmission interval. And its energy consumption is 15σ + 137μ, and MCT is 2.33 units of time. The average completion time of scheme 1 is the largest, and its energy consumption is the smallest. The average completion time of scheme 2 and scheme 3 is the smallest one, and the energy consumption of scheme 3 is smaller, because it balances the flow transmission in its transmission interval as much as possible. By adjusting the transmission rate and transmission interval, the energy consumption and the average completion time of the stream can be effectively reduced.
The goal of this paper is to find a flow transmission scheme to minimize the energy consumption and the MCT.

Problem Formulation.
In this paper, the main optimization objective is to minimize the weighted sum of network link energy consumption and mean complete time (MCT) of flows, which is expressed by Formula (2), where the variables ϕ′ and MCT′ are the normalized ones, which are, respectively, calculated by Formulas (3) and (4). The two constants ρ and ð1 − ρÞ represent the ratio between the energy consumption and MCT, respectively, and s represents one feasible solution of flow scheduling and will be included in the solution set s. Formulas (5) MCT With constraints Formulas (8)-(10) represent performance constraints. Formula (8) represents each flow must be completed before its latest deadline; Formula (9) represents link resource capacity constraints, i.e., the bandwidth used by network traffic cannot exceed the available bandwidth of the link. To ensure the availability of the link, the available bandwidth of the link is β times the link bandwidth capacity, and ð1 − βÞ times the link bandwidth needs to be reserved for the emergency. Formula (10) represents the flow conservation

DDPG-Based Energy-Efficient Flow Scheduling Algorithm
For the energy-saving scheduling optimization model established above, we model it as a Markov decision process (MDP) with a state space, action space, and reward function. This paper tries to apply the DDPG method to seek the most energy-saving transmission rate for each flow while minimizing the MCT of flows. Firstly, we propose a DDPG-EEFS architecture and describe the components. State, action, and reward are also outlined. Secondly, the process of the DDPG-EEFS algorithm is presented.

DDPG-EEFS Architecture.
The DDPG-EEFS architecture is shown in Figure 3. It mainly includes environment and agent.

Environment.
The environment is an SDN-enabled data center network, which consists of switches, links, and servers.

4.1.2.
Agent. An agent is used to communicate with the environment. Once it observes the consequences, it learns to change its behavior and action in response to the reward. When DDPG is applied in the system, the SDN controller has a global view to get the state of the network environment, and it can be seen as an agent to make decisions based on observation and carry out a series of actions and take actions to the current state and provide flexible policy deployment. DDPG combines the DQN method and the DPG method with actor critical framework and uses a neural network to fit policy function and Q function to form a more efficient and stable control model. To improve the convergence and stability of the network, the important idea of experience replay and target network are used in the DDPG algorithm. The purpose of the former is to disrupt the correlation between the data, so that the sequence meets the independence and identical distribution. The latter regularly copies the online network parameters to the target network with the same structure and then uses the target network to update online network parameters.
(1) Primary Network. The primary network is used to determine an action based on the current state with the corresponding critic value. So, its input is the current state, and its output is an action. The primary network consists of an actor network and critic network. The actor network is the online policy network and is responsible for selecting the current action a i according to the current state s i and used to interact with the environment to generate the next state s i+1 and reward r i . The critic network is used to approximate the value function Qðs, ajθ Q Þ of the state action pair (i.e., the output of actor network) and to provide gradient information and helps the actor to learn the gradient of the policy.
(2) Target Network. The target network is the same as the primary network model and consists of a target actor network and target critic network. The target actor network is responsible for selecting the optimal next action a i+1 according to the next state s i+1 sampled in the replay buffer, and its input is the transformed state (TS) ðs i , s i+1 Þ, and the network parameter θ μ′: is periodically copied from θ μ . The target critic network is responsible for calculating the value function Q ′ ðs i+1 , μ ′ ðs i+1 jθ μ′ Þjθ Q′ Þ, and the network parameter θ Q′ is copied from θ Q periodically.

Wireless Communications and Mobile Computing
observed state transition process is first stored in a replay buffer. After the samples in the replay buffer have accumulated to a certain extent, they will be randomly chosen to update the network. The main reason is that the samples obtained by randomly exploring the surrounding environment by different flows are a sequence associated with time. Due to the temporal correlation, if the data is directly used as a sample for training, the system convergence will be greatly affected, thereby the random sampling method solves the time correlation problem. This random extraction approach disrupts the correlation between experiences and makes neural network updates more efficient. In summary, the replay buffer is an especially important part of the DRL method, which greatly improves the system performance of DRL.
According to the flow transmission rate allocation, the real transmission rates of flows in the transmission interval can be calculated. By adjusting the transmission rate allocation situation of flows during different time periods, different flow scheduling schemes are obtained. Action a is defined as follows: (6) Reward. The reward r of agent is related to the adopted network operation and maintenance strategy. These control policies can be adjusted by changing the reward settings. The immediate reward is defined by analyzing the objective function. Since the objective function is to find the minimum energy consumption and mean completion time of flows, and the smaller the weighted sum of both items, the larger the reward, so the reciprocal of weighted sum can be immediately used as an immediate reward in Formula (12) For those states that do not satisfy the bandwidth constraints (8)-(10), the immediate reward is 0.

DDPG-EEFS Algorithm
Process. The general process of DDPG optimizing flow scheduling is as follows: Firstly, the agent obtains an accurate network state. Secondly, the agent determines an action; then, the SDN controller produces rules and distributes them to the switches in the data plane. Finally, the agent obtains the rewards and new network state after the implementation of the new scheduling scheme. The training goal of the DDPG agent is to find the optimal action a according to the input state s to maximize the reward r. The process is shown with black solid lines in Figure 3.
The general DDPG framework cannot clearly define how to explore, and the TE-aware random search method [23] uses the basic TE solution a base as the baseline to guide the exploration process. So here, we use the TE-aware random search method to improve the efficiency of exploration. The DRL agent generates action a base + ε ⋅ ℕ with probability ε, and action a + ε ⋅ ℕ with probability ð1 − εÞ. a is the output of the actor network. The parameter ℕ is uniformly distributed random noises. a base can be obtained by different methods. Every flow is supposed to transfer uniformly in the interval between arrival time and its deadline. Although it is not the best solution, it is enough as a fundamental solution. We decide to increase the probability of a random noise to the basic TE solution and the action of actor network instead of directly adopting a base and a. Moreover, ε decreases with the increase of epoch because the more learning, the more action output will be adopted.
Deep neural networks are introduced in these networks, whose parameters are updated according to learning. To make efficient use of hardware optimizations, DDPG explores policy with an off-policy algorithm in minibatches, rather than online. For the flow scheduling problem, it can be considered as a continual control environment.
At each time step, the actor primary network and critic primary network are updated by sampling a minibatch from the replay buffer. Sampling is made up of a series of transitions, which contains the state, the action, the reward, and TS. Their updating is introduced as follows.
The reverse transmission of the policy gradient of the DPG neural network for actor module is shown in Equation (13), the related data flow is also shown in Figure 3 with blue dotted lines.
Wireless Communications and Mobile Computing where J is the performance function to measure the performance of the policy μ. The former in the right of equal sign is the action gradient from the critical network, which is used to characterize the direction of movement on which the action gets a higher return; the latter is the parameter gradient for the actor network, which is used to characterize how should the neural network of actor adjust its own parameters in order to make the neural network to select the action with the highest return with higher possibility. The combination of the two items means that the neural network of the actor module moves towards the direction with a higher possibility of getting a higher return to modify its own parameters.
The update process of the DQN network in a critical module is shown in Equation (14), which is also shown in Figure 3 with green dotted lines. TD_Error (TD: temporal difference) L is equal to the mean square error of Q value of target network and the Q value of the online network, where the Q value of target network is shown in Equation (15), and it is based on the Q value of next state s i+1 and next action a i+1 . The next action here comes from the target network of the actor module.
Therefore, for the network energy-efficient flow scheduling problem, the DDPG-EEFS algorithm only trains some state data to get optimal results.
The simple uniform sampling method ignores the significance of the samples in the pool. Priority experience replay assigns a priority to each sample, and the DRL agent chooses one sample based on the priority to learn experience from transitions more effectively. Priority experience replay is applied to the TE problem [26]. DDPG includes actor and critical networks, so the priority also includes the following two parts. The first part is corresponding to the training of the critical network. The second part is related to the actor network training. Combining the two parts, the priority of samples is shown in Formula (16). When j∇ a Qj is the mean value of the absolute value of ∇ a Q, parameter φ determines the relative importance between and ∇ a Q. ς is a small constant to avoid the error of edge cases of transition, i.e., if the error is 0, it will need to be revisited. The probability of transition sampling is calculated by Formula (17), where the exponent β 0 describes the priority. When β 0 = 0, the transition sampling becomes uniform sampling.

Simulation and Results
To verify the effectiveness of the proposed DDPG-EEFS algorithm, simulation is conducted in the SDN-enabled data center network with Fat-Tree topology.

Simulation Environment and Setting.
Under the Windows 10 system, the Python language is used to program the algorithm. The hardware platform is configured as a 2.4 GHz CPU and 64 GB memory. This work selects the commonly used Fat-Tree data center network topology, which is set to consist of 20 four-point switches, 16 hosts, and 48 links. We mainly use the network energy-saving percentage P as the evaluation matrix of energy-saving effectiveness, that is, the network energy consumption saved by using the method A accounts for the percentage of the total network energy consumption NEC full when all the links are full load. The specific definition is as shown in Formula (18):

Simulation Results and Analysis.
To verify the validity and performance of the proposed DDPG-EEFS algorithm, we mainly design the simulation from two parts. Firstly, to verify the effectiveness of the algorithm, we design 64 flows, which belong to 4 different pods, and there are half of flows that go through the core switch. To deal with emergencies and failure recovery, the parameter of redundancy δ is taken as 0.8 in our experiment. Our goal is to find the optimal scheduling scheme to make the objective function minimized. Through learning and constantly adjusting various parameters, we finally obtain the actual parameters in the stable convergence algorithm. The parameters in the algorithm are given in Table 1.
After the training of the DDPG-EEFS algorithm is completed, the model is saved and then tested, and the network will find a relatively ideal scheduling scheme. For the test results, the normalized optimization objective is counted every 100 steps, which are shown in Figure 4. It is found that after reaching 800 steps, the algorithm approaches convergence and the objective function tends to be stable.
Secondly, because energy-saving and performance optimization sometimes restrict each other, pure DDPG-EEFS, i.e., DDPG-EEFS without considering the synchronous optimization of network flow completion time performance, is chosen to compare with the base solution a base proposed in Section 4, i.e., every flow is supposed to transfer uniformly in the interval between arrival time and its deadline, and the Most-Critical-First [4] algorithm to evaluate the energysaving effect and network performance. As shown in Figure 5, we mainly use the network energy-saving percentage P as an evaluation index of the energy-saving effect. We use the energy cost when the network link load is full as the benchmark of comparison in the network energy-saving percentage. Figure 5 shows the network energy-saving percentage of different algorithms under different network loads. The difference between the energy-saving effects of pure DDPG-EEFS and the Most-Critical-First scheduling mechanism is about 2.4%. The static scheduling mechanism knows the information of all flows in advance, and it can carry out global optimization. Compared with the base solution a base , the network energy-saving percentage of the pure DDPG-EEFS mechanism is increased by 3.2%.
As shown in Figure 6, we can see that under the different load conditions, the MCT by using DDPG-EEFS is lower than that of the baseline solution and Most-Critical-First algorithm. Compared with the base solution, the MCT of flows of DDPG-EEFS is reduced by about 15%.

Conclusions and Future Work
SDN is proposed as the promising technology in data center networks, and it can provide central network management and global traffic control. In this paper, we propose the dual optimization goals of the energy-saving and MCT of flows and design the DDPG-EEFS algorithm to solve it. For network operators, the shorter the average completion time is, the better the performance of the data plane is. Compared to other heuristic algorithms, DDPG-EEFS is easy to design the variable transmission rate and achieve good effect of energy saving, and good QoS.

Data Availability
This work selects the commonly used Fat-Tree data center network topology, which is set to consist of 20 four-point switches, 16 hosts, and 48 links. We design 64 flows, which belong to 4 different pods, and there are half of flows that go through the core switch.

Conflicts of Interest
The authors declare that there is no conflict of interest.