Reinforcement Learning-Based Multiple Constraint Electric Vehicle Charging Service Scheduling

)e popularization of electric vehicles faces problems such as difficulty in charging, difficulty in selecting fast charging locations, and comprehensive consideration of multiple factors and vehicle interactions. With the increasingly mature application of navigation technology in vehicle-road coordination and other aspects, the proposal of an optimal dynamic charging method for electric fleets based on adaptive learning makes it possible for edge computing to process electric fleets to effectively execute the optimal route charging plan. We propose a method of electric vehicle charging service scheduling based on reinforcement learning. First, an intelligent transportation system is proposed, and on this basis a framework for the interaction between fast charging stations and electric vehicles is established. Subsequently, a dynamic travel timemodel for traffic sections was established. Based on the habits of electric vehicle owners, an electric vehicle charging navigation model and a reinforcement learning reward model were proposed. Finally, an electric vehicle charging navigation scheduling method is proposed to optimize the service resources of the fast charging stations in the area.)e simulation results show that the method balances the charging load between stations, can effectively improve the charging efficiency of electric vehicles, and increases user satisfaction.


Introduction
With the extensive development of electric vehicles in various countries around the world, the number of electric vehicles is increasing, and problems such as difficulty in charging electric vehicles, serious line losses, voltage drops, charging safety, and severe peaks are expected [1][2][3]. Electric vehicle charging and charging path planning should receive more attention. For electric vehicles whose driving time is longer than the nondriving time, fast charging is an important power supplement method [4,5]. e disorderly charging of electric vehicles would not only cause congestion of fast charging stations, which increases the burden on the regional grid, but also result in concentrated charging times causing problems such as transformer overload and increased peak-to-valley difference, which is not conducive to the safe operation of the distribution network [6,7]. erefore, reasonable guidance and charging scheduling for vehicles with fast charging needs are beneficial to the alleviation of the burden on the regional grid while meeting the charging needs [8,9].
In response to the above problems, scholars at home and abroad have conducted some research. In [10], we studied the uniform charging node in [11] and extended it to the nonuniform charging node in [12] by solving the mixed integer nonlinear programming problem (MINLP) of the single vehicle. e remaining energy of the vehicle on each node is expressed as a dynamic programming (DP) problem for a single electric vehicle path problem, and a DP-based algorithm is provided to determine the optimal path and charging strategy of the electric vehicle subflow level. In [13], we proposed a distributed electric vehicle path selection system based on the distributed ant colony algorithm (ACA). e distributed architecture minimizes the total travel of electric vehicles to the destination by proposing a set of nearest fast charging stations. In [14], we proposed an improved Dijkstra method to solve the multiobjective optimization problem and obtained a multiobjective optimization function including travel time, fast charging station number of vehicles, and charging load, thereby optimizing electric vehicle charging path planning and alleviating fast charging stations. e lack of surrounding traffic congestion reduces waiting time and improves the availability of charging facilities. e above literature has its own characteristics regarding charging route navigation and charging scheduling, but when studying electric vehicle charging route navigation, it only focuses on the economic benefits and waiting time of the vehicle and ignores the impact of fast charging station loads when charging large-scale electric vehicles. Most charging scheduling uses a fixed strategy while ignoring the influence of various factors, such as the increase in the number of electric vehicles and user habits, on electric vehicle charging scheduling for different time periods.
In this context, we propose an electric vehicle charging service scheduling method based on reinforcement learning to meet the needs of electric vehicle owners. e structure of the paper is as follows. In Section 2, we propose a fast charging station and electric vehicle system framework and use this framework to study electric vehicle charging navigation. In Section 3, we establish a dynamic travel time model for traffic sections and propose an electric vehicle charging navigation model. In Section 4, incorporating reinforcement learning, we further propose an electric vehicle charging navigation scheduling method to rationally optimize the service resources of each fast charging station in the area. In Section 5, we use a certain city as a model and compare the simulation results of the proposed method with those of the traditional electric vehicle charging navigation method to demonstrate the superiority of this method. Conclusions and further research directions are outlined in Section 6.

Fast Charging Station and Electric Vehicle System Framework
With the gradual development and application of 4G and 5G communications, the applications of various technologies for navigation and vehicle-road collaboration have become increasingly mature [15,16]. At the same time, edge computing technology also provides technical guarantees for fast response and low error rate operating environments. e computational burden of the central scheduling node is transferred to the user edge side, which greatly increases the processing efficiency and enables electric vehicles and fast charging stations to share information and synchronize processing [17]. Currently, electric vehicles can share information with fast charging stations and other systems through the Internet, upload the status and location of electric vehicles in real time, and navigate in real time based on the location of electric vehicles [18,19]. Moreover, a variety of optimal dynamic charging methods for electric fleets based on adaptive learning have been proposed, and the results show that this method can basically achieve the optimal solution. On this basis, the optimal route charging schedule can be effectively carried out for the electric fleet of efficient and dynamic transportation systems. Inspired by the above research, this paper proposes a guidance system structure for electric vehicles and fast charging stations. e structure of the guidance system for electric vehicles and fast charging stations in this article is shown in Figure 1. With the Internet platform as the center, the system dynamically updates intersection information and provides dynamic charging and navigation strategies for electric vehicles by referring to road condition information and fast charging station information. Navigation combines the road condition information and the waiting time of each fast charging station and chooses the fast charging station with the highest overall efficiency for itself to charge. e fast charging station itself further charges the electric vehicle according to various factors, such as weather, energy supply and demand, and user habits. At intervals, the traffic information and fast charging station information are refreshed according to the above selection, and the charging navigation strategy is provided again.

Preliminary Model Establishment
is section first proposes a dynamic travel time model for traffic sections and, on this basis, establishes a charging navigation model that considers distance, time, and economic benefits for a single electric vehicle.

Dynamic Travel Time Model of Traffic Section.
e dynamic path selection model for electric vehicles in this paper is based on the dynamic travel time model of the road segment. First, the movement of the vehicle in the road segment is described by the cumulative number of vehicles M(a, t), which represents the sum of the number of vehicles passing observation point a before time t. According to the definition of flow and density, the traffic flow σ(a, t) and traffic density ρ(a, t) are as follows: where M(a, t 0 ) and M(a 0 , t) are the number of vehicles at position a at time t 0 and the number of vehicles at position a 0 at time t, respectively. According to the traffic volume and traffic density, the traffic velocity v(a, t) can be obtained as follows: . (2) Assuming that the vehicles on the road section are evenly distributed in the road section, the traffic density ρ i (t) of road section i is as follows:

Mathematical Problems in Engineering
where a 0 i and a L i are the entrance and exit positions of road section i, respectively; n i is the number of vehicles that can be accommodated per unit length in road section i; and L i is the length of road section i.
According to the above formula, the vehicle speed v i (t) on road section i can be expressed as follows [20]: where v i,free is the free flow velocity of section i; ρ i, max and ρ i, min are the maximum density and minimum density on section i, respectively; v i, min is the minimum vehicle speed; and α and β are system model parameters.
It can be concluded that the passing time T i of road section i is expressed as follows: If the road congestion signal is received halfway, the system changes the route to reduce the delay time. e subjective probability of the owner changing road section i to road section i ′ is P i⟶i′ : where T i is the travel time of section i in the route; T i′ is the travel time of section i ′ in the route; T Max is the maximum travel time; and η is a subjective coefficient. erefore, the length of the driving section can be approximated by subjective probability as d i : where L i′ is the length of road section i ′ .

Electric Vehicle Charging Navigation Model.
Electric vehicles need to be charged frequently during use, so there will be demand for fast charging. According to the charging needs of different vehicles, implementing different navigation schemes can effectively improve the response speed of the vehicle. is section comprehensively considers the driving distance required to reach a fast charging station, the total time of driving and charging, and the charging economy to establish a charging navigation model.

Mathematical Problems in Engineering
For electric vehicle owners with high total driving distance requirements, this article considers the principle that the direction of the fast charging station is the same as the destination direction when all vehicles are connected to the Internet. It is proposed that the sum of the shortest distance min D from the starting point O of the vehicle to the fast charging station S and from the fast charging station S to the destination D is expressed as follows: where a and b are path nodes; m is the total number of path nodes; d ab,OS and d ab,S D are the length of the road section from the starting point O to the fast charging station S and from the fast charging station S to the destination D with a and b as the end nodes; and α ab is a variable that equals 1 for the road section with a and b as the end nodes and equals 0 otherwise.
For electric vehicle owners with high total time requirements, this article proposes the shortest total charging time as the goal to optimize the charging path: e specific solutions of T D and T C are as follows: , where T D is the travel time to the fast charging station; T Q is the waiting time in the fast charging station, which is determined by the number of vehicles; T C is the charging time; Q Ex is the expected voltage at the end of charging, which is set to 95% of the full charge; Q Re is the remaining power to the fast charging station; P is the charger power; θ is the charging efficiency; C car is the electric vehicle battery capacity; C carINI is the initial state of charge of the electric vehicle; and Q is the electric energy consumed by the electric vehicle per kilometer.
For electric vehicle owners with high cost requirements, this article proposes the minimum cost as the goal to optimize the charging path: where M D is the electricity cost consumed on the charging path and M s is the cost consumed by the fast charging station.

Electric Vehicle Charging Navigation Scheduling Strategy Based on Reinforcement Learning
e goal of the reinforcement learning algorithm is to find an optimal strategy based on the Markov decision process to maximize the expected cumulative return. In this section, the driving distance of the electric vehicle, the total driving and charging time, and the charging economy are optimized in parallel to provide the electric vehicle owner with the best electric vehicle charging navigation scheduling strategy [21,22].

Strategy Gradient Algorithm.
e basic principle of reinforcement learning is to learn from exploratory experiments and obtain action strategies to achieve established goals. e learning subject is the agent; the object interacting with the agent is the environment. Reinforcement learning is an abstraction of goal-oriented interactive learning problems. In a certain environment state, the agent takes action, and the environment responds to the agent's actions, presents the new environment state to the agent, and feeds a certain reward back to the agent. e agent and the environment continue to interact to achieve the ultimate goal of maximizing returns. e interaction process between the agent and the environment can be described by a time series: in a certain period t, the agent takes a certain action a according to the current environment state s n t ; in the next period t + 1, due to the agent's action a n t , the environment state changes from s n t to s n t+1 , and the agent is rewarded with r(t) n . In each time period, the probability distribution of all actions that the agent can take in the current environment state is called the agent's strategy π.
e agent continuously changes its strategy through interaction and finally achieves the goal of maximizing rewards. e reinforcement learning problem satisfies the Markov characteristic; that is, the state of the next period is only related to the state s n t of the current period and has nothing to do with the state s n t+1 of the previous period. e policybased method is used to express a policy. Assuming that the strategy of electric vehicle charging and navigation control consists of a t-step decision, the agent obtains n corresponding training trajectories τ n by interacting with the environment as follows: τ n � s n 1 , r(1) n , a n 1 , s n 2 , r(2) n , a n 2 , . . . , s n t , r(t) n , a n t , where a n t represents the action determined at time t during the n training, s n t represents the state after action a during the n training, and r(t) n represents the reward obtained after action a during the n training. e expected return reward R θ for all stored trajectories is as follows: 4 Mathematical Problems in Engineering p a n t |s n t , θ · p r(t + 1) n , s n t+1 |s n t , a n t , where R(τ n ) � t i�0 r(t) n is the reward value of trajectory τ n , p θ (τ n ) is the probability of trajectory τ n , p(r(t + 1) n , U n t+1 , f n t+1 |U n t , f n t , a n t ) is the probability of [r(t + 1) n , s n t+1 ] in state [s n t , a n t ], and p(a n t |s n t , θ) is the probability of selecting actions a n t according to input and output strategy π n (θ) in state s n t . erefore, reinforcement learning can be expressed as solving the maximum expected return reward R θ . To realize the strategy, the partial derivative of the parameter set θ is obtained to obtain the optimized strategy function ∇R θ as follows: p a n t |s n t , θ • p r(t) n , s n t+1 |s n t , a n log p a n t |s n t , θ + p r(t) n , s n t+1 |s n t , a n t ⎧ ⎨ ⎩ log p a n t |s n t , θ R τ n • ∇ log p a n t |s n t , θ .
e reinforcement learning policy gradient algorithm is equivalent to solving a partial derivative problem. If the parameter set θ is updated in the positive direction, that is, the reward increases, the probability of trajectory τ n will increase, and vice versa. e pseudocode of the policy gradient Algorithm 1 is given below.

Action Selection.
Taking the vehicle travel path as an example, the control parameter is a n t , and the vehicle has 3 possible actions at each intersection. e value range is a n t � [0, 2]. 0 means going forward, 1 means turning left, and 2 means turning right.

Controller.
For electric vehicle charging navigation, a scheduling algorithm based on the policy gradient algorithm is proposed according to the personal habits of different electric vehicle owners. By observing the information to select a behavior directly for back propagation and using rewards to directly enhance and weaken the possibility of selection behavior, the probability of selecting good behavior will increase next time, and bad behavior will be weakened next time.
A three-layer wavelet neural network is used. e wavelet neural network is a multilayer feedforward neural network trained according to error back propagation [23].
is article uses a three-layer neural network, that is, one output layer, one input layer, and one hidden layer, as shown in Figure 2. e state is set as the input layer of the neural network. Its dimension is 3; the hidden layer of the neural network has 20 neurons; and the output layer contains 3 neurons, corresponding to 3 output actions. e connection weights and bias terms between the input layer and the hidden layer and between the hidden layer and the output layer are represented by a parameter set of θ. e input and output strategies of the n training wavelet neural network of the strategy body are defined as π n (θ). e activation function of the connection between the input layer and the hidden layer is Tan h, and its function formula is as follows: e activation function connecting the hidden layer and the output layer is a wavelet basis function, and its function formula is as follows: According to the pseudocode of the algorithm, the specific training process can be obtained as shown in Figure 3.

Simulation Results and Discussion
Taking the city in Figure 4 as a model, the city includes 21 nodes, 32 road sections, and 4 fast charging stations. e number marked on the road section represents the length of the road section in km. Fast charging stations are located at nodes 9, 12, 14, and 19. For electric vehicles, the battery capacity is 90 kW·h, the cruising range is 400 km, and the fast charging station power is 350 kW. When the electric vehicle leaves the fast charging station, C carINI is 90%; the training parameters are as follows: the number of training rounds is 1900, and the learning coefficient is 0.95. e discount rate is 0.95. e vehicle randomly sets the initial position and target position (on 21 nodes) and randomly sets the remaining power (not higher than 30%). According to the distance selected by the user, the total time consumed, and the cost as the reward value, the vehicle is trained from the initial position to the fast charging station to charge and from the fast charging station to the target location. After the training is completed, the final reward changes are shown in Figure 5. Figure 5 shows that as the number of training sessions increases, the training reward gradually increases. After 600 training sessions, the curve shows an oscillating trend, and the reward oscillates around 190. In the subsequent training, the reward is basically stable. Save the neural network model obtained from the last training parameter. e 08:00 traffic flow distribution obtained through urban traffic simulation is shown in Figure 6. e green line represents smooth traffic, orange represents traffic congestion, and red represents heavy traffic congestion. For the traffic flow shown in Figure 6, the saved reinforcement learning model is used to obtain the station selection probability of the electric vehicle when each network node starts, as shown in Figure 7. It can be concluded that under the premise of considering congestion, the trained reinforcement learning model can effectively select fast charging (1) In the neural network, initialize the parameter set θ randomly and initialize n � 1.
(2) Initialize t � 0, randomly initialize action a n t and output state s n t+1 , calculate local reward r(t+1) n , and then add the trajectory generated by the t + 1 action to the stored trajectory τ n of the n training.
(3) Input state s n t+1 to the neural network and select a random action a n t+1 . (4) After the simulation environment executes action a n t+1 , obtains the output state s n t+2 , and calculates the local reward r(t+2) n , the trajectory generated by the t + 2 action is added to the stored trajectory τ n of the n training. (5) Judge whether R(τ n ) � i�t i�1 r(t) n > R is true; if it is true, go to step 6; otherwise, assign t + 1 to t and go to step 3, where i is the variable to be accumulated and R is the expected value of the total reward for a single trajectory. (6) Calculate the strategy optimization strategy function ∇R θ . (7) Assign n + 1 to n, update the parameter set θ in strategy π n (θ) to θ + c * ∇R θ , and judge whether n ≤ N is true; if so, go to step 2; otherwise, the reinforcement learning training process is over; save the updated parameter set as the most optimal parameter set θ * and the optimal strategy π; N is the maximum number of trajectories ALGORITHM 1: Policy gradient algorithm 6 Mathematical Problems in Engineering stations corresponding to shorter distances according to the target node. Now, take an electric vehicle starting at node 13 and ending at node 2 as an example to analyze its dynamic station selection strategy. Consider the distance, total time, and cost required for the owner to obtain charging navigation during driving, as shown in Table 1.
Plan 1 takes the minimum distance as the goal and chooses fast charging station No. 9, and the travel route is shown as the solid line in Figure 8. Plan 2 takes the minimum time as the goal and chooses fast charging station No. 14, and the travel route is shown by the dashed line in Figure 8. Plan 3 takes the minimum cost as the goal and chooses No. 12 fast charging station, and the travel route is shown as the crossed line in Figure 8.
Multiple routes were selected for testing, and methods from [10,13] and the charging navigation method proposed in this paper were compared. e performance comparisons under the comprehensive requirements of the research vehicle owners are shown in Figure 9. e first graph in Figure 9 shows the change trend of the average distance with the increase in the number of test routes under the premise of considering the comprehensive performance required by the user. In this graph, the comparison between the method in this paper and the methods in the other two references is shown. With the increase in the number of routes, the average travel distance of the three methods fluctuated and finally stabilized in the vicinity of 17 km. In this process, the total distance predicted by the three methods is basically the same. e second graph in Figure 9 shows the trend of the total time as the number of test routes increases. As the number of routes increases, the total time of the method in this paper steadily decreases, and finally the time is reduced to 0.7 h, while for the other two methods, the total time consumed curve presents an oscillating situation, and the time consumed is unstable and greater than that for the method in this paper. It can be concluded from the curve that the method in this paper has the least total time consumption. e third graph in Figure 9 shows the trend of the total cost as the number of test routes increases. With the increase in the number of routes, the total cost of the method in this paper first increases, then gradually decreases, and finally stabilizes at approximately 30 yuan. For the method from [10], the total cost of the method was initially lower than that of the method in this paper. With the increase in the number of test routes, the cost began to increase and eventually was significantly higher than that of the method in this paper. e cost for the method in [13] remained higher than the cost for the method in this paper after initially oscillating lower. It can be concluded that, under the comprehensive performance requirements, the total distances of the three methods are basically the same. On this basis, with the increase in the number of route tests, the method in this paper has the least total time and cost, which indicates the superiority of the method in this paper.

Mathematical Problems in Engineering
In the case of the same time, initial point, and destination, we compare user satisfaction under the electric vehicle charging navigation strategy in [10,13,[24][25][26]. e user satisfaction from testing electric vehicles using these methods is shown in Table 2 Figure 3: Flowchart of training steps.      Total cost (yuan) [10] [13] is article (c) Figure 9: Distance, time, and cost comparison chart.

Conclusions
We propose an electric vehicle charging service scheduling method based on reinforcement learning to meet the needs of electric vehicle owners. First, based on an intelligent transportation system, a framework for the interaction between fast charging stations and electric vehicles is proposed. Subsequently, the dynamic travel time model of the traffic section was established, and the electric vehicle charging navigation model was proposed. Finally, combined with reinforcement learning, the electric vehicle charging navigation scheduling method is further proposed to rationally optimize the service resources of each fast charging station in the area. e results show that, compared with the existing methods, the algorithm and model proposed in this paper can effectively optimize electric vehicle charging and navigation scheduling based on the needs of the vehicle owner and can meet the various needs of the vehicle owner.
Data Availability e MATLAB simulation data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.