Modeling and Optimization of Multiaction Dynamic Dispatching Problem for Shared Autonomous Electric Vehicles

+e fusion of electricity, automation, and sharing is forming a newAutonomousMobility-on-Demand (AMoD) system in current urban transportation, in which the Shared Autonomous Electric Vehicles (SAEVs) are a fleet to execute delivery, parking, recharging, and repositioning tasks automatically. To model the decision-making process of AMoD system and optimize multiaction dynamic dispatching of SAEVs over a long horizon, the dispatching problem of SAEVs is modeled according to Markov Decision Process (MDP) at first. +en two optimization models from short-sighted view and farsighted view based on combinatorial optimization theory are built, respectively.+e former focuses on the instant and single-step reward, while the latter aims at the accumulative and multistep return. After that, the Kuhn–Munkres algorithm is set as the baseline method to solve the first model to achieve optimal multiaction allocation instructions for SAEVs, and the combination of deep Q-learning algorithm and Kuhn–Munkres algorithm is designed to solve the second model to realize the global optimization. Finally, a toy example, a macrosimulation of 1 month, and a microsimulation of 6 hours based on actual historical operation data are conducted. Results show that (1) the Kuhn–Munkres algorithm ensures the computational effectiveness in the large-scale real-time application of the AMoD system; (2) the second optimization model considering long-term return can decrease average user waiting time and achieve a 2.78% increase in total revenue compared with the first model; (3) and integrating combinatorial optimization theory with reinforcement learning theory is a perfect package for solving the multiaction dynamic dispatching problem of SAEVs.


Introduction
ree revolutions of Electrification, Automation, and Sharing are booming in current urban transportation [1]. e fusion of L4/L5 level autonomous driving, electric vehicles, and shared mobility mode is forming a new Autonomous Mobility-on-Demand (AMoD) system [2,3]. In the AMoD system, SAEVs can automatically pick up and deliver passengers from origin to destination, drive to the nearby charging station/pile for electricity supplement, reposition to hotspots with low vehicle supply and high trip demand, and park on the road waiting for the new assignment [4][5][6]. On one hand, Shared Autonomous Electric Vehicles (SAEVs) help reduce environmental pollution, carbon emissions, and traffic congestion in cities [7,8]. On the other hand, SAEVs can provide on-demand mobility service to satisfy the immediate trip demand of users [9,10].
In this paper, the multiagent multiaction dynamic dispatching problem based on SAEVs fleet is the main research theme, which corresponds to the optimal decision-making process of assigning different SAEVs to serve the passenger's trip request, drive to the charging station/pile for recharging, and head for hotspots to supply vehicles in advance [11]. Previously, few scholars focused on the dispatching problem of SAEVs since it is a new emerging topic in recent years. e most studied area is the static or dynamic relocation problem of the electric car-sharing system based on manned vehicles [12][13][14][15][16][17]. However, these studies almost focus on the pickup and delivery task for users and barely consider the recharging task of electric vehicles and repositioning task of redundant vehicles [18][19][20]. Meanwhile, methods involved in the above studies mainly include nonlinear programming models and solving algorithms with high time complexity, which ignores the computational efficiency in large-scale application [13,21,22]. Hence, the existing methods are not suitable for the future AMoD system. A new method for large-scale multiaction dispatching of SAEVs comprehensively considering delivery task, parking task, recharging task, and repositioning task is in urgency.
Our goal is to model the decision-making process of the AMoD system and optimize multiaction dynamic dispatching of SAEVs over a long horizon (e.g., several days or a month) instead of a static dispatching. It relies on a proper mathematical modeling method to illustrate the whole multiaction dispatching process of SAEVs, an optimization model to figure out the optimal objective and constraints, and an efficient solving algorithm to realize fast processing to achieve optimal task allocation instructions for SAEVs in large-scale application.
To accomplish this, the MDP framework including agent, state, action, and reward is adopted to model the multiagent multiaction dynamic dispatching problem of SAEVs at first. en, combinatorial optimization method is employed to establish the multiaction dispatching optimization model, and the optimal task allocation instructions are solved by the KM algorithm. After that, to realize a long-term and global optimization, e Bellman Equation from reinforcement learning theory is used to transform the instant and singlestep reward into an accumulative and multistep return, which is represented as a new match value between each SAEV and each task to update the above combinatorial optimization model. e deep Q-learning and KM algorithms are combined to solve the new optimization model to achieve a better task assignment instruction considering the future impact. Finally, a toy example based on simple data and two dispatching simulators based on actual vehicle trajectory and trip request data of Didi Chuxing platform (https://gaia. didichuxing.com) and charging pile information (http:// admin.bjev520.com/jsp/beiqi/pcmap/do/index.jsp) are conducted to verify the effectiveness of the above methods. Results show that the latter optimization model considering accumulative and multistep return can bring an improvement compared with the former one.
We overcome several practical issues to make the proposed methods suitable for large-scale application of future SAEVs fleet, including computational efficiency and multiaction coordination. e contribution of this research is summarized as follows: (1) To the best of our knowledge, this is the first batch of work to study multiaction dynamic dispatching problem of SAEVs, which comprehensively considers delivery task, recharging task, and repositioning task simultaneously. Even though Al-Kanj [23] also studied the related problem, which involved the decisions of assigning orders to cars, recharging batteries, repositioning, and parking the vehicles, our research adopts a bipartite graph modeling to achieve a faster solving process of optimal dispatching scheme. (2) By modeling multiaction dynamic dispatching problem as a sequential decision-making problem through MDP framework and taking the accumulative and multistep reward into account based on the Bellman Equation, the proposed methods belong to the category of reinforcement learning. Results show that reinforcement learning can be applied into large-scale real-time AMoD system very well. (3) A stochastic integer linear programming model is established based on combinatorial optimization theory and solved by the combination of Q-learning algorithm and KM algorithm to achieve the optimal multiaction allocation instructions for SAEVs. Results show an improvement over the baseline KM algorithm, and it turns out that integrating combinatorial optimization theory with reinforcement learning theory is a perfect package for solving the multiaction dynamic dispatching problem of SAEVs. e rest of the paper is organized as follows: Section 2 is a literature review of the dispatching problem of the Mobilityon-Demand system system. Section 3 provides the analyzing framework and details the methods, including problem description, problem assumption, and mathematical formulation. Case studies including a toy model and two dispatching simulators are then described in Section 4. Section 5 concludes the research results and discusses the future work.

Literature Review
Generally, the fleet operation process of SAEVs is usually involved with 8 subproblems, including trip demand, fleet size, traffic assignment, vehicle assignment, vehicle distribution, pricing, charging, and parking [24]. Vehicle assignment, charging and parking assignment, and vehicle repositioning are the main focus when studying the multiagent multiaction dynamic dispatching problem based on SAEVs in AMoD system in this paper. Most scholars study these three subproblems separately (as shown in Table 1) while this research integrates the above three subproblems into one methodology framework by combining deep reinforcement learning and combinatorial optimization method.

Vehicle Assignment.
Vehicle assignment means assigning vehicles to the customers, and then vehicles execute pickup and delivery tasks to satisfy customers' trip request. Certain rules, heuristics, or precise optimization algorithm are three main methods to solve this problem. For modeling dynamic process of vehicle assignment, a rulebased vehicle assignment method is usually implemented [26,28,[44][45][46][47][48]. e most widely used rule is assigning the nearest vehicle to the user request. Further optimization models and solving algorithms are developed to optimize vehicle assignment performance. Liang et al. [39] modeled a dial-a-ride problem of ride-sharing SAEVs in urban road network into an integer nonlinear programming model and designed a customized Lagrangian relaxation algorithm to solve the optimal assignment scheme. However, this methodology is not practice-ready due to the computation time and computation gap between the upper and lower bounds. is is also a typical problem in other studies [32,[49][50][51]. To make the method more practical, Shi et al. [41] developed a reinforcement learning-based algorithm to operate an electric vehicle fleet, which can also be referenced to apply to the SAEVs assignment problem. e goals of designing the reward function are to minimize customer waiting time, electricity cost, and operational costs of the vehicles. A deep feed-forward neural network (FNN) is parameterized to approximate the state-value function, and the KM algorithm with a time complexity of o (n^3) is adopted to solve the optimal dispatching results.

Charging and Parking Assignment.
Charging and parking assignment refers to monitoring real-time battery levels of SAEVs and conducting corresponding strategies to assign vehicles to charging piles or parking lots [34,52]. Regarding charging assignment, Chen et al. [26] insisted that the charging vehicles are not allowed to undock and serve a new trip request but Bauer et al. [30] believed that still-charging vehicles are allowed to serve a new trip request. Iacobucci et al. [34,53] developed a simulation methodology for evaluating a Shared Autonomous Electric Vehicle system interacting with passengers and charging at designated charging stations using a heuristic-based charging strategy and used electricity price information for optimizing vehicle charging in a mixed-integer optimization model by adding charging constraints over longer time scales. Jones and Leibowicz [37] used an energy optimization model to assess the impact of charging SAEVs at times that are optimal for the energy system. Zhang et al. [40] adopted an agent-based simulation model, called BEAM, to describe the complex behaviors of both passengers and AMoD systems in urban cities. BEAM simulates the driving, parking, and charging behaviors of the SAEV fleet with range constraints and identifies times and locations of their charging demands. Melendez et al. [42] incorporated power network involving power purchase, real-time price spikes, arbitrage with battery banks, and solar generation into SAEV fleet operation planning and constructed a mixed-integer linear programming model to solve the optimal delivery and charging task decisions. To facilitate charging decisions, Basso et al. [54] propose ad probabilistic Bayesian machine learning approach for predicting the expected energy consumption of electric vehicles. MAPE of this model decreases from 7.95% to 3.59% compared with prior forecast models. Regarding parking assignment, Azevedo et al. [27] used an optimization algorithm (facility location problem) to locate parking stations, wherein vehicle charging is also possible. Zhang and Guhathakurta [28] minimized cost by routing idle vehicles to low-cost parking areas. Al-Kanj et al. [23] introduced SAEVs into the ridehailing system and combined the Markov decision process and one-stage combinatorial optimization method to realize the real-time optimal decision-making of charging and parking assignment.

Vehicle Repositioning.
Vehicle repositioning, also referred as "vehicle rebalancing or redistribution," is used to reposition excess vehicles from low demand areas to high √ Optimization Azevedo et al. [27] √ Optimization Zhang and Guhathakurta [28] √ Optimization Alonso-Mora et al. [29] √ Optimization Bauer et al. [30] √ Optimization Babicheva et al. [31] √ Optimization Farhan et al. [32] √ Optimization and simulation Rossi et al. [33] √ √ Optimization Iacobucci et al. [34] √ Optimization Vosooghi et al. [6] √ Optimization Dandl et al. [35] √ Optimization Iacobucci et al. [36] √ Optimization Jones and Leibowicz [37] √ Optimization Mao et al. [38] √ Reinforcement learning Liang et al. [39] √ Optimization Zhang et al. [40] √ Simulation  [42] √ √ Optimization Zhang and Chen [43] √ Simulation demand areas when modeling on-demand SAEVs service. Fagnant and Kockelman [7] designed an agent-based simulation model for SAEVs operations and introduced vehicle rebalancing into this simulation process. Results show vehicle rebalancing may save 10 times the number of cars needed for self-owned personal-vehicle travel. Vosooghi et al. [6] concluded that vehicle repositioning has a significant effect on service performance, such as modal share and fleet usage. In modern literature, reactive methods such as nearest neighbours are commonly used, but Babicheva et al. [31] compared 6 different ways to apply vehicle repositioning assignment and proposed a new index-based proactive redistribution (IBR) algorithm based on predicted near-future demand at stations. A linear programming model for vehicle redistribution is adopted by Zhang and Pavone [25] and Alonso-Mora et al. [29]. Rossi et al. [33] suggested that the problems of vehicle assignment and rebalancing vehicles can be decoupled and develop a computationally efficient routing and rebalancing algorithm for SAEVs. e rebalancing optimization problem is modeled as the Minimum Cost Flow problem. Dandl et al. [35] emphasized the importance of trip demand forecast and concluded that accurate forecast quality can help redistribute vehicles among different regions. Mao et al. [38] modeled the repositioning task dispatching problem of SAEVs alone by MDP and achieved the optimal dispatching results by actor-critic strategy gradient network.

Research Gaps.
Generally, current studies concerning the multiagent multiaction dynamic dispatching problem of SAEVs operation are not enough yet, and three research gaps remain to be settled in the future.
First, vehicle assignment, charging and parking assignment, and vehicle repositioning are mostly studied separately in previous research. Rossi et al. [33] studied the combination of vehicle assignment and vehicle repositioning, and Melendez et al. [42] combined vehicle assignment and charging assignment together. During the dynamic operation process, it is better to integrate vehicle repositioning into the decision-making process since vehicle redistribution in advance has a significant effect on mobility service performance [6].
Second, the current research barely balances operational income, user satisfaction, and electricity cost when deciding which vehicle should be assigned to the specified passenger. Besides, designing the matching weight of each vehiclepassenger pair from a long-term view is verified to be better than a short-sighted instant view. Hence, it is better to design a multidimensional accumulated reward function to represent the matching weight value between each vehicle and delivery task.
ird, to guarantee to be practical-ready of these methods for SAEVs' dynamic dispatching process, a combination of deep reinforcement learning and combinatorial optimization is a promising methodology framework. us far, some studies have verified the effectiveness of the above framework, but they are only applied in the field of ridehailing rather than in the field of SAEVs fleet operation. How to model the dynamic dispatching process comprehensively considering delivery, charging, and repositioning tasks together based on the reinforcement learning framework and how to solve the best charging, delivery, and repositioning assignment scheme by combinatorial optimization methods remain to be explored.

Methods
To assign different tasks, including delivery, parking, recharging, and repositioning, to SAEVs, MDP is adopted to model the operation process and to transform this process into a multiagent multiaction dynamic dispatching problem at first. en, to achieve the optimal task allocation instructions for SAEVs, two optimization models with the objective of maximum economic income from the local view (instant and single-step reward) and global view (accumulative and multistep reward) are established, respectively. Accordingly, two algorithms are designed to solve the above optimization models so that the best dispatching scheme can be obtained at each time period as soon as possible. Finally, to test and verify the effectiveness of the optimization models and the performance of the solving algorithms, two case studies are designed, including a toy model with hypothetical data including 1 time slot, 4 agents, and 4 tasks, and two dispatching simulators covering 1 month, 43200 time slots in Chengdu, China, based on the historical data generated from Didi Chuxing platform. e analyzing framework is illustrated in Figure 1.

Problem Description and Assumption
3.1.1. Problem Description. Shared automotive electric vehicles (SAEVs) are a free-floating L4/L5 level fleet that executes four tasks, including delivery, parking, recharging, and repositioning. (1) Delivery task means the SAEV picks up and delivers the passenger to the destination. (2) Parking task indicates the SAEV idles on the road waiting for the next round of task distribution. (3) Recharging task represents the SAEV drives to the nearby charging station or pile for recharging. (4) Repositioning task denotes the SAEV is instructed to drive to a specific area for vehicle replenishment. e problem to be solved in this research is how different tasks should be assigned to SAEVs dynamically in the best way at each time slot. For practical application in the future robotaxi fleet operation, SAEVs should make decisions themselves fast and optimally by following a set of multiaction dispatching algorithms, which is exactly the purpose of this research.

Problem Assumption.
e proposed multiaction dispatching optimization model and algorithm is developed based on the following assumptions: (1) SAEVs are assumed to be a fleet with L4/L5 level autonomous driving technology. SAEVs can follow the instruction of task allocation results to execute four tasks autonomously without drivers.
(2) When executing the delivery task, the position of pickup location and destination is assumed to be fixed following the initial trip request information without any change. (3) When executing the parking task, the parking strategy is assumed to be predetermined: the SAEV keeps still in its original position and waits for the next round of task allocation instruction. (4) When executing the recharging task, the SAEV is assumed to drive to the nearby predesigned charging station/pile for recharging without any help of manpower; that is, the whole charging process is conducted by the SAEV itself. (5) When executing the repositioning task, specific hotspots with high trip demand and low vehicle supply at different time periods and the specific quantity of surplus or missing vehicles are predetermined.

Model Parameters.
Based on the problem description and assumptions above, firstly, a mathematical model will be conducted by following MDPs to describe the multiaction dispatching process realistically; then, two optimization models from the perspective of global and local respectively shall be established to realize the reward-maximum oriented task allocation instructions; and finally, two corresponding algorithms will be designed separately to solve the above two optimization models fast and efficiently. Model parameters are shown in Table 2.

Model Framework.
e multiaction dispatching process of SAEVs is modeled based on MDPs [20]. Each SAEV behaves as an "agent" in the "environment." e spatiotemporal status, including geographic position, time, and state of charge (SOC) of SAEV, is set as "state." Four tasks, including delivery, parking, recharging, and repositioning, are defined as four "actions," respectively. In each round of task allocation, each SAEV will achieve a "reward" representing task income, carbon emission savings, or user satisfaction. A "policy" should be determined to decide how to allocate different tasks for SAEVs in the best way.
(1) Agent and Environment. Each independent SAEV is modeled as an agent, and the environment contains all the information, including the layout of charging stations/piles, parking spots, users' trip orders, and other agents.
(2) State. e state of the SAEV is represented by location, time, and battery level. It is defined as a four-dimensional vector indicating universal time coordinated (UTC), geographic position, and state of charge (SOC). Formally, we define s � (t, lng, lat, soc) ∈ S z, where t ∈ T is the time index, (lng, lat) is the real location of the SAEV at time index t, and soc represents the remaining battery volume of the SAEV at time index t.
(4) Reward. r � (r1, r2, r3, r4) defines four types of the reward of SAEVs for executing each task (action), which is quantified by task income highly related to driving distance dis r ∈ DIS, elapsed time ela r ∈ ELA, battery energy consumption con r ∈ CON, battery capacity bat r ∈ BAT of each SAEV, and charging volume vol r ∈ VOL of each charging process. r1 represents the expected return when the SAEV executes the delivery task. r1 is linearly related to driving distance dis r ∈ DIS and elapsed time ela r ∈ ELA. Assume SAEV drives from state s1 � (t1, lng1, lat1, soc1) to the next state s2 � (t2, lng2, lat2, soc2), picks up the passenger, then continues to drive to the next state s3 � (t3, lng3, lat3, soc3), Known conditions Figure 1: Method and analyzing framework.
Journal of Advanced Transportation and arrives at the destination. e formula of r1 is defined in equations (1)-(4). Following Mercator Projection, dis r is simplistically determined by equations (2) and (4). It should be noted that dis r1 is only related to the distance between the pickup point s2 and destination point s3: r2 denotes the expected return when the SAEV executes the parking task. Following the principle of opportunity cost, r2 is negative because if the SAEV was assigned to a delivery task instead, the return would be a positive value highly linearly related to driving distance dis r and elapsed time ela r . Assume the SAEV parks from state s1 � (t1, lng1, lat1, soc1) to the next state s2 � (t2, lng2, lat2, soc2) before the next round of task allocation instruction. e formula of r2 is defined in equations (5)-(7) and (4): r3 implies the expected return when the SAEV executes the recharging task. If soc of the SAEV is lower than 20%, the SAEV would be assigned to the recharging task compulsively, or the SAEV would be subject to the global optimization task allocation results. e recharging task is divided into two parts: (1) the SAEV drives from its real-time location (state s1 � (t1, lng1, lat1, soc1)) to the charging station/pile (state s2 � (t2, lng2, lat2, soc2)) and (2) the SAEV stays in the charging station/pile for recharging until the soc reaches 90%(state s3 � (t3, lng2, lat2, soc3), soc3 � 90%). Following the principle of opportunity cost, part 1 will cause a negative reward which is linearly related to driving distance dis r and elapsed time ela r , part 2 will also cause a negative reward, including the presupposed delivery task income, which is linearly related to elapsed time ela r , and charging cost which is linearly related to battery capacity bat r of this SAEV and charging volume vol r of this charging process. e formula of r3 is defined in equations (8)-(12) and (4): r4 indicates the expected return when the SAEV executes the repositioning task. Following the principle of opportunity cost, driving to the area (a regular hexagon with radius rad) with high trip demand and low vehicle supply will cause a certain cost for the SAEV in the short term. Instead of repositioning, the SAEV could be assigned a delivery task and earn a positive reward. Hence, if the SAEV was assigned a repositioning task, it would drive from its real-time location (state s1 � (t1, lng1, lat1, soc1)) to the corresponding area for replenishment. Set the central point (state s2 � (t2, lng2, lat2, soc2)) of the area as the destination of repositioning task. Moreover, please note that it is unnecessary for the SAEV to drive to the central point. Instead, crossing into the boundary of the regular hexagon (state s3 � (t3, lng3, lat3, soc3)) is enough. Hence, ζ is defined as a constant reward from the first point when crossing into the area to the central point of this area, and ζ should also be considered into r4. e formula of r4 is defined in equations (13)- (16) and (4): dis r4 � mer(lng1, lat1, lng2, lat2), (5) Discount Factor. e discount factor c ∈ (0, 1] controls the degree of how far the MDP looks into the future. It is beneficial to use a small discount factor as long horizons will introduce a large variance on the value function. It is worth noting that, under this setting, the reward should also be discounted. For a task which lasts for T time slots with reward r and a discount factor c, the final reward is r c given as follows: (6) Policy. Policy π represents a strategy which is used to decide the specific task allocation instruction for SAEVs at each decision time slot. It will be determined by the combinatorial optimization method in this paper.
(7) Value Function. Since r � (r1, r2, r3, r4) defines the instant and single-step reward, it is hard to reveal the accumulated and multistep reward when the agent chooses different actions. Hence, the value function based on the Bellman Equation is introduced into this research to design the long-term return of each action (task reward in a global view). e formula of value function V is defined in the following equations: (8) State Transition. Before trip assignment, the transition of state will be predetermined by assuming different kinds of SAEV-action matches. When trip assignment is completed by solving the optimization model below, the transition of state then will be finally determined.

Optimization Model.
Based on the MDP framework above, a combinatorial optimization model will be constructed in this section to achieve the best policy π in order to decide the best task allocation (action selection) instructions for SAEVs in a global view. Given a bipartite graph G � (V, E), V is the set of vertexes including two subsets V 1 and V 2 . V 1 is the set of SAEVs (agents) and V 2 is the set of tasks. E is the set of directed lines e ∈ E from v 1 ∈ V 1 to v 2 ∈ V 2 . w(e) represents the weighted value of directed line e. Assume M as a match between V 1 and V 2 among G, each SAEV (agent) can only match one task, and all the SAEVs should be assigned with corresponding tasks in each round of task allocation process. Define W as the final weighted value of the match M , and it is the sum of w(e) of all the directed lines. For instance, Figure 2 illustrates the task allocation instructions of 8 SAEVs based on bipartite graph theory and MDP framework.
In the bipartite graph, if W reaches the maximum under the condition of match M, match M can be identified as the best policy π. Combinatorial optimization model will be constructed to solve the best match M, that is, the best policy π. e mathematical model of the optimization process is shown as follows: e maximum of reward is set as the optimization objective mainly representing the total task economic income. x ij , x im , x in , x ip represent whether the SAEV drives from state i to state j while the agent executes the delivery task, from state i to state m to execute parking task, from state i to state n to execute the recharging task, and from state i to state p to execute the repositioning task, respectively. x ij , x im , x in , x ip are decision variables and are all 0-1 variable.
e objective function is shown as follows: I, J, M, N, P denote the number of states of SAEVs, passengers, parking positions, charging stations, and targeted areas, respectively. I s , J s , M s , N s , P s denote the set of states of SAEVs, passengers, parking positions, charging stations, and targeted areas, respectively. R ij , R im , R in , R ip imply the weighed value of w(e) (or the reward) from vehicle state i to passenger state j, from vehicle state i to parking position state m, from vehicle state i to charging station state n, and from vehicle state i to targeted area state p, respectively. R ij , R im , R in , R ip are supposed to be calculated from two different perspectives: one is the global view and the other is the local view.
From the perspective of local view, R ij , R im , R in , R ip represent the instant and single-step return of x ij , x im , x in , x ip , and they will be determined by following Journal of Advanced Transportation equations (1)- (19). However, this perspective may cause myopia since this round of task allocation result may only be suitable for this time period, not the following several stages.
Instead, from the perspective of global view, R ij , R im , R in , R ip represent the accumulative and multistep return of x ij , x im , x in , x ip , and they can be calculated by combining the instant reward (the calculation process follows equations (1)- (16)) and the long-term value function value (the calculation process follows equations (18) and (19)). erefore, two optimization models from two perspectives (myopic and global) are compared. e optimization mathematical model 1 from myopic perspective is listed in Appendix, and model 2 from global (long horizon) perspective can be listed as follows: Mathematical model 2: set accumulative, multistep reward as a weighted value: Constraint 4: ∀j ∈ J s , I i�1 x ij � 1, Constraint 5: ∀m ∈ M s , I i�1 x im � 1, Constraint 6: ∀n ∈ N s , I i�1 x in � 1, Constraint 7: ∀p ∈ P s , I i�1 x ip � 1.

Solving Algorithm.
Based on the two optimization models above, two solving algorithms are separately designed to achieve the best task allocation instructions for SAEVs at each time period. Model 1 is a typical integer linear programming problem (ILPP). e Kuhn-Munkres (KM) algorithm will be adopted as the final solving algorithm for model 1. Model 2 is a stochastic combinatorial optimization model due to the randomness and nonlinearity of value function V(s t ). A new algorithm combining the KM algorithm with Q-learning algorithm will be designed to solve model 2.
(1) Alternating Path. e path starting from the unmatched vertex, going through the unmatched line, matched line, unmatched line, and so on, is called an alternating path.
(2) Complete Matching. Among match M of graph G, if the number of vertexes |V 1 | ≤ |V 2 | and the number of matched lines |M| � |V 1 |, this match M will be called complete matching. e KM algorithm mainly serves for determining the maximum weight matching under complete matching in the bipartite graph G � (V, E). Assume there is vertex i ∈ V 1 and vertex j ∈ V 2 . ∀i ∈ V 1 and ∀j ∈ V 2 , V i

Theorem 1. Define a set made up of directed lines from
vertex i ∈ V 1 to vertex j ∈ V 2 meeting the condition of V i 1 + V j 2 � w(e) ij as S. If there is a match that is the complete matching of set S, this match must also be the maximum complete matching of graph G.
Following eorem 1, the core of the KM algorithm applying into model 1 is to search for the complete matching of set S and view it as the final optimal matching to instruct the task allocation of SAEVs. e flowchart of the KM algorithm is as follows ( Table 3): For the purpose of solving model 1, the KM algorithm above can realize an exact solving with the time complexity of o(n 3 ), which is a fast and efficient algorithm for practical application. However, to solve model 2, the KM algorithm is not enough since the value function is not predetermined and constant, which cannot be quantified as a weighted value.
To solve this problem, an approaching method called the deep Q-learning algorithm is put forward as the first step to transfer the nonlinear and stochastic value function into an approximately determined accumulative and multistep reward. Next, the KM algorithm will be adopted to solve the optimal task allocation instructions based on the value function.
Before conducting the deep Q-learning algorithm, an experience trajectory including information of SAEVs executing different tasks at different time slots should be built based on the historical fleet operational data. Based on the experience trajectory, value function V(s) will be updated iteratively by following the temporal difference (TD) principle in the following equation: Previous studies adopt a Q-learning algorithm to update value function V(s). However, Since the Q-learning algorithm can only record limited historical trajectories, billions of daily trip orders and corresponding trajectories generate in our country every day. is will cause a memory explosion if the Q-learning algorithm is expected to be applied in a large-scale operational process.
Hence, a Back Propagation-Deep Neural Network (BP-DNN) based estimator, also called the deep Q-learning algorithm, is constructed to fit the value function V (s), and the flowchart is shown in Table 4.
Combining the Q-learning algorithm with the KM algorithm will transform model 2 from a stochastic combinatorial optimization model into another typical integer 8 Journal of Advanced Transportation linear programming problem. Specifically, the deep Q-learning algorithm will be adopted offline to train historical experience trajectory data to update the value function representing the long-term return of executing each task. e KM algorithm shall be conducted online to realize fast processing and achieving the final optimal matching instructions to guide SAEVs to execute different tasks. is combination saves plenty of online computational resources to ensure convenience and efficiency for practical application.

Case Study
To provide a more complete understanding of the proposed optimization models and solving algorithms, a toy example, a macrosimulation of 1 month, and a microsimulation of 6 hours are demonstrated separately. A toy example is designed based on some hypothetical data. Macro-and microsimulation cases share the same dataset. Before conducting task allocation, preliminary generation of potential delivery, parking, recharging, and repositioning tasks waiting for match follows the rules below: (1) delivery tasks to be completed are generated by following the trip order information; (2) it is assumed that each SAEV may receive a parking task; (3) all the charging piles in the hexagon will be listed to form recharging tasks for SAEVs; and (4) "hotspots" regions with more trip demand are predetermined and form the potential repositioning tasks.

Dataset Introduction.
Macro-and microsimulation cases are involved with the real-time geographical location of vehicles, O-D information of real-time trip order request, static geographical location of charging piles, and static hexagonal partition information. In this paper, vehicle and trip order information comes from the open-source dataset in the Didi platform, which is a one-month (from November 1 to November 31) ride-hailing operational dataset located in Chengdu, Sichuan, China. Charging pile information in Chengdu is achieved by crawling the website http://admin. bjev520.com/jsp/beiqi/pcmap/do/index.jspof BAIC BJEV, which is the holding subsidiary of Beijing Automotive Group Co., Ltd. e FieldInfo is shown in Table 5.

Toy Example.
A toy example is designed to illustrate the capacity of optimization model 1 and the KM algorithm in this multiaction dispatching problem.
Experimental setup: based on the MDP framework, there are 4 agents and 4 tasks (including 1 delivery task, 1 parking task, 1 recharging task, and 1 repositioning task) waiting to be assigned at time slot t. Area D has been predetermined as a hot spot with low SAEV supply and high trip demand at time slot t. Set α � 1, β � 1, θ � 1. Assuming that 10 k·Wh can be supplemented through each recharging process, the match weighted value between different agents and tasks is shown below by following the calculation principle of reward in equations (1)- (19) (Table 6). Table 1, the best task allocation instructions based on model 1 can be achieved as follows. Agent 1 executes the parking task and drives straight, agent 2 executes the recharging task, agent 3 executes pickup and delivery task, and agent 4 is repositioned to area D for replenishment. e total reward is 8, which is the maximum return among all the possible task allocation alternative options. e specific assigned tasks at time slot t are illustrated in Figure 3. In addition, to figure out the computational capacity of the solving algorithm, Table 7 displays the elapsed time of the KM algorithm for different fleet sizes of SAEVs (macOS High Sierra 10.13.6, 2.3 GHz, Intel Core i5). To realize fast solving of multiaction allocation instructions second by second, the KM algorithm is a competitive way for SAEVs whose fleet size is less than 1000.

(3) Result of Toy Example. By executing the KM algorithm in
However, the above result may be an optimal allocation for time slot t, but not the best one for the next several time slots. For instance, in Figure 4, if there will be two new delivery task x and task y at time slot (t + 1), maybe agent 4 should be assigned to task 2 at time slot t instead of being assigned to the hotpot (task 4) for replenishment. Meanwhile, agent 1 is ought to be relocated to the hotspot to execute repositioning task 4. Since at time slot t + 1, there will be no delivery task in area A, but more delivery tasks will generate in area C. Agent 4 and agent 1 may satisfy the trip demand task x and task y at time slot t + 1, and repositioning agent 1 to area D may be a better choice at time t. Hence, optimization model 1 and KM algorithm is a static multiaction dispatching method from the local view, which may be short-sighted. To solve this problem, a dispatching simulator considering a global view will be conducted in the next part.

Macrosimulation of One Month.
e dispatching simulator of 30 days is designed to test the capacity of optimization model 2 and the combinational algorithm of deep Table 3: Flowchart of the KM algorithm.
Step 1: Initialize set S Set SAEVs as a subset V 1 , set delivery, cruising, recharging, and repositioning tasks as s subset V 2 For V 1 , each vertex value is set as the maximum reward r started from the corresponding SAEV For V 2 , each vertex value is set as 0 Step 2: Find alternating path Until no vertex can be added into the alternating path Step 3: Reset vertex value of V 1 and V 2 Search for all the vertexes both in the subset V 1 and in the alternating path Search for all the vertexes in the subset V 2 not in the alternating path Form a series of vertex pair (i, j) Find the vertex pair whose d ij is minimum e value of all the vertex in V 1 minus d ij , the value of all the vertex in V 2 plus d ij Step 4: Repeat Step 2 and Step 3 Until finding the complete matching Step 5: Output the final matching Transfer the final matching into the optimal task allocation instructions Q-learning and KM to improve the final dispatching return from a global view.
(1) Experimental Setup. is simulator is conducted based on the trip data in Chengdu, Shanghai, from November 1, 2016, to November 30, 2016, on Didi Chuxing platform. e dataset mainly refers to the order data (for delivery task decision), including order ID, starting billable time, ending billable time, longitude of pickup location, latitude of pickup location, longitude of pickoff location, latitude of pickoff location, and predetermined hexagon areas (for repositioning task decision) including hexagon ID, longitude of six vertexes, and latitude of six vertexes. Also, to achieve recharging task decision, the GPS data of some charging stations/piles are randomly generated by crawling the website http://admin.bjev520.com/jsp/beiqi/ pcmap/do/index.jspof BAIC BJEV. Setting 1 minute as a time slot, this simulator is ready to conduct the task allocation process of 1 month, including 30 days and 43200 minutes. At each time slot, 100 SAEVs and 100 tasks are randomly extracted from the dataset. First, set c � 0.9, σ � 0.1, and train value function of different state V(s) based on the Q-learning algorithm. Second, set α � 1, β � 1, θ � 1, and assume that 10 k · Wh can be supplemented through each recharging process, form the match weighted value between different agents and tasks, and implement the simulation by following Figure 5. A final income (without any opportunity cost) of the whole   Initialize state s 1 � x 1 , and calculate the input sequence ∅ 1 � ∅(s 1 ) Repeat time steps of each episode trajectory, from t � 1 to T: Achieve state s t+1 � x t+1 according to state pair (s t , s t+1 , r t ) of the episode trajectory, and calculate the input sequence ∅ t+1 � ∅(s t+1 ); y i � r j , current state is the final state ∅ j+1 , r j + c · V( ∅ j+1 , θ), current state is not the final state ∅ j+1 , ; Calculate loss function (y i − V( ∅ j+1 , θ)) 2 based on gradient decent method Until the final time slot T Until the final episode trajectory K Output value function V(s, θ)  month will be counted to represent the economic benefit of this simulation experiment.
(2) Result of Dispatching Simulation. Compared with the local view (instant and single-step reward), multiaction allocation simulation from the global view (accumulative and multistep reward) shows a giant improvement of the total reward. As is illustrated in Figure 6, the final reward of the allocation instructions generated from optimization model 2 and the combinational algorithm of deep Q-learning and KM is twice the one generated from optimization model 1 and the KM algorithm. Besides, Figure 7 reveals that the total order revenue of the allocation instructions generated from optimization model 2 and the combinational algorithm of the deep Q-learning and KM improves by 1.2% compared with the one generated from optimization model 1 and KM algorithm. e reason why the improvement in total reward is more significant than the total revenue mainly lies in the long-term return. Since the reward in global view is calculated by adding the reward in local view with a long-term return, and the longterm return is represented by a BP-DNN estimator with a positive value. Another indication we can infer from Figures 6 and 7 is that though the total reward in the global view can improve more significantly than the total reward in the local view, the actual order revenue can only improve slightly by adopting the global view. Some basic configuration, including subregional division results, initialization of SAEV fleet and charging piles, trip order requests distribution of target subregions, is essential. Figure 8 shows 19 hexagonal subregions of the inner ring of Chengdu, Sichuan, China, and these are the target studied areas when conducting the dispatching process. Figure 9 reveals the static configuration information, which contains the number of charging piles in each subregion and the number of SAEVs in each subregion at 8 a.m.  1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08 3.50E+08 local view global view Since recharging and repositioning task assignment will be conducted separately in each subregion and repositioning task assignment should be executed among several subregions, subregion 1 is selected as the target area to illustrate the dispatching performance. 1-7 subregions are selected as the target repositioning areas in this research. Figure 10   e dispatching methodology put forward in this paper, comprehensively considering delivery, recharging, and repositioning tasks, can increase the order fulfillment at each time step to a great extent. As is illustrated in Figure 11, adding repositioning task assignment can keep 100% order fulfillment for 18 time steps during 8 hours (24 time steps), and the lowest order fulfillment is 50%, which is      higher than the situation only considering delivery and recharging task assignment. e above results fully illustrate the importance of introducing vehicle redistribution task assignment in the dispatching process of SAEVs.

"Model 2 + BP-DNN + KM" versus "Model 1 + KM".
As is shown in Figure 12, the total order revenue of the allocation instructions generated from optimization model 2 and the combinational algorithm of deep Q-learning (BP-DNN based value function) and KM algorithm improves by 2.78% compared with the one generated from optimization model 1 and the KM algorithm. is further reveals the better performance of the combination of BP-DNN based deep Q-learning algorithm and KM algorithm, which means the dispatch from global view can bring an economic income growth for SAEV fleet operators.  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 delivery √ recharging √ repositioning √ delivery √ recharging √ repositioning × delivery √ recharging × repositioning × (%)  According to the loss function (y i − V(ϕ j , θ)) 2 training process based on the Gradient Descent method in Table 5, 4080 completed historical order records are collected as a training set. Variation of loss value is shown in Figure 13, with a steady loss of 4651.600955. ere is still a large room for optimization of the loss function in the future.
Results of model 2 in a global view considering instant return and long-term return show a decrease in the aspect of average user waiting time. As shown in Figure 14, the skewness of the frequency distribution of all the completed trip orders becomes larger in the global view case than the one in the local view case. Specifically, the number of completed trip orders with 0-200 s interval increases from 1600 to 2400, showing a shorter user pickup duration and better user trip satisfaction. However, around 3 orders show longer user waiting time with more than 4000 s, which reveals that the global view case may cause some longerpickup-duration orders, but the average user waiting time still can be decreased.
Results show that different reward functions in both local view and global view cannot affect the utilization rate of charging piles. As illustrated in Figure 15, the pile utilization rate during 6 hours (24 time steps) of the local view case and global view case remains almost the same. Meanwhile, Figure 15 also reveals a common feature of two cases, that is, almost 10 time steps keep 100% pile rate utilization during the 6-hour dispatching process. is may be caused by the low supply of charging piles of the studied area, which also has implications for our future research on the location and layout of charging piles serving SAEVs.

Conclusions and Future Work
In this paper, the operational dispatching process of the L4/L5 level shared autonomous electric fleet, including delivery, parking, recharging, and repositioning, is put forward and modeled as a multiagent multiaction dynamic dispatching problem based on MDP. To achieve the optimal task allocation instructions for each SAEV, two multiaction dispatching optimization models from a local view (instant and single-step reward) and global view (accumulative and multistep reward) based on combinatorial optimization method are established, respectively. Correspondingly, two algorithms involved with the KM algorithm and Back Propagation-Deep Neural Network algorithm are designed to realize a rapid and exact solution. Based on the actual order and trajectory data from the Didi Chuxing platform, a toy example, a macrosimulation of 1 month, and a microsimulation of 6 hours are conducted to test the validity and effectiveness of the methods put forward in this paper. Results of the case study reveal the validity of the methods in this research. MDP is an effective method for modeling the future operating process of SAEVs as a multiagent task allocation problem and combinatorial optimization method is feasible for solving the best multiaction allocation instructions.
First, results prove that the KM algorithm can realize fast solving of optimal assignment scheme solving for 1000 SAEVs in the practical application scenario. Second, establishing the optimization model from a global view considering accumulative and multistep reward shall bring an obvious improvement of the total multiaction allocation return compared with the local view. ird, the Q-learning algorithm and KM algorithm is a perfect combination of the offline and online methods, which can be packaged to be applied to the future SAEVs operation application. Fourth, the deep Q-learning algorithm based on Back Propagation-Deep Neural Network (BP-DNN) shows a better performance than the Q-learning algorithm, but for the fit goodness of the BP-DNN estimator, there remains room for improvement. Fifth, adopting model 2 from the global view can bring not only 2.78% increase in the total order revenue but also a decrease in the average user waiting time (i.e., an increase of user trip satisfaction).
For future works, we are committed to improving this research from 3 aspects. First, to explore a better operational performance, the reward function in the MDP framework should be further modified by adding the influence of user satisfaction and carbon emission. Second, a supply-demand forecasting model should be established to predetermine the hotspots with low vehicle supply and high trip demand. Hence, a more accurate trip demand forecast model should be studied in the future. ird, the deep Q-learning algorithm is verified to be effective in both increasing total order revenue and decreasing user waiting time. However, the loss value of the BP-DNN estimator remains higher. Further exploration will focus on the search for better deep Q-learning algorithms for improving the fit of goodness [55][56][57][58][59][60].

Data Availability
In this paper, vehicle and trip order information comes from the open-source dataset in the Didi platform, which is a onemonth (from November 1 to November 31) ride-hailing operational dataset located in Chengdu, Sichuan, China. Charging pile information in Chengdu is achieved by crawling the website http://admin.bjev520.com/jsp/beiqi/ pcmap/do/index.jsp of BAIC BJEV, which is the holding subsidiary of Beijing Automotive Group Co., Ltd.

Conflicts of Interest
e authors declare that they do not have any commercial or associative interest that represents any conflict of interest in connection with the work submitted.

Authors' Contributions
Ning Wang conceived and designed the study; Jiahui Guo performed data collection; Jiahui Guo performed analysis and interpretation of results; and Jiahui Guo performed draft manuscript preparation. All the authors reviewed the results and approved the final version of the manuscript.