Time-Driven Scheduling Based on Reinforcement Learning for Reasoning Tasks in Vehicle Edge Computing

Significant challenges for reasoning tasks scheduling remain, including the selection of an optimal tasks-servers solution from the possible numerous combinations, due to the heterogeneous resources in edge environments and the complicated data dependencies in reasoning tasks. In this study, a time-driven scheduling strategy based on reinforcement learning (RL) for reasoning tasks in vehicle edge computing is designed. Firstly, the reasoning process of vehicle applications is abstracted as a model based on directed acyclic graphs. Secondly, the execution order of subtasks is defined according to the priority evaluation method. Finally, the optimal tasks-servers scheduling solution is chosen by Deep Q-learning (DQN). ,e extensive simulation experiments show that the proposed scheduling strategy can effectively reduce the completion delay of reasoning tasks. It performs better in algorithm convergence and runtime compared with the classic algorithms.


Introduction
In recent years, Internet of Vehicles (IoV) has become a research hotspot for the Intelligent Transportation System (ITS) [1]. e autonomous driving of IoV not only improves the driving safety, but also solves the problem of traffic inefficiency and lane congestion. It is a challenge for the autonomous driving to complete the target application under strict time constraint and restricted computing resources. e current work for autonomous driving mostly focuses on how to design the specific functions, such as traffic recognition, into reasoning tasks [2,3]. Less attention is paid to how to schedule these reasoning tasks of autonomous driving to the appropriate computing nodes with low latency. Fortunately, IoV in Mobile Edge Computing (MEC) could schedule the real-time tasks from vehicles to the Road Side Units (RSU) with powerful computing resources, alleviating the task execution delay. Besides, a reasonable scheduling for reasoning tasks in MEC can effectively reduce both the execution latency of tasks and the workload of vehicles [4][5][6][7][8][9][10][11]. However, due to the heterogeneous resources in edge environments and the complicated data dependencies in reasoning tasks, significant challenges for reasoning tasks scheduling remain, including the selection of an optimal tasks-servers solution from the possible numerous combinations.
Existing studies are mainly done subject to task scheduling and task coordination through heuristic algorithms [12][13][14][15], such as Particle Swarm Optimization (PSO), Colony Algorithm (CA), and Genetic Algorithm (GA). Although these works could obtain the feasible solutions while satisfying different constraints, they fail to predict the deviation between the feasible and optimal solutions in advance, which makes their solutions easily fall into the local optimum. Several studies have been devoted to task scheduling using reinforcement learning (RL) algorithm [16][17][18][19][20][21][22][23][24][25][26], which can not only correct the deviation between the feasible and optimal solutions, but also accelerate the convergence of perfect results. Specifically, Lin et al. [23] proposed a time-driven scheduling strategy based on Q-learning algorithm for reasoning tasks of autonomous driving in IoV. e experimental results demonstrated that the performance of RL algorithms based on simulated annealing was better than other classic algorithms. is work is instructive for our work. Zhao et al. [20] put forward a distribution scheduling algorithm based on DQN to achieve the best balance between latency, computational rate, and energy consumption, for an edge access network of IoV. ey prioritized the tasks of different vehicles according to the analytic hierarchy process (AHP). e experimental results showed that the average task processing delay of the proposed method could effectively improve the task offload efficiency. However, the priority between tasks has not been scientifically calculated and weighted, but only evaluated by experts based on their experience. e current work [27,28] for priority evaluation is mostly subjective by experts.
ere are great achievements in multivehicle task collaborative scheduling [5,7,11,20]. However, the time-driven scheduling for single-vehicle reasoning tasks with data dependencies is still an open issue.
In response to this issue, two research questions are considered: (1) how to design a model for the reasoning tasks with data dependencies to evaluate the latency caused by task execution and data transmission? (2) How to develop an efficient and reliable scheduling strategy to reduce the latency during vehicle driving? To solve the above questions, we design time-driven scheduling based on RL for reasoning tasks in vehicle edge computing, which considers the differences of heterogeneous real-time reasoning tasks and optimizes the completion latency of reasoning tasks. e main contributions of this paper are concluded as follows: (1) A latency model is designed for reasoning tasks with data dependencies, which considers the latency caused by task execution and data transmission (2) e scheduling for reasoning tasks in MEC is defined as a Markov Decision Process (MDP), which models the scheduling strategy for a reasoning task as the state, the resource allocation decision for each subtask as the action, and the completion latency of a reasoning task as the reward (3) A time-driven scheduling strategy based on DQN is designed to explore an optimal tasks-servers solution from the possible numerous combinations in vehicle edge computing e remaining part of the paper proceeds as follows. We review the related work in Section 2. Section 3 introduces the problem definitions of reasoning tasks scheduling. Section 4 describes the proposed reasoning tasks scheduling strategy in detail. Section 5 conducts the comparative experiments and analyzes the performance of the proposed strategy. Finally, Section 6 summarizes the work of this paper and looks forwards to the future research directions.

Methods Based on Heuristic
Algorithms. Xie et al. [12] proposed a novel Directional and Non-local-Convergent Particle Swarm Optimization (DNCPSO) to address workflow scheduling in cloud-edge environment, which could reduce the make span and cost dramatically and work well for task scheduling in complex applications. Wu et al. [13] studied how to dynamically partition a given application effectively into local and remote parts while reducing the total cost in cloud-edge environment. ey proposed a Min-Cost Offloading Partitioning (MCOP) algorithm, which could significantly reduce the execution time and energy consumption by optimally distributing tasks between mobile devices and servers. Lin et al. [15] proposed a linear-time rescheduling algorithm for the task migration in MCC environment. e algorithm started from a minimal-delay scheduling solution and subsequently performed energy reduction by migrating tasks among the local cores and the cloud.
e methods based on heuristic algorithms can easily fall into the local optimal solution, which usually fails to get a good result. Moreover, the time required to process reasoning tasks of IoV is usually strict. e methods based on heuristic algorithms are not suitable for such problem due to their long algorithm execution time.

Methods Based on Reinforcement
Learning. To adapt the scheduling strategies for dynamic scenarios, Deep Reinforcement Learning (DRL) has been widely applied to the task scheduling problems in MEC systems in recent years.
Chen et al. [16] designed a double DQN-based computation scheduling policy for a virtual MEC system. Numerical experiments showed that their proposed policy could achieve a significant improvement in computation scheduling performance. Xiong et al. [17] proposed an improved DQN algorithm to minimize the long-term weighted sum of average completion time of jobs and average number of requested resources in IoT edge computing system. Simulation results showed that the proposed algorithm has a better performance than the original DQN algorithm. Wang et al. [18] proposed a new DRL-based scheduling framework to address the challenges of task dependency and adapting to dynamic scenarios in the MEC system. e proposed DRL solution could automatically discover the common patterns behind various applications so as to infer an optimal scheduling policy in different scenarios. Rjoub et al. [21,26] proposed four deep and RLbased scheduling approaches to automate the process of scheduling large-scale workloads onto cloud computing resources, while reducing both the resource consumption and task waiting time.
ese approaches derived an appropriate task scheduling mechanism that could minimize both tasks' execution delay and cloud resources utilization. Qi et al. [22] firstly proposed a multitask DRL approach for scalable parallel task scheduling (MDTS) in IoV. For avoiding the curse of dimensionality when coping with complex parallel computing environments and jobs with diverse properties, they extended the action selection in DRL to a multitask decision, where the output branches of multitask learning were fine-matched to parallel scheduling tasks. Huang et al. [24] proposed a DRL-based Online Offloading (DROO) framework to optimally adapt task scheduling decisions and wireless resource allocations to the time-varying wireless channel conditions in a wireless powered MEC network. Numerical results showed that the proposed framework could achieve near-optimal performance while significantly decreasing the computation time.
RL-based methods mostly assume that the scheduling problem is a learning task. rough preliminary training, an effective scheduling policy for the task can be quickly formed by a reasonable designed RL algorithm. Note that current work for IoV mostly focuses on multivehicle collaborative scheduling, but the time-driven scheduling for single-vehicle reasoning tasks with data dependencies is still an open issue. Table 1 shows the notations used in this paper. Figure 1 gives an example of reasoning tasks scheduling in vehicle edge computing. is example considers autonomous driving reasoning system [2,3], which consists of applications such as emergency rule inference engine and security operations. e user equipment (UE) makes scheduling decision for those applications according to the status in edge environments and application profiles; thus some of them are executed locally on the vehicle (i.e., UE) while others are scheduled to the edge by wireless channels. In this work, we consider that the edge environment is composed of m RSUs providing resources including computing, communications, and storage to UE in each timeslot, which is expressed as F � f j | 1 < j < m . e computation capacities of vehicle and RSU are denoted as f vehicle and f RSU .

Problem definition
In time-slot i, a reasoning task can be expressed as a directed acyclic graph (DAG) G � 〈N, E〉 as in Figure 2, where N � n i | 1 < i < z is a set of z subtasks and E � e u,k | 1 ≤ u, k ≤ z, u ≠ k is the data dependencies between subtasks. e data dependency e u,k indicates that there is a directed arc between subtasks n u and n k , and task n k cannot start until task n u has been completed. e set of direct precursors of task n k can be expressed as R k � n k | 1 ≤ k ≤ z, k ≠ j, e u,k . A task cannot be executed until all of its direct precursors are completed.
In vehicle edge computing, a subtask in the reasoning task can be either offloaded to the edge or executed locally on the vehicle. If the offloading occurs for a reasoning task, the process delay will be related to the subtask profile and the environment state. e subtask profile includes the required CPU cycles for running the subtasks c i , data size of the subtasks data i , and the tolerable delay of the subtasks t i d . Besides, the environment state contains the transmission rate of wireless channel v m i transmission . erefore, the transmission latency t i transmission for subtasks executing on edge nodes can be calculated as (1) and (2). If subtasks are executed locally on the vehicle, there is only execution latency on the user equipment, which can be obtained by e scheduling plan for a reasoning task is denoted as distribution relationship matrix as (3). If a x,y � 1, it means that subtask n x is offloaded to edge node m y ; otherwise, it means that subtask n x is executed locally. When the edge node is running normally, the execution latency of the reasoning tasks can be expressed by (4), where t process means the processing latency of the reasoning tasks. If there is no available edge node in the edge environment, all subtasks will be executed serially on the vehicle, where m i is set to 0. When the worst scheduling occurs, the completion latency of the reasoning tasks is described as in equation (5): To make better use of computing resources in different edge environments, we assume that edge nodes should satisfy the following processing principles: (1) A subtask is processed by only one corresponding edge node, which is formally defined as (6). (2) After all subtasks are assigned to the corresponding edge nodes, the edge nodes begin to process the subtasks. (3) e subtasks on different edge nodes without data dependencies can be processed in parallel. (4) e subtasks on the same edge node are processed according to the data dependencies. Otherwise, they are processed according to their corresponding priorities. (5) e tolerable delay of the subtasks on the edge node is not greater than the execution latency of the corresponding edge node, which is formally defined as (7): e reasoning task scheduling discussed in this paper can be summarized as follows: in various time-slots, a reasoning task on a vehicle can be decomposed into several Wireless Communications and Mobile Computing subtasks, and these subtasks can be reasonably scheduled to the edge nodes for processing based on a specific scheduling algorithm. e scheduling algorithm proposed in this paper can minimize the execution latency of reasoning tasks, which can be expressed as

Algorithm Design
In this section, we first describe the priority evaluation for subtasks in a reasoning task, which is employed to determine the order of execution for the subtasks without data dependencies. And then give an overview of our proposed scheduling algorithm. Finally, we introduce the implementation of the scheduling algorithm in detail.

Priority Evaluation for Each Subtask.
It is difficult to estimate the execution time of a reasoning task that defines the execution sequence of subtasks. Fuzzy analytic hierarchy process (FAHP) [27][28][29] is usually employed to analyze multiobjective problems, which decomposes the problem hierarchically according to its feature and overall goal, forming a bottom-up gradient hierarchy. In this work, FAHP is commonly used to measure the subtask weight, which can determine the order of execution for the subtasks without data dependencies. Each subtask weight is modified by calculating the information entropy of objective factors (i.e., each subtask's own parameters) [30,31]. e pseudocode of the priority evaluation for each subtask is described as follows: s i ands j represent the relative importance of subtask factors. α and δ are the number of factors and the information entropy of them.

Scheduling Algorithm.
MDP is a basic model of the RL in this paper. e scheduling algorithm can be simplified according to the MDP property, which means that the next state is only related to the current state as Figure 3. In Figure 3, each state represents a corresponding allocation strategy for real-time vehicle tasks in different edge environment and corresponds to a specific reward. Each action is calculated by the agent (neural network), and it is used to guide the current state to a better direction. Q k+1 s t , a t , θ t � Q k s t , a t , θ t + α k · R k , e model characteristics of the discussed problem in this paper are described as follows: (1) State space: the number of states for the feasible solutions is not constant. ey can change dynamically as the change of the number of subtasks after decomposition and the changed distribution of edge nodes in various time-slots. (2) Action space: the number of optional actions in the action space is equal to the number of subtasks. Action selection means scheduling the corresponding subtasks in current state to the specific edge nodes. (3) Reward value: this work tries to minimize the completion latency of the reasoning task, so the reward value is set to r t � (1/t all ).
Input: computational complexity a 1 , the amount of data a 2 , the tolerable delay a 3 Output: the priority of subtask z (1) Sort subtask's factor according to equation (9) and construct matrix P (2) for i ⟵ 0 to maximum rows of P do (3) r i � 0 (4) for j ⟵ 0 to maximum columns of P do (5) r i � r i + p i,j (6) end for (7) end for (8) r i are transformed through equation (12) to obtain R (9) for i ⟵ 0 to maximum rows of R do (10) w i � 0 (11) for j ⟵ 0 to maximum columns of R do (12) update w i via equation (13)  (13) end for (14) end for (15) calculate the information entropy δ i via equations (14) and (15)  (16) obtain w i ′ via (16)  (17) z � w 1 ′ · a 1 + w 2 ′ · a 2 + w 3 ′ · a 3 ALGORITHM 1: Priority evaluation for each subtask. Wireless Communications and Mobile Computing e scheduling strategy is based on the DQN algorithm. It can be abstracted as a function fitting problem when the discrete tangent dimension of the state and action space are high. e pseudocode of our scheduling algorithm is described in Algorithm 2.where α k and c represent the learning rate and discount factor, respectively. s ′ is the state after executing the action a t in the k th iteration.a ′ represents the action of the largest reward in state s ′ , and R k represents the accumulated reward during the iterations.

Algorithm Implementation.
In various time-slots, reasoning tasks and edge environments can change dynamically. ese changes are summarized as follows: (1) e topological structure of reasoning tasks and the number of nodes in edge environments (2) e computational complexity, the datasets between subtasks, and the tolerable delay of subtasks in various environments (3) e transmission latency and execution latency of subtasks e algorithm implementation will calculate the completion latency of reasoning tasks in edge environments. e pseudocode of the algorithm implementation is described in Algorithm 3. Figure 4 presents the calculation process of execution latency, which includes the following steps.
Step 1: initiate the parameters of Algorithm 3, including the subtask queue Q and the set of predecessor nodes R. Next, a reasoning task is expressed as a specific directed acyclic graph.
Step 2: Q is used to sort the subtasks by the topology of the reasoning task.
Step 3: calculate the task execution time according to the specific strategy derived from Algorithm 2.

Experimental Parameter Settings.
e simulation experiments are implemented with the Python 3.7 and conducted on a 64-bit Win10 system, which is configured with Inter(R) Core (TM) i7-7700HQ CPU and 16 GB RAM. Our proposed scheduling algorithm is DQN, and Q-learning algorithm [23] and GA-PSO [32] are introduced as the comparison algorithms. Based on the effects of adjusting parameters in many experiments, the corresponding parameters of DQN and Q-learning [23] are set as α � 0.005, c � 0.9, and ε � 0.9. e corresponding parameters of GA-PSO [32] are set as w max � 0.9, w min � 0.4, C start 1 � 0.9, C end 1 � 0.2, C start 2 � 0.9, and C end 2 � 0.4. In addition, the number of rounds is set as 100 and the number of iterations is set as 1000 for DQN, Q-learning, and GA-PSO, respectively.
All the algorithms try to find the optimal scheduling result with the shortest completion latency of reasoning tasks in edge environments.
UEs have different reasoning tasks with various topologies and task number, and the topological structure of reasoning tasks is shown in Figure 5. e related parameters for the vehicle edge computing environment are set according to the IEEE 802.11p [33], and other parameters are set as Table 2. Table 3 shows the completion latency of different reasoning tasks in various edge environments with our proposed scheduling algorithm, where m and n denote the number of edge nodes and subtasks in each experiment. Note that n � 6 corresponds to the "Topology I," n � 9 corresponds to the "Topology II," and n � 12 corresponds to the 'Topology III' in Figure 5. Each grid in Table 3 corresponds to an experiment with different reasoning tasks with specific topologies and different edge nodes in edge environments. In addition, the execution order of subtasks is Input: initial state, maximum number of rounds, maximum number of iterations in a single round Output: the scheduling strategy for reasoning tasks (1) Initialize the experience pool of constant storage space, the action-value function Q θ (s t , a t ) with random weight θ and the corresponding target − Q θ (s t , a t ) (2) for i←0 to Maximum number of rounds do (3) s t ← initial state (4) for i←0 to maximum number of iterations in a single round do (5) Choose the action a t � argmaxQ(s t , a, θ) with the largest historical reward with possibility ε, otherwise choose a random action (6) Execute action a t to get the next state s t+1 and use Algorithm 2 to calculate the reward r t (7)

Analysis of Results.
Store (s t , s t+1 , r t , a t ) in the experience pool (8) s t ←s t+1 (9) Random sampling (s j , s j+1 , r j , a j ) from the experience pool (10) Construct an error function according to equation (17), and back-propagation to update the parameters θ (11) Update target − Q θ (s t , a t ) � Q θ (s t , a t ) per few steps (12) If s t+1 satisfies the termination state, the current iteration is ended (13) end for (14) end for ALGORITHM 2: Scheduling algorithm. 6 Wireless Communications and Mobile Computing (1) Initialization: set the array I, the subtask queue Q and the set of predecessor nodes R to ∅ (2) Use the constraint relationship G i to set the array I(i) (3) Enqueue the ith subtask with I(i) � 0 to Q and set the number of traversed subtasks u � 0, the number of subtasks in the current layer k to the current queue size (4) while Q! � ∅ do (5) if u � k then (6) u � 0, k � size(Q). (7) end if (8) e subtask is dequeued, and the task is expressed as v, u+ � 1 (9) for i ⟵ 0 to z i do (10) if there exists a directed edge of v to i then (11) Add the vth subtask and its predecessor node set R(v) to R(i) I(i)-= 1 (12) if I(i) = 0 then (13) enqueue the ith subtask to Q (14) end if (15) end if (16) end for (17) end while (18) According to B m i ×z i , the subtasks are assigned to edge nodes. (19) Initialization: set the subtask completion list O to ∅, set the remaining execution latency of subtasks Y by C m i ×z i , and set the current running time h � 0 (20) while O < z i do (21) Determine the subtask to be assigned to each edge node, which satisfies the direct predecessor set is subset of O (22) Find the minimum execution latency w from the currently executed subtasks in parallel (23)  Wireless Communications and Mobile Computing based on two rules: traditional rule and priority rule. Traditional rule executes the subtasks according to their corresponding topology depths [23] and priority rule executes the ones according to the priority evaluation for each subtask discussed in Section 4.1.
From Table 3, we find that the completion latency of reasoning tasks reduces as the number of edge nodes increases. Under the same circumstances, the priority rule for subtask execution can effectively reduce the completion latency of reasoning tasks compared to the traditional rule. As the topology complexity of reasoning tasks increases, this gap will become larger. is is because that the maximum number of parallel subtasks in the same time is limited by the number of edge nodes, and the priority rule can increase the upper limit number of the parallel subtasks. Figure 6 shows the average completion latency of different scheduling algorithms (i.e., GA-PSO, Q-learning, and DQN) with different seasoning tasks in various edge environments, where m denotes the number of edge nodes in each experiment. In each subgraph, we record the completion latency of reasoning tasks with different topologies for 100 rounds and display the average completion latency of reasoning tasks for every 10 rounds. From Figure 6(a), we find that GA-PSO is difficult to converge, although it can get a feasible solution with shorter completion latency of reasoning tasks. In contrast, DQN can not only get a feasible solution with shorter completion latency, but also convergence well. From Figure 6(b) and Figure 6(c), the performance of Q-learning is similar to that of DQN when the numbers of subtasks and edge nodes are both small. However, the convergence performance of Q-learning will decrease as the topology complexity of reasoning tasks increases. e main reason for the different scheduling results with various algorithms is that the increase in the number of subtasks has brought about the multiplication of the number of solutions in searching space. For GA-PSO, finding the optimal solution mainly relies on randomness and fitness function. erefore, when the number of feasible solutions in searching space is huge, GA-PSO is easy to fall into the local optimal solution. Q-learning is difficult to build the Q list and converge also due to the huge number of feasible solutions in searching space. However, DQN converts the Q list to the Q value function by neural network, which can solve the problem with a huge number of states (i.e., feasible solutions) and make it easier to converge. Table 4 shows the average runtime (s) of different algorithms with different seasoning tasks in various edge environments. Each grid in Table 4 is the average of the runtime of 100 rounds for different algorithm. From Table 4, we find that the runtime of GA-PSO is relatively stable with different seasoning tasks in various edge environments. is is because that the runtime of GA-PSO mainly depends on   is is because the runtime of RL algorithms will decrease as the number of feasible solutions learned increases. In addition, the architecture of the neural network used in DQN is more suitable for reasoning tasks scheduling in vehicle edge computing, compared with the Q list used in Q-learning.

Conclusions
is paper proposes a scheduling strategy based on DQN for reasoning tasks in vehicle edge computing, which aims to reduce the completion latency of reasoning tasks. e extensive simulation experiments show that the proposed strategy can achieve the superior performance compared to other classic methods. Our strategy and other classic methods all perform well when the structure of reasoning tasks is simple, while GA-PSO has poor convergence. Specially, our strategy has better performance and convergence than any other classic methods when the structure of reasoning tasks is complex.
In the future, we will improve the scheduling algorithm through optimizing the training efficiency of the neural network to fit the wireless channel fluctuations and radio interference in vehicle edge computing. In addition, we will further consider a multivehicle collaborative scheduling strategy to alleviate uneven resources allocation for multivehicle tasks in edge environments.

Data Availability
e data used to support the findings of this study are included within the article.