OL-DEC-MDP Model for Multiagent Online Scheduling with a Time-Dependent Probability of Success

Focusing on the on-line multiagent scheduling problem, this paper considers the time-dependent probability of success and processing duration and proposes anOL-DEC-MDP (opportunity loss-decentralizedMarkovDecision Processes)model to include opportunity loss into scheduling decision to improve overall performance. The success probability of job processing as well as the process duration is dependent on the time at which the processing is started. The probability of completing the assigned job by an agent would be higher when the process is started earlier, but the opportunity loss could also be high due to the longer engaging duration. As a result, OL-DEC-MDP model introduces a reward function considering the opportunity loss, which is estimated based on the prediction of the upcoming jobs by a sampling method on the job arrival. Heuristic strategies are introduced in computing the best starting time for an incoming job by each agent, and an incoming job will always be scheduled to the agent with the highest reward among all agents with their best starting policies. The simulation experiments show that the OL-DEC-MDP model will improve the overall scheduling performance compared with models not considering opportunity loss in heavy-loading environment.


Introduction
Problems involving time-dependent success probability extensively exist in manufacturing, industrial, and military domains.One example is the scheduling of a procrastinator [1], whose speed and success probability of job processing will increase as the due date is approaching.Practice shows that a procrastinator's performance varies under different time pressures when processing the same job.As higher time pressure is more likely to force a procrastinator to make mistakes when processing a sophisticated job, the success probability is consequentially dependent on the starting time.Another example is the antiship missile defense by SAM (surface-air-missile) systems shown in [2].SAM systems are scheduled to intercept the incoming antiship missiles within feasible interception time window.Killing probability of the interception is associated with the range at which the interception missile and the antiship missile meets, which in turn depends on the launching time of the interception.Usually, an earlier firing time means more flight time before engagement.
Both of the above examples imply that though early starting strategy for job processing guarantees maximal time window for processing, longer processing duration will be spent as a price.Compared with the classic on-line scheduling [3,4], extra trade-offs should be considered by the agent between the current job and the possible incoming jobs.For example, Figure 1 gives the killing probability associated with the time at which the engagement occurs for a SAM system of a Halifax ship against the incoming antiship missile [5].It can be inferred that an antiship missile can be intercepted in the feasible time window [  ,   ].Hence the SAM system can choose the best engaging time   to get the highest killing probability.If the interception fails, SAM system will have time window [  ,   ] to take an immediate remedial interception.However, if the SAM system fires at earlier time and makes the engagement occur at   , though the killing probability is lowered down, longer time window is left in case of interception fail.Therefore, the SAM system needs to make trade-off between a high killing probability of current interception and more feasible time left to take remedial  action in case of interception fail.In addition, for an antiship missile and a SAM system, an earlier firing time always means more flight time before the engagement.Hence adopting the early firing strategy would cause SAM system to spend longer duration on the current interception, while losing more opportunities to intercept possible upcoming missiles.This is the second trade-off to be considered.
Similar trade-off also exists in the application of procrastinator on-line scheduling [1].To the best of our knowledge, multiagent on-line scheduling with the trade-off discussed above is not studied by previous researches such as timedependent scheduling [6][7][8][9], on-line stochastic optimization [3,4,10,11], and stochastic resource allocation in a multiagent environment [12][13][14][15].Therefore, in this paper, we consider the above trade-offs in a multiagent scheduling process.There are several independent agents that can be scheduled to process the stochastically arriving jobs.Each job has a specific feasible time window, during which an agent can process it with time-dependent success probability.In case of fail, an agent will immediately make another try as long as the remaining time window allows.The objective is to complete all jobs with high probability.A general problem definition is introduced in Section 2, and Section 3 surveys closely related studies.Section 4 builds a DEC-MDP (decentralized Markov Decision Processes) to model the on-line multiagent scheduling process without considering the opportunity loss.An OL-DEC-MDP (opportunity loss-decentralized Markov Decision Processes) model is proposed in Section 5 to include the opportunity loss in the scheduling decision with proofs on its properties.Section 6 is the simulation evaluation of the OL-DEC-MDP, and Section 7 contains the conclusions and the future work.

Problem Definitions
There is a group of agents, denoted as , that should be scheduled to process a set of stochastically arriving jobs, which is denoted as .For each job  ∈ , there is a time interval   = [   ,    ], during which the process of job  is feasible.For example,    and    are the low bound and upper bound of the feasible time window for job  to be processed.For each agent  ∈ , there is a duration   (), which should be spent to process job  for one time starting from time .The outcome of the process by the end of   () is either success or fail, and probability of success is denoted as   ().
Assumption 1.One agent can only be scheduled to process one job at a time.
Assumption 2. If the agent fails to complete the job by the end of the process, it will immediately start another try as long as the feasible processing time window of the job will not elapse before the next try can be finished.Assumption 3. The agent will be released from the current job and be available for the next job, if either the current job is completed successfully or the current job is discarded because of insufficient time window left for another try.
According to the above assumptions, an agent  will have several opportunities to complete a job  before the feasible time window [   ,    ] of job  elapses depending on the process duration   () of each try.For example, If a process starting from time  0 fails by the end of time  0 +   ( 0 ), agent  must try to reprocess the job immediately at time  0 +  ( 0 ), as long as  0 +   ( 0 ) +   ( 0 +   ( 0 )) ≤    .
Assumption 4. For an assignment of job  to agent , later a try of process starts; later the process will end, but the process duration will be shorter.
Assumption 4 is in accordance with the observations in the time-sensitive applications such as air defense.With the threat approaching, the time needed for an interception is diminishing.For example, the earlier a process starts, the earlier the effect can be observed, but a longer duration needs to be spent.According to Assumption 4, For any  Assumption 5.Each agent operates independently, and there is no resource competition or mutual influence between agents.
The objective is to schedule the agents on-line to successfully complete all the arriving jobs with highest probability.In the off-line case, the objective of the problem can be modelled as (1) max implies that an agent can process a job more than one time.
Constraint (3) ensures that a try of job processing should not be started if the feasible time window of the job will elapse before a try can be finished.
Compared with a similar model in Karasakal et al. [2], the starting time of job processing in the above model is continuously distributed in a job's feasible time window.Moreover, in the on-line version, the future arriving jobs could not be known in advance, which makes the model more difficult to solve.
As a result, trade-off should be made during the on-line scheduling of the problem to ensure good scheduling quality: (1) the trade-off between the probability of successful process of the current job and the probability of the successful reprocess of the job in case of fail; (2) the trade-off between the reward of successful process of the current job and the opportunity loss that the agent might have with other incoming jobs during the current job processing.

Related Works
3.1.Scheduling.Job scheduling [16] is a classic domain to solve the problem, in which jobs need to be handled by one ore more machines regarding the constraints of due date, processing time, priorities, and so forth.There are many different models such as single or parallel machine model depending on the number of machines.If a job needs to be handled by a series of machines in-order, the models are called flow shops, job shops, or open shops under different situations.The objective is to handle all the jobs with a minimum makespan [17] or lateness [18,19].Timeindependent uncertainties such as machine breakdowns, unexpected releases of jobs with high priority [16], duration of a processing [20], and execution uncertainty [21] are introduced in the scheduling, which are called stochastic scheduling.Recently, models on time-dependent scheduling are proposed, in which parameters of the scheduling are time-dependent.For example, learning effect and processing time are defined as increasing funtion [22,23] or deceasing function [1,24,25] of their start times.However, most of the above studies are discussed in the off-line case, where all of the jobs exist from the beginning.Moreover, time-dependent parameters mainly focus on the processing times or the cost of processing, while time-dependent success probability of job processing is not discussed.

On-Line Stochastic
Optimization.On-line stochastic optimization, such as the on-line packet scheduling, stochastic reservations, vehicle dispatching, or routing, has been studied [3,4], in which a job or requisition arrives stochastically in queue to wait for a certain machine or a sever to be served.A job or requisition will be successfully processed once a machine is scheduled.Scheduling can be centralized or decentralized depending on whether the scheduling decision is made globally or by each agent.To model real-world problems, time-independent uncertainties such as action duration [11], resource consumption [10], and operation outcomes [12,14] are introduced into the scheduling process.
Sampling approach is also introduced to estimate the future arriving jobs to achieve a global optimal solution [26][27][28].Similar to our problem, each agent of the above problem is dynamically scheduled against the stochastically arriving jobs, and there is no resource competition or dependences among the agents.However, time-dependent probability of success is not considered, and each job will be successfully processed once an agent is scheduled to process it.Moreover, job discarding is allowed if a more important job is arriving.Instead, in our problem, each agent should retry in case of fail as long as the time allows.

Stochastic Resource Allocation in Multiagent Environment.
The most related studies in the area of stochastic resource allocation in multiagent environment mainly focus on the following problems; each agent can execute a task independently while different agents may share the same resources.An agent consuming shared resources may decrease the reward of other agents.As the outcome of the job execution is uncertain, the resources are allocated to achieve the global optimal solution.[12] solves this type of problems by introducing dynamic constraint satisfaction problem (DCSP) model into MDP and constructing a Markovian CSP (MaCSP) model.The best action at each Markovian step depends on the resource availability.As the state space increases exponentially with the number of agent and the types of resource, some studies propose heuristic search [29] and decomposition approaches [14,15] in solving Decentralized Markov Decision Processes (DEC-MDP).As the dependency between different agents is taken into account, starting an action too early or too late by an agent may jeopardize the operation of others.Hence, trade-off is introduced into DEC-MDP to estimate the cost that one agent may suffer due to the negative influence of others [13].
Comparing between our problems with closely related studies is listed in Table 1.

Decentralize MDP (DEC-MDP)
In theory, models (1)-( 3) can obtain the optimal scheduling solution for an off-line problem.However, in the on-line case, the scheduling decision should be made according to the state of each agent and the incoming job in real time.As a result, MDP provides a suitable approach to model the online scheduling by mapping the current state of agents and incoming jobs to an optimal scheduling decision.In order to construct the MDP model of a problem, the state space of the problem should be defined.

States of the
If a job is being processed by an agent, the state of the job is modelled as the ratio of the remaining time window feasible for the job to be completed.If it is waiting to be processed, its state is set to be 0; if it has been completed successfully, the state is set to be 1; otherwise the state is set to be −1.Let    () denote the state of job  at time : For example, at time 0, there is an unemployed agent  without any job coming.The state of the agent  at 0 is At time 10, a job  arrives with feasible time window [12,30], and it is scheduled to agent  which is due to start at time 12.Then If job processing fails again by the end of the second try (e.g., at time 29), and the remaining time window of job  (only 1 second is left, since    = 30) is not enough for another try, then job  will be discarded from time 29: In this case, agent  will be available for other incoming jobs from time 29.As the different agents may be released or start to process a job at different times, it is hard to define a joint action, which is the set of actions for each agent in each decision step of the on-line scheduling process [13].Moreover, because of the time-dependent state space, the reward of a joint action is difficult to evaluate by a recursive approach as introduced in [30].Recently, in order to limit the set of state space in the multiagent environment, there is significant progress in extending the Markov Decision Processes (MDP) for optimizing decentralized control [13,31].In this paper, as there is no dependence or resource competition among agents, a decentralized MDP is adopted to model the decision process of each agent.

DEC-MDP.
For an agent  and its allocated job  with time window [   ,    ], the corresponding DEC-MDP is defined as a tuple ⟨  ,   ,   ,   ⟩  1 , where (i)   = (   ,    ) is the state set regarding agent  and job  during the whole process period (e.g., from the time (unemployed, 1) (unemployed, 1) (unemployed, 1) (unemployed, −1) (s a , s m ) when the process is started for the first time, denoted as  1 , to the time when agent  is released from job ); (ii is the reward function as defined in (13).
The initial state of a DEC-MDP is The absorbing state is Figure 2 shows the state transition process when agent   starts to process job  at time  1 for the first time, in which the maximal retry times are , and  is the smallest number that satisfies  +1 +   ( +1 ) >    .The reward function is defined as in (13), which is represented by the probability that agent  will successfully complete job  when starting the first process at time  1 .Consider when starting from time  1 , agent  can process job  not more than  times.With ( 13)-( 15), the best time for agent  to begin to process job  is  * 1 : Therefore, during the on-line scheduling, we prefer scheduling the incoming job  to an agent  * with highest success probability: However, as stated in Section 4, this decision does not take the opportunity loss into account.Agent may lose higher rewards with upcoming jobs during its engagement with the current job.As a result, we introduce a opportunity loss decentralized MDP (OL-DEC-MDP) model in the next section.

Opportunity Loss Decentralized MDP (OL-DEC-MDP)
An OL-DEC-MDP model has the same state space, strategy set, and transition probability with a DEC-MDP.However, the reward function of an OL-DEC-MDP should be redefined to take the opportunity loss into account.

Opportunity Loss.
As shown in Figure 2, the agent may try at most  times before being released from the current job.It will not be available for other upcoming jobs during the period Δ   ( 1 ) = [ 1 ,   +   (  )] (1 ≤  ≤ ) with the probability where  +1 =   +   (  ) and  = 1, 2, . . .,  − 1.As a result, agent will lose all upcoming jobs during Δ   ( 1 ) with probability of    .Hence the opportunity loss for agent  to process job  starting from  1 can be defined as OL (, ,  1 ) = max Mathematical Problems in Engineering In the above equation,   1 is the best starting time for agent  to process job , which is decided by (16).(Δ   ( 1 )) is the set of all possible jobs that will arrive during period Δ   ( 1 ).The opportunity loss of the agent is defined to be the highest possible reward that the agent may lose during the period of engagement with its current job.Considering both the reward and the potential loss in scheduling decision, we now refine the reward function in OL-DEC-MDP as following: (, ,  1 ) =   ( 1 ) − OL (, ,  1 ) . ( As a result, the best starting time of agent  to process job  is  * 1 : Reward function (21) calculates the maximum reward when schedule agent  to process job  while taking the opportunity loss into account.

Computation of the
According to (19), to compute OL(, ,  1 ), we should know (Δ   ( 1 )); for example, the set of all possible jobs that will arrive during time Δ   ( 1 ).(Δ   ( 1 )) can be estimated on-line by the sampling approach, as described in [26,27], which can forecast the possible events according to the job arriving distribution.

On-Line Scheduling Based on OL-DEC-MDP.
The detailed scheduling algorithm is given as following.
(1) Queuing Up the Incoming Jobs.When a new job comes, it is queued up in a time-priority queue.A new arrived job with a smaller low bound of feasible time will have higher priority.
(2) Observing the System State Change.Each agent has a job list with length of 1, which indicates its next job to be processed.System state changes when (a) agent is released from current job and starts to process the assigned job in its job list (the agent's job list will be empty); (b) agent fails to complete the current job and begins to make another try (the job in the agent's job list will be still waiting, which will be rescheduled); (c) a new job is coming, and there exists at least one agent with empty job list (the incoming job will be assigned to some agent by being pushed into its job list).
When system state changes, scheduling or rescheduling decision will be made to decide or adjust the best next job as well as the best starting time for each agent.  denote the current set of next job of all agents before rescheduling.
(a) If ‖  ‖ < ‖‖, then dequeue ‖‖−‖  ‖ jobs from queue.Let   be the set of these dequeued jobs.Then,   ∪   is the job set to be scheduled/rescheduled, as denoted by   .
(b) Order jobs in   according to time priority, and clear the job list of all agents.
(c) Schedule each job by order in   to an agent as following.
(i) Given job , it will be scheduled to the agent  * with highest reward: In the above computation,   is the set of all agents with empty job list;  * 1 is decided by (23) with given  and .For each agent, If it has not been released when job  comes, its earliest available time   is set to be the ending time of its current process cycle.For example, the scheduling decision is made based on the assumption that all agents will be released by the end of its current process cycle.If the assumption is violated according to the observation, it is thought to be a system state change, and rescheduling will be made as described in step 3.
(ii) Push job  into job list of  * .
(4) Job Processing.when available (being released from the current job), agent will begin to process the job in its job list at time  * 1 , and its job list will be cleared.By Assumption 2, agents will try many times to complete assigned jobs before success or time window of the job expires.

Properties and Proofs
Property 1.The time complexity to compute the best starting time  * 1 for an agent  to process job  according to ( 21) is (   −    ).

Evaluations
6.1.Evaluation Setting.In the evaluation, a scenario of antiship missile defence by SAM systems is studied, which is introduced in [2].Suppose there are four ship-borne SAM systems that can be scheduled to intercept the incoming antiship missiles, and each SAM system is capable of working independently and intercepting antiship missile coming from any direction (the modern ship-borne vertical launching missile system matches these features and is becoming very popular).The feasible interception time window of each incoming antiship missile  is set to be [   ,   + 43.29], in which    is the time when the antiship missile is detected.The length of interception time window is decided by the detection capability of SAM system as well as the speed of the antiship missile.As a result, based on Section 5, if a SAM system is available when the missile is detected, there are   =    ,   =   =   + 15.9, and   =   + 20.47 based on scenario in [2].The killing probability associated with the starting time of each interception   () is shown in Figure 3, which is approximated by a cubic multinomial.The duration function is defined as   () =   ()/(] sam + ]  ), where   () is the range at time  between the SAM missile and the antiship missile; ] sam and ]  are the velocities of the SAM missile and the antiship missile.

Quality of Scheduling.
In the air-defence scenario, fail to intercept even one time may result in severe damage.Hence the quality of scheduling is measured by the probability of successfully interception of all incoming antiship missiles, which is denoted as P-interception.As shown in Figures 4  and 5, both DEC-MDP and OL-DEC-MDP based scheduling approaches illustrate that less intensive the attack comes (fewer antiship missiles with fixed time span or longer time  span with fixed incoming antiship missiles), higher the Pinterception will be.The reason is that if there are fewer antiship missiles per time unit, there should be more available SAM systems that can be scheduled, hence the overall interception performance will be improved.
However, as shown in Figure 6, the OL-DEC-MDP based scheduling approach always has a higher probability of overall interception compared with DEC-MDP model.It can be observed that, for the same time span, the improvement of OL-DEC-MDP becomes more significant as the number of incoming missiles () increases.For example, there is performance improvement under more intensive attack environment.On the other hand, the overall shape of the improvement along the time span (for the fixed number of incoming antiship missiles, longer time span means less intensive attack) tends out to be a "cap." For example, the improvement rises sharply with the time span increasing at first and then comes down after reaching some peaks.The reason is that when the time span is small at first, which means that the antiship missiles are coming very intensively, it is very hard to improve the interception performance by OL-DEC-MDP since the SAM system reaches its saturation point under very intensive attack.The decision space left for each SAM system to decide the best starting time of the interception is quite small; hence OL-DEC-MDP has similar performance with DEC-MDP.However, when the intensity falls below the saturation point of the SAM system, the improvement brought by OL-DEC-MDP becomes gradually significant as opportunity loss is taken into account in online scheduling to achieve better overall performance.As the attack intensity continues to lower down with the increase of time span, the whole system has more than enough capability (available SAM systems) to intercept the incoming missiles; hence the improvement brought by OL-DEC-MDP becomes less significant.
Figure 7 shows the best starting time of interception used in OL-DEC-MDP obtained by heuristic strategy introduced in Section 5. -axis is the time indicating how long after an antiship missile is detected the first interception is launched.It can be observed that when the antiship missiles arrive intensively (which means smaller time span with fixed total incoming missiles), OL-DEC-MDP prefers to postpone the first interception launch.Study on the simulation data shows that the optimal starting time for interception under this case is near the time   , which means that the strategy to achieve the highest killing probability against the antiship missile by one shot is the superior strategy.This observation can be inferred from Property 2 that the superior strategy in this case is to release the SAM system as early as possible to treat the next incoming antiship missile.On the other hand, when the attack is less intensive (which means longer time span with fixed total incoming missiles), OL-DEC-MDP prefers to start the interception earlier as to leave more feasible time for retrying in case of interception fail.

Conclusions
This paper proposes an OL-DEC-MDP model for on-line multiagent stochastic scheduling, which considers the starting time-dependent probability of success and processing duration.The probability of completing the assigned job by an agent would be higher when the process is started earlier, but the opportunity loss could be also high due to the longer engaging duration.As a result, OL-DEC-MDP model introduces the reward function considering the opportunity  loss and schedules the incoming job to the agent with the highest reward.In order to measure the opportunity loss, OL-DEC-MDP model uses sampling method to predict the upcoming jobs and introduces heuristic strategies to compute the best starting time of an agent against an incoming job.The simulation experiments show that the OL-DEC-MDP model will improve the overall scheduling performance compared with models without considering opportunity loss, such as DEC-MDP.The overall trend of performance improvement is studied under different scenarios, which shows that the performance improvement is most significant if the jobs are coming intensively but within the saturation point of the multiagent system.
For the future research, we should extend the model to more general cases.
(1) Dependency Between Agents.In some cases, agents may interfere with other's operation.For example, if soft weapons such as chaff rocket are used during the interception, there may be mutual interference between different air defence weapons: firing a chaff rocket may prevent the missile guiding radar of the SAM system from working normally.In future work, the mutual influence between agents will be considered in constructing available strategy set and computing action reward.
(2) Partial Observation.For some real-world problem, the result of the action can only be partially observed.For example, the result of interception by a SAM system may not be totally observed by other agents due to the limitation of sensing capability.Hence the reward and the opportunity loss should be reevaluated, and POMDP (partial observation MDP) based approach could be a good candidate.
(3) On-line Learning.The sampling approach implemented in OL-DEC-MDP is based on the prior knowledge of the arrival distribution of the incoming jobs.If the prior knowledge of the arrival distribution does not exist, the on-line learning method could be used to learn and predict the future incoming jobs.

Figure 1 :
Figure 1: Engagement time-dependent killing probability of the interception missile.

Figure 4 :
Figure 4: Probability of overall interception of DEC-MDP based approach.

Figure 5 :
Figure 5: Probability of overall interception of OL-DEC-MDP based approach.

Figure 6 :
Figure 6: The increased probability of interception by OL-DEC-MDP compared with DEC-MDP.

Table 1 :
Agent and the Job.The state of an agent is either busy when it is processing a job or unemployed when it is released from the current job.Let    () denote the state of agent  at time : Comparing of the closely related works.
is the strategy set of agent   , represented by the starting time that agent  may choose to process job  for the first time.  = { |  ∈ [   ,    ],  +   () ∈ [   ,    ]}; (iii)   () is a function of time , which gives the success probability of completing job  by agent  by the time of  +   (), when the process or the reprocess starts at time .According to Assumptions 2 and 3, there is Reward with Opportunity Loss.For an OL-DEC-MDP ⟨  ,   ,   ,    ⟩  1 , if job  is allocated to agent , then agent  should choose a starting time  1 from the time window [   ,   ], while   +   (  ) =    .If derivation of the reward function (21) exists within interval [   ,   ], the optimal starting time  Start as early as possible.For example, set  1 =    if the agent is available earlier than    , or set  1 to be the earliest time when agent  becomes available.We denote  1 under this heuristic as   ,   ∈ [   ,   ]. (2) Start as late as possible, as long as agent will still have the same number of retrying opportunities () with starting as early as possible (e.g., at time   ).With this heuristic, there are  +1 =   +  (  ) ( = 1, 2, . . ., − 1) and   +   (  ) =    .We denote  1 under this heuristic as   .(3) Start at the time with highest success probability to complete job  within the first try, while still having the maximal retrying opportunities ().For example,  1 is the time point between [  ,   ] with the highest success probability of the first try.With this heuristic, there is  1 = arg max ∈[  ,  ]   ().We denote  1 under this heuristic as   .As a result, the best starting time  * 1 of agent  to process job  considering both rewards and opportunity loss can be computed as ∈{  ,  ,  ,  } [  ( 1 ) − OL (, ,  1 )] .