SEQUENTIAL SCHEDULING OF PRIORITY QUEUES AND ARM-ACQUIRING BANDITS *

In a queueing network with a single server and r service nodes, a non-preemptive non-idling policy chooses a node to service at each service completion epoch. Under the assumptions of independent Poisson arrival processes, fixed routing probabilities, and linear holding cost rates, we apply Whistle's method for Arm-acquiring bandits to show that for minimizing discounted cost or long-run average cost the optimal policy is an index policy. We also give explicit expressions for those priority indices.


INTRODUCTION
In this paper a queueing network consisting of a single server and r service nodes is considered.Each node allows an unbounded queue.At any time > 0, service can only take place at one node (this is time-sharing service).The queueing discipline is non- preemptive and non-idling.The former requires that no interruption of service in a node is permitted.The latter means that the server cannot be idle if at least one node has a non- empty queue.Here a queue includes any customer being serviced.
Several assumptions are made for the probability structure of the system: (A1) The arrival process at node from outside the network is a Poisson process with intensity X, i, r.The r arrival processes at different nodes are independent.
(A2) The service times at node are iid random variables, which need not have exponential distributions.The r service time sequences at different nodes are independent.
(A3) All Service time sequences are independent of all arrival processes.
(A4) The service order at each node is "first-in-first-out."A customer who finishes service at node will either switch to the end of queue at node with probability Pij, or leave the network with probability 1 2j Pij.
(A5) The set of r nodes associated with a given partial order generates an oriented graph . is a forest consisting of one-root trees oriented towards the root.Hence contains no closed loops and may be decomposed into connectivity components, each of which is a tree; each tree has one root and is oriented towards this root.The root of a tree is the maximal element with respect to other vertices of the same tree.
The order is defined as follows: Node is said to be achievable from node if there exist n e N and nodes n such that il i, n and Pili2 ...Pin_li n > 0. We denote this by p(i ---) j) > 0. Hence iff p(i --+ j) > 0o Note that Pij > 0 implies -j, but the converse need not be true.

Example
Here )i > 0, r and p(i -j) 0, if i. j.
This queueing network is equivalent to a multi-class system with feedback probabilities, for one can view a customer at node as a customer of type i, or simply an "i- customer", r.In what follows, we may refer to node or i-customer, depending on which term is more convenient.Now we introduce more notations.Let Qi(t) the queue length at node and time t; c the holding cost rate at node i; n the nth service completion epoch; dn_ the node (code) which accepts service in the nth service stage ('On_l, "On); Note that we usually choose node d n at each epoch :n" However, if at "c n all nodes have empty queues and the next new arrival at the network happens to be a j-customer, then d n automatically.
Every sequence {d n, n 0,1 specifies a policy 7t.For every c > 0, define V Er e -ett ciQi(t)dt Va,n is the expected total discounted cost with discount factor e -a and policy rt; Jr is the expected long-run average cost with policy .In most cases of interest, the limit actually exists.
Our goal is to find rtc and * such that V inf V for every c > 0; and J inf J The problem of finding rta was solved by Harrison [1] for the case of Example 1, using a direct policy improvement method.He also obtained rt* essentially in [2] for the same model.Following the same approach of Harrison with more elegant analysis, Tcha and Pliska [6] provided an algorithm for computing the optimal policy rta for the general network model.Klimov [3], [4] studied the general network model with the long-run average cost criterion.Assuming the system is in steady-state, he applied linear programming to characterize the optimal policy t*.Whittle [8] obtained the same results as in Harrison [1], [2], using a different method [7], called Arm-acquiring bandits (AAB).Whittle made significant contributions to the important bandit problem.
In this paper we investigate the general network model from the viewpoint of AAB.Motivated by Whittle's idea and methodology, we have succeeded in deriving explicit expressions for a and *.The two different fields scheduling of priority queues and multi-armed bandits have been tied together.
In section 2 the equivalence between our queueing problem and AAB is established by an adequate state-space transformation.We also state Whittle's results for AAB and give some heuristic explanations.Section 3 contains the main results of this paper.To characterize the index policy a, we first derive a recursive formula for the priority indices, then apply the compound Poisson process theory to give the probability interpretation of those indices.
Based on the results of section 3, section 4 establishes the explicit expressions for *.
2. Equivalence between sequential scheduling of priority queues (SSPQ)   and Arm-acquiring bandits (AAB) The problem given in section 1 may be called SSPQ.In this section we transform it to an equivalent problem of AAB.
Associated with each node are the following traffic flows: A (t) # of arrivals at node from outside the network in [0,t]; AIi (t) # of arrivals at node from other nodes in [0,t]; D O (t) # of departures from node to outside the network in [0,t]; DIi (t) # of departures from node to other nodes in [0,t].
For every o > 0 maximizing a,r is a semi-Markov decision problem with state space Z {q (ql qr): qi 0,1,2 1 r} and action space a/" {1 r}.
Intuitively, qi is the queue length at node i, and action represents "servicing node i." Naturally we let E(.) denote the expectation given action and state q.
A non-randomized Markov policy r is such a sequence d n, n 0,1 that every d n depends only on the state at % (or at the next new arrival epoch if all nodes have empty queues at %).When d n does not even depend on n, we call : a stationary policy.In this paper we omit definition of those more general policies such as randomized and measurable policies.
The dynamical programming equation for this problem is given by Theorem 1.For every o > O, there exists a stationary policy such that 71o,ra supr qYo,r d'd r0a and qJa satisfies the equation (2.4) 7/ (q)= max L.7/ (q), 10t ot < i<r qi>0 where the one-stage operator L is defined by Li2/ot(q) g e-t: Zcj.A(D(t) q j=l Pij q ot q eC*rvOx (q(1)+c) where "c is the generic notation for the duration of one service stage; c0 ((.01 mr) with c0 being the # of new arriving i-customers in the period represents that one customer moves from node to node j; and q(i.) (ql qi-qr) represents that one customer leaves the network from node i. q(ij) and q(i.) are well-defined for qi > 0. Notice that "Uc(.), called value function, depends on the initial state q in general.
Theorem is a standard theorem of Blackwell type.For the proof, see Ross [5].
The problem of maximizing 7Jo,r can be solved by using Whittle's AAB approach.To do that we need to introduce an additional action A, which stands for "retirement."At each epoch "c n, we either choose some e a" provided Qi(n) > 0, or choose A with a constant welfare M. If Qi('l:n) 0 for all e ag, then A is the only choice.Once A is taken, service of the entire network will terminate from then on.
Let Oot(q,M) be the analogue of "Oa(q) modified by adding action A with welfare M. Then the same conclusions as Theorem hold for rOa(q,M).We state them wittK)ut proof as Theorem 2. For every ot > 0 and M R, there exists a stationary policy 7ot,M such that sup 7J '= 2J The key point of AAB approach is to decompose (2.5) into r simultaneous equations, which are considerably easier to handle.
Let e (0 0,1,0 0) be the state corresponding to a single i-customer.And let 7)i,o(M rUa(ei, M), El(.) E ('), where A is a Poisson random variable with intensity )vi, r.
usual notation for partial derivative, which will be justified later.
02)i,a(m) 0m is the Theorem 3. (2.5) is equivalent to the following r simultaneous equations: (2.6) here is just the analogue of [8], P227, (5), with the slight difference due to the greater generality of our network model.The verification can be done by repeating the argument in [7] with minor modification.For brevity we would rather make some heuristic remarks which emphasize more insight of Theorem 3. Remarks: (a) where E is the event that a customer finishing his service by a: will go to node j.Note that the transition probabilities Pij, i,j r do not depend on a:, hence (2.7) holds.
(b) For every a > 0, M e F1, the optimal policy 71;ot,M is an index policy, which chooses certain node with the largest priority index M provided M > M, where M inf m e FI: i,c (m) m}, r.
(c) The function Ui,ot (M), r. are nondecreasing, convex, and piecewise linear in M. Therefore, the derivatives O2/i,o(m) / am exist except at m Mj, r.At those index points we may define them as the right-derivativeso (d) Given a subset .N' of r}, r is said to be a write-off policy with write-off set if r does not choose node as the next service stage when e at that decision epoch.If all nodes are written-off, then A will be the only available action.Obviously, the index policy rcot,M is a write-off policy with N' {i M <_ M}.Note that here ' depends on M, denoted by N'M..N'M C M' when M < M'.
Start with small value M and let it increase.If we assume M > M 2 > > M r, then Therefore M recruits new members when M passes the index points, and &'M keeps in- variant when M lies between two adjacent index points.
(e) The free parameter M introduced in Theorem 2 and Theorem 3 seems to be a nuisance in the original queueing scheduling problem.However, it enables us to determine M M r.Meanwhile, for sufficiently small M, 7zot,M never chooses A unless at the decision epoch all nodes have empty queues.In that case rto,M and rtot coincide.
{}3. Construction of rtc Following Whittle's notation, for every ot > 0 and M e [ we let i(M) 2)i,a(M), i=l, r.It is observed in section 2 that each i(M) is a piecewise linear function and changes its slopes at each index point Mj, j=l r.. Therefore, if we find the slope of i(M) on each piece (Mj+1, Mj), then those Mj's can be located as well.This idea is due to Whittle and can be carried out in our problem even when the network structure is much more complicated.
Recall that M is the priority index of node (or an i-customer).Assume that M _> M 2 > > M since we can always number those nodes (or customers) in order of decreasing priority.For simplicity we also assume that M > M 2 > > M r, since M Mj means that node and node are equally preferable so that any tie breaker can be used.
For every o > 0 and r, let l/i(Ot) E Ci-2 Cj P i j ) lq/i (0), j=l Hi(M) BI/i(oQ Ei i i Pij The next theorem gives a recursive formula for computing Mj's. Theorem 4. Consider a relabeling of nodes (or customers) at each decision epoch, so that node has the j-th highest priority, r.Then having M Mj determined, we have Set B M M 1, then Hi(M1) Ml/i().Since Ml>h i+Hi(M1), 'v' i=l r, with equality for being assigned the label in the new labeling, we obtain h The last step is due to the fact that Wkj for all k > + 1.
Since Mj+ > h + Hi(Mj+I) for all > + 1, and the equality holds for being assigned the label + in the new labeling, (3.2) follows. [-1(3.2) provides a recursive formula for computing the priority indices.However, for each 0, r 1, to calculate Mj+ we still need to know Wk, 1< k < < j.Notice that Wk is the slope of 0k(M) on the piece (M+a, M).And it has very nice probabilistic interpretation.
In particular, B {A}, C O D. Define Tie the time needed to bring all relabeled i-customers (i e C z) to the set B when the initial state is e k, < k < < r.
Proof.Given M e (Mr+ 1, Mr), got,M is a write-off policy with the write-off set B t.
Starting with the initial state ek, got,M will service some node e Ct in each stage until there is no i-customer (i e Cz) in the network.Then ra,M will retire and take the welfare M.
Thus, qbk(M V + M E k e--tTkl, where V is the expected reward before retirement, independent of M. Proposition 1 follows by differentiation.D

Notes.
(i) There is no presumption that Tie < oo.However our interest excludes the case that Tie is a defective random variable, < k < < r.We impose the light-traffic condition, specified by (*) where bt El'l: is the expected service time at node i, r; satisfies the traffic flow equations: r, or in matrix form ( where I is the r x r identity matrix, [P (r) is the r x r matrix with entries Pij, i, j, r.
In fact, the assumption (A4) guarantees that I [P (r) is invertible hence rl is uniquely determined.This will also be explained later in the proof of Lemma 2. Note that p is called the traffic intensity of the network and the condition (*) implies that Tra has finite moments of any order, < k < < r.
(ii) Tkl depends on the target set B and the initial state e k but not on the order in which those nodes in the set Cl are serviced.In what follows, we apply compound Poisson process theory to derive the expressions of E k e -aTkl 1 < k < < r.
Note that lim E[Ze -u(Z-1) I(z>l)] 0, and by Fatou's lemma, lim E[Ze -u(z-1) I(z<l)] > E lim Ze -u(z-1) I(z<l)] c. U---) Therefore, 9"(u) < 0 for sufficiently large u.Hence (i), (ii), (iii), imply that there exists a u > 0 such that e -u 13Ee -uZ.U3 In queueing literature the term "workload" is usually referred to service time(s) associated with a customer.Even in this complex network model we can still imagine that each arriving customer brings certain workload, which is the sum of service times corresponding to those nodes along the customer's route in the network.Let X k be the generic notation for the service time at node k, k =1 r; Yn be the workload brought by the n-th arriving customer at the network, n IN. (Here we assume that no more than one customer arrives at the network at the same time.)For every r, let Ij the x identity matrix; [P (j) the x matrix with entries Pvd, k,/=1 j; v(j)=(v vj)'withvz= != j; )v + + 9. x(j) (Xl xj)'; U the workload brought by a customer arriving at node towards the target set Bj, i.e.U.,, )a only includes those service times at nodes in C.. l Lemma 2. For fixed r, suppose the workload sequence {Yn} is defined with respect to the target set Bj, then Y1, Y2 are iid random variables, and there exists a random variable Y such that (i) Y and Yx have the same distribution; and (ii) Y v'(j) (Ij-[P (j))-I X(j).
Proof.Recall (A1), (A2), (A3) and notice that the transition probability matrix [P(j) does not depend on any arrival process or service time sequence.So Y1, Y2 are iid.
For an arbitrary arrival customer with workload Y, we have Y=vl.U 0).l=l Suppose he enters the network at node l.After time X he may reach the target set B with probability J then no more workload is left with him.Or with i=l Pi; probability Pzi he goes to node (i e Cj); then his updated workload is U 0) Therefore (3.5) U 1= .=Pli (Xl+ Ui))+ (1-i=l p/i) X! Xl+ .=Pli u(j)i 1<l<_ j.
In matrix form (3.5) is written as By (A4) every customer will reach the target set Bj after entering the network and passing through a finite number of nodes in Cj.This implies that Ij [P (j) is invertible (cf. Klimov [3], Lemma 3).Therefore, (3.6) (U1 (j) UjO))" (Ij-[P(j))-X(j).where Y is defined by Lemma 2.   Proof.Let N the total # of customers arriving at all nodes of Cj in [0,t]; S o) the total residual workloads at time with respect to the target set B.. i.e. S 0). is the sum of workloads associated with all customers at those nodes of Cj and at time t.
So far we have completed the algorithm for computing indices M M r.

Construction of 7t
It usually happens that the optimal policy with respect to long-run average cost is the limit of the optimal policy for discounted cost as the discount factor tends to one.This is indeed the case between r* and rro.
Theorem 5.Under the light-traffic condition (*), rta will tend to n* as cz approaches zero.
Proof.In this queueing network a busy period is counted from the first arrival epoch (after the server was idle) to the first time that all nodes have empty queues.Assuming light- traffic we have an alternating busy-idle sequence.Since only non-idling policies are considered, and all arrival processes and the transition matrix [P (r) are policy-independent, it turns out that the duration of a busy period is policy-independent as well.And the successive busy periods form an iid sequence.The light-traffic condition also implies that a busy period has finite moments of any order.Then Theorem 5 follows from [5], section 7.4.D For each ot > 0, no is characterized by the priority indices M M in Theorem 4.
To characterize n*, we need to evaluate the asymptotic behavior of M i's as o is close to zero.Theorem 6.Let lim cM.j ,Ij, r.
Starting with the initial state e i, the one stage expected reward is given by