Chip attach is the bottleneck operation in semiconductor assembly. Chip attach scheduling is in nature unrelated parallel machine scheduling considering practical issues, for example, machine-job qualification, sequence-dependant setup times, initial machine status, and engineering time. The major scheduling objective is to minimize the total weighted unsatisfied Target Production Volume in the schedule horizon. To apply Q-learning algorithm, the scheduling problem is converted into reinforcement learning problem by constructing elaborate system state representation, actions, and reward function. We select five heuristics as actions and prove the equivalence of reward function and the scheduling objective function. We also conduct experiments with industrial datasets to compare the Q-learning algorithm, five action heuristics, and Largest Weight First (LWF) heuristics used in industry. Experiment results show that Q-learning is remarkably superior to the six heuristics. Compared with LWF, Q-learning reduces three performance measures, objective function value, unsatisfied Target Production Volume index, and unsatisfied job type index, by considerable amounts of 80.92%, 52.20%, and 31.81%, respectively.
1. Introduction
Semiconductor manufacturing consists of four basic steps: wafer fabrication, wafer sort, assembly, and test. Assembly and test are back-end steps. Semiconductor assembly contains many operations, such as reflow, wafer mount, saw, chip attach, deflux, EPOXY, cure, and PEVI. IS factory is a site for back-end semiconductor manufacturing where chip attach is the bottleneck operation in the assembly line. In terms of Theory of Constraints (TOC), the capacity of a shop floor depends on the capacity of the bottleneck, and a bottleneck operation gives a tremendous impact upon the performance of the whole shop floor. Consequently, scheduling of chip attach station has a significant effect on the performance of the assembly line. Chip attach is performed in a station which consists of ten parallel machines; thus, chip attach scheduling in nature is some form of unrelated parallel machine scheduling under certain realistic restrictions.
Research on unrelated parallel machine scheduling focuses on two sorts of criteria: completion time or flow time related criteria and due date related criteria. Weng et al. [1] proposed a heuristic algorithm called “Algorithm 9” to minimize the total weighted completion time with setup consideration. Algorithm 9 was demonstrated to be superior to six heuristic algorithms. Gairing et al. [2] presented an effective combinatorial approximate algorithm for makespan objective. Mosheiov [3] and Mosheiov and Sidney [4] converted an unrelated parallel machine scheduling problem with total flow time objective into polynomial number of assignment problems. The scheduling problem was tackled by solving the derived assignment problems. Yu et al. [5] formulated unrelated parallel machine scheduling problems as mixed integer programming and dealt with them using Lagrangian Relaxation. They examined six measures such as makespan and mean flow time. Promising results were achieved compared with a modified FIFO method.
Besides completion time or flow time related criteria, tardiness objectives are also employed frequently. Dispatching rules are widely applied to production scheduling with a tardiness objective, such as Earliest Due Date (EDD), Shortest Processing Time (SPT), Critical Ratio (CR), Minimal Slack (MS), Modified Due Date (MDD) [6, 7], Apparent Tardiness Cost (ATC) [8, 9], and COVERT [10–12]. More complicated heuristic algorithms and local search methods are also developed. Bank and Werner [13] addressed the problem of minimizing the weighted sum of linear earliness and tardiness penalties in unrelated parallel machine scheduling. They derived some structural properties useful to searching for an approximate solution and proposed various constructive and iterative heuristic algorithms. Liaw et al. [14] found the efficient lower and upper bounds of minimizing the total weighted tardiness by a two-phase heuristics based on the solution to an assignment problem. They also presented a branch-and-bound algorithm incorporating various dominance rules. Kim et al. [15] studied batch scheduling of unrelated parallel machines with a total weighted tardiness objective and setup times consideration. They examined four search heuristics for this problem: the earliest weighted due date, the shortest weighted processing time, the two-level batch scheduling heuristic, and the simulated annealing method.
We are concerned in the paper about a particular Target Production Volume (TPV) oriented optimization objective. In real production in IS factory, the planning department figures out the TPV of each job type on chip attach operation in a schedule horizon. Thus, the major objective of chip attach scheduling is to meet TPVs to the fullest extent (see Section 2.1 for details). We apply reinforcement learning (RL), an artificial intelligence method, for this study. We first present a brief concept of reinforcement learning.
Reinforcement learning is a machine learning method proposed to approximately solve large-scale Markov Decision Process (MDP) or Semi-Markov Decision Process (SMDP) problems. Reinforcement learning problem is a model in which an agent learns to select optimal or near-optimal actions for achieving its long-term goals (to maximize the total or average reward) through trial-and-error interactions with dynamic environment. In this paper, we address RL problems of episodic task, that is, problems with a terminal state. Sutton and Barto [16] defined four key elements of RL algorithms: policy, reward function, value function, and model of the environment. A policy determines the agent’s action at each state. A reward function determines the payment on transition from one state to another. A value function specifies the value of a state or a state-action pair in the long run, the expected total reward for an episode. By learning from interaction between the agent and its environment, value-based RL algorithms aim to approximate the optimal state or action value function through iteration and thus find a near-optimal policy. Compared with dynamic programming, RL algorithms do not need to know the transition probability and reduce the computational effort.
Q-learning is one of the most widely applied RL algorithms based on value iteration. Q-learning was first proposed by Watkins [17]. Convergence results of tabular Q-learning were obtained by Watkins and Dayan [18], Jaakkola et al. [19], and Tsitsiklis [20]. Bertsekas and Tsitsiklis [21] demonstrated that Q-learning produces the optimal policy in discounted reward problems under certain conditions. Q-learning uses Q(s,a), called Q-value, to represent the value of a state-action pair. Q(s,a) is defined as follows:
(1)Q(s,a)=∑s′∈Sp(s,a,s′)[r(s,a,s′)+γV*(s′)],
where S denotes the state space, p(s,a,s′) denotes the transition probability from s to s′ taking action a, r(s,a,s′) denotes the reward on transition from s to s′ taking action a, γ(0<γ≤1) is a discounted factor, and V*(·) is the optimal state value function.
In terms of Bellman optimality function, the following holds for arbitrary s∈S, where A(s) denotes the set of actions available for state s:
(2)V*(s)=maxa∈A(s)Q(s,a).
From (1) and (2), the following equation holds:
(3)Q(s,a)=∑s′∈Sp(s,a,s′)[r(s,a,s′)+γmaxa′∈A(s′)Q(s′,a′)]∀(s,a).
Equation (3) is the basic transformation of Q-learning algorithm. The step-size version of Q-learning is
(4)Q(s,a)=Q(s,a)+α[r(s,a,s′)+γmaxa′∈A(s′)Q(s′,a′)maxa′∈A(s′)-Q(s,a)],∀(s,a),
where α (0<α≤1) is learning rate. Using historical samples or simulation experiments, Q-learning obtains a near-optimal policy by driving action-value function, Q(s,a), towards the optimal action-value function, Q*(s,a), through iteration based on formula (4).
Recently, RL has drawn attention from production scheduling. S. Riedmiller and M. Riedmiller [22] used Q-learning to solve stochastic and dynamic job shop scheduling problem with the overall tardiness objective. Some typical heuristic dispatching rules, SPT, LPT, EDD, and MS, were chosen as actions and compared with the Q-learning method. Aydin and Öztemel [23] applied a Q-learning algorithm to minimize the mean tardiness of dynamic job shop scheduling. Their results showed that the RL-scheduling system outperformed the use of each of the three rules (SPT, COVERT, and CR) individually with mean tardiness objective in most of the testing cases. Hong and Prabhu [24] formulated setup minimization problem (minimizing the sum of due date deviation and setup cost) in JIT manufacturing systems as an SMDP and solved it by tabular Q-learning method. Experiment results showed that Q-learning algorithms achieved significant performance improvement over usual dispatching rules such as EDD in complex real-time shop floor control problems for JIT production. Wang and Usher [25] applied Q-learning to select dispatching rules for the single machine scheduling problem. Csáji et al. [26] proposed an adaptive iterative distributed scheduling algorithm operated in a market-based production control system, where every machine and job is associated with its own software agent. Singh et al. [27] proposed an online reinforcement learning algorithm for call admission control. The approach optimized the SMDP performance criterion with respect to a family of parameterized policies. Multi-agent reinforcement learning system has also been applied to scheduling or control problems, for example, Kaya and Alhajj [28], Paternina-Arboleda and Das [29], Mariano-Romero et al. [30], Vengerov [31], Iwamura et al. [32].
Applications of RL algorithms to scheduling problems have not been thoroughly explored in the prior studies. In this study, we employ Q-learning algorithm to resolve chip attach scheduling problem and achieve overwhelming experimental results compared with six heuristic algorithms. The remainder of this paper is organized as follows. We describe the problem and convert it into RL problem explicitly in Section 2, present the RL algorithm in Section 3, conduct the computational experiments and analysis in Section 4, and draw conclusions in Section 5.
2. RL Formulation2.1. Problem Statement
The scheduling problem concerned in this paper is described as follows. The work station for chip attach operation consists of m parallel machines and processes n types of jobs. The bigger the weight of a job type is, the more important it is. Each job needs to be processed on one machine only and one machine processes at most one job at a time. Any job type (say, j) is only allowed to be processed on subset Mj of the m parallel machines. The jobs of the same type j have a deterministic processing time pi,j (1≤i≤m; 1≤j≤n) if they are processed on machine i. The machines are unrelated; that is, pi,j is independent of pk,j for all jobs j and all machines i≠k. The production is lot based. Normally, one lot contains more than 1000 units. Thus, the processing time is the time for processing one lot and processing is nonpreemptive (i.e., once a machine starts processing one lot, it cannot process another one until it completely processes this lot). Setup time between job type j1 and j2 is sj1,j2 (1≤j1, j2≤n). The setup times are deterministic and sequence dependant. Trivially, sj,j=0 holds for arbitrary j (1≤j≤n) and sj,x+sx,q>sj,q holds for arbitrary j,x,q (1≤j,x,q≤n).
The usage of a machine is considered to be in one of two categories: engineering time (e.g., maintenance time) and production time. We only need to schedule the production in production time, the total available time in a schedule horizon deducting the engineering time. Production time is divided into initial production time and normal production time. We consider the initial machine status in the schedule horizon. If a machine is processing a lot, called “initial lot,” at the beginning of a schedule horizon, it is not allowed to process any other lot until it completely processes the remaining units in the initial lot (called initial volume). The time for processing the unprocessed initial volume in the initial lot is called “initial production time.” Since the production of nonbottleneck operations is determined by the bottleneck operation, we assume that the jobs are always available for processing on chip attach operation when they are needed.
The primary objective of chip attach scheduling is to minimize the total weighted unsatisfied TPV of a schedule horizon. Since equipment of semiconductor manufacturing is very expensive, machine utilization should be kept in a high level. Hence, on the premise that TPVs of all job types are entirely satisfied, the secondary objective of chip attach scheduling is to process as much as weighted excess volume to relieve the burden of the next schedule horizon. The objective function is formulated as follows:
(5)min∑j=1nwj(Dj-Yj)+-∑j=1nwjM(Yj-Dj)+,
where wj (1≤j≤n) is the weight per unit of job type j, Dj (1≤j≤n) is the predetermined TPV of job type j (including the initial volume in the initial lots), and Yj (1≤j≤n) is the processed volume of job type j. Dj can be represented as follows:
(6)Dj=∑i=1mω(i,j)Ii+kjL(kj=0,1,…),
where Ii denotes the initial volume in the initial lot processed by machine i at the beginning of the schedule horizon, L is lot size, and
(7)ω(i,j)={1,ifmachineiisprocessingjobtypejinthebeginningoftheschedulehorizon,0,otherwise.
Calculation of Yj is rate based, interpreted as follows. Suppose machine i processes lot LQ (belonging to job type q), proceeding lot LJ (belonging to job type j). Let tsLJ denote the start time of setup for LJ; then, the completion time of LJ is tsLJ+sq,j+pi,j. Let ΔYi,j(t) denote the increase of processed volume of job type j because of processing LJ on machine i from time tsLJ to t, defined as follows:
(8)ΔYi,j(t)=(t-tsLJ)Lsq,j+pi,j(tsLJ≤t≤tsLJ+sq,j+pi,j).M is a positive number which is large enough. M is set following the next inequality:
(9)M>max{(wq+wx)(sv,j+pi,j)wjpi,q,wq(sv,x+pi,x)(sx,j+pi,j)pi,jmin1≤c≤m,1≤a,b,k≤n{wk(sa,b+pc,b)}}(∀1≤i≤m,1≤j,q,v,x≤n).
For an optimal schedule minimizing objective function (5), if (9) holds and there exists j (1≤j≤n) such that Yj>Dj, then
(10)Yj+∑i=1mβ(i,j)Ui≥Dj(∀1≤j≤n),
where Ui denotes the unprocessed volume in the last lot processed by machine i at the end of this schedule horizon (i.e., the initial volume of the next schedule horizon) and
(11)β(i,j)={1,ifmachineiisprocessingjobtypejattheendoftheschedulehorizon,0,otherwise.
According to inequality (9), in any schedule minimizing objective function (5), any machine will not process a lot belonging to a job type whose TPV has been satisfied until TPV of any other job types is also fully satisfied. In other words, inequality (9) guarantees that the objective function takes minimization of the total weighted unsatisfied TPV (the first item of objective function (5)) as the first priority. The fundamental problem in applying reinforcement learning to scheduling is to convert scheduling problems into RL problems, including representation of state, construction of actions, and definition of reward function.
2.2. State Representation and Transition Probability
We first define the state variables. State variables describe the major characteristics of the system and are capable of tracking the change of the system status. The system state can be represented by the vector
(12)φ=[Ti0(1≤i≤m);Ti(1≤i≤m);ti(1≤i≤m);dj(1≤j≤n);ei(1≤i≤m)],
where Ti0(1≤i≤m) denotes the job type of which the latest lot completely processed on machine i, Ti(1≤i≤m) denotes the job type of which the lot being processed on machine i (Ti equals zero if machine i is idle), ti(1≤i≤m), denotes the time starting from the beginning of the latest setup on machine i (for convenience, we assume that there is a zero-time setup if Ti0=Ti), dj(1≤j≤n) is unsatisfied TPV (i.e., (Dj-Yj)+), and ei(1≤i≤m) represents the unscheduled normal production time of machine i.
Considering the initial status of machines, the initial system state of the schedule horizon is
(13)s0=[Ti,00(1≤i≤m);Ti,0(1≤i≤m);ti,0(1≤i≤m);Dj(1≤j≤n);TH-σi-TEi(1≤i≤m)],
where TH denotes the overall available time in the schedule horizon, σi denotes the initial production time of machine i, and TEi denotes the engineering time of machine i.
There are two kinds of events triggering state transitions: (1) completion of processing a lot on one or more machines; (2) any machine’s normal production time is entirely scheduled. If the triggering event is completion of processing, the state at the decision-making epoch is represented as
(14)sd=[Ti,d0(1≤i≤m);Ti,d(1≤i≤m);ti,d(1≤i≤m);dj,d(1≤j≤n);ei,d(1≤i≤m)],
where {i∣Ti,d=0,1≤i≤m}≠Φ. If Ti,d=0 (machine i is idle), then ti,d=0. If the triggering event is using up a machine’s normal production time, then {i∣ei,d=0,1≤i≤m}≠Φ.
Assume that after taking action a, the system state immediately transfers form sd to an interim state, s, as follows:
(15)s=[Ti0(1≤i≤m);Ti(1≤i≤m);ti(1≤i≤m);dj(1≤j≤n);ei(1≤i≤m)],
where Ti>0 for all i(1≤i≤m); that is, all machines are busy.
Let Δt denote the sojourn time at state s; then, Δt=min{min1≤i≤m{sTi0,Ti+pi,Ti-ti}, min1≤i≤m{ei∣ei>0}}. Let Λ={i∣sTi0,Ti+pi,Ti-ti=Δt}; then, the state at the next decision-making epoch is represented as
(16)s′=[Ti(i∈Λ),Ti0(i∉Λ);Ti=0(i∈Λ),Ti(i∉Λ);0(i∈Λ),ti+Δt(i∉Λ);dj-ΔtL∑i=1mδY(Ti,j)sTi0,j+pi,j(1≤j≤n);Ti0max{ei-Δt,0}(1≤i≤m)],
where
(17)δY(Ti,j)={1,ifTi=j,0,ifTi≠j.
Apparently we have Psd,s′a=1, where Psd,s′a denotes the one-step transition probability from state sd to state s′ under action a. Let su and τu denote the system state and time, respectively, at the uth decision-making epoch. It is easy to show that
(18)P{su+1=X,τu+1-τu≤t∣s0,s1,…,su;τ0,τ1,…,τu}=P{su+1=X,τu+1-τu≤t∣su;τu},
where τu+1-τu is the sojourn time at state su. That is, the decision process associated with (s,τ) is a Semi-Markov Decision Process with particular transition probability and sojourn times. The terminal state of an episode is
(19)se=[Ti,e0(1≤i≤m);Ti,e(1≤i≤m);ti,e(1≤i≤m);dj,e(1≤j≤n);0(1≤i≤m)].
2.3. Action
Prior domain knowledge can be utilized to fully exploit the agent’s learning ability. Apparently, an optimal schedule must be nonidle (i.e., any machine has no idle time during the whole schedule). It may happen that more than one machine are free at the same decision-making epoch. An action determines which lot to be processed on which machine. In the following, we define seven actions using heuristic algorithms.
Action 1.
Select jobs by WSPT heuristics as follows.
Algorithm 1.
WSPT heuristics.
Step 1.
Let SM denote the set of free machines at a decision-making epoch.
Step 2.
Choose machine k to process job type q, with (k,q)=argmin(i,j){(sTi0,j+pi,j)/wj∣1≤j≤n, i∈Mj and i∈SM}.
Step 3.
Remove k from SM. If SM≠Φ, go to Step 2; otherwise, the algorithm halts.
Action 2.
Select jobs by MWSPT (modified WSPT) heuristics as follows.
Algorithm 2.
MWSPT heuristics.
Step 1.
Define SM as Step 1 in Algorithm 1, and let SJ denote the set of job types whose TPVs have not been satisfied at a decision-making epoch; that is, SJ={j∣Yj<Dj,1≤j≤n}. If SJ=Φ, go to Step 4.
Step 2.
Choose job type q to process on machine k, with (k,q)=argmin(i,j){(sTi0,j+pi,j)/wj∣j∈SJ, i∈Mj and i∈SM}.
Step 3.
Remove k from SM. Set Yq=Yq+L and update SJ. If SJ≠Φ and SM≠Φ, go to Step 2; if SJ=Φ and SM≠Φ, go to Step 4; otherwise, the algorithm halts.
Step 4.
Choose machine k to process job type q, with (k,q)=argmin(i,j){(sTi0,j+pi,j)/wj∣1≤j≤n, i∈Mj and i∈SM}.
Step 5.
Remove k from SM. If SM≠Φ, go to Step 4; otherwise, the algorithm halts.
Action 3.
Select jobs by Ranking Algorithm (RA) as follows.
Algorithm 3.
Ranking Algorithm.
Step 1.
Define SM and SJ as Step 1 in Algorithm 2. If SJ=Φ, go to Step 5.
Step 2.
For each job type j(j∈SJ), sort the machines in increasing order of (sVi,j+pi,j) (1≤i≤m), where Vi is defined as follows. (20)Vi={Ti,ifmachineiisbusyTi0,ifmachineiisfree(1≤i≤m).
Let gi,j(1≤gi,j≤m) denote the order of machine i(1≤i≤m) for job type j(1≤j≤n).
Step 3.
Choose job q to process on machine k, with (k,q)=argmin(i,j){gi,j∣j∈SJ,i∈Mjandi∈SM}. If there exist two or more machine-job combinations (say, machine-job combination (i1,j1), (i2,j2), …, (ih,jh)) with the same minimal order; that is, (ie,je)=argmin(i,j){gi,j∣j∈SJ,i∈Mjandi∈SM} holds for e(1≤e≤h), then choose job type je to process on machine ie, with (ie,je)=argmin(i,j){(sVie,je+pie,je)/wje∣1≤e≤h}.
Step 4.
Remove k or ie from SM. Set Yq=Yq+L or Yje=Yje+L and update SJ. If SJ≠Φ and SM≠Φ, go to Step 3; if SJ=Φ and SM≠Φ, go to Step 5; otherwise, the algorithm halts.
Step 5.
Choose job q to process on machine k, with (k,q)=argmin(i,j){gi,j∣1≤j≤n,i∈Mjandi∈SM}. If there exist two or more machine-job combinations (say, machine-job combinations (i1,j1), (i2,j2),…, (ih,jh)) with the same minimal order, choose job type je to process on machine ie, with (ie,je)=argmin(i,j){(sVie,je+pie,je)/wje∣1≤e≤h}.
Step 6.
Remove k or ie from SM. If SM≠Φ, go to Step 5; otherwise, the algorithm halts.
Action 4.
Select jobs by LFM-MWSPT heuristics as follows.
Algorithm 4.
LFM-MWSPT heuristics.
Step 1.
Define SM and SJ as Step 1 in Algorithm 2.
Step 2.
Select a free machine (say, k) from SM by LFM (Least Flexible Machine; see [33]) rule and choose a job type to process on machine k following MWSPT heuristics.
Step 3.
Remove k from SM. If SM≠Φ, go to Step 2; otherwise, the algorithm halts.
Action 5.
Select jobs by LFM-RA heuristics as follows.
Algorithm 5.
LFM-RA heuristics.
Step 1.
Define SM and SJ as Step 1 in Algorithm 2.
Step 2.
Select a free machine (say, k) from SM by LFM rule and choose a job type to process on machine k following Ranking Algorithm.
Step 3.
Remove k from SM. If SM≠Φ, go to Step 2; otherwise, the algorithm halts.
Action 6.
Each free machine selects the same job type as the latest one it processed.
Action 7.
Select no job.
At the start of a schedule horizon, the system is at initial state s0. If there are free machines, they select jobs to process by taking one of Actions 1–6; otherwise, Action 7 is chosen. Afterwards, when any machine completes processing a lot or any machine’s normal production time is completely scheduled, the system transfers into a new state, su. The agent selects an action at this decision-making epoch and the system state transfers into an interim state, s. When, again, any machine completes processing a lot or any machine’s normal production time used is up, the system transfers into the next decision-making state su+1 and the agent receive reward ru+1, which is computed due to su and the sojourn time between the two transitions into su and su+1 (as shown in Section 2.4). The previous procedure is repeated until a terminal state is attained. An episode is a trajectory from the initial state to a terminal state of a schedule horizon. Action 7 is available only at the decision-making states when all machines are busy.
2.4. Reward Function
A reward function follows several disciplines. It indicates the instant impact of an action on the schedule, that is, to link the action with immediate reward. Moreover, the accumulated reward indicates the objective function value; that is, the agent receives large total reward for small objective function value.
Definition 6 (reward function).
Let K denote the number of decision-making epoch during an episode, tu (0≤u<K) the time at the uth decision-making epoch, Ti,u (1≤i≤m, 1≤u≤K) the job type of the lot which machine i processes during time interval (tu-1,tu], Ti,u0 the job type of the lot which precedes the lot machine i processes during time interval (tu-1,tu], and Yj(tu) the processed volume of job type j by time tu. It follows that
(21)Yj(tu)-Yj(tu-1)=∑i=1m(tu-tu-1)δ(i,j)LsTi,u0,Ti,u+pi,Ti,u,
where δ(i,j) is an indicator function defined as
(22)δ(i,j)={1,Ti,u=j,0,Ti,u≠j.
Let ru(u=1,2,…,K) denote the reward function at the uth decision-making epoch. ru is defined as
(23)ru=∑j=1nmin{∑i=1m(tu-tu-1)δ(i,j)LsTi,u0,Ti,u+pi,Ti,u,[Dj-Yj(tu-1)]+}wj+max{∑i=1m(tu-tu-1)δ(i,j)LsTi,u0,Ti,u+pi,Ti,u-[Dj-Yj(tu-1)]+,0}×wjM.
The reward function has the following property.
Theorem 7.
Maximization of the total reward R in an episode is equivalent to minimization of objective function (5).
Proof.
The total reward in an episode is
(24)R=∑u=1Kru=∑u=1K∑j=1nmin{∑i=1m(tu-tu-1)δ(i,j)LsTi,u0,Ti,u+pi,Ti,u,∑i=1m(tu-tu-1)δ(i,j)LsTi,u0,Ti,u+pi,Ti,u,[Dj-Yj(tu-1)]+}wj+max{∑i=1m(tu-tu-1)δ(i,j)LsTi,u0,Ti,u+pi,Ti,u-[Dj-Yj(tu-1)]+,0}×wjM=∑j=1n∑u=1Kmin{∑i=1m(tu-tu-1)δ(i,j)LsTi,u0,Ti,u+pi,Ti,u,[Dj-Yj(tu-1)]+}×wj+max{∑i=1m(tu-tu-1)δ(i,j)LsTi,u0,Ti,u+pi,Ti,u-[Dj-Yj(tu-1)]+,0}×wjM.
It is easy to show that
(25)Yj=∑u=1K∑i=1m(tu-tu-1)δ(i,j)LsTi,u0,Ti,u+pi,Ti,u.
It follows that
(26)R=∑j=1n[wjmin{Dj,Yj}+wjMmax{0,Yj-Dj}]=∑j∈Ω1[wjDj+wjM(Yj-Dj)]+∑j∈Ω2wjYj=∑j=1nwjDj-{∑j∈Ω1[-wjM(Yj-Dj)]+∑j∈Ω2wj(Dj-Yj)}=∑j=1nwjDj-∑j=1n[wj(Dj-Yj)+-wjM(Yj-Dj)+],
where Ω1={j∣Yj>Dj} and Ω2={j∣Yj≤Dj}. Since ∑j=1nwjDj is a constant, it follows that
(27)maxR⟺min∑j=1n[wj(Dj-Yj)+-wjM(Yj-Dj)+].
3. The Reinforcement Learning Algorithm
The chip attach scheduling problem is converted into an RL problem with terminal state in Section 2. To apply Q-learning to solve this RL problem, another issue arises, that is, how to tailor Q-learning algorithm in this particular context. Since some state variables are continuous, the state space is infinite. This RL system is not in tabular form, and it is impossible to maintain Q-values for all state-action pairs. Thus, we use linear function with gradient-descent method to approximate the Q-value function. Q-values are represented as linear combination of a set of basis functions, Φk(s)(1≤k≤4m+n), as shown in the next formula:
(28)Q(s,a)=∑k=14m+nckaΦk(s),
where cka(1≤a≤6,1≤k≤4m+n) are the weights of basis functions. Each state variable corresponds to a basis function. The following basis functions are defined to normalize the state variables:
(29)Φk(s)={Tk0n(1≤k≤m),Tk-mn(m+1≤k≤2m),tk-2mmax{sj1,j2+pj2∣1≤j1≤n,1≤j2≤n}(2m+1≤k≤3m),dk-3mDk-3m(3m+1≤k≤3m+n),ek-3m-nTH(3m+n+1≤k≤4m+n).
Let Ca denote the vector of weights of basis functions as follows:
(30)Ca=(c1a,c2a,…,c4m+na)T.
The RL algorithm is presented as Algorithm 8, where α is learning rate, γ is a discount factor, E(a) is the vector of eligibility traces for action a, δ(a) is an error variable for action a, and λ is a factor for updating eligibility traces.
Algorithm 8.
Q-learning with linear gradient-descent function approximation for chip attach scheduling.
Initialize Ca and E(a) randomly. Set parameters α, γ, and λ.
Let num_episode denote the number of episodes having been run. Set num_episode = 0.
While num_episode < MAX_EPISODE do
Set the current decision-making state s←s0.
While at least one of state variables ei(1≤i≤m) is larger than zero do
Select action a for state s by ε-greedy policy.
Implement action a. Determine the next event for triggering state transition and the sojourn time. Once any machine completes processing a lot or any machine’s normal production time is completely scheduled, the system transfers into a new decision-making state, s′(ei′(1≤i≤m) is a component of s′).
Compute reward rs,s′a.
Update the vector of weights in the approximate Q-value function of action a:
(31)δ(a)⟵rs,s′a+γmaxa′Q(s′,a′)-Q(s,a),E(a)⟵λE(a)+∇CaQ(s,a),Ca⟵Ca+αδ(a)E(a).
Set s←s′.
If ei=0 holds for all i(1≤i≤m), set num_episode = num_episode + 1.
4. Experiment Results
In the past, the company used a manual process to conduct chip attach scheduling. A heuristic algorithm called Largest Weight First (LWF) was used as follows.
Algorithm 9 (Largest Weight First (LWF) heuristics).
Initialize SM with the set of all machines (i.e., SM={i∣1≤i≤m}) and define SJ as Step 1 in Algorithm 2. Initialize ei(1≤i≤m) with each machine’s normal production time. Set Yj=Ij, where Ij is the initial production volume of job type j.
Step 1.
Schedule the job types in decreasing order of weights in order to meet their TPVs.
While SJ≠Φ and SM≠Φ do
Choose job q with q = argmax{wj∣j∈SJ}.
While SM∩Mq≠Φ and Yq<Dq do
Choose machine i to process job q, with i = argmin{pk,q/wq∣k∈SM∩Mq}.
If ei-sTi0,q<(Dq-Yq)pi,q, then
set Yq←Yq+L(ei-sTi0,q)/pi,q, ei=0, and remove i from SM;
else, set ei←ei-sTi0,q-(Dq-Yq)pi,q/L, and Yq=Dq.
Ti0=q
Step 2.
Allocate the excess production capacity.
If SM≠Φ, then
For each machine i(i∈SM),
Choose job j with j=argmax{(ei-sTi0,q)wq/pi,q∣1≤q≤n}, set Yj←Yj+L(ei-sTi0,j)/pi,j, ei=0.
The chip attach station consists of 10 machines and normally processes more than ten job types. We selected 12 sets of industrial data for experiments comparing the Q-learning algorithm (Algorithm 8) and the six heuristics (Algorithms 1–5, 9): WSPT, MWSPT, RA, LFM-MWSPT, LFM-RA, and LWF. For each dataset, Q-learning repeatedly solves the scheduling problem 1000 times and selects the optimal schedule of the 1000 solutions. Table 1 shows the objective function values of all datasets using the seven algorithms. Individually, any of WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA obtains larger objective function values than LWF for every dataset. Nevertheless, taking WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA as actions, Q-learning algorithm achieves an objective function value much smaller than LWF for each dataset. In Tables 1–4, the bottom row presents the average value over all datasets. As shown in Table 1, the average objective function value of Q-learning is only 12.233, less than that of LWF, 66.147, by a large amount of 80.92%.
Comparison of objective function values using heuristics and Q-learning.
Dataset no.
WSPT
MWSPT
RA
LFM-MWSPT
LFM-RA
LWF
Q-Learning
1
88.867
78.116
59.758
87.689
57.582
42.253
−3.8613
2
138.44
135.86
110.747
126.69
109.07
95.926
7.6657
3
119.01
108.75
124.09
104.25
121.90
83.332
23.775
4
83.681
60.797
39.073
69.405
45.575
33.920
−4.4275
5
129.38
128.47
96.960
109.17
99.827
89.863
21.414
6
70.840
55.692
51.108
66.213
51.041
16.930
−5.4467
7
120.90
100.60
95.399
109.33
90.754
76.422
27.374
8
102.42
107.80
116.56
103.33
107.62
93.663
11.840
9
94.606
87.914
81.763
88.812
80.331
60.164
33.036
10
90.803
88.164
90.773
87.926
88.56293
56.307
22.798
11
111.13
88.287
82.916
97.882
85.605
60.160
16.493
12
100.29
89.005
86.692
95.836
78.342
60.744
−3.8617
Average
104.19
94.123
86.321
95.547
84.685
64.147
12.233
Comparison of unsatisfied TPV index using heuristics and Q-learning.
Dataset no.
WSPT
MWSPT
RA
LFM-MWSPT
LFM-RA
LWF
Q-learning
1
0.1179
0.1025
0.0789
0.1170
0.0754
0.0554
0.0081
2
0.1651
0.1611
0.1497
0.1499
0.1482
0.1421
0.0137
3
0.1421
0.1289
0.1475
0.1227
0.1455
0.0987
0.0691
4
0.1540
0.1104
0.0716
0.1258
0.0854
0.0614
0.0088
5
0.1588
0.1564
0.1186
0.1303
0.1215
0.1094
0.0571
6
0.1053
0.0819
0.0757
0.1006
0.0762
0.0248
0.0137
7
0.1462
0.1209
0.1150
0.1292
0.1082
0.0917
0.0582
8
0.1266
0.1324
0.1437
0.1272
0.1309
0.1150
0.0381
9
0.1315
0.1211
0.1133
0.1249
0.1127
0.0828
0.0815
10
0.1154
0.1112
0.1151
0.1105
0.1110
0.0709
0.0536
11
0.1544
0.1215
0.1146
0.1387
0.1204
0.0827
0.0690
12
0.1262
0.1112
0.1088
0.1194
0.1002
0.0758
0.0118
Average
0.1370
0.1216
0.1112
0.1247
0.1113
0.0842
0.0402
Comparison of unsatisfied job type index using heuristics and Q-learning.
Dataset no.
WSPT
MWSPT
RA
LFM-MWSPT
LFM-RA
LWF
Q-learning
1
0.1290
0.2615
0.1667
0.1650
0.1793
0.0921
0.0678
2
0.1924
0.2953
0.2302
0.2650
0.2097
0.1320
0.0278
3
0.2250
0.3287
0.2126
0.2564
0.2278
0.0921
0.0921
4
0.0781
0.2987
0.0278
0.1290
0.0828
0.0278
0.0278
5
0.2924
0.3169
0.3002
0.2224
0.2632
0.2055
0.0571
6
0.2290
0.3062
0.1817
0.2290
0.1632
0.0571
0.0278
7
0.2650
0.2987
0.2160
0.2529
0.2075
0.1320
0.1647
8
0.1924
0.3225
0.2067
0.1813
0.1696
0.1320
0.1320
9
0.2221
0.3250
0.1403
0.1892
0.2073
0.1320
0.0749
10
0.2621
0.3304
0.2667
0.2859
0.2708
0.1781
0.0678
11
0.2029
0.2896
0.2578
0.2194
0.2220
0.1381
0.1542
12
0.1924
0.3271
0.2302
0.1838
0.2182
0.0921
0.0678
Average
0.2069
0.3084
0.2031
0.2149
0.2018
0.1176
0.0802
Comparison of the total setup time using heuristics and Q-learning.
Dataset no.
WSPT
MWSPT
RA
LFM-MWSPT
LFM-RA
LWF
Q-learning
1
0.8133
0.8497
0.3283
0.7231
0.3884
0.3895
1.0000
2
0.8333
1.3000
0.4358
0.7564
0.5263
0.4094
1.0000
3
0.8712
1.1633
0.4207
0.6361
0.4937
0.4298
1.0000
4
1.1280
0.7123
0.4629
0.8516
0.5139
0.4318
1.0000
5
0.9629
1.3121
0.4179
0.8597
0.5115
0.3873
1.0000
6
0.7489
1.0393
0.4104
0.7489
0.4542
0.4074
1.0000
7
1.7868
2.2182
0.8223
1.4069
1.0125
0.4174
1.0000
8
0.6456
0.8508
0.4055
0.6694
0.5053
0.3795
1.0000
9
0.9245
0.9946
0.5013
0.7821
0.6694
0.4163
1.0000
10
1.1025
1.7875
0.6703
1.0371
0.9079
0.4894
1.0000
11
0.9973
1.3655
0.3994
0.9686
0.5129
0.4066
1.0000
12
0.7904
1.1111
0.4419
0.6195
0.5081
0.4258
1.0000
Average
0.9671
1.2254
0.4764
0.8383
0.5837
0.4158
1.0000
Besides objective function value, we propose two indices, unsatisfied TPV index and unsatisfied job type index, to measure the performance of the seven algorithms. Unsatisfied TPV index (UPI) is defined as formula (32) and indicates the weighted proportion of unfinished Target Production Volume. Table 2 compares UPIs of all datasets using seven algorithms. Also, any of WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA individually obtains larger UPI than LWF for each dataset. However, Q-learning algorithm achieves smaller UPI than LWF does for each dataset. The average UPI of Q-learning is only 0.0402, less than that of LWF, 0.0842, by a large amount of 52.20%. Let J denote the set {j∣1≤j≤n,Yj<Dj}. Unsatisfied job type index (UJTI) is defined as formula (33) and indicates the weighted proportion of the job types whose TPVs are not completely satisfied. Table 3 compares UJTIs of all datasets using seven algorithms. With most datasets, Q-learning algorithm achieves smaller UJTIs than LWF. The average UJTI of Q-learning is 0.0802, which is remarkably less than that of LWF, 0.1176, by 31.81%. Consider
(32)UPI=∑j=1nwj(Dj-Yj)+∑j=1nwjDj,(33)UJTI=∑j=Jwj∑j=1nwj.
Table 4 shows the total setup time of all datasets using seven algorithms. For the reason of commercial confidentiality, we used the normalized data with the setup time of a dataset divided by the result of this dataset using Q-learning. Thus, the total setup times of all datasets by Q-learning are converted into one and the data of the six heuristics are adjusted accordingly. Q-learning algorithm requires more than twice of setup time than LWF does for each dataset. The average accumulated setup time of LWF is only 41.58 percents of that of Q-learning.
The previous experimental results reveal that for the whole scheduling tasks, any individual one of the five action heuristics (WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA) for Q-learning performs worse than LWF heuristics. However, Q-learning greatly outperforms LWF in terms of the three performance measures, the objective function value, UPI, and UJTI. This demonstrates that some action heuristics provide better actions than LWF heuristics at some states. During repeatedly solving the scheduling problem, Q-learning system perceives the insights of the scheduling problem automatically and adjusts its actions towards the optimal ones facing different system states. The actions at all states form a new optimized policy which is different from any policies following any individual action heuristics or LWF heuristics. That is, Q-learning incorporates the merit of five alternative heuristics, uses them to schedule jobs flexibly, and obtains results much better than any individual action heuristics and LWF heuristics. In the experiments, Q-learning achieves high-quality schedules at the cost of inducing more setup time. In other words, Q-learning utilizes the machines more efficiently by increasing conversions among a variety of job types.
5. Conclusions
We apply Q-learning to study lot-based chip attach scheduling in back-end semiconductor manufacturing. To apply reinforcement learning to scheduling, the critical issue being conversion of scheduling problems into RL problems. We convert chip attach scheduling problem into a particular SMDP problem by Markovian state representation. Five heuristic algorithms, WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA, are selected as actions so as to utilize prior domain knowledge. Reward function is directly related to scheduling objective function, and we prove that maximizing the accumulated reward is equivalent to minimizing the objective function. Gradient-descent linear function approximation is combined with Q-learning algorithm.
Q-learning exploits the insight structure of the scheduling problem by solving it repeatedly. It learns a domain-specific policy from the experienced episodes through interaction and then applies it to latter episodes. We define two indices, unsatisfied TPV index and unsatisfied job type index, together with objective function value to measure the performance of Q-learning and the heuristics. Experiments with industrial datasets show that Q-learning apparently outperforms six heuristic algorithms: WSPT, MWSPT, RA, LFM-MWSPT, LFM-RA, and LWF. Compared with LWF, Q-learning achieves reduction of the three performance measures, respectively, by an average level of 52.20%, 31.81%, and 80.92%. With Q-learning, chip attach scheduling is optimized through increasing effective job type conversions.
Disclosure
Given the sensitive and proprietary nature of the semiconductor manufacturing environment, we use normalized data in this paper.
Acknowledgments
This project is supported by the National Natural Science Foundation of China (Grant no. 71201026), Science and Technological Program for Dongguan’s Higher Education, Science and Research, and Health Care Institutions (no. 2011108102017), and Humanities and Social Sciences Program of Ministry of Education of China (no. 10YJC630405).
WengM. X.LuJ.RenH.Unrelated parallel machine scheduling with setup consideration and a total weighted completion time objectiveGairingM.MonienB.WoclawA.A faster combinatorial approximation algorithm for scheduling unrelated parallel machinesMosheiovG.Parallel machine scheduling with a learning effectMosheiovG.SidneyJ. B.Scheduling with general job-dependent learning curvesYuL.ShihH. M.PfundM.CarlyleW. M.FowlerJ. W.Scheduling of unrelated parallel machines: an application to PWB manufacturingBakerK. R.BertrandJ. W. M.A dynamic priority rule for scheduling against due-datesKanetJ. J.LiX.A weighted modified due date rule for sequencing to minimize weighted tardinessRachamaduguR. V.MortonT. E.Myopic heuristics for the single machine weighted tardiness problemVolgenantA.TeerhuisE.Improved heuristics for the n-job single-machine weighted tardiness problemCarrollD. C.VepsalainenA. P. J.MortonT. E.Priority rules for job shops with weighted tardiness costsRussellR. S.Dar-ElE. M.TaylorB. W.A comparative analysis of the COVERT job sequencing rule using various shop performance measuresBankJ.WernerF.Heuristic algorithms for unrelated parallel machine scheduling with a common due date, release dates, and linear earliness and tardiness penaltiesLiawC. F.LinY. K.ChengC. Y.ChenM.Scheduling unrelated parallel machines to minimize total weighted tardinessKimD. W.NaD. G.ChenF. F.Unrelated parallel machine scheduling with setup times and a total weighted tardiness objectiveSuttonR. S.BartoA. G.WatkinsC. J. C. H.WatkinsC. J. C. H.DayanP.Q-learningJaakkolaT.JordanM. I.SinghS. P.On the convergence of stochastic iterative dynamic programming algorithmsTsitsiklisJ. N.Asynchronous stochastic approximation and Q-learningBertsekasD. P.TsitsiklisJ. N.RiedmillerS.RiedmillerM.A neural reinforcement learning approach to learn local dispatching policies in production schedulingProceedings of the 16th International Joint Conference on Artificial Intelligence1999Stockholm, SwedenAydinM. E.ÖztemelE.Dynamic job-shop scheduling using reinforcement learning agentsHongJ.PrabhuV. V.Distributed reinforcement learning control for batch sequencing and sizing in just-in-time manufacturing systemsWangY. C.UsherJ. M.Application of reinforcement learning for agent-based production schedulingCsájiB. C.MonostoriL.KádárB.Reinforcement learning in a distributed market-based production control systemSinghS. S.TadićV. B.DoucetA.A policy gradient method for semi-Markov decision processes with application to call admission controlKayaM.AlhajjR.A novel approach to multiagent reinforcement learning: utilizing OLAP mining in the learning processPaternina-ArboledaC. D.DasT. K.A multi-agent reinforcement learning approach to obtaining dynamic control policies for stochastic lot scheduling problemMariano-RomeroC. E.Alcocer-YamanakaV. H.MoralesE. F.Multi-objective optimization of water-using systemsVengerovD.A reinforcement learning framework for utility-based scheduling in resource-constrained systemsIwamuraK.MayumiN.TanimizuY.SugimuraN.A study on real-time scheduling for holonic manufacturing systems—determination of utility values based on multi-agent reinforcement learningProceedings of the 4th International Conference on Industrial Applications of Holonic and Multi-Agent Systems2009Linz, Austria135144PinedoM.