Inventory management is a sequential decision problem that can be solved with reinforcement learning (RL). Although RL in its conventional form does not require domain knowledge, exploiting such knowledge of problem structure, usually available in inventory management, can be beneficial to improving the learning quality and speed of RL. Ruminative reinforcement
learning (RRL) has been introduced recently based on this approach. RRL is motivated by how humans contemplate the consequences of their actions in trying to learn how to make a better decision. This study further investigates the issues of RRL and proposes new RRL methods applied to inventory management. Our investigation provides insight into different RRL characteristics, and our experimental results show the viability of the new methods.
1. Introduction
Inventory management is a crucial business activity and can be modeled as a sequential decision problem. Bertsimas and Thiele [1], among others, addressed the need for an efficient and flexible inventory solution that is also simple to implement in practice. This may be among the reasons for extensive studies of reinforcement learning (RL) application to inventory management.
RL [2, 3] is an approach to solve sequential decision problems based on learning the underlying state value or state-action value. Relying on learning mechanism, RL in its typical form does not require knowledge of a structure of the problem. Therefore, RL has been studied in wide range of sequential decision problems, for example, virtual machine configuration [4], robotics [5], helicopter control [6], ventilation, heating and air conditioning control [7], electricity trade [8], financial management [9], water resource management [10], and inventory management [11]. Acceptance of RL is credited to RL’s effectiveness, potential possibilities [12], link to mammal learning processes [13, 14], and its model-free property [15].
Despite fascination with RL’s model-free property, most inventory management problems can naturally be formulated into a well-structured part interacting with another part that is less understood. That is, replenishment cost, holding cost, and penalty cost can be determined precisely in advance. On the other hand, customer demand or, in some cases, delivery time or availability of supplies is usually less predictable. However, once a value of a less predictable variable is known, the period cost can be determined precisely. Specifically, a warehouse would know its period inventory cost after its replenishment has arrived and all demand orders in the period have been observed. Calculation of a period cost is a well-defined formula, while another part, for example, demand, is less predictable. Knowledge about the well-structured part can be exploited, while a learning mechanism can be used to handle the less understood part.
Utilizing this knowledge, Kim et al. [16] proposed asynchronous action-reward learning, which used simulation to evaluate consequences of actions not taken in order to accelerate the learning process in a stateless system. Extending the idea to state-based system, Katanyukul [17] developed ruminative reinforcement learning (RRL) methods, that is, ruminative SARSA (RSarsa) and policy-weighted RSarsa (PRS). The RRL approach is motivated by how humans contemplate consequences of their actions to improve their learning hoping to make a better decision. His study of RRL reveals good potential of the approach. However, existing individual methods show strengths in different scenarios: RSarsa is shown to have fast learning but leads to inferior learning quality in a long-term run. PRS is shown to lead to superior learning quality in a long-term run, but with slower rate.
Our proposed method here is developed to exploit the fast learning characteristic of RSarsa and good learning quality in a long-term run of PRS. Our experimental results show effectiveness of the proposed method and support our assumption underlying development of RRL.
2. Background
An objective of a sequential inventory management is to minimize a long-term cost, C0(s0)=mina0,…,aT∑t=0TγtE[ct∣s0,a0,…,aT], subject to at∈Ast for t=0,…,T, where E[ct∣s0,a0,…,aT] is the expected period cost of period t given an initial state s0 and actions a0,…,aT over periods 0 to T, respectively; γ is a discount factor; and Ast is a feasible action set at state st. Under certain assumptions, the problem can be posed as a Markov decision problem (MDP) (see [15] for details). In this case, what we seek is an optimal policy, which maps each state to an optimal action. Given an arbitrary policy π, the long-term state cost for that policy can be written as
(1)Cπ(s)=rπ(s)+γ∑s′pπ(s′∣s)Cπ(s′),
where rπ(s) is an expected period state cost, pπ(s′∣s) is a transition probability—the probability of the next state being s′when the current state is s. The superscript π notation indicates dependence on the policy π. In practice, exact solution to (1) is difficult to find. Reinforcement learning (RL) [2] provides a framework to find an approximate solution. An approximate long-term cost of state s is obtained by summation of the period cost and a long-term cost of the next state Q(s)=r(s)+γQ(s′).
The RL approach is based on temporal difference (TD) learning, which uses temporal difference error ψ (2) to estimate the long-term cost (3):
(2)ψ=r+γQ(s′,a′)-Q(s,a)(3)Q(new)(s,a)=Q(old)(s,a)+α·ψ,
where r is the period cost, which corresponds to taking action a in state s, α is a learning rate, and s′ and a′ are the state and action taken in the next period, respectively.
Once the values of Q(s,a) are thoroughly learned, they are good approximations of long-term costs. We often refer to Q(s,a) as the “Q-value.” Most RL methods determine the actions to take based on Q-values. These methods include SARSA [2], a widely used RL algorithm. We use SARSA as a benchmark, representing a conventional RL method, to compare with other methods under investigation. In each period, given observed state s, action taken a, observed period cost r, observed next state s′, and anticipating next action taken a′, the SARSA algorithm updates the Q-value based on TD learning (2) and (3).
Based on the Q-value, we can define a policy π to determine an action to take at each state. The policy is usually stochastic, defined by a probability p(a∣s) to take an action a given a state s. The policy has to balance between taking the best action based on the currently learned Q-value and trying another alternative. Trying another alternative gives the learning agent a chance to explore thoroughly the consequences of its state-action space. This helps to create a constructive cycle of improving the quality of learned Q-values, which in turn will help the agent to choose better actions and reduce the chance to get stuck in a local optimum. This is an issue of balancing between exploitation and exploration, as discussed in Sutton and Barto [2]. (Since the RL algorithm is autonomous and interacts with its environment, we sometimes use the term “learning agent.”).
An ϵ-greedy policy is a general RL policy, which also is easy to implement. With probability ϵ, the policy chooses an action randomly from a∈A(s), where A(s) is a set of allowable actions given state s. Otherwise, it takes an action corresponding to the minimal current Q-value, a*=argminaQ(s,a).
3. Ruminative Reinforcement Learning
The conventional RL approach, SARSA, assumes that the agent knows only the current state s, the action a it takes, the period cost r, the next state s′, and the action a′ it will take in the next state. Each period, the SARSA agent updates the Q-value based on the TD error calculated with these five variables. Figure 1 illustrates the SARSA agent, the five variables it needs to update the Q-value, and its interaction with its environment.
SARSA agent and interacting variables (this figure is adapted from Figure 6.15 of Sutton and Barto [2]).
However, in inventory management problems, we usually have extra knowledge about the environment. That is, the problem structure can naturally be formulated such that the period cost r and next state s′ are determined by a function k:s,a,ξ↦r,s′, where ξ is an extra information variable. This variable ξ captures the stochastic aspect of the problem. The process generating ξ may be unknown, but the value of ξ is fully observable after the period is over. Given a value of ξ, along with s and a, the deterministic function k can precisely determine r and s′.
Without this extra knowledge, each period, the SARSA agent updates only one value of Q(s,a) corresponding to current state s and action taken a. However, with the function k and an observed value of ξ, we can do “rumination”: evaluating the consequences of other actions a^, even those that were not taken. Figure 2 illustrates rumination and its associated variables. Given the rumination mechanism, we can provide information required by SARSA’s TD calculation for any underlying action. Katanyukul [17] introduced this rumination idea and incorporated it into the SARSA algorithm, resulting in the ruminative SARSA (RSarsa) algorithm. Algorithm 1 shows the RSarsa algorithm. It should be noted that RSarsa is similar to SARSA, but with inclusion of rumination from line 8 to line 13.
<bold>Algorithm 1: </bold>RSarsa algorithm.
(L00) Initialize Q(s,a).
(L01) Observe s.
(L02) Determine a by policy π.
(L03) For each period,
(L04) observe r,s′, and ξ;
(L05) determine a′ by policy π;
(L06) calculate ψ=r+γQ(s′,a′)-Q(s,a);
(L07) update Q(s,a)←Q(s,a)+α·ψ;
(L08) for each a^∈A^(s),
(L09) calculate r^,s^′ with k(s,a^,ξ),
(L10) determine a^′,
(L11) calculate ψ=r^+γQ(s^′,a^′)-Q(s,a^),
(L12) update Q(s,a^)←Q(s,a^)+α·ψ
(L13) until ruminated all a^∈A^;
(L14) set s←s′ and a←a′
(L15) until termination.
Environment, knowledge of its structure, and rumination.
The experiments in [17] showed that RSarsa had performed significantly better than SARSA in early periods (indicating faster learning), but its performance was inferior to SARSA in later periods (indicating poor convergence to the appropriate long-term state cost approximation). Katanyukul [17] attributed RSarsa’s poor long-term learning quality to its lack of natural action visitation frequency.
TD learning (2) and (3) update the Q-value as an approximation of the long-term state cost. The transition probability pπ(s′∣s) in (1) does not appear explicitly in the TD learning calculation. Conventional RL relies on sampling trajectories to reflect the natural frequency of visits to state-action pairs corresponding to the transition probability. It updates only the state-action pairs as they are actually visited; therefore, it does not require explicit calculation of the transition probability and still eventually converges to a good approximation.
However, because RSarsa does rumination for all actions ignoring their sampling frequency, this is equivalent to disregarding the transition probability, which leads to RSarsa’s poor long-term learning quality.
To address this issue, Katanyukul [17] proposed policy-weighted RSarsa (PRS). PRS explicitly calculates probabilities of actions to be ruminated and adjusts the weights of their updates. PRS is similar to RSarsa, but the rumination update (line 12 in Algorithm 1) is replaced by
(4)Q(s,a^)⟵Q(s,a^)+β·ψ,
where β=α·p(a^) and p(a^) is the probability of taking action a^ in state s with policy π. Given an ϵ-greedy policy, we have p(a^)=ϵ/|A^(s)| for a^≠a* and p(a^)=ϵ/|A^(s)|+(1-ϵ) otherwise, where |A^(s)| is a number of allowable actions. PRS has been shown to perform well in early and later periods, compared to SARSA. However, RSarsa is reported to significantly outperform PRS in early periods.
4. New Methods
According to the results of [17], although RSarsa may converge to a wrong approximation, RSarsa was shown to perform impressively in the very early periods. This suggests that if we jump-start the learning agent with RSarsa and then later switch to PRS, before the Q-values settle into bad spots, we may be able to achieve both faster learning and good approximation for a long-term run.
PRS.Beta. We first introduce a straightforward idea, called PRS.Beta, where we will use a varying ruminative learning rate as a mechanism to shift from full rumination (RSarsa) to policy-weighted rumination (PRS). Similar to PRS, the rumination update is determined by (4). However, the value of the rumination learning rate β is determined by
(5)β=α·{1-(1-p(a^))·f},
where f is a function having a value between 0 and 1. When f→0, β→α and the algorithm will behave like RSarsa. When f→1, β→α·p(a^) and the algorithm will behave like PRS. We want f to start out close to 0 and grow to 1 at a proper rate. By examining our preliminary experiments, the TD error will get smaller as the learning converges. This is actually a property of TD learning. Given this property, we can use the magnitude of the TD error |ψ| to control the shifting, such that
(6)f(ψ)=exp(-τ·|ψ|),
where τ is a scaling factor. Figure 3 illustrates the effects of different values of τ. Since the magnitude of τ should be relative to |ψ|, we set τ=|2/(r+Q(s,a))|, so that the magnitude of τ will be in a proper scale relative to |ψ| and automatically adjusted.
Function exp(-τ|ψ|) and effects of different τ values.
RSarsa.TD. Building on the PRS.Beta method above, we next propose another method, called RSarsa.TD. The underlying idea is that since SARSA performs well in a long-term run (see [2] for theoretical discussion of SARSA’s optimality and convergence properties), then after we speed up the early learning process with rumination, we can just switch back to SARSA. This approach is to utilize the fast learning characteristic of full rumination in early periods and to avoid its poor long-term performance. In addition, as a computational cost of rumination is proportional to the size of the ruminative action space |A^(s)|, this also helps to reduce the computational cost incurred by rumination. It is also intuitively appealing in the sense that we do rumination only when we need it.
The intuition to selectively do rumination was introduced in [17] in an attempt to reduce the extra computational cost from rumination. There, the probability to do rumination was a function of the magnitude of the TD error:
(7)p(rumination)=1-exp(-|2ψr+Q(s,a)|).
However, Katanyukul [17] investigated this selective rumination only with the policy-weighted method and called it PRS.TD. Although PRS.TD was able to improve the computational cost of the rumination approach, the inventory management performance of PRS.TD was reported to have mixed results, implying that incorporation of selective rumination may deteriorate performance of PRS.
This performance deterioration may be due to using p(rumination) with policy weighted correction. Both schemes use |ψ| to control their effect of rumination; therefore, they might have an effect equivalent to overcorrecting the state-transition probability. Unlike PRS, RSarsa does not correct the state-transition probability. Incorporating selective rumination (7) will be the only scheme controlling rumination with |ψ|. Therefore, we expect that this approach may allow the advantage of RSarsa’s fast learning, while maintaining the long-term learning quality of SARSA.
5. Experiments and Results
Our study uses computer simulations to conduct numerical experiments on three inventory management problem settings (P1, P2, and P3). All problems are periodic review single-echelon with nonzero setup cost. P1 and P2 have one-period lead time. P3 has two-period lead time. The same Markov model is used to govern all problem environments, but with different settings. For P1 and P2, the problem state space is I×{0,I+}, for on-hand and in-transit inventories: x and b(1), respectively. P3’s state space is I×{0,I+}×{0,I+}, for x and in-transit inventories b(1) and b(2). The action space is {0,I+}, for replenishment order a.
The state transition is specified by
(8)xt+1=xt+bt(1)-dt,bt+1(i)=bt(i+1),fori=1,…,L-1,bt+1(L)=at,
where L is a number of lead time periods.
The inventory period cost is calculated from the equation
(9)rt=K·δ(at)+G·at+H·xt+1·δ(xt+1)-B·xt+1·δ(-xt+1),
where K, G, H, and B are setup, unit, holding, and penalty costs, respectively, and δ(·) is a step function. Five RL agents are studied: SARSA, RSarsa, PRS, RSarsa.TD, and PRS.Beta.
Each experiment is repeated 10 times. In each repetition, an agent is initialized with all zero Q-values. Then, the experiment is run consecutively for NE episodes. Each episode starts with initial state and action as follows: for all problems, b1(1) and a1 are initialized with values randomly drawn between 0 and 100. In P1, x1 is initialized to 50; in P2, x1 is initialized from randomly drawn values between -50 and 400; in P3, x1 is initialized to 100 and randomly drawn values of b(2) between 0 and 100. Each episode ends when NP periods are reached or an agent has visited a termination state, which is a state lying outside a valid range of Q-value implementation. The maximum number of periods in each episode, NP, defines the length of the problem horizon, while the number of episodes NE specifies a variety of problem scenarios, that is, different initial states and actions.
Three problem settings are used in our experiments. Problem 1 (P1) has NE=100, NP=60, K=200, G=100, B=200, and H=20. Demand dt is normally distributed, with mean 50 and standard deviation 10, denoted as dt~N(50,102). The environment state [x,b(1)] is set as the RL agent state s=[x,b(1)]. Problem 2 (P2) has NE=500, NP=60, K=200, G=50, B=200, and H=20, with demand dt~N(50,102). The RL agent state is set as the inventory level s=x+b(1). Therefore, the RL agent state is one-dimensional. Problem 3 (P3) has NE=500, NP=60, K=200, G=50, B=200, and H=20. The demand dt is AR1/GARCH(1,1): dt=a0+a1·dt-1+ϵt; ϵt=et·σt and σt2=ν0+ν1·ϵt-12+ν2·σt-12, where a0 and a1 are AR1 model parameters; ν0, ν1, and ν2 are GARCH(1,1) parameters; and et is white noise distributed according to N(0,1). The values of AR1/GARCH(1,1) in our experiments are a0=2, a1=0.8, ν0=100, ν1=0.1, and ν2=0.8, with initial values d1=50, σ12=100, and ϵ1=2. The RL agent state in P3 is three-dimensional s=[x,b(1),b(2)]. In all three problem settings, the RL agent period cost and action are the inventory period cost and replenishment order, respectively. For RSarsa, PRS, RSarsa.TD, and PRS.Beta, the extra information required by rumination is the inventory demand variable ξ=dt.
The Q-value is implemented using grid tile coding [2] without hashing. Tile coding is a function approximation method based on a linear combination of weights of activated features, called “tiles.” The approximation function with argument z is given by
(10)f(z)=w1ϕ1(z)+w2ϕ2(z)+⋯+wMϕM(z),
where w1,w2,…,wM are tile weights and ϕ1(z),ϕ2(z),…,ϕM(z) are tile activation functions ϕi(z)=1 only when z lies inside the hypercube of the ith tile.
The tile configuration, that is, ϕ1(z),…,ϕM(z), is predefined. Each Q-value is stored using tile coding through the weights. Given a value Q to store at any entry of z, the weights are updated according to
(11)wi=wi(old)+(Q-Q(old))N,
where wi(old) and Q(old) are the weight (of the ith tile) and approximation before the new update. Variable N is for a number of tiling layers.
For P1, we use a tile coding with 10 tiling layers. Each layer has 8×3×4 three-dimensional tiles, covering multidimensional state-action space of [-300,500]×[0,150]×[0,150] corresponding to s=[x,b(1)] and a. This means that this tile coding allows only a state lying in [-300,500]×[0,150] and a value of action between 0 and 150. The dimensions, along x, b(1), and a, are partitioned into 8, 3, and 4 partitions, creating 96 three-dimensional hypercubes for each tiling layer. All layers are overlapping to constitute an entire tile coding set. Layer overlapping is arranged randomly. For P2, we use a tile coding with 5 tiling layers. Each tiling has 11×5 two-dimensional tiles, covering the space of [-300,650]×[0,150] corresponding to s=(x+b(1)) and a. For P3, we use a tile coding with 10 tiling layers. Each tiling has 8×3×3×4 four-dimensional tiles, covering the space of [-400,1200]×[0,150]×[0,150]×[0,150] corresponding to s=[x,b(1),b(2)] and a.
All RL agents use the ϵ-greedy policy with ϵ=0.2. The learning update uses the learning rate α=0.7 and discount factor γ=0.8.
Figures 4, 5, and 6 show moving averages (of degree 1000) of period costs, in P1, P2, and P3, obtained with different learning agents, as indicated in the legends (“R.TD” is short for RSarsa.TD). Figures 7 and 8 show box plots of average costs obtained with the different methods in early and later periods, respectively.
Moving average of period costs, P1.
Moving average of period costs, P2.
Moving average of period costs, P3.
Average costs in early periods.
Average costs in late periods.
The results are summarized in Table 1. The computation costs of the methods are measured by relative average computation time per epoch, shown in lines 1–3. Average costs are used as the inventory management performance and they are shown in lines 4–6 for early periods (periods 1–2000 in P1 and P2 and periods 1–4000 in P3) and lines 7–9 for later periods (periods after early periods). The numbers in each entry indicate average costs obtained from the corresponding methods. Parentheses reveal results from one-side Wilcoxon’s rank sum tests: “W” indicates that the average cost is significantly lower than an average cost obtained from SARSA (P<0.05); otherwise, the P value is shown instead.
Experimental results.
Line
Methods
SARSA
RSarsa
PRS
RSarsa.TD
PRS.Beta
Relative computation time/epoch
1
P1
1
30
26
5
30
2
P2
1
20
21
3
19
3
P3
1
31
29
6
31
Average cost of early periods
4
P1
8,421
7,619 (W)
8,379 (p0.43)
7,597 (W)
7,450 (W)
5
P2
4,935
4,606 (W)
4,792 (p0.06)
4,685 (W)
4,411 (W)
6
P3
10,502
8,694 (W)
9,958 (p0.20)
9,390 (p0.07)
8,472 (W)
Average cost of later periods
7
P1
7,214
7,355 (p0.68)
7,051 (W)
7,110 (p0.11)
7,010 (W)
8
P2
4,308
4,388 (p0.90)
4,248 (p0.14)
4,375 (p0.84)
4,194 (W)
9
P3
8,613
8,139 (p0.29)
8,312 (p0.37)
8,486 (p0.43)
7,664 (p0.18)
The computation costs of RSarsa, PRS, and PRS.Beta (full rumination) are about 20–30 times of SARSA (RL without rumination). RSarsa.TD (selective rumination) dramatically reduces the computation cost of rumination at scales of 5–7 times. An evaluation of the effectiveness of each method (compared to SARSA) shows that RSarsa and PRS.Beta significantly outperform SARSA in early periods for all 3 problems. Average costs obtained from RSarsa.TD are lower than ones from SARSA, but significance tests can confirm only results in P1 and P2. It should be noted that PRS results do not show significant improvement over SARSA. This agrees with results in a previous study [17]. With respect to performance in later periods, average costs of PRS and PRS.Beta are lower than SARSA’s in all 3 problems. However, significance tests can confirm only few results (P1 for PRS and P1 and P2 for PRS.Beta).
Table 2 shows a summary of results from significance tests comparing the previous study’s RRL methods (RSarsa and PRS) to our proposed methods (RSarsa.TD and PRS.Beta). The entries with “W” indicate that our proposed method on the corresponding column significantly outperforms a previous method on the corresponding row (P<0.05). Otherwise, the P value is indicated.
Experimental results.
Line
RSarsa.TD
PRS.Beta
Early periods
1
P1
RSarsa
0.49
0.16
2
PRS
W
W
3
P2
RSarsa
0.95
W
4
PRS
0.10
W
5
P3
RSarsa
0.80
0.37
6
PRS
0.26
W
Later periods
7
P1
RSarsa
W
W
8
PRS
0.63
0.14
9
P2
RSarsa
0.46
W
10
PRS
0.97
0.12
11
P3
RSarsa
0.66
0.26
12
PRS
0.60
0.18
6. Conclusions and Discussion
Our results have shown that PRS.Beta achieves our goal, which is to address the slow learning rate of PRS, as it significantly outperforms PRS in early periods in all 3 problems, and to address the long-term learning quality of RSarsa, as it significantly outperforms RSarsa in later periods in P1 and P2 and its average cost is lower than RSarsa’s in P3. It should be noted that although the performance of RSarsa.TD may not seem impressive when compared to PRS.Beta’s, RSarsa.TD requires less computational cost. Therefore, as RSarsa.TD shows some improvement over SARSA, this reveals that selective rumination is still worth further study.
It should be noted that PRS.Beta employs TD error to control its behavior (6). The notion to extend TD error to determine learning factors is not limited only to rumination. It may be beneficial to use the TD error signal to determine other learning factors, such as the learning rate, for an adaptive-learning-rate agent. A high TD error indicates that the agent has a lot to learn, that what it has learned is wrong, or that things are changing. For each of these cases, the goal is to make the agent learn more quickly. So, a high TD error should be a clue to increase the learning rate, increase the degree of rumination, or increase the chance to do more exploration.
To address issues in RL worth investigation, more efficient Q-value representations should be among the priorities. Regardless of the action policy, every RL policy relies on Q-values to determine the action to take. Function approximations suitable to represent Q-values should facilitate efficient realization of an action policy. For example, ϵ-greedy policy has to search for an optimal action. A Q-value representation suitable for an ϵ-greedy policy should allow efficient search for an optimal action given a state. Another general RL action policy is the softmax policy [2]. Given a state, the softmax policy has to evaluate the probabilities of candidate actions based on their associated Q-values. A representation that facilitates efficient mapping from Q-values to the probabilities would have great practical importance in this case. Due to the interaction between the Q-value representation and the action policy, there are considerable efforts to combine these two concepts. This is an active research direction under the rubric of policy gradient RL [18].
There are many issues in RL needed to be explored, theoretically and for application. Our findings reported in this paper provide another step in understanding and applying RL to practical inventory management problems. Even though we only investigated inventory management problems here, our methods can be applied beyond this specific domain. This early step in the study of using TD error to control learning factors, along with investigation of other issues in RL, would yield a more robust learning agent that is useful in a wide range of practical applications.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
The authors would like to thank the Colorado State University Libraries Open Access Research and Scholarship Fund for supporting the publication of this paper.
BertsimasD.ThieleA.A robust optimization approach to inventory theorySuttonR. S.BartoA. G.PowellW. B.RaoJ.BuX.XuC. Z.WangL.YinG.VCONF: a reinforcement learning approach to virtual machines auto-configurationProceedings of the 6th International Conference on Autonomic Computing (ICAC '09)June 2009Barcelona, SpainACM1371462-s2.0-7004910739610.1145/1555228.1555263KhanS. G.HerrmannG.LewisF. L.PipeT.MelhuishC.Reinforcement learning and optimal adaptive control: an overview and implementation examplesCoatesA.AbbeelP.NgA. Y.Apprenticeship learning for helicopter controlAndersonC. W.HittleD.KretchmarM.YoungP.SiJ.BartoA.PowellW.WunschD.Robust reinforcement learning for heating, ventilation, and air conditioning control of buildingsLincolnR.GallowayS.StephenB.BurtG.Comparing policy gradient and value function based reinforcement learning methods in simulated electrical power tradeTanZ.QuekC.ChengP. Y. K.Stock trading with cycles: a financial application of ANFIS and reinforcement learningCastellettiA.PianosiF.RestelliM.A multiobjective reinforcement learning approach to water resources systems operation: pareto frontier approximation in a single runKatanyukulT.ChongE. K. P.DuffW. S.Intelligent inventory control: is bootstrapping worth implementing?AkiyamaT.HachiyaH.SugiyamaM.Efficient exploration through active learning for value function approximation in reinforcement learningDoyaK.Metalearning and neuromodulationDayanP.DawN. D.Decision theory, reinforcement learning, and the brainKatanyukulT.DuffW. S.ChongE. K. P.Approximate dynamic programming for an inventory problem: empirical comparisonKimC. O.KwonI. H.BaekJ. G.Asynchronous action-reward learning for nonstationary serial supply chain inventory controlKatanyukulT.Ruminative reinforcement learning: improve intelligent inventory control by ruminating on the pastProceedings of the 5th International Conference on Computer Science and Information Technology (ICCSIT '13)2013Paris, FranceAcademic PublisherVienN. A.YuH.ChungT.Hessian matrix distribution for Bayesian policy gradient reinforcement learning