Solving a Joint Pricing and Inventory Control Problem for Perishables via Deep Reinforcement Learning

We study a joint pricing and inventory control problem for perishables with positive lead time in a finite horizon periodic-review system. Unlike most studies considering a continuous density function of demand, in our paper the customer demand depends on the price of current period and arrives according to a homogeneous Poisson process. We consider both backlogging and lost-sales cases, and our goal is to find a simultaneously ordering and pricing policy to maximize the expected discounted profit over the planning horizon. When there is no fixed ordering cost involved, we design a deep reinforcement learning algorithm to obtain a near-optimal ordering policy and show that there are some monotonicity properties in the learned policy. We also show that our deep reinforcement learning algorithm achieves a better performance than tabular-based Q-learning algorithms. When a fixed ordering cost is involved, we show that our deep reinforcement learning algorithm is effective and efficient, under which the problem of “curse of dimension” is circumvented.


Introduction
e inventory control of perishables has received increasing attention from the business community and academia. According to a report released by the Food Market Institute (2012) in the United States, as of 2005, the total sales of perishables accounted for more than half of sales in supermarkets and grocery stores in the US, and this proportion is still increasing. Meanwhile, losses due to the deterioration of perishables also account for a large proportion of the total retail cost. Besides, pricing is also an important and effective lever for the retail industry to manage the profitability of perishables. As shown in Karaesmen et al. [1] and Chen et al. [2], a firm's profit increases significantly by dynamic adjustment of prices of perishables according to the availability of the inventory and the remaining lives of perishables.
In this research, we study a joint pricing and inventory control problem for perishables in a finite planning horizon. Demand in each period depends on the current price and satisfies a Possion distribution. e problem of inventory control for perishables is usually more difficult than the one for nonperishables, in which the inventory state can be represented by a single variable. e state of perishables has to be recorded by a vector to account for items with different lifetimes, which makes the analytical studies much more difficult. As a fixed ordering cost can make the problem even more difficult in a dynamic setting, few studies consider it due to the tractability in analysis.
One main contribution in this research is that we consider a fixed ordering cost in our model. We study both the backlogging case and the lost-sales case and allow for positive lead time. Our goal is to find a near-optimal ordering and pricing policy to maximize the expected profit in the planning horizon. is problem is hard to analyze by traditional dynamic programming approach in the inventory control literature. erefore, we use a reinforcement learning approach to solve the problem.
In the literature, there have been a few papers that study inventory control problems with reinforcement learning, such as Charharsooghi et al. [3], Dogan et al. [4], and Kara et al. [5]. Unlike these papers which use Q-learning, we take a deep reinforcement learning approach and show that it outperforms Q-learning models that do not use neural networks. e outperformance of deep reinforcement learning has also been shown by Ke et al. [6] and Shihab et al. [7] for complex problems.
In this paper, we set up deep reinforcement learning models to study the joint pricing and inventory control problem of perishables. We adopt a FIFO (first-in-first-out) policy in this study. When there is no fixed ordering cost involved, we show that the fixed pricing strategy is dominated by the dynamic pricing strategy, under which the price can be adjusted according to the availability of inventory and the lives of remaining items. We set up a benchmark based on realized demand for this no fixed ordering cost case and show that our designed deep reinforcement learning methods achieve a better performance than tabular-based Q-learning. We also find some monotonicity properties in our learned policies; our learned order quantity is nonincreasing in inventory position or on-hand inventory and price decision is most sensitive to the oldest on-hand inventory. Moreover, in order to show the expansibility of the proposed algorithm, we extend the distribution of the demand and take an additive form in Chen et al. [2] where the customer demand depends on the price of current period plus an additive random term; finally, we obtain a nearoptimal performance by our proposed deep reinforcement learning models. When the fixed ordering cost is taken into account in the joint pricing and inventory control system, we set up a performance upper bound based on the realized demand in each period in order to assess the performance. rough our proposed methods, we find convergent policies and critical values under which orders should be placed.

Literature Review
We review two streams of literature which are closely related to our research: traditional inventory control management for perishables and inventory control management with reinforcement learning.

Traditional Inventory Control Management for Perishables.
ere is a considerable literature devoted to dynamic inventory control for nonperishable products; see, for example, Presman and Sethi [8], Caliskan-Demirag et al. [9], Alp et al. [10], Almaktoom et al. [11], Azghandi et al. [12], Li et al. [13], and Gan et al. [14]. e dynamic inventory control for perishable products has not been widely studied in the literature. is is not to say that the literature does not realize the importance of the study of perishable studies.
Indeed, there are a number of papers devoted to the study of inventory decisions for perishable products. Nahmias and Pierskalla [15] studied a dynamic inventory control with a fixed lifetime, zero lead time, and uncertainty demand for perishable products. Nahmias [16], Fries [17], and Nahmias [18] studied the same problem, with multiple periods of lifetime and zero lead time, and their research studies are all to satisfy the same assumption that only products that exceed the life cycle will be abandoned, which is known as the first-in-first-out policy (FIFO), and this policy is widely used in the research of perishable retailing. And they proved that the optimal order quantity under different inventory ages is decreasing. Prastacos [19] reviewed some important theories and practices in blood inventory management and proposed that this kind of application can be extended to other perishable product inventory control problem. Ferguson and Koenigsberg [20] considered a two-period joint pricing and inventory control problem with a random lifetime, emphasizing and discussing the impact of competition between new inventory and surplus inventory over the previous period on inventory and pricing decisions for the first time interval. Chen et al. [21] used Pontryagin's maximum principle method to investigate the optimal policies for the pricing and replenishment of fashion apparel with short product lifecycles. Heuristic algorithms are also increasingly being used to address the problem of dynamic pricing and inventory control for perishables. Li et al. [22] proposed a base-stock/list-price heuristic policy to solve the problem of dynamic pricing and inventory control for a perishable product, assuming that the demand is a function of price and zero lead time. Li and Lu [23] studied a joint optimization of the price and order quantity of a perishable product and proposed a Minimax Regret algorithm. Li et al. [24] discussed a new dynamic pricing and inventory control scenario for perishables. New and old products cannot be sold at the same time. e seller can decide whether to discard the remaining inventory in the previous period, even though the lifetime may not be over. And they proposed a fractional programming heuristic algorithm to obtain a stable structural policy.
Chen et al. [2] is closely related to our research. ey considered positive lead time and used the concept of L-convexity/concavity to analyze the problem and proposed a heuristic algorithm to solve the problem. However, the traditional approach used in their research is not able to solve the problem with a fixed ordering cost. By using neural networks with hidden layers to approximate state-action values, our deep reinforcement learning approach exploits the advantages of deep learning [25] and reinforcement learning and is shown to be effective and efficient to find the solution.

Inventory Control with Reinforcement
Learning. In the literature, there have been few papers that study inventory control problems with reinforcement learning.
Giannoccaro and Pontrandolfo [26] studied the coordination of inventory policies adopted by different supply chain factors which are a major issue in supply chain inventory management, and they used a reinforcement learning approach to manage inventory decisions at all stages of the supply chain in an integrated manner and aimed at optimizing the performance of the whole supply chain. Chaharsooghi et al. [3] proposed an inventory control system based on reinforcement learning methods, which included uncertain delivery times and uncertain customer requirements to determine the ordering policy for each order point in the supply chain. Chaharsooghi et al. [3] used Q-learning to solve supply chain ordering management and applied to the beer game. Jiang and Sheng [27] proposed a case-based reinforncement learning algorithm (CRL) for dynamic inventory control in a multiagent supply-chain system. ey studied a multiagent simulation of a simplified two-echelon supply chain and showed the effectiveness of the method they proposed. Sui et al. [28] considered a Vendor-Managed Inventory (VMI) system where the supplier makes decisions of inventory decisions of inventory management for the retailer, and the retailer is not responsible for placing orders. rough a methodology based on reinforcement learning and numerical study, they show their approach can outperform the newsvendor. Zarandi et al. [29] presented a flexible fuzzy reinforcement learning algorithm where the value function is approximated by a fuzzy rule-based system and considered the problem of a fuzzy agent (supplier), that is, how to determine the amount of orders for each retailers based on their utility for supplier when its supply capacity is limited. Finally, the effectiveness of their proposed algorithm is proved by a simulation. Dogan et al. [4] used the Q-learning method to study an ordering and pricing policy in a multiretailer environment. Rana and Oliveira [30] use reinforcement learning methods to develop dynamic pricing strategies for interdependent perishable products or service. Kara and Dogan [5] used Q-learning and Sarsa reinforcement learning algorithms to study a dynamic inventory control issues for perishable products, with positive lead time and fixed lifetime. Our research further uses deep reinforcement learning to study this dynamic inventory control of perishable products. e aforementioned studies investigate the inventory problem for nonperishable and perishable products and use the nondeep reinforcement learning methods. Compared to their problems, our problem focuses on the inventory control of perishables, which makes the problem much more difficult. We use neural networks to avoid the curse of dimensionality and show that our deep reinforcement learning model outperforms the traditional reinforcement learning models without using neural networks.

Model
We consider a periodic-review single-product inventory system over a finite horizon of T periods. e whole process can be defined as a Markov Decision Process. e decision maker is called the agent, and the thing it interacts with is called the environment. At each period (step) of a sequence of discrete time periods, t � 1, 2, . . . , T, the agent and the environment interact; the agent selects the action denoted by A t , and the environment responds to A t and presents a new situation to the agent. At the end of the period, the agent receives a numerical reward denoted by R t+1 , R t+1 ∈ R, in part as a consequence of its action. roughout this paper, we let superscript t denote the period. More specifically, by superscript t, we mean the beginning of the period; we denote the end of period t, which coincides with the beginning of the next period as t + 1.
Customer demand, denoted by D t , at the beginning of period t, is represented by a Poisson distribution with the parameter as d t or d t (A t ) if the agent's action A t at the beginning of period t changes the demand distribution and d(·) is a function of selling price p, strictly decreasing the selling price p. Let the product's finite lifetime be denoted by l, variable cost by c t , and leadtime by L t (0 ≤ L t < l). Let the age of an item be 0 by the time it is shipped to the agent, and its residual lifetime be l − i when its age is i. When an item's age is greater than l, it has to be disposed. e inventory state, also known as the state of the agent, at the beginning of period t can be represented by a (l − 1)-dimensional vector: where x t i represents the level of inventory position of the items at the age of i. In particular, is the level of on-hand inventory, and x t ≡ l− 1 1 x t i is the level of inventory position of all ages.

An Action.
Here, action space A t refers to the order quantity q t and price decision p t . e selling pricing p t is restricted to an interval [p, p]. Based on the selling price p, 3.2. Update Rule. Update rule, denoted by h(·), describes the update of the environment state. In our research, the supply state remains unchanged in each period (M t+1 � M t ). e demand state in each period depends on selling price p. Last, we need to define the update rules for the inventory state. e update rules for the inventory state X t can be divided into two cases according to the unmet demand handing principle. We first consider the backlogging case. If

Reward Function.
In our study, our goal is to maximize the accumulative expected profit in the planning horizon, so our reward function R t+1 can be represented by the following form. We first consider the backlogging case. If L � 0, then For the lost-sales case, if L � 0, then where inventory carried forward to the next period incurs a unit holding cost h, unmet demand incurs a unit penalty cost u, and ] is unit disposal cost. When fixed ordering cost K is considered, reward function above will subtract K if order quantity is not 0. e sequence of events in period t is as follows: (1) Based on the environment state E t , E t ≡ (D t , M t , X t ), the agent selects an action A t . Note that A t is a vector, including ordering and pricing decisions. e order will be delivered at the beginning of period t + L; when L � 0 the order is delivered immediately.
(2) During period t, demand D t arrives, which is discrete and stochastic depending on the selling price p t , and is satisfied by the on-hand inventory as much as possible by the agent. Unsatisfied demand is either backlogged or lost; the remaining inventory with positive lifetime can be carried over to the next period. (3) At the end of period t, the agent receives a reward R t+1 , which depends on the environment state and action A t . (4) At the beginning of period t + 1, the agent receives an order (if any), and the environment state is updated to E t+1 according to the update rule.
For this joint pricing inventory problem, we introduce the notations in Table 1.
In this paper, we assume as in Chen et al. [2] that c ≤ u/1 − c, which eliminates the incentive to intentionally carry the back orders. We also assume that items with different lifetimes are charged the same price and that the back orders are met at cost c at the end of each planning period.

Deep Reinforcement Learning Methods
e objective of reinforcement learning is to learn a policy π that achieves near-optimal accumulated reward for the agent. Q-learning [31] is one widely used value iterative reinforcement learning method where the expected total discount rewards of state-action pairs can be approximated by a Q-function table based on the bellman equation, as shown in Function 7. Q-learning also has obvious limitation, that is, when there is a large state space, it is impractical and inefficient to record all the states and actions. Mnih et al. [32] extends Q-learning to Deep Q-network (DQN) which uses a neural network to approximate the Q-function table. DQN updates the parameters of the neural network by minimizing the difference between the predicted Q-values and the target Qvalues, where the target Q-values are estimated by current 4 Complexity reward and predicted Q-values from the next state. Meanwhile, to avoid training instability caused by correlation between training data, a replay memory pool is used: As mentioned before, in our joint pricing and inventory control problem, the state of the agent is expressed by the inventory state X t which integrates different ages and corresponding quantities. Here, the initial inventory state is X 0 . In our proposed algorithm PAQ-DQN, there are two same neural networks with the same structure but different parameters θ and θ, respectively. We adopt the fixed Qtargets' policy in standard DQN. e neural network that predicts Q-values has the latest parameters, while the neural network that predicts target Q-values uses the old parameters. Each neural network has two hidden layers, and there are 128 neurons in each layer, we use the ReLU as the activation function. In each time period, based on Q-values from the neural network, the ε-greedy policy will be executed to select an action from the action space which contains a combination of ordering and pricing. After receiving the reward from the environment, the target Q-values are estimated by current rewards and discounted predicted Q-values from the next state, as shown in equation (9). e parameters of the network θ are updated by minimizing the difference between the predicted Q-values and the target Qvalues, as shown in equation (10). After a fixed number of steps, assign the value of parameter θ to θ. e details of the algorithm named perishables integrate age and quantity deep Q-network (PAQ-DQN) are shown in Algorithm 1: e second reinforcement learning algorithm named perishables integrate age and quantity advantage actorcritic (PAQ-A2C). A2C is a method combining policy gradient and function approximation. Actor-critic (A2C) has two networks, one policy network, known as actor and used to output policy, and one value network, known as critic and used to evaluate the policy from actor. In our algorithm, both the policy network and value network have two hidden layers, and there are 128 neurons in each layer with the ReLU activation function. Especially, the activation function of the policy network output layer is the Softmax, which outputs the probability of each action being executed in the current state. In each time period, based on the current inventory state, an action will be executed by the policy network. After receiving the reward from the environment, the value network will evaluate this policy and output a td_error. e parameters of value network θ v can be updated by Equation (11), where y t is the target value calculated by equation (12). e policy network is updated by θ p ←θ p + α * ▽ θ p J(θ p ), where α is learning rate, and gradient ▽ θ p J(θ p ) is shown in equation (13) where advantage function is estimated by equation (14). e details of the algorithm of proposed perishables integrate age and quantity advantage actor-critic (PAQ-A2C) are shown in Algorithm 2:

Experiments
In this section, we conduct simulation studies to evaluate the performance of our proposed reinforcement learning algorithms and investigate the positive effects of the proposed algorithms on the profit of dynamic pricing and the impacts of the key parameters. Ordering and pricing policy are also discussed in situation involving fixed ordering cost. In this experiment, we only show the discussions on the backlogging case, the discussions of the lost-sales case is carried out in Appendix. e values of various parameters are set in Table 2. For simplicity, the value range of the price p and order quantity q are restricted to [32,37] and [0, 31], respectively.
In the reinforcement learning method, the effect of hyperparameters on final performance is very important, so we need to set the variation rules for relevant parameters, exploration rate ε and learning rate α. We adopt ε-greedy policy here; ε is decreasing linearly, that is, search-thenconvergence form in Darken et al. [33]: where y � epoch 2 /ε decay , ε 0 is the initial value of the ε, and ε decay is the decay parameter.  � 1). In order to examine the positive impact of dynamic pricing, we consider a fixed-price policy where the agent always takes the fixed best price which achieves the highest revenue. Let MEP and MEP FP be the expected mean epochs profits for the dynamic ordering and pricing policy and the fixed-price ordering policy, respectively, and MDC and MDC FP be the mean epochs disposal cost. After ten thousand simulations, we get the results in Table 3. Table 3 shows that PAQ-DQN achieves better performance than PAQ-A2C when lifetime is (1) Initialize replay memory pool D to capacity N (2) Use random weights θ to initialize the action-value function Q (3) Initialize target action-value function Q with weights θ � θ (4) For epoch � 1 to number of epochs do (5) Reset the environment and initialize state X 0 (6) for t � 1, T do (7) With probability ε, select a random action A t , otherwise select A t � argmax A t Q(X t , A t ; θ) (ϵ-greedy policy) (8) Execute action A t and observe reward R t+1 and X t+1 (9) Store transition (X t , A t , R t+1 , X t+1 ) in the replay memory pool D (10) Set X t+1 � X t (11) Sample a minibatch of transitions (X t i , A t i , R t+1 i , X t+1 i ), ∀i � 1, . . . , N from replay memory pool D (12) Calculate the target Q-value by equation (9) (13) Update the parameters of network θ by equation (10)  (14) Every C steps reset Q � Q (15) end for (16) end for ALGORITHM 1: Perishables integrate age and quantity deep Q-network.
(1) Use random weights θ p and θ ] to initialize the policy network and value network (2) for epoch � 1 to number of epochs do (3) Reset the environment and initialize state X 0 (4) For t � 1, T do (5) Take action A t based on action probability π θ p (·|X t ) (6) Execute action A t and observe reward R t+1 and X t+1 (7) Update the parameters θ ] of the value network by minimizing the loss function equation (11) (8) Estimate advantage function by equation (14) (9) Update the policy network parameters θ p ←θ p + α p ▽ θ p J(θ p ), where ▽ θ p J(θ p ) is calculated by equation (13) (10) Set X t+1 � X t (11) end for (12) end for ALGORITHM 2: Perishables integrate age and quantity advantage actor-critic.  Table 3, it is easy to find out that it is better to adjust the price in a dynamic way so that the price can be adjusted according to the availability of inventory and the remaining life of the product and maximize the profits. Table 4 shows the comparison between the tabular Q-learning and reinforcement learning methods on mean epoch profits and mean epoch disposal cost. From the table, we can see our proposed PAQ-DQN and PAQ-A2C obviously performs better than the Q-learning method. As we have mentioned before, Q-learning is a tabular method; it stores every state-action value in a table, but in our perishables inventory system, we considered the different ages, so the state space increases exponentially with lifetime, which is inefficient and impractical. Moreover, the amount of computing power and time required increase greatly with lifetime for the Q-learning method.

Experiments on the Performance of Proposed Algorithms.
In this case, we compute the mean epoch profits for the optimal policy and proposed PAQ-DQN and PAQ-A2C with zero lead time. In particular, we set up an upper bound benchmark for this computation and define it as the optimal policy. e optimal policy takes the same price action as the PAQ-DQN and PAQ-A2C in each period, and its order quantity is always equal to the real demand D t in each period, which means there is always no holding cost, penalty cost, and disposal cost for the planning horizon. Although there may be still some unreasonable place, this can be a useful metric to gauge the performance of the agent. Table 5 shows the computed results after twenty thousand simulations and MEP average , MEP average � (MEP optimal − MEP)/T, where T denotes the mean difference between the mean epochs profits from deep reinforcement learning methods and the mean epoch profits from the optimal policy. From the table, we can see our proposed algorithms achieve a good performance for three different lifetimes, where the benchmark is a loose upper bound from the real demand D t and the difference from the average optimal profit is almost always less than the highest possible profit per unit, that is, MEP average ≤ p − c. And the algorithm PAQ-A2C is slightly better than algorithm PAQ-DQN. Figure 1 shows the real epoch profits for the proposed PAQ-DQN and PAQ-A2C (in order to show the variation, we let the initial negative values as zero); from the figure, we can see that two methods quickly reached a relatively flat of profitability and PAQ-A2C showed more stable properties at the beginning of the learning process. Figures 2-4 show the scatter plots of the profits difference between the optimal policy and the proposed PAQ-DQN algorithm for three different lifetimes. To better show the convergence rate, the figure is drawn on a log-log scale. From three figures, we can see three MEP differences begin to decrease rapidly after about fifty simulations; this demonstrates our deep reinforcement learning method works, the agent gradually learns how to order, and price is near optimal. Besides, the fitting lines in the figures are used to depict the convergence rate, and the following fitting line functions are for lifetime 2, 3, and 4. Here, we also carry out sensitivity analysis to investigate the effects of learning rate α and exploration parameter ε decay for the training of the proposed deep reinforcement learning methods, respectively. Figure 5 demonstrates the MEP for three different learning rates on PAQ-DQN and the learning rate α at 0.001 is the best for three different lifetimes, and α at 0.01 is very close to the best performance. Figure 6 shows the effects of exploration parameters ε decay on PAQ-DQN, and when the exploration parameter ε decay is 1 × 10 3 , the agent gets a higher reward than the other two parameters. And the difference between the three parameters is very obvious. From the above two sensitivity analysis cases, the importance of hyperparameter is verified, and this is a common problem in deep learning. To show the expansibility of the algorithm, we also extend the distribution of the demand into an additive form in Chen et al. [2], where random term has a zero mean. By setting the random term which satisfies a uniform distribution in [A, B], where A and B are symmetric and the absolute value is 2, we get a nearoptimal performance with optimal rate 96.344%: log MEP diff ≈ − 0.609 log(epochs) + 11.794 r 2 � 0.963 , (16) log MEP diff ≈ − 0.737 log(epochs) + 12.445 r 2 � 0.986 , (17) log MEP diff ≈ − 0.575 log(epochs) + 11.278 r 2 � 0.956 .

Experiments on Dynamic Ordering and Pricing with No
Fixed Ordering Cost. In this case, when the real epoch profits gradually become stable (stable means the real epoch profits Chen et al. [2] has discussed the properties of optimal policies in the joint pricing and inventory system without fixed ordering cost. From the above settings and through our proposed reinforcement learning methods, when there is no fixed ordering cost, we get that the learned order quantity is nonincreasing in both outstanding and on-hand inventory levels. When L � 0, the learned price is always equal to the price that achieves highest expected revenue, and when L > 0, the learned price is most sensitive to the oldest onhand inventory. Figure 7 shows that the order quantity decreases with the inventory position and on-hand inventory. In order to show the sensitivity, we extract the fragment P 5 from P as an example, where l � 4, L � 2, and n � 1000. In the inventory state , where x 1 is the outstanding order and x 3 is the oldest on-hand inventory, we find that x 1 and x 2 are equal to a fixed value happens more than 500 times out of 1000, and when x 1 � x 2 , the price decreases with the oldest on-hand inventory. e same results can be obtained from other fragments. In this setting, Figure 8 shows that the price decreases with the oldest on-hand inventory, which means when the oldest inventory increases, the agent tends to set a lower price.

Experiments on Dynamic Ordering and Pricing with Fixed
Ordering Cost. In this part, we will consider the case when there is a fixed ordering cost in this joint pricing and inventory system. We use our propose deep reinforcement learning algorithms to solve this case, and in order to measure the final performance, we set up a loose upper bound as our benchmark. In this benchmark, there are trade-offs between different costs. e price decision is supplied by algorithms. For ease of discussion and simplicity, we assume zero penalty costs and zero disposal costs to be achieved, which mean each demand will be met and each order will be sold within l period. We also assume that the initial inventory is zero; thus, the first order will always be placed at the beginning of the planning horizon. In particular, when L > 0, there may be a penalty cost at the beginning. D t is the real demand in period t, t � 1, . . . , l, . . . , T. It is obviously unwise to order every period in this setting.
When L � 0, taking into account the width of finite lifetime l and the minimization of total cost, it is easy to see that the agent needs to place at least one order every l term and every order is just consumed by the next one. In the first l term, if [(D 2 + · · · + D l ) + (D 3 + · · · + D l ) + · · · + (D l− 2 + D l− 1 ) + D l− 1 ] * h ≤ K, it only needs one order at the beginning of the first period and ordering quantity q � D 1 + · · · + D l . Morover, at some point t o , whether to order depends on the time of last order t n , t o − t n ≤ l, if [(D t n +1 + · · · + D t o − 1 ) + (D t n +2 + · · · + D t o − 1 ) + · · · + D t o − 1 ] * h ≤ K and [(D t n +1 + · · · + D t o ) + (D t n +2 + · · · + D t o ) + · · · + D t o ] * h > K, it should order at point t o and the order quantity for t n is q � D t n + · · · + D t o − 1 . When L > 0, taking into account the width of finite lifetime l and the minimization of total cost, it is easy to see that the agent needs to place at least one order every l term. In the first l term, there is a penalty cost, (D 1 + · · · + D L ) * u, due to the lag of the order. At some point t o (t o ≠ 1), whether to order depends on the time of last order t n and in order to make the subsequent penalty cost zero, and [(D t n +L+1 + · · · + D t o +L ) +(D t n +L+2 + · · · + D t o +L ) + · · · + D t o +L ] * h > K, it should order at point t o , and when t n � 1, the order quantity for t n is q � D t n + · · · + D t o +L− 1 ; when t n ≠ 1, the order quantity for t n is q � D t n +L + · · · + D t o +L− 1 .
We consider two different fixed ordering costs K, K ∈ 25, 50 { }, two different penalty costs u, u ∈ 10.78, 4.18 { }), and two different price-demand functions d t , and the first one is shown in Table 2 and another one is d t � 380 − 10p. Lifetime l � 4, lead time L � 0, 1, and a larger order action space is considered for d t � 380 − 10p. Under all of the above setting, we find that when L � 0, the convergent price is always the price that maximizes the expected revenue. is is in line with expectations. Same as the no fixed ordering cost case, the order quantity is nonincreasing in both outstanding and on-hand inventory levels. Table 6 shows the MEP results from PAQ-DQN and PAQ-A2C when the fixed ordering cost is 25 and 50 and price-demand function d t � 380 − 10p after thirty thousand simulations. From the table, we can see that, under the same conditions, the mean epoch profits MEP decreases with the lead time and the fixed ordering cost. When L � 0, algorithm PAQ-A2C performs better than PAQ-DQN, and when L > 0, our proposed PAQ-DQN performs better than PAQ-A2C. More interestingly, in our learned convergent policies, we find there exist one or two critical values in the inventory position in each period in each case when L � 0 and L � 1. We denote cv t as the critical value in each period for the one critical value cases and cv t 1 and cv t 2 in each period for the two critical value cases, cv t 1 < cv t 2 . In the one critical value case, when x t < cv t , there will be a fixed order quantity q 1 ; when x t ≥ cv t , the fixed order quantity is q 2 . In the case of two critical values, when x t < cv t 1 , the fixed order quantity is q 1 ; when cv t 1 ≤ x t < cv t 2 , the fixed order quantity is q 2 ; when x t ≥ cv t 2 , the fixed order quantity is q 3 . For more details about obtaining the critical values, see Appendix. Table 7 shows the MEP of learned policies from algorithm PAQ-DQN; from the table, we can see our learned policies achieve a higher optimal rate and are closer to the upper bound, compared to Table 6. Table 8 shows the MEP comparison between the learned policies and the algorithm PAQ-DQN; from the table, we can see our learned policies achieve a higher MEP and lower MDC, which means our learned policies are working well.  e above discussion is mainly based on the current inventory state. Next, we will try to add the historical inventory states and action information to the state to discuss its impact on the final performance. Here, we define the new state to be S t � (X t− L , A t− L , X t− L+1 , . . . , X t ) when L > 0, and when L � 0, we also discuss the gradual influence of the addition of information in the state on the final performance. At the same time the dimension of the state accompanying the increase in information will also increase. Table 9 shows the results after ten thousand simulations and there we add the inventory state and action from the previous period for the new state when L � 0. From the table,            2  12  13  13  13  14  13  12  13  13  13  13  13  13  14  15  16  17  18  19  20  21  22  23 24 2  13  13  13  13  13  13  13  12  13  13  13  13  25  26  27  28 29 cv 1 − 9 − 13 − 10 − 10 − 10 cv 2 13 13 13 13 12 we can see that the new state contains more information almost all performs better than the single current inventory state, which means the current decisions of the agent are influenced by not only the current inventory state but also the inventory states of the previous periods. In Table 10, MEP S 1 and MEP S 2 , respectively, represent the inventory state and action information of the previous period and the previous two periods added to the current state. From the table, we find that the final performance did not get better and better with the continuous addition of the historical information, which also confirms that the dimensions of the state mentioned above continue to increase with the addition of information, which may have a negative impact on learning.

Conclusions
In this paper, we investigate a joint pricing and inventory control problem and obtain near-optimal pricing and replenishment policies for stochastic perishable inventory systems with positive lead time by deep reinforcement learning algorithms. rough our designed algorithms, we show that, in a perishable inventory control problem, the expected profit is maximized by adjusting the price according to the availability of inventory and the remaining lives of the items. We consider the case of no fixed ordering cost and the one involving a fixed ordering cost and find near-optimal policies for both cases. Our findings when a fixed ordering cost is involved contribute to the literature of inventory control for perishables, which has not been studied before. In this paper, we only focus on a single agent's joint pricing and inventory control problem. However, multiple agents are usually involved in supply chains, and their interactions may have a big impact on each agent's pricing and inventory decisions. erefore, the study of the competition and cooperation of participants under complete and incomplete information is an interesting topic for future research.
Appendix is section is for the discussion about lost-sales case in Section 5. Table 11 shows that it is better to adjust the price in a dynamic way so that the price can be adjusted according to the availability of inventory and the remaining life of the product and maximize the profits. From Table 12, we can also see our proposed method PAQ-DQN achieves better performance than PAQ-A2C when lead time is positive and our proposed deep reinforcement learning methods obviously perform better than the Q-learning method.

B. Experiments on the Performance of Proposed Algorithms
In this case, we compute the mean epoch profits for the optimal policy and proposed PAQ-DQN and PAQ-A2C with zero lead time. Table 13 shows the computed results after twenty thousand simulations. From the table, we can see our proposed algorithms achieve a good performance for three different lifetimes, where the benchmark is a loose upper bound from the real demand D t and the difference from the average optimal profit is almost always less than the highest possible profit per unit. And the algorithm PAQ-A2C is slightly better than algorithm PAQ-DQN. Figures 9-11 show the scatter plots of the profits' difference between the optimal policy and the proposed PAQ-DQN algorithm for three different lifetimes. To better show the convergence rate, the figure is drawn on a log-log scale. From three figures, we can see three MEP differences all begin to decrease rapidly after about ninety simulations; this demonstrates our deep reinforcement learning method works, the agent gradually learns how to order, and price is optimal. Besides, the fitting lines in the figures are used to depict the convergence rate, and the

B.1. Experiments on Dynamic Ordering and Pricing with no
Fixed Ordering Cost. In this section, under the same settings as the backlogging case, we get the same results where the learned order quantity is nonincreasing in both outstanding and on-hand inventory levels. When L � 0, the learned price is always equal to the price that achieves highest expected revenue, and when L > 0, the learned price is most sensitive to the oldest on-hand inventory.

B.2. Experiments on Dynamic Ordering and Pricing with Fixed
Ordering Cost. Firstly, we introduce the steps to get the critical values mentioned in the backlogging case. We first  (t � 1, . . . , T), we can observe the relationship between the inventory level and the order quantity and price. In the backlogging case, when l � 4, L ∈ [0, 1], K ∈ [25,50], and u/h + u � 98%, we get the following ordering and pricing policies. In each epoch, we set the first period as a zero initial inventory, so we cannot observe the critical values, and the following values are obtained from the second period. Tables 14 and 15 show the critical values in the different settings. When L � 0 and K � 25, the price is always 32 and q 1 � 80, q 2 � 65, and q 3 � 35. When L � 0 and K � 50, the price is always 32 and q 1 � 95, q 2 � 65, and q 3 � 50; when L � 1 and K � 50, the price is 32 except for the first period and q 1 � 120 and q 2 � 0; when L � 1 and K � 25, when inventory position is less than the critic value cv 1 , the price is 32; otherwise, the price is 33, q 1 � 75, and q 2 � 45. In the lost-sales case, Table 16 shows the MEP for the different settings. And the same learned convergent policies structure as backlogging case can be obtained from this lost-sales case.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.