Application of Reinforcement Learning Algorithm in Delivery Order System under Supply Chain Environment

With the intensification of market competition and the development of market globalization, the efficiency of supply chain management orders has become an important part of enterprise competition resources. +e competition among enterprises is fierce. To achieve effective customer response quickly, the time for supply chain order management is minimized, and refine the order processing process. +is article introduces the strategy research of supply chain management order based on a reinforcement learning algorithm.+is article first combines the reinforcement learning algorithm and deep learning algorithm, using the optimal decision-making ability of reinforcement learning algorithm and deep learning algorithm. +e combination of data perception and the optimal ability to analyze examine the data of the order process, order cycle, and order delivery process of the supply chain order management and give the optimal decision. +e supply chain order management process conducts questionnaire surveys and seminars to understand the current process of supply chain order management and the problems derived from the analysis of data based on the deep learning algorithm. Finally, through the output of the optimal strategy of the reinforcement learning algorithm, the supply chain order management process was improved, and the satisfaction survey was conducted again. +e survey showed that the satisfaction was improved, and the satisfaction reached more than 90%.


Research Background and Significance.
With the rapid development of the times, the development of enterprises is also flourishing. e increase of enterprises and the innovation of supply chain management within enterprises have brought great challenges to enterprises [1]. erefore, if an enterprise wants to develop continuously, it must continuously explore new supply chain management methods to improve its competitiveness. Among them, supply chain management is a new type of management mode and thought [2], and in supply chain management, in addition to the importance of suppliers, the management of supply chain orders is also essential, and the management of orders directly affects the entire enterprises; the management of orders involves customers at the source of purchase and the center of interest, so supply chain management has become an important method for enterprises to obtain sustainable competitive advantages [2,3]. Supply chain management can shorten the order cycle, improve efficiency, and reduce the total cost of the enterprise.
e main goal of inventory optimization is inventory-related expenses. e cost of capital increases with the increase in inventory. e backlog of goods is what companies should try to avoid. Reaching the benefits brought by supply chain management is the subject of concern for supply chain management strategy research [4][5][6]. erefore, this article starts from the practice of the enterprise to discuss and study the problems of supply chain management and give reasonable opinions for the supply chain order management to achieve the benefits presented by the supply chain management.

Related Content.
Supply chain management is increasingly regarded as the management of key business processes in the organizational network that makes up the supply chain [7,8]. Croxton et al. exemplify the interface between processes and an example of how to implement process methods internally [9]. Because many people have realized the benefits of using process methods to manage the business and supply chain, most people are still unsure which processes to consider, which subprocesses and activities are included in each process, and how the processes interact with traditional functions isolated islands. us he believes that his goal in an organization is to provide managers with a framework for implementing supply chain management, provide lecturers with materials that can be used to build supply chain management courses, and provide researchers with a series of further development of the field opportunity [10,11]. However, process management is only a key process and lacks a complete process. Gosavi proposed a reinforcement learning (RL) algorithm based on policy iteration [12] to solve average reward Markov and semi-Markov decision problems. His algorithm is an asynchronous, model-free algorithm (which can be used for large-scale problems). Its core idea is to calculate the value function of a given strategy and search in the strategy space [13,14]. In the field of applied operations research, RL is used to provide good solutions to previously considered difficult problems. erefore, he tested the proposed algorithm in commercial case studies related to practical problems in the aviation industry [15,16]. In his experiments, he combined the algorithm with the nearest neighbor method to solve the larger state space [17,18]. However, this kind of algorithm has a large error. Tao et al. proposed a steel manufacturing plant's production planning management system structure based on manufacturing to order (MTO) and manufacturing to inventory (MTS) management ideas [19]. In this architecture, he discussed the order planning process in detail and constructed a nonlinear integer programming model for the order planning problem. e model he proposed considers inventory matching and production planning at the same time and considers multiple objectives, such as the total cost of early/delay fines, delay fines within the delivery time window, fines for production, inventory matching, and cancellation orders, and he also considers the results of using PSO, TS, and hybrid PSO/TS algorithms to solve the models with three different orders compared [20,21]. Numerical results show that the PSO/TS hybrid algorithm provides a better solution with high computational efficiency. However, while the calculation efficiency is improved, the accuracy of the calculation is uncertain.

Main Content and Innovation.
e main content of this article is to study the strategy of supply chain order management based on reinforcement learning algorithms.
rough questionnaire survey and discussion methods, we can obtain data on all aspects of supply chain order management in enterprises and carry out calculations on algorithms in reinforcement learning. e data is initially processed to obtain a series of problems in order management, and then through the combination of reinforcement learning algorithm and deep learning algorithm, problem analysis and processing and output of optimal strategies are performed, and then the problems in supply chain order management are given corresponding strategy recommendations. e innovation of this paper is to use the powerful data processing ability and optimal decision output ability of the reinforcement learning algorithm, combined with the data analysis of the algorithm and the accuracy of the optimal decision to analyze and give optimal strategy recommendations in the supply chain order management.

Reinforcement Learning Algorithms and Supply Chain
Management Concepts. Supply chain management integrates the functions of the enterprise with relevant data while improving the competitiveness of the enterprise. Supply chain management involves a wide range of issues [22]. In other words, it is a complex dynamic network structure that connects the material handling and channel selection of logistics management and coordination, pricing decisions, etc. With the widespread application of supply chain business models, how the ordering and pricing decisions of all members of the chain can satisfy everyone has become more important, and the more complex supply chain is composed of different enterprises; as the supply chain is composed of "chain," the transition to the "net" provides balanced decision-making changes in the process of gradual complexity in the supply chain structure [23,24]. Supply chain management is also effective management of an enterprise, reflecting the strategic optimization of the whole process of an enterprise. According to the semi-Markov theory, the reinforcement learning algorithm is used to learn the joint supplementary problems in the supply chain without mentoring [25,26]. In most cases, unpredictable emergency requirements will cause delays in delivery and reduce the efficiency of all links. In order to unite different supply chain links and solve these problems, the basic cycle of each kind of goods is used as the initial state. e Markov decision chain calculates the joint supplementary Q value through behavior and transition probability, parameter selection principles, and end conditions, and finally, the example verification proves the effectiveness and practicability of the algorithm [27,28].

Reinforcement Learning and Deep Learning Combined
Algorithm. Traditional reinforcement learning has a perfect theoretical model, and the algorithm is universal, but it has disadvantages such as low training efficiency and difficulty in processing high-dimensional data [29,30]. Deep learning needs to go through complex network screening and physical linear transformation, can perform specific analysis of data, extract high-level representations of data, have powerful analysis capabilities for data, have a complete training mechanism, provide an approximate solution method for optimization problems, which can achieve the best results in many applications [31]. Deep learning focuses on the analysis of data, and reinforcement learning has more advantages in the output of strategies [32]. Both algorithms have their own advantages. Deep reinforcement learning organically integrates the perception ability of deep learning and the decision-making ability of reinforcement learning. It can not only use deep learning to automatically learn information from large-scale input data but also uses reinforcement learning to make decision-making optimization based on this information. It is an end-to-end, end perception, and the control system has strong versatility. erefore, there is an innovative deep reinforcement learning model [33].

Strategy Gradient Learning
Algorithm. Strategy gradient enhancement learning PG-SVM multiround coordinated control method studies the joint supplementary problem of fuzzy variable demand under the condition of a single supplier; the demand is fuzzy variable; list its membership function; solve the objective function through trapezoidal fuzzy number; pass. e objective function is obtained by fuzzy membership degree, that is, the replenishment period of each product; the corresponding basic replenishment period length is determined by the optimal replenishment period of each product. rough the research on the joint supplementary problem of fuzzy demand, a reward function is obtained by the system after each action and the mathematical model is processed through the learning algorithm. e function finally solved is to minimize the order cost. Due to the large variance in the gradient estimation process, the convergence speed of the policy gradient algorithm is very slow, which has become an obstacle to the wide application of policy gradient reinforcement learning. At this time, it is assumed that the operation of the supply chain takes a week cycle, and a cycle consists of several systems. Competitive decision-making in the process of gradual complexity of the supply chain is composed of units. At this time, the cycle strategy is expressed as s. Without causing confusion, s will be denoted as si, and the goal of reinforcement learning is to find the optimal parameters to make the goal of reinforcement learning: the expectation of cumulative return is maximized. Supposing the state in the plot, the sequence of actions is consecutively arranged, R is the cumulative return of the plot, and p is the probability of the plot appearing under the strategy. At this time, the reported expectation can be expressed as follows: (1) Similarly, in another action sequence, it can also be expressed as follows: Use the gradient method to optimize the objective function x: where i were the learning rate. Expand the gradient term in formula (3) to the following: Express x in the above formula with the state transition probability at each moment: Combining the above two steps, we can get the following: e expectation in formula (6) consists of two items. e first item is a direction vector; that is, the direction in which the probability of the current episode changes the fastest with the parameter t, and the parameter update in this direction can be increased or decreased to the greatest extent the probability of occurrence of plot t. e second term is a scalar, which plays a role in the degree of vector increase in the strategy gradient. e larger the R, the greater the vector increase. e intuitive meaning of the strategy gradient is to increase the probability of a high return trajectory and reduce the probability of a low return trajectory.

Questionnaire Survey Method.
Select a company's order management related personnel to conduct a questionnaire survey. e survey content is divided into order process and existing problems, order delivery process problems, and a satisfaction survey after improvement. e investigation phase is divided into two parts. e first part is to investigate the company's existing order process and problems, collect data, and analyze, and the second part is to conduct a satisfaction survey on the improvements brought about by the previous part of the investigation. Obtain the satisfaction of the improved result.

Staff Discussion Method.
e staff discussion method cannot be carried out directly. You must first formulate a detailed question sheet for the content of the information to be collected, grasp the problem raised the relationship between the order processing and the enterprise supply chain management insufficiency and purpose, and then collect it through face-to-face conversation information. e seminar method is not only to obtain information through questions but also to use conversations to actively guide more complete information.
is article uses the staff discussion method to determine their dissatisfaction with the company's process and then uses the deficiencies they have reflected to improve the process.

Modeling of Reinforcement Learning Problems.
In the supply chain, the unit cycle cost of operators, distributors, and retailers is coordinated as the state model, and the order cycle time can also be the state model. When an enterprise order is placed, the main cost value of each cycle has been determined. e time and quantity of the order are both important factors. When the ordering time is too early, but the goods are stagnant in the warehouse, or the purchase of too many goods causes the goods to be backlogged in the warehouse, this will lead to an increase in inventory costs; too many orders cause excessive investment funds and long turnaround times, leading enterprises to increase the transportation cost of goods. As a result, the investment funds are too large and the turnaround time is too long, which leads to an increase in the cost of the enterprise. Too few orders need to be ordered as long as possible, which increases transportation costs. Good inventory management is to balance the question of when to order and how much to order. erefore, inventory decision-making optimization has become an important link in the supply chain. e order cost increases with the increase in inventory. According to the basic economic order quantity, the total inventory cost in a period of time, the formula is as follows: e latter equation is the inventory maintenance cost, which is a derivative of the surrounding costs of the inventory, such as the sum of the warehouse, water, and electricity costs, staff costs, site cost, insurance cost, etc. e long-term accumulation of goods will cause an increase in the latter equation.
e size is shown in the following formula: Ordering costs are all costs in the process from the purchase of the goods to the warehousing after an order is placed, such as the public relations costs of the supplier, the transportation costs of the goods, the travel expenses of the purchaser, and so on. e order cost per unit period is shown in the following formula: However, in direct life, the rate of demand for goods is often an uncertain factor; it is a fuzzy number. erefore, combining the EOQ model and fuzzy numbers to establish a new fuzzy demand inventory model, the goal is still the lowest total inventory cost. e hypothesis is the demand rate n is a fuzzy variable with a known distribution, and the fuzzy variables are different in different periods; late delivery means that stocks are allowed to be out of stock during the supply process, but once they are out of stock, the owed goods must be replenished; all orders are delivered at one time; the demand per unit period is N; at this time, the maximum of the system is not the fuzzy demand, so the total system cost model is shown in the following formula: According to fuzzy mathematics, x is expressed as a trapezoidal fuzzy number. Suppose the objective function: Let M be the fuzzy maximum set of the formula on the fuzzy set, and its membership expression is shown in the following formula: e fuzzy decision set is shown in the following formula: where x is the conditional extreme point to be sought, the simplified fuzzy demand is shown in the following formula: e optimal value of the end of the solution to the basic supplementary period is shown in the following formula:

Supply Chain
Model. e research in this paper mainly considers a three-level supply chain system that includes retailers, wholesalers, and manufacturers, and each level contains only one merchant. e three-level supply chain system is closely related to supply chain order.
As shown in Figure 1, the relationship at all levels of the supply chain is as follows. Retailers contact the market, accept consumer demand for goods in the market, predict, control inventory, and issue ordering requirements to the higher level after receiving the demand. Consumers buy goods from retailers, and retailers meet consumer demand by selling consumers a corresponding number of goods while inventory is reduced. e retailer adopts a certain ordering strategy to pass its demand for goods to the wholesaler by adopting a certain ordering strategy based on its own inventory reduction and combined with the forecast of future market development needs. After receiving the order, the wholesaler will make an ordering decision based on the retailer's demand for the goods and the inventory situation after meeting the retailer's demand and send his demand for the goods to the manufacturer. Figure 2, H represents the environment and s represents the initial state of the system. e learning process is as follows: First, the state sensor Z perceives the environment H and processes the information through the signal acquisition system to obtain the initial state s of the environment, and then the state sensor transforms s and sends a signal to the action selector D and the learner X, the selector D takes action b according to the learned knowledge and signal a and affects the environment. Because the environment H is affected by the agent's behavior, H changes. At this time, the environment variable is s. At the same time, the environment H feeds back a signal R to the template as a function of action b on the state. e learner X will change the strategy, and some feedback will come back. Strengthen the signal R and the internal signal a. It can be seen from the structure of the reinforcement learning system that the perception of state signals during the learning process plays an important role in the entire learning process.

Deep Reinforcement Learning Model.
e depth enhancement model is shown in Figure 3. is model only uses the original video image information as input, after network processing, maps to the connection layer, and finally outputs the optimal value. is model has achieved results beyond the human level and has more advantages than traditional reinforcement learning algorithms in data input and analysis. Both of these two algorithms can output the optimal value, but the data analysis ability of deep reinforcement learning is relatively strong.
As shown in Figure 3, the deep reinforcement learning model is first input into the data. e data enters the network processing and then reaches the connection layer. e connection layer analyzes again and finally obtains the output value. In this paper, the DQN network is introduced in the target detection task to learn the search strategy for candidate regions. e basic related theories of DQN will be explained below. A series of actions, observations, and

Mobile Information Systems
circulation will be carried out in the agent and the environment. At each time step, the agent selects an action from a set of actions. e action will be passed to the simulator, and its internal state and output score will change. Under normal circumstances, the environment is random, and the agent will not observe the real internal state of the simulator. What the agent sees from the simulator is only the original pixel array representing the current screen. During the interaction, the agent will receive changes in the representative data score from the simulator. Generally speaking, the score can depend on the complete sequence of previous actions and observations, and feedback on actions may take thousands of steps to reflect. Since the agent only observes the image on the current screen; that is to say, the information he observes is only a partial description of the internal state of the simulator; that is, it is impossible to fully understand the current state only from the current screen, therefore, the deep reinforcement learning model relies on a sequence of actions and observations to learn the strategy of the entire game.

Sensitivity Analysis of Order Inventory Price Name.
According to the problem modeling of reinforcement learning, the relationship between purchase cost and the price is calculated numerically. As shown in Table 1, when the cost is a change of 15-18, the price P1 rises from 31 to 39, and the price P2 also rises from 40 to 45. e decrease in cost and the increase in price lead to a decrease in profit and profit as unit costs increased and prices increased; the profits fell from 4000 to 3300. is phenomenon can be explained as that with the increase of purchase cost, the company's profit decreases compared to when the product purchase cost is low. To increase the company's profit, the company only has to increase the product's selling price p1 and p2.
is shows that the changes of different parameters will have an unfavorable impact on the order quantity of the product and the price of different sales stages. rough this analysis, the understanding of product inventory management can be further strengthened, and the correct sales price and price can be set for the company-procurement strategy to help.

Analysis and Optimization of Order Shipping Process.
e current delivery process is that the sales department issues a delivery plan instruction to the logistics system based on the delivery time required by the contract order. According to this plan, the logistics plans to communicate with the carrier on the delivery line after verifying the inventory quantity, and deploy the vehicle on the warehouse, and will load the goods and complete the shipment. If there is no suitable vehicle on the route that day, it will be transferred to the next day for delivery, and it is agreed that the delivery task must be completed within the next day. e specific process is as in Figure 4.
It can be seen from Figure 4 that the ordered goods will be first planned for warehousing and delivery and will be sent directly to the customer when there are goods. When there is no vehicle, the vehicle cannot be shipped and can only be returned to the warehousing. e delivery instruction is issued by the sales department according to the time of the customer, A contract order, and the warehousing and delivery department only unilaterally execute it. If the carrier has the corresponding vehicle to the customer's location, the goods can be shipped. If there is no corresponding line, vehicles cannot be delivered even if there are goods in the warehouse. e main reason for the low turnover rate of warehouse goods with orders is the uncertainty of vehicle resources and the instability of warehousing and delivery plans.
In response to the problems in the above-mentioned process, we conducted on-site visits to the logistics delivery plan management and discussed the various routes to the customer's location with the carrier to determine the key to the problem. rough the on-site understanding of the site, the shipping plan and the sales plan are only one-way work, and the model is as follows: As shown in Figure 5, after the analysis of the above process, it is found that the current information is only unilaterally transmitted, and it is only shipped according to the sales plan. If there are not enough vehicles, you can only wait, and it does not give full play to the carrier's advantages in gathering vehicle information and overall planning. e intermediate process is single and lacks integration and intercommunication. erefore, the shipping process is redesigned as in Figure 6.
As shown in Figure 6, during the delivery process, the carrier actively provides the vehicle route information to the supply chain logistics delivery system. Logistics delivery no  longer simply waits for the sales department's delivery plan but actively provides the vehicle information to the sales department. According to the vehicle information, the sales department negotiates and communicates with customers in the area on this route, signs contracts, and confirms the shipment of goods. ese not only effectively use vehicle    Mobile Information Systems resources but also increase the overall processing and information management of vehicle information. And it objectively promotes sales activities and improves the turnover rate of goods.
As shown in Table 2, the turnover rate of each quarter has increased, and the number of turnover days has decreased. It increased 25% in the first quarter, 8% in the second quarter, 23% in the third quarter, and 28% in the fourth quarter. is The delivery date is entered into the production plan, and the delivery date feedback After the order is produced and put into storage, the order manager arranges the shipment according to the customer's instructions After the goods are delivered, the order management assists the financial department in dunning and settlement of the order.  Mobile Information Systems greatly reduces the amount of funds occupied by inventory materials, reduces management costs, and avoids the production of sluggish materials and the resulting losses and environmental protection issues. At the same time, after adopting the new order shipping process, the customer satisfaction and other delivery indicators are investigated, and the satisfaction survey data is analyzed as in Figure 7. As shown in Figure 7, customer satisfaction in the second half of the year is higher than that in the first half of the year, especially in the stability of order indicators; the difference in satisfaction is the largest; the supply capacity of the order, the timeliness of delivery, the after-sales processing of the order, and the complaint satisfaction with the timeliness of processing have increased. It can be seen that the management of the new shipping process is effective for order management.
rough the optimization of supply chain management, the optimization of order management has been realized, and the work efficiency has been greatly improved. e optimization of order processing and order delivery has shown good results.

Problems in Order Cycle and Optimization Analysis.
e method of employee discussion and questionnaire survey was adopted for a company to analyze the information collected about the order cycle problems and found that the following problems mainly exist in the order fulfillment process.
As shown in Table 3, the existing problems include diverse customer order types and ordering methods, unstable customer demand, diverse order shipping modes, backward order processing methods, and unstable supply. It is easy to form order confusion and cause missed orders. Moreover, due to the complicated procedures, the order cycle is long, which will lead to the loss of customers. ese are the main problems that cause the long-order cycle. erefore, in view of the various above-mentioned problems, the reasons and optimization goals are analyzed.
As shown in Table 4, the reasons for the occurrence are explained by the lack of an electronic processing system, long procurement, complex delivery process, and the pros and cons of cooperative customers, and optimization suggestions for the reasons are given. e first is to establish an electronic system for order processing and then advance the purchase of key materials to prevent material instability, shorten the purchase time, simplify the delivery process, and increase the threshold for cooperative customers, and select high-quality customers.

Optimization of Order Processing Process.
e figure below is a general process of the first order. A er the order is produced and put into storage, the order manager arranges the shipment according to the customer's instructions A er the goods are delivered, the order management assists the financial department in dunning and settlement of the order. As shown in Figure 8, there are many steps in the existing order process, and most of the orders are processed manually, and there are still too many departmental processing flows in the processing process. erefore, the existing process wastes a lot of time, and this situation needs to be improved.
After being familiar with the order process and extensive suggestions from the company's senior management and related departments, the existing process was improved. e improvements are shown in Figure 9.
As shown in Figure 9, an order receiving window is canceled, orders are directly connected to customers by the order management office, and the previous manual operations are changed to electronic system operations. is not only improves the accuracy of order processing but also reduces order. e workload of the administrator saves processing time, and the order processing cycle time before and after the improvement is significantly shortened. It has fundamentally solved the missing order phenomenon caused by the instability of personnel, and the electronic operation has increased the order delivery time and the preorder processing time.

Discussion
is paper analyzes the supply chain order management based on the reinforcement learning theory, analyzes and improves the problems existing in the order processing process, the order transfer process, and the order cycle, and then combines the reinforcement learning algorithm for problem data processing and analysis and gives the optimal strategy. It improves the order processing and transshipment process, and at the same time makes strategic recommendations on the order cycle, uses an electronic system office, abandons the traditional manual processing of orders, improves the efficiency of order processing, simplifies the process, and shortens the order cycle, which is conducive to the improvement of the enterprise.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.