The Applicability of Reinforcement Learning Methods in the Development of Industry 4.0 Applications

Reinforcement learning (RL) methods can successfully solve complex optimization problems. Our article gives a systematic overview of major types of RL methods, their applications at the field of Industry 4.0 solutions, and it provides methodological guidelines to determine the right approach that can be fitted better to the different problems, and moreover, it can be a point of reference for R&D projects and further researches.


Introduction
Reinforcement learning (RL) has a significant chance to revolutionize the artificial intelligence (AI) applications by serving a novel approach of machine learning (ML) developments that lets the user to handle large-scale problems efficiently. ese techniques together with widespread Internet of things tools have opened up new possibilities for optimizing complex systems, including domains of logistics, project planning, scheduling, and further industry-related domains. Extracting this potential can result in a fundamental progress of Industry 4.0 transformation [1]. During this digital transformation, the vertical and horizontal integration will be strengthened, the flexibility should be raised, and the human control and supervision need to be focused [2,3]. Furthermore, the data produced by the integrated tools are increasing exponentially that requires a higher level of autonomous processes and decisions. Reinforcement learning can serve as a valuable tool in the development of self-optimising and organising Industry 4.0 solutions. e main challenge of developing these applications is that there are several methods and techniques and a wide range of parameters that need to be defined. As the definition of these parameters requires detailed knowledge of the nature of the RL algorithms, the main goal of this paper is to provide a comprehensive overview of RL methods from the viewpoint of Industry 4.0 and smart manufacturing.
On the basis of our best knowledge, there exists no similar overview article of reinforcement learning methods in Industry 4.0 applications. Next to the fundamental book [4], there are several overviews of reinforcement learning methods from theoretical point of view. A detailed semantic overview of Industry 4.0 frameworks [5] and a categorization of Industry 4.0 research fields are also described. An overview of key elements of Industry 4.0 researches and several application scenarios [6] highlighted the wide scope of smart manufacturing. Although many authors found that there is a lack of extensive review of Industry 4.0 revolution from different aspects, according to their persistent work nowadays, several articles are available in this topic [7]. A survey on the applications of optimal control to scheduling in production, supply chain, and Industry 4.0 systems [8] focused on maximum principle-based studies. Most of the surveys and review articles of Industry 4.0 declare the importance of optimization, but mostly only general approaches are discussed, and there are no detailed guidelines extracted. A comprehensive survey at field of Industry 4.0 and optimization [9] discussed the recent developments in data fusion and machine learning for industrial prognosis, placing an emphasis on the identification of research trends, niches of opportunity, and unexplored challenges. Even if it considered several ML methods and algorithms, RL was mentioned only shortly without extracting its key fundamentals. e above collected facts strengthened our motivation of preparing a detailed overview of RL applications and methods used in the field of Industry 4.0. Our main goals with this are: Presenting a hands-on reference for researchers who are interested in RL applications Giving compact descriptions of applicable RL methods Serving a guideline to support them easily identify the best fitting subset of RL methods to their problems and hence letting them focus on the relevant part of the literature Our systematic review is based on an examination of the literature available from Scopus by following the PRISMA-P (Preferred Reporting Items for Systematic Reviews and Meta-Analysis Protocols). e PRISMA-P workflow contains a 17item checklist that supports to facilitate the preparation and reporting of a robust protocol in a standardized way for systematic reviews. e literature source list was queried in February 2021 with the following keywords: TITLE-ABS-KEY ("reinforcement learning" AND ("smart factory" OR "IOT" OR "smart manufacturing" OR "industry 4.0" OR "CPS")).
Both author keywords and index keywords were involved into the analysis. e keyword processing started with an extensive data cleansing process by: Building up a standardized keyword unit (SKU) list and splitting complex keywords into SKUs Assigning SKUs to one of the following keyword classification types: (i) Principle captured (ii) Industrial field of application (iii) Application field of solution (iv) Mathematical approach of application methodology Identifying major classification groups by classification types 781 articles were involved into the analysis. Out of 14,035 original author and index keywords, 2,579 duplications were filtered out. e remaining 11,456 keywords were sliced into 45,824 SKUs. Finally, 12,017 keywords were assigned to classification types that provide the major tendencies and relations of industrial applications of reinforcement learning methods. Figure 1 shows the change of the assessed literature size over the PRISMA steps.
Our article stands for the following major parts: First, in Section 2, we will give a short general introduction of reinforcement learning framework and summarize some major mathematical properties behind RL techniques. Furthermore, we will present a classification of RL methods that lets the reader to have a map for the further discussions.
As a next step in Sections 3.1-3.3, we will present the key findings of systematic review and a hands-on reference for further researches. en, in Section 3.4 and in Section 3.5, we will discuss the conclusions and give a detailed guideline to help the reader to choose to most adequate RL method for the different problems. Finally, in Appendices A-H, we will provide a compact overview of 18 different RL methods.

Theoretical Background of Reinforcement Learning
In this section, we will summarize the fundamental concept of reinforcement learning, then we will present a general classification of RL methods. ere are three main paradigms in machine learning: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, a functional relationship of a regression model of a classifier is learnt based on data that represent the input and output of the model. In unsupervised learning, the hidden structure of the data is explored, usually by clustering [9].
Reinforcement learning (RL) also refers to learning problems. As Figure 2 represents the process, an agent takes observations of the environment; then on the basis of that, it executes an action (A t ). As a result of the action in the environment, the agent will get a reward (R t ) and it can take a new observation (O t ) from the environment and the cycle is repeated. e problem is to let agent learning so as to maximize the total reward. Reinforcement learning concept was introduced in ([4], Section 3.1). While in supervised and unsupervised learning, the model fitting requires a complete set of observations; in reinforcement learning, the learning process is sequential.Reinforcement learning is based on the reward hypothesis which states that all goals can be described by the maximisation of expected cumulative rewards. Formally, the history is the sequence of observations, actions, and rewards:

Complexity
A state contains all the information to determine what happens next. Formally, state is a function of the history: Policy covers the agent's behaviour in all possible cases, so it is essentially a map from states to actions. ere are two major categories in it: (1) deterministic policy: e action-value function q π (s, a) is the expected return starting from state s, taking action a, and then following policy π: Practically, state-value function is a prediction of expected present values (PV) of future rewards that allows evaluating the goodness of states, so it is a map from states to scalars: It is easy to find that in case if an optimal state-value function is known that an optimal action-value function and an optimal policy can be derived.
Reinforcement learning concept is based on stochastic processes and on Markov chains. Markov property is fundamental of mathematical basis of reinforcement learning methods. A state is Markov if and only if the P[S t+1 | S t ] � P[S t+1 | S 1 , . . . , S t ] condition holds. By definition, a Markov decision process (MDP) is a tuple of 〈S; A; P; R; c〉, where S is a finite set of states, A is a finite set of actions, P is a state transition probability matrix, P a  Figure 3 summarizes a classification of reinforcement learning methods in tree structure. Further details of the different RL methods are described in Appendix.

Overview of the Industry 4.0 Relevant Applications
In this section, we will present the hands-on references in tabular format based on the results of our data cleansing process and some major results of systematic literature analysis that will highlight some general trends which is able to lead the reader to a successfully applicable RL methods by preventing the usage of inappropriate trials and hence shortening development periods. In the final part of the section, we will present a hands-on guideline to summarize the key conclusions.

Classification of Applications by Principle Captured.
e main goal of this section is to give an overview what are the principal captured problem types that reinforcement learning was applied for and describe the major tools that gave an impressive performance for each and every problem category and finally to highlight some typical issues that needed to be taken care of during the implementation.
By performing SKU analysis, we identified the most relevant keywords that are assigned to a principle captured. In Table 1, the associated publications are listed by principle captured categories.
Furthermore, Figure 4 shows the principle captured classes by reinforcement learning methods. Although the related frequency table does not meet all the required criteria, in Table 2, a χ 2 -test, calculation is presented, yet by principle captured classes, it makes the identification of some significant deviations from the overall distribution of RL methods possible.
In the class of prediction, forecasting, and estimation, planning value function approximation methods and Markov decision processes are over-represented. is lets us to conclude that the complex methods in the focus are less, which is fully in line with the goal to understand better the behaviour of the environment without strong optimization aims. In the class of detection, recognition, prevention, avoidance, and protection, the policy gradient methods are over-represented, while MDPs are under-represented. is shows us that researchers are interested more in complex models with a higher predictive performance than in basic solutions. In the classes of evaluation, assessment and allocation, assignment, and resource management, the multiagent methods are more in focus, which tells us that this field is on the way to distribute the tasks to lower level tools instead of centralized data processes. But while in the first class, the distribution of further RL methods follows the overall distribution, and in the second class, the policy gradient methods are over-represented which comes from the fact that allocation-related problems prefer to create an optimal policy. In the classes of classification, clustering and decision making and scheduling, queuing, and planning, the situation is opposite: multiagent methods are underrepresented, which means that researches of these kinds of operations are still focusing to a centralized solution. In the class of control, the temporal-difference methods and Markov decision process contractions and multiagent methods are over-represented, while complex approaches, like policy gradient methods, are underrepresented.

Classification of Publications by Industrial Field of
Application. Similarly, as we have shown in Section 3.1, by performing SKU analysis, we also identified the most relevant keywords that are assigned to industrial fields. In Table 3, the associated publications are listed by industrial field categories. Similarly, as we presented categories of principal captured, we also prepared Figure 5 that shows the industrial field classes by reinforcement learning methods. Although the related frequency table does not meet all the required criteria, in Table 4, a χ 2 -test, calculation is presented, yet by industrial field classes, it makes the identification of some significant deviations from the overall distribution of RL methods possible.
In the class of energy, solar, power, electric, the applications of Q-learning methods are over-represented, while more basic methods and policy gradient methods are under-represented In the class of telecommunication, communication, networking, internet, 5G, Wi-Fi, and mobile, the policy gradient methods are over-represented and there is a strong focus on the applications of edge computing In the class of wireless, radio, antenna, and signal, the applications of Markov decision processes are highlighted Similarly, in the class of vehicle, unmanned aerial vehicle, drone, and aircraft, the applications of Markov decision process are over-represented together with policy gradient methods, while the multiagent solutions are less discussed In the classes of cyber-physical system, robot and manufacturing, and factory, the basic dynamic methods and Q-learning approaches are more popular Finally, in the class of city and building, the multiagent methods are over-represented

Classification of Publications by Mathematical Approach of Application Methodology.
Similarly, as we have shown in the previous sections, we also performed the SKU analysis for the third major dimension of keywords which is the methodological approach of the solution. e most relevant keywords were identified, and then in Table 5. the associated publications are listed by methodological approach categories.
Although it is not feasible to summarize all the different methodological approaches in details, we would like to highlight some specialities of selected cases to demonstrate how widely RL approaches are used and motivate researchers to find a solution for their problems from a new perspective.
As we described in Section 2, reinforcement learning methods are based on Markov property and hence it is fundamental to model the problems as Markov decision processes (MDPs), which is far not trivial in several cases. By formulating an MDP, we need to take care about state space design, especially guaranteeing that a state representation contains all the relevant information to evaluate a situation, or with other words anytime, when the system is in the same particular action, the environment will take its response by the same characteristic for a particular action [96,104,191,203,313,346].
Actor-critic methods are model-free learning methods that learn both the optimal policy for taking an action and the value function for most accurate evaluating of the current state. Most of the publications discuss mainly distributed autonomous IoTdevice networks. In these cases, the focus is shifted towards the learning and knowledge transfer solutions: Stochastic model of cloud-based IoT for fog computing computation offload and radio resource allocation [97]. Centralized joint resource allocation solution for handling shortage of frequency resources of cellular 6 Complexity systems by using a neural network embedded reinforcement learning algorithm [176]. Determining optimal sampling time for IoT devices for energy harvesting by saving batteries. Hence state space contains continuous quantities, a linear function approximation was used and a set of novel features were introduced to represent the large state space [349]. A bio-inspired RL modular architecture is able to perform skill-to-skill knowledge transfer and called transfer expert RL (TERL) model. Its architecture is based on a RL actor-critic model where both the actor and critic have a hierarchical structure, inspired by the mixture-of-experts model [392]. Deep reinforcement learning-based cooperative edge caching approach [338]. Multiple IoT devices are sending data parallel, but in general, they do not provide additional information to the existing knowledge. So, it is not necessary to permanently send data. By using actor-critic method, it can be determined which data packages need to be sent to prevent redundant or irrelevant communication [221]. Mobile edge computing and energy harvesting framework of centralized training with decentralized execution by adopting MD-hybrid-AC method [120]. Asynchronous advantage actor-critic method for mobile edge computing because computation offloading cannot have good performance in many situations, but the optimal algorithm can be chosen to use on IoTside [196].
Optimization of the robustness of IoT network topology with a scale-free network model which has good performance in random attacks. A deep deterministic learning policy (DDLP) is proposed to improve the stability for large-scale IoT applications [337].
IoT devices have lack of storage capacity, therefore a jointly cache content placement and delivery policy for the cache-enabled D2D networks was constructed. [17]. A federated reinforcement learning architecture was presented where each agent working on its independent IoT device shares its learning experience (i.e., the gradient of loss function) with each other [237].
By applying multiagent methods, there are multiple ways to organize learning: Local learning and no centralized knowledge (see Figure 6(a)) Local knowledge deployment, local learning, and central knowledge collection Local knowledge deployment and local learning with knowledge transfer to close neighborhoods (see Figure 6(b)) Local knowledge deployment and centralized learning (see Figure 6(c))
Although the first researches have focused on designing learning algorithms with provable convergence time, but other issues, such as incentive mechanism, were explored later: a deep reinforcement learning-based incentive mechanism has been designed to determine the optimal pricing strategy for the parameter server and the optimal training strategies for edge nodes [147].

Hierarchical Methods.
Hierarchical approaches are applied primarily to solve communication channel or information processing capacity issues. e model structure usually follows the structure of the information path. In a two-layer approach, a local IoT device needs to transfer information to a local hub and then the local hub transmits the collected information to the central decision maker. In this case, separated models can be set up for both layers to find optimal scheduling order for communication.
A new crowd sensing framework is introduced based on hierarchical structure to organize different resources and it is solved by using deep reinforcement learning-based strategy to ensure quality of service [88]. A hierarchical  8 Complexity correlated Q-learning (HCEQ) approach is presented to solve the dynamic optimization of generation command dispatch (GCD) for automatic generation control (AGC) [231]. An enhanced version of a bio-inspired reinforcement learning modular architecture is presented to perform skillto-skill knowledge transfer and called transfer expert RL (TERL) model. TERL architecture is based on a RL actorcritic model where both the actor and critic have a hierarchical structure, inspired by the mixture-of-experts model, formed by a gating network that selects experts specializing in learning the policies or value functions of different tasks [392]. A new cloud computing model is proposed that is hierarchically composed of two layers: a cloud control layer (CCL) and a user control layer (UCL). e CCL manages cloud resource allocation, service scheduling, service profile, and service adaptation policy from a system performance point of view. Meanwhile, the UCL manages end-to-end service connection and service context from a user performance point of view. e proposed model can support nonuniform service binding and its real-time adaptation using metaobjects by intelligent service-context management using a supervised and reinforcement learning-based machine learning framework [150]. A new cooperative resource allocation algorithm is  23,24,37,64,70,75,84,96,100,101,104,127,130,133,138,144,153,165,167,170,177,188,191,199], [203, 207, 211, 212, 214, 217, 220, 231, 252, 256-259, 263, 264, 272, 274, 281, 291, 309, 313, 320, 340, 343, 346], [369][370][371][372][373][374][375][376] Multiarmed bandit [61,66,102,198,351,377,378] Dynamic programming [16,19,27,52,68,70,84,90,93,107,119,120,132,135,141,145,155,156,161,162,189,191,198,201,207], [209,212,222,236,242,247,254,258,259,278,280,288,289,304,306,313,321,331,340,347,357,371,372,379,380] Q-learning [10, 17,24,44,47,50,64,68,70,80,81,83,91,92,94,101,110,116,124,125,127,129,133,145,152,172,179], [180,183,187,201,203,205,206,208 [19,20,32,41,47,50,60,75,77,84,88,90,95,98,100,103,117,131,134,147,154,159,165,168,176,179,182], [193,199,201,207,210,220,221,223,236,241,260,261,273,275,281,294,299,301,302,305,309,311,317,323,333,338,341,346,352,355,361,363,375,380,[389][390][391] Actor-critic [15,17,33,97,120,176,196,221,237,337,338,349,392] Double deep Qnetwork [47,83,125,210,254,294,387,393] Imitation [226,265,355] Multiagent [32,60,70,77,103,145,163,168,173,175,176,188,195,200,209,218  Complexity 9 presented which couples reinforcement learning networks and prediction neural networks for accurate mobile targets tracking. Specifically, a hierarchical structure that performs collaborative computing is designed for alleviating computing pressure of front-end devices which are supported by edge servers [397]. A slightly different approach is applied at a resilient control problem studied for cyberphysical systems (CPSs) under the denial-of-service (DoS) attack. e term resilience is interpreted as the ability to be robust to the physical layer external disturbance and defending against cyber layer DoS attacks. e overall resilient control system is described by a hierarchical game, where the cyber security issue is modeled as a zero-sum matrix game, and physical minimax control problem is described by a zerosum dynamic game. In virtue of the reinforcement learning method, the defense/attack policy in the cyber layer can be obtained, and additionally, the physical layer control strategy can be obtained by using the dynamical programming method [398]. Further publications in hierarchical RL topics are related to balancing timeliness and criticality when gathering data from multiple sources [116], ubiquitous user connectivity, and collaborative computation offloading for smart cities [248].

Distributed and Parallel
Methods. It can be stated with certainty that the biggest potential of industrial applications is in intelligent devices. In this context, intelligence means some kind ability for taking autonomously decisions and furthermore being able to perform learning steps locally. ere were made significant efforts to develop functional solution to reach this goal. Computation offloading can provide a solution for the issue of the high computation requirement of resourceconstrained mobile devices. e mobile cloud is the wellknown existing offloading platform, which is usually far-end network solution, but this can cause other issues, such as higher latency or network delay, which negatively affects the real-time mobile Internet of things (IoT) applications. erefore, a deep Q-learning-based autonomic management framework is proposed as a near-end network solution of computation offloading in mobile edge [133].
Another way to extend single reinforcement learning applications is to handle multiple objectives. ere are two major solution practices to handle such kind of problems. e most obvious idea is to construct a mixed reward function that returns a combined result according to the different objectives [161,259,370]. Another possible way is to combine multiobjective ant colony optimization methods    Yes R-02: Only the feasible actions need to be involved into the action space to simplify it, which can significantly speed up learning convergence. Furthermore, the effects of an action in a particular state should be based on the same deterministic or stochastic behaviour. R-01: Reinforcement learning methods exploit Markov property, hence it needs to ensure that each and every potential states contain all the relevant information that can have any influence on the outcomes, so on the rewards and on the state transitions.
R-03: If the rewards and the state transitions can be determined (in deterministic environment) or can be simulated (in stochastic environment), then the learning process can be done in a virtual environment.
R-05: Even if it is usually quite limited, but it is also an option for learning to observe an external system in use.
C-11: Multi-agent solutions are suggested. ere are techniques with centralized knowledge sharing methods and also with distributed methods depending on problem properties.
C-04: RL methods require to learn from own experience or to observe an environment in use with rewards and state transitions. If these are both not feasible, RL cannot be an option.
R-09: Sharing the knowledge and experience between RL agents can improve the learning performance, but it is not applicable in single agent and in separated setups.
R-08: In some cases the episodes can be taken extreme long or it cannot be guaranteed that episodes end within a limited time period.
R-04: If an RL agent is able to decide on the next action and hence to discover unknown or undervalued actions, then trial-and-error learning process can be an option.
No C-02: If no, then it is suggested to rework action space definition.
C-05: Only off-policy methods can be applied.
C-06: On-policy methods can be also applied. By distributing computational tasks to IoT devices, a fundamental change gets required: it is not possible to assign as much human effort to data processing and predictive model development supervision as before during the centralized era.
is was the major reason of appreciating RL methods because it provides a general self-learning framework that basically requires no manual or human interactions to maintain. e early researches focused on the applicability of reinforcement learning techniques with single agents. en, more and more complex problems were solved, and the multiagent solutions started to analyze. In the last years, the focus of the researchers is shifting to multiagent structures. e set-up of the agents and their goals or reward functions are showing very creative solutions. At a new wave of researches, the agents are defined with different roles often with attacker-defender objectives and let each of the agent to be trained an optimal strategy according to it. en, the stability and robustness of the system can be analyzed and the weakest items can be purposefully improved.
As Figure 7 demonstrates, the number of Industry 4.0related reinforcement learning-based researches dynamically increases, and there is no sign for expecting a slowing in it.

Discussion and Guideline Process to Determine Appropriate RL Method to Use.
On the basis of the previous section, it can be highlighted that there are several ways and methods how reinforcement learning can be applied for Industry 4.0-related problems, and it is far not trivial which one can provide a successful solution.
We prepared a questionnaire and we presented it in a decision flow diagram in Figure 8. Our primary goal was to set up a method to help the readers in formulating their RL tasks. e first questions of the questionary-based process verify whether state and action spaces are appropriately defined and how the reward can be obtained. e further questions systematically narrow down the set of applicable RL methods. e possibility of using simulation or learning from own experience can determine the general learning mechanism. In contrast, the nature of reward propagation can determine a smaller subset of the RL methods that can be applicable. Even if the conclusions are soft-defined, a user with some basic knowledge of RL methods can easily interpret them, or it can be a basis of some RL methods selector wizard. We believe that researchers will have fewer failed attempts by using our guideline, and the time-to-solution can be reduced significantly.
We should keep in mind that the whole reinforcement learning concept is based on Markov decision processes. A direct conclusion is that the state space should be constructed in a way that all the potential states should contain all the relevant information that can have any influence on the outcomes. Moreover, the action space should be constructed similarly: the effects of an action in a particular state should be based on the same deterministic or stochastic behaviour.
is will let the RL agent to learn the effect mechanism behind.
Once the state and action spaces are defined, it needs to be investigated whether performing simulations is an option or not. If we are able to determine the environment's behaviour when an action is made in a particular state, so deriving the reward value and the state transition, then an extensive learning process can be executed by using model-based RL methods in a costefficient way without significant risk of applying untrained agents. e general rule is also true in this case: the RL solution will be as adequate as the simulation is. If there is an option to validate the simulation outcomes to the real environment, then this can help to ensure the validity of the solution.

Conclusions
As we pointed out that reinforcement learning methods have a high potential also in Industry 4.0 applications which is a common agreement of researchers, one of the biggest reasons behind is that smart tools require a high level of optimizations which cannot be satisfied with human interventions.
is continuously raises the demand of selflearning solutions, and RL techniques have been proven their efficiency at multiple fields. A major goal of our article was to give an overview of RL applications at the field of Industry 4.0. As a first step, we served a high-level overview of the general RL framework and a classification of RL methods to easily see through the possibilities, while we also presented a more detailed summary of the most widely used RL methods of Industry 4.0 applications in Appendix. erefore, our publication can serve a starting point of further researches for RL applications. en, we highlighted the results of our systematic literature overview of reinforcement learning applications at the field of Industry 4.0. An extensive keyword analysis drove us to identify some typical patterns by choosing an adequate RL method for some particular combinations of principal captures and industrial fields. Although there are no unique optimal RL methods, there are RL methods that provide efficient solution for some problems. Our summary can be used as a hands-onreference for further researches and it can help researchers to shorten the preparation time for their researches.
Furthermore, we prepared a questionnaire that provides a methodology to set up the reinforcement learning system in a proper way and to choose an appropriate method for the learning problem that the researcher is facing to. We believe that an extension of our questionnaire can be a basis of a wizard tool that enables the user to find the most fitting RL method for the learning task and guiding through the set-up processes. On the other hand, by knowing the key properties of the different RL methods, it becomes faster to adopt an existing one or to modify it to fit the specific needs and hence develop an own RL method.
We hope that our article lets the researchers strengthen to decide using RL methods for further applications as numerous successful applications show the high efficiency of them. and evolutionary stages by following David Silver's approach from the simplest ones to the more complex ones.

A. Dynamic Programming
Dynamic programming (DP) covers a decision process by breaking it down into a sequence of elementary decision steps over time. "Dynamic" refers to the sequential approach, while "programming" refers to its optimization objective.
In this section, all the methods work with the assumption that the environment is perfectly known. Iterative policy evaluation method is described for learning state-value function of a given policy Π, then value iteration method is used to determine optimal state-value function although actions are taken according to any given policy Π, and last but not least, policy iteration is presented to derive an optimal policy to the environment.
In general, there is limited usage of dynamic programming algorithms both because of its assumption to know the environment perfectly and its high computational requirements. On the other hand, dynamic programming methods provide the essence of ideas that are used in advanced methods in an easily understandable form.
Iterative Policy Evaluation. Let us assume that a policy π is given and actions are taken according to it. e goal is to determine state-value function v π by iterative application of Bellman backup: v 1 ⟶ v 2 ⟶ . . . ⟶ v π . At each and every iteration steps, the state-value function should be updated in the following way: e second term shows the cumulative rewards from state s by taking action a and applying a single Bellman decomposition while the first term provides the probability of taking action a by following policy Π. It can be proven that with weak conditions, the proposed state-value function update will converge to v π (S) ( [4], Section 4.2).
Value Iteration. Iterative policy evaluation method can be extended to find an optimal state-value function v * (s).
e main idea behind that iteration should be done by starting from the final reward and working backward. Let us assume that the solution of subproblem v * (s ′ ) is known. en, by the solution of the next iteration step, v * (s) can be found by one-step look-ahead: It can easily be seen that for finite state space S, the determination of optimal state-value function for all the available states can be done in finite number of steps ( [4], Section 4.4).
Policy Iteration. e iteratively learnt knowledge can be extracted by improving the policy by acting greedily with respect to v π * . is practically means to pick that action a from a particular state s which maximizes the sum of immediate reward r a s and discounted state-value cv π * (s ′ ) of the successor state s ′ ( [4], Section 4.6). e learning process of policy iteration is demonstrated on Figure 9.

B. Model-Free Prediction Methods
Unlike in dynamic programming, in model-free methods, perfectly known environment is not necessary, only experience samples are required or with other words just sequences of states, actions, and rewards, no prior knowledge of the environment. In this section, Monte-Carlo learning method is presented for learning simply by averaging the experience, and then temporal-difference learning method is discussed to let the agent learn by more frequent but smaller steps by applying bootstrapping techniques, while temporal-difference (λ) learning method is described as an extension of temporal-difference method's one-step learning to multiple-steps learning.
Monte-Carlo Learning. Monte-Carlo (MC) agent solves the reinforcement learning problem by applying average sample return, so it learns from complete episodes. Hence, it needs to be guaranteed always to terminate episodes; otherwise, the learning process cannot be performed. MC uses the simplest idea by assigning empirical mean of returns to a specific state ([4], Section 5.1). ere are two major types of MC methods: First-visit MC: only the first visit of a state will be involved into the calculation during an episode. Let us assume that state s is visited first time at time period t. Let us denote G t as the total return from time period t and N(s) the number of times that state s is visited while S(s) is the sum of G t returns up to the current episode. In this case, the state-value estimate will be the empirical mean: V(s) � S(s)/N(s). As experience grows, so as N(s) ⟶ ∞, the long-term mean will converge to the state-value function: V(s) ⟶ v π (s). Every-visit MC: all the visits of a state will be involved into the calculation during an episode. Formally, the main difference to first-visit MC is that N(s) needs to be incremented at every time period t whenever state s is visited.
From computational point of view, it is important to mention that empirical mean is determined incrementally in practice. Let us denote V (n) (s) as the value-function estimate while S (n) (s) is the cumulative sum of returns after episode n, then G (n) t is the total return in episode n from time period t when state s is visited and assume that state s is visited kth times overall. Figure 10 demonstrates the learning process of Monte-Carlo method. As we can see, the learning step is performed at the end of an episode.

Complexity 13
Temporal-Difference Learning. Temporal-difference (TD) agent learns from incomplete episodes by applying bootstrapping. Comparing to MC learning, TD uses best guess of total return, or formally R t+1 + cV(S t+1 ) instead of episodic experience G t to calculate value function estimates (V(s)). is single difference indicates that TD agent can perform a learning step after each and every actions ([4], Section 6.1), as Figure 11 shows. As a consequence, it can be applied at never ending episodes.
Temporal-Difference (λ) Learning. ere are intermediate solutions between TD that performs VF estimate updates after 1-step return and MC that performs updates only at the end of an episode (practically ∞-step return). e main idea behind is to apply normalized geometric series (1 − λ)λ n− 1 for weighting n-step returns G (n) t ([4], Section 7.1). In this case, value function estimate will use a weighted total return of It can be shown that TD(0) is equivalent to every-visit MC learning and TD(1) is equivalent to original TD learning methods. Furthermore, TD (λ) methods can be applied both forward and backward. e algorithms shown in this section can be used whether In offline mode: value function estimate updates are accumulated within episodes but applied only at the end of the episode, or In online mode: value function estimate updates are accumulated within episodes and can be applied immediately.
A unified view of model-free prediction techniques is shown in Figure 12. First, it was created by Richard Sutton, but this version is prepared by David Silver. It highlights the two most important dimensions of learning methods: the vertical dimension represents the depth of the updates, while the horizontal dimension represents the width of the updates.

C. Model-Free Control Methods
In the previous section, model-free prediction methods were summarized.
ese are methods that learn from other's experience so acting policies were managed from the external and called off-policy learning. In contrast, on-policy learning lets the algorithm to make actions on the basis of their own policy. Hence, a major objective steps to the front, to optimize policy.
In this section, ϵ-Greedy policy iteration is described to combine exploitation of the current knowledge of optimal decisions and exploration of unknown new potentials. Furthermore, on-policy temporal-difference control method known as SARSA method is presented by applying bootstrapping techniques to speed-up the learning process.
ϵ-Greedy Policy Iteration Control. ϵ-Greedy policy iteration covers a combined solution. On the one hand, MC method is applied to learn the action-value function Q(s; a). On the other hand, the agent can act greedily which means that it will choose the most optimal action on the basis of the actual action-value function Q(s; a).
is kind of action policy exploits only the current experience and does not support to explore alternatives. With a small change in the strategy, this kind of issue can be solved: let the agent act randomly with probability ϵ and greedily with probability (1 − ϵ) ( [4], Section 5.4): Figure 9: Learning by policy iteration method.
Value function update next episode … … 14 Complexity On-Policy Temporal-Difference Control Method, Aka SARSA Method. Similar to model-free prediction methods, there is also an algorithm to let agent learn from incomplete episodes by applying bootstrapping ( [4], Section 6.4). In this case, ϵ-Greedy policy iteration method needs to be modified in the following way: instead of using MC method, TD learning should be applied for learning the action-value function Q(s; a) that makes possible to perform a learning step after each and every actions and acting according to the most updated action-value function in a similar way than at ϵ-greedy policy iteration. e SARSA name comes from an acronym: state s ⟶ action a ⟶ reward r ⟶ state s ′ ⟶ action a ′ . By following SARSA method, action-value function update should look like Q(s; a)←Q(s; a) + α(r + cQ(s ′ ; a ′ ) − Q(s; a)). It can be proved that under certain conditions, SARSA action-value function converges to optimal action-value function: Q(s; a) ⟶ q * (s; a).

D. Off-Policy Learning
ere are several situations when the learning process is not based on just own experience. Formally, this means that target policy π(a | s) or state-value function v π (s) or actionvalue function q π (s; a) is determined by observing results of an external behaviour policy μ(a | s).
In this section, importance sampling is shown to determine the most accurate of the learning objective, and then Q-learning is described as an effective alternative to get the function iteration with a lower variance.
Importance Sampling. One possible way to handle the difference of target and behaviour policy is importance sampling when a correction multiplier shall be applied by processing observations ( [4], Section 5.8). If MC learning is combined with importance sampling, then value function update will look like (S t )←V(S t ) + α(G π/μ t − V(S t )). But because corrections are made at the end of an episode, the product of multipliers can drive to a dramatically high variance and hence MC learning is not suitable for off-policy learning. erefore, TD learning seems much more adequate to combine with importance sampling, because correction multiplier should be applied for only a single step and not for a whole episode: (A.5) Q-Learning. Another possible way to handle the difference of target and behaviour policy is to modify the value function update logic as Q-learning does ( [4], Section 6.5). Assume that in state S t , the very next action is derived by using behaviour policy: A t+1 ∼ μ(· | S t ). By taking action A t+1 , immediate reward R t+1 and the next state S t+1 will be determined. But for value function update, let us consider an alternative successor  action on the basis of target policy: A ′ ∼ π(· | S t ). erefore, the importance sampling will be not necessary and Q-learning value function update will look like Q(S t ; . In a special case, if target policy π is chosen as a pure greedy policy and behaviour policy μ follows ϵ-greedy policy, then the so-called SARSAMAX update can be defined as follows: Q(S; A)←Q(S; A) + α(R + cmax a′ Q(S ′ ; a ′ ) − Q(S; A)). Last but not least, it was proven that Q-learning control converges to the optimal action-value function: Q(s; a) ⟶ q * (s; a).

E. Value Function Approximation
e reinforcement learning methods discussed in the previous sections represented value functions by lookup tables, but in practice, it is not feasible operating with state-level or state-action-level lookup tables. On the one hand, it would be very memory-and computation-intensive, and on the other hand, the learning process would be too slow if the state and/or action spaces are large. e solution for large problems is to estimate state-value and action-value functions with function approximation: v(s; w) ≈ v π (s), and similarly, q(s; a; w) ≈ q π (s; a).
ere are many kinds of function approximation methods that can be applied: linear combination of features, neural network, decision tree, and Fourier bases. In this section, the first two types of methods are discussed. e first gradient descent method is presented that can be effectively combined with Monte-Carlo or temporal-difference methods for value function approximations, and then deep Q-network is described that serves a more sample-effective way from learning.
Value Function Approximation by Gradient Descent. A well-known tool for function approximation is gradient descent ([4], Section 9.3). Let us denote J(w) as a differentiable function of parameter vector w. Define the gradient of J(w) as ∇ w J(w) � (zJ(w)/zw 1 , . . . , zJ(w)/zw n , ) T . To find a local minimum of J(w), parameter w needs to be adjusted in the direction of negative gradient by Δw � −1/2α∇ w J(w) where α is the learning step-size parameter.
An effective solution is to use gradient descent with linear combination of features, because in this case, the formulas become much simpler. Value function representation will look like v(S; w) � x(S) T w � n i�i x i (S)w i , while objective function to minimise mean-squared error between true value function and its approximation can be calculated by the formula of It is proven that stochastic gradient descent with linear combination of features converges to global optimum. Furthermore, the update rule is quite simple: ∇ w v(S; w) � x(S), and then Δw � α(v π (S) − v(S; w))x(S). e result shows that parameter w adjustment stands for three components: learning step-size, prediction error, and feature value. In practice, the true value function is usually not known but a noisy sample of it is known at different methods: For MC method, the target is G t and hence parameter update Δw � α(G t − v(S t ; w))∇ w v(S t ; w). For TD(0) method, the target is the TD target R t+1 + cv(S t+1 ; w) while parameter update For TD (λ), the target is λ-return G λ t and parameter update Δw � α(G λ t − v(S t ; w))∇ w v(S t ; w). Whichever method is chosen, the RL learning process needs to update the value function approximation with the same frequency than at the original method.
Deep Q-Network. Even if gradient descent-based value function approximation methods can be very calculationeffective and updates can be managed incrementally, these are less sample-effective which means that the information that could be extracted from an observation will be not necessarily exploited.
ere are batch methods that are working with experience replay. Preliminary all the observed experiences should be collected. Let us denote D as the consisting experience of state-value pairs: D � 〈〈s 1 ; v π 1 〉, . . . , 〈s n ; v π n 〉〉. Artificial observations can be generated by random sampling from experience history: 〈s; v π 〉 ∼ D. erefore, stochastic gradient descent can be applied on it: (s; w). In this way, w π converges to optimal least square solution.
One of a most commonly used RL methods was born by combining experience replay and Q-learning with periodically frozen target policy: (1) By using behaviour policy, action a t can be taken according to ϵ-greedy policy (2) Transitions should be stored in replay memory D as 〈s t , a t , t t+1 , s t+1 〉 (3) ere can be generated random mini-batch samples of transitions (s, a, r, s ′ ) from D (4) On the basis of them, Q-learning targets will be determined by using fixed parameters w − (5) Minimise mean-squared error between Q-network and Q-learning targets: (A.7)

F. Policy Gradient
In contrast to value-based methods where optimal action can be determined on the basis of learnt value function in a particular state, policy gradient methods approximate directly the optimal policy: π θ (s, a) � P[a | s, θ]. It is necessary for an objective function J(θ) to measure the goodness of fitting policy π θ to the optimal policy. In this case, policy-based RL becomes an optimization problem to 16 Complexity find optimal θ according to J(θ). ere are methods that use gradient as gradient descent, conjugate gradient, or quasi-Newton method and there are methods that do not use as hill climbing, simplex, or genetic algorithms. In general, these kinds of methods show better convergence properties and can work effectively with high-dimensional or continuous action spaces, and last but not least, they can learn stochastic policies. On the other hand, policy gradient methods typically converge to a local rather than global optimum. It is important to highlight that value functions can be also used to learn the optimal θ parameter, but once it is learnt, value functions are not necessary to select optimal action. Softmax. Let J(θ) be a policy objective function. Policy gradient descent algorithms search for a local optimum in J(θ) by ascending the gradient of the policy: Δθ � α∇ θ J(θ). By assuming that policy π θ is differentiable and its gradient is ∇ θ π θ (s, a), likelihood ratios can be transformed to the following form: ∇ θ π θ (s, a) � π θ (s, a)∇ θ π θ (s, a)/π θ (s, a) � π θ (s, a)∇ θ log π θ (s, a), where ∇ θ log π θ (s, a) is called score function.
Softmax policy method is based on the approach of weighting actions by using linear combinations of features ϕ(s, a) T θ ( [4], Section 13.2). erefore, the probabilities of actions are proportional to exponentiated weights: π θ (s, a) ∝ e ϕ(s,a) T θ . e score function looks like Gaussian/Natural Policy Gradient. In continuous action spaces, Gaussian policy is a natural option. In this case, the mean is a linear combination of features: μ(s) � ϕ(s, a) T θ. By fixing variance as σ 2 , the policy will be Gaussian: a ∼ N(μ(s), σ 2 ).
Monte-Carlo Policy Gradient Method Aka REINFORCE. Monte-Carlo policy gradient method or with more popular name the REINFORCE algorithm updates θ parameter by using stochastic gradient ascent. It is strongly based onpolicy gradient theorem that generalizes likelihood ratio approach to multistep MDPs by replacing immediate reward r with long-term values of Q π (s, a) with weak restrictions on J(θ). e key idea behind that the locally optimal policy can be found by gradient ascent on the objective function as follows: θ t+1 ←θ t + α∇ θ t log π θ t (s t , a t )v t , where v t is an unbiased sample of Q π θ t (s t , a t ). Actor-Critic Policy Gradient. In practice, REINFORCE still has high variance. To handle it, action-value function can be also estimated: Q w (s, a) ≈ Q π θ (s, a). In this way, there are two sets of parameters: Critic: it updates action-value function parameters w Actor: it updates policy parameters θ according the actual version of critic Updates should be done at each elementary steps as follows: Sample reward: r � R a s Sample transition: s ′ ∼ P a s Sample action: a ′ ∼ π θ (s, a ′ ) δ � r + cQ w (s ′ , a ′ ) − Q w (s, a) θ � θ + α∇ θ log π θ (s, a)Q w (s, a) w ⟵ w + βcϕ(s, a) s ⟵ s ′ a ⟵ a ′

G. Model-Based Methods
Model-free methods learn value function and/or policy directly from their experience of a real environment. e accuracy of the knowledge of RL can be raised by extending the experience collection process. is can be reached either by setting up an artificial virtual environment due to defining reward and state transition functions that describes the real environment well or by building an own model that approximates the real environment by learning its history.
If it is assumed that the state space S and action space A are known, then model M � 〈P η ; R η 〉 is a representation of MDP 〈S; A; P; R〉 if S t+1 ∼ P η (S t+1 | S t , A t ) and R t+1 � R η (R t+1 | S t , A t ). Learning model from experience is a supervised learning problem. Figure 13 presents the basic concept of model-based learning methods. First, the model should learn and therefore an internal simulation environment can be defined.
en, using the model representation, the model-free RL methods can be used. So, model-based techniques differ from model-free techniques by using internal model representation to derive rewards and state transitions.

H. Multiagent Learning Systems
At Industry 4.0 applications, usually not a single RL agent is set up, but multiple ones. Multiagent RL topic addresses the sequential decision-making problem of multiple autonomous agents that operate in a common or quite similar environment, each of which aims to optimize its own longterm return by interacting with the environment and a central system and/or other agents.
Markov Games. One way to generalize MDPs for applying multiple agents is Markov games (MG) or also known as stochastic games. Formally, Markov game can be defined as a tuple 〈N, S, A i i∈N , P, R i i∈N , c〉, where N � 1, . . . , N { } denotes the set of N > 1 agents, S denotes the state space of all the agents, and A i denotes the action space of agent i ∈ N. By introducing A � A 1 × . . . × A N , let P: S × A ⟶ S be the transition probability function from any state s ∈ S to a particular state s ′ ∈ S for a joint action of a ∈ A, while R i : S × A × S ⟶ R is the reward function that determines the immediate reward by starting from state s, by taking action a and by moving to state s ′ . Last but not least, c ∈ [0, 1) is the discount factor. Figure 14 shows the general framework of Markov games.
MG problems can be classified by knowledge sharing strategies between agents and central system and their goals: whether they can learn from each other or is it worth to share observations or policies with each other or their goals are conflicting. e main categories are Cooperative agents problem Complexity Conflicting agents problem Mixed problem In a fully cooperative setting, all agents have the very same or identical reward function: is is also referred as multiagent MDP (MMDP). With this approach, the state-and action-value functions are identical to all agents, which thus enables the single-agent RL algorithms to be applied, if all agents are coordinated as one decision maker. e global optimum for cooperation now constitutes a Nash equilibrium of the game.
Nash equilibrium (NE) characterizes an equilibrium point π * , from which none of the agents has any incentive to deviate. As a standard learning goal for MARL, NE always exists for discounted MGs, but may not be unique in general. Most of the MARL algorithms are contrived to converge to such an equilibrium point.
We believe that our summary of the major reinforcement learning methods gave a useful and efficient overview of the concept behind. As our literature overview shows there are numerous further modifications and extensions over the basis of the basic methods. By following our questionnaire in Figure 8, it becomes easier to determine the relevant area of RL methods that can provide an appropriate solution to be fitted to their learning problems.

Data Availability
No data were used to support this study. 18 Complexity