Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach

Programming by demonstrations is one of the most eﬃcient methods for knowledge transfer to develop advanced learning systems, providedthat teachersdeliver abundantandcorrectdemonstrations,andlearnerscorrectly perceivethem.Nevertheless,demonstrations are sparse and inaccurate in almost all real-world problems. Complementary information is needed to compensate these shortcomings of demonstrations. In this paper, we target programming by a combination of nonoptimal and sparse demonstrations and a limited number of binary evaluative feedbacks, where the learner uses its own evaluated experiences as new demonstrations in an extended inverse reinforcement learning method. This provides the learner with a broader generalization and less regret as well as robustness in face of sparsity andnonoptimality in demonstrations and feedbacks. Our method alleviates the unrealistic burden on teachersto provide optimal and abundantdemonstrations. Employinganevaluative feedback, which is easy for teachersto deliver,provides the opportunity to correct the learner’s behavior in an interactive social setting without requiring teachers to know and use their own accurate reward function. Here, we enhance the inverse reinforcement learning (IRL) to estimate the reward function using a mixture of nonoptimal and sparse demonstrations and evaluative feedbacks. Our method, called IRL from demonstration and human’s critique (IRLDC), has two phases. The teacher ﬁrst provides some demonstrations for the learner to initialize its policy. Next, the learner interacts with the environment and the teacher provides binary evaluative feedbacks. Taking into account possible inconsistencies and mistakes in issuing and receiving feedbacks, the learner revises the estimated reward function by solving a single optimization problem. The IRLDC is devised to handle errors and sparsities in demonstrations and feedbacks and can generalize diﬀerent combinations of these two sources expertise. We apply our method to three domains: a simulated navigation task, a simulated car driving problem with human interactions, and a navigation experiment of a mobile robot. The results indicate that the IRLDC signiﬁcantly enhances the learning process where the standard IRL methods fail and learning from feedbacks (LfF) methods has a high regret. Also, the IRLDC works well at diﬀerent levels of sparsity and optimality of the teacher’s demonstrations and feedbacks, where other state-of-the-art methods fail.


Introduction
e next generation of technologies focuses on the capabilities of artificial intelligent agents to become an integral part of our daily lives.To reach that goal, artificial agents, instead of being preprogrammed, need to be equipped with efficient learning systems to rapidly adapt to novel, dynamic, and complex situations.On top of that, the agents should have the flexibility to be personalized to user preferences, that is, to learn styles and behaviors that their human users prefer and enjoy.erefore, considering vast individual differences among human beings, in terms of both preferences and technical expertise, the learning systems should be able to learn from nontechnical users with minimum burden on them.A significant body of research has targeted solving this problem, especially by using Learning from Demonstrations (LfD), where the learner agent derives its policy by observing its teacher's demonstrations, and Learning from Feedbacks (LfF), where the teacher provides critiques to indicate the desirability of the learner's actions (see Table 1).
In LfD, also known as imitation learning, the learner generalizes the teacher's demonstrations to derive its policy.
ere exist two major approaches in the LfD framework based on the way that the learner deals with these demonstrations.e first one is the direct approach, termed as behavioral cloning, where the goal is to learn the mapping between states and actions (i.e., policy) in the teacher's demonstrations using a supervised learning technique.is approach suffers from several problems including cascading error issue [1] and being sensitive to the dynamic model of environment [2].
e other approach is known as apprenticeship learning [3] and is usually casted as an inverse reinforcement learning (IRL) problem [4].In this approach, the policy is derived indirectly by estimating the reward function underlying the teacher's demonstrations, and then a planning algorithm [5] is employed to derive the policy that maximizes the estimated reward function.
is approach overcomes the challenges that the preceding one faces [2,6].In addition, in this approach, the learner agent not only replicates the observed behavior, but also infers the "reason" behind it [7] and generalizes the demonstrations accordingly.As a result, the learning process becomes transferable, robust to changes in the configuration of the agent and the environment [8][9][10].In this paper, we focus on this approach, mainly on the IRL problem.We should note that the IRL is usually used to accomplish two objectives: apprenticeship learning and reward learning, where in the latter gaining the knowledge of reward function is a goal by itself [10,11].
Most existing works on the IRL assume that (1) the teacher's demonstrations are reliable; i.e., demonstrations are optimal or near-optimal, (2) the teacher's demonstrations are abundant and sufficiently available, and (3) samples of the teacher's policy are provided by demonstrations.In practice, several reasons could be thought of for these assumptions not to hold, which imposes severe limitations on the applicability of IRL in the real world.ese reasons include teachers' inability to perform the task optimally, insufficiency and nondiversity of demonstrations due to the dangers for teachers and the burden on them, and poor correspondence between teachers and learners.Moreover, teachers prefer to express their intentions and preferences in multiple modalities rather than just by demonstrations.Consequently, these limitations highly restrict the generalization capability of the standard IRL methods, which leads to poor performance of the learner.Some methods in the literature partially address these nonoptimality and sparsity issues, see Section 2 for details, but do not take into consideration that the nonoptimality may exist in all demonstrations and its amount may be significant rather than being only a noise.Other works tackle these issues by adding another source of information, in addition to teacher's demonstrations, to the learning process.e most recent state-ofthe-art works employ reinforcement learning (RL) along with demonstrations [12][13][14].
ese methods require a predefined environmental reward function that should be consistent with the teacher's demonstrations.is somehow necessitates knowing the teacher's reward function a priori, which is not practical in complex situations.Another recent work in this area is our previous method [15], which adds evaluative human feedback information (i.e., right/wrong instructions) to solve the nonoptimality problem in demonstrations.Providing evaluative feedbacks is extremely simpler than constructing the reward function required in RL methods.Nevertheless, Ezzeddine's study [15] uses evaluative feedbacks solely to correct mistakes in the teacher's demonstrations and cannot handle sparsity in demonstrations.A side effect of this limitation is that it decreases the robustness against errors in evaluative feedbacks.In this paper, we successfully handle both sparsity and nonoptimality in demonstrations and evaluative feedbacks.We employ negative evaluative feedbacks to boost alternative actions and employ the learner's own experiences along with the teacher's demonstrations to improve solving IRL problem. is results in faster learning and higher robustness against sparsity and nonoptimality levels.Motivated by the challenges stated above and in order to leverage learning from humans, we propose a practical approach, called IRLDC, that blends both teacher's task demonstrations and her binary evaluative feedbacks (true/ false) into a unified IRL framework.In the presented method, the learning process is done within two phases.In the first phase, the learner acquires its initial skills from the teacher's demonstrations.In the second phase, the learner interacts with the environment and receives binary evaluative feedbacks from the teacher.Here, by taking into account the natural inconsistencies and errors in the teacher's feedbacks, we propose a feedback model coding how the teacher's feedbacks are provided.In addition, the learner takes its own evaluated experiences as new demonstrations.Using these feedbacks and demonstrations, an enhanced version of the IRL is employed to estimate the reward function and the learner policy is revised by using the dynamic programming [16].
e cycle of interact-feedbackupdate continues until the teacher is satisfied.In summary, the proposed framework contributes in three ways to boost the robustness and the speed of learning: (i) Developing an IRL framework, which deals with both teacher's demonstrations and evaluative feedbacks at different levels of sparsity, optimality, and inconsistency. is framework, unlike those restricted to demonstrations, is also capable of operating in extreme cases where only erroneous and inconsistent feedback data are available.
(ii) Deriving the teacher's preference model from the noisy and inconsistent feedback data provided by the teacher.For that, we employ a feedback model that incorporates recent and old observations to implicitly handle inconsistencies in providing feedback in addition to handling errors.
(iii) Presenting a new IRL objective function that combines demonstrations and feedbacks as a single optimization problem and allows the teacher's preference model to affect the optimization process when searching for the reward function.In our objective function, the algorithm learns from the incorrect data instead of filtering them out.
e approach presented in this paper can bring notable benefits and possibilities: (1) it can effectively treat the nonoptimality and sparsity in demonstrations and feedbacks; (2) it allows the teacher to express his/her intention and style for solving the task by using two instructive modalities, i.e., demonstrations and evaluative feedbacks; (3) it exploits the complementary and teacher depended on expertise embedded in demonstrations and feedbacks [17,18] (see Table 1); (4) it is possible to teach the learner by only feedbacks, if needed; (5) being an incremental learning method, the teacher can provide demonstrations at one time or place and provide feedbacks at another; and (6) it is possible to provide demonstrations by one teacher and feedbacks by another.e rest of this paper is organized as follows: Section 2 discusses and reviews the related works.In Section 4, our IRLDC framework is introduced and formalized.e experimental setup and the results are reported and discussed in Section 5. Finally, Section 6 draws conclusions and discusses future research directions.

Related Work
In this section, we describe the closest works to ours, scrutinizing the way they have dealt with nonoptimal and sparse demonstrations in the IRL setting, and how humans can teach learning agents using both modalities, i.e., demonstrations and evaluative feedbacks.

Inverse Reinforcement Learning.
As previously discussed, the LfD is comprised of two main learning trends: imitation learning (direct approach) and IRL (indirect approach) (see [19,20]).In the IRL category, there are many approaches that differ in their algorithmic view [8,17], the objective function they optimize [11,18,21,22], and the challenges they try to solve in the IRL [23][24][25][26][27].Most of the existing works in this framework assume that demonstrations are abundant and their quality is optimal, which is rarely the case in reality.On the other hand, there are also some methods that slightly relax these assumptions.Bayesian IRL approaches [11,28,29] give way to slight deviations from the optimal demonstration assumption, due to the probabilistic nature of the Bayesian approach and the inclusion of teacher's model.Authors in [21] suppose that the suboptimality in demonstrations can occur at a small scale, and they handle this suboptimality by smoothing the constraints of the object function.In [25], it is assumed that demonstrations are locally optimal, but due to this assumption, this work cannot benefit from globally optimal demonstrations, in case they are available.In [30][31][32], the problem of nonoptimality is handled by using a generative model to learn the optimal demonstrations from a large number of suboptimal ones.Authors in [33] suppose that demonstrations are abundant but noisy, and they pretreat this limited suboptimality by a maximum a posteriori estimation to reconstruct near-optimal demonstrations.In [10], it is assumed that a sparse noise exists in some trajectories in demonstrations, and a model is used to identify and separate noisy trajectories from the reliable ones.Unlike [10], in [34], the authors do not filter out the noisy trajectories; instead, they learn from them, provided that some successful demonstrations are available, which is not always a realistic assumption.All the aforementioned methods cannot deal with real-world cases where demonstrations are sparse and far from optimal (more than noise), and also, in cases where nonoptimality exists in all demonstrations, whereas in this paper, we target learning in such conditions by extending IRL to incorporate teacher's demonstrations and her binary evaluative feedbacks.

Learning from Evaluative Feedbacks.
Learning from feedbacks is another direction for teaching an agent.Out of the different forms that feedbacks can take, here we only focus on binary evaluative feedbacks.Among the large body of literature on this subject, there are some works that provide an evaluative feedback for each entire trajectory executed by learners (see [35][36][37]).Using this type of feedback, the majority of works target direct derivation of the optimal policy.On the other hand, in other works, an evaluative feedback is provided for each action and is used either to communicate a numeric reward [38][39][40][41] or to transfer the performed action's correctness (true/false) in order to derive the optimal policy.e latter type is used for policy shaping [42,43], while RL methods [16] are mainly used for policy improvement in the former case.Many recent works emphasize on the effectiveness of policy shaping in comparison with using evaluative feedbacks as a numeric reward function (see [44][45][46]).Nevertheless, these learning methods are sensitive to nonoptimality of feedbacks.A way to handle this nonoptimality is by employing probabilistic feedback models to deal with errors and sparsities in teacher's feedbacks (see [42,45,47]).In our work, human teacher's feedbacks are considered to be binary and evaluative and are provided for each action.We suggest a novel nonprobabilistic feedback model that depends on recent and old observations to handle natural and unavoidable inconsistencies and errors in human's feedbacks.Unlike former approaches, our model implicitly can handle the inconsistency in feedbacks.Contrary to most of the works that directly derive a policy or modify it by using feedbacks, we employ a revised version of the IRL to estimate the teacher's reward function and, hence, generalize the experience of sparse interactions with the teacher to the entire task space.

Combining Human Demonstrations and Reinforcement
Learning.In a more realistic approach, to deal with nonoptimal and sparse demonstrations, most of the state-of-theart methods combine human demonstrations with the experience of interacting with the environment using reinforcement learning (RL) which requires a critic knowing the reward function [12][13][14]48].Human demonstrations can be used to initiate a policy and then refining it using RL (see Section 5.1 of [20] for a survey). is approach is appealing and results in a good learner performance.However, to learn an acceptable policy, such an approach suffers from the curse of dimensionality and high regret especially in sparse and nonoptimal demonstrations.In addition, this approach does not express the teacher's preferences well and needs to design the environmental reward function consistent with the mentor's behavior.
Our work differs from this approach, where we focus on leveraging learning from human data by combining her sparse as well as nonoptimal demonstrations and errorprone correct/wrong evaluative feedbacks.
e human evaluative feedback is different in its nature from the environmental reinforcement signal (see [39,49]).In addition, the goal of this approach is to derive the optimal policy directly, whereas our work follows IRL approach to derive the reward function underlying the task, which results in less regret due to the inherent generalization capability of IRL.

Combining Human Demonstrations and Feedbacks.
Different combinations of human demonstrations and feedbacks are used in the literature to accelerate and enhance the learning process or to allow teachers to provide information using different modalities.Human feedbacks that are combined with the teacher's demonstrations mainly take the following forms: corrective action, advice preferences, and evaluation of performed actions (evaluative feedback).Corrective action feedback is used in interactive learning systems [50] and in the active learning setting [8,17,51]; for providing this kind of feedback, the teacher should be able to provide an optimal action, which is not available in most realistic cases.Advice preference feedback is a kind of prior knowledge for solving the task [52,53] and is usually combined with other types of learning.Since advice preference feedback is provided by domain experts, its use is restricted to those cases wherein experts are available.In human evaluative feedback, or critique, the human teacher provides evaluative critiques to indicate the desirability of the performed action.is kind of feedback is simple and requires minimal information from the teacher.
In this work, we use binary evaluative feedbacks along with demonstrations.A limited number of works have been done within this setting [54][55][56].e most close approach to ours, in terms of human information and feedbacks, is the work of [55,56].However, the work of [55] employs a supervised learning method while we use the IRL and extract human preferences.In addition, nonoptimality or sparsity of human demonstrations as well as erroneous feedbacks is not considered in [55,56].Recently, a new published paper from our group managed to treat the nonoptimality in demonstrations (within a certain limit) in the presence of feedbacks and abundant demonstrations [15].But it failed to handle sparsity and high levels of nonoptimality in demonstrations.Also, the feedback error which can be dealt with was very limited.In addition, the nonoptimality and inconsistencies in data were filtered out instead of learning from them.All these issues are successfully handled in this paper.

Problem Formulation
e underlying decision-making process of an IRL agent learning from human demonstrations is modeled by a Markov decision process (MDP) without a reward function (MDP/R).MDP/R is a 4-tuple (S, A, T, c) where S is a set of states in the environment and A is a set of actions available to the learner.
Moreover, the transition model where s is the current state, a is the performed action, and s ′ is the next state.In our case, this model is preknown.is assumption is realistic in many cases, like when we learn a new task or a novel style in a familiar environment.Furthermore, c ∈ [0, 1] is a discount factor.e aim of an IRL problem is to extract the reward function R: S × A ⟶ R, which assigns a real-valued reward for executing action a in state s.Usually, the number of states 4 Journal of Robotics is too large.erefore, for the reward function to admit a practical representation and to allow recovering it from fewer number of demonstrations, the reward is represented as a function of state-action's features; R(s, a) � f(φ(s, a)), where φ: S × A ⟶ R m is a known m-dimensional stateaction feature function.As in other research studies [3,24,33,34,57], here we use a linear function, i.e., R θ (s, a) � θ T φ(s, a), where θ is the weighting vector of features.
Given a reward function, in general, solving a MDP involves obtaining a policy, π: S × A ⟶ [0, 1], where π(s, a) is the probability of choosing action a in state s, that maximizes the expected return, i.e.T]. e optimal state value function can be computed recursively using the Bellman equation as Similarly, the optimal state-action value function can be recursively computed as Also, the optimal state value function can be written in terms of the stateaction value function as . us, the optimal state-action value function becomes (1) Typically, the IRL seeks for the reward function underlying the demonstration set of the task. is demonstration set is generated according to a certain teacher's policy.Similar to the formulation used in many IRL methods, the demonstrations are represented by a set of trajectories D � τ i   K i�1 , where K is the number of trajectories and a trajectory is defined as a set of state-action pairs τ i � (s 1 , a 1 ),  (s 2 , a 2 ), . ..}.We should note that, in our framework, we use the demonstrations D E to indicate the demonstrations provided by the teacher and the demonstrations D L to indicate the demonstrations collected from the leaner motion.In this paper, the learner is provided by nonoptimal and sparse (i.e., insufficient number of) demonstrations to estimate the reward function.Taking these assumptions into consideration, one can easily deduce that the traditional IRL alone cannot lead to learning the optimal policy.e likelihood of the demonstration data D given the reward function R θ is defined as Similar to other works [11,58], our learning process is not sensitive to the trajectories in the demonstration dataset.It depends on the (s, a) pairs in the demonstration dataset regardless of the trajectory they belong to; thus, the likelihood function can be written as e policy π θ (s, a) is a stochastic policy, defined by the Boltzmann distribution: a′∈A e αQ θ s,a′ ( ) where α controls randomness in the policy.
In our work, we utilize the Bayesian IRL approach (see [8,11,28,29]).More specifically, we adopt the maximum likelihood IRL suggested by [29].e maximum likelihood IRL (MLIRL) works as follows: given the demonstration dataset D, we seek for the reward function R θ that maximizes the likelihood of the demonstration data (equation (2).To that end, a recursive gradient ascent optimization tool is used.First, we take an arbitrary value for θ, and then π θ is computed by solving the MDP and using equation (3).After that, the likelihood of the demonstrated data (equation ( 2)) and the gradient of θ is computed.ereby, θ is updated, and so on (see Figure 1(b)).

The Proposed Learning Method
In this section, we present our proposed framework, called IRLDC.In the following, we discuss the detailed framework and delineate the learning and optimization process.
Our framework targets learning by a mixture of sparse as well as nonoptimal demonstrations and human binary evaluative feedbacks, where the learner uses its own evaluated experiences as new demonstrations in an extended IRL method.
e learning process starts by providing some demonstrations from the teacher (sparse and/or nonoptimal) for the learner to initialize its policy.Next, the learner interacts with the environment and acquires binary evaluative feedbacks from the teacher.Such feedbacks indicate the desirability of the learner's actions.By taking into account possible inconsistencies and errors in issuing and receiving feedbacks, the learner derives the teacher's preference model.is model is used to revise the estimated reward function by solving a single optimization problem.e cycle of interact-feedback-update continues until the teacher is satisfied.

IRLDC Framework.
Our IRLDC framework includes two main stages: (1) the demonstration stage and (2) the feedback stage.
e general framework is shown in Figure 1(a) and it is described procedurally in Algorithm 1.
In the first stage, the teacher provides a demonstration dataset D E (sparse and/or nonoptimal) and the learner uses the IRL algorithm (Figure 1(b)) to estimate the reward function parameter θ initial (line 02).In the second stage, the learner employs θ initial to generate its initial policy (line 06).
ereafter, the learner observes the world (gets state s), chooses its action a using the initial policy (line 08 and line 09) and records its trajectories in D L � τ i L  , where τ i L � (s 1 , a 1 ), (s 2 , a 2 ), . . .  (line 11).en, the teacher provides a binary evaluative feedback signal (line 10) for the executed action a by f a ∈ f + , f −   within a certain state s, where f + and f − indicate "good" and "bad" actions, respectively.Note that the teacher may give multiple feedbacks at different times in state s denoted by f s � f a  .Also, f � f s   denotes the feedback set given by the teacher.
After M steps of interaction with the environment, the performed demonstrations and the received feedbacks are provided as inputs to our proposed IRL algorithm called the maximum likelihood inverse reinforcement learning with  Journal of Robotics demonstration and critique (MLIRLDC).So, the learner uses θ initial , f, and D L as inputs for MLIRLDC, to update the reward estimation parameter θ (line 15).Using this parameter, the learner updates its policy and executes an action in the environment (line 06 and line 09).e process of execution and reward function update continues until the teacher satisfaction is attained (lines 06-16).We should note that the learner can take different exploration strategies for deriving its policy in the second stage (probabilistic, greedy, and random policies).As seen in Figure 1(a), in our framework, demonstrations (D L ) are collected from the learner motions in the second stage.On the other hand, the demonstrations provided by the teacher (D E ) are used in the initialization of the MLIRDC and in the initial policy of the learner execution.

Journal of Robotics
is allows the learner to operate with diverse combinations of teacher's demonstrations and feedbacks, ranging from demonstrations of any amount or quality, to teacher's feedbacks only.
It is worth mentioning that, from the feedback data, the learner extracts the teacher's preference model H E , which represents the preferences of the teacher's actions on a certain state s. is preference model is used to weight the likelihood of the demonstrations in the MLIRLDC (Algorithm 2).In the following, first we detail the derivation of H E and thereafter describe the MLIRLDC in more details.

Estimating the Teacher's Feedback Model.
Usually, the critique provided by a human teacher is noisy due to the errors in reporting her true assessment (feedback error) and inconsistency in assessing the learner's behavior in a single situation at different times.e inconsistency in feedback can occur due to changes in the teacher's behavior during the teaching process [59], dependency of the teacher's feedback on the current agent policy [44], inconfidency of the teacher in providing feedbacks, and multiple teachers providing feedbacks.erefore, we use the following feedback model to handle the noise.e feedback model for getting feedback f a j in state s for performing action a j is as follows: is model assumes that the teacher determines if the performed action is consistent with her policy π * , with the probability of error ε (feedback error).If the teacher interprets the learner's action as correct, she gives a positive feedback (f + ) ("good" feedback), so that the action gets a proportion of the "good" feedback equal to 1 − ε, and each one of the other actions get ε/(|A| − 1).e same model is used for a negative feedback ("bad" feedback).e error ε can also encode the error in the learner's perception of feedback.e teacher's preference about the agent's action in a certain state is complete and transitive, so we can model it with a utility function H(s, is utility function is the difference between the number of "good" and "bad" critiques and its value is directly correlated with the teacher's preference for the corresponding action.Equation ( 5) depends on the history of feedbacks, and therefore, the effect of feedback error and inconsistency in the teacher's critiques are implicitly encoded in that.

Journal of Robotics
By scaling H between zero and one, it can be mathematically regarded as a cumulative probability distribution.Subsequently, the teacher's model can be obtained as where and k is a very small number.Also, H 0 (s, a | f s ) is defined similarly as equation ( 5) while considering ε � 0 for the collected feedback dataset f s .Note that other forms of scaling rather than minimum of H 0 can also be used.is distribution allows the teacher's model to be informative even for actions that do not receive the teacher's critique.

MLIRLDC Optimization Process and Algorithm.
Unlike the majority of IRL algorithms, our proposed IRL algorithm (MLIRLDC) takes demonstrations and evaluative feedbacks as inputs.e implicit assumption in the MLIRL likelihood (equation ( 2) where O(s) is the correct actions in the state s.We may not have access to the correct action in every state, due to the nonoptimality of the teacher's demonstrations or the absence of them, but we can use the critique data f + , f −   which provides a partial evidence for the suitability of action a in state s.Accordingly, we calculate the likelihood using the critique data.To do so, we modify the likelihood model (equation ( 2)).In the simple case, when there is no inconsistency and error in teacher's feedbacks, we search for θ in a way that: (i) If the feedback f a for the pair (s, a) is positive (f a � f + ), then the action a is exactly correct for that state (a ∈ O(s)); thus, in the likelihood objective function, we must maximize the policy π θ (s, a).(ii) If the feedback f a for the pair (s, a) is negative (f a � f − ), then the action a is not suitable and exactly wrong for that state (a ∉ O(s)); thus, in the likelihood objective function, we must maximize As a result, the likelihood objective function of demonstrations given teacher's feedbacks, becomes When the teacher's critiques contain inconsistencies and errors, instead of considering actions that are exactly correct or wrong, we use H E (equation ( 6)) and modify the likelihood (equation ( 7)) so that the degree of correctness is included: e teacher's preference model H E affects the optimization process when searching for θ according to its value.If H E (s, a) is large (H E (s, a) ⟶ 1), i.e., a is more likely to be a correct action in state s, the term π θ (s, a) H E (s,a) will highly affect the searching process for parameter θ.In contrast, when H E (s, a) is small (H E (s, a) ⟶ 0), i.e., a is more likely to be a noncorrect action, the term π θ (s, a) H E (s,a) will be large whatever the value of π θ (s, a) and its effect on the searching process is very low.It means the pair (s, a) will be filtered out from the demonstration D L .However, in order to fully benefit from the demonstrations and the teacher's preference model H E (rather than only filtering out the pair (s, a)), we can learn from the unsuitability of action a in the state s (H E (s, a) ⟶ 0) by estimating the most likely correct action in that state using the teacher's preference model.us, we will firstly enhance the demonstration data D L according to the teacher's preference model H E as follows: (1) Input: MDP/R, feature φ, D L , f, θ initial , and learning rate β (2) θ ⟵ θ initial (3) compute teacher model H E {using equations ( 4)-( 6)} (4) enhance the demonstration D L {using equation ( 9)} (5) while not converged (6) compute Q θ , dQ θ /dθ and π θ {using equations ( 3) and ( 12)} (7) compute ∇ log(L DC ) {using equation ( 11)} (8) θ ⟵ θ + β ∇ log(L DC ) (9) end while (10) Output: θ ALGORITHM 2: MLIRLDC algorithm.8 Journal of Robotics en, we use the enhanced demonstration (equation ( 9)) in the likelihood objective function as e role of the teacher's preference model H E (s, a) in equation ( 9) is to determine the best action in the state s.And its role in equation ( 10) is to determine the degree of correctness of the action a in the state s.
We should note that the gradient of state-action value function is not differentiable due to the "max" operator in equation (1).To make it differentiable, as in [29], we replace the "max" operator by weighted the state-action values using Boltzmann distribution.Hence, V θ (s) �  a∈A π θ (s, a) Q θ (s, a) and the state-value function become us, the state-action value function and its gradient can be computed recursively.
e optimization process is summarized in Algorithm 2.

Experiments
In our experiments, we assess the performance of our IRLDC framework under different conditions: (i) diverse degrees of demonstration optimality, (ii) different degrees of demonstration sparsity, (iii) lack of demonstration (learning only from feedbacks), (iv) different types of agent policy, and (v) diverse degrees of feedback error.
e experiments are divided into two parts.e first part includes a simulation domain, where the effect of the foresaid aspects is studied.e second part is carried out within two domains to investigate the applicability of our framework for the real human data and real-world problems: highway car driving simulator and a real mobile robot navigation task, both instructed by a human.
In the experiments, the performance evaluation measure is the "expected value" score (EV), to evaluate the optimality level of the learned policy under the "true" reward function.
is score is computed by finding the greedy policy π from the learned reward function and then measuring its expected return under the "true" reward function R T .e "expected value" score of the teacher's policy π T , derived from the "true" reward function R T , will be the upper baseline (named "teacher policy") and is used for comparison.
In the literature, the only direct method that uses nonoptimal demonstrations with evaluative feedbacks is our previous work (LfDHF) [15], which we compare with it.Also, we suggest two indirect scenarios to compare our work with LfD methods, which use optimal demonstrations, and LfF approaches: (i) Standard IRL: the standard IRL methods acquire abundant optimal demonstrations, whereas our method employs sparse and nonoptimal demonstrations along with evaluative feedbacks.erefore, to make a fair comparison, we provide the standard IRL method with the same demonstrations we use in our method as well as a set of optimal demonstrations equivalent to the number of the evaluative feedbacks we employ in IRLDC.
Although, according to our assumptions, providing optimal demonstrations might be impractical, we do that just for the comparison purposes.
Here, we used MLIRL [29]; other IRL methods also yield similar results in face of sparse and nonoptimal demonstrations.(ii) Policy combination: we derive the policy (π demo ) from the provided teacher's demonstrations by using MLIRL to calculate the reward function and then derive the policy by means of dynamic programming [16].en, we derive the policy (π feedback ) from the provided teacher's feedback (For this method, we use the following settings: the probability of giving explicit and implicit feedbacks is equal, and the feedback error is equal to zero) [45].ereafter, we combine the two policies using an idea suggested by [42]: In general, we should note that the amount of information provided by optimal demonstrations is more than that of provided by binary evaluative feedbacks.In the following, we relate the information content of these two sources.For |A| number of actions and only one single optimal action per state, by providing optimal demonstrations, the teacher can directly give the optimal action by just Journal of Robotics a single interaction per state.By providing binary evaluative feedbacks, the learner may get the optimal action from the first interaction or after (|A| − 1) interactions per state.Formally, i(F | s) ∈ [1, 2, . . ., |A| − 1] with a uniform distribution, where i(F | s) is the number of feedback interactions per state needed to get the optimal action.erefore, the average number of feedback interactions per state to get the optimal action will be So, the average number of feedback interactions (i(F)average) needed to achieve the same learning performance of optimal demonstration interactions will be where i(D) is the number of state-action pair in demonstrations.In case of error-free feedbacks and nonoptimal demonstrations, the number of required feedbacks will be reduced.

Simulated Navigation Domain.
In this experiment, we consider a simulated navigation task on a 16 × 16 multifeature grid world, such as in Figure 2(a).e learner robot has five actions for navigation (up, down, left, right, and stay motionless), where each action has 10% chance of failure, leading to one random step move.e purpose of the learner robot is to navigate in the environment by following the teacher's navigation style to reach the goal.
To capture the teacher's navigation style, five features are defined in the environment, namely, ground, puddle, grass, obstacle, and goal, yielding to 5-dimensional binary feature vector φ which is used to characterize each state.For example, a navigation style could be moving in the environment while avoiding the obstacles, with a priority for going through the grass as much as possible; otherwise, it is preferred to pass through the ground rather than over puddle.e learner's state is represented by its position in the grid which has Markov properties.e reward function is represented by a linear combination of the state's features (R θ (s) � θ T φ(s)) and it is unknown to the learner.By manually setting a feature weight vector θ True ⟵ θ, we obtain a "true reward" function (R T � R θ True ) which represents a specific teacher's navigation style.en, we use a planning algorithm to compute the optimal teacher's policy (π T ) for this reward function.ereafter, the nonoptimal demonstrations are derived by drawing the starting state from a fixed distribution, and the optimal policy is then sampled with a certain chance (degree of nonoptimality η ∈ [0, 1]) of selecting a nonoptimal action in each state.Each demonstration is terminated when reaching the goal or after 50 steps are elapsed-among the derived nonoptimal trajectories, we select ones that have the nonoptimality level near to η.
Similarly, in the execution phase (stage 2; see Section 4.1) the learner agent starts from a state drawn from a certain distribution and terminates its episode when either reaching the goal or after 50 steps.e simulated teacher provides an evaluative feedback after each learner's action, depending on the teacher's policy and feedback error ε.
e simulation is performed using the settings summarized in Table 2. e results of learning in stage 1 of our framework are shown in Figure 3, which show that the agent performance is inversely correlated with the number and nonoptimality degree of demonstrations.
e results are averaged over 100 repetitions.e plain lines in the graphs shown in the following pages are the "mean" value of the EV scores and the shaded colored areas are the "standard deviation".e white, blue, green, gray, and red cells depict the ground, puddle, grass, obstacle, and goal, respectively.e black circle represents the learner robot.(b) A snapshot of our highway car driving simulator.10 Journal of Robotics

Comparison with Other
Approaches. Figure 4(a) illustrates the performance of our framework in face of nonoptimal demonstrations, where we only need 200 evaluative feedbacks to statistically reach the teacher's performance.is is a reasonable number in comparison with the information transferred by the evaluative feedbacks.Compared to our previous work (LfDHF) [15], our current method (IRLDC) results in very signi cantly higher learning performance because we use human demonstrations for initialization of our method and employ the learner's own experience trials as new demonstrations.is highly increases sample e ciency and expedites the generalization of experiences.In addition, our previous work ltered out nonoptimal demonstrations, but here we learn from them.In contrast, we see that the "policy combination" method hardly reaches the desired result even if a large number of feedbacks are provided.e results of MLIRL method, which is considered as one of the best methods to deal with nonoptimal demonstrations in an IRL domain, show that nonoptimality has a deep in uence on its performance and it requires a large number of additional optimal demonstrations to attain an acceptable result, while providing optimal demonstrations is against our realistic assumption.erefore, large number of feedbacks and optimal demonstrations cannot resolve the nonoptimality e ect within the "policy combination" and "MLIRL" methods, respectively, while within a few number of feedbacks our approach (IRLDC) yielded a much more better result.Figure 4(b) is the case where a limited number of optimal demonstrations is provided.Having optimal demonstrations, (η ⟶ 0), IRLDC and MLIRL statistically exhibit a similar performance.For IRLDC, we need i(f) 260 feedbacks, which is equivalent to i(D) 100 extra stateactions in optimal demonstrations in MLIRL.By considering the number of actions |A| equal to ve, these values obey equation (15).Our previous work (LfDHF) [15] could not exploit evaluative feedbacks to compensate sparsity in demonstrations.Because it just focused on using human evaluative feedbacks to correct teacher's demonstrations.Regarding the "policy combination" method, it needs a large number of feedbacks to reach an acceptable result.Naturally, the methods employing evaluative feedbacks, i.e., IRLDC and policy combination, show a larger variance.

Nonoptimal Demonstration E ect.
According to Figure 5, one can see that, in all cases, the e ects of nonoptimality on the learning process can be compensated by using evaluated feedbacks.Nevertheless, it shows that when demonstrations are more misleading than being informative (with optimality degree less than 50%), it is better to use only feedbacks and ignore the demonstrations.

Sparse Demonstration E ect.
e relation between the number of required feedbacks increases nonlinearly with the increment of sparsity in demonstrations (see Figure 6(a)).Expected value (EV) - Teacher policy 100% optimal demo 90% optimal demo 80% optimal demo 70% optimal demo 60% optimal demo 50% optimal demo 40% optimal demo 30% optimal demo 10% optimal demo Figure 3: Performance of the standard IRL method (MLIRL) used in the rst stage of our framework.e plain curves are the mean of "EV" scores with respect to demonstration steps and nonoptimality degree.e blue, red, and black circles are di erent initialization settings for stage 2 of our framework.Journal of Robotics occurs.is con rms the intuition that using any amount of optimal demonstrations makes the learning process faster than using only feedbacks.Also, Figure 6(b) depicts that the lack of demonstrations (i.e., sparsity) can be compensated by employing reasonable number of feedbacks.

Learning Only from Feedbacks (No Demonstrations
). e performance shown in Figure 7 indicates that, even in the absence of demonstrations, only feedback data are su cient for the IRLDC to get a good result.ough convergence is slow in the early learning trails, after collecting a su cient number of feedbacks, the convergence is expedited; this is due to the generalization capability embedded in the IRLDC.
is makes IRLDC performance better than that of [45] used in "policy combination" scenario (Figure 4(b)).In addition, learning only from feedbacks obeys equation (15), where it needs i(F) 350 to achieve the same score value of i(D) 140.

E ects of the Learner Policy on Learning Process.
In the IRLDC framework, in the rst stage, the agent observes demonstrations and then, in the second stage, it uses the gained knowledge to learn interactively.In the second stage, the agent receives feedbacks from the critic and uses that information in its IRL engine to improve its behavior.
e agent can use di erent policies in this stage.Teacher policy 100% optimal demo 90% optimal demo 80% optimal demo 70% optimal demo 60% optimal demo 50% optimal demo 40% optimal demo 30% optimal demo 10% optimal demo Learn only by feedback Figure 5: IRLDC's stage-two performance in face of abundant and di erent demonstration optimality levels in stage-one (points "A1", . .., "A8" and "B10" in Figure 3) and the number of evaluative feedbacks.e black curve has no initial demonstration (point "C" in Figure 3).Figure 4: (a) e e ect of nonoptimality in demonstrations.e experiment setting: initial nonoptimal demonstration "A5" with 60% optimality and 100 demonstrations in the rst stage (see Figure 3).(b) e case where all demonstrations in the rst stage are optimal but sparse.e experiment setting: initial sparse demonstration "B2" and 20 demonstrations in the rst stage (see Figure 3).Two kinds of data are provided during the experiment: evaluative feedbacks related to the IRLDC, LfDHF, and policy combination ( rst horizontal axis), and state-action pairs in extraoptimal demonstrations used in the MLIRL (second horizontal axis).
compares the performance of the agent against different number of feedbacks using random, probabilistic, and greedy policies.is experiment is done by using batch learning mode for the collected feedbacks.Since the demonstrations are not optimal and sufficient, the agent needs to balance the exploration-exploitation to gain sufficient feedbacks as well as minimizing its regret.e greedy policy is the worst, since it gains information from feedbacks mostly in states where demonstrations are not optimal, and it cannot collect diverse information.In contrast, Figure 6: (a) e relation between sparsity level of demonstrations in stage-one and the number of feedbacks needed to reach "EV" score equal to 1.17 using IRLDC.(b)IRLDC's stage-two performance in face of optimal and different demonstration sparsity levels in stage-one (point "B1", . .., "B10" in Figure 3) and the number of evaluative feedbacks.e black curve has no initial demonstration (point "C" in Figure 3).2).
Journal of Robotics probabilistic and random policies provide the agent with the chance of facing states not seen in the demonstrations.

Effect of the Feedback Error.
As mentioned in section 0, our model can handle errors and inconsistencies in the feedbacks.Due to space constraints, in this experiment, we only study the effect of feedback error.An insight into Figure 9(a) reveals that the learning performance remains acceptable and the navigation style can be learned even in the presence of noisy feedbacks.It can also be seen that the negative effect of the noise is diminished as the number of feedbacks grows, provided that the noise level is below 50%. Figure 9(b) illustrates the performance of our previous work (LfDHF) [15] in face of errors in the feedbacks; as the feedbacks' errors increase, the learning performance deteriorates, and as a result, LfDHF needs a large number of feedbacks to attain acceptable results.

Highway Car Driving Experiment.
In this section, we investigate the applicability of our framework with real human data in a dynamic environment.We utilized the car driving experiment that is devised in our previous work [15].Our task is to navigate the agent car through three busy highway lanes (Figure 2(b)) using five actions: moving left/ right, speeding up/down, and no action.e learner agent car moves faster than all of the other cars even at its lowest speed.e state space is constituted of the learner's speed, its lane, and the distribution of other cars on the highway.We consider two driving styles: Style 1. Giving the highest priority to avoiding collisions with other cars, preferring the middle lane with high speed over the left lane with high speed, and over the right lane with low speed Style 2. e highest preference is to collide with other cars as possible, and it is preferred to drive at middle lane with high speed Each of these styles is learned from demonstrations and feedbacks from a real human teacher interacting with the simulator through a keyboard.
e nonoptimality in the demonstrations is imposed by assuming that the learner agent perceives the teacher's demonstrations with 30% error, that is, on top of the unmeasurable natural error in human demonstrations and feedbacks.In order to decrease the direct communication between the teacher and the learner, only negative feedbacks are given by the teacher.e pace of the simulator is set in a way that the teacher can conveniently give feedback per decision.
When working with a human teacher, her "true" reward function is not available; instead, a task-specific performance measure is needed for the evaluation purpose [3,25].Here, we apply the standard IRL to the teacher's optimal demonstrations and take the extracted reward as a proxy of the "true" reward function.
Table 3 shows the results averaged over 5 independent runs, and M � 40 interaction steps with the environment before the learner policy is updated.ese results illustrate that the IRLDC with various demonstrations and reasonable number of feedbacks achieves the same performance of the standard IRL given the teacher's optimal demonstrations.A video of this experiment and the learned behavior can be found at http://bit.ly/31FnwGT.

E-Puck Robot Experiment.
Here we use an E-puck educational mobile robot [60] navigating in two environments  2).(b) e performance of our previous work (LfDHF) [15] with similar setting to (a).
similar to the one employed in Section 5.1 (see Figure 10).e robot learns the navigation style of the human teacher interacting with it through a keyboard.e robot's odometer and an external camera are used for localization and motion error correction (see Figure 10(c)).e robot has five actions: moving forward/backward, rotating clockwise/counterclockwise, and staying motionless.e transition model is estimated from the previously collected sequences of transition triplets (s, a, s ′ ).e teacher's navigation style is as follows: moving in the environment to reach the goal (the red cell) while avoiding the gray cells, with a priority for going through the green cells as much as possible; otherwise, it is preferred to pass through the white cells rather than the blue ones.Two environments are involved in this experiment (Figures 10(a We induce nonoptimality in the demonstrations by distracting the attention of the teacher when providing the demonstrations, that is, on top of the unmeasurable natural error in human demonstrations and feedbacks.e "true" reward function is estimated using the standard IRL on the optimal human demonstrations on a simulated version of the task.e feedback protocol and the learner pace setting are similar to the previous section.e results of this experiment are summarized in Table 4. ey show that our framework performs well in the realworld environment as well as when the learned reward  Journal of Robotics function is generalized to a new environment.Also, the results are consistent with the previous simulated domain.Indeed, the results provide further affirmation that nonoptimal and sparse demonstrations are useful and help the learning process when using them along with evaluative feedbacks.A video of this experiment and the learned behavior can be found at http://bit.ly/31FnwGT.

Conclusions
In this paper, we introduced the IRLDC to learn from a mixture of sparse as well as imperfect demonstrations and human evaluative binary feedbacks.Employing these two sources of information, the IRLDC is a practical and convenient tool to program artificial systems in real-world situations, where nonoptimal and sparse human's demonstrations are common and inconsistency as well as error in human's feedbacks is usual.Having the state transition model, the IRLDC estimates the reward function in a single optimization problem in order to generalize the expertise embedded in demonstrations and feedbacks, where standard IRL methods fail in face of sparse and imperfect demonstrations and learning from feedbacks (standard LfF methods) suffers from the curse of dimensionality and high load on human teacher to provide rewards.e closest approach [15] to IRLDC does not benefit from the learner's experiences to improve the learning process and just focuses on using human evaluative feedbacks to correct the teacher's demonstrations and to filter out the nonoptimal ones.ese result in failure to face sparsity as well as limited robustness against nonoptimality in demonstrations.In contrast, in IRLDC we use the learner's own experiences as additional demonstrations which enhance sample efficiency and generalization and lead to lower regret and faster learning.In addition, we exploit errors in demonstrations, instead of filtering them out, to improve IRL through giving a higher chance to alternative decisions.ese properties make the method faster and highly robust in face of errors in demonstrations and feedbacks.
Comparing to other state-of-the-art methods, which combine demonstrations with RL experience, use corrective actions, or advice preferences, to learn from nonoptimal and sparse demonstrations, we follow a different paradigm to leverage learning from human in order to allow her to simply express her preferences through adding evaluative feedbacks.Unlike the aforementioned rich sources of information, evaluative feedback is simple, offers strengths, and imposes minimum constraints on the teacher during the teaching process.Nevertheless, corrective actions and advice, if available, can be directly used in our IRL model and boost our results further.
We studied the functionality of the IRLDC in three distinct problems: a grid world task, a car driving simulator, and an E-puck mobile robot navigation, where human data are used in the last two cases.e results showed that the addition of feedbacks in our framework exploits well the nonoptimal and sparse demonstrations, when the nonoptimality is below 50%.In addition, the learning was done well in face of intrinsic errors in human feedbacks.
Moreover, the IRLDC worked well when programming solely by feedbacks; however, convergence occurred slowly in a linear way.
One of the major assumptions in the IRLDC, as well as in standard IRL methods, is having the state transition model.
is assumption is very realistic and prevalent, when learning a new task or style in a known environment.Testing the IRLDC's robustness in face of limited errors in the state transition model is a problem for further studies.Furthermore, we assumed that every decision of the learner can be distinctly evaluated by the teacher.However, this setting is not practical in some situations where the pace of the learner is fast or the effect of multiple decisions is evaluated at once.ese situations in turn arises the credit assignment problem [38].Handling such situations is the next step of this research.In addition, we would like to employ our method in deep neural networks to attain higher generalization in face of more complex problems.

Figure 1 :
Figure 1: (a) Schematic overview of the IRLDC framework.(b)MLIRL block diagram with only demonstration input.(c) Our MLIRLDC block diagram with demonstration and feedback inputs.

Figure 2 :
Figure 2: (a) Grid world navigation domain.ewhite, blue, green, gray, and red cells depict the ground, puddle, grass, obstacle, and goal, respectively.e black circle represents the learner robot.(b) A snapshot of our highway car driving simulator.

Figure 6 (
b) indicates that when the optimal demonstration steps (s, a) increase, rapid improvement in the performance

Figure 7 :Figure 8 :
Figure7: e performance of IRLDC framework when learning only from evaluative teacher's feedback.is is the case where no demonstration is available (point "C" in Figure3).

Figure 9 :
Figure 9: (a) e performance of IRLDC framework under different feedback error values used in the interactive phase when 100 stateaction pairs of 60% optimal demonstrations are given (see Table2).(b) e performance of our previous work (LfDHF)[15] with similar setting to (a).
) and 10(b)), using the same features and state representation, for the following purposes: (i) Testing the performance of the learned reward function: where the reward function is learned in one environment and evaluated in the second (ii) Providing demonstrations in one environment and feedbacks in another

Figure 10 :
Figure 10: (a) and (b) E-puck robot navigating in two different environments.(c) E-puck external camera perspective used to localize its position.

Table 1 :
Strengths and weaknesses of LfD and LfF approaches.

Table 2 :
Setting used to study di erent aspects of our framework (IRLDC) in the simulated navigation domain.: learning from the collected feedbacks is done in the batch learning mode. Note

Table 3 :
Learning the driving styles from a human teacher in different conditions.

Table 4 :
Results of the learned navigation experiment by the E-puck robot in two different environments, with M � 30 interaction steps with the environment before the learner policy is updated.Here, in experiments 1 and 3, demonstrations and feedbacks are provided in the same environment, and then, the performance of the learned reward function is examined in the second environment.Experiment 2 is done by providing sparse demonstrations in one environment and feedbacks in another.