A Robot Human-Like Learning Framework Applied to Unknown Environment Interaction

Learning from demonstration (LfD) is one of the promising approaches for fast robot programming. Most learning systems learn both movements and stiﬀness proﬁles from human demonstrations. However, they rarely consider the unknown environment interaction. In this paper, a robot human-like learning framework is proposed, where it can learn human skills through demonstration and complete the interaction task with an unknown environment. Firstly, the desired trajectory was generated by dynamic movement primitive (DMP) based on human demonstration. Then, an adaptive optimal admittance control scheme was employed to interact with environments with the reference adaptation method. Finally, the experimental study was conducted, and the eﬀectiveness of the framework proposed in this paper was veriﬁed via a group of curved surface wiping experiments on a balloon with unknown model parameters.


Introduction
Robot learning from demonstration (LfD) has recently drawn much attention because of its high efficiency in robot programming [1].
us, robots can quickly program the robots to perform operating variable skills and replace human tutors from such tasks in a complex industrial environment [2]. Compared to conventional programming methods using a teaching pendant, LfD is an easier and more intuitive way for people who are unfamiliar with programming. Besides, human characteristics involved in the demonstrations are available for robots to further improve the flexibility and compliance of motions [3,4].
After the demonstration, how to use the information of the human tutor is very important. Dynamic movement primitive (DMP) is a kind of common method in humanrobot skill transfer tasks [5]. DMP has many advantages; for example, the DMP model is so simple that we only need to adjust a few parameters to achieve trajectory modeling. Besides, we can use regression algorithm to quickly learn model parameters in the online trajectory planning of robots [6]. In addition, the DMP model is also easy to generalize; we can quickly generalize a trajectory with the same style as the original trajectory by simply adjusting the starting and ending coordinates of the trajectory [7,8]. Because of the above advantages, the DMP has been widely used in humanrobot skill transfer tasks [9].
Appropriate control strategies help robots reproduce human skills more accurately and stably. In some specific tasks such as surface cleaning, cargo handling, and environment identification, robots are required to track a task trajectory and achieve compliance in the interaction with environments [10]. In the previous literature on interaction control, two main methods have been studied widely: impedance control [11] and hybrid position/force control [12]. e admittance control, which is regarded as the position-based impedance control, can achieve good interaction performance by trajectory adaptation [13][14][15]. According to admittance models, the external forces received by the robot will be transformed to the position of the end-effector, and then, the desired interaction performance can be ensured by trajectory adaptation and tracking [16]. e control strategies mainly include proportion integration differentiation (PID) control, adaptive control, adaptive control using neural networks, and fuzzy control [17][18][19][20]. When robots perform different tasks in an unknown, complex, and dynamic environment, it is usually difficult to obtain accurate task models and environmental information, and various errors may have a serious impact on the final control results [21]. In recent years, control methods based on neural network learning have shown better adaptability to the system and environmental uncertainty, but this method requires a large amount of system data samples, and it is difficult to integrate various constraints in unknown environments in real-time [22,23].
In this paper, firstly, the desired trajectory is generated by human demonstration, and then, an adaptive admittance control scheme is applied to interact with environments using the reference adaptation method. e contributions can be summarized as follows: (1) An adaptive optimal admittance controller is developed to take into account the unknown interaction environment dynamics. Combining with the generalization ability of the DMP model and the compliance control ability of the adaptive optimal admittance model, the interaction performance between the robot and the unknown environment is improved. (2) A complete human-like learning framework is developed. In the beginning, the desired trajectory can be obtained quickly and accurately by human teaching and generalization. And then, the online adaptive controller recalculates and updates the originally desired trajectory to obtain a new reference trajectory. e framework can update the new reference trajectory combined with different interaction environments, which greatly enhances the interaction accuracy e rest of the article is organized as follows: In Section 2, the methods of desired trajectory generation and adaptive optimal admittance controller used in this paper are introduced. In Section 3, the experimental study is presented and then the effectiveness of the framework proposed in this paper is verified via balloon surface wiping experiments. Finally, Section 4 summarizes the whole paper.

Overview of the Framework.
e scheme of the proposed framework is shown in Figure 1. In the proposed learning framework, the human tutor presents a demonstration at first. e trajectory learned from the DMP model is regarded as the desired trajectory. en, the desired trajectory and the interaction force measured by the force sensor are input into an adaptive admittance controller to obtain the modified reference trajectory.
Here, x and _ x represent the current position and velocity, respectively. x d and x d represent the desired position and velocity, respectively. x r represents the reference position. q, _ q, and τ represent the current angle, angular velocity, and torque, respectively. q r and τ r represent the reference angle and reference torque, respectively. f int represents the interaction force. Finally, new manipulation motions are implemented by the robot joint controller, and the collected new data are taken as a new demonstration for the repetitive training.

Dynamic Movement Primitives (DMPs).
In this paper, motion DMP can be obtained by using the DMP model to fit the motion trajectory. e principles of motion DMP used in this paper are stated as follows [24,25]. e essence of DMP is a second-order nonlinear dynamical system including spring and damper. A singledegree-of-freedom motion can be expressed by the following equations: where we ignore the time variable for the sake of simplicity. For example, β 1 (t) is represented by β 1 ; a and b represent the damping coefficient and spring constant of the system, respectively. And a is usually set as a � b 2 /4; g is the target value of the motion trajectory; and τ represents the time scaling constant. β 1 and β 2 represent the position and velocity of motion trajectories, respectively. And the relationship between these two variables is shown in equation (2). ω means the weight of the Gaussian model. s is the phase variable of the system, which is calculated by the regular system of equation (3). And k 1 is a positive constant. e nonlinear function f(s; ω) is defined as where c i , di, and ω i are the centre, width, and weight of the ith kernel function, respectively. β 0 is the initial value of the motion trajectory. N is the total amount of Gaussian models. In general, the initial value of s is set to 1, which gradually decays to zero. Since the value of s tends to zero, the nonlinear function f (s; ω) is bounded, and the model becomes a stable second-order spring-damped system.
In general, supervised learning algorithms such as the local weighted regression (LWR) algorithm are used to determine model parameters ω [26]. Given the teaching trajectory β(t), where t � [1, 2, ..., T], and g � β(T), the force function can be determined according to the following equation where K and D represent the stiffness and damping of the system, respectively. ω can be determined by the following equation: argmin

Adaptive Optimal Admittance Control.
In this section, an adaptive task-specific admittance controller is developed. is adapts the parameters of the prescribed robot admittance model so that the robot system assists the human to achieve task-specific objectives. e task information is modeled by DMP so that the controller can adapt to the human tutor characteristics. e designed adaptive admittance controller will be used in the reproduction phase.
As shown in Figure 1, the process of adaptive admittance control in this article is as follows: the robot obtains the desired trajectory x d , _ x d , € x d through LfD and DMP generalization; then, the force sensor collects the interaction force between the robot end-effector and the environment in real-time. ey are used as the input of the adaptive admittance model, and the expected trajectory x d is improved according to the admittance model. At this time, a new reference trajectory x r is obtained, and x r is transmitted to the controller as the control signal to ensure the fast and accurate tracking of the actual trajectory to the reference trajectory. Among them, the core of adaptive admittance control is that the model parameters are not fixed but can be optimized online with the help of an adaptive algorithm according to the real-time position and interaction force information, in order to minimize the quadratic cost function [13]. e prescribed admittance model is defined as follows: where x, _ x, and € x represent the current position, velocity, and acceleration, respectively. _ x d and € x d represent the desired velocity and acceleration, respectively. M E , C E , and K E represent the unknown mass, damping, and stiffness matrices in the model, respectively. However, the mass matrix M E is usually high nonlinear. In this study, the massdamping-stiffness model is simplified as the damping-stiffness model, which is used to interact with a balloon as a kind of flexible object. e simplified model is as follows: Suppose the following continuous-time linear system: where e optimal control input of the system is designed as u � −K ξ, and the control objective is to minimize the cost function by designing the control system. e cost function is defined as follows: where Q is a constant matrix, Q' � [Q,−Q] T [1, −1], which represents the weight matrix of tracking error. R represents the weight matrix of the external force. In this paper, the design of the cost function takes into account the both robot system state and the external environment to evaluate the interactive control effect.
In the case that A and B are unknown constant matrices, an algorithm to obtain the optimal control signal by online learning is proposed. First, some variables are defined as follows: Adaptive optimal admittance control model x, x · · Figure 1: Overview of the proposed framework.
Complexity 3 degree of integration and ⊗ represents the Kronecker product. We define that P K which transforms matrix into vector form, and P K is a symmetric matrix: e principle of the adaptive optimal admittance scheme is summarized in Algorithm 1 [27,28].
where I m is the m-dimension identity matrix and vec(﹡) is the function that transforms the matrix into a vector. rough equation (17), we can obtain the optimal feedback control gain K K * . Substituting K K * into u � -K ξ, the optimal feedback control signal u can be obtained.

Inverse Kinematics Using CLIK.
e closed-loop inverse kinematics (CLIK) algorithm is employed to resolve the Cartesian reference trajectory x r into q r in joint space [29][30][31]. e solution error is e � k(q r ) − x r , where k(﹡) denotes the forward kinematics and e is given by where K e is a positive user-defined matrix that decides the convergent rate of e. Expanding the above equations and combining with x � J i q and J i � zk(q)/q, Ji is the Jacobian matrix of the robot. e following equation is obtained: Furthermore, we obtain the CLIK method: where J i

Experiment and Analysis
In this section, the performance of the proposed learning framework was validated by conducting experiments on a 7-DOF Baxter robot, as shown in Figure 2. e manipulator was equipped with the ATI Mini45 Force/Torque force sensor. e end effector was wrapped in a towel used to wipe the drawn curve on the surface of the balloon. e force sensor and the system controller are communicated by the UDP protocol whose sampling rate and control rate are set as 100 and 50 Hz, respectively. To prevent the displacement of the balloon from affecting the experimental results, the balloon to be wiped was fixed in the fixing box. e box is a paper carton with a size just big enough to hold the balloon. e paper carton is of size 43 cm × 32 cm × 18 cm and is fixed on the test bench with adhesive tape.

Demonstration Stage.
In the teaching stage, first, a curve was drawn on the surface of the balloon with a whiteboard pen, and then, the human tutor dragged the left arm of the Baxter robot to complete the teaching task-wiping. In the meantime, the teaching trajectory information was recorded and input into the DMP model through the program. en, the system learnt and generalized to obtain the desired trajectory x d . At the same time, the force sensor recorded the interactive forces in the X, Y, and Z directions for subsequent analysis.

Reproduction of the Wiping Task.
In the beginning, a new curve was drawn on the surface of the balloon. e robot end-effector was controlled to move to the starting point of the desired trajectory at [0.992, 0.280, 0.227] m. At this time, the end of the robot arm had interacted with the environment and changed from free space motion to constrained space motion. Since balloon was used as an interactive environment, its parameters are unknown. So an adaptive optimal admittance control was proposed to solve this problem. According to the set cost function, online adaptive learning of the interactive environment model parameters could help to achieve our desired control effects and complete the wiping task of the new curve. e first wiping experiment process was that the trajectory obtained by a demonstration was directly used as the reference trajectory, and the admittance model parameters were specified as C E � [−0.5, 0.01, −0.8] and K E � [7,2,10]. In the second experiment, the trajectory obtained by teaching was input into the DMP model for learning and generalization. e generated trajectory was used as the reference trajectory and then applied to the admittance model of the first experiment. In the third experiment, the expected trajectory obtained by DMP learning generalization of the teaching trajectory was used as the input of the adaptive optimal admittance controller, and finally, a new reference trajectory was obtained and then input to the Baxter joint controller. e initial value of state feedback gain in the X, Y, and Z directions was set to  .5781], respectively. Next, to verify the effectiveness of the proposed framework, the effects of the above three experiments were compared and the trajectory tracking error and interaction force changes were analyzed.

Experimental Results and Discussion.
First of all, we give the three-dimensional curves of teaching trajectory, DMP generalization trajectory, and three experimental trajectories in the same space rectangular coordinate system, as shown in Figure 3. As can be seen from the figure, the unprocessed teaching trajectory has serious jitter, and the expected trajectory after DMP generalization becomes smoother.
In the first experiment, the teaching trajectory is used as the reference input of the joint controller of the Baxter robot. e wiping effect is shown in Figure 4(b). It can be seen that the wiping task is not successfully completed under these experimental conditions. e time-varying curve of the interaction force during this process is shown in Figure 5. It shows that the robot performed the wiping task between 6 s and 18 s. However, its interaction force is not large enough and later becomes smaller and smaller close to 0. So, it fails to wipe the handwriting clean.
In the second experiment, although the curve is able to be erased from the surface of the balloon, it can be seen from the force trajectory that the interaction force in this case is very large. As shown in Figure 6, the maximum force in the Z direction is up to 25 N. e actual task figure (Figure 4(c)) also shows that a very serious inward depression occurs on the balloon at this time. Assuming that the interactive environment is not a flexible object as a balloon and the rigidity is very large, it may cause some damage to the robot arm or the interactive object. erefore, it is obvious that this experiment fails to complete the wiping task well.
Input: the set of initial feedback gain K 0 and state variable ξ; Output: optimal feedback gain K K * ; phase 1: set f int0 � K 0 ξ as the initial input while the manipulator is contacting with the environment; Repeat: compute Δ ξξ , I ξξ , and I ξu ; Until: rank[I ξξ , I ξu ] � m(m+1)/2 + mr; Repeat: phase 2: solve P K and K K+1 according to: Until: ||P K -P K-1 ||<ε; Return: K K * ; ALGORITHM 1: Framework of the adaptive optimal admittance scheme.  6 Complexity e third experiment is based on the learning framework proposed in this paper. It can be seen from Figure 4(d) that the wiping effect is greatly improved compared with the previous two, and the handwriting curve can be basically wiped clean. e interactive force graph (Figure 7) shows that the robot interacted with the balloon in about 6-20 s, during which the force in the Z direction, as the dominant force of the wiping task, changes smoothly between 0 N and 8 N. Compared with the previous two experiments, it is obvious that the interactive force is relatively optimal.
From the three-dimensional trajectory in Figure 3, it can be seen intuitively that there are large errors between the first experiment and the third experiment in terms of the expected trajectory. e interaction between the manipulator end effector and the balloon is basically completed within 20 seconds. e following is to analyze and discuss the change of trajectory tracking error in this stage. As shown in Figure 8, in the first experiment, the tracking error reaches the maximum in the later stage of the wiping task, and the maximum value in the Z direction has exceeded 0.1 m. Likewise, the trajectory tracking error of the second experiment ( Figure 9) is also relatively large. Only the third experiment can track the expected trajectory well. e tracking error curves in the X, Y, and Z directions in the third experiment are shown in Figure 10. e error values all lie within the range of ±0.04 m and, in the process of wiping, the error is basically stable around 0. It also proves that the learning framework proposed in this paper is able to well track the reference trajectory.

Conclusions
In this paper, a robot human-like learning framework based on robot and unknown environment interaction was proposed. e LfD approach can make the robot obtain the  8 Complexity reference input more quickly and accurately. At the same time, combined with the generalization ability of the DMP model and the compliance control ability of the adaptive optimal admittance model, the interaction performance between the robot and the unknown environment was enhanced. At last, the effectiveness of the proposed framework was verified by the wiping experiment of the balloon surface. Our future work will apply the proposed framework to different complex tasks and environments such as writing on an unknown curved surface, and the learning and generalization of interaction force will also be considered in force control.

Data Availability
e detailed parameters of used model and controller are given in the article. e results are computed on the Matlab 2018a software, while the relevant results are also given in the manuscript.

Conflicts of Interest
e authors declare that they have no conflicts of interest.