An Improved Reinforcement Learning Algorithm for Cooperative Behaviors of Mobile Robots

Reinforcement learning algorithm for multirobot will become very slow when the number of robots is increasing resulting in an exponential increase of state space. A sequential Q -learning based on knowledge sharing is presented. The rule repository of robots behaviors is firstly initialized in the process of reinforcement learning. Mobile robots obtain present environmental state by sensors. Then the state will be matched to determine if the relevant behavior rule has been stored in the database. If the rule is present, an action will be chosen in accordance with the knowledge and the rules, and the matching weight will be refined. Otherwise the new rule will be appended to the database. The robots learn according to a given sequence and share the behavior database. We examine the algorithm by multirobot following-surrounding behavior, and find that the improved algorithm can effectively accelerate the convergence speed.


Introduction
In recent years, multirobot systems (MRSs) have received considerable attention because such systems possess some special capabilities such as more flexibility, adaptability, and efficiency in dealing with a complex task [1].Multirobot learning is the process of acquiring new cooperative behaviors for a particular task by trial and error in the presence of other robots.The desired cooperative behaviors may emerge by local interactions among the robots which are with limited sensing capabilities.Multirobot system can perform more complex tasks via cooperation and coordination [2,3].
Normally, multirobot learning method can be classified as collective swarm learning and intentionally cooperative learning based on the various levels of explicit communication.The collective swarm systems allow participating robots to learn swarm behaviors with only minimal explicit communication among robots [4,5].In these systems a large number of homogeneous mobile robots interact implicitly with each other based on the sharing environment.The robots are organized on the basis of local control laws, such as the stigmergy introduced by Garnier et al. [6].Stigmergy is a mechanism of indirect interaction mediated by modifications of the sharing environment of agents [7].The information coming from the local environment can guide the participating individual activity.The complex intelligent behavior emerges at the colony level from the local interactions that take place among individuals exhibiting simple behaviors.At present, the swarm behaviors are often modeled using methods inspired by biology.Along with the advent of artificial life, some self-organizing models of social animals have provided salutary inspirations [8].Beckers et al. conducted some initiative simulations and physical experiments to interpret the nesting behavior of termites with stigmergy mechanism [9].The method of swarm intelligence learning mainly involves ant colony algorithm and particle swarm optimization algorithm [10].Beyond the above methods reinforcement learning and evolutionary algorithm are also the important methods in the design of collective swarm system.Givigi and Schwartz presented an evolutionary method of behavior learning for swarm robot system [11].The chromosome was exchanged with distributed genetic algorithm to improve the robot behavior ability.Sang-Wook et al. discussed the use of evolutionary psychology in order to select a set of traits of personality that will evolve due to a learning process based on reinforcement learning [12].The use of Game Theory is introduced in conjunction with the use of external payoffs.
Unlike collective swarm systems, robots in intentionally cooperative systems share the information of joint actions to determine the next state and rewards with each learning robot.More and more attention has been paid to the improved reinforcement learning algorithms.Kobayashi et al. proposed an objective-based reinforcement learning system for multiple autonomous mobile robots to acquire cooperative behavior.The proposed system employs profit sharing (PS) as a learning method [13].Fernández et al. studied the issues of credit assignment and large scale state space problem in a multirobot environment and combined domain knowledge and machine learning algorithms to achieve successful cooperative behaviors [14].Lee et al. presented an algorithm of behavior learning and online distributed evolution for cooperative behavior of a group of autonomous robots.Individual robots improve the state-action mapping through online evolution with the crossover operator based on the Q-values and their update frequencies [15].Ahmadabadi et al. introduced some expertness measuring criteria and a new cooperative learning method called weighted strategy sharing (WSS).Each robot assigns a weight to the knowledge of learning robots and utilizes it according the amount of its teammate expertness [16].The hybrid policy reduced the dimensions of robot state space.
In both swarm learning and cooperative learning the state space will grow exponentially in the number of team members.In order to deal with the slow convergence speed of standard learning algorithm, we propose an improved reinforcement learning algorithm based on knowledge sharing.The robots perform the learning process according to the predefined sequence.The learning robot will sense the current state at each time step.If the same state has existed in the rules repository, the learning robot will choose an action on the basis of the knowledge base and rules repository.The corresponding weight vector will be updated based on reinforcement learning.Otherwise the learning robot will choose an appropriate action according to the state transition probabilities function.The new rule will be appended to the rules repository.While in the process of reinforcement learning, the learning robot assigns a weight to each robot based on weighted strategy sharing.The behavior weight vector will be refined on the basis of the weighted sum of teammate expertness.Because each robot does not need to observe the actions of all its teammates, the improved reinforcement learning algorithm results in a significantly smaller state space than that for the standard Q-learning algorithm.

The Model of Multirobot Environment State
The work space is a two-dimensional environment model with the boundary.Each robot is considered an omnidirectional mobile robot with limited sensing capabilities.The robot has eight range sensors, which are labeled   ( = 1, 2, . . ., 8).The eight sensors are arranged evenly.Accordingly, the detecting zone is evenly separated into eight sectors, starting counterclockwise from the direction of the robot's heading.Each sector is represented by an 8-bit binary code, which will be 1 when the robot or obstacle is located in the sector and 0 otherwise.This binary code represents the object distribution around the robot in the detection circle.The environmental state shown in Figure 1 can be described as state , and  = [0 1 0 1 1 1 1 1].

The Behaviors Database of Robots
The robots behaviors database includes the knowledge base and the rules repository.The selection rules are stored in the rules repository.Once the learning robots find the target, the position and state will be stored in the sharing knowledge base.The knowledge base is shown as follows: : {  (  ,   ) ,   } ,  = 1, 2, . . ., , where   is the th target, (  ,   ) is the position of   ,  is the number of the target,   is the state of the current target, and if the target is stationary,   is equal to 0, otherwise 1.
The robots can choose their behaviors at each step in  directions.A group of behaviors is corresponding to a weight vector, which indicates the selection probability of every behavior.The weight vector is characterized in the following formula: where   is the weight vector which corresponds to the state   and   is the current state.  is the weight corresponding to the appropriate behavior.When perceiving the environment state   , the robot chooses a behavior on basis of the element of weight vector   .The behavior selection strategy is shown by where   is the selected action,   is the current state, (  ) is the probability to select an appropriate action,  is equal to 8, and   ( = 1, . . ., ) have a sum equal to 1.The behavioral rules of robots are represented by the following formula: where   is the behavioral rule corresponding to the state   and   is the selected action.
If the robots cannot perceive the target, then  0 = [0 0 0 0 0 0 0 0], and the robots will choose behaviors equiprobably in  directions, meaning the learning robots move randomly; that is,  0 = [1/, 1/, . . ., 1/].The equation of selection rules of behaviors is denoted as follows: In the course of learning the robot will perceive the current state   and determine if a same state has existed in the rules repository .If existing, the corresponding weight vector will be updated based on reinforcement learning.Otherwise the new rule   will be appended to the rules repository .All learning robots will share the behaviors database; that is, any robot will choose the same action to respond to the same state.The rules repository is shown as follows: where  is the rules repository,   is the behavioral rule corresponding to the state   , and  is the maximum number of rules.

Reinforcement Learning Algorithm Based on Weighted Strategy Sharing
4.1.Reinforcement Learning Algorithm for Mobile Robots.Qlearning is a form of model-free reinforcement learning, which does not require explicit knowledge of the environment.It allows a robot to synthesize and improve behaviors through trial and error.Within the reinforcement learning framework a robot chooses an appropriate action for the current state that results in an immediate reward and attempts to maximize the long-term rewards.The algorithm converges with probability one to the optimal Q-values so long as each state-action pair is visited infinitely often and learning mate declines.
In single-agent reinforcement learning, a robot operates in accordance with a finite-discrete-time Markov Decision Process, which is formally defined as follows.
Definition 1. Finite Markov Decision Process is a four-tuple sample ⟨, , , ⟩, where  is a finite discrete set of states,  is a finite discrete set of agent actions,  :  ×  → Π() is the state transition probability function, and  :  ×  → R is a reward function.
In the process of reinforcement learning, the learning robot senses the current state and chooses an appropriate action.The environment changes its state to the succeeding state according to the state transition probabilities function.The task of robots reinforcement learning is to obtain the optical policy  * , which makes robots acquire the maximum cumulative reward   beginning at every state.The cumulative value   (  ) achieved by following an arbitrary policy  from an arbitrary initial state   is defined as follows: where  is policy,  is immediate reward, and  is discount factor.
The optimal policy  * with the maximum cumulative rewards is shown by The maximum cumulative reward by following the optimal policy from the current state is denoted as  * ().The value of Q function is defined as the immediate reward plus the maximum cumulative reward of the succeeding state; the equation is shown as follows: where  is a small learning rate parameter between 0 and 1,   is the succeeding state by performing the action  under the current state , and  * (  ) has close relation with (  ,   ), as shown by where   is the optimal action under the state   .Then the Qvalues indexed on the agent's state and action at each step are updated by  (, ) = (1 − )  (, ) +  ( +  max where (, ) is the value of state-action pair (, ),  is a small learning rate,  is immediate reward,  is discount factor,   is the selected action according to   , and   is the succeeding state by performing the action  under the current state .

Weighted Strategy Sharing.
Weighted strategy sharing (WSS) is an effective method of the knowledge sharing to multirobot system with communication ability [16].It is assumed that a group of robots is learning in distinct environments.And their actions do not change the working environment of others.The learning robot calculates expertise value of participating robots.In this method, weights of each robot's expertise must be specified properly.The expertise of each robot is evaluated on the basis of the robot's location and sensor information.This strategy allows each learning robot to make a decision based on the expertise value of other robots.The expertise is the embodiment of the knowledge and ability of the robot individuals.The methods to evaluate the robot expertise are classified into two categories.The learning system needs additional information in evaluating the expertise by the first method.The relevant information should be continually acquired in the process of learning.The reinforcement signal was usually calculated by this category of method.The learning system only needs Q-values in evaluating the expertise without additional information or prior knowledge in the second method.The different methods of expertise evaluation should be chosen on the basis of special environment state in practice.
The cooperation of robots can be implemented at different levels.The robots can share the information of sensors, joint actions, and reinforcement signals.The knowledge and expertise of robots are all reflected in the Q-tables.So the learning policy based on sharing Q-tables can fully embody the collaboration among robots.The punishments obtained by learning robots have greater significance in the early stage of learning.The awards of robots will be more important as the cooperative behavior evolved.Therefore, the weight of punishments will decrease with the learning time.The expertise value is shown by where   is the expertise value of the th learning robot,   () is the immediate reward of the th robot at time-step , and   denotes   th trial.The weight-assigning mechanism is that the learning robot only learns from more experienced robots.
The learning robot will assign different weights to other robots based on the expertise value in the process of learning.The Q-values of the learning robot will be updated based on the weighted mean of Q-values of other robots.The learning robot assigns a weight to the knowledge and expertise of other robots as follows: where  is the number of robots,   is the expertise value of learning robot,   is the expertise value of other robots, and   is the weight of the th robot relative to the th robot.Then the Q-values indexed on the robot's state and action at each step are updated by where  is a small learning rate parameter between 0 and 1,   is the succeeding state by performing the action  under the current state , and   is the optimal action under the state   .

An Improved Reinforcement Learning Algorithm for
Robots.The state space will grow exponentially with the increasing of the number of team members when the single agent reinforcement learning is scaled up to the multirobot domain.In order to speed up the convergence, a sequential Q-learning algorithm is proposed based on the knowledge sharing.In the sequential Q-learning algorithm, the robots learn the decision strategy one by one according to the predefined sequence.The rule repository of robots behaviors is firstly initialized in the process of reinforcement learning.
Mobile robots obtain present environmental state by sensors.
Then the state will be matched to determine if the relevant behavior rule has been stored in database.If the rule is present, an action will be chosen in accordance with the knowledge and the rules, and the corresponding weight will be refined.Otherwise the new rule consisted of the state and initial Q-value will be appended into the database.The robot will randomly choose an action according to the initial behavior weight and continue the learning process.While in the process of reinforcement learning, the learning robot assigns a weight to each robot based on weighted strategy sharing.The behavior weight vector will be refined on the basis of the weighted sum of teammate expertness.Figure 2 is the flow chart of improved Q-learning algorithm for mobile robots.The sequential Q-learning algorithm based on the knowledge sharing may be summarized as follows.
(4) If multirobot system reaches the stable state or the learning system reaches the maximum time, the learning will end.Otherwise, go to Step (2).

Environment Settings.
To elucidate the advantage of our proposed improved Q-learning algorithm, the implementation of a cooperative pursue behavior is presented as an example.The cooperative pursue is a very challenging task of multirobot system.The performance of the task has significance for multirobot collaborative behavior.The predators in the task search the preys by the limited perception and local interaction among the robots.The predators finally surround the preys.In multirobot pursue system twenty predators and four preys are randomly distributed in a 10 × 10 area.The red asterisk and the black dot denote the predator and prey, respectively (see Figure 3).The task of the predators is to follow and surround the prey.The predator may perceive the current states and choose appropriate actions.The preys always strive to escape from the surrounding.If the state is  = [1 0 1 1 0 1 0 1], the prey will randomly choose an action in the second, fifth, and seventh sectors.
In our demonstration we set up the parameters of the learning process in the following way.The sensing radius of the predators is 1.The maximum distance of the predators is 1 for each step.The minimum distance between robots is 0.6.The reinforcement learning parameters are chosen as  = 0.3,  = 0.95.The explore probability  is initialized at 0.5 and decays with the evolution of cooperative behaviors.At each learning step, the predator chooses an action according to its current state and receives reward 0.2 for actions approaching to a prey, −0.2 for actions away from a prey, and 0 otherwise.

Results and Analysis.
To compare the performance of the conventional Q-learning and the improved Q-learning algorithm, the experiments have been done by running each algorithm in the same work space.Figure 3 is the trajectory of predators during the first twenty trials under two learning policies.Figure 3(a) is the trajectory of predators during the first twenty trials with sequential Q-learning algorithm.The experimental results demonstrate that most of the predators search randomly in the early stage of learning because they do not find target.Only individual predators can track the target, which find the preys.the improved Q-learning algorithm.Therefore, the improved Q-learning algorithm is of high efficiency in the early stage of learning based on knowledge sharing.When the knowledge base and the rules repository are all in the initial state, the predator has not any information about the prey in the learning process.The predator will search the prey randomly.Any of the predators finds the object.All predators will follow the prey based on knowledge sharing and completely surround them.Figure 3(c) shows a final state of the following-surrounding behaviors.
Theoretically, there will be 2  (N is the number of sensing sectors) rules in the rules repository.But the robots can only catch quite a few of these states.Most of the robots cannot find the target in the initial stage of learning.The perceiving state is Stimulus 0 = [0 0 0 0 0 0 0 0].Most of the predators move randomly because they have nothing about the preys (see Figure 4).Once a predator detects the prey, all of them will track it on the basis of knowledge sharing.With the learning going on, more and more predators gradually find the target, and the system will reach an equilibrium state quickly.Figure 4(a) shows the evolution of the number of the searching predators based on the improved Q-learning.When the database of robots is in the initial state, the predators search randomly.The number of predators gradually diminishes until it becomes zero.If all of predators have found the preys, the system will reach an equilibrium state quickly.Figure 4(b) shows the evolution of the number of the searching predators based on conventional Q-learning.In contrast with the improved Q-learning, the conventional Qlearning converges more slowly.
In most cases predators can find the target in one sector or two. Figure 5(a) shows the evolution of the number of  the predators which find the targets in the first sector and the corresponding weight.Through analyzing the operation process of simulation, we can find that when the predators find the target in the first sector, they will choose an action in the first or eighth sector.Finally the predators will follow the target in the direction of the first sector.Figure 5(b) shows the evolution of the number of the predators which find the target in the fourth sector and the corresponding weight.When the predators find the target in the fourth sector, they will choose an action in the fifth or fourth sector.Finally they will follow the target in the direction of the fourth sector.
The sequential Q-learning algorithm based on the knowledge sharing can obviously reduce the complexity of the traditional Q-learning.The same rule will be updated many times in one trial of the learning.So the improved algorithm can speed up the convergence efficiently.However, the predators and preys are randomly distributed in the workspace in the initial state.And the predators will search randomly when they have no sharing knowledge.Therefore, the execution time of each cooperative task is random.To test the effectiveness and the performances of our innovative approach, each experiment has been repeated 50 times under the same condition.Figure 6 shows the execution time under different Q-learning algorithms.It is all some randomness in system that makes the execution time of predators different for each task.The simulation shows that the improved algorithm has higher efficiency.The improvement is chiefly due to the fact that predators act cooperatively based on the sharing knowledge.

Conclusion
When the traditional Q-learning is performed in multirobot domain, the state space will grow exponentially in the number of team members.To speed up the convergence and reduce the complexity of the traditional reinforcement learning algorithm we propose a sequential Q-learning algorithm based on knowledge sharing.The learning system will initialize the rules repository and share the knowledge base.The learning robot will perceive the current state according to the predefined sequence.If the same state has existed in the rules repository, the robot will choose an action on the basis of the knowledge base and rules repository, and the corresponding weight vector will be updated based on reinforcement learning.Otherwise the learning robot will choose an appropriate action according to the state transition probabilities function, and the new rule will be appended to the rules repository.Then the behavior weight vector will be refined on the basis of the weighted sum of teammate expertness.The sequential Q-learning algorithm based on knowledge sharing is performed in a subspace and promotes the learning efficiency.The validity of this method is tested via the simulation experiment.

Figure 1 :
Figure 1: The state model of robot.

Figure 2 :
Figure 2: The flow chart of improved Q-learning algorithm for mobile robots.

Figure 3 :
Figure 3: The evolution of the multirobot following-surrounding behavior.

Figure 4 :
Figure 4: The evolution of the number of the searching predators.

Figure 5 :
Figure 5: The learning process of partial state-action pairs.
The improved Q-learning The sequential Q-learning

Figure 6 :
Figure 6: The execution time under different Q-learning algorithms.