Multiagent Reinforcement Learning with Regret Matching for Robot Soccer

This paper proposes a novel multiagent reinforcement learning (MARL) algorithmNash-Q learning with regret matching, in which regret matching is used to speed up the well-knownMARL algorithmNash-Q learning. It is critical that choosing a suitable strategy for action selection to harmonize the relation between exploration and exploitation to enhance the ability of online learning for Nash-Q learning. In Markov Game the joint action of agents adopting regret matching algorithm can converge to a group of points of no-regret that can be viewed as coarse correlated equilibrium which includes Nash equilibrium in essence. It is can be inferred that regret matching can guide exploration of the state-action space so that the rate of convergence of Nash-Q learning algorithm can be increased. Simulation results on robot soccer validate that compared to original Nash-Q learning algorithm, the use of regret matching during the learning phase ofNash-Q learning has excellent ability of online learning and results in significant performance in terms of scores, average reward and policy convergence.


Introduction
Multi-robot system (MRS) has received more and more attention because of its broad application prospect, which has several research platforms including formation [1], foraging [2], prey-pursuing [3,4], and robot soccer [5][6][7].Robot soccer is associated with robot architecture, cooperation, decision making, planning, modeling, learning, vision tracking algorithm, sensing, and communication, which owns all the key features of MRS.And the robot soccer system is discussed as a test benchmark in this paper [8].
Though reinforcement learning (RL), for example, learning [9][10][11] can be directly applied in MRS for decisionmaking, it violates the static environment assumption of Markov Decision Process (MDP) [12].For MRS action selection of the learning robot is unavoidably affected by actions of other agents, so multiagent reinforcement learning (MARL) involving joint state and joint action is more suitable and promising method for MRS [13][14][15][16].
MARL based on Stochastic Game (SG) that can be also called Markov game (MG) has a solid theoretical foundation for MRS, which has developed several branches such as MiniMax- learning [17], Nash- learning [18], FF- learning [19], and CE- learning algorithms [20].Agents adopting the above algorithms can also be called equilibrium learners [17,20,21], which is one method of handling the loss of stationarity of MDP.These algorithms learn joint action values which are stationary and in certain circumstances guarantee that these values can converge to Nash equilibrium (NE) values [22] or correlated equilibrium (CE) values.Using these values, the agents'policy corresponds to the agent's component of some nash or correlated equilibrium [23].So based on the fundamental solution concept of NE for MG, Nash- learning algorithm that finds NE at each state in order to obtain NE policies for  value updating is an effective and typical MARL method.
For single agent learning scenario, -learning is guaranteed to converge to the optimal action independent of the action selection strategy.However, in a multiagent setting, the action selection policy becomes crucial for convergence to any joint action.A big challenge in defining a suitable strategy for the selection of actions is to strike a balance between exploring the usefulness of actions that have been attempted only a few times and exploiting those in which the agents' confidence in obtaining a high reward is relatively strong.This is known as the exploration and exploitation problem [24].
Regret matching can better harmonize the relation between exploration and exploitation.Regret has been studied both in game theory [25] and computer science [26,27].Regret measures how much worse an algorithm performs compared with the best static strategy whose goal is to guarantee at least zero average regret [23].Regret matching [25] belonging to no-regret algorithms guarantees that the joint action will asymptotically converge to a set of points of no-regret that can be referred to as coarse correlated equilibrium in MG [28].Because Nash equilibrium is in fact coarse correlated equilibrium [28], it can be inferred that regret matching that leads joint action to points of coarse correlated equilibrium can effectively improve the convergence rate of original Nash- learning algorithm.
This paper is organized as follows.Section 2 reviews multiagent reinforcement learning and Nash- learning algorithm.Section 3 briefly describes regret matching algorithm and then shows how to incorporate regret matching technique into original Nash- learning algorithm.Section 4 describes the structure of reinforcement learning of soccer robot.Section 5 presents simulation demonstration of our algorithm in robot soccer.Section 6 draws a conclusion and summarizes some important points about this paper.

Multiagent Reinforcement Learning and
Nash- Learning 2.1.Markov Game.Markov game (MG) can be viewed as an extension of MDP to multiagent environments [29,30], where all agents select their actions simultaneously.The reward that each agent gets depends on their joint action of all agents and the current state as well as the state transitions according to the Markov property [31].MG is the theory foundation of MARL and Figure 1 shows the architecture.
A reinforcement framework of MG can be defined by the following.
An -agent MG Γ is a tuple ⟨, ,  1 . . .,   , ,  1 , . . .,   ⟩, where  represents the number of agents,  is the state space,   is the action space of agent  ( = 1, . . ., ),  :  ×  1 × ⋅ ⋅ ⋅×  ×  → Δ() is the transition function which depends on the actions of all agents and Δ() is the set of probability distributions over state space , and   : × 1 ×⋅ ⋅ ⋅×  × →  is the reward function for agent  which also depends on the actions of all agents.Given state , each agent independently chooses corresponding action  1 , . . .,   and then receives rewards   (,  1 , . . .,   ),  = 1, . . ., .The next state   arrives after joint action ( 1 , . . .,   ) is taken at state  based on fixed transition probabilities and the following equation is satisfied: In a discounted MG, the objective of each agent is to maximize the discounted sum of rewards with discount factor  ∈ [0, 1).Denote   as the strategy of agent .For a given initial state , agent  tries to maximize (2)

Comparing among Existing
Algorithms.The traditional -learning algorithm [9] for computing an optimal policy in an MDP with unknown reward and transition functions is as follows: The simplest way to extend this to the multiagent MG setting is just to add a subscript to the formulation above and the definition of the  values assumes that they depend on the joint action of all agents.Meanwhile  should be updated with computation outcome of the  values corresponding to respective algorithm.The Minimax- learning algorithm as the first MARL extends the traditional -learning to the domain of twoplayer zero-sum multiagent MG environment.In Minimax- learning,  is updated with the minimax of the  values: The policy used in the Minimax- learning algorithm can guarantee that it receives the largest value possible in the absence of knowledge of the opponent's policy.Hu and Wellman [21] extended the Minimax- algorithm to -player general-sum MG.The extension requires that each agent maintains  values for all of the agents.And the linear programming solution used to find the equilibrium of zero-sum games is replaced by the quadratic programming solution for finding an equilibrium in -player general-sum games.Nash- updates the  values based on some NE in the game defined by the -values: where   (, ) denotes the payoff matrix to player  and Nash  denotes the Nash payoff to that player.Since Nash- is limited to zero-sum and common-payoff games in essence, Littman reinterpreted it as the Friend-or-Foe- (FF-) learning framework [19].Although FF- can be applied in multiple players scenario, for simplicity we show how the  are updated in a two-player game: Thus Friend- updates  similarly to regular -learning, and Foe- updates as does minimax-.The above algorithms in this section all depend on some methods of computing the NE for the matrix game defined by  values of all players in each state.The value for each player of a mutually agreed-on equilibrium is the value function used in the  update process.Instead of computing Nash equilibria of  stage games, the agent can compute other solution concepts.One option is computing the CE.This is the technique used by Greenwald and Hall in the unambiguously named Correlated- (CE-) algorithm [20].A CE is more general than an NE, since it allows dependencies among the agents' probability distributions, while maintaining the property that agents are optimizing.Compared to NE, CE can be computed easily via linear programming.CE- learning is similar to Nash- but instead uses the value of a correlated equilibrium to update : Like Nash-, it requires agents to select a unique equilibrium; an issue that the authors address explicitly by suggesting several possible selection mechanisms.

Nash-𝑄 Learning.
The following is based on [18].Extending -learning to the multiagent learning domain with NE concept, Nash- equilibrium value is defined as the expected sum of discounted rewards when all agents follow specified Nash equilibrium strategies from the next period on.The literature usually uses the terms policy and strategy interchangeably.A Nash equilibrium is a joint strategy where each agent's strategy is a best response to the others' strategies.
In MG Γ, a Nash equilibrium point is a tuple of  strategies ( * 1 , . . .,  *  ) such that for all  ∈  and  = 1, . . ., , where Π  is the set of strategies available to agent . *  is defined as a Nash- function for agent  and  *  (,  1 , . . .,   ) is called Nash- equilibrium value.Nash- function of agent  is defined over (,  1 , . . .,   ), as the sum of agent 's current reward plus its future rewards when all agents follow the joint NE strategy.That is, where ( * 1 , . . .,  *  ) is the joint Nash equilibrium strategy,   (,  1 , . . .,   ) is agent 's one stage reward in state  and under joint action ( 1 , . . .,   ), and   (  ,  * 1 , . . .,  *  ) is agent 's total discounted reward over infinite periods starting from state   given that agents follow the equilibrium strategies.
In the case of multiple equilibria, different NE strategy profiles may select different Nash- functions.In this paper, the learning agent picks the NE that yields the highest expected payoff to them as a whole.The learning agent indexed by  learns about its  values by forming an arbitrary guess at time 0. One simple guess would be letting  0  (,  1 , . . .,   ) = 0 for all  ∈ ,  1 ∈  1 , . . .,   ∈   .At each time , agent  observes the current state and then takes its action.After actions were taken, agent  observes its own reward, actions taken by all other agents, others' rewards, and the new state   .It then calculates a Nash equilibrium  1 (  ) ⋅ ⋅ ⋅   (  ) and updates its  values according to where where   ∈ (0, 1) is the learning rate, and Nash   (  ) is defined in (10).Let  :=  + 1.
For obtaining the NE  1 (  ) ⋅ ⋅ ⋅   (  ), agent  need to know  1  (  ), . . .,    (  ).Agent  should have conjectures about those -functions at the beginning of play.As the game proceeds, agent  observes other agents' immediate rewards and previous actions.That information can then be used to update agent 's conjectures on other agents' functions.Agent  updates its beliefs about agent 's function, according to the same updating rule (10) it applies to its own: Note that   = 0 for (,  1 , . . .,   ) ̸ = (  ,  1 , . . .,   ).Therefore (12) does not update all the entries in the -functions.It updates only the entry corresponding to the current state and actions chosen by the agents.This type of updating is called asynchronous updating [18].

Regret Matching Algorithm for Action Selection
By observing human ways of handling problems, we can conclude that a human often reflects how regretful it is for the decision that he had made.Through reflecting on past action and feeling regretful, a human can learn more experience, find improved action under complicated environment, and enhance the learning efficiency.Regret enables him to obtain better policy and to make progress quickly.In case that people of community all adopt such idea, then the joint action will bring each one good reward.Based on the above notion, no regret learning algorithms are proposed and have been widely studied and applied in multiagent learning.No regret learning algorithms consist of a lot of algorithms which guarantee that the joint action will converge asymptotically to a set of points of no-regret that can also be called coarse correlated equilibrium [32].A no-regret point represents a case for which the average reward which an agent actually obtained is as much as the counterpart that the agent "would have" obtained had that  agent used a different fixed strategy at all previous time steps [28].Figure 2 shows that Nash equilibrium belongs to not only correlated equilibrium but also coarse CE.In other words, it is important to note that convergence to a NE point also implies convergence to a coarse correlated equilibrium point (no-regret point).
The prominent feature of regret matching [25] as a branch of no regret learning algorithms is that compared to other learning algorithms, for example, fictitious play [33], it can be easily applied in large scale MRS [28].The detailed description of regret matching can be found in [25].And a new algorithm Nash- learning with regret matching is proposed to increase the rate of convergence in MG.In the proposed algorithm, regret matching is used to select the action in each state to increase the convergence rate toward Nash equilibrium policy.
Equation (13) shows that average regret for   ∈   of agent  would represent the average improvement in his reward if it had chosen   ∈   in all past steps and all other agents' actions had remained unchanged up to time .Regret matching based each agent  computes     (, ) for every action   ∈   using the following iterative equation: Note that at each time step  > 0, agent  updates all entries included in his average regret assemble   (, ) = [    (, )]   ∈  .In regret matching after agent  computed its average regret assemble   (, ), action   (, ) is selected according to the probability distribution   (), as shown in the following equation: where   () is the uniform distribution over   .In other words, an agent using regret matching selects a particular action at any time step with probability proportional to the average regret for not selecting that particular action in the past time steps.
If all agents of one team choose regret matching algorithm for robot soccer, then the joint action will converge asymptotically to a set of points of coarse CE.So it can be inferred that regret matching can effectively improve the convergence rate of original Nash- learning, which is validated by the following simulation.

Environment States and Joint Action of Robot.
Robot soccer is an very challenging and interesting domain for the application of machine learning algorithms to real world problems.Research groups have applied a lot of different machine learning approaches to many facets of autonomously soccer playing MRS [34].
Behavior-(action-) based approaches are very suited for soccer because they have outstanding performance than deliberative control in uncertain and dynamic environments.Behavior design of the robot (agent) soccer team is based on the following two characteristics.Firstly, points are scored by kicking the ball across the opponent team's goal.Secondly, robots should avoid kicking the ball toward the wrong directions, lest they score against their own team [35].
In this paper, environment states represented in Figure 3 are used to activate the robot.For simplicity each team is composed of three agents (players) as shown in Figure 5.
Based on [36], a motor schema-based reactive control system is used for action designing in which each agent is provided three preprogrammed actions (behavior assemblages) that correspond to steps in achieving the task as shown in Table 1.These actions are in turn composed of more primitive behaviors called motor-schemas.Several motor-schemas are described as follows.
Move to kickspot: high gain to draw the robot to a point one-half of a robot radius behind the ball.If the robots bumps the ball from that location, the ball is propelled in the direction of the opponent's goal.Avoid teammates: gain sufficiently high to keep the robots on the team spread apart.Move to half point: high gain to draw the robot to a point halfway between the ball and the defended goal.Swirl ball: a ball dodging vector with gain sufficiently high to keep the robots from colliding with the ball.Move to defended goal: high gain to draw the robot to the defend goal [35].
Shoot ball action is showed in Figure 4. Being analogous to Figure 4, chase ball action is composed of three primitive schemas: move to halfway point, swirl ball, and avoid teammates.Goal keeping action is composed  of two primitive schemas: move to defended goal and move to kickspot.

Reward Function for Soccer.
As an instant evaluator the reward function for the action taken at a given state is important for reinforcement learning.Global reinforcement [35] refers to the case where a single reinforcement signal is simultaneously delivered to all robots.A potential problem with global reinforcement is the ambiguous assumption that the closet robot just happened to be near the goal while another soccer robot kicked the ball for a score from a distance.Two important factors should be considered: time and distance, and a modified reward function   (, ) for agent  from global reinforcement is as follows: where  denotes the soccer robot's state,  represents the joint action of all agents ( 1 , . . .,   ), touch is time in milliseconds since the soccer robot last touched the ball, and  represents the distance in meters between the ball and robot. is a parameter value varying between 0.5 and 1 that indicates how quickly a potential reward should decay after the ball is touched, and in this paper  is set to be 0.7.

Action Robot activity Shoot ball
If robot is close to the ball and goal, this action is used to shoot the ball.

Chase ball
When robot is far away from ball, this action is given to go after the ball.

Goal keeping
Robot playing as a goal keeper gets this action to prevent losing point.

Simulation
TeamBots as shown in Figure 5 is a Java-based assemble of application programs and java packages for multiagent mobile robotics research, where control system of a robot interacts with a well-defined sensor-actuator interface.The simulation proceeds in discrete steps.The robots process their sensor data in each step and then issue appropriate actuator commands.The simulation models physical interactions including robot, ball and wall collisions, sensors, and motordriven actuators [24].Two teams A and B of soccer robots are designed and each team is composed of three agents.Team A adopts Nash- learning with regret matching algorithm and Team B is equipped with original Nash- learning that learns Nash- equilibrium values by random action selection strategy.If a goal is kicked, the ball will be replaced to the center of the field without repositioning the agents and the match goes on.Historical data including scores, average reward, and the average number of policy changes are saved as the match proceeds.The agents preserve Nash- values learned between matches.No limited time is imposed on playing that the whole match is not over until a total of 10 points are completed.The simulation is composed of 100 10-point matches.The reward functions that the robots Team A and B adopt are the same as shown in (16).
At the beginning  values of all robots were initialized with zero value.An important performance for robot soccer is measured as the scores difference : where  teamA denotes the scores of Team A and  teamB is the scores of Team B. A negative value indicates that Team A lost the match, while positive values indicate that Team A won the match.Figure 6 shows the curves of scores difference  through which we know that robots of Team A found good strategy of joint action resulting in draw or scoring over 5 points after the 37th match and outperformed Team B from the 52th match to the end of simulation.It can be summarized that robots of Team A with action exploration strategy of regret matching have accumulated much experience by computing regret value for every action and gradually taking joint action improved after the 37th match.By online learning of regret matching, the joint actions of robots of Team A are gradually close to approach points of coarse correlated equilibrium, which greatly improved the offensive and defensive capabilities of the whole team.Through Figure 7, it may be concluded that the robots of Team A received positive rewards most of the matches.The average reward per match is increased as matches proceed when the robots obtained more experience of cooperating.It increases from approximately 3.8 to 8.7 in 100 rounds of continuous matches.Because a bigger average reward indicates that the robots have employed good cooperation strategies to kicking more goals, Figure 7 confirms that regret matching as action selection strategy is effective in helping the agents to improve the quality of tactics coordination in carrying out the cooperative attacking.Although for the initial learning phase (the former 37 matches) Team A has worse performance than Team B, as the matches proceed the performance of Team A becomes better and better as we expect before the simulation.For the latter 63 matches, the robots of Team A can quickly adapt themselves to the transition of environment state and coordinate their joint action reducing conflict with their own teammates and obtaining more and more positive rewards.
Learning rate is evaluated by monitoring the policy convergence which is tracked by recording the average number   of policy changes for all agents of Team A. For example, an agent from Team A may have been following a strategy of goal keeping when opponents appearing in front area, the ball in front area and teammates in left and right, but owing to regret matching it switches to the chase ball action instead.Such alteration is viewed as policy change.The average number of policy changes for Team B is stochastic because of strategy of random action selection.So only the curve of policy changes of Team A is analyzed.The data plotted in Figure 8 shows good convergence for Team A using regret matching algorithm.The average number of policy changes per match dropped to 0.47 after 100 matches.
The number of policy changes for robots of Team A initially is high but decreases gradually in the latter matches.
It can be noted that there is turnpoint at around the 52th match, from which the average number of policy changes of Team A monotonously diminished.An extended simulation shows that the average number of policy changes for Team A reached zero after 150 matches.
From Figures 6 to 8, it is clear that the new Nash- with regret matching learning algorithm has higher learning efficiency than the original Nash- learning algorithm in robot soccer.Regret matching can better harmonize the tradeoff between exploration and exploit such that the agent can reinforce the evaluation of the actions it already knows to be good but also explore new actions.In particular, the new algorithm Nash- learning with regret matching takes an average of 150 matches for completing policy convergence in order to find Nash- equilibrium values early.

Conclusion
This paper presents a new multiagent reinforcement learning approach combining Nash- learning with regret matching to increase the convergence rate of original Nash- learning algorithm that learns Nash- equilibrium values by random action selection in multiagent system.Regret matching which belongs to online learning as a branch of no regret learning algorithms can guarantee that the joint action will asymptotically converge to a set of points of coarse correlated equilibrium including Nash equilibrium points.So we investigate how to make improved action selection in original Nash- learning algorithm through regret matching.Robot soccer is adopted as platform to test the proposed approach.Compared to original Nash- learning, the results of experiments validate that Nash- learning with regret matching algorithm has better performance in terms of scores, average reward, and policy convergence for obtaining the Nash equilibrium policy.

Figure 1 :
Figure 1: Architecture of MARL based on Markov game.

Figure 6 :
Figure 6: Scores difference  as the number of matches increases.
Nash-Q learning with regret matching Team B using original Nash-Q learning

Figure 7 :
Figure 7: The average reward received by the robots in match.

Figure 8 :
Figure 8: The average number of policy changes per match.