A Decentralized Partially Observable Decision Model for Recognizing the Multiagent Goal in Simulation Systems

,


Introduction
With the fast development of computational software and artificial intelligence techniques, agent-based simulation systems become more and more popular for staff training, policy analysis and evaluation, and even entertainments.In developing these systems, people always need to create humanlike agents who can make decisions and have interactions with other agents or humans autonomously.For example, in the famous real-time strategy game Star-Craft, the AI players have to construct buildings, collect resources, produce units, and defeat their enemies [1].Unfortunately, even though many decision and planning algorithms have been applied to improve the intelligence of these agents, they are still easily defeated, especially when human players play with them in the same scenario for several times.One important reason for that is that these agents are unable to recognize the goal of their opponents or friends.On the other hand, that is what the human players usually do in the game [2].Obviously, if agents know the goal of others, they can make counter decisions more efficiently.
Because goal recognition is significant for creating human-like agents and decision support, many related models and algorithms have been proposed and applied in different fields, such as hidden Markov models (HMMs) [3], partially observable Markov decision processes (POMDPs) [4], Markov logical networks (MLNs) [5], and particle filtering (PF) [6,7].However, most of the existing research focuses on single agent scenarios.However, in some scenarios missions are so complex that a number of agents have to constitute a group and achieve their joint goal through cooperation.And our aim is to identify the joint goal of the group but not one member.In most cases, it does not work to directly apply methods for recognizing the single agent in multiagent goal recognition, because we have to consider the relations or interactions between agents and the state space is usually very large.
There are three fundamental components in the framework of multiagent goal recognition: (a) modeling the agents' behaviors, the environment, and the observations for the recognizer; (b) estimating the parameters of the model obtained through learning or other methods, and (c) inferring the goals from the observations.In the past, people have done some works on all these aspects.However, we still have some difficulties in recognizing multiagent goals in simulation systems: (a) For modeling behaviors, we usually have little knowledge about the details of agents' cooperation, such as the decomposition of the complex task, the allocation of the subtasks, the communication, and other details.
Even though this information is available, it is hard to present all of them formally in a model in practice.
(b) For learning parameters, sometimes a training dataset for supervised or unsupervised learning cannot be provided.Even if we have a training set, the unsupervised learning is still infeasible, because the state space in multiagent scenarios is always very large.
Additionally, the supervised learning may suffer from the overfitting problem, which will be shown in our experiments.
(c) For inferring goals, traditional exact filters such as an HMM filter are infeasible because the state space is large.The widely applied PF is available for computing the posterior distribution of goals, but it may fail when there are not sufficient particles, and increasing the number of particles will consume much more computing time.
To solve the problems above, we present a solution for recognizing multiagent goals in simulation systems.The core of our method is a novel decentralized partially observable Markov decision model (Dec-POMDM).After modeling the agents' behaviors, the environment, and the observations for the recognizer by the Dec-POMDM, we use an existing multiagent reinforcement learning (MARL) algorithm to estimate the behavior parameters and a marginal filter (MF) to infer the joint goal of agents.Our method has several advantages considering the above problems: (a) For the modeling problem, the Dec-POMDM presents the agents' behaviors in a compact way.The Dec-POMDM can be regarded as an extension of the well-known decentralized partially observable Markov decision process (Dec-POMDP) [8].As in the Dec-POMDP, all details of cooperation are hidden in joint policies in the Dec-POMDM.In this implicit way of behavior modeling, we only need to concern ourselves with the selection of primitive actions with given goals and situations.Further knowledge on interactions between agents is unnecessary.Another advantage of the Dec-POMDM is that it can make use of the large amount of existing algorithms for the Dec-POMDP, which will be explained later.
(b) For the problem of estimating the agents' joint policies, the MARL algorithm does not need a training dataset.We borrow the definition of goals from the domain of planning under uncertainty and associate each goal with a reward function.Then, we assume that agents achieve their joint goal by executing optimal policies, which can bring them the maximum cumulative reward.The optimal policies define cooperative behaviors well and can be computed accurately or approximated by any algorithm for the Dec-POMDP.In this way, the training dataset is unnecessary.In this paper, the cooperative colearning based on the Sarsa algorithm is exploited, because it does not need information of the model, which may be difficult to get in complex scenarios [9].Actually, heuristic algorithms such as memory-bounded dynamic programming (MBDP) and joint equilibrium-based search for policies (JESP) may also work, if we have enough information of the environment model [10,11].(c) For the inference, the MF outperforms the PF when the state space is discrete and large, which has been proved in [12,13].Additionally, we will also show that the MF solves the inference failure problem of the PF in our research.Another contribution is that we implement the MF under the framework of Dec-POMDM, in which filtering process is different from the work in [12,13].
To validate the Dec-POMDM together with the MARL and MF in multiagent goal recognition, we modify the classic predator-prey problem and design a new scenario [9].In this scenario, there are more than one prey, and predators have to choose one prey and capture it by moving on a grid map.Additionally, predators may change their goal on the half way.Our method is applied to recognize the real goal of predators based on the observed noisy traces.We also use the simulation model to generate different training datasets, which consist of different numbers of labeled traces.With these datasets, we compare performances of our method and the widely applied hidden Markov model (HMM), whose transition matrix is obtained through supervised learning (the MF is applied doing inference for both models).In the inference part, results computed by the MF are compared to those of PF with different numbers of particles.All performances are evaluated in terms of precision, recall, and -measure.
The rest of the paper is organized as follows: Section 2 introduces related work.Section 3 gives the formal definition of the Dec-POMDM, the model's DBN representation, the cooperative colearning algorithm, and the baseline used for comparison.Section 4 introduces how to use the MF to infer the joint goal.Section 5 presents the scenarios, settings, and results of our experiments.Subsequently, we draw conclusions and discuss future works in Section 6.

Related Work
In many simulation systems such as training systems and commercial digital games, effects of actions are uncertain, which is usually caused by two reasons: (a) agents execute erroneous actions with a given probability; (b) environment states are impacted by some events, which are not under control of the agents.Because of this uncertainty in simulation systems, a goal cannot be achieved by a certain sequence of primitive actions, as what we do in the classical planning.Instead, we have to find a set of policies, which define the distribution of selecting actions within given situations.Since policies essentially reveal the goal of planning, goals can be inferred as discrete parameters or hidden states, after knowing their corresponding policies, and this process is usually implemented under the Markov decision framework.

Recognizing Goals of a Single Agent.
People have proposed many methods for recognizing goals of a single agent; some of them are foundations of methods for multiagent goal recognition.Baker et al. proposed a computational framework based on Bayesian inverse planning for recognizing mental states such as goals [14].They assumed that the agent is rational: actions are selected based on an optimal or approximate optimal value function, given the beliefs about the world, and the posterior distribution of goals is computed by Bayesian inference.The core of Baker's method is that policies are computed by the planner based on the standard Markov decision process (MDP), which does not model the observing process of the agent.Thus, Ramırez and Geffner extended Baker's work by applying the goal-POMDP in formalizing the problem [4].Compared to the MDP, the POMDP models the relation between real world state and observation of the agent explicitly; compared to the POMDP, the goal-POMDP defines the set of goal states.Besides, Ramırez and Geffner also solved the inference problem even when observations are incomplete.Works in [4,14] are very promising but both of them suffer two limitations: (a) the input for goal recognition is an action sequence; however, sometimes we only have observations of environment states from real or virtual sensors, and translating observations of states to actions is not easy; (b) the goal is estimated as a static parameter; however, it may be interrupted and changed in one episode.
Recently, the computational state space model (CSSM) became more and more popular for human behavior modeling [15,16].In the CSSM, transition models of the underlying dynamic system can be described by any computable function using compact algorithmic representations.Krüger et al. also discussed the performances of applying the CSSM on intention recognition in different real scenarios [16,17].In this research, (a) intentions as well as actions and environment states are modeled as hidden states, which can be inferred by online filtering algorithm; (b) observations reflect not only primitive actions, but also environment states.The limitation of the research in [17] is that goal inference is not implemented in scenarios where results of actions are uncertain.Another related work on human behavior modeling under the MDP framework was done by Tastan et al. [18].By making use of the inverse reinforcement learning (IRL) and the PF, they learned the opponent's motion model and tracked it in the game Unreal 2004.This work made the following contributions: (a) the features for decision in the pursuit problem were abstracted; (b) IRL was used to learn the reward function of the opponent; (c) the solved decision model was regarded as the motion function in the PF.However, IRL relies on a large dataset, and Tastan's method is proposed for tracking but not for goal recognition.

Multiagent Goal Recognition Based on Inverse Planning.
The inverse planning theory can also be used in the multiagent domain.Baker et al. inferred relational goals between agents (such as chasing and fleeing), by using multiagent MDP framework to model interactions between agents and the environment [19].In this model, each agent selected actions based on the world state, its goal, and its beliefs about other agents' goals.Mental state inference is done by inverse planning, under the assumption that all agents are approximately rational.Ullman et al. also successfully applied this theory in more complex social goals, such as helping and hindering, where an agent's goals depend on the goals of other agents [20].In the military domain, Riordan et al. borrowed Baker's idea and applied Bayesian inverse planning to inferred intents in multi-Unmanned Aerial Systems (UASs) [21].Additionally, IRL was also used to learn reward function.Even though Baker's theory is quite promising, it can only work when every agent has accurate knowledge of the world state, because the multiagent MDP does not model the observing process.Besides, Bayesian inverse planning does not allow the goal to change.Another related work under the Markov decision framework in multiagent settings was done by Doshi et al. [22].Although their main aim is to learn the agents' behavior models, without recognizing goals, the process of estimating mental states is very similar to Bayesian approaches for probabilistic plan recognition.In Doshi's work, the interactive partially observable Markov decision process (I-POMDP) was used to model interactions between agents.I-POMDP is an extension of POMDP for multiagent settings.Comparing to POMDP, I-POMDP defines an interactive state space, which combines the traditional physical state space with explicit models of other agents sharing the environment in order to predict their behavior.Thus, I-POMDP is applicable in situations where agents may have identical or conflicting objectives.However, I-POMDP has to deal with the problem "what do you think that I think that you think," which makes finding optimal or approximately optimal policies very hard [23].Actually, in many multiagent scenarios such as the football game or the first person shooting game, the agents being recognized share a common goal.This makes the Dec-POMDP framework sufficient for modeling cooperation behaviors.Additionally, the increasing interests/number of works of planning theory based on Dec-POMDP can provide us with a large number of planners [24,25].

Multiagent Goal Recognition
Based on DBN Filtering.If all actions and world states (the agent and the environment) are defined as variables with time labels, the MDP can be regarded as a special case of directed probabilistic graphical models (PGMs).With this idea, some people ignore the reward function but only concern themselves with the policies and infer goals under the dynamic Bayesian framework.For example, Saria and Mahadevan presented a theoretical framework for online probabilistic plan recognition in cooperative multiagent systems.This model extends the Abstract Hidden Markov Model and consists of a Hierarchical Multiagent Markov Process that allows reasoning about the interaction among multiple cooperating agents.The Rao-Blackwellized particle filtering (RBPF) is also used for the inference [26,27].Pfeffer et al. [28] studied the problem of monitoring goals, team structure, and state in a dynamic environment: an urban warfare field, where uncoordinated or loosely coordinated units attempt to attack a target.They used the standard DBN to model cooperation behaviors (communication, group constitution) and the world states.An extension of the PF named factored particle filtering is also exploited in the inference.We also proposed a Logical Hierarchical Hidden Semi-Markov Model (LHHSMM) to recognize goals as well as cooperation modes of a team in a complex environment, where the observation was noisy and partially missing and the goal was changeable.The LHHSMM is a branch of the Statistical Relational Learning (SRL) method, which combines the PGM and the first order logic; it also presents the team behaviors in a hierarchical way.The inference for the LHHSMM was done by a logical particle filer.These works based on directed PGM theory have the advantage that they can use filtering algorithm.However, they suffer some problems: (a) constructing the graph model needs a lot of domain knowledge, and we have to vary the structure in different applications; (b) the graph structure will be very complex when the number of agents is large, which will make parameter learning and goal inference time consuming, sometimes even infeasible; (c) they need a training dataset.Other models based on data-driven training, such as the Markov logic networks and deep learning, have the same problems listed above [29,30].

The Model
We propose the Dec-POMDM for formalizing the world states, behaviors, and goals in the problem.In this section, we first introduce the formal definition of the Dec-POMDM and explain relations among variables in this model by a DBN representation.Then, the planning algorithm for finding out the policies is given.
3.1.Formalization.One of foundations of our POMDM is the widely applied Dec-POMDP.However, the Dec-POMDP is proposed for solving decision problem, and there is no definition of the goal and the observation model for the recognizer in the Dec-POMDP.Additionally, the multiagent joint goal may be terminated because of achievement or interruption.Thus, we design the Dec-POMDM as a combination of three parts: (a) the standard Dec-POMDP; (b) the joint goal and goal termination variable; (c) the observation model for the recognizer.The Dec-POMDP is the foundation of the Dec-POMDM.
A Dec-POMDM is a tuple ⟨, , , , , Ω, , , ℎ, , , , Υ, , , , ⟩, where (i)  = {1, 2, . . ., } is the set of  agents; (ii)  is a finite set of world states , which contains all necessary information for making a decision; (iii)  is the finite set of joint actions; (iv)  is the state transition function; (v)  is the reward function; (vi) Ω is the finite set of joint observations for agents making a decision; (vii)  is the observation probability function for agents making a decision; (viii)  is the discount factor; (ix) ℎ is the horizon of the problem; (x)  is the initial state distribution at stage  = 0; (xi)  is the set of possible joint goals; (xii)  is the set of goal termination variables; (xiii) Υ is the observation function for the recognizer; (xiv)  is the finite set of joint observations for the recognizer; (xv)  is the goal selection function; (xvi)  is the goal termination function; (xvii)  is the initial goal distribution at stage  = 0.
Symbols including , , , Ω, , , ℎ,  in the Dec-POMDM have the same meanings as those in the Dec-POMDP.More definition details and explanations can be found in [9,27].The reward function is defined as  :  ×  ×  ×  → R, which shows that the reward depends on the joint goal; the goal set  consists of all possible joint goals; the goal termination variable set  = {0, 1} indicates whether the current goal will be continued in the next step (if  ∈  is 0, the goal will be continued; otherwise, a new goal will be selected again in the next step); the observation function for the recognizer is defined as Υ : × → [0, 1]: Υ(, ) = ( | ) is the probability that the recognizer observes  ∈  while the real worlds state is  ∈ ; the goal selection function is defined as  :  ×  → [0, 1]: (, ) = ( | ) is the conditional probability that agents select  ∈  as the new goal while the world state is ; the goal termination function is defined as  : × → [0, 1]: (, ) = ( = 1 | , ) is the conditional probability that agents terminate their  ∈  while the world state is .
In the Dec-POMDP, the policy of the th agent is defined as where   ∈  is the set of possible actions of agent ; Ω *  is the set of observation sequences.Thus, given an observation sequence {  1 ,   2 , . . .,    }, the agent  selects an action with a probability defined by Π  .Since the selection of actions depends on the history of the observations, the Dec-POMDP does not satisfy the Markov assumption.This attribute makes inferring goals online very hard: (a) if we precompute policies and store them offline, it will require a very large memory because of the combination  of possible observations; (b) if we compute policies online, the filtering algorithm is infeasible because weights of possible states cannot be updated with only the last observation.One possible solution is to define an interior belief variable for each agent to filter the world state, but it will make the inferring process much more complex.In this paper, we simply assume that all agents are memoryless as the work in [9].Then, the policy of the th agent is defined as where Ω  is the set of possible observations of agent .The definition of policies in the Dec-POMDM shows that (a) an agent does not need the history of observations for making decisions; (b) selection of actions depends on the goal at that stage.In the next section, we will further explain the relations of variables in the Dec-POMDM by its DBN representation.

The DBN Representation.
After estimating the agents' policies by a multiagent reinforcement learning algorithm, we do not need the reward function in the inference process.Thus, in the DBN representation of the Dec-POMDM, there are six sorts of variables in total: the goal, the goal termination, the action, observations for agents making decision, state, and observations for the recognizer.In this section, we first analyze how these variables are affected by other factors.Then, we give the DBN representation in two adjacent time slices.
(A) Goal ().The goal   at time  depends on the goal  −1 , the goal termination variable  −1 , and the state  −1 .If  −1 = 0, agents keep their goal at time ; otherwise,   is selected depending on  −1 .The DBN representation of the Dec-POMDM in two adjacent time slices presents all dependencies among variables discussed above, as is shown in Figure 1.

(B) Goal Termination Variable (𝑒
In Figure 1, only actions and observations of agent  and agent  are presented for simplicity, and each agent has no knowledge about others and can only make decision based on its own observations.Although the Dec-POMDM has a hierarchical structure, it models the task decomposition and allocation in an inexplicit way: all information about cooperation is hidden in the joint policies.From filtering theory point of view, the joint policies actually play the role of the motion function.Thus, estimating policies is a key problem for goal inference.

Learning the Policy.
Because the Dec-POMDP is essentially a DBN, we can simply use some data-driven methods to learn parameters from a training dataset.However, the training dataset is not always available in some cases.Besides, when the number of agents is large, the DBN structure will become large and complex, which makes supervised or unsupervised learning time consuming.To solve these problems, we assume that the agents to be recognized are rational, which is reasonable when there is no history of agents.Then, we can use an existing planner based on the Dec-POMDP framework to find out the optimal or approximately optimal policies for each possible goal.
Various algorithms have been proposed for solving Dec-POMDP problems.Roughly, these algorithms can be divided into two categories: (a) model-based algorithms, under the general name of dynamic programming; (b) model-free algorithms, under the general name of reinforcement learning [8].In this paper, we select a multiagent reinforcement learning algorithm, cooperative colearning based on Sarsa [9], because it does not need a state transition function of the world.
The main idea of cooperative colearning is that at each step one chooses a subgroup of agents and updates their policies to optimize the task, given the rest of the agents have fixed plans; then, after a number of iterations, the joint policies can converge to a Nash equilibrium.In this paper, we only consider settings where agents are homogeneous.All agents share the same observation model and policies.Thus, we only need to define one POMDP  for all agents.All parameters of  can be obtained directly from the given Dec-POMDP, except for the transition function   .Later, we will show how to compute   from , which is the transition function of the Dec-POMDP.The Dec-POMDP problem can be solved by the following steps [9].
Step 1.We set  = 0 and start from an arbitrary policy Π 0 .
Step 2. We select an arbitrary agent and compute the state transition function   of  considering that policies of all agents are Π  , except the selected agent that is refining the plan.
Step 3. We compute Π * which is the optimal policy of  and set Π +1 = Π * .
Step 4. We update the policy of each agent to Π +1 , set  = +1, and return to Step 2.
In Step 2, the state transition function of the POMDP for any agent can be computed by formula (1) (we assume that we refine the plan for the first agent): where   is the observation function defined in ,   is the action of the th agent,   is the observation of the th agent, (,  1 , Π  ( 2 ), . . ., Π  (  ),   ) is the probability that the state transits from  to   while the action of the first agent is  1 , and other agents choose actions based on their observations and policy Π  .
Unfortunately, computing formula (1) is always very difficult in complex scenarios.Thus, we apply the Sarsa algorithm for finding out the optimal policy of  (in Steps 2 and 3) [8].In this process, the POMDP problem is mapped to the MDP problem by regarding the observation as the world state.Then, agents get feedback from the environment and we do not need to compute the updated reward function and state transition function.

Inference
In the Dec-POMDM, the goal is defined as a hidden state indirectly reflected by observations.In this case, many filtering algorithms can be used to infer the goals.However, in the multiagent setting, the world state space is large because of combinations of agents' states and goals.Besides, the Dec-POMDM models a discrete system.To solve this problem, we use a MF algorithm to infer multiagent joint goals under the framework of the Dec-POMDM.It has been proved efficient when the state space is large and discrete.
Nyolt et al. discussed the theory of the MF and how to apply it in a casual model in [12,13].Their main idea can still work in our paper, but there are also differences between the Dec-POMDM and the causal model: (a) the initial distribution of states in our model is uncertain; (b) the results of actions in our model are uncertain; (c) we do not model duration of actions for simplicity.Thus, we have to modify the MF for casual models and make it available for the Dec-POMDM.
When the MF is applied to infer goals under the framework of the Dec-POMDM, the set of particles is defined as {   } =1:  , where    = ⟨   ,    ,    ⟩.   is the number of particles at time , and the weight of th particle is    .The detailed procedures of goal inference are given in Algorithm 1.
is the set of {   ,    } =1:  , which contains all particles and their weights at each stage.When we initialize {  0 ,   0 } =1: 0 , the weights are computed by (  0 ) and (  0 |   0 ,   0 ) are provided by the model, and (  0 |  0 ) is computed by is a temp memory of particles and their weights which are transited from one specific particle.Because a world state   may be after the execution of different actions from given  −1 , we have to use a LOOKUP operation to query the record in , which covers the new particle   .The operation LOOKUP(  , ) searches   in ; if there is a record ⟨   ,    ⟩ in  which covers   , the operation returns this record; otherwise, it returns empty.This process is time consuming if we scan the  for every query.One alternative method is to build a matrix for each , which records the indices of all reached particles.Then, if the index of   is null, we add a new record in  and update the index matrix; otherwise, we can read the index of   from the matrix and merge the weight directly.After we finish building , its index matrix can be deleted to release memory.We also need to note that this solution saves computing time but needs extra memory.
The merging  and the existing   .In this process, if a particle   in  has not appeared in   , we directly put   and its corresponding weight   into   ; otherwise, we need to add   into the weight of the record in   which covers   .Similarly, an index matrix can also be used to save computing time in the MERGE operation.
Under the framework of Dec-POMDM, we update    by where    and    = ⟨   ,    ,    ⟩ are the weight and particle, respectively, th of record in   , and (  |    ) is the observation function for the recognizer in the model.The details of PRUNE operation can be found in [13], and we can use the existing pruning technique directly in our paper.

The Predator-Prey Problem.
The predator-prey problem is a classic problem to evaluate multiagent decision algorithms [9].However, it cannot be used directly in this paper because predators have only one possible goal.Thus, we modify the standard problem by setting more than one prey on the map, and our aim is to recognize the real target of the predators at each time.Figure 2 shows the grid-based map in our experiments.
In our scenario, the map consists of 5 × 5 grids, two predators (red diamonds: Predator PX and Predator PY), and two preys (blue stars: Prey PA and Prey PB).The two predators select one of the preys as their common goal and move around to capture it.As is shown in Figure 2, the observation of the predators is not perfect: a predator only knows the exact position of itself and others which are in the nearest 8 grids.If another agent is out of its observing range, the predator only knows its direction (8 possible directions).For the situation in Figure 2, Predator PX observes that none is near to itself, Prey PB is in the north, Prey PA is in the southeast, and Predator PY is in the south; Predator PY observes that none is near itself, Prey PB and Predator PY are in the north, and Prey PA is in the east.In each step, all predators and preys can get into one of the four adjacent grids (north, south, east, and west) or stay at the current grid.When two or more agents try to get into the same grid or try to get out of the map, they have to stay in the current grid.The predators can achieve their goal if and only if both of them are adjacent to their target.Additionally, the predators may also change their goal before they capture a prey.The recognizer can get the exact positions of the two preys, but its observations of the predators are noisy.We need to compute the posterior distribution of predators' goals with the observation trace.The modified predator-prey problem can be modeled by the Dec-POMDM.Some important elements are as follows: (e) Ω: the directions of agents far away and the real positions of agents nearby; (f) : the real positions of preys and the noisy positions of predators; (g) : predators getting a reward +1 once they achieve their goal; otherwise, the immediate reward is 0; (h) ℎ: infinite horizons.
With the definition above, the effects of predator's actions are uncertain, and the state transition function depends on the distribution of preys' actions.Thus, actions of preys actually play the role of events in discrete dynamic systems.

Settings.
In this section, we provide additional details for the scenario and the parameters used in the policy learning and inference algorithm.
(A) The Scenario.The preys are senseless: they select each action with equal probability.Initial positions of agents are uniform.The initial goal distribution is that ( 0 =  ) = 0.6 and ( 0 =  ) = 0.4.
We simplify the goal termination function in the following way: if predators capture their target, the goal is achieved; otherwise, the predators change their goal with a probability of 0.05 for every state.The observation for the recognizer   ∈  reveals the real position of each predator with a probability of 0.5, and the observation may be one of 8 neighbors of the current grid with a probability of 0.5/8.When the predator is on the edge of the map, the observation may be out of the map.The observed results of the agents do not affect each other.
(B) Multiagent Reinforcement Learning.The discount factor is  = 0.8.The predator selects an action   given the observation    with a probability where  = 0.1 is the Boltzmann temperature.We set  > 0 as a constant, which means that predators always select approximately optimal actions.In our scenario, the -value converges after 750 iterations.In the learning process, if predators cannot achieve their goal in 10000 steps, we will reinitialize their positions and begin next episode.
(C) The Marginal Filter.In the MF inference, a particle consists of the joint goal, the goal termination variable, and the positions of predators and preys.Although there are 25 possible positions for each predator or prey, after getting new observation, we can identify the positions of preys and there are only 9 possible positions for each predator.Thus, the number of possible values of particles at one step never exceeds 9 × 9 × 2 × 2 = 324 after the UPDATE operation.In our experiments, we simply set the max number of particles as 324; then we do not need to prune any possible particles, which means that we exploit an exact inference.We also make use of real settings in the Dec-POMDM and the real policies of predators in the MF inference.to 34 with a standard deviation of 5.36.We also found that the predators changed their goals for at least once in 35% of the test traces.To validate our method and compare it with baselines, we did experiments on three aspects: (a) to discuss the details of the recognition results obtained with our method, we computed the recognition results of two specific traces by our method; (b) to show the advantages of the Dec-POMDM in modeling, we compared the recognition performances when the problem was modeled as a Dec-POMDM and as an HMM; (c) to show efficiency of the MF under the framework of the Dec-POMDM, we compared the recognizing performances when the goal was inferred by the MF and the PF.
In the second and the third parts, performances were evaluated by statistic metrics: precision, recall, and measure.Their meanings and computation details can be found in [31].The value of the three metrics is between 0 and 1; a higher value means a better performance.Since these metrics can only evaluate the recognition results at a single step, and traces in the dataset had different time lengths, we defined a positive integer  ( = 1, 2, . . ., 5).The metric with  means that the corresponding observation sequences are { ∈1:100 ∈1:⌈ * ℎ  /5⌉ }.Here,   ∈1:⌈ * ℎ  /5⌉ is the observation sequence from time 1 to time ⌈ * ℎ  /5⌉ of the th trace, and ℎ  is the length of the th trace.And we need to recognize   for each observation sequence.Thus, metrics with different  show the performances of algorithms in different phases of the simulation.Additionally, we regarded the destination with largest probability as the final recognition result.
(A) The Recognition Results of the Specific Traces.To show the details of the recognition results obtained with our method, we selected two specific traces from the dataset (number 1 and number 4).These two traces were selected because Trace number 1 was the first trace where the goal was changed before it was achieved, and number 4 was the first trace where the goal was kept until it was achieved.The detailed information about the two traces is shown in Table 1.
In Trace number 1, predators selected Prey PA as their first goal from  = 1 to  = 7.Then, the goal was changed to Prey PB, which was achieved at  = 13.In Trace number 4 predators selected Prey PB as their initial goal.This goal was kept until it was achieved at  = 14.Given the real policies and other parameters of the Dec-POMDM including , , Υ, , , , and , we used the MF to compute the posterior distribution of goals at each time.The recognition results are shown in Figure 3.
In Figure 3(a), the probability of the real goal (Prey PA) increases fast during the initial period.At  = 3, the probability of Prey PA exceeds 0.9 and keeps a high value until  = 7.When the goal is changed at  = 8, our method has a very fast response, because predators select highly certain joint actions at this time.In Figure 3(b), the probability of the false goal increases at  = 2, and the probability of the real goal (Prey PB) is low at first.The failure happens because the observations support that predators selected joint actions with small probability if the goal is Prey PB.Anyway, the probability of the real goal increases continuously and exceeds that of Prey PA after  = 5.With the recognition results of the two specific traces, we conclude that the Dec-POMDM and MF can perform well regardless of the goal change.
(B) Comparison of the Performances of the Dec-POMDM and HMMs.To show the advantages of the Dec-POMDM, we modeled the predator-prey problem as the well-known HMM as a baseline.In the HMM, we set the hidden state  as the tuple ⟨,   ,   ,   ,   ⟩, where   ,   ,   , and   are positions of Predator PX, Predator PY, Prey PA, and Prey PB, respectively.The observation model for the recognizer in the HMM was the same as that in the Dec-POMDM.Thus, there were 25 × 24 × 23 × 22 × 2 = 607,200 possible states in this HMM, and the dimension of the transition matrix was 607200 × 607200.Since the state space and transition matrix were too large, an unsupervised learning method such as Balm-Welch algorithm was infeasible in this problem.Instead, we used a simple supervised learning: counting the state transitions based on labeled training datasets.With the real policies of predators, we run the simulation model repeatedly and generated five training datasets.The detailed information of these datasets is shown in Table 2.
The datasets HMM-50, HMM-100a, and HMM-200a were generated in a random and incremental way (HMM-100a contains HMM-50, and HMM-200a contains HMM-100a).Since HMM-100a and HMM-200a both covered 79% of traces of the test dataset, which might cause an overfitting problem, we removed the traces covered by the test dataset in HMM-100a and HMM-200a and compensated them by extra traces.In this way, we got new datasets HMM-100b and HMM-200b which did not cover any trace in the test dataset.With these five labeled datasets, we estimate the transition matrix by counting state transitions.Then, the MF was used to infer the goal.However, in the inference process, we may reach some states that have not been explored in the training dataset (HMM-200a only explores 492,642 states, but there are totally 607,200 possible states).In this case, we assumed that the hidden state would transit to a possible state with a uniform distribution.The rest of the parameters in the HMM inference were the same as those in the Dec-POMDM inference.We compare performances of the Dec-POMDM and the HMMs.The recognition metrics are shown in Figure 4.
Figure 4 shows that comparing the results of the Dec-POMDM and the HMMs trained by different datasets is  similar in terms of precision, recall, and -measure.More precisely, HMM-200a had the highest performance; HMM-100a performed comparable to our Dec-POMDM, but Dec-POMDM performed better after  = 4; HMM-50 had the worst performance.Generally, there was no big difference between performances of the Dec-POMDM, HMM-100a, and HMM-200a, even though the number of traces in HMM-200a was twice as large as that in HMM-100a.The main reason is that the training datasets were overfitted.Actually, there was a very serious performance decline after we removed the traces covered in the test dataset from HMM-200a and HMM-100a.In this case, HMM-200b performed better than HMM-100b, but worse than our Dec-POMDM.The results in Figure 4 showed that (a) our Dec-POMDM performed well on three metrics, almost as well as the overfitted trained model; (b) the learned HMM suffered an overfitting problem, and there will be a serious performance decline if the training dataset does not cover most traces in the test dataset.The curves of HMM-50, HMM-100b, and HMM-200b also showed that when we model the problem through the HMM, it may be possible to improve the recognition performances by increasing the size of the training dataset.However, this solution is infeasible in practice.Actually, the dataset HMM-200a which consisted of 200,000 traces only covered 81.13% of all possible states, and only 71.46% of the states in HMM-200a had been reached more than once.Thus, we can conclude that HMM-200a will have a poor performance if agents select actions with higher uncertainty.Besides, there is no doubt that the size of the training dataset will be extremely large if most states are reached a large number of times.In real applications, it is very hard to obtain a training dataset with so large size, especially when all traces are labeled.
We also performed a Wald test over the Dec-POMDM and HMMs with a different training dataset to prove that their recognition results came from different distributions.Given our test dataset, there were 100 goals to be recognized for each value of .Let  be the set of samples obtained from the Dec-POMDM; then we set   = 1 if the recognition result of the Dec-POMDM is correct on the test case  ( = 1, 2, . . ., 100) and   = 0, otherwise; let  be the set of samples obtained from the baseline model (one of the HMMs); then we set   = 1 if the recognition result of the Dec-POMDM is correct on the test case  and   = 0, otherwise.We define   =   −   , and let  = E(  ) = E(  ) − E(  ) = (  = 1) − (  = 1); the null hypothesis is  = 0, which means the recognition results from different models follow the same distribution.A more detailed test process can be found in [32].The matrix of  values is shown in Table 3.
The  values in Table 3 show that recognition results of the Dec-POMDM follow different distributions from those of HMM-50, HMM-100b, and HMM-200b, respectively, with a high confidence.We cannot reject the null hypothesis when the baseline is an overfitted trained model.The Wald test results are consistent with the metrics in Figure 4. We also performed the Wilcoxon test, and the test results showed the same trend.Here, we exploited the PF as the baseline.The model information used in the PF is the same as that in the MF.We evaluated the PF with different number of particles and compared their performances to the MF.All inference was done under the framework of Dec-POMDM.We have to note that when the PF used weighted particles to approximate the posterior distribution of the goal, it is possible that all weights decrease to 0 if the number of particles is not large enough.
In this case, we simply reset all weights 1/Np to continue the inference, where Np is the number of particles in the PF.The recognition metrics of the MF and PF are shown in Figure 5.
In Figure 5, the red solid curve indicates the metrics of the MF.The green, blue, purple, black, and cyan dashed curves indicate the metrics of PF with 1000, 2000, 4000, 6000, and 16000 particles, respectively.All filters had similar precision.However, considering the recall and the -measure, the MF had the best performance, and the PF with the largest number of particles performed better than the rest of PF.We got these results because an exact MF (without pruning step) is used in this section, and the PF can approximate the real posterior distribution better with more particles.
Similar to the testing method we used in Part (B), here we also performed the Wald test on the MF and PF with different number of particles.The matrix of the  values is shown in Table 4.
The null hypothesis is that the recognition results of the baseline and the MF follow the same distribution.Generally,   with larger  and smaller number of particles, we can reject the null hypothesis with higher confidence, which is consistent with the results in Figure 5.
Since there was a variance in the results inferred by the PF (this is due to the fact that the PF performs approximate inference), we run the PF with 6000 particles for 10 times.The mean value with 0.90 belief interval of the values of metrics at different  is shown in Figure 6.
The blue solid line indicates the mean value of the PF metrics, and the cyan area shows the 90% belief interval.We need to note that since we do not know the underlying distribution of the metrics, an empirical distribution was used to compute the belief interval.At the same time, because we run PF for 10 times, the bound of the 90% belief interval also indicates the extremum of PF.We can see that the metrics of the MF are better than the mean of the PF when  > 1, and even better than the maximum value of PF except for  = 4. Actually, the MF also outperforms 90% of the runs of the PF at  = 4. Additionally, the MF only needs average of 75.78% of the time which is needed by the PF for inference at each step.Thus, the MF consumes less time and has a better performance than the PF with 6000 particles.
Generally, the computational complexities of the PF and the MF are both linear functions of number of particles.When there are multiple possible results of an action, the MF consumes more time than the PF when their numbers of  particles are equal.However, since the MF does not duplicate particles, it needs much less particles than the PF when the state space is large and discrete.Actually, the number of possible states in the MF after UPDATE operation is never more than 156 in the test dataset.At the same time, the PF has to duplicate particles in the resampling step to approximate the exact distribution, which makes it inefficient under the framework of Dec-POMDM.

Conclusion and Future Work
In this paper, we propose a novel model for solving multiagent goal recognition problems of the Dec-POMDM and present its corresponding learning and inference algorithms, which solve a multiagent goal recognition problem.
First, we use the Dec-POMDM to model the general multiagent goal recognition problem.The Dec-POMDM presents the agents' cooperation in a compact way; details of cooperation are unnecessary in the modeling process.It can also make use of existing algorithms for solving the Dec-POMDP problem.
Second, we show that the existing MARL algorithm can be used to estimate the agents' policies in the Dec-POMDM assuming that agents are approximately rational.This method does not need a training dataset and the state transition function of the environment.
Third, we use the MF to infer goals under the framework of the Dec-POMDM, and we show that the MF is more efficient than the PF when the state space is large and discrete.
Last, we also design a modified predator-prey problem to test our method.In this modified problem, there are multiple possible joint goals and agents may change their goals before they are achieved.With this scenario, we compare our method to other baselines including the HMM and the PF.The experiment results show that the Dec-POMDM together with MARL and MF algorithms can recognize the multiagent goal well whether the goal is changed or not; the Dec-POMDM outperforms the HMM in terms of precision, recall, and -measure; and the MF can infer goals more efficiently than the PF.
In the future, we plan to apply the Dec-POMDM in more complex scenarios.Further research on pruning technique of the MF is also planned.

Figure 1 :
Figure 1: The DBN representation of the Dec-POMDM in two adjacent time slices.

Figure 2 :
Figure 2: The grid-based map in the predator-prey problem.
(a) : the two predators; (b) : the positions of predators and preys; (c) : five actions for each predator, moving into four directions and staying; (d) : Prey PA or Prey PB;

4 Figure 3 :
Figure 3: Recognition results of two specific traces computed by the Dec-POMDM and MF.

Figure 4 :
Figure 4: Recognition metrics of the Dec-POMDM and HMMs.

Figure 5 :
Figure 5: Recognition metrics of the MF and PF.

Figure 6 :
Figure 6: Metrics of the PF with 6000 particles and the MF.
).The goal termination variable   at time  depends on the goal   and state   at the same time.The   is terminated with the probability (  = 1 |   ,   ).The action    selected by agent  at time  depends on the goal   and its observation at the same time, and the distribution of actions is defined by Π  .(D) Observations for Agents Making Decision ().The observation    of agent  at time  reflects the real world state  −1 at time  − 1, and the agent  observes    with the probability (   |  −1 ).
(C) Action ().(E)State ().The world state   at time  depends on the state  −1 at time  − 1 and the actions of all agents at time .The distribution of the updated states can be computed by the state transition function .(F)Observationsfor the Recognizer ().The observation  for the recognizer at time  reflects the real world state   at the same time, and the recognizer observes   with the probability (  |   ).
operation PUT(, ⟨  ,   ⟩) adds a new record ⟨  ,   ⟩ in  and indexes this new record.The generated  contains a group of particles and corresponding weights.Some of these particles may have been covered in   .Thus, we have to use a MERGE operation to get a new   by Set  = 0,  0 ← {  0 ,   0 } =1: 0 % Initialization For  = 1 : max ,   ←  % We totally have max  + 1 observations Foreach   which satisfies that (  |   ,   ) > 0 )   ← {  ,   ,   } % A temp memory of a particle ⟨ LOOKUP(  , ) % Find out the record which is equal to   in  If ⟨ Foreach   which satisfies that (  −1 , ⃗ ,   ) > 0 Foreach   which satisfies that (  |   ,   ) > 0   =   −1 ⋅  (  |   ,   ) ⋅  (  −1 , ⃗ ,   ) ⋅ )   ← {  ,   ,   } % A temp memory of a particle ⟨ LOOKUP(  , ) % Find out the record which is equal to   in  If ⟨ Algorithm 1: Goal inference based on the marginal filter under the framework of Dec-POMDM.

Table 1 :
The details of two traces.Results and Discussion.With the learned policy, we run the simulation model repeatedly and generated a test dataset consisting of 100 traces.There are 11.83 steps averagely in one trace, and the number of steps in one trace varies from 6

Table 2 :
Information of training datasets for the HMM.

Table 3 :
The  values of the Wald test over the Dec-POMDM and HMMs.

Table 4 :
The  values of the Wald test over the MF and PFs.