Risk-Sensitive Multiagent Decision-Theoretic Planning Based on MDP and One-Switch Utility Functions

In high stakes situations decision-makers are often risk-averse and decision-making processes often take place in group settings. This paper studies multiagent decision-theoretic planning under Markov decision processes (MDPs) framework with considering the change of agent’s risk attitude as his wealth level varies. Based on one-switch utility function that describes agent’s risk attitude change with his wealth level, we give the additive and multiplicative aggregation models of group utility and adopt maximizing expected group utility as planning objective. When the wealth level approaches infinity, the characteristics of optimal policy are analyzed for the additive and multiplicative aggregation model, respectively. Then a backward-induction method is proposed to divide the wealth level interval from negative infinity to initial wealth level into subintervals and determine the optimal policy in states and subintervals. The proposed method is illustrated by numerical examples and the influences of agent’s risk aversion parameters and weights on group decision-making are also analyzed.


Introduction
Decision-theoretic planning is to compute optimal policy that is formed by courses of action to maximize expected reward with considering actions that have uncertain outcomes [1].In high stakes situations with the possibility of high wins and losses, such as emergency and crisis response, business and investment decision, military battle, and lottery, decision-makers are often risk-averse.In the risk-sensitive decisions, exponential utility function is one of the typical utility functions to model decision-maker's risk aversion and maximizing expected utility is the most commonly used rule instead of maximizing expected reward.However, the risk attitude of a decision-maker modeled by exponential utility function is independent of his wealth level and does not change as his wealth level varies, while in reality personal risk attitude often changes with his wealth level [2][3][4].Bell proposes a kind of utility function, named one-switch utility function, to model an agent who is always risk-averse but becomes risk neutral as his wealth increases [2].Bell and Fishburn take further studies on the characteristics of a form of one-switch utility function that is a combination of linear utility function and exponential utility function [5].Liu and Koenig give a form of one-switch utility function with considering agent's risk aversion parameter  (0 <  < 1); that is, () =  −   , where  denotes the wealth level and  is a parameter to adjust tradeoff between risk neutrality and risk aversion.This form of one-switch utility function not only describes the change of agent's risk attitude, but also presents the degree of agent's risk aversion by the quantitative risk aversion parameter  [6][7][8].
For decision-theoretic planning problems, Markov decision processes (MDPs) framework is adopted broadly as an underlying model.Howard and Matheson in their seminal paper introduce risk-sensitive MDPs based on maximizing the expected exponential utility [9].In the follow-up related studies structural properties of optimal solution and algorithms to compute optimal policy are investigated based on exponential utility function [10][11][12].If an agent is risksensitive, it is necessary to consider the possible change of agent's risk attitude when his wealth level varies and further influence on the decisions in the next stage.Liu and Koenig study Markov decision processes with considering agent's risk-sensitive attitude modeled by one-switch utility function and propose an exact backward-induction algorithm to compute optimal policy [8].
In reality decision-making processes often take place in group settings due to a single decision-maker's limited decision-making ability.For the group decision-making problem, group utility is usually got by aggregating personal utilities and then group decisions are made based on the group utility.The aggregation methods include additive value model and multiplicative value rule.Other methods such as multiobjective linear programming [13], fuzzy sets method [14,15], and interactive approach [16,17] are used to aggregate individual decision information including attribute weights and attribute values into group decisions.Besides, some researches on group decision-making problem take time into consideration.Xu investigates multistage multiattribute group decision-making problems in which the weight information on a collection of attributes and the decision information on a finite set of alternatives with respect to the attributes are collected at different stages [18].
This paper focuses on decision-theoretic planning problem in which sequential decisions are made by a group of risk-sensitive members.Considering agent's risk-sensitive attitude and wealth level, this paper studies the risk-sensitive multiagent decision-theoretic planning problem based on one-switch utility function and MDP framework.Two group utility functions based, respectively, on additive value model and multiplicative value model of one-switch utility functions are given.Backward-induction algorithms for these two kinds of group utility functions to compute optimal policy of risk-sensitive group decision-making under MDP framework are proposed.
The rest of this paper is organized as follows.One-switch utility function and risk-sensitive MDP model augmented with wealth level are introduced in Section 2. In Section 3, additive and multiplicative aggregation model of one-switch utility functions are given.We analyze the characteristics of optimal policy when the wealth level approaches negative infinity for additive and multiplicative aggregation model, respectively, in Section 3. In Section 4, detailed backwardinduction algorithms are proposed to solve the multiagent decision-theoretic planning problem based, respectively, on additive and multiplicative aggregation model.Numerical examples are used to illustrate the proposed method and analyze the influences of agent's risk aversion parameters and weights on group decision-making in Section 5. Finally, a conclusion of this paper and suggested topics for future research are presented in Section 6.

Risk-Sensitive MDP Model
Augmented with Wealth Level 2.1.One-Switch Utility Function.One-switch utility function is a kind of utility function to describe the change of agent's risk attitude as his wealth level varies.In detail, there exists a wealth level ; when the agent's wealth level is below , the agent is risk-averse, but when his wealth level increases and becomes higher than , the agent becomes risk neutral.For agent , one-switch utility function given by Liu and Koenig is shown as follows [6][7][8]: where  is wealth level,   is agent 's risk aversion parameter, and 0 <   < 1.   is a constant that provides an adjustable tradeoff between risk neutrality (linear term) and risk aversion (exponential term).  =  is a linear utility function. , () = −   is agent 's exponential utility function.

Risk-Sensitive MDP Model Augmented with Wealth
Level.In the paper goal directed Markov decision problem (GDMDP) is adopted as underlying model of decisiontheoretic planning problem [8].GDMDP is a kind of MDP with a finite set of goal states.When an agent reaches a goal state, he stops acting and receives no more rewards thereafter.One-switch utility function is used to describe the agent's risk-sensitive attitude and maximizing expected utility is adopted as planning objective instead of maximizing expected reward.As wealth level is included in the one-switch utility function, it is necessary to consider the wealth level as a component of the system state of GDMDP.
Formally, a GDMDP consists of a finite set of states  with wealth levels , so the augmented state set of GDMDP is denoted by (, ).Goal state set is (, ), where  ⊆ .Nongoal state set is (  , ), where   =  \ .
The agent's action set is   .The agent chooses an action  ∈   to execute in its current state  ∈ .
The agent's execution of action  in state  results in a finite reward (, ,   ) and a transition to successor state   ∈  with probability (  | , ).In the paper only cost is considered and assumed reward is strictly negative, (, ,   ) < 0.
We also use   and   to denote the state and action at time step  ( = 0, 1, 2, . ..).   = (  ,   ,  +1 ) is used to denote the reward for executing action   .After the agent reaches a goal state,   = 0.
=  0 + ∑ =−1 =0   is the agent's wealth level at time step , where the initial wealth level is denoted by  0 .
For the MDP model augmented with wealth level, the optimal policy maps every combination of a state  ∈   and wealth level  to an action  ∈   that an agent in state  with wealth level  should execute to maximize expected utility.
For agent 's exponential utility function  , () and all policies , we define the value V  , (, ) = lim  → ∞   , [ , (  )] as the expected exponential utility of agent  with initial state  and initial wealth level  that follows policy .
The optimal value V * , (, ) = max  V  , (, ) is defined as the highest possible expected exponential utility of agent  with initial state  and initial wealth level .Assume V * , (, ) is finite for all state  ∈  and wealth levels .
It is worth noting that differently from Liu and Koenig [8], the paper focuses on the decision-making in group setting.So the value function in MDP will be replaced by group utility function which is the aggregation of personal oneswitch utilities and the planning objective is to maximize the expected group utility.

Utility Aggregation Model of One-Switch Utility Functions
Group utility is the aggregation of personal utilities.The common methods include additive value model and multiplicative value model.In the following sections we will discuss additive and multiplicative value model for the aggregation of personal one-switch utility functions, respectively.

Additive Aggregation Model of One-Switch Utility Functions.
In general, additive aggregation model of group utility is defined as follows: where   is the weight of agent 's utility, 0 <   < 1, and Thus the additive aggregation model of one-switch utility functions is defined as follows: According to MDP, the expected group utility for all policies  and state (, ) is presented as follows: Then, the optimal value V * (, ) is presented as follows: Next, we will derive the relationship between the expected group utility and the expected personal linear and exponential utility for the additive aggregation model.For all policies , the expected group utility of the additive aggregation model of one-switch utility functions is presented as follows: where V   () and V  , () satisfy the following policy-evaluation equations, respectively, [8,19,20]: From the above policy-evaluation equation ( 6), we obtain the relationship between V  (, ) and expected linear utility V   () and expected personal exponential utility V  , () for the additive aggregation model, where V   () and V  , () are independent of the wealth level .
For simplicity, in the paper we only consider the case  = 2.For  > 2, the derivation of multiplicative aggregation model is similar.
For  = 2, multiplicative aggregation model of one-switch utility functions is simplified as follows: According to MDP, the optimal value V * (, ) is presented as follows: For the multiplicative aggregation model of one-switch utility functions and all policies , the expected group utility is According to the fact that lim From the above policy-evaluation equation (12), we obtain the relationship between expected group utility V  (, ) and expected linear utility V   () and expected personal exponential utility V  , () for the multiplicative aggregation model.

Preparation for Backward-Induction Method
To solve the optimal policy of the additive and multiplicative aggregation model of one-switch utility functions, backwardinduction method is adopted.In the paper the value range of wealth level is a continuous interval (−∞,  0 ]; we first compute the optimal policy when wealth level  → −∞ that is represented by  * −∞ .Then increase the wealth level until  * −∞ is no longer an optimal policy and get a wealth level threshold.Increase further the wealth level and get the next wealth level threshold similarly.The backwardinduction method ends when the wealth level is larger than initial wealth level  0 .Thus the continuous wealth level interval is divided into subintervals by the thresholds.We use ((, (  ,  +1 ]), ) to denote action  executed in state  and wealth level interval (  ,  +1 ] (  denotes a wealth level threshold and  = 0, 1, 2, . ..).In this section we will analyze the characteristics of optimal policy when the wealth level  approaches negative infinity for additive and multiplicative aggregation model, respectively.

Lemma 1. For additive aggregation model of one-switch utility functions, if agent 𝑖 is the most risk-averse one, that is, 𝛾
Proof.For all optimal policies  * , we have lim As ).On the other hand, for all optimal policies  * , , according to the fact that V * (, ) ≥ V  (, ) for all policies , we have lim Therefore, the lemma holds.
Lemma 1 implies that the optimal policy for the additive aggregation model of one-switch utility functions is the same as the most risk-averse agent's optimal policy for the exponential utility function as the wealth level  → −∞.

Lemma 2. For multiplicative aggregation model of one-switch utility functions, lim
) for all states  ∈  and  3 =  1  2 , where V * 3, (, ) is the highest expected exponential utility with risk aversion parameter  3 .
Proof.For all optimal policies  * , we have lim As On the other hand, for all optimal policies  * 3, , according to the fact that V * (, ) ≥ V  (, ) for all policies , we have lim Therefore, the lemma holds.
Lemma 2 implies that the optimal policy for the multiplicative aggregation of one-switch utility functions is the same as the optimal policy for a virtual agent's exponential utility function as the wealth level  → −∞.The virtual agent's risk aversion parameter is the product of every agent's risk aversion parameter in the group.

Division of Wealth Level Interval and Backward-Induction Method
The above section gives the optimal policy as the wealth level approaches negative infinity for additive and multiplicative aggregation model of one-switch utility functions, respectively.The next step is to divide the wealth level interval and determine the wealth level thresholds and optimal policies in the intervals by using backward-induction method.In this section we will discuss the backward-induction method in the cases of additive and multiplicative aggregation model.

The Case of Additive Aggregation Model.
For the additive aggregation model of one-switch utility functions, we first give the following theorem to prove the existence of a wealth level threshold  and then give the backward-induction algorithm.
Theorem 3.For all optimal policies  * , there exists a wealth level threshold  such that it holds for all states  ∈   and all wealth levels  ≤  that Please see Appendix A for the proof of Theorem 3.
Theorem 3 shows the existence of the wealth level threshold .Next we will show how to determine the wealth level threshold .
After getting  * −∞ when wealth level  → −∞, assume  * −∞ is the optimal policy for the wealth level interval (−∞, ]; then for wealth level  in the interval (, min ∈  ( − (, ,   ))], where (, ,   ) is the reward got by executing one step action,  * −∞ is no longer the optimal policy, and assume   is the optimal policy; according to (5); for all nongoal states, we have As  + (, ,   ) < , where For  ∈   \ ,   ∈   \  * −∞ (), because the optimal policy is   , not  * −∞ now under assumption, we have or, equivalently, We can get a wealth level threshold   * −∞ ,,  in equality case of the above weak inequality.
From the algorithm above, we can get the wealth level threshold : After getting , the next step is to divide further the wealth level interval (,  0 ] into subintervals and solve the optimal policy for each subinterval similarly to the above algorithm.
The main procedure of the backward-induction algorithm for group decision-making in the case of additive aggregation model is listed as follows.
Step 1.By maximizing the expected exponential utility, get the optimal policy  * , of the most risk-averse agent when  → −∞.
Step 4. Calculate the wealth level threshold  according to (25).
Step 5.For the wealth level interval (,  0 ], increase further the wealth level according to the reward got by executing one step action, and determine the wealth level threshold and optimal policy similarly to the above steps.
Step 6. If, for all  ∈   , the wealth level  is larger than  0 , then end the algorithm.

The Case of Multiplicative Aggregation Model.
For the multiplicative aggregation model of one-switch utility functions, we also have the following theorem that shows the existence of a wealth level threshold .Theorem 4. For all optimal policies  * , there exists a wealth threshold .For any state  ∈   , wealth level  ≤ , Please see Appendix B for the proof of Theorem 4.
Similarly to additive aggregation model, we determine the wealth level threshold  for the multiplicative aggregation model.Assume that  * −∞ is the optimal policy for the wealth level interval (−∞, ]; then for the wealth level  in the interval (, min ∈  ( − (, ,   ))],  * −∞ is no longer Step 1.By maximizing the expected exponential utility, get the optimal policy  * 3, of the visual agent when  → −∞.
Step 4. Calculate the wealth level threshold  according to (25).
Step 5.For the wealth level interval (,  0 ], increase further the wealth level according to the reward got by executing one step action and determine the wealth level threshold and optimal policy similarly to the above steps.
Step 6. If, for all  ∈   , the wealth level  is larger than  0 , then end the algorithm.

Numerical Examples
Consider a simple GDMDP model.There are two agents named Agent 1 and Agent 2 with risk aversion parameters  1 and  2 , respectively.The state set of the GDMDP model includes initial state  0 and goal state .Agent's action set is { 1 ,  2 }.In state  0 executing action   ( = 1, 2) results in a finite reward   and a transition to goal state  with possibility   .When agent reaches the goal state  it stops acting and receives no more rewards thereafter.Figure 1 shows the transitions of system states.Agents need to make an optimal policy together to reach the goal state.
First, consider the situation that each agent makes decisions alone.The agent's optimal policy and the wealth level threshold are solved by utilizing the method proposed by Liu and Koenig [8].The results are shown as follows: Next, we consider the group decision-making based on additive and multiplicative aggregation model of one-switch utility functions.In the case of additive aggregation model, assume each agent has equal weight; that is,  1 =  2 = 0.5; then the optimal policy and the wealth level threshold based on the proposed method in the paper are solved as follows: The above result shows that, in the wealth level interval (−106.5, −16.4), if Agent 1 makes decisions alone, the optimal policy is taking action  1 in state  0 but  2 if Agent 2 makes decisions alone.If they make decisions together, then action  2 is taken.Now we consider how the wealth level threshold  of group decision-making changes as the weights of agents vary.In detail,  1 changes from 0.05 to 0.95; meanwhile  2 = 1 −   changes from 0.95 to 0.05.The change of the wealth level threshold of group decision-making is shown in Figure 2.
Figure 2 shows that the values of the wealth level threshold  of group decision-making are near to the wealth level threshold of Agent 2 who is more risk-averse even if  2 is small and  1 =  2 − 1 is large.This means that the influence of weights on group decision-making is not obvious if the risk aversion parameters of agents are different, while the risk aversion parameters play an important role in this situation.
Consider the situation that two agents have similar risk attitude; that is, their risk aversion parameters are similar; for example, their one-switch utility functions are  1 () =  − 0.952  and  2 () =  − 0.95  , respectively.When each agent makes decisions alone their optimal policies and the wealth level thresholds are solved as follows: Change the values of weights in the same way as Figure 2; the result is shown in Figure 3. Difference from Figures 2 and  3 shows that the values of the wealth level threshold  of group decision-making are near to the wealth level threshold of Agent 1 when  2 is small and  1 =  2 − 1 is large.So if the difference between the risk aversion parameters of agents is not obvious, the weights of agents will play a critical role.Finally, we consider group decision-making based on the multiplicative aggregation model and especially focus on the influence of product term of group utility, that is,  12  1 () 2 (), on the group decision-making.Given the same one-switch utility functions of agents in Figure 3 and assuming two sets of  1 and  2 value, in detail,  1 = 0.85,  2 = 0.1 and  1 = 0.  from −0.001 to −0.05, we get two curved lines of wealth level threshold of group decision-making in Figure 4.
The two curved lines gradually approach each other to a point when the absolute value of  12 increases from 0.001 to 0.05.If compared with the result of additive aggregation model, we can find that the point is very close to the wealth level threshold of additive aggregation model with  1 =  2 = 0.5.This is because when the absolute value of  12 is small the absolute values of  1  1 () and  2  2 () in group utility are larger than the absolute value of  12  1 () 2 (), so  12  1 () 2 () has little influence on group decisionmaking.When the absolute value of  12 increases enough  12  1 () 2 () will mainly influence the group decisionmaking; furthermore,  12  1 () 2 () has same influence on Agent 1 and Agent 2 ; therefore the wealth level threshold of multiplicative aggregation model will approach the threshold of additive aggregation model with  1 =  2 = 0.5.This implies that the multiplicative aggregation model avoids group decision-making being dominated by the weights of individuals completely.

Conclusion and Future Works
This paper has put an effort on how to extend a single agent's risk-sensitive decision-theoretic planning under the MDP framework to the multiagent problem.Based on one-switch utility function that is used to describe agent's risk-sensitive attitude, the additive and multiplicative aggregation models of group utility have been proposed in this paper.According to the characteristics of group utility, a backward-induction method has been presented to divide the wealth level interval and compute the optimal policy.The paper has also offered numerical examples and discussed how the weights and risk aversion parameters influence the group decision-making.From numerical examples we can observe that, for the additive aggregation model, if the risk aversion parameters of agents are different, the risk aversion parameters will have an obvious influence on the group decision-making, while the weights of agents will play a critical role if the risk aversion parameters are similar.For the multiplicative aggregation model, group decision-making will not be dominated by the weights of individuals completely.The product term of group utility will also influence the group decision-making.
In the future we intend to further study multiattribute group decision-making under the MDP framework with oneswitch utility function.Based on the work of Tsetlin and Winkler [21], we will further study how to extend our method to the group decision-making problem.

Figure 2 :
Figure 2: The change of the wealth level threshold of group decision-making with different risk aversion parameters of agents.

Figure 3 :
Figure 3: The change of the wealth level threshold of group decision-making with similar risk aversion parameters of agents.

1 Figure 4 :
Figure 4: The change of the wealth level threshold of group decision-making based on the multiplicative aggregation model.