Rational Probabilistic Deciders — Part I : Individual Behavior

This paper is intended to model a decision maker as a rational probabilistic decider (RPD) and to investigate its behavior in stationary and symmetric Markov switch environments. RPDs take their decisions based on penalty functions defined by the environment. The quality of decision making depends on a parameter referred to as level of rationality. The dynamic behavior of RPDs is described by an ergodic Markov chain. Two classes of RPDs are considered—local and global. The former take their decisions based on the penalty in the current state while the latter consider all states. It is shown that asymptotically (in time and in the level of rationality) both classes behave quite similarly. However, the second largest eigenvalue of Markov transition matrices for global RPDs is smaller than that for local ones, indicating faster convergence to the optimal state. As an illustration, the behavior of a chief executive officer, modeled as a global RPD, is considered, and it is shown that the company performance may or may not be optimized—depending on the pay structure employed. While the current paper investigates individual RPDs, a companion paper will address collective behavior.

1. Introduction 1.1.Motivation.The theory of rational behavior (TRB) is a set of models intended to capture one of the main features of living organisms' behavior-the possibility of selecting the most favorable decisions among all possible options in a decision space.TRB involves two major components: dynamical systems, which model rational behavior, and rules of interaction among them and with the environment.Analysis of the resulting complex dynamics reveals fundamental properties of rational behavior.TRB emerged in the early '60s in the work of mathematicians and physicists, primarily in Russia.A summary of their results can be found in [1,2] while the earliest publications are [3,4].In [1][2][3][4], the decision makers were modeled as automata, the states of which represent various decisions.The transitions among the states were driven by inputs (penalties or rewards) generated either by the environment or by the decisions of other automata.The transition diagrams of the automata were designed so that the steady state probabilities corresponding to the decisions with the largest rewards were maximized.This was interpreted as rational behavior of an individual decision maker or a collective of decision makers.
In [5], this approach was generalized by assuming that the decision makers were not necessarily automata but general dynamical systems in certain decision spaces.The trajectories of these systems were assumed to satisfy two axioms: ergodicity and retardation.Roughly speaking, the ergodicity axiom implied that all possible decisions were considered within the process of decision making, while the retardation axiom required that the trajectories slow down (i.e., retard) in the vicinity of the most advantageous states.Along with enlarging the set of possible decision makers, this framework exhibited additional properties of rational behavior, such as the possibility of rapid convergence to the optimal state, which was impossible in the framework of automata.
These two modeling approaches involved not only issues of rationality but also complex dynamic behavior, which, on one hand, made the analysis of the resulting systems difficult and, on the other hand, obscured the issue of rational behavior.As a result, the steady state probabilities of various decisions as functions of the parameters of the decision makers and rules of interactions were all but impossible to analyze, especially when the environment was time-varying and/or more than one decision makers were involved.
The main purpose of the present work is to develop a purely probabilistic modeling approach to rational behavior, which does not lead to complicated dynamics and which provides a more complete and transparent analysis of main features of rational behavior.To accomplish this, we introduce the notion of rational probabilistic deciders (RPDs), which select various states in their decision spaces with certain probabilities.How this selection takes place is omitted; it could be accomplished by either automata, or the dynamical systems of [5], or any other, perhaps unknown, mechanism.But, as long as this selection takes place, the approach of this work, being divorced from the issues of dynamics of the decision making, leads to a simple Markov chain analysis of fundamental properties of rational behavior.This paper addresses the issues of individual behavior, while in a forthcoming paper collective behavior is analyzed.

Brief review of existing results.
The work on using automata to model the simplest form of rational behavior first appeared in [3], where the so-called finite automata with linear tactics were constructed and their individual behavior in stationary and nonstationary media was investigated.Based on this work, [6][7][8][9][10] developed other types of rational automata, and investigated their individual behavior.All these automata were shown to behave optimally with arbitrarily large memory in stationary media.In nonstationary media, it was shown that there exists a finite memory size for the automata to behave optimally.P. T. Kabamba et al. 3 The collective behavior of the automata mentioned above was investigated [4,[11][12][13][14][15][16][17][18][19][20][21].In particular, [14][15][16] analyzed the collective behavior of asymptotically optimal automata as players in zero-sum matrix games.The results showed that automata with arbitrarily large memory converge to the saddle point in matrix games with pure optimal strategy and do not converge to the saddle point in mixed strategies.References [17][18][19] investigated the collective behavior of automata in the so-called Gur game.Conditions for the group to behave optimally were derived.Specifically, it was shown that as the number of automata, M, and their memory capacity, N, become arbitrarily large, the ratio N/M must exceed some constant for the collective to behave optimally.
Similar results, in the framework of general dynamical systems, were obtained in [5].In addition, [5] provided many examples of dynamical systems that exhibit rational behavior.The main among them was the so-called ring-element, which could be viewed as the simplest search algorithm for global extrema.Unlike the automata, where the convergence to the optimal decision is exponentially slow (with the exponent being the memory capacity), ring-elements could converge arbitrarily fast.Also, [5] provided a detailed study of collective behavior under homogeneous and nonhomogeneous fractional interactions and showed that even if N/M tends to 0 (as N,M → ∞), the convergence to the optimal state still may take place if the interaction was nonhomogeneous.
The earliest western work on rational behavior appeared in [22][23][24].Specifically, [22] introduced a new type of rational automata and [23,24] initiated the application to learning systems.The learning system approach, along with applications to telephone routing problems, has continued in [25][26][27][28].The results of this research were summarized in [29].
A number of applications of rational behavior to various practical problems have also been reported.Namely, [30,31] applied rational automata to distributed systems, and [32][33][34] discussed the applications to cooperative mobile robots, to quality of service for sensor networks, and to control of flapping wings in micro air vehicles, respectively.Recently, TRB was applied to power-efficient operation of wireless personal area networks [35].

Goals of the study.
The main goals of this work are as follows.
(α) Contributing to TRB by introducing a purely probabilistic modeling framework for rational behavior.(β) Using this framework, it develops methods for analysis of rational behavior, that is, methods, which allow to investigate the steady state probabilities of various decisions as functions of system parameters.(γ) Addressing the issues of synthesis of desirable rational behavior, that is, methods for selecting parameters of RPDs and rules of interactions, which lead to desired individual and collective behavior.This paper pursues these goals for the case of individual behavior; a forthcoming paper will address the collective one.
The outline of this paper is as follows.In Section 2, we define and characterize RPDs, and state analysis and synthesis problems.The individual behavior of RPDs in stationary and symmetric Markov switch environments is investigated in Sections 3 and 4, respectively.An application of RPDs to a pay and incentive system is described in Section 5. Finally, in Section 6, we state the conclusions.All proofs are given in the appendices.
(e) when a state transition occurs, the PD selects any other state with equal probability Therefore, the vector of steady state probabilities, where κ i is the steady state probability of the PD choosing state i, can be calculated from and, moreover, (2.12) Equation (2.11) means that, at steady state, the probability of an RPD being in a less penalized state is higher than the probability of it being in a more penalized state.Equation (2.12) means that as N increases, the RPD becomes more selective, preferring states that are least penalized.As before, the parameter, N, is referred to as the level of rationality (LR) of the RPD.(2.15)

Analytical characterization of
The class of functions satisfying (P.1) is denoted as Π.
Theorem 2.1.An L-PD is, in fact, an RPD if and only if P ∈ Π.
We refer to L-PDs that are RPDs as local RPDs (L-RPDs).Clearly, an L-RPD is characterized by the function P, P ∈ Π.An example of a function P satisfying (P.1) is where is a strictly increasing function, for example, or or Finally, we state an analytic property of P, which will be useful later.

Global RPDs.
Consider PDs, which satisfy the following.
(a) The decision space contains two states The steady state probabilities, κ i (ϕ 1 ,ϕ 2 ;N), i = 1,2, can be written in the form of a composite function, The purpose of (2.21) is to simplify the characterization.The function G in (2.22) is intended to model how the PD perceives the ratio of the penalties associated with the states, while the function F in (2.22) models how the PD makes decisions upon this perception.In addition, (2.22) implies which indicates that the decisions of the PD are not prejudiced.We specify what functions in ᏼ of the PD give rise to (2.22).
Theorem 2.3.A necessary and sufficient condition for ᏼ to lead to (2.22) is where Note that, according to (2.25), the probability of leaving a state depends not only on the penalty of this particular state, but also on the penalty of the other state.For this reason, PDs, satisfying (a) and (b), are called global probabilistic deciders (G-PDs).
We investigate the properties of F and G, which guarantee that a G-PD is an RPD, that is, (2.11) and (2.12) are satisfied.Assume the following.
We refer to G-PDs that are RPDs as global RPDs (G-RPDs).A G-RPD is characterized by the pair (F,G), F ∈ Ᏺ, G ∈ Ᏻ.The functions in ᏼ of the G-RPD can be reconstructed by Theorem 2.3.For example, consider the G-RPD characterized by the pair (F 1 ,G 1 ), where The functions in ᏼ can be either or (2.28) We note that Krinskiy's automata [8] with two actions can be characterized by the pair (F 1 ,G 1 ), defined in (2.26).Furthermore, new RPDs can be found by using other functions in Ᏺ and Ᏻ.For example, a G-RPD characterized by (F 2 ,G 2 ), where has not been found before.
Remark 2.5.Two methods can be used to extend the characterization of G-RPDs discussed above to a G-RPD with more than two states in the decision space.The first method is to characterize the G-RPD iteratively: in the first step of the iteration, the decision space is partitioned into two subspaces.The G-RPD then selects one of the two subspaces in a probabilistic way.The probability of this selection is modeled in a form similar to (2.22).In the next iteration, the selected subspace in the previous step is partitioned into two subspaces and the G-RPD proceeds as before.The second method is to characterize the G-RPD in a pairwise fashion.Specifically, for each pair of the states, we characterize the probabilities of the decisions of the G-RPD in the form of (2.22).This method is used in Section 5 below.
In conclusion of this subsection, we formulate a lemma, which will be useful in later sections.
Analysis.Given an RPD and an environment, analyze the probability of various decisions as a function of the level of rationality and the parameters of the environment.
Synthesis.Given an RPD and an environment, calculate the level of rationality and/or parameters of the environment, which lead to various types of RPD's behavior.
Exact formulations of these problems along with appropriate answers are given in Sections 3 and 4.

Analysis.
In order to characterize the behavior of RPDs qualitatively, introduce the following definition.Definition 3.1 (Asymptotically optimal behavior).The behavior of a PD is asymptotically optimal if for all 0 < < 1/2 and for all Φ with ϕ 1 < ϕ 2 , there exists Clearly, asymptotically optimal behavior means that no matter how close the penalties associated with the two states are, there is an LR large enough so that the RPD selects the state with less penalty reliably.
We have the following qualitative results.

Theorem 3.2. Both L-RPDs and G-RPDs exhibit asymptotically optimal behavior.
Although, as it is stated in Theorem 3.2, L-RPDs and G-RPDs asymptotically behave qualitatively similar, their behavior for fixed N and ϕ 2 − ϕ 1 might be different.To illustrate this, consider L-RPD defined by (2.16), (2.18) and G-RPD defined by (2.26), respectively.In addition, G-RPDs have a faster rate of convergence to the steady state probabilities than L-RPDs.Indeed, when = {1, 2}, the second largest eigenvalue, λ 2 , of the transition matrix (2.5) for the case of L-RPDs is Due to Lemma 2.2, this implies that λ 2 → 1 as N → ∞, that is, the transient period tends to infinity when LR becomes large.This phenomenon does not take place for G-RPDs.Indeed, for the case of G-RPD defined by (2.26) and (2.27), which tends to 1/2 as N → ∞.This qualitatively different behavior is illustrated in Figure 3.2 for L-RPD defined by (2.16), (2.18) and G-RPD defined by (2.26), (2.27).

Synthesis.
We address the following synthesis question.S: given ϕ 1 < ϕ 2 and 0 < < 1/2, how large should N be so that holds?The following theorem give answers to these synthesis questions.
and 0 < < 1/2, the value of N * introduced in Definition 3.1 is given by  i , i = 1,2, is monotonically decreasing in and y.In addition, we observe that the G-RPD defined by (F 1 ,G 1 ) is more efficient than that defined by (F 2 ,G 2 ) since it requires a smaller LR for the same probability of making the right decision.(b)

RPDs in symmetric
where ϕ 1 (n) and ϕ 2 (n) are the penalties associated with states 1 and 2, respectively, at time n; (c) the dynamics of ϕ i (n), i = 1,2, are defined by a symmetric Markov chain with two states, E 1 and E 2 , and the state transition matrix given by At time n, the penalties, where 0 < φ i < ∞, i = 1,2, and without loss of generality, φ 1 < φ 2 .The environment defined by (a)-(c) is referred to as symmetric Markov switch environment.In this section, we consider the behavior of L-RPDs and G-RPDs.

Analysis
where a = p(φ 1 ), b = p(φ 2 ), 0 < a < b < 1, and 0 < < 1. Denote the steady state probability vector of this chain by where κ i j is the steady state probability of the environment being in state E i and the L-RPD being in state j.Solving the equations we obtain Hence, the average penalty incurred by the L-RPD at steady state is where In Section 3, the RPDs are in a stationary environment, and the steady state probability of the RPDs to be in the less penalized state is a measure of performance.However, in a symmetric Markov switch environment, the less penalized state changes from time to time.Hence, the steady state probability of being in a certain state is not a suitable measure of performance.Instead, we use Ψ (N) as a measure of performance.The analysis considered here is A: given an L-RPD in a symmetric Markov switch environment, how does the value of Ψ (N) behave as a function of N and ?A partial answer to this question is provided by the following theorem.Theorem 4.1.For Ψ (N) defined in (4.7), the following four properties hold: (i) Thus, for 1 > > 1/2, Ψ (N) is a function of N with a unique maximum that is larger than (φ 1 + φ 2 )/2.This is clearly not an interesting case.For 0 < < 1/2, Ψ (N) is a function of N with a unique minimum that is smaller than (φ 1 + φ 2 )/2.The behavior of Ψ (N) is illustrated in Figure 4.1 for φ 1 = 1, φ 2 = 5, a = 0.2, and b = 0.8.
Let N * be the minimizer of Ψ (N) for 0 < < 1/2.Then, we have the following theorem.Thus, as the environment is switching slower, the L-RPD can use a higher LR, which takes more time for the L-RPD to settle at the least penalized state.As a result, the L-RPD has better ability to discriminate the two states in between switches of the environment.Hence, the average penalty incurred at steady state is less.

G-RPDs.
Consider a G-RPD characterized by (F 1 ,G 1 ) defined in (2.26) with ᏼ given by (2.27).The dynamics of a system consisting of this G-RPD and the symmetric Markov switch environment (a)-(c) is described by a four state ergodic Markov chain with transition matrix (4.12) The average penalty incurred at steady state is where The analysis question considered here is as follows A: given an G-RPD in a symmetric Markov switch environment, how does the value of Ψ (N) behave as a function of N and ?A partial answer to this question is given by the following.(i) Thus, unlike L-RPDs, the larger LR, the less penalty incurred by the G-RPD.When the LR is arbitrarily large, the G-RPD is least penalized.

Synthesis.
In this section, we ask the synthesis question.
S: given an L-RPD in a symmetric Markov switch environment with 0 < < 1/2, what is the optimal LR, N * , that minimizes Ψ (N)?The answer to S can be obtained by solving for N in However, this involves solving for the transcendental equation (4.18) P. T. Kabamba et al. 17 An example is shown in Figure 4.4, where a = 0.2, b = 0.6, and = 0.1.An initial approximation of N * , N * , can then be obtained by solving which gives Note that N * is a very rough approximation of N * since this approximation only depends on a, but not on b.An example is shown in Figure 4.5 where the percentage errors The approximation of N * can be improved by iterating once in Newton's method for solving (4.18) using N * as initial guess.This improved approximation is then where N * is given in (4.21).For the same values of φ 1 , φ 2 , a, and b as in Figure 4.5, Figure 4.6 plots the percentage errors as functions of .As compared to Δ N and Δ Ψ , Δ 1 N and Δ 1 Ψ improve to −80% and 3%, respectively, as approaches 0.45.Further improvement can be expected if Newton's method for solving (4.18) is carried out with more than one iteration.

Application
As an application, we use G-RPDs to model the behavior of a Chief Executive Officer (CEO) of a company to provide insights on how to induce the CEO, through his own self-interest, to act for the best benefits of the company.

Environment.
Assume that the CEO is making decisions in a two-stage decision process.The information structure of the decision process is depicted by the graph in Figure 5.1.At Stage 1, the company is at node A and the CEO can choose between two decisions, x 1 or x 2 , which lead the company to nodes B or C, respectively.Similarly, at Stage 2, whether the company is at node B or C, the CEO can choose between two decisions, x 1 or x 2 .At the end of the process, the company is at node S i j if the CEO takes the sequence of decisions, x i at Stage 1 and x j at Stage 2. We denote the sequence of decisions by x i x j .The numbers a 1 , a 2 , a 11 , a 12 , a 21 , and a 22 on the edges of the graph denote the reward received by the CEO for each decision made.These numbers are assumed to be functions of the company's stock prices and reflect the situation of the company due to the decisions of the CEO.The larger the number, the better the situation the company is in.The information structure is assumed to be "probabilistically" known to the CEO in the sense that he can take the decision in the form of a G-RPD, that is, where ϕ 1 and ϕ 2 are the penalties associated with states x 1 and x 2 , respectively, and F 1 and G 1 are defined in (2.26).The decision process at Stages 1 and 2 are described below.
Stage 1.At node A, the CEO considers all four possible sequences of decisions, x i x j , i, j = 1,2.The objective reward for the sequence of decisions x i x j is a i + a i j .However, subjectively, the CEO views his reward for x i x j as a i + αa i j , 0 ≤ α ≤ 1, where the value of α is defined by the contractual relation between the CEO and the company.When the P. T. Kabamba et al. 21 employment of the CEO is long term, encompassing the two stages, α is large.Otherwise, α can be small.Thus, parameter α models how much the rewards at Stage 2 are taken into account by the CEO at Stage 1.
Let κ i j denote the probability that the CEO favors the sequence of decisions x i x j .Then, the probability that the CEO chooses x i , i = 1,2, at Stage 1 is The probabilities, κ i j , are determined by pairwise comparison of the sequence of decisions.To be more specific, taking the reciprocal of rewards as penalties, where the numerator and denominator on the right-hand side of the ratio are the probabilities of the CEO favoring x i x j and x k x l , respectively, if there were only these two choices.
Since F 1 and G 1 are defined in (2.26), (5.3) implies Since i j κ i j = 1, we have where i, j = 1,2, i = j.Similarly, if the company is at node C, the probability of selecting x i is Given the above discussion, at the end of the decision process, the probability that the company is at S i j is (5.9)

Analysis.
We next analyze how the probabilities, P i j , change as functions of α and N.For this purpose, we assume a 1 = 1, a 2 = 5, a 11 = 10, a 12 = 15, a 21 = 2, a 22 = 3.Since a 1 + a 12 has the largest value, the sequence of decisions, x 1 x 2 , is the best for the company.Figure 5.2 shows P i j , calculated according to (5.6)-(5.9),as a function of N for various values of α, and Figure 5.3 shows P i j as a function of α for various N's.
Based on these figures, we observe the following.
(i) For a fixed α, the probability of the CEO making the best decision for the company, P 12 , does not necessary become large as N becomes large.For small α (0.2 and 0.3), the CEO does not make the best decision with high probability as N becomes large.Moreover, it tends to 0 as N becomes large.However, for large α (0.6 and 1), the CEO does make the best decision with high probability if N is large enough.This means that for the CEO to make the best decision for the company, having a high LR is not enough.The CEO must also take into account the future rewards.(ii) For a fixed N, P 12 increases when α becomes larger.This means that for a CEO with small LR, the probability of making the best decision can be improved by increasing his ability to take into account future rewards.

Synthesis.
From the observations in Section 5.2, it follows that it is best for a company to have a CEO who takes into account the future rewards when he is making decisions, that is, a CEO with large α.One way to ensure a large α is to guarantee a relatively long-term contractual relationship between the CEO and the company.

Conclusions
This paper shows that rational behavior can be modeled in a purely probabilistic way, thus avoiding complex dynamics associated with other approaches.The rational probabilistic deciders, introduced in this work, both in the so-called local and global implementations, allow us to quantitatively investigate individual rational behavior in stationary and symmetric Markov switch environments.Although both G-RPD and L-RPD perform well, there are qualitative differences between them.Specifically, (1) in the stationary environment, with the same LR and penalties associated with their states, G-RPD selects its least penalized state with higher steady state probability than L-RPD; P. T.  (2) as the LR becomes large, the rate of convergence for G-RPD is much faster than that for L-RPD in stationary environments; (3) in a slowly switching symmetric Markov switch environment, the average penalty incurred by G-RPD is a monotonically decreasing function of the LR, while for L-RPD, the average penalty is minimized by a finite optimal LR and increases as the LR deviates from this optimal value.The results of this paper, as it is shown in the application, can be used as a mathematical model for investigating efficacy of various pay and incentive systems.

C. Proofs of Section 4
Proof of parts (i) and (ii) of Theorem 4.1.We first prove part (i).The function f (N) defined in (4.8) satisfies From the discussions in Section 4.2, we have To prove parts (iii) and (iv), we need the following lemmas.
Then, Proof.We note that part (ii) of Theorem 4.1 and Rolle's theorem imply that there must be at least one real positive solution to (C.5).Suppose that 0 < < 1 and = 1/2 and that there are more than two solutions to (C.5).Then, Lemma C.1 implies that these solutions are isolated.Hence, we can find two solutions, N 1 and N 2 , to (C.5) such that N 1 < N 2 and there are no other solutions to (C.5) in between them.Finally, parts (ii), (iii), and (iv) imply part (i).
.23) P. T.Kabamba et al. 7 Figures 3.1(a) and 3.1(b) show the probability of selecting state 1 as a function of N and ϕ 1 , respectively.Clearly, G-RPD, having more information, outperforms L-RPD.

4. 2
.1.L-RPDs.The dynamics of the system consisting of an L-RPD characterized by the function P in (2.16) and the symmetric Markov switch environment (a)-(c) are described P. T. Kabamba et al. 13 by a four-state ergodic Markov chain with the state transition Matrix