Combining Multiple Strategies for Multiarmed Bandit Problems and Asymptotic Optimality

This brief paper provides a simple algorithm that selects a strategy at each time in a given set of multiple strategies for stochastic multiarmed bandit problems, thereby playing the arm by the chosen strategy at each time. The algorithm follows the idea of the probabilistic ε t -switching in the ε t -greedy strategy and is asymptotically optimal in the sense that the selected strategy converges to the best in the set under some conditions on the strategies in the set and the sequence of {ε t }.


Introduction
This paper considers the problem of stochastic non-Bayesian multiarmed bandit (MAB) in which a player with a bandit has to decide which arm to play at each time among available arms to maximize the sum of rewards earned through a sequence of playing arms.When played, each arm provides a random reward from an unknown distribution specific to that arm.The problem models the well-known trade-off between "exploration" and "exploitation" in sequential learning.The player needs to obtain new knowledge (exploration) and at the same time optimize her decisions based on existing knowledge (exploitation).The player attempts to balance these competing tasks in order to achieve the goal.Many practical problems, for example, in networking [1,2], in games [3], and in prediction [4], and problems such as clinical trials and ad placement on the Internet (see, e.g., [1,5,6] and the references therein) have been studied with a (properly extended) model of the MAB problems.
Specifically, we consider a stochastic -armed bandit problem where there is a finite set of arms  = {1, 2, . . ., },  > 1 and one arm in  needs to be sequentially played.When an arm  ∈  is played at time  ≥ 1, the player obtains a sample bounded reward  , ∈ R drawn from an unknown distribution associated with , whose unknown expectation and variance are   and  2  , respectively.We define a strategy  = {  ,  = 1, 2, . ..} as a sequence of mappings such that   maps from the set of past plays and rewards,  −1 := ( × R) −1 , to the set of all possible distributions over  where  0 is an arbitrarily given nonempty subset of R. We denote the set of all possible strategies as Π.Given a particular sequence ℎ  −1 ∈  −1 of the past plays and rewards obtained by following  ∈ Π over  − 1 time steps,  selects  ∈  to be played by probability   (ℎ  −1 )() at time .We assume that  1 (ℎ  0 ) is arbitrarily given.Let random variable    denote the arm selected by  at time  and let   () be the distribution over  given by  at time  so that    () = Pr{   = }, which is equal to We assume that  , and    ,  are independent of any ,   and  ̸ =   and  , 's for  ≥ 1 are identically distributed for any fixed  in .
Let  * = max ∈   and  * = { ∈  |   =  * }.For a given  ∈ Π, if ∑ ∈\ *    () → 0 as  → ∞, then we say that  is an asymptotically optimal strategy.The notion of the asymptotic optimality was introduced by Robbins [5].He presented a strategy which achieves the optimality for the  = 2 case where in a single play, each  ∈  produces a reward of 1 or 0 with unknown probabilities   and 1 −   , respectively.Bather [7] considered the same Bernoulli problem with  ≥ 2 and established an asymptotically optimal index-based strategy  such that, at time ,  selects an arm in arg max ∈ { by  during the first  − 1 plays, and   , ,  ≥ 1, denotes the average of the  reward samples obtained by playing  when  is followed, and {(),  ≥ 1} is a sequence of strictly positive constants such that () → 0 as  → ∞, and   (),  ∈ ,  ≥ , are i.i.d.positive and unbounded random variables whose common distribution function  satisfies that (0) = 0 and () < 1, for all  > 0. The idea is to ensure that each arm is played infinitely often by adding small perturbations to   ,   (−1) and to make the effect vanish as    (−1) increases.The well-known asymptotically optimal   -greedy strategy [8] with ∑ ∞ =1   = ∞ and lim  → ∞   = 0 follows exactly the same idea: at time , with probability   ∈ (0, 1], it selects  ∈  with probability 1/ and, with probability 1 −   , it selects an arm in arg max ∈ {  ,   (−1) } where  refers to the   -greedy strategy.
This brief paper provides a randomized algorithm  comb(Φ) which follows the spirit of the   -greedy strategy for combining multiple strategies in a given finite nonempty Φ ⊂ Π.At each time , we use the probabilistic   -switching to select uniformly a strategy in Φ or to select a strategy with the highest sample average of the rewards obtained so far by playing the bandit.Once a strategy  is selected, the arm chosen by  is played for the bandit.Analogous to the case of the   -greedy strategy, it is asymptotically optimal in the sense that the selected strategy converges to the "best" in the set under some conditions on the strategies in Φ and {  }.

Related Work
In the following, we briefly summarize the most relevant works in the literature with the results of the present paper.A seminal work by Gittins and Jones [9] provides an optimal policy (or allocation index rule) to maximize the discounted reward over an infinite horizon when the rewards are given by Markov chains whose statistics are perfectly known in advance.Note that our model does not consider discounting in the rewards and assumes that the relevant statistics are unknown.
Auer et al. [10] presented an algorithm, called Exp4, which combines multiple strategies in a nonstochastic bandit setting.In the nonstochastic MAB, it is assumed that each arm is initially assigned an arbitrary and unknown sequence of rewards, one for each time step.In other words, the rewards obtained by playing a specific arm are predetermined.In Exp4, the "uniform expert, " which always selects an action uniformly over , needs to be always included in Φ.At time , Exp4 computes where ℎ
McMahan and Streeter [13] proposed a variant of Exp4, called NEXP, in the same nonstochastic setting.NEXP needs to solve a linear program (LP) at each time to obtain a distribution over  that offers a "locally optimal" trade-off between exploration and exploitation.Even if some improvement over Exp4 was shown, this comes at the expense of solving an LP at every time step.
de Farias and Megiddo [14] presented another variant of Exp4, called EEE, within a "reactive" setting.In this setting, at each time a player chooses an arm and an environment chooses its state, which is unknown to the player.The reward obtained by the player depends on both the chosen arm and the current state but not necessarily determined by a distribution specific to the chosen arm and the current state.An example of this setting is playing a repeated game against another player.When an expert is selected by EEE for a phase, it is followed for multiple times during the phase, rather than picking a different expert at each time, and the average reward accumulated for that phase is kept track of.The current best strategy with respect to the estimate of the average reward or a random strategy is selected at each phase with a certain control rule of exploration and exploitation, which is similar to the   -schedule we consider here.(See also a survey section in [15] for expert-combining algorithms in different scenarios).
Because these representative approaches combine multiple experts, we compare them with our algorithm after adapting into our setting (cf.Section 4).However, more importantly, the notion of the best strategy in nonstochastic or reactive settings does not directly apply to the stochastic setting.Establishing some kind of asymptotic optimality for Exp4 or its variants with respect to a properly defined best strategy (even after adapting Exp4 as a strategy in Π) is an open problem.In fact, to the authors' best knowledge, there seems to be no notable work yet which studies asymptotic optimality in combining multiple strategies for stochastic multiarmed bandit problems.
Finally, we stress that this paper focuses on asymptotic optimality as performance measure, also termed as "instantaneous regret" [8], but not on "expected regret" which is typically considered in the (recent) existing bandit-theory literature (see, e.g., [12] for a survey).It is worthwhile to note that the instantaneous regret is a stronger measure of convergence than expected regret [8].

Algorithm and Convergence
Assume that a finite nonempty subset Φ of Π is given.Once  ∈ Φ is selected by the algorithm  =   -comb(Φ) at time , the bandit is played with an arm selected by  and a sample reward of  ,  ∈ Φ. ( We formally describe the algorithm  =   -comb(Φ) below.
The  =   -(Φ) Algorithm Note that   -comb(Φ) as above is involved with general schedule of {  }.By setting the {  }-schedule in   -comb(Φ) properly, {  } subsumes the schedules used in -greedy, first, and -decreasing strategies [16,17].In particular, as a special case, if Φ = {  ,  = 1, . . ., } and   is given such that     () = 1 for all , then  degenerates to the   -greedy strategy.As shown in the experimental results in [16,17], the performance of tuned -greedy is no worse than (or very close to) those of -first, -decreasing strategies, and so forth, even with tuning the schedule (and the relevant parameters) each strategy used.However, because these schedules are usually heuristically tuned and -greedy uses a constant value of , it is not necessarily guaranteed that employing such schedules in   -comb(Φ) achieves asymptotic optimality.Furthermore, it is very difficult to tune the value of  in advance.The theorem below establishes general conditions for asymptotic optimality of  with respect to a properly defined best strategy in Φ.It states that if each  ∈ Φ is selected infinitely often by  and for each , each  ∈  is selected infinitely often by , and each 's action-selection distribution converges to a stationary distribution, and the selection of  goes greedy in the limit; then the selected strategy by  converges to the best strategy in Φ.
Theorem 1.Given a finite nonempty Φ ⊂ Π, consider  =   -(Φ).Suppose that lim  → ∞   = 0, that ∑ ∞ =0   = ∞, that there exists  > 0 such that    () ≥  for all  ∈ Φ,  ∈ , and  ≥ 1, and that lim  → ∞   () =   for all  ∈ Φ.Then we see that as  → ∞, the first term in the right-hand side of (4) goes to zero because    ( − 1) → ∞ from ∑ ∞ =0   = ∞ and    (  ) < ∞.For the second term in the right-hand side of ( 4), we rewrite it as and we will establish the convergence of the right product term in (5): = ],  ∈ .Then (6) can be rewritten such that Because    () ≥  for all , we have that, as  → ∞,   (   ( − 1)) −   (   (  )) → ∞.Therefore the second product term inside of the summation in the right-hand side of ( 7), goes to   by the law of large numbers.We now show that the first product term inside of the summation in the right-hand side of ( 7 Then by Poisson's limit theorem [18,Chapter 11]  We remark that this algorithm can be used for solving a bandit problem in a decomposed manner.Suppose that we partition  into  nonempty subsets   ,  = 1, . . ., , such that   ∩   = 0,  ̸ =  and ⋃  =1   = .Choose any asymptotically optimal strategy  and associate  with a strategy   such that ∑ ∈  \ *      () → 0 as  → ∞, where

A Numerical Example
For a proof-of-concept implementation of the approach, we consider three simple numerical examples.
For the first case,  is partitioned into   ,  = 1, 2, 3, such that  1 = {1, 2, 3, 4},  2 = {5, 6, 7}, and  3 = {8, 9, 10} and the   -greedy strategy associated with   , playing only the arms in   , corresponds to   (cf. the remark given at the end of Section 3).Thus we have Φ = {  ,  = 1, 2, 3} and trivially the best strategy is  1 .The second case considers combining two pursuit learning automata (PLA) algorithms [19] with different learning rates designed for solving stochastic optimization problems.Even though PLA was not designed specifically for solving multiarmed bandit problems, PLA guarantees "-optimality" and can be casted into a strategy for playing the bandit.(Roughly, the error probability of choosing nonoptimal solutions is bounded by .)The first PLA strategy uses the learning rate of 0.000002 which corresponds to the parameter setting in [19,Theorem 3.1] with  = 0.1 for a theoretical performance guarantee and the second one uses the tuned learning rate of 0.002 which achieves the best performance among the various rates we tested for the above distribution.These two PLAs are contained in Φ and the PLA with 0.002 is taken as the best strategy.
From Figures 1-3, we show the percentage of selections of the best strategy and of plays of the optimal arm of (tuned)   -comb(Φ) for each case along with those of (tuned) Exp4, NEXP, and EEE, respectively.(The percentage for the optimal arm for the second case is not shown due to the space constraint.)The tuned strategy corresponds to the best empirical parameter setting we obtained.The performances of all tested strategies were obtained through the average over 1000 different runs where each strategy is followed over 100,000 time steps in a single run.For the first case, that is, Φ = {3  -greedy's}, we set () = 0.15 for all  ∈ Φ, and () = 0.15 for  =   -comb(Φ), and (EEE) = 0.15 for EEE.(The value of 0.15 was chosen for a reasonable allocation of exploration and exploitation.)The tuned  uses 0.075 for () and the tuned EEE uses 0.07 for (EEE).Exp4 uses 0.0116 for   The third case considers the case of combining two different strategies, the   -greedy strategy and UCB1-tuned [8], and was chosen to show some robustness of   -comb(Φ).
Here the   -greedy strategy uses the value of 0.3 for {  }schedule.(Tuning the   -greedy strategy will make it more competitive with UCB1-tuned.This particular value was chosen for an illustration purpose only.) Figure 4 shows the average regret of each strategy  tested over  time steps for two different distributions, respectively, calculated by 10 −3 ∑ 10 3 =1 ∑ ∈ ( * −   )  , () where   , () denotes the number of times arm  has been played by  during the first  plays in the th run.The first distribution is the same as the above one used for the first and the second cases.The second distribution is again Bernoulli reward distributions with  = 10 but the reward expectations are given by 0.9 for the optimal arm 1 and 0.7 for all the remaining arms.For the first distribution, UCB1-tuned's regret is much smaller than that of the   -greedy strategy but for the second distribution, the   -greedy strategy is slightly better than UCB1-tuned.In sum, we have the case where the better performance of the two strategies is distribution-dependent in terms of the average regret even if both achieve asymptotic optimality empirically (not shown here).By combining two strategies by  =   -comb({UCB1-tuned,   -greedy}), we have a reasonable distribution-independent algorithm for playing the bandit.As we can see from Figure 4, the tuned  with () = 0.0375 and 0.01, respectively, shows a robust performance across the two distributions.For both cases, the performances of the tuned Exp4, NEXP, and EEE are not good.

Concluding Remarks
In this paper, we provided a randomized algorithm  comb(Φ) for playing a given stochastic MAB when a finite nonempty strategy set Φ is available.Following the spirit of the   -greedy strategy, the algorithm combines the strategies in Φ and finds the best strategy in the limit.Specifically, at each time , we use the probabilistic   -switching to select uniformly a strategy in Φ or to select a strategy with the highest sample average of the rewards obtained so far by playing the bandit.Once a strategy  is selected, the arm chosen by  is played for the bandit.We showed that it is asymptotically optimal in the sense that the selected strategy converges to the best in the set under some conditions on the strategies in Φ and {  } and illustrated the result by simulation studies on some example problems.
If each  ∈ Φ is stationary (as opposed to the general case studied in the present paper) in that, for a distribution   over ,   () =   for all , then each  ∈ Φ can be viewed as an arm and playing according to  provides a sample reward of     , with unknown expectation of ∑ ∈      .Therefore, in this case any asymptotically optimal strategy in Π can be adapted for selecting a strategy at each time instead of   -comb(Φ) to achieve asymptotic optimality with respect to a strategy that achieves max ∈Φ ∑ ∈      .Even if we showed the convergence to the best strategy in the set, the convergence rate has not been discussed.