Some Simple Formulas for Posterior Convergence Rates

We derive some simple relations that demonstrate how the posterior convergence rate is related to two driving factors: a “penalized divergence” of the prior, which measures the ability of the prior distribution to propose a nonnegligible set of working models to approximate the true model and a “norm complexity” of the prior, which measures the complexity of the prior support, weighted by the prior probability masses. These formulas are explicit and involve no essential assumptions and are easy to apply. We apply this approach to the case with model averaging and derive some useful oracle inequalities that can optimize the performance adaptively without knowing the true model.


Introduction
In Jiang [1], there are some general results on the posterior convergence rate, which were very simple and easy to be applied. The current paper is related and developed from the ideas of Jiang [1]. There is no essential new idea behind the proofs. However, the results have been much simplified from the earlier groundbreaking works in this area, such as Ghosal et al. [2], Walker [3], and Ghosal et al. [4] so that the current work can be applied much more easily and displays the intrinsic driving factors behind the convergence rate more directly.
The current paper cannot be used to derive new convergence rates better than what are achievable in the existing literature, except that the current convergence rate is described in a 2 -divergence for any ∈ (−1, 0) (defined later), which is more general than the squared Hellinger distances 2 −1/2 (corresponding to = −1/2). A recent work by Norets [5] actually has obtained convergence rates in a stronger Kullback-Leiber divergence 2 0 (corresponding to the limit = 0); however, the rate in 2 0 is suddenly much worse than the rate in 2 −1/2 (e.g., about 1/√ in parametric cases, instead of about 1/ ). Interestingly, although we allow any 2 for ∈ (−1, 0), the rate stays at about 1/ , which is essentially as good as in the Hellinger case at = −0.5 and does not deteriorate to the case of the Kullback-Leibler limit at = 0.
Aside from this technical difference in the divergence measures used in this current paper, the difference from the previous works is essentially esthetical. The key difference from the previous works is that the previous results are presented as bounds of the posterior probability (outside a neighborhood of the true density), while the current paper presents almost sure bounds of the distance (or divergences) from the true density directly. Applications of previous works sometimes need to make a guess on the convergence rate and then check that it simultaneously satisfies several inequality conditions, while the current paper presents explicit formulas that are essentially assumption free.

Main Results
Let denote the observed data generated from a probability distribution 0 . Consider a prior distribution supported on a set of densities P . Define for a subset ⊂ P the posterior probability as Π( ) = ∫ ∈ ( ) ( )/∫ ∈P ( ) ( ). Denote 0 Π( ) = 2 International Scholarly Research Notices Define the divergence 2 ( , ) = −1 ∫ (( / ) − 1) between densities and , on a suitable dominating measure, for > −1. (The Kullback-Leiber divergence corresponds to the limiting case of = 0; the squared Hellinger distance corresponds to the case = −1/2; the 2 divergence corresponds to the case = 1.) Then we have the following result.

Proposition 1.
For any ∈ (0, 1) and any ∈ (0, ∞), for any ⊂ P, one has This result requires no assumption essentially. The relation displayed is explicit. In the result, 0 , are regarded as probability densities of the entire set of data . Now consider the case with iid assumption, so that = ( 1 , . . . , ), , 1 , . . . , being iid (independent and identically distributed), generated from density 0 for a single copy . Let be a prior distribution supported on a set P of densities of . Consider any ⊂ P with any of its countable convex covers ∪ ∈N . Using relations such as 0 = Π =1 0 ( ) and 1 + ≤ for any real , the previous result becomes as follows.

Proposition 2.
For iid data with sample size , for any −1 < − < 0 < < ∞, any ⊂ P, one has The only essential assumption here is iid. The relation displayed is explicit.
We will now consider a sequence of densities ( ) for iid data, which are generated from the posterior distributions based on iid data with increasing sample sizes , and study how they converge to the true density 0 . Condition 1 ("posterior sequence" of random densities for iid data). A "posterior sequence" ( ) (labeled by sample size ) of random density functions in P, in a probability space, satisfies, for any subset ⊂ P, Here P is a set of density functions, () is the prior distribution of ∈ P, and Π() is the fraction in the integrand, which can be regarded as the posterior distribution of based on iid data = ( 1 , . . . , ).
At any fixed sample size , this probability law is equivalent to assuming that ( ) is sampled from the posterior Π() given data = ( 1 , . . . , ), and is an iid sample of with density 0 . We will often omit the superscript and write ( ) = .
Suppose P can be covered by a finite number of the 's, each being an 1 ball with radius −1/ . Then the following result can be obtained.
Remark 4. The result can be extended to the continuously valued ∈ (0, 1). This is because the divergence is monotonically increasing in . For any − so that 1/ is not an integer, we can use a more stringent divergence − with the − being the next larger value from the integer range {−1/2, −1/3, −1/4, . . .} to bound the convergence rate in − .

Remark 5.
In this and other works, we notice that we often encounter in the convergence rate results a quantity similar to the "penalized divergence" of the form 2 ( 0 , ) ≡ inf [sup ∈ 2 ( 0 , ) + −1 ln(1/ ( ))], related to a prior . This first part sup ∈ 2 ( 0 , ) describes the maximal divergence of a set (proposed by a prior ) from 0 . We can understand this part as the approximation error of the prior when it is used to propose densities to approximate a true density 0 . The second part penalizes an unlikely set with a small prior ( ). Combining the two parts, we can perhaps try to interpret 2 ( 0 , ) as the approximation error (away from 0 ) by a not-too-unlikely set proposed by a prior . This "penalized divergence" is a critically important driving International Scholarly Research Notices 3 factor for determining the convergence rates in the previous results. It is noted that although this factor corresponds to the approximation ability of , it already has a complexity penalty built in it implicitly. This is from the penalty against a small prior; the second part is −1 ln(1/ ( )), which is, roughly speaking, about / , where is the number of parameters proposed by the prior (e.g., for a uniform prior , for a small -dimensional cube with volume , we have ( ) ∝ ).
Remark 6. The other factor behind the convergence rate is related to the complexity of the model, which is proportional to −1 ln where is some number that increases with the number of small convex balls needed to cover the prior support of the model. Typically, this "complexity factor" is roughly about / , up to some logarithm factors, where is the dimension of the parameters involved in the prior. It is noted, however, that with model averaging the higher dimensional model can be downweighted by the model prior, so that effectively one can make to be of order 1 for this complexity factor, so that the convergence rate will be controlled by the first factor ("the penalized divergence") alone.
The convergence rate result in Proposition 3 can be extended to the case of model averaging, when the prior is ( , ) = ( | ), jointly over a model index ∈ in a set of nonoverlapping models , and density ∈ P (the support of prior V model ) (we assume nonoverlapping models for simplicity, where P ∩ P = 0 for any two different model indexes and . This is only a technical convention for defining the prior supports, which typically does not affect the real applications see, e.g., Section 3.1) and posterior Π( ) ∝ ∑ ∈ ∫ ∈P (( , ) ∈ )Π =1 ( ) ( | ) for an event . In this case, let = ((1/ ) 1/ , 1 , P ), the 1 balls of radius (1/ ) 1/ needed to cover the prior-support P under model . Then we have the following, under the iid assumption.
Proposition 7. Consider a "posterior sequence" of densities for iid data satisfying Condition 1. For any ∈ {1/2, 1/3, 1/4, . . .} and any 0 < < ∞, with probability 1, for almost all large sample size , one has 2 − ( 0 , ) ≤ 2 ( 0 , ) P and P are supports of the mixing prior = ∑ ( | ) and the model-prior ( | ), respectively. This is an oracle inequality that achieves the best performance of all models for the bound on the right hand side. Again, the convergence rate is displayed explicitly, and we will try to explain the driving factors of the convergence rate later. This is unlike the previous works where one has to conjecture a rate and check that it satisfies many conditions. So far, we have assumed existence of a finite covering number for the prior support, such as in Proposition 3 or in Proposition 7. They determine the "complexity factor" as commented in Remark 6. A deeper analysis of the "complexity factor" is to regard it as an upper bound for a better complexity measure related to the prior , developing an idea pioneered by Walker [3].
Remark 8. The complexity in Remark 6 is not satisfactory when the prior support P is unbounded and the covering number is infinity. However, the proofs of the propositions can be easily adapted to show that the covering number can be replaced by ∑ ∈N ( ) from Proposition 2, where we have relaxed ∪ to be a cover of the entire prior support of P, and we have freedom in choosing the cover ∪ . Therefore, we can define a quantity that is related to the prior itself. Let the ( ) be the infimum of the ℓ norm [∑ ∈N ( ) ] 1/ over all such covers ∪ of P, where each is an 1 ball of radius −1/ . We may name it as the "ℓnorm prior complexity" for covering the prior support. An unbounded prior support may still be coverable by infinitely many 's, so that ( ) is finite, even with an infinite covering number . Then we have a better way of formulating a bound corresponding to Proposition 3: where ( ) is the "ℓ -norm complexity" of this prior defined in this remark.
Remark 9. We now describe heuristically how to bound the "norm complexity" ( ) defined in the previous remark in parametric models, where densities ∈ P are parameterized by a dimensional parameter , and a prior on induces a prior on the densities in P. A more rigorous treatment is given in the example of Section 3.2. In typical situations with some smoothness conditions on the densities, we can relate the 1 distance between two densities 1 and 2 by the maximal norm | ⋅ | ∞ : Then, to cover the parameter space, we can use ℓ ∞ ball 's in the parameter space with radius ℎ = ( 1/ ) −1 , so that the corresponding densities cover the 1 -ball with the required radius The sum in the square bracket is a Riemann sum over a fine grid, which we will assume to be approximated by an integral under some regularity conditions, even if the domain may be unbounded. Therefore, we have an upper bound of the norm complexity as for all large enough . Assume that the prior density is integrable in the parameter space, and the norm | | scales as (const) as in the case of an iid prior ( ) = ∏ =1 1 ( ).
Then the complexity term in the bound of Remark 8 can be derived as which increases with the dimension .
Remark 10. Similar to Remark 8, we have a better way of formulating a bound corresponding to Proposition 7: where ( ) is the "ℓ -norm complexity" of this prior , which in this case should be the infimum of the ℓ norm of P, where each is an 1 ball of radius −1/ , and under each model , ∪ ∈N represents a cover of its prior support P using possibly infinitely many balls. The defining expression of ( ) can also be related to the norm complexities of all the conditional priors (⋅, ) given the model choices: ( ) = [∑ ∈ { ( (⋅ | ))} ] 1/ . With model averaging using some suitable weights , this term ( ) and its effect on the convergence rate no longer diverge with the complexity of the model, in contrast to the conclusion of Remark 9. The convergence rate is then mainly determined by the penalized divergence 2 ( 0 , ). An example below (in its second part) is used to illustrate this.

A Simple Example for Illustration
This is a simple binary regression example intended for illustration. We will see that model averaging can be used to derive nearly optimal convergence rates that are adaptive to the assumptions on the true model. In the first part, we will illustrate how to bound the penalized divergence with a uniform prior with a bounded support. In the second part, we will illustrate how to bound the norm complexity when the prior has an unbounded support. (For technically defining the prior supports to be nonoverlapping for different models, one can further require the 's to be mutually distinct. The resulting prior would be unchanged almost everywhere and would not affect the discussions later.)

When the Prior Has a Bounded
We will consider two different setups of the true model.
Setup 1 (dense true model). In the first setup, the true 0 has continuous derivative bounded by . We call this a "dense" setup since we may need a large piecewise constant model (with large increasing with sample size ) to approximate this quite arbitrary true mean function 0 .
Let 0 and be the densities corresponding to 0 and , respectively. Then sup ∈ 2 to the triangle inequality. The prior probability over ( , ) is (2Δ) . Therefore, the "penalized divergence" We will take ∝ − ( −1) ln for some large enough constant > 0 and apply Proposition 7. (It can be shown that this will make the "complexity" term −1 ln( 6 (∑ ∞ =1 √ ) 2 ) negligible compared to 2 ( 0 , ) (by showing ≤ 2 , we omit the tedious details here).) We can take ∼ 1/3 and Δ = 1/ for an upper bound of the inf ,Δ . Therefore, the "penalized divergence" 2 1 ( 0 , ) and the resulting convergence rate 2 −1/2 ( 0 , ) are both of order ( −2/3 ln ), which is within a ln factor to the International Scholarly Research Notices 5 minimax optimal result. It is noted that the model averaging automatically achieves this near optimal rate. Setup 2 (sparse true model). Consider a second setup, where we assume that the true model is a 0 -piecewise constant, where we do not know the value 0 . We call this a sparse case since we only need an -piecewise constant model * to approximate the true mean function 0 perfectly, where = 0 can be much smaller than the choice of ∼ 1/3 in Setup 1.
In summary, the prior is V in the sense that in either the dense or the sparse case, the resulting posterior distribution works nearly optimally, even if we do not really know whether the true model is dense or sparse.
), which is finite despite the unbounded priors support. Then according to Remark 10, So the norm complexity term in Remark 10 is of order (ln / ), which, when compared with the last formula in Remark 9, behaves as if the dimension has become reduced to order (1) by model averaging. Therefore, the norm complexity term does not affect the convergence rate significantly due to model averaging, and the convergence rate is mainly determined by the penalized divergence 2 1 ( 0 , ). The bounding of the penalized divergence is similar to the example discussed in the previous subsection and we omit the details. The resulting convergence rates are essentially the same as when the uniform priors (with bounded supports) are used, despite the fact that we now allow priors with unbounded supports (such as normal priors in the parametrization of log-odds).

6
International Scholarly Research Notices Using Markov's theorem, for any > 0, we have for any in the support of .
due to Jensen's inequality; due to Fubini's Theorem; All these combine to ( * ) Now apply a result that is a straightforward extension of Ghosal et al. ([4], Lemma 6.1). For any convex set , there exist such that for any , > 0, ∈ (0, 1), Therefore, we can find so that Given any > 0, we can choose ( , ) so that / = in the above statement.
This leads to the proof by applying Proposition 1.
Proof of Proposition 3. This is a special case of Proposition 7, where focuses on only one model.
Proof of Proposition 7. Repeat the proofs of Propositions 1 and 2 for the case with model averaging, with the support of prior being P = ∪ ∈ P , where P is the support of (⋅ | ).
Suppose the convex cover of ∩ P is doubly indexed as ∪ ∈ ∪ ∈N , where N has cardinality at most and ∪ ∈N is a convex cover of the support ∩ P . Then the result in Proposition 2 holds with In Proposition 2, let = [ : 2 − ( 0 , ) > ]. Suppose all the convex sets are such that inf , inf ∈ 2 − ( 0 , ) > . Then we have ( †) where is an upper bound for the number of convex sets needed to cover ∩ P . Now we try to define the convex sets in more detail. They are used to cover , so without generality, each contains a point in , say 1 , which is not close to 0 since 2 − ( 0 , 1 ) > . If is small so that any two points are close together, then any point 2 in (which may fall outside ) can be made to be also not close to 0 , so that 2 − ( 0 , 2 ) > for some > 0 related to . This would be easy to establish by a triangular inequality, were it not for the difficulty that the divergence − is not a true distance for ̸ = 1/2. So we would not be able to say, for example, that should be a small − -ball.

Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.