Joint behaviour of semirecursive kernel estimators of the location and of the size of the mode of a probability density function

Let $\theta$ and $\mu$ denote the location and the size of the mode of a probability density. We study the joint convergence rates of semirecursive kernel estimators of $\theta$ and $\mu$. We show how the estimation of the size of the mode allows to measure the relevance of the estimation of its location. We also enlighten that, beyond their computational advantage on nonrecursive estimators, the semirecursive estimators are preferable to use for the construction on confidence regions.


Introduction
Let X 1 , . . . , X n be independent and identically distributed R d -valued random variables with unknown probability density f . The aim of this paper is to study the joint kernel estimation of the location θ and of the size µ = f (θ) of the mode of f . The mode is assumed to be unique, that is, f (x) < f (θ) for any x = θ, and nondegenerated, that is, the second order differential D 2 f (θ) at the point θ is nonsingular (in the sequel, D m g will denote the differential of order m of a multivariate function g).
The problem of estimating the location of the mode of a probability density was widely studied. Kernel methods were considered, among many others, by Parzen [18], Nadaraya [17], Van Ryzin [26], Rüschendorf [23], Konakov [10], Samanta [24], Eddy ([5], [6]), Romano [20], Tsybakov [25], Vieu [27], Mokkadem and Pelletier [13], and Abraham et al. ( [1], [2]). At our knowledge, the behaviour of estimators of the size of the mode has not been investigated in detail, whereas there are at least two statistical motivations for estimating this parameter. First, a use of an estimator of the size is necessary for the construction of confidence regions for the location of the mode (see, e.g., Romano [20]). As a more important motivation, let us underline that the high of the peak gives information on the shape of a density; from this point view, as suggested by Vieu [27], the location of the mode is more related to the shape of the derivative of f , whereas the size of the mode is more related to the shape of the density itself. Moreover, the knowledge of the size of the mode allows to measure the pertinence of the parameter location of the mode.
Let us mention that, even if the problem of estimating the size of the mode was not investigated in the framework of density estimation, it was studied in the framework of regression estimation. Müller [16] proves in particular the joint asymptotic normality and independence of kernel estimators of the location and of the size of the mode in the framework of nonparametric regression models with fixed design. In the framework of nonparametric regression with random design, a similar result is obtained by Ziegler ([32], [33]) for kernel estimators, and by Mokkadem and Pelletier [14] for estimators issued from stochastic approximation methods. This paper is focused on semirecursive kernel estimators of θ and f (θ). To explain why we chose this option of semirecursive estimators, let us first recall that the (nonrecursive) wellknown kernel estimator of the location of the mode introduced by Parzen [18] is defined as a random variable θ * n satisfying f * n (θ * n ) = sup where f * n is Rosenblatt's estimator of f ; more precisely, where the bandwidth (h n ) is a sequence of positive real numbers going to zero and the kernel K is a continuous function satisfying lim x →+∞ K(x) = 0, R d K(x)dx = 1. The asymptotic behaviour of θ * n was widely studied (see, among others, [5], [6], [10], [13], [17], [18], [20], [23], [24], [26], [27]), but, on a computational point of view, the estimator θ * n has a main drawback: its update, from a sample size n to a sample size n + 1, is far from being immediate. Applying the stochastic approximation method, Tsybakov [25] introduced the recursive kernel estimator of θ defined as where T 0 ∈ R d is arbitrarily chosen, and the stepsize (γ n ) is a sequence of positive real numbers going to zero. The great property of this estimator is that its update is very rapid. Unfortunately, for reasons inherent to stochastic approximation algorithms properties, very strong assumptions on the density f must be required to ensure its consistency. A recursive version f n of Rosenblatt's density estimator was introduced by Wolverton and Wagner [30] (and discussed, among others, by Yamato [31], Davies [3], Devroye [4], Menon et al. [12], Wertz [29], Wegman and Davies [28], Roussas [22], and Mokkadem et al. [15]). Let us recall that f n is defined as Its update from a sample of size n to one of size n + 1 is immediate since f n clearly satisfies the recursive relation x − X n h n .
This property of rapid update of the density estimator is particularly important in the framework of mode estimation, since the number of points where f must be estimated is very large. We thus define a semirecursive version of Parzen's estimator of the location of the mode by using Wolverton-Wagner's recursive density estimator, rather than Rosenblatt's density estimator. More precisely, our estimator θ n of the location θ of the mode is a random variable satisfying Let us mention that, in the same way as for Parzen's estimator, the fact that the kernel K is continuous and vanishing at infinity ensures that the choice of θ n as a random variable satisfying (2) can be made with the help of an order on R d . For example, one can consider the following lexicographic order: x ≤ y if the first nonzero coordinate of x − y is negative. The definition where the infimum is taken with respect to the lexicographic order on R d , ensures the measurability of the kernel mode estimator.
Let us also mention that, in order to make more rapid the computation of the kernel estimator of the location of the mode, Abraham et al. ( [1], [2]) proposed the following alternative version of Parzen's estimator θ * n :θ * Similarly, we could consider the following alternative version of our semirecursive estimator θ n : However, to establish the asymptotic properties ofθ * n , Abraham et al. [2] prove the asymptotic proximity between θ * n andθ * n , which allows them to deduce the asymptotic weak behaviour ofθ * n from the one of θ * n . In the same way, we can conjecture that the asymptotic weak behaviour of θ n could be deduced from the one of θ n , but, in this paper, we limit ourselves on establishing the asymptotic properties of θ n .
Let us now come back to the problem of estimating the size f (θ) of the mode. The ordinarily used estimator is defined as µ * n = f * n (θ * n ) (f * n being Rosenblatt's density estimator and θ * n Parzen's mode estimator); the consistency of µ * n is sufficient to allow the construction of confidence regions for θ (see, e.g., Romano [20]). Adapting the construction of µ * n to the semirecursive framework would lead us to estimate f (θ) by However, this estimator has two main drawbacks (as well as µ * n ). First, the use of a higher order kernel K is necessary for (µ n − µ) to satisfy a central limit theorem, and thus for the construction of confidence intervals of µ (and of confidence regions for (θ, µ)). Moreover, in the case when a higher order kernel is used, it is not possible to choose a bandwidth for which both estimators θ n and µ n converge at the optimal rate. These constations lead us to use two different bandwidths, one for the estimation of θ, the other one for the estimation of µ. More precisely, letf n be the recursive kernel density estimator defined as where the bandwidth (h n ) may be different from (h n ) used in the definition of f n (see (1)); we estimate the size of the mode byμ where θ n is still defined by (2), and thus with the first bandwidth (h n ).
The purpose of this paper is the study of the joint asymptotic behaviour of θ n andμ n . We first prove the strong consistency of both estimators. We then establish the joint weak convergence rate of θ n andμ n . We prove in particular that adequate choices of the bandwidths lead to the asymptotic normality and independence of these estimators, and that the use of different bandwidths allow to obtain simultaneously the optimal convergence rate of both estimators. We then apply our weak convergence rate result to the construction of confidence regions for (θ, µ), and illustrate this application with a simulations study. This application enlightens the advantage of using semirecursive estimators rather than nonrecursive estimators. It also shows how the estimation of the size of the mode gives information on the relevance of estimating its location. Finally, we establish the joint strong convergence rate of θ n andμ n .

Assumptions and Main Results
Throughout this paper, (h n ) and (h n ) are defined as h n = h(n) andh n =h(n) for all n ≥ 1, where h andh are two positive functions.

Strong consistency
The conditions we require for the strong consistency of θ n andμ n are the following.
(A1) i) K is an integrable, differentiable, and even function such that R d K(z)dz = 1.
iii) There exists η > 0 such that z → z η f (z) is a bounded function. iv) There exists θ ∈ R d such that f (x) < f (θ) for all x = θ.

Remark 1
Note that (A1)iv) implies that K is bounded.

Remark 2
The assumptions required on the probability density to establish the strong consistency of the semirecursive estimator of the location of the mode are slightly stronger than those needed for the nonrecursive estimator (see, e.g., [13], [20]), but are much weaker than the ones needed for the recursive estimator (see [25]).

Weak convergence rate
In order to state the weak convergence rate of θ n andμ n , we need the following additional assumptions on K and f .

Remark 4 Note that (A4)ii) and (A4)iii) imply that ∇K is Lipschitz-continuous and integrable;
it is thus straightforward to see that lim x →∞ ∇K(x) = 0 (and in particular ∇K is bounded).
We also need to add conditions on the bandwiths. Let us set L θ (n) = n a h n and L µ (n) = nãh n .
(In view of (A3), L θ and L µ are positive slowly varying functions, see Remark 3). In the statement of the the weak convergence rate of θ n andμ n , we shall refer to the following conditions.
(C1) One of the following two conditions is fulfilled.
(C2) One of the following two conditions is fulfilled.

Remark 6 The simultaneous weak convergence rate of nonrecursive estimators of the location and size of the mode can be established by following the lines of the proof of Theorem 1. More precisely, set
let θ * n be Parzen's kernel estimator of the location of the mode andμ * n =f * n (θ * n ) be the kernel estimator of the size of the mode defined with the help of θ * n and of Rosenblatt's density estimator f * n (the bandwidth (h n ) definingf * n being eventually different from the banwidth (h n ) used to define θ * n ); Theorem 1 holds when θ n ,μ n , B q (θ), Σ are replaced by θ * n ,μ * n , B * q (θ), Σ * , respectively.
Part 1 and Part 2 in the case c =c = 0 (respectively Part 3) of Theorem 1 correspond to the case when the bias (respectively the variances) of both estimators θ n andμ n are negligeable in front of their respective variances (respectively bias). When c,c > 0, Part 2 of Theorem 1 corresponds to the case when the bias and the variance of each estimator θ n andμ n have the same convergence rate. Other possible conditions lead to different combinations; these ones have been omitted for sake of simplicity.
Theorem 1 gives the joint weak convergence rate of θ n andμ n . Of course, it is also possible to estimate the location and the size of the mode separately. Concerning the estimation of the location of the mode, let us enlighten that the advantage of the semirecursive estimator θ n on its nonrecursive version θ * n is that its asymptotic variance [1 + a(d + 2)] −1 f (θ)G is smaller than the one of Parzen's estimator, which equals f (θ)G (see, e.g. Romano [20] for the case d = 1 and Mokkadem and Pelletier [13] for the case d ≥ 1); this advantage of semirecursive estimators will be discussed again in Section 2.3. The estimation of the size of the mode is of course not independent of the estimation of the location, since the estimatorμ n is constructed with the help of the estimator θ n . To get a good estimation of the size of the mode, it seems obvious that θ n should be computed with a bandwidth (h n ) leading to its optimal convergence rate (or, at least, to a convergence rate close to the optimal one). The main information given by Theorem 1 is that, forμ n to converge at the optimal rate, the use of a second bandwidth (h n ) is then necessary.
Let us enlighten that, in the case when θ n andμ n satisfy a central limit theorem (Parts 1 and 2 of Theorem 1), these estimators are asymptotically independent, although, in its definition, the estimator of the size of the mode is heavily connected to the one of the location of the mode. As pointed out by a referee, this property was expected. As a matter of fact (and as mentioned in the introduction), the location of the mode is a parameter which gives information on the shape of the density derivative, whereas the size of the mode gives information on the shape of the density itself. This constatation must be related to the fact that the weak (and strong) convergence rate of θ n is given by the one of the gradient of f n , whereas the weak (and strong) convergence rate of µ n is given by the one off n itself; the variance of the density estimators converging to zero faster than the one of the estimators of the density derivatives, the asymptotic independence of θ n and µ n is completely explained.
Let us finally say one word on our assumptions on the bandwidths. In the framework of nonrecursive estimation, there is no need to assume that (h n ) and (h n ) are regularly varying sequences. In the case of semirecursive estimation, this assumption can obviously not be omitted, since the exponents a andã stand in the expressions of the asymptotic bias B q (θ) and variance Σ. This might be seen as a slight inconvenient of semirecursive estimation; however, as it is enlightened in the following section, it turns out to be an advantage, since the asymptotic variances of the semirecursive estimators are smaller than the ones of the nonrecursive estimators.

Construction of confidence regions and simulations studies
The application of Theorem 1 (and of Remark 6) allows the construction of confidence regions (simultaneous or not) of the location and of the size of the mode, as well as confidence ellipsoids of the couple (θ, µ). Hall [9] shows that, in order to construct confidence regions, avoiding bias estimation by a slight undersmoothing is more efficient than explicit bias correction. In the framework of undersmoothing, the asymptotic bias of the estimator is negligeable in front of its asymptotic variance; according to the estimation by confidence regions point of view, the parameter to minimize is thus the asymptotic variance. Now, note that is the asymptotic covariance matrix of the semirecursive estimators (θ n ,μ n ) (respectively of the nonrecursive estimators (θ * n ,μ * n )). In order to construct confidence regions for the location and/or size of the mode, it is thus much preferable to use semirecursive estimators rather than nonrecursive estimators. Simulations studies confirm this theoritical conclusion, whatever the parameter (θ, µ or (θ, µ)) for which confidence regions are contructed is. For sake of succintness, we do not give all these simulations results here, but focuse on the construction of confidence ellipsoid for (θ, µ); the aim of this example is of course to enlighten the advantage of using semirecursive estimators rather than nonrecursive estimators, but also to show how this confidence region gives informations on the shape of the density, and, consequently allows to measure the pertinence of the parameter location of the mode.
To construct confidence regions for (θ, µ), we consider the case d = 1. The following corollary is a straightforward consequence of Theorem 1.

Remark 7
In view of Remark 6, in the case when the nonrecursive estimators θ * n andμ * n are used, (7) becomes (and, again, this convergence still holds when the parameters f (θ) and f ′′ (θ) are replaced by consistent estimators).
Letf ′′ n (respectivelyf * ′′ n ) be the recursive estimator (respectively the nonrecursive Rosenblatt's estimator) of f ′′ computed with the help of a bandwidthȟ n , and set Moreover, let c α be such that P(Z ≤ c α ) = 1−α, where Z is χ 2 (2)-distributed; in view of Corollary 1 and Remark 7, the sets Let us dwell on the fact that both confidence regions have the same asymptotic level, but the lengths of the axes of the first one (constructed with the help of the semirecursive estimators θ n andμ n ) are smaller than the ones of the second one (constructed with the help of the nonrecursive estimators θ * n andμ * n ).
We now present simulations results. In order to see the relationship between the shape of the confidence ellipsoids and the one of the density, the density f we consider is the density of the N (0, σ 2 )-distribution, the parameter σ taking the values 0.3, 0.4, 0.5, 0.7, 0.75, 1, 1.5, 2, and 2.5. We use the sample size n = 100 and the coverage level 1 − α = 95% (and thus c α = 5.99). In each case, the number of simulations is N = 5000. The kernel we use is the standard Gaussian density; the bandwidths are h n = n −1/7 (log n) ,h n = n −1/5 (log n) ,ȟ n = n −1/9 . Table 1 below gives, for each value of σ, the empirical values of θ n , θ * n , µ n , µ * n (with respect to the 5000 simulations), and: b the empirical length of the θ-axis of the confidence ellipsoid E 5% ; b * the empirical length of the θ-axis of the confidence ellipsoid E * 5% ; a the empirical length of the µ-axis of the confidence ellipsoid E 5% ; a * the empirical length of the µ-axis of the confidence ellipsoid E * 5% ; p the empirical coverage level of the confidence ellipsoid E 5% ; p * the empirical coverage level of the confidence ellipsoid E * 5% . Confirming our theoritical results, we see that the empirical coverage levels of both confidence ellipsoids E * 5% and E 5% are similar, but that the empirical areas of the ellipsoids E 5% (constructed with the help of the semirecursive estimators) are always smaller than the ones of the the ellipsoids E * 5% (constructed with the help of the nonrecursive estimators).
Let us now discuss the interest of the estimation of the size of the mode and the one of the joint estimation of the location and size of the mode. Both estimations give informations on the shape of the probability density and, consequently, allow to measure the pertinence of the parameter location of the mode. Of course, the parameter θ is significant only in the case when the high of the peak is large enough; since we consider here the example of the N (0, σ 2 )-distribution, this corresponds to the case when σ is small enough. Estimating only the size of the mode gives a first idea of the shape of the density around the location of the mode (for instance, when the size is estimated around 0.16, it is clear that the density is very flat). Now, the shape of the confidence ellipsoids allows to get a more precise idea. As a matter of fact, for small values of σ, the length of the µ-axis is larger than the one of the θ-axis; as σ increases, the length of the µ-axis decreases, and the one of the θ-axis increases (for σ = 2.5, the length of the θ-axis is larger than 20 times the one of the µ-axis). Let us underline that these variations of the lengths of the axes are not due to bad estimations results; Table 2 below gives the values of the lengths b (respectively b * ) of the θ-axis, a (respectively a * ) of the µ-axis of the ellipsoids computed with the semirecursive estimators θ n andμ n (respectively with the nonrecursive estimators θ * n andμ * n ) in the case when the true values of the parameters f (θ) and f ′′ (θ) are used (that is, by straightforwardly applying (7) and (8)).

Strong convergence rate
To establish the joint strong convergence rate of θ n andμ n , we need the following additionnal assumption.
Moreover, condition (C2) is replaced by the following one.
i) If (C1) is fulfilled, then, with probability one, the sequence is relatively compact and its limit set is the ellipsoid ii) If a = (d + 2q + 2) −1 ,ã = (d + 2q) −1 , and if there exist c,c ≥ 0 such that lim n→∞ nh d+2q+2 n /(2 log log n) = c and lim n→∞ nh d+2q n /(2 log log n) =c, then, with probability one, the sequence is relatively compact and its limit set is the ellipsoid
Laws of the iterated logarithm for Parzen's nonrecursive kernel mode estimator were established by Mokkadem and Pelletier [13]. The technics of demonstration used in the framework of nonrecursive estimators are totally different from those employed to prove Theorem 2. This is due to the following fondamental difference between the nonrecursive estimator θ * n and the semirecursive estimator θ n : the study of the asymptotic behaviour of θ * n comes down to the one of a triangular sum of independent variables, whereas the study of the asymptotic behaviour of θ n reduces to the one of a sum of independent variables. Of course, this difference is not quite important for the study of the weak convergence rate. But, for the study of the strong convergence rate, it makes the case of the semirecursive estimation much easier than the case of the nonrecursive estimation. In particular, on the oppposite to the weak convergence rate, the joint strong convergence rate of the nonrecursive estimators θ * n andμ * n cannot be obtained by following the lines of the proof of Theorem 2, and remains an open question.

Proofs
Let us first note that an important consequence of (A3) which will be used throughout the proofs is that if βa < 1, then lim Moreover, for all ε > 0 small enough, As a matter of fact: (i) if aq < 1, (10) follows easily from (9); (ii) if aq > 1, since i h q i is summable, (10) holds; (iii) if aq = 1, since a(q − ε) < 1, using (9) again, we have n −1 n i=1 h q i = O(h q−ε n ), and thus (10) follows. Of course (9) and (10) also hold when (h n ) and a are replaced by (h n ) andã, respectively.
Our proofs are now organized as follows. Section 3.1 is devoted to the proof of the strong consistency of θ n andμ n . In Section 3.2, we give the convergence rate of the derivatives of f n . In Section 3.3, we show how the study of the joint weak and strong convergence rate of θ n and µ n can be related to the one of ∇f n (θ) andf n (θ). In Section 3.4 (respectively in Section 3.5), we establish the joint weak convergence rate (respectively the joint strong convergence rate) of ∇f n (θ) andf n (θ). Finally, Section 3.6 is devoted to the proof of Theorems 1 and 2.

Proof of Proposition 1
Since θ n is the mode of f n and θ the mode of f , we have: The application of Theorem 5 in Mokkadem et al. [15] with |α| = 0 and v n = log n ensures that for any δ > 0, there exists c(δ) > 0 such that . In view of (9), since ad < 1, we can write Borell-Cantelli's Lemma ensures that lim n→∞ f n −E(f n ) ∞ = 0 a.s. Since lim n→∞ E(f n )−f ∞ = 0, it follows from (11) that lim n→∞ f (θ n ) = f (θ) a.s. Since f is continuous, lim z →∞ f (z) = 0 and θ is the unique mode of f , we deduce that lim n→∞ θ n = θ a.s. Now, we have where the last inequality follows from (11). As previously, one can show that lim n→∞ f n −f ∞ = 0 and thus lim n→∞μn = µ a.s.

Convergence rate of the derivatives of the density
For any d-uplet [α] = α 1 , . . . , α d ∈ N d , we set |α| = α 1 + · · · + α d and, for any function g, let Lemma 1 Assume (A3)-(A5) hold. Let (g n ) and (b n ) be defined as follows: g n = f n and b n = h n or g n =f n and b n =h n .

Relationship between ((θ
T By definition of θ n , we have ∇f n (θ n ) = 0 so that For each i ∈ {1, . . . , d}, a Taylor expansion applied to the real valued application ∂f n /∂x i implies the existence of ε n (i) = (ε (1) Define the d × d matrix H n = (H (i,j) n ) 1≤i,j≤d by setting H (i,j) n = ∂ 2 fn ∂x i ∂x j (ε n (i)); Equation (13) can be then rewritten as H n (θ n − θ) = −∇f n (θ). Now, set We can then write: Let U be a compact set of R d containing θ. The combination of Lemmas 1 and 2 with |α| = 2, g n = f n and b n = h n ensures that for any γ > 0 and ε > 0 small enough, Since D 2 f is continuous in a neighbourhood of θ and since lim n→∞ θ n = θ a.s., (16) ensures that lim n→∞ H n = D 2 f (θ) a.s. It follows that the weak and a.s. behaviours of ((θ n − θ) T , (μ n − µ)) T are given by the one of the right-hand-sided term of (15).

Weak convergence rate of ([∇f
Let us at first assume that the following lemma holds.

is relatively compact and its limit set is
The combination of either (18) or (19) and of Lemma 4 gives the almost sure convergence rate of ([∇f n (θ)] T ,f n (θ) − f (θ)) T : • If (C1) holds, then, with probability one, the sequence is relatively compact and its limit set is E = ν ∈ R d+1 such that ν T Σ −1 ν ≤ 1 .
• If a = (d + 2q + 2) −1 ,ã = (d + 2q) −1 , and if there exist c,c ≥ 0 such that lim n→∞ nh d+2q+2 n / (2 log log n) = c and lim n→∞ nh d+2q n /(2 log log n) =c, then with probability one, the sequence is relatively compact and its limit set is • If (C'2) holds, then We now prove Lemma 4. Set let (ε n ) be a sequence of R d+1 -valued, independent and N (0, Γ)-distributed random vectors, and set S n = n k=1 Q k ε k . In order to prove Lemma 4, we first establish the following Lemma 5 in Section 3.5.1, and then show in Section 3.5.2 how Lemma 4 can be deduced from Lemma 5.
Let V be a compact set that contains θ; for n large enough, we get On the one hand, let us recall that the a.s. convergence rate of (θ n − θ) is given by the one of D 2 f (θ) −1 ∇f n (θ) (see (15) and the comment below). One can apply (27), (28), and (29) and obtain the exact a.s. convergence rate of θ n − θ. However, to avoid assuming (A6), we apply here Lemmas 1 and 2 (with |α| = 1 and (g n , b n ) = (f n ,h n )), and get the following upper bound of the a.s. convergence rate of θ n − θ: for any γ > 0 and ε > 0 small enough, On the other hand, we have Let L denotes a generic slowly varying function that may vary from line to line.