Asymptotically sufficient statistics in nonparametric regression experiments with correlated noise

We find asymptotically sufficient statistics that could help simplify inference in nonparametric regression problems with correlated errors. These statistics are derived from a wavelet decomposition that is used to whiten the noise process and to effectively separate high resolution and low resolution components. The lower resolution components contain nearly all the available information about the mean function, and the higher resolution components can be used to estimate the error covariances. The strength of the correlation among the errors is related to the speed at which the variance of the higher resolution components shrinks, and this is considered an additional nuisance parameter in the model. We show that the NPR experiment with correlated noise is asymptotically equivalent to an experiment that observes the mean function in the presence of a continuous Gaussian process that is similar to a fractional Brownian motion. These results provide a theoretical motivation for some commonly proposed wavelet estimation techniques. AMS 2000 subject classifications: Primary 62B15; secondary 62G20, 62G08.


Introduction
A nonparametric regression NPR problem consists of estimating an unknown mean function that smoothly changes between observations at different design points. There are n observations Y i of the form where μ is the unknown smooth mean function on 0, 1 and the errors ξ i are observations from a zero-mean Gaussian process. For NPR problems that have a particular long memory structure to the covariance of the error terms, we will find a continuous Gaussian experiment approximation to the problem of estimating the mean. Brown and Low 1 showed that the NPR experiment is asymptotically equivalent to the white-noise model where the mean function is observed in the presence of a Brownian motion process. This result paralleled the work by Nussbaum 2 in showing that asymptotic results in nonparametric function estimation problems can be simplified using approximations by the continuous white-noise experiments that Pinsker 3 studied. The original asymptotic equivalence results for NPR experiments were extended by Brown et al. 4 and Carter 5,6 along with refinements in the approximations from Rohde 7 and Reiss 8 .
All of these results assume that the errors ξ i in 1.1 are all independent, and this assumption is critical in establishing the appropriateness of a white-noise model that also has independent increments. We want to consider the effect of correlation between the observations on these approximations. Presumably, if the correlation is weak, then the effect washes out asymptotically. However, we wish to consider cases where there is sufficient longrange correlation to affect the form of the approximation. In particular, we will show that the appropriate approximation is by a continuous Gaussian process experiment that is no longer white noise but is closer to a fractional Brownian motion.
Our approach is motivated by the work by Johnstone and Silverman 9 and Johnstone 10 . They investigated the wavelet decomposition of data of this type and used a fractional Brownian motion approximation in the limit: dY t μ t dt n − β 1 /2 dB K t t ∈ 0, 1 .

1.2
They argued that the wavelet decomposition resulted in nearly independent coefficients which simplified the inference significantly. We will assume that the B K t process is decorrelated by a wavelet decomposition and then show that this continuous model is asymptotically equivalent to the NPR experiment with the same covariance structure.
Theorem 1.1. The nonparametric regression experiment F observes Y i as in 1.1 for an unknown mean function μ from a parameter set M M, α defined in Section 1.2 and a known covariance structure as described in Section 1.3. This experiment is asymptotically equivalent to the experiment E that observes dY t μ t dt σn − β 1 /2 dB K t , 1.3 where B K t is a Brownian motion with covariance kernel K.
This will be proven in two steps. First, Lemma 2.1 proves that the first n wavelet coefficients in a decomposition of dY t are asymptotically sufficient in E for estimating μ. For the second step, Lemma 3.1 shows that a discrete wavelet transform of the observations from F produces observations with nearly the same distribution as these asymptotically sufficient statistics.
Furthermore, in both experiments the lower-frequency terms in the wavelet decomposition are sufficient for estimating the means, allowing the higher-frequency terms to be used to give information about the variance process. This leads to Theorem 1.2, which proposes an experiment that allows some flexibility in the error structure. Theorem 1.2. The NPR experiment F observes the Y i as in 1.1 , where the covariance structure depends on the parameters β and γ and is such that the variance of the wavelet coefficients is 2 γ β j 1 .
The experiment E observes the pair (where x and Λ are defined in Section 4) and then observes the continuous Gaussian process conditionally on γ and β: where the covariance K is such that Var B K ψ jk 2 β j 1 . The estimators β and γ are the same as β, γ but truncated so that −1 ≤ β ≤ 0, and γ ≥ −c.
This theorem can be seen as an extension of Carter 6 Theorem 1.1 from a case where there is a single unknown variance for all the wavelet coefficients to a case where the variance changes as a log-linear function of the resolution level or frequency . Wang 11 addressed the issue of asymptotically sufficient statistics in the fractional Brownian motion process. In Section 3 of that article there is an argument that bounds the difference between minimax errors in an NPR experiment with correlated errors and an experiment that observes the mean in the presence of fractional Brownian motion error. This result extends the sort of approximation by Donoho and Johnstone 12 to correlated errors and is very much in the spirit of our Theorem 1.1 here. Our results differ from Wang 11 in that we have made a stronger assumption on the covariance structure of the errors in order to obtain the full asymptotic equivalence of the experiments as discussed in Section 1.1.
Lemma 2.1 is presented and proven in Section 2. Section 3 presents Lemma 3.1 and the proof of Theorem 1.1. The proof for Theorem 1.2 is in Section 4 with some relevant bounds in Sections 5 and 6.

Asymptotic Sufficiency
Instead of focusing on single estimation techniques, we will consider approximations of the entire statistical experiment. For large sample sizes, there is often a simpler statistical experiment that can approximate the problem at hand. One benefit of finding an approximating experiment is that it may have convenient sufficient statistics even when they are not available in the original experiment.
Our approximations will therefore be of experiments rather than particular distributions. A statistical experiment P that observes data X consists of a set of probability distributions {P θ } indexed by the parameter set θ ∈ Θ. We wish to compare the information about θ in P to another experiment Q that observes data Y from among the set of distributions {Q θ } that are indexed by the same parameter θ. Implicitly, we are concerned with two sequences of experiments P n and Q n where n roughly denotes the increasing sample size, but 4 Journal of Probability and Statistics generally, we will leave off the subscript n. It will always be understood that the distributions depend on the "sample size." The NPR experiment will be approximated using Le Cam's notion of asymptotically equivalent experiments 13, 14 and asymptotically sufficient statistics 15 . Asymptotically equivalent experiments have corresponding inference procedures such as estimators or tests in each experiment that perform nearly as well. Specifically, if there is an estimator τ X in P with risk P θ L τ X , then, for bounded loss functions, there is an estimator σ Y such that as n → ∞. These asymptotic equivalence results are stronger than the equivalence of minimax rates that is derived under a similar model by, for example, Wang 11 . Our results imply a correspondence over a range of bounded loss functions. Thus, the equivalence holds for a global L 2 error as well as local error measurements or other distances. Asymptotic sufficiency is a stronger notion, where if T X is a sufficient statistic for inference about θ in P, then T Y is asymptotically sufficient for Q when the total-variation distance between P θ and Q θ is negligible. These asymptotically sufficient statistics generate experiments that are all asymptotically equivalent. In particular, P and Q are asymptotically equivalent, and they are also asymptotically equivalent to the experiments generated by the distributions of T X and T Y . As a result, an estimator in P should generally be of the form τ T X and there is a corresponding estimator τ T Y that performs nearly as well in the Q experiment. There is a basic transitive property to the asymptotic equivalence that implies if P is asymptotically equivalent to Q, and Q is asymptotically equivalent to R, then P is asymptotically equivalent to R.
Le Cam's asymptotic equivalence is characterized using the total-variation distance δ P θ , Q θ between the distributions. We will abuse this notation a bit by writing δ P, Q sup θ δ P θ , Q θ . It will often be more convenient to use the Kullback-Leibler divergence D P, Q P log dP/dQ to bound the total variation distance:

Wavelet Basis
We will use orthonormal wavelet bases to characterize the function space and to simplify the covariance structure of the errors. Assuming we are considering periodic functions on the interval 0, 1 , we can construct periodic wavelet bases as by Daubechies 17, Chapter 9.3 . We start with a space V j which consists of functions of the form Journal of Probability and Statistics 5 where φ jk is an orthonormal set of periodic functions generated via We will work with a φ function having finite support 0, N , and at the boundaries of the interval the φ jk t are given the proper periodic extensions. This space generates wavelet functions ψ jk that span the difference between the V j and V j−1 and can be written ψ jk t 2 j/2 ψ 2 j t − k with the proper periodic adjustment at the boundary e.g., ψ j,2 j 2 j/2 ψ 2 j for a small . This periodic adjustment has a small effect at the high resolution levels but is a larger factor for small values of j. In particular, the scaling function at level 0 is φ 0 t N k 0 φ k t 1. The mean functions μ t will be assumed to be constructed from this wavelet basis: We will restrict the mean functions to those that belong to a Hölder α class of functions. Specifically, the class of periodic mean functions μ t is M M, α : for some 1/2 < α < 1 and M > 0. This smoothness condition on the functions bounds the rate of growth of the higher-frequency terms in the orthonormal expansion. Originally from Meyer 18 , in Daubechies where ε 2α − 1.

Error Structure
These results rely on a specific structure to the covariance matrix of the errors in the NPR experiment. As by Johnstone 10 , the fractional Brownian motion is the motivating example for our continuous Gaussian model. However, this model does not necessarily provide the independent coefficients that would simplify the inference. Instead, an error structure that has roughly some of the properties of the fractional Brownian motion will be considered.

Journal of Probability and Statistics
Traditionally, the asymptotics of the NPR experiment have assumed independent noise. This white-noise model is especially convenient because all of the eigenvalues of the covariance operator are equal. Thus, any orthonormal basis generates a set of independent standard normal coefficients. With a more general covariance function, the eigenvalues are different and only particular decompositions lead to independent coefficients. Thus there is much less flexibility in the choice of basis, and this basis determines some of the structure of the covariance.
Following Johnstone 10 , Johnstone and Silverman 9 , Zhang and Waiter 19 , Wang 11 , and Cavalier 20 among others, we will assume a covariance structure that is whitened by a wavelet decomposition. When there is a long-range positive correlation between the observations, the wavelet decomposition tends to decorrelate the error process because the wavelet functions act like bandpass filters.
We will assume that there exists an orthonormal basis φ 0 and ψ jk for j ≥ 0 and k 0, . . . , 2 j − 1 such that the decomposition of the error process generates independent normal coefficients. In other words, the error process is a zero-mean Gaussian process that is roughly 14 in the distributional sense where the ξ kj are independent normals. The Var ξ jk will be assumed to depend on j and not k as a sort of stationarity condition. In particular, we will assume that Var ξ 0 σ 2 and then Var ξ jk σ 2 2 β j 1 for some β in the interval −1, 0 . If β 0, then this is the white-noise process. This is a convenient form for the error, but not completely unrealistic. Wavelet decompositions nearly whiten the fractional Brownian motion process. Wornell 21 argued that long-memory processes can be constructed via a wavelet basis with variances at resolution level j shrinking like 2 −γj for 0 < γ < 2. McCoy and Walden 22 showed that the discrete wavelet transform nearly decorrelates the noise in fractionally differenced white-noise processes. Alternatively, Wang 11 used a wavelet-vaguelette decomposition 23 to find a decomposition of the fractional Brownian motion that results in independent coefficients for a nearly orthonormal basis. Section 7 demonstrates some properties of the specific Gaussian process generated by using the Haar basis as the wavelet basis. These properties are consistent with the sort of behavior that we want in the covariances of our observations. The correlation between observations decreases like d − β 1 for β < 0 where d measures the distance between the locations of the coefficients.
A well-established method for estimating the parameter β in these long-range dependent models is to fit a linear function to the log of an estimate of the variances of the coefficients at each resolution level. This idea goes back to at least Abry and Veitch 24 and is now a standard approach that has been improved upon in subsequent work; see Veitch and Abry 25 , Stoev et al. 26 , among others. This motivates the asymptotic sufficient statistics in Theorem 1.2 which are least squares estimates from the fitted line.
The assumptions in Theorem 1.1 on the covariance structure of the errors are strong and could limit the applicability of the result. However, if we allow the variances at different scales to have a range of linear relationship, we could then have a sufficiently rich class of error models. Theorem 1.2 allows for this somewhat larger class of models, and it seems likely that the changing magnitude of the variances over different resolutions level will have a greater effect on the distribution of the observed errors than the underlying basis.
Journal of Probability and Statistics 7

Approximate Sufficiency in the Gaussian Sequence Experiment
The first step in the proof of Theorem 1.1 is to establish that a truncated wavelet decomposition is asymptotically sufficient for the continuous Gaussian experiment.
In the Gaussian sequence experiment E where only the mean μ t is to be estimated, the likelihood is where P μ is the distribution of Y t and P 0 is the distribution of the version with mean 0 which would just be σn − β 1 /2 B K t . We want to approximate this experiment E by a similar experiment F where the mean is projected onto the first j * resolution levels, that is; μ is replaced by μ: The likelihood becomes Therefore, this experiment F has sufficient statistics y 0 and y jk for 0 ≤ j ≤ j * . These observations are approximately sufficient in the E experiment if the distance between the distributions in the two experiments is small.

8
Journal of Probability and Statistics By 5.3 , the distance between these two sets of experiments is For the parameter space M M, α , 5.4 bounds the distance between the two experiments as which is negligible as n → ∞ when the dimension of the sufficient statistics increases like for −1 < β < 0. The worst case is when β 0, and thus we have proved Lemma 2.1.

Approximating the NPR Experiment
Theorem 1.1 can now be proven by approximating the sufficient statistics from Lemma 2.1 using the observations from the NPR experiment F. We suppose that we have n observations from the NPR experiment as in 1.1 where the ξ i are Gaussian random variables with a specified covariance function. Specifically, let W be the n × n matrix that performs the discrete wavelet transform, and let W be its inverse. The vector of random wavelet coefficients from Lemma 2.1, y y 0 , y 00 , y 01 , . . . , y J−1,2 J−1 −1 where J log 2 n, can be transformed via the discrete wavelet transform to create y J W y.
The expected value of this transformed vector is For a μ t function that is nearly constant around i/n, μ Ji ≈ 2 −J/2 μ i/n ; so we can approximate y J by 1/ √ n Y 1 , Y 2 , . . . , Y N . In the original NPR experiment, the variances are Var Y i Var ξ i Cσ 2 for a constant C that depends on β and the basis we will be using. The covariance matrix for these Y i will be assumed to be Σ WDW where D is the diagonal matrix of Var y jk σ 2 n 2 β j 1 . The variance of the y Jk should be the same as that of Y i n −1/2 , and in the model described, Var y Jk ∝ σ 2 n n β . Therefore, σ 2 n should be set to σ 2 n −1−β . The NPR observations are such that the covariance matrix of ξ i is also Σ, and therefore the total variation distance between the distributions is bounded in 5.2 by A standard calculation bounds the difference between the means when φ t < M with support on 0, N and the μ t are Hölder α for α < 1: The covariance matrix is a positive definite matrix such that Σ −1 MD −1 M . The first column of the wavelet transform matrix is √ n1 where 1 is the vector of 1's. Therefore, which is negligible for large n and α > 1/2.

Proof of Theorem 1.1
The theorem follows from the fact that the observations y 0 , {y jk } for j 0, . . . , J − 1 are asymptotically sufficient for the continuous process in 1.2 . Then a linear function of these sufficient statistics y W y is still approximately sufficient. Thus, the experiment that seeks to draw inference about μ from the observations y Ji is asymptotically equivalent to the experiment that observes 1.2 by Lemma 2.1.
Furthermore, by Lemma 3.1, the original NPR experiment that has the same covariance structure as y Ji is asymptotically equivalent to that experiment and thus, by transitivity, to the experiment that observes the process 1.2 as well. This proves Theorem 1.1.

Remarks on the Covariance Structure
This result is restrictive in that it requires a specific known covariance structure. We are working under the assumption that the covariance matrix has eigenfunctions that correspond to a wavelet basis. This does not generally lead to a typical covariance structure. It does not even necessarily lead to a stationary Gaussian process; see the Haar basis example below.
The difficulty is that the requirement for having asymptotically equivalent experiments is quite strict, and the total variation distance between the processes with even small differences in the structure of the covariance is not negligible. For two multivariate Guassian distributions with the same means but where one covariance matrix is Σ and the other is D, a diagonal matrix with the same diagonal elements as Σ, the Kullback-Leibler divergence between the distributions is log |Σ| − log |D|.
If the correlation between the highest level coefficients is Corr ξ j * ,k , ξ j * ,k 1 ρ, then the contribution to the difference of the log determinants is on the order of ρ2 j * . The dimension of the problem is growing while the correlations are generally not going to 0 significantly quickly. For instance, in a typical wavelet basis decomposition of the true fractional Brownian motion Corr ξ j * ,k , ξ j * ,k 1 c β where c β is a constant that depends on β but not j * or n. Thus, the difference log |Σ| − log |D| will not go to 0 as the sample size increases. Therefore, for the sort of long-range correlation structures that we are considering here, the eigenfunctions of the kernel K need to be known or else the experiments will not be asymptotically equivalent.

Estimating the Covariance of the Increments
The key limitation of Theorem 1.1 is that it supposes that the covariance structure of the errors is known to the experimenter. To make the approximation more useful, it would help if the covariance structure was more flexible. A strategy similar to that used by Carter 6 can be used to estimate the variances of the coefficients.
By Carter 6 , I showed that a model with a variance that changes slowly over time can still be approximated by the Gaussian process as long as all of the observations are independent. Our result here is that for correlated observations, if the variance is a linear function of the frequency, then a similar technique can be used to establish a set of asymptotically sufficient statistics.
Flexibility with regard to the covariance structure is added by allowing the magnitude of the Var y jk to depend on the resolution level j. The variances will be described by two parameters γ and β, which characterize the size of the error and the speed that it shrinks at higher resolution levels. These nuisance parameters can be estimated using part of the data, and then the inference can be carried out conditionally on the estimates.
Specifically, the experiment E n observes independent components y 0 ∼ N θ 0 , n − 1 β 2 γ and where the n − 1 β factor is included to match up with the scaling functions at the Jth resolution level. These observations form a new experiment with a parameter set that includes μ, γ, β where μ t ∈ M M, ε , −1 < β < 0, and γ is bounded below by a constant −c.
This experiment E n with the parametric variances is no longer an approximately sufficient statistic for the experiment that observes all of the θ jk . That experiment has too much information about the variance. If we observed the entire sequence at all resolution levels j, then γ and β could be estimated exactly. We need to adopt another approximating experiment as in Carter 6 Many of the bounds in this section follow arguments from that paper.

Proof of Theorem 1.2
The theorem can be proven by applying Lemma 3.1 and then a version of Lemma 2.1 that uses only a small proportion of the low-frequency wavelet coefficients. The rest of the coefficients can be used to fix the parameters in the covariance of the observations. The first step is to decompose the nonparametric regression into a set of wavelet coefficients. The n NPR observations Y i can be transformed by dividing by √ n and then performing the discrete wavelet transformation as in Lemma 3.1. The result is that a sequence of n wavelet coefficients y 0 and y jk for j 0, . . . , J − 1 is equivalent to the original NPR observations with a total-variation distance between the distributions of The supremum of this bound over all γ > −c and β < 0 is δ P, P ≤ C2 c n 1−2α , 4.3 which will be negative for α > 1/2.
The key strategy is to break the observations from this wavelet composition into pieces starting at level j * , where observations on j ≤ j * are assumed to be informative about the means, and the higher resolution levels are used to estimate the covariance structure.
For each resolution level with j > j * , we generate the approximately sufficient statistics V j k y 2 jk . Along with the y jk for j ≤ j * , the collection of V j is exactly sufficient if the means are θ jk 0 for j > j * , because if there is no information about the means in the higherfrequency terms, then we have a piece of the experiment that is like a normal scale family. This new experiment E v is asymptotically equivalent to our E n .
The error in approximating E n by E v , where the means of the higher-frequencies coefficients are 0, is bounded by 5.3 :

4.4
For θ jk in M M, ε space, 5.4 bounds the distance as bound which is negligible when j * > J/ 1 2ε . This E v experiment has sufficient statistics y 0 , y jk for j ≤ j * , and Furthermore, there are approximately sufficient statistics in this experiment y jk , γ, β where γ and β are the weighted least-squares estimates of γ and β from the data log V j . These are exactly sufficient statistics in the experiment E that observes the y 0 and y jk for the lower resolution levels j ≤ j * as before, in addition to the observations 2 W j for j * < j < J where The distance between E and E v depends on the distance between the distribution of the log of the Gamma variables and the normal approximation to this distribution. The calculation in of 6, Section 10.1 gives a bound on the Kullback-Leibler divergence of D Q j ,Q j ≤ 2 −j where Q j is the distribution of V j , andQ j is the distribution of 2 W j . Therefore, the total error between the two experiments is Therefore, the observations in E are asymptotically sufficient for E v and thus also E n as long as j * → ∞ with n .
In the experiment E , the sufficient statistics for estimating γ and β are the weighted least-squares estimators γ and β: where Λ is the diagonal matrix with 2 j for j j * 1, . . . , J −1 along its diagonal, x is the design matrix with rows 1, j − J 1 , and W J is the column of observations W j − J. The vector of estimators is normal with mean γ β and covariance 1/2 log 2 x Λx −1 .
Therefore, we can compare this experiment E to an experiment E that observes the same γ β , but the y 0 and y jk for j ≤ j * are replaced by Gaussian random variables y jk with variances conditional on γ, β that are Var y jk 2 γ j−J 1 β−J . The error in this approximation depends on the distance between the two sets of independent normal experiments with different variances. Letting P jk be the distribution of y jk and P jk the distribution of y jk , the bound 6.12 in Section 6 gives

4.10
There are 2 j * 1 independent normals y jk for j ≤ j * so that the total divergence is Therefore, the experiments E and E are asymptotically equivalent for j * J − 2 log 2 J − η n 4.12 for some η n → ∞.
We can improve this approximation by replacing the estimators β and γ in E by using

4.13
and γ γ ∨ c to match up with the bounds on the parameter space. The new version of this experiment therefore observes γ, β and the normal coordinates y jk ∼ N θ jk , n 1 β 2 γ 2 β j 1 for 0 ≤ j ≤ j * . The error between E and this new version of E is smaller because |γ − γ β − β j − J | ≤ |γ − γ β − β j − J |, which makes the bound in 6.2 uniformly smaller. Finally, we create a continuous Gaussian version of the E experiment. This approximation E observes all the y kj for k ≥ 0 with means θ jk and variances n − 1 β 2 γ β j 1 . The E are actually sufficient statistics for an experiment that observes γ, β and y jk for 0 ≤ j ≤ j * and for j > j * : The difference between the experiments E and E conditional on γ and β is as in Section 2 and 5.4 less than M2 − γ 2 J−j * 1 β /2− β/2−εj * . The expectation of this bound when averaged over the possible values of γ, β is a bound on the unconditional error. Furthermore, this expectation is less than the minimum over possible values of γ, β this is the real advantage that comes from going from γ, β to γ, β . Thus, as J → ∞ for ε > 0. At the same time, the bound in 4.11 becomes Thus, Theorem 1.2 is established.

Bounding the Total Variation Distance
We need a bound on the distance between two multivariate normal distributions with different means in order to bound the error in many of our approximations. For shifted Gaussian processes, the total-variation distance between the distributions is The expression in 5.1 for the total variation distance is concave for positive Δ, so a simple expansion gives For the Gaussian process with correlated components, we will assume that the variance of each wavelet coefficient is of the form Var ψ jk σ 2 n − 1 β 2 β j 1 where the variance is calibrated so that Var Y i σ 2 n Var B K φ J, . A bound on the error in the projection onto the span of ψ jk for j > j * comes from 5.2 which depends on where the upper bound in 5.4 follows from n 2 J , the definition of M M, α and the bound in 1.13 , and −1 < β < 0. This error is negligible as J → ∞ whenever j * > J 2α . 5.5

Bounds from the Estimated Variances
In order to expand our asymptotically sufficient statistics out into a continuous Gaussian experiment, we need a bound on the total-variation distance between E n which, for 0 ≤ j ≤ j * , observes a sequence of normals with variances n − 1 β 2 γ β j 1 and E g , which observes a similar set of normals with variances n − 1 β 2 γ β j 1 .
For two normal distributions with the same means μ, and variances σ 2 1 and σ 2 2 , respectively, the Kullback-Leibler divergence is Thus, for Q kj the distribution of the y kj and Q kj the distribution of the y kj , the divergence between the conditional distributions given γ and β is This divergence between conditional distributions can be used to bound the joint divergence: where the expectation is taken over the estimators γ and β.
To bound the expected value of the divergence in 6.2 , we need the distribution of the estimators:

6.12
Analogously, the expected divergence between Q 0 and Q jk is bounded by

6.13
If we add up these errors over the 2 j * observations in the experiment, we get that the error in the approximation is less than C log n 2 / n2 −j * − 1 , which is negligible for j * sufficiently small.

Haar Basis Covariance
The Haar basis is a simple enough wavelet basis by which we can make some explicit calculations of the properties of the error distribution. We will show that the resulting errors ξ i will have variances of approximately n β as we expected, and the correlation between ξ i and ξ j will decrease at about a rate of |i − j| − 1 β .
The scaling functions for the Haar basis are constant on 2 j dyadic intervals at the resolution level j. The assumption is that we have a single scaling function coefficient with Var y 0 1, and then every wavelet coefficient y jk is independent and has variance 2 β j 1 . Then the covariances can be calculated from the synthesis formula for the Haar basis.
The formula for synthesizing the scaling function coefficients y Jk from the wavelet decomposition is where k * is the index such that ψ jk * has support that includes the support of φ Jk . The ζ j,J,k is either 1 or −1 depending on whether φ Jk sits in the positive or negative half of the ψ jk * function. Using the covariance structure described above, the variance of y Jk is for −1 < β < 0. For β 0, the variance of each scaling function coefficient is 1 as in white noise. For β −1, direct calculation leads to a variance of 2 −J 1 J/2 . To find the covariance between two variables y Jk 1 and y Jk 2 , we need j * which is the highest resolution level such that the support of ψ j * k * includes the support of both scaling functions φ Jk 1 and φ Jk 2 . The covariance is thus

7.3
For large J the correlation is on the order of d − 1 β where d 2 J−j * is a proxy for the distance between the observations. For β 0, all of these covariances are 0. For β −1, the correlation is j * / J 2 .