We first propose in this paper a new test method for detecting heteroscedasticity of the error term in nonparametric regression. Some simulation experiments are then conducted to evaluate the performance of the proposed methodology. A real-world data set is finally analyzed to demonstrate the application of the method.
1. Introduction
In recent years, nonparametric regression models have been widely applied in a variety of areas for data analysis. The estimation of the regression function and related statistical inferences in nonparametric models are usually based on the assumption that the error term is homoscedastic. However, in many real-world problems, we rarely know a priori whether this assumption can be guaranteed. Therefore, it is necessary to develop a method for detecting heteroscedasticity in the error terms before we embark on the model fitting and inferential issues.
In the literature of the statistical nonparametric regression, there have been many papers on testing heteroscedasticity (see, e.g., [1–8]). Among these papers, a procedure was developed by Dette and Munk [2] based on an estimator for the best L2-approximation of the variance function by a constant and was extended by You and Chen [5] to partially linear regression models. Dette [1] proposed a test for heteroscedasticity in nonparametric regression. A residual-based statistic was suggested by Eubank and Thomas [3] to detect heteroscedasticity of the error term in nonparametric models. Furthermore, Zhang and Mei [7] obtained a test for the constant variance of the model errors based on residual analysis.
Most of the existing procedures, including those mentioned above, belong to the class of parametrically hypothesis test methods. That is, the methods work quite well when the model errors coincide with the preassumed distribution, while the performance significantly decreases when the distribution cannot be guaranteed. Therefore, it is necessary to develop a test which is robust to the error distributions. To the best of our knowledge, however, there has been little work done on this issue.
In this paper, we propose a completely nonparametrically hypothesis test method for detecting heteroscedasticity of the error term in nonparametric regression. In this method, the test statistic is constructed on the basis of an appropriate transformation of the residuals after fitting the regression model with the local linear estimation. In order to evaluate the performance of the proposed method, we conduct a simulation comparison with Zhang and Mei’s procedure [7] and a real-world data set is analyzed to show the application of the method.
The remainder of this paper is organized as follows. In Section 2, we briefly describe the local linear estimation method. By using the residuals after fitting the regression model with the local linear estimation and applying the idea of trend analysis in nonparametric statistics, a testing procedure is described in Section 3. In Section 4, we conduct some simulations to assess the performance of the test. A real-world data set is analyzed in Section 5 to demonstrate the application of the proposed method. The paper is then ended with some final remarks.
2. A Brief Description of the Local Linear Estimation
Consider the univariate nonparametric regression model
(1)Yi=m(Xi)+σ(Xi)ei,i=1,2,…,n,
where Y and X indicate the response and explanatory variable, respectively, and (Yi,Xi)(i=1,2,…,n) is a random sample from model (1). m(·) and σ(·) are unknown regression and variance functions. e1,e2,…,en are generally assumed to be independently and identically distributed random variables with zero mean and unit variance. Also X and e are independent.
Due to its several attractive mathematical properties (see [9–11] for details), the local linear estimation procedure is used to calibrate the model in (1). Specifically, suppose that the second order derivative of the regression function m(x) in model (1) is continuous in the domain of the variable X, say D, and x0 is a given point in D. According to Taylor’s expansion, we have in the neighborhood of x0 that
(2)m(x)≈m(x0)+m′(x0)(x-x0),
where m′(x0) denotes the first order derivative of m(x) at x0. By replacing m(x) in model (1) with its linear approximation in (2) and combining the least-squares procedure, the local linear estimate of the regression function m(x) at x0 can be obtained by solving the following weighted least-squares problem:
(3)minimize∑i=1n[Yi-m(x0)-m′(x0)(Xi-x0)]2Kh(Xi-x0)
with respect to m(x0) and m′(x0), where Kh(·)=K(·/h)/h and K(·) is a given kernel function that is generally taken to be a symmetric probability density function and h is the bandwidth which can be determined by some data-driven methods such as the cross-validation, generalized cross-validation methods, and corrected Akaike information criterion (see [12–14] for more details). Specifically, in the cross-validation procedure, the optimal value of the bandwidth h is chosen to minimize the following expression:
(4)CV(h)=∑i=1n[Yi-Y^(i)(h)]2,
where Y^(i)(h) stands for the ith predicted value of the response Y under the bandwidth h with the ith observation omitted from the calibration process.
For convenience, we introduce the matrix notations. Let
(5)X(0)=(1X1-x01X2-x0⋮⋮1Xn-x0),Y=(Y1Y2⋮Yn),ε=(ε1ε2⋮εn)=(σ(X1)e1σ(X2)e2⋮σ(Xn)en),W(0)=Diag(Kh(X1-x0),hhh.hKh(X2-x0),…,Kh(Xn-x0)).
By solving the weighted least-squares problem in (3), we can obtain the local linear estimate of m(x) at x=x0 as
(6)m^(x0)=e1T[XT(0)W(0)X(0)]-1XT(0)W(0)Y,
where e1 indicates a two-dimensional vector with its first element being 1 and the other being 0.
Taking x0 in (6) to be X1, X2, and Xn, respectively, we can get the fitted value of Y=(Y1,Y2,…,Yn)T, denoted by Y^=(Y^1,Y^2,…,Y^n)T, as
(7)Y^=(m^(X1)m^(X2)⋮m^(Xn))=(e1T[XT(1)W(1)X(1)]-1XT(1)W(1)Ye1T[XT(2)W(2)X(2)]-1XT(2)W(2)Y⋮e1T[XT(n)W(n)X(n)]-1XT(n)W(n)Y)=LY,
where
(8)L=(e1T[XT(1)W(1)X(1)]-1XT(1)W(1)e1T[XT(2)W(2)X(2)]-1XT(2)W(2)⋮e1T[XT(n)W(n)X(n)]-1XT(n)W(n))
is called “hat” matrix or smoothing matrix.
Further, the residual vector can be computed from
(9)ε^=(ε^1,ε^2,…,ε^n)T=Y-Y^=(I-L)Y,
which will be used in the next section.
3. A Procedure for Detecting Heteroscedasticity in Nonparametric Regression
As mentioned in introduction, in real-world data analysis, we rarely know in advance whether the error term is homoscedastic, which deals with the problem of testing for heteroscedasticity. That is, the hypothesis to be tested is
(10)H0:σ2(Xi)=σ2⟷H1:σ2(Xi)≠σ2,
where σ2>0 is a certain constant.
Let ε^=(ε^1,ε^2,…,ε^n)T=Y-Y^=(I-L)Y be the residual vector which is described in (9). In order to construct a test statistic suitable for quantifying the heteroscedasticity of the error term in nonparametric regression, we use the transformed residuals
(11)ri=ε^iσ^02hii,i=1,2,…,n,
where
(12)σ^02=YT(I-L)T(I-L)Ytr[(I-L)T(I-L)]
with “tr” standing for the trace of a matrix and hii is the ith diagonal element of the matrix H=(I-L)T(I-L).
If the null hypothesis H0 in (10) is true, which means that the variance of the error term in model (1) is constant, the values of ri2(i=1,2,…,n) should not have any trend, whereas there will be some variations in r12,r22,…,rn2 if heteroscedasticity is present. Therefore, we can test heteroscedasticity of the error term by analyzing the trend of ri2(i=1,2,…,n). Along this line of thinking, the hypothesis in (10) amounts to the hypothesis
(13)H0:r12,r22,…,rn2havenotrend⟷H1:r12,r22,…,rn2havecertaintrend.
According to the literatures Diblasi and Bowman [15] and Wei et al. [16], the random variables r1,r2,…,rn are approximately independent and identically distributed. Let
(14)c={n2,niseven;n+12,nisodd,n′={c,niseven;c-1,nisodd,Di=ri2-ri+c2,i=1,2,…,n′.
Then D1,D2,…,Dn are approximately independent under H0 and PH0(Di>0)=PH0(Di<0)=1/2. Therefore, the test statistic is constructed as follows:
(15)T=∑i=1nI(Di>0),
where I(·) is the indicative function.
If the null hypothesis H0 in (10) (or (13)) is true, which means the model error term is homoscedastic, we have
(16)T~B(n′,12),
where B(n′,1/2) denotes the binomial distribution with the parameter being 1/2 and the sample size being n′. By noting the fact that the test statistic T is symmetric or approximately symmetric with respect to n′/2, the value of the test statistic |T-n′/2| tends to be large if the error heteroscedasticity is present. Therefore, the p-value of testing H0 versus H1 based on the statistic T is
(17)p=PH0(|T-n′2|≥|t-n′2|)=PH0(T≤n′2-|t-n′2|)+PH0(T≥n′2+|t-n′2|)=2PH0(T≤n′2-|t-n′2|)=2×12n′∑k=0(n′/2)-|t-n′/2|Cn′k≈2[1-Φ(|T-n′/2|n′/2)],
where t is the observed value of T computed by (15). For a given significance level α, reject H0 if p<α; otherwise, do not reject H0.
4. Simulation Studies
As mentioned in the introduction, Zhang and Mei [7] also proposed a test method for detecting heteroscedasticity in nonparametric models. The particular method that they used is the t-test applied to the squared residuals ε^i2(i=1,2,…,n), which are shown in (9). A comparison with Zhang and Mei’s method is conducted in this section to assess the validity of the proposed test method.
The following three types of regression and variance functions are considered:
m(x)=1+x, σ(x)=σ(1+asin(8x))2;
m(x)=1+sinx, σ(x)=σ(4+4acos(4x));
m(x)=1+sinx, σ(x)=σexp(ax),
where σ=0.5 and a is a constant.
Using the above regression and variance functions, we can formulate three models to generate the experimental data. For convenience, the models that correspond to those three settings of regression and variance functions are denoted by Model 1, Model 2, and Model 3, respectively.
In each model, the observations X1,X2,…,Xn of the explanatory variable X are equidistantly taken on the interval [0,1]; that is, Xi=i/n, i=1,2,…,n. The constant a in the variance functions is considered to be 0, 0.5, and 1.0, respectively. Note that a=0 refers to the model with the error term being homoscedastic, and the variance function deviates from homoscedasticity more and more significantly with the value of a increasing. The sample sizes are taken to be n=100 and n=200, respectively.
Furthermore, in order to evaluate the robustness of the test methods (the proposed and Zhang and Mei’s methods) on the error distributions, the random numbers e1,e2,…,en are independently drawn from N(0,1), U(-3,3), and the standardized Chi-square distribution with 4 degrees of freedom, respectively.
Given each of Models 1, 2, and 3 for each combination of the values of the constant a, the error distributions, and the sample sizes, we ran N=500attempts of replication of the testing procedure either for our proposed method or Zhang and Mei’s method, in which the Gauss kernel function K(x)=exp(-x2/2)/2π is adopted and the bandwidth h is selected by the cross-validation procedure. Throughout N=500 attempts of replication, we record the frequency of rejecting the null hypothesis under the significance level α=0.05 and the related results are reported in Table 1.
Rejection frequencies of 500 replications of the testing procedure.
Model
a
N(0,1)
U(-3,3)
χ2(4)
Proposed method
Zhang and Mei [7]
Proposed method
Zhang and Mei [7]
Proposed method
Zhang and Mei [7]
n=100
Model 1
0
0.022
0.038
0.032
0.022
0.058
0.024
0.50
0.346
0.014
0.402
0.002
0.348
0.026
1.00
0.540
0.020
0.570
0.006
0.506
0.012
Model 2
0
0.020
0.038
0.038
0.024
0.048
0.030
0.50
0.294
0.020
0.490
0.048
0.352
0.064
1.00
0.996
0.032
1.000
0.056
0.996
0.080
Model 3
0
0.024
0.038
0.038
0.026
0.044
0.032
0.50
0.162
0.510
0.258
0.796
0.202
0.248
1.00
0.544
0.954
0.736
1.000
0.574
0.714
n=200
Model 1
0
0.034
0.034
0.034
0.018
0.062
0.032
0.50
0.596
0.012
0.678
0.002
0.620
0.014
1.00
0.870
0.014
0.884
0.004
0.952
0.016
Model 2
0
0.032
0.040
0.038
0.018
0.058
0.030
0.50
0.436
0.022
0.690
0.052
0.514
0.068
1.00
1.000
0.034
1.000
0.058
1.000
0.072
Model 3
0
0.038
0.042
0.038
0.018
0.058
0.030
0.50
0.236
0.692
0.358
0.946
0.318
0.362
1.00
0.714
0.996
0.912
1.000
0.750
0.862
We see from Table 1 that, under the normality distribution of the error term, the rejection frequency of both methods under H0 (i.e., a=0) is reasonably close to the corresponding significance levels for both sample sizes. On the other hand, two test methods perform quite differently for different types of variance functions under the alternative hypothesis. Although the rejection frequency computed by our method tends to be undersized for monotone variance function (see Model 3), it is much larger than that obtained by Zhang and Mei’s method for high frequency variance functions (see Models 1 and 2), which means that our method is of satisfactory power in detecting heteroscedasticity, especially when the variance function shows many alternations.
Under the situations where the distribution of the model error term is nonnormal, we see from Table 1 that, under H0 (i.e., a=0), the estimated values of the nominal probability computed from our method are more stable and more close to the corresponding significance levels than those obtained by Zhang and Mei’s method with respect to different types of error distributions, which indicates that the proposed test approach is more robust to the choices of the error distributions. Furthermore, the values of the rejection frequency for both test methods under a≠0 show the same patterns, which demonstrate that our test approach is more powerful in detecting high frequency variance functions.
5. An Example on the Application of the Proposed Method
A real-world data set is analyzed in this section to demonstrate the application of the proposed method. Specifically, with the observed data of the average temperature (AT) of each day in Xi’an, China, from January 1, 1951, to December 31, 2000, the mean of the average temperatures collected on the same days of the 50 years is taken as the values of the average temperature (unit: degree). It is worth pointing out that the data on February 29 during the 50 years have been excluded. Furthermore, the observations of the explanatory variable X in the regression function are taken as the time orders from January 1 to December 31.
Based on the observations (ATi,Xi)(i=1,2,…,365) which are graphically shown in Figure 1, the following nonparametric regression model is considered:
(18)ATi=m(Xi)+σ(Xi)ei,i=1,2,…,365,
where ei is assumed to satisfy E(ei∣Xi)=0 and Var(ei∣Xi)=1. Here, we test whether or not the model error term is homoscedastic. By using the proposed test method with the Gauss kernel function, the optimal value of the bandwidth h selected by the cross-validation procedure is h=2 and the resulting p-value is 4.713×10-27. Because of the extremely small p-value of the test statistic, we may conclude that the heteroscedasticity of the model error is significant over the time that ranges from January 1 to December 31.
The original data of the response AT in model (18).
6. Final Remarks
In this paper, a test which is free of the types of the error distributions is developed for detecting heteroscedasticity in nonparametric regression models. Specifically, the statistic is constructed on a basis of appropriate transformation of the residuals after fitting the regression model with the local linear estimation as well as the idea of trend analysis in nonparametric statistics. In order to assess the performance of the proposed method, we conduct a simulation comparison with other procedures and the results are satisfactory, especially for high frequency variance functions.
Compared to Zhang and Mei’s method, the power of the proposed test when heteroscedasticity is present tends to be underestimated for monotone variance function. This is reasonable because the former is mainly formulated based on the monotone trend of the squared residuals, while the latter is a sign-based testing method.
Anyhow, due to its conceptual simplicity and easy implementation, our method is useful in testing heteroscedasticity in nonparametric regression, especially for the variance functions with many alternations.
Conflict of Interests
The authors declare that there is no conflict of interests regarding publication of this paper.
Acknowledgments
This research is supported by the National Natural Science Foundations of China (no. 11326181 and no. 11201123), International Cooperative Project in Henan Province (no. 134300510034,) and the start fund of doctorial scientific research (no. 09001624). The authors are especially grateful to the reviewer and editor for their valuable comments and suggestions which led to significant improvements in the paper.
DetteH.A consistent test for heteroscedasticity in nonparametric regression based on the kernel methodDetteH.MunkA.Testing heteroscedasticity in nonparametric regressionEubankR. L.ThomasW.Detecting heteroscedasticity in nonparametric regressionLieroH.Testing homoscedasticity in nonparametric regressionYouJ. H.ChenG. M.Testing heteroscedasticity in partially linear regression modelsYouJ. H.ChenG. M.ZhouY.Statistical inference of partially linear regression models with heteroscedastic errorsZhangL.MeiC.-L.Testing heteroscedasticity in nonparametric regression models based on residual analysisShenS. L.MeiC.-L.ZhangY. J.Spatially varying coefficient models: testingfor heteroscedasticity and reweighting estimation of the coefficientsFanJ. Q.Design-adaptive nonparametric regressionFanJ. Q.Local linear regression smoothers and their minimax efficienciesRuppertD.WandM. P.Multivariate locally weighted least squares regressionHurvichC. M.SimonoffJ. S.TsaiC. L.Smoothing parameter selection in nonparametric regression using an improved Akaike information criterionHartJ. D.HastieT. J.TibshiraniR. J.DiblasiA.BowmanA.Testing for constant variance in a linear modelWeiB. C.LinX. G.XieF. C.