JAM Journal of Applied Mathematics 1687-0042 1110-757X Hindawi Publishing Corporation 10.1155/2014/435925 435925 Research Article Testing Heteroscedasticity in Nonparametric Regression Based on Trend Analysis http://orcid.org/0000-0002-5850-9534 Shen Si-Lian 1 Cui Jian-Ling 2 Wang Chun-Wei 1 Zhang Zhihua 1 School of Mathematics and Statistics Henan University of Science and Technology Luoyang 471003 China haust.edu.cn 2 Electronic Equipment Test Center Luoyang 471003 China 2014 1162014 2014 27 01 2014 26 05 2014 26 05 2014 11 6 2014 2014 Copyright © 2014 Si-Lian Shen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We first propose in this paper a new test method for detecting heteroscedasticity of the error term in nonparametric regression. Some simulation experiments are then conducted to evaluate the performance of the proposed methodology. A real-world data set is finally analyzed to demonstrate the application of the method.

1. Introduction

In recent years, nonparametric regression models have been widely applied in a variety of areas for data analysis. The estimation of the regression function and related statistical inferences in nonparametric models are usually based on the assumption that the error term is homoscedastic. However, in many real-world problems, we rarely know a priori whether this assumption can be guaranteed. Therefore, it is necessary to develop a method for detecting heteroscedasticity in the error terms before we embark on the model fitting and inferential issues.

In the literature of the statistical nonparametric regression, there have been many papers on testing heteroscedasticity (see, e.g., ). Among these papers, a procedure was developed by Dette and Munk  based on an estimator for the best L 2 -approximation of the variance function by a constant and was extended by You and Chen  to partially linear regression models. Dette  proposed a test for heteroscedasticity in nonparametric regression. A residual-based statistic was suggested by Eubank and Thomas  to detect heteroscedasticity of the error term in nonparametric models. Furthermore, Zhang and Mei  obtained a test for the constant variance of the model errors based on residual analysis.

Most of the existing procedures, including those mentioned above, belong to the class of parametrically hypothesis test methods. That is, the methods work quite well when the model errors coincide with the preassumed distribution, while the performance significantly decreases when the distribution cannot be guaranteed. Therefore, it is necessary to develop a test which is robust to the error distributions. To the best of our knowledge, however, there has been little work done on this issue.

In this paper, we propose a completely nonparametrically hypothesis test method for detecting heteroscedasticity of the error term in nonparametric regression. In this method, the test statistic is constructed on the basis of an appropriate transformation of the residuals after fitting the regression model with the local linear estimation. In order to evaluate the performance of the proposed method, we conduct a simulation comparison with Zhang and Mei’s procedure  and a real-world data set is analyzed to show the application of the method.

The remainder of this paper is organized as follows. In Section 2, we briefly describe the local linear estimation method. By using the residuals after fitting the regression model with the local linear estimation and applying the idea of trend analysis in nonparametric statistics, a testing procedure is described in Section 3. In Section 4, we conduct some simulations to assess the performance of the test. A real-world data set is analyzed in Section 5 to demonstrate the application of the proposed method. The paper is then ended with some final remarks.

2. A Brief Description of the Local Linear Estimation

Consider the univariate nonparametric regression model (1) Y i = m ( X i ) + σ ( X i ) e i , i = 1,2 , , n , where Y and X indicate the response and explanatory variable, respectively, and ( Y i , X i ) ( i = 1,2 , , n ) is a random sample from model (1). m ( · ) and σ ( · ) are unknown regression and variance functions. e 1 , e 2 , , e n are generally assumed to be independently and identically distributed random variables with zero mean and unit variance. Also X and e are independent.

Due to its several attractive mathematical properties (see  for details), the local linear estimation procedure is used to calibrate the model in (1). Specifically, suppose that the second order derivative of the regression function m ( x ) in model (1) is continuous in the domain of the variable X , say D , and x 0 is a given point in D . According to Taylor’s expansion, we have in the neighborhood of x 0 that (2) m ( x ) m ( x 0 ) + m ( x 0 ) ( x - x 0 ) , where m ( x 0 ) denotes the first order derivative of m ( x ) at x 0 . By replacing m ( x ) in model (1) with its linear approximation in (2) and combining the least-squares procedure, the local linear estimate of the regression function m ( x ) at x 0 can be obtained by solving the following weighted least-squares problem: (3) minimize i = 1 n [ Y i - m ( x 0 ) - m ( x 0 ) ( X i - x 0 ) ] 2 K h ( X i - x 0 ) with respect to m ( x 0 ) and m ( x 0 ) , where K h ( · ) = K ( · / h ) / h and K ( · ) is a given kernel function that is generally taken to be a symmetric probability density function and h is the bandwidth which can be determined by some data-driven methods such as the cross-validation, generalized cross-validation methods, and corrected Akaike information criterion (see  for more details). Specifically, in the cross-validation procedure, the optimal value of the bandwidth h is chosen to minimize the following expression: (4) CV ( h ) = i = 1 n [ Y i - Y ^ ( i ) ( h ) ] 2 , where Y ^ ( i ) ( h ) stands for the i th predicted value of the response Y under the bandwidth h with the i th observation omitted from the calibration process.

For convenience, we introduce the matrix notations. Let (5) X ( 0 ) = ( 1 X 1 - x 0 1 X 2 - x 0 1 X n - x 0 ) , Y = ( Y 1 Y 2 Y n ) , ε = ( ε 1 ε 2 ε n ) = ( σ ( X 1 ) e 1 σ ( X 2 ) e 2 σ ( X n ) e n ) , W ( 0 ) = Diag ( K h ( X 1 - x 0 ) , h h h . h K h ( X 2 - x 0 ) , , K h ( X n - x 0 ) ) . By solving the weighted least-squares problem in (3), we can obtain the local linear estimate of m ( x ) at x = x 0 as (6) m ^ ( x 0 ) = e 1 T [ X T ( 0 ) W ( 0 ) X ( 0 ) ] - 1 X T ( 0 ) W ( 0 ) Y , where e 1 indicates a two-dimensional vector with its first element being 1 and the other being 0.

Taking x 0 in (6) to be X 1 , X 2 , and X n , respectively, we can get the fitted value of Y = ( Y 1 , Y 2 , , Y n ) T , denoted by Y ^ = ( Y ^ 1 , Y ^ 2 , , Y ^ n ) T , as (7) Y ^ = ( m ^ ( X 1 ) m ^ ( X 2 ) m ^ ( X n ) ) = ( e 1 T [ X T ( 1 ) W ( 1 ) X ( 1 ) ] - 1 X T ( 1 ) W ( 1 ) Y e 1 T [ X T ( 2 ) W ( 2 ) X ( 2 ) ] - 1 X T ( 2 ) W ( 2 ) Y e 1 T [ X T ( n ) W ( n ) X ( n ) ] - 1 X T ( n ) W ( n ) Y ) = L Y , where (8) L = ( e 1 T [ X T ( 1 ) W ( 1 ) X ( 1 ) ] - 1 X T ( 1 ) W ( 1 ) e 1 T [ X T ( 2 ) W ( 2 ) X ( 2 ) ] - 1 X T ( 2 ) W ( 2 ) e 1 T [ X T ( n ) W ( n ) X ( n ) ] - 1 X T ( n ) W ( n ) ) is called “hat” matrix or smoothing matrix.

Further, the residual vector can be computed from (9) ε ^ = ( ε ^ 1 , ε ^ 2 , , ε ^ n ) T = Y - Y ^ = ( I - L ) Y , which will be used in the next section.

3. A Procedure for Detecting Heteroscedasticity in Nonparametric Regression

As mentioned in introduction, in real-world data analysis, we rarely know in advance whether the error term is homoscedastic, which deals with the problem of testing for heteroscedasticity. That is, the hypothesis to be tested is (10) H 0 :    σ 2 ( X i ) = σ 2 H 1 : σ 2 ( X i ) σ 2 , where σ 2 > 0 is a certain constant.

Let ε ^ = ( ε ^ 1 , ε ^ 2 , , ε ^ n ) T = Y - Y ^ = ( I - L ) Y be the residual vector which is described in (9). In order to construct a test statistic suitable for quantifying the heteroscedasticity of the error term in nonparametric regression, we use the transformed residuals (11) r i = ε ^ i σ ^ 0 2 h i i , i = 1,2 , , n , where (12) σ ^ 0 2 = Y T ( I - L ) T ( I - L ) Y tr [ ( I - L ) T ( I - L ) ] with “tr” standing for the trace of a matrix and h i i is the i th diagonal element of the matrix H = ( I - L ) T ( I - L ) .

If the null hypothesis H 0 in (10) is true, which means that the variance of the error term in model (1) is constant, the values of r i 2 ( i = 1,2 , , n ) should not have any trend, whereas there will be some variations in r 1 2 , r 2 2 , , r n 2 if heteroscedasticity is present. Therefore, we can test heteroscedasticity of the error term by analyzing the trend of r i 2 ( i = 1,2 , , n ) . Along this line of thinking, the hypothesis in (10) amounts to the hypothesis (13) H 0 : r 1 2 , r 2 2 , , r n 2 have no trend H 1 : r 1 2 , r 2 2 , , r n 2 have certain trend .

According to the literatures Diblasi and Bowman  and Wei et al. , the random variables r 1 , r 2 , , r n are approximately independent and identically distributed. Let (14) c = { n 2 , n is even ; n + 1 2 , n is odd , n = { c , n is even ; c - 1 , n is odd , D i = r i 2 - r i + c 2 , i = 1,2 , , n . Then D 1 , D 2 , , D n are approximately independent under H 0 and P H 0 ( D i > 0 ) = P H 0 ( D i < 0 ) = 1 / 2 . Therefore, the test statistic is constructed as follows: (15) T = i = 1 n I ( D i > 0 ) , where I ( · ) is the indicative function.

If the null hypothesis H 0 in (10) (or (13)) is true, which means the model error term is homoscedastic, we have (16) T ~ B ( n , 1 2 ) , where B ( n , 1 / 2 ) denotes the binomial distribution with the parameter being 1/2 and the sample size being n . By noting the fact that the test statistic T is symmetric or approximately symmetric with respect to n / 2 , the value of the test statistic | T - n / 2 | tends to be large if the error heteroscedasticity is present. Therefore, the p -value of testing H 0 versus H 1 based on the statistic T is (17) p = P H 0 ( | T - n 2 | | t - n 2 | ) = P H 0 ( T n 2 - | t - n 2 | ) + P H 0 ( T n 2 + | t - n 2 | ) = 2 P H 0 ( T n 2 - | t - n 2 | ) = 2 × 1 2 n k = 0 ( n / 2 ) - | t - n / 2 | C n k 2 [ 1 - Φ ( | T - n / 2 | n / 2 ) ] , where t is the observed value of T computed by (15). For a given significance level α , reject H 0 if p < α ; otherwise, do not reject H 0 .

4. Simulation Studies

As mentioned in the introduction, Zhang and Mei  also proposed a test method for detecting heteroscedasticity in nonparametric models. The particular method that they used is the t -test applied to the squared residuals ε ^ i 2 ( i = 1,2 , , n ) , which are shown in (9). A comparison with Zhang and Mei’s method is conducted in this section to assess the validity of the proposed test method.

The following three types of regression and variance functions are considered:

m ( x ) = 1 + x , σ ( x ) = σ ( 1 + a sin ( 8 x ) ) 2 ;

m ( x ) = 1 + sin x , σ ( x ) = σ ( 4 + 4    a cos ( 4 x ) ) ;

m ( x ) = 1 + sin x , σ ( x ) = σ exp ( a x ) ,

where σ = 0.5 and a is a constant.

Using the above regression and variance functions, we can formulate three models to generate the experimental data. For convenience, the models that correspond to those three settings of regression and variance functions are denoted by Model 1, Model 2, and Model 3, respectively.

In each model, the observations X 1 , X 2 , , X n of the explanatory variable X are equidistantly taken on the interval [ 0,1 ] ; that is, X i = i / n , i = 1,2 , , n . The constant a in the variance functions is considered to be 0, 0.5, and 1.0, respectively. Note that a = 0 refers to the model with the error term being homoscedastic, and the variance function deviates from homoscedasticity more and more significantly with the value of a increasing. The sample sizes are taken to be n = 100 and n = 200 , respectively.

Furthermore, in order to evaluate the robustness of the test methods (the proposed and Zhang and Mei’s methods) on the error distributions, the random numbers e 1 , e 2 , , e n are independently drawn from N ( 0,1 ) , U ( - 3 , 3 ) , and the standardized Chi-square distribution with 4 degrees of freedom, respectively.

Given each of Models 1, 2, and 3 for each combination of the values of the constant a , the error distributions, and the sample sizes, we ran N = 500 attempts of replication of the testing procedure either for our proposed method or Zhang and Mei’s method, in which the Gauss kernel function K ( x ) = exp ( - x 2 / 2 ) / 2 π is adopted and the bandwidth h is selected by the cross-validation procedure. Throughout N = 500 attempts of replication, we record the frequency of rejecting the null hypothesis under the significance level α = 0.05 and the related results are reported in Table 1.

Rejection frequencies of 500 replications of the testing procedure.

Model a N ( 0,1 ) U ( - 3 , 3 ) χ 2 ( 4 )
Proposed method Zhang and Mei  Proposed method Zhang and Mei  Proposed method Zhang and Mei 
n = 100
Model 1 0 0.022 0.038 0.032 0.022 0.058 0.024
0.50 0.346 0.014 0.402 0.002 0.348 0.026
1.00 0.540 0.020 0.570 0.006 0.506 0.012

Model 2 0 0.020 0.038 0.038 0.024 0.048 0.030
0.50 0.294 0.020 0.490 0.048 0.352 0.064
1.00 0.996 0.032 1.000 0.056 0.996 0.080

Model 3 0 0.024 0.038 0.038 0.026 0.044 0.032
0.50 0.162 0.510 0.258 0.796 0.202 0.248
1.00 0.544 0.954 0.736 1.000 0.574 0.714

n = 200
Model 1 0 0.034 0.034 0.034 0.018 0.062 0.032
0.50 0.596 0.012 0.678 0.002 0.620 0.014
1.00 0.870 0.014 0.884 0.004 0.952 0.016

Model 2 0 0.032 0.040 0.038 0.018 0.058 0.030
0.50 0.436 0.022 0.690 0.052 0.514 0.068
1.00 1.000 0.034 1.000 0.058 1.000 0.072

Model 3 0 0.038 0.042 0.038 0.018 0.058 0.030
0.50 0.236 0.692 0.358 0.946 0.318 0.362
1.00 0.714 0.996 0.912 1.000 0.750 0.862

We see from Table 1 that, under the normality distribution of the error term, the rejection frequency of both methods under H 0 (i.e., a = 0 ) is reasonably close to the corresponding significance levels for both sample sizes. On the other hand, two test methods perform quite differently for different types of variance functions under the alternative hypothesis. Although the rejection frequency computed by our method tends to be undersized for monotone variance function (see Model 3), it is much larger than that obtained by Zhang and Mei’s method for high frequency variance functions (see Models 1 and 2), which means that our method is of satisfactory power in detecting heteroscedasticity, especially when the variance function shows many alternations.

Under the situations where the distribution of the model error term is nonnormal, we see from Table 1 that, under H 0 (i.e., a = 0 ), the estimated values of the nominal probability computed from our method are more stable and more close to the corresponding significance levels than those obtained by Zhang and Mei’s method with respect to different types of error distributions, which indicates that the proposed test approach is more robust to the choices of the error distributions. Furthermore, the values of the rejection frequency for both test methods under a 0 show the same patterns, which demonstrate that our test approach is more powerful in detecting high frequency variance functions.

5. An Example on the Application of the Proposed Method

A real-world data set is analyzed in this section to demonstrate the application of the proposed method. Specifically, with the observed data of the average temperature (AT) of each day in Xi’an, China, from January 1, 1951, to December 31, 2000, the mean of the average temperatures collected on the same days of the 50 years is taken as the values of the average temperature (unit: degree). It is worth pointing out that the data on February 29 during the 50 years have been excluded. Furthermore, the observations of the explanatory variable X in the regression function are taken as the time orders from January 1 to December 31.

Based on the observations ( A T i , X i ) ( i = 1,2 , , 365 ) which are graphically shown in Figure 1, the following nonparametric regression model is considered: (18) A T i = m ( X i ) + σ ( X i ) e i , i = 1,2 , , 365 , where e i is assumed to satisfy E ( e i X i ) = 0 and Var ( e i X i ) = 1 . Here, we test whether or not the model error term is homoscedastic. By using the proposed test method with the Gauss kernel function, the optimal value of the bandwidth h selected by the cross-validation procedure is h = 2 and the resulting p -value is 4.713 × 1 0 - 27 . Because of the extremely small p -value of the test statistic, we may conclude that the heteroscedasticity of the model error is significant over the time that ranges from January 1 to December 31.

The original data of the response AT in model (18).

6. Final Remarks

In this paper, a test which is free of the types of the error distributions is developed for detecting heteroscedasticity in nonparametric regression models. Specifically, the statistic is constructed on a basis of appropriate transformation of the residuals after fitting the regression model with the local linear estimation as well as the idea of trend analysis in nonparametric statistics. In order to assess the performance of the proposed method, we conduct a simulation comparison with other procedures and the results are satisfactory, especially for high frequency variance functions.

Compared to Zhang and Mei’s method, the power of the proposed test when heteroscedasticity is present tends to be underestimated for monotone variance function. This is reasonable because the former is mainly formulated based on the monotone trend of the squared residuals, while the latter is a sign-based testing method.

Anyhow, due to its conceptual simplicity and easy implementation, our method is useful in testing heteroscedasticity in nonparametric regression, especially for the variance functions with many alternations.

Conflict of Interests

The authors declare that there is no conflict of interests regarding publication of this paper.

Acknowledgments

This research is supported by the National Natural Science Foundations of China (no. 11326181 and no. 11201123), International Cooperative Project in Henan Province (no. 134300510034,) and the start fund of doctorial scientific research (no. 09001624). The authors are especially grateful to the reviewer and editor for their valuable comments and suggestions which led to significant improvements in the paper.

Dette H. A consistent test for heteroscedasticity in nonparametric regression based on the kernel method Journal of Statistical Planning and Inference 2002 103 1-2 311 329 10.1016/S0378-3758(01)00229-4 MR1896998 Dette H. Munk A. Testing heteroscedasticity in nonparametric regression Journal of the Royal Statistical Society B 1998 60 4 693 708 10.1111/1467-9868.00149 MR1649535 Eubank R. L. Thomas W. Detecting heteroscedasticity in nonparametric regression Journal of the Royal Statistical Society B 1993 55 1 145 155 MR1210427 Liero H. Testing homoscedasticity in nonparametric regression Journal of Nonparametric Statistics 2003 15 1 31 51 10.1080/10485250306038 MR1958958 You J. H. Chen G. M. Testing heteroscedasticity in partially linear regression models Statistics and Probability Letters 2005 73 1 61 70 10.1016/j.spl.2005.03.002 MR2154061 You J. H. Chen G. M. Zhou Y. Statistical inference of partially linear regression models with heteroscedastic errors Journal of Multivariate Analysis 2007 98 8 1539 1557 10.1016/j.jmva.2007.06.011 MR2370106 Zhang L. Mei C.-L. Testing heteroscedasticity in nonparametric regression models based on residual analysis Applied Mathematics 2008 23 3 265 272 10.1007/s11766-008-1648-0 MR2438675 Shen S. L. Mei C.-L. Zhang Y. J. Spatially varying coefficient models: testingfor heteroscedasticity and reweighting estimation of the coefficients Environmentand Planning A 2011 43 7 1723 1745 Fan J. Q. Design-adaptive nonparametric regression Journal of the American Statistical Association 1992 87 420 998 1004 MR1209561 Fan J. Q. Local linear regression smoothers and their minimax efficiencies The Annals of Statistics 1993 21 1 196 216 10.1214/aos/1176349022 MR1212173 Ruppert D. Wand M. P. Multivariate locally weighted least squares regression The Annals of Statistics 1994 22 3 1346 1370 10.1214/aos/1176325632 MR1311979 Hurvich C. M. Simonoff J. S. Tsai C. L. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion Journal of the Royal Statistical Society B 1998 60 2 271 293 10.1111/1467-9868.00125 MR1616041 Hart J. D. Nonparametric Smoothing and Lack-of-Fit Tests 1997 New York, NY, USA Springer Springer Series in Statistics 10.1007/978-1-4757-2722-7 MR1461272 Hastie T. J. Tibshirani R. J. Generalized Additive Models 1990 London, UK Chapman and Hall Press MR1082147 Diblasi A. Bowman A. Testing for constant variance in a linear model Statistics and Probability Letters 1997 33 1 95 103 10.1016/S0167-7152(96)00115-0 Wei B. C. Lin X. G. Xie F. C. Statistical Diagnostics 2009 Beijing, China Higher Education Press MR2554230