Testing Heteroscedasticity in Nonparametric Regression Based on Trend Analysis

We first propose in this paper a new test method for detecting heteroscedasticity of the error term in nonparametric regression. Some simulation experiments are then conducted to evaluate the performance of the proposed methodology. A real-world data set is finally analyzed to demonstrate the application of the method.


Introduction
In recent years, nonparametric regression models have been widely applied in a variety of areas for data analysis.The estimation of the regression function and related statistical inferences in nonparametric models are usually based on the assumption that the error term is homoscedastic.However, in many real-world problems, we rarely know a priori whether this assumption can be guaranteed.Therefore, it is necessary to develop a method for detecting heteroscedasticity in the error terms before we embark on the model fitting and inferential issues.
In the literature of the statistical nonparametric regression, there have been many papers on testing heteroscedasticity (see, e.g., [1][2][3][4][5][6][7][8]).Among these papers, a procedure was developed by Dette and Munk [2] based on an estimator for the best  2 -approximation of the variance function by a constant and was extended by You and Chen [5] to partially linear regression models.Dette [1] proposed a test for heteroscedasticity in nonparametric regression.A residualbased statistic was suggested by Eubank and Thomas [3] to detect heteroscedasticity of the error term in nonparametric models.Furthermore, Zhang and Mei [7] obtained a test for the constant variance of the model errors based on residual analysis.
Most of the existing procedures, including those mentioned above, belong to the class of parametrically hypothesis test methods.That is, the methods work quite well when the model errors coincide with the preassumed distribution, while the performance significantly decreases when the distribution cannot be guaranteed.Therefore, it is necessary to develop a test which is robust to the error distributions.To the best of our knowledge, however, there has been little work done on this issue.
In this paper, we propose a completely nonparametrically hypothesis test method for detecting heteroscedasticity of the error term in nonparametric regression.In this method, the test statistic is constructed on the basis of an appropriate transformation of the residuals after fitting the regression model with the local linear estimation.In order to evaluate the performance of the proposed method, we conduct a simulation comparison with Zhang and Mei's procedure [7] and a real-world data set is analyzed to show the application of the method.
The remainder of this paper is organized as follows.In Section 2, we briefly describe the local linear estimation method.By using the residuals after fitting the regression model with the local linear estimation and applying the idea of trend analysis in nonparametric statistics, a testing procedure is described in Section 3. In Section 4, we conduct some simulations to assess the performance of the test.A realworld data set is analyzed in Section 5 to demonstrate the application of the proposed method.The paper is then ended with some final remarks.

A Brief Description of the Local Linear Estimation
Consider the univariate nonparametric regression model where  and  indicate the response and explanatory variable, respectively, and (  ,   ) ( = 1, 2, . . ., ) is a random sample from model (1).(⋅) and (⋅) are unknown regression and variance functions. 1 ,  2 , . . .,   are generally assumed to be independently and identically distributed random variables with zero mean and unit variance.Also  and  are independent.Due to its several attractive mathematical properties (see [9][10][11] for details), the local linear estimation procedure is used to calibrate the model in (1).Specifically, suppose that the second order derivative of the regression function () in model ( 1) is continuous in the domain of the variable , say D, and  0 is a given point in D. According to Taylor's expansion, we have in the neighborhood of  0 that where   ( 0 ) denotes the first order derivative of () at  0 .By replacing () in model ( 1) with its linear approximation in (2) and combining the least-squares procedure, the local linear estimate of the regression function () at  0 can be obtained by solving the following weighted least-squares problem: with respect to ( 0 ) and   ( 0 ), where  ℎ (⋅) = (⋅/ℎ)/ℎ and (⋅) is a given kernel function that is generally taken to be a symmetric probability density function and ℎ is the bandwidth which can be determined by some data-driven methods such as the cross-validation, generalized crossvalidation methods, and corrected Akaike information criterion (see [12][13][14] for more details).Specifically, in the crossvalidation procedure, the optimal value of the bandwidth ℎ is chosen to minimize the following expression: where Ŷ() (ℎ) stands for the th predicted value of the response  under the bandwidth ℎ with the th observation omitted from the calibration process.
For convenience, we introduce the matrix notations.Let By solving the weighted least-squares problem in (3), we can obtain the local linear estimate of () at  =  0 as where e 1 indicates a two-dimensional vector with its first element being 1 and the other being 0. Taking  0 in ( 6) to be  1 ,  2 , and   , respectively, we can get the fitted value of Y = ( 1 ,  2 , . . .,   ) T , denoted by Ŷ = ( Ŷ1 , Ŷ2 , . . ., Ŷ ) T , as where . . .
is called "hat" matrix or smoothing matrix.Further, the residual vector can be computed from which will be used in the next section.

A Procedure for Detecting Heteroscedasticity in Nonparametric Regression
homoscedastic, which deals with the problem of testing for heteroscedasticity.That is, the hypothesis to be tested is where  2 > 0 is a certain constant.Let ε = (ε 1 , ε2 , . . ., ε ) T = Y − Ŷ = (I − L)Y be the residual vector which is described in (9).In order to construct a test statistic suitable for quantifying the heteroscedasticity of the error term in nonparametric regression, we use the transformed residuals where with "tr" standing for the trace of a matrix and ℎ  is the th diagonal element of the matrix H = (I − L) T (I − L).
If the null hypothesis H 0 in (10) (or (13)) is true, which means the model error term is homoscedastic, we have where ( where  is the observed value of  computed by (15).For a given significance level , reject H 0 if  < ; otherwise, do not reject H 0 .

Simulation Studies
As mentioned in the introduction, Zhang and Mei [7] also proposed a test method for detecting heteroscedasticity in nonparametric models.The particular method that they used is the -test applied to the squared residuals ε2  ( = 1, 2, . . ., ), which are shown in (9).A comparison with Zhang and Mei's method is conducted in this section to assess the validity of the proposed test method.
Using the above regression and variance functions, we can formulate three models to generate the experimental data.For convenience, the models that correspond to those three settings of regression and variance functions are denoted by Model 1, Model 2, and Model 3, respectively.
In each model, the observations  1 ,  2 , . . .,   of the explanatory variable  are equidistantly taken on the interval [0, 1]; that is,   = /,  = 1, 2, . . ., .The constant  in the variance functions is considered to be 0, 0.5, and 1.0, respectively.Note that  = 0 refers to the model with the error term being homoscedastic, and the variance function deviates from homoscedasticity more and more significantly with the value of  increasing.The sample sizes are taken to be  = 100 and  = 200, respectively.Furthermore, in order to evaluate the robustness of the test methods (the proposed and Zhang and Mei's methods) on the error distributions, the random numbers  1 ,  2 , . . .,   are independently drawn from (0, 1), (− √ 3, √ 3), and the standardized Chi-square distribution with 4 degrees of freedom, respectively.
Given each of Models 1, 2, and 3 for each combination of the values of the constant , the error distributions, and the sample sizes, we ran  = 500 attempts of replication of the testing procedure either for our proposed method or Zhang and Mei's method, in which the Gauss kernel function () = exp(− 2 /2)/ √ 2 is adopted and the bandwidth ℎ is selected by the cross-validation procedure.Throughout  = 500 attempts of replication, we record the frequency of rejecting the null hypothesis under the significance level  = 0.05 and the related results are reported in Table 1.
We see from Table 1 that, under the normality distribution of the error term, the rejection frequency of both methods under H 0 (i.e.,  = 0) is reasonably close to the corresponding significance levels for both sample sizes.On the other hand, two test methods perform quite differently for different types of variance functions under the alternative hypothesis.Although the rejection frequency computed by our method tends to be undersized for monotone variance function (see Model 3), it is much larger than that obtained by Zhang and Mei's method for high frequency variance functions (see Models 1 and 2), which means that our method is of satisfactory power in detecting heteroscedasticity, especially when the variance function shows many alternations.
Under the situations where the distribution of the model error term is nonnormal, we see from Table 1 that, under H 0 (i.e.,  = 0), the estimated values of the nominal probability computed from our method are more stable and more close to the corresponding significance levels than those obtained by Zhang and Mei's method with respect to different types of error distributions, which indicates that the proposed test approach is more robust to the choices of the error distributions.Furthermore, the values of the rejection frequency for both test methods under  ̸ = 0 show the same patterns, which demonstrate that our test approach is more powerful in detecting high frequency variance functions.

An Example on the Application of the Proposed Method
A real-world data set is analyzed in this section to demonstrate the application of the proposed method.Specifically, with the observed data of the average temperature (AT) of each day in Xi'an, China, from January 1, 1951, to December 31, 2000, the mean of the average temperatures collected on the same days of the 50 years is taken as the values of the average temperature (unit: degree).It is worth pointing out that the data on February 29 during the 50 years have been excluded.Furthermore, the observations of the explanatory variable  in the regression function are taken as the time orders from January 1 to December 31.Based on the observations (AT  ,   ) ( = 1, 2, . . ., 365) which are graphically shown in Figure 1, the following nonparametric regression model is considered: where   is assumed to satisfy E(  |   ) = 0 and Var(  |   ) = 1.Here, we test whether or not the model error term is homoscedastic.By using the proposed test method with the Gauss kernel function, the optimal value of the bandwidth ℎ selected by the cross-validation procedure is ℎ = 2 and the resulting -value is 4.713 × 10 −27 .Because of the extremely small -value of the test statistic, we may conclude that the heteroscedasticity of the model error is significant over the time that ranges from January 1 to December 31.

Final Remarks
In this paper, a test which is free of the types of the error distributions is developed for detecting heteroscedasticity in nonparametric regression models.Specifically, the statistic is constructed on a basis of appropriate transformation of the residuals after fitting the regression model with the local linear estimation as well as the idea of trend analysis in nonparametric statistics.In order to assess the performance of the proposed method, we conduct a simulation comparison with other procedures and the results are satisfactory, especially for high frequency variance functions.
Compared to Zhang and Mei's method, the power of the proposed test when heteroscedasticity is present tends to be underestimated for monotone variance function.This is reasonable because the former is mainly formulated based on the monotone trend of the squared residuals, while the latter is a sign-based testing method.
Anyhow, due to its conceptual simplicity and easy implementation, our method is useful in testing heteroscedasticity in nonparametric regression, especially for the variance functions with many alternations.

Figure 1 :
Figure 1: The original data of the response AT in model (18).
, 1/2) denotes the binomial distribution with the parameter being 1/2 and the sample size being   .By noting the fact that the test statistic  is symmetric or approximately symmetric with respect to   /2, the value of the test statistic | −   /2| tends to be large if the error heteroscedasticity is present.Therefore, the -value of testing H 0 versus H 1 based on the statistic  is

Table 1 :
Rejection frequencies of 500 replications of the testing procedure.