Remodeling and Estimation for Sparse Partially Linear Regression Models

When the dimension of covariates in the regression model is high, one usually uses a submodel as a working model that contains signi�cant variables. �ut it may be highly biased and the resulting estimator of the parameter of interest may be very poor when the coeﬃcients of removed variables are not exactly zero. In this paper, based on the selected submodel, we introduce a two-stage remodeling method to get the consistent estimator for the parameter of interest. More precisely, in the �rst stage, by a multistep adjustment, we reconstruct an unbiased model based on the correlation information between the covariates; in the second stage, we further reduce the adjusted model by a semiparametric variable selection method and get a new estimator of the parameter of interest simultaneously. Its convergence rate and asymptotic normality are also obtained. e simulation results further illustrate that the new estimator outperforms those obtained by the submodel and the full model in the sense of mean square errors of point estimation and mean square prediction errors of model prediction.


Introduction
Consider the following partially linear regression model: where is a scalar response variable, and are, respectively, -dimensional and -dimensional continuous-valued covariates with being �nite and , is the parameter vector of interest and is the nuisance parameter vector which is supposed to be sparse in the sense that ‖ ‖ 2 is small, ( ) is an unknown function satisfying ( ) for identi�cation, is the random error satisfying ( ) . For simplicity, we assume that is univariate. Let ( ), , be i.i.d. observations of ( ) obtained from the above model. A feature of the model is that the parametric part contains both the parameter vector of interest and nuisance parameter vector. e reason for this coefficient separation is as follows. In practice we oen use such a model to distinguish the main treatment variables of interest from the state variables. For instance, in a clinical trial, consists of treatment variables and can be easily controlled, is a vector of many clinical variables, such as patient ages and body weights. e variables in may have an impact on but are not of primary interest and the effects may be small. In order to make up for potentially nonnegligible effects on the response , the nuisance covariate are introduced into model (1); see Shen et al. [1]. Model (1) contains all relevant covariates and in this paper we call it full model. e purpose of this paper is to estimate , the parameter of interest, when is removed from the model. e main idea is remodeling based on the following working model: As is known, ( ) is a nonzero function if ( ) , which relies on two elements, one is ( ), related with the correlation between the covariates of and ( ), the other is , determined by the nuisance parameter in the removed part. us the least squares estimator based on model (2) may be inconsistent. In the following, we will make use of the above two elements. Speci�cally, in the �rst stage, we shall construct a remodeled model by a multistep-adjustment to correct the submodel bias based on the correlation information between the covariates. is adjustment is motivated by Gai et al. [2]. In the paper, they proposed a nonparametric adjustment by adding a univariate nonparametric estimation to the working model (2), and it can dramatically reduce the bias of the working model. But this only holds in a subset of the covariates, although the subset may be fairly large. In order to obtain a globally unbiased working model for linear regression model, Zeng et al. [3] adjusted the working model by multiple steps. Because only those variables in correlated with ( may have impact on estimation of , in each step a univariate nonparametric part was added to the working model and consequently a globally unbiased working model was obtained.
However, when many components of are correlated with ( , the number of nonparametric functions added in the above working model is large. Such a model is improper in practice. us, in the second stage, we further simplify the above adjusted model by a semiparametric variable selection procedure proposed by Zhao and Xue [4]. eir method can select signi�cant parametric and nonparametric components simultaneously under sparsity condition for semiparametric varying coefficient partially linear models. e relevant papers include Fan and Li [5], Wang et al. [6,7], among others. A�er two-stage remodeling, the �nal model is conditionally unbiased. Based on this model, the estimation and model prediction are signi�cantly improved. e rest of this paper is organized as follows. In Section 2, a multistep adjustment and remodeled models are �rstly proposed, then the models are further simpli�ed via the semiparametric SCAD variable selection procedure. A new estimator of the parameter of interest based on the simpli-�ed model is derived, its convergence rate and asymptotic normality are also obtained. Simulations are given in Section 3. A short conclusion and some remarks are contained in Section 4. Some regular conditions and theoretical proofs are presented in the appendix.

New Estimator for the Parameter of Interest
In this paper, we suppose that covariate has zero mean, is �nite and , ( and Var( 2 . We also assume that covariates and and parameter are prespeci�ed, so that the submodel (2) is a �xed model.

Multistep-Adjustment by Correlation.
In this subsection, we �rst adjust the submodel to be conditionally unbiased by a multistep-adjustment.
When is normally distributed, the principal component analysis (PCA) method will be used. Let Σ be the covariance matrix of , then there exists an orthogonal matrix such that Σ Λ, where Λ is the diagonal matrix diag( 1  . When is centered but nonnormally distributed, we shall apply independent component analysis (ICA) method. Assume that is generated by a nonlinear combination of independent components ( , that is ( , where ( is an unknown nonlinear mapping from to , is an unknown random vector with independent components. By imposing some constraints on the nonlinear mixing mapping or the independent components ( , the independent components ( can be properly estimated. See Simas Filho and Seixas [8] for an overview of the main statistical principles and some algorithms for estimating the independent components. For simplicity, in this paper we suppose that

�o�e� �imp�i�cation.
When the most of the features in the full model are correlated, then 0 is very large and even is close to . In this case, the adjusted model (3) is improper in practice, so we shall use the group SCAD regression procedure, proposed by Wang et al. [6], and the semiparametric variable selection procedure, proposed by Zhao and Xue [4], to further simplify the model.
, and assume that the model (3) is sparse, that is, is small. We de�ne the semiparametric penalized least squares as where ‖ ‖ 2 1/2 , and ⋅ is the SCAD penalty function with being a tuning parameter de�ned as with > 2, > 0 and 0 0. In (6), ⋅ denotes the set { 1 0 }. Because are nonparametric functions, thus they cannot be directly applied for minimization. Here we will replace ⋅ and ⋅ by basis function approximations. For 1 ≤ ≤ 0 , let {Ψ 1 } be orthogonal basis functions satisfying where is the density function of . Similarly, let {Ψ 0 1 } be orthogonal basis functions satisfying the above condition which is only replaced by the support and density function of . Denote Ψ en and can be approximated by Denote ‖ ‖ 2 1/2 , invoking that Ψ ⋅Ψ the identity matrix, we get Denote by , 1 0 and the least squares estimators based on the penalized function (10) }. So we get the following simpli�ed working model where , 1 and Under the assumption of sparsity, the model (11) contains all of signi�cant nonparametric functions and fully utilizes both the correlation information of covariates and the model sparsity on nuisance covariate.
If is centered and normally distributed with covariance matrix Σ the identity matrix, then , 1 , where denotes the unit vector with 1 at position , and is sparse with . So the model (4) is sparse. For model (5), the special case of model (4), we can apply the SCAD penalty method proposed by Fan and Li [5] to select variables in 0 and estimate parameters and simultaneously. e selected covariate and the corresponding parameter are denoted by and , the resulting parameter estimators are denoted by and , respectively. Finally, we can use the simpli�ed model for model prediction. Under the condition of sparsity, its model size is much smaller than those of the multistep adjusted model (5) and the full model (1).

Asymptotic Property of Point
Estimator. Let 0 , 0 , 0 , and 0 ⋅ , 0 ⋅ be the true values of , , , and ⋅ , ⋅ , respectively, in model (3) From the last paragraph of Section 2.2 we know that, for linear regression model and normally distributed , the multistep adjusted model (5) is a linear model. By orthogonal basis functions, such as power series, we have = ∞, then ‖ ‖ = ( /2 , implying the estimator has the same convergence rate as that of the SCAD estimator in Fan and Li [5]. Remark 3. By Remark 1 of Fan and Li [5], we have that, if max → as → ∞, then → . Hence from eorems 1 and 2, by choosing proper tuning parameters, the variable selection method is consistent and the estimators of nonparametric components achieve the optimal convergence rate as if the subset of true zero coefficients was already known; see Stone [10].
be the nonzero components of , corresponding covariates are denoted by Ψ * = . In addition, let

eorem 4. Suppose that the regularity conditions (C1)-(C6) in the appendix hold and the number of terms
where " →" denotes the convergence in distribution.
Remark 5. From eorems 1 and 4, it can be found that the penalized estimators have the oracle property. Furthermore, the estimator of the parameter of interest has the same asymptotic distribution as that based on the correct submodel.

Some Issues on Implementation.
In the adjusted model (4), = are used. When the population distribution is not available, they need to be approximated by estimators. When is normally distributed and eigenvalues = of the covariance matrix Σ are different from each other, then √ ( is asymptotically ( where is the th eigenvector of = ( /( ∑ = ( ( with = ( / ∑ = ; see Anderson [11]. For the case when the population size is large and comparable with the sample size, if the covariance matrix is sparse, we can use the method in Rütimann and Bühlmann [12] or Cai and Liu [13] to estimate the covariance matrix. So we can use to approximate . When in model (4) are replaced by these consistent estimators, one can see that the approximation error can be neglected without changing the asymptotic property.
e nonparametric parts ( ( in the adjusted model depend on the univariate variable ( , for = . So it needs to choose the steps �rstly. In real implementation, we compute all the multiple correlation coefficients of ( ( = ) with and . en we choose the components = ( : |mcorr( ( ( | = for given small number , where mcorr( denotes the multicorrelation coefficient between and and can be approximated by its sample form; see Anderson [11]. ere are some tuning parameters needing to choose in order to implement the two-stage remodeling procedure. Fan and Li [5] showed that the SCAD penalty with = performs well in a variety of situations. Hence, we use their suggestion throughout this paper. We still need to choose the positive integer for basis functions and the tuning parameter of the penalty functions. Similar to the adaptive lasso of Zou [14], we suggest taking = /‖ ( ‖ 2 , where ( is initial estimator of by using ordinary least squares method based on the �rst term in (10). So the two remaining parameters and can be selected simultaneously using the leave-one-out CV or GCV method; see Zhao and Xue [4] for more details.

Simulation Studies
In this section, we investigate the behavior of the newly proposed method by simulation studies.

Linear Model with Normally Distributed
Covariates. e dimensions of the full model (1) and the submodel (2) are chosen to be 100 and 5, respectively. We set with 0.5 or 0. . e error term is assumed to be normally distributed as (0, 0.3 2 ).
Here we denote the submodel (2) as model (I), the multistep adjusted linear model (5)  We also compare mean square prediction errors (MSPEs) of the above mentioned models with corresponding estimators. e data are simulated from the full model (1) with sample size 100 and simulation times 1000. We use the sample-based PCA approximations to substitute 's. e parameter in the SCAD penalty function is set to be 3.7 and is selected by leave-one-out CV method. Table 1 reports the MSEs of point estimators on the parameter and the MSPEs of model predictions. From the table, we have the following �ndings: (1) has the largest MSEs and takes the second place, nearly all the new estimator TS has the smallest MSEs. (2) When 0.5, the MSEs of SCAD are smaller than those of , while when 0. they are larger than those of . ese show that if the correlation between the covariates is strong, the MSEs of SCAD are larger than those of , the multistep-adjustment is necessary, so the estimations and model predictions based on two-stage model are signi�cantly improved. In summary, Table 1 indicates that the two-stage adjusted linear model (12) performs much better than the full model, and better than the submodel, the SCAD-penalized model and the multistep adjusted model.

Partially Linear Model with Nonnormally Distributed
Covariates. e dimensions of the linear part in the full model (1) and the submodel (2) are chosen to be 50 and 5, respectively. We set We assume that the covariates are distributed in the following two ways.
Here we denote the submodel (2) as model (I) ′ , the multistep adjusted additive partially linear model (3)  e data are simulated from the full model (1) with sample size 100 and simulation times 500. We use the sample-based approximations of ICA, see Hyvärinen and Oja [15]. e parameter in the SCAD penalty function is set to be 3.7, the number and the parameter is selected by GCV method. We use the standard Fourier orthogonal basis as the basis functions. Table 2 reports the MSEs of point estimators on the parameter , the MASEs of ( ) and the MSPEs of model predictions. From the table, we have the following results: (1) has the largest MSEs, its MSEs are much larger than the MSEs of the other estimators, and the new estimator TS always has the smallest MSEs. (2) e MASEs of ( ) have T 1: MSEs on the parameter and MSPEs of the two-stage adjusted linear model (12)   In summary, Table 2 indicates that the two-stage adjusted model (11) performs much better than the full model and the multistep adjusted model, and better than the submodel.

Some Remarks
In this paper, the main objective is to consistently estimate the parameter of interest . When estimating the parameter of interest, its bias is mainly determined by the relevant variables, and its variance may be impacted by other variables. Because variable selection much relies on the sparsity of the parameter, when we directly consider the partially linear model, some irrelevant variables with nonzero coefficients may be selected in the �nal model. is may affect the estimation of the parameter on its efficiency and stability. us based on the prespeci�ed submodel, a two-stage remodeling method is proposed. In the new remodeling procedure, the correlation among the covariates ( and the sparsity of the regression structure are fully used. �o the �nal model is sufficiently simpli�ed and conditionally unbiased. Based on the simpli�ed model, the estimation and model prediction are signi�cantly improved. �enerally, a�er the �rst stage the adjusted model is an additive partially linear model. erefore, the remodeling method can be applied to partially linear regression model with linear regression model as a special case.
From the remodeling procedure, we can see that it can be directly applied to additive partially linear model, in which the nonparametric function ( has componentwise additive form. As for general partially linear model with multivariate nonparametric function, we should resort to multivariate nonparametric estimation method. If the dimension of covariate is high, it may be faced with "the curse of dimensionality". In the procedure of model simpli�cation, orthogonal series estimation method is used. is is only for technical convenience, because the semiparametric penalized least squares (6) can be easily transformed into parametric penalized least squares (10) and then the theoretic results are obtained. Although other nonparametric methods such as kernel and spline can be used without any essential difficulty, they can not directly achieve this goal. Compared with kernel method, it is somewhat difficult for series method to establish the asymptotic normality result for the nonparametric component ( under primitive conditions. Conditions (C1)-(C3) are some regular constraints on the covariates and condition (C4) is some constraints on the regression structure as those in Härdle et al. [16]. Conditions (C5)-(C6) are assumptions on the penalty function which are similar to those used in Fan and Li [5] and Wang et al. [7]. Similarly, there exists a local minimizer satis�es that ‖ − 0 ‖ = − 2 + ) + ). en we can get ‖ − 0 ‖ = − 2 + ) + ).