Linearity Identification for General Partial Linear Single-Index Models

Partial linear models, a family of popular semiparametric models, provide us with an interpretable and flexible assumption for modelling complex data. One challenging question in partial linear models is the structure identification for the linear components and the nonlinear components, especially for high dimensional data. This paper considers the structure identification problem in the general partial linear single-index models, where the link function is unknown.We propose two penalized methods based on a modern dimension reduction technique. Under certain regularity conditions, we show that the second estimator is able to identify the underlying true model structure correctly. The convergence rate of the new estimator is established as well.


Introduction
Partially linear models (PLMs), containing both linear and nonlinear additive components, provide a useful class of tools for modelling complex data.PLMs have wide applications in practice due to their flexibility and interpretability.
Given the data set {  , x (1)   , x (2)   }  =1 , where   is the response and x (1)   = ( 1 , . . .,   ) and x (2)   = ( (+1) , . . .,  (+) ) are vectors of covariates, the basic PLM has the following form: where  is the intercept,   is a vector of unknown parameters for linear terms, each  , is an unknown nonlinear function from R  to R, and   's are i.i.d.random errors with zero conditional mean and a finite variance.
Given that the linear term and nonlinear term are known in advance, estimation and inference for various PLMs have been well studied in literature, such as [1][2][3].Under the same settings, motivated by sparse models and the Lasso idea developed rapidly in recent years, there exists a substantial work on variable selection and estimation for sparse PLMs, including [4][5][6] and others.However, faced with complicated data-generating processes, a key and challenging question to use PLM is how to decide which features should be assigned to be linear and which are nonlinear, especially for high dimensional problems such as text categorization and biology.Furthermore, the difficulty level of model search increases dramatically as the data dimension grows due to the curse of dimensionality.In the last several years, a number of methods have been proposed to address various aspects of this problem.Two commonly used methods are the screening and hypothesis testing procedures.The screening method is sometimes useful in practice but lacks theoretical justifications, while, for hypothesis testing, it is often hard to construct proper test statistics and the tests may have a poor performance when the number of covariates is large.To handle estimation and component identification problems efficiently for PLM, Zhang et al. [7] originally proposed a penalized method called LAND for automatically identifying linear and nonlinear terms for PLM.This method is based on the orthogonal decomposition of Sobolev space of order 2, consisting of linear subspace and nonlinear subspaces.Another main stream of research is based on spline approximation for nonparametric functions and penalized likelihood.For example, Huang et al. [8] start with a nonparametric additive model but use spline series expansions to approximate the component functions.This 2 Mathematical Problems in Engineering spline series approximation approach allows them to prove selection consistency for more general functions.Lian et al. [9] proposed a doubly SCAD-penalized approach for partial linear structure identification problems of nonpolynomial (NP) dimensionality.These above methods can be easily extended to be generalized partial linear models; that is, the added link function is known in advance.If the link function is unknown, a standard technique in the literature is the two-step estimation.Firstly given that the component functions are fixed, a rough estimator of the link function is obtained; then the derived link function is fixed, estimating the component functions again.Repeat the process until convergence.However, this method is unstable and quite sensitive to the choice of penalty parameters.Moreover, this method involves a nonconvex optimization, which can not guarantee satisfactory numerical results.
In this paper we consider a general single-index models with the partial linear form where the error term   is assumed to be independent of covariates, besides satisfying the standard moment conditions mentioned above.The link function (⋅) is typically unknown.Note that additive structure of the unknown link function and the error is not assumed in model ( 2).This model is quite general; it includes the classical generalized partial linear model with the form   =  1 (   x (1)    + ∑  =1  , ( (+) )) +   , as well as the partial linear transformation model  2 (  ) =    x (1)   + ∑  =1  , ( (+) ) +   .Clearly, when the link function (⋅) is not specified, the partial linear term is identifiable only up to a multiplicative scalar because any location-scale change in {   x (1)   + ∑  =1  , ( (+) )} can be absorbed into the link function.
By making use of the cubic spline approximation to smooth functions and the regularization technique, we develop a new penalized method to identify model structure for the general partial linear models (2).The proposed method is formulated by the theory of sufficient dimension reduction and two different variant forms of eigenvalues problems.We specially use a modern sufficient dimension technique for estimating the linear and nonlinear terms, called gradient-based kernel dimension reduction (gKDR), proposed by Fukumizu and Leng [10].This method shows some specific advantages compared with the abovementioned techniques; for example, the gKDR method can handle any type of variable for  including multivariate or nonvectorial one in the same way.Besides, the linear condition and constant variance conditions are avoided when the gKDR is adopted.
Based on the gKDR and two variant forms for the eigenvalues problem, we design two different penalized approaches by adding twofold Lasso-type penalties.The first algorithm is a nonconvex optimization problem with equality constraint, which can be solved efficiently by an alternating direction method of multipliers (ADMM).To provide better statistical properties, a convex relaxation of the eigenvalues problem is used to formulate our other method, a semidefinite programme.Using this convex approach, we can establish the estimation consistency and support recovery for the general partial linear models.The convex optimization problem can also be solved in a polynomial time algorithm by the ADMM.
The rest of the article is organized as follows.In Section 2 we transform model ( 2) into a surrogate model by the cubic spline approximation.Next we introduce the gKDR in the literature of dimension reduction.Then we propose two different penalized approaches based on two variants of eigenvalues problems.Statistical properties of the new estimators, including its convergence rate and selection consistency, are established in Section 3.All the proofs are relegated to the Appendix.

Problem Formulation
Notation.We collect here some standard notation used throughout the paper.For matrices ,  with the same dimension ⟨, ⟩ fl tr(  ) is the Frobenius inner product, and ‖‖  fl √⟨, ⟩ is the square Frobenius norm.‖‖  is the usual norm in the Euclidean spaces.‖‖ , is (, )-norm defined to be ℓ  norm of the vector of rowwise ℓ  norms of .We denote  ⪯ , meaning that − is semidefinite positive.
We mainly consider the problem of linear identification of the general single-index model ( 2) and estimate relevant variables to the response.For nonparametric components in model ( 2), one often uses spline or wavelet approximation as finite representations, because of their good approximation capabilities to functions in various nonparametric classes and computational advantages.In our specific setting, we adopt the cubic spline approximation, because two important properties of cubic spline are critical for our analysis.One property is that all the its bases are linearly independent of each other.Another property is that the function  is contained in the cubic spline bases.We refer the reader to [11] for a detailed description for spline functions.Thus, benefited from the two properties of cubic spline, each function of this spline space has a unique finite representation, and the coefficient of the base function  corresponds to the linear part.
In this article, we use the gKDR formulated by the reproducing kernel techniques.This method shows some specific advantages compared with the above-mentioned techniques; for example, the gKDR method can handle any type of variable for  including multivariate or nonvectorial one in the same way.Besides, the nonparametric nature of the kernel method avoids making strong assumptions on the distribution of covariates, the response, or the conditional probability, which are required usually in many classical dimension reduction methods such as SIR, pHd, and contour regression.

gKDR Method.
To give the gKDR method, we need some technical notations.For a compact set Ω in some metric space, we denote a positive-definite kernel on Ω × Ω, that is, a symmetric function  : Ω × Ω → R satisfying the following semidefinite positive condition: ∑  ,=1     (  ,   ) ≥ 0 for any  1 , . . .,   in Ω and  1 , . . .,   ∈ R. It is known [15] that a positive-definite kernel on Ω is uniquely associated with a Hilbert space H consisting of functions on Ω, and H is called a reproducing kernel Hilbert space (RKHS) associated with the kernel .
For short, we denote Z fl {Φ  (x), x ∈ X} as a subset of R (+) .Let (Z,  Z ,  Z ) and (Y,  Y ,  Y ) be measurable spaces and (, ) be a random variable on Z × Y with probability .Let  Z and  Y be positive-definite kernels on Z and Y, respectively, and the corresponding RKHSs denoted by H Z and H Y .It is assumed that ( Z (, )) and ( Y (, )) are finite.Now we introduce two integral operators, so that an appropriate variant of the problem of estimating   in (3) can be established.The cross-covariance operator   : H Z → H Y is defined as the operator as follows: Similarly,   denotes the self-covariance operator on H Z ; that is, Remark that these definitions are natural extensions of the ordinary covariance metrics on Euclidean spaces.Although   and   depend on the kernels, we omit the dependence for notational simplicity.
With these definitions, we now define the population ( + ) × ( + ) matrix (z) = (  (z)) by Reference [10] has shown that the eigenvector corresponding to the largest eigenvalue of the (z) is equal to the true vector   for any z, provided that the maximal eigenvalue is simple.
To be precise, the following equality holds: where ⃗ V max () is denoted to be the maximum eigenvector of the matrix .Thus, the problem of estimating   of ( 3) is equivalent to the eigenvalue problem of solving (z).
We now turn to the empirical version of (z), so as to estimate   based on available sample.Denote  ×  Gram matrices ( Z (z  , z  )) and ( Y (  ,   )) by  Z and  Y , respectively.Let ∇k Z (z) = ( Z (z 1 , z)/z, . . .,  Z (z  , z)/z) ∈ R ×(+) .Then, the gKDR proposes the eigenvectors of ( + ) × ( + ) symmetric matrix: where   is a regularization parameter in Thikonov-type regularization.Then, we define as an empirical estimator of   .In the next two subsections, we introduce two variational forms with respect to the eigenvalue problems, so that our proposed methods based on Lasso idea can be formulated naturally.

A Nonconvex Variational Algorithm.
In order to employ the popular Lasso idea, we need to transform the eigenvalue problem (7) to an optimization scheme.Recall the following Courant-Fischer variational representation of the maximal eigenvalue and eigenvector: max To identify the linear and nonlinear terms in model ( 3), we define an estimator by adding twofold convex penalties to (10) such that some sparse estimators might be generated.To be precise, let  = ( 1 , . . .,  + ), where each   ∈ R  .The penalized estimation method is defined as follows: Mathematical Problems in Engineering θ = arg min where  ,1 is the first element of   and  ,−1 is the last  − 1 component of   ,  = 1, . . .,  + .Note that  is the penalty parameter and  is a tuning parameter, typically,  = /(1 + ) for normalization.Our penalized method is motivated by the following fact.For any given function () = ∑  =1     (), by the cubic spline representation, as mentioned earlier, the first spline base is the single function  and all the bases are linearly independent of each other.Consequently,  1 = 0 and   ̸ = 0 with  > 1 if and only if  is a nonlinear function.Based on this and the lasso constraint in (11), the proposed method can identify both the linear and nonlinear terms, of course, also finding the relevant feature simultaneously.
We observe that optimization of ( 11) is complicated, since it belongs to a class of constrained nonconvex programmes.In this article, we consider an alternative formulation of (11); precisely, we use the ADMM [16].ADMM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers.Let () = ∑ + =1 (‖ ,1 ‖ 2 + (1 − )‖ ,−1 ‖ 2 ), and by introducing an additional constraint, we can reformulate algorithm (11) as min ,∈R (+)   {∞ × Π 2 () −   M  +  ()} , where Π 2 is the 0 − 1 indicator function for the unit ℓ 2 -ball surface in R (+) and we adopt the convention ∞ × 0 = 0.As in the method of multipliers, we form the augmented Lagrangian with a parameter : where (^, ) is the Lagrangian multipliers.Defining the residual r =  − , we have where u = ^/ is called the scaled dual variable.Using the scaled dual variable, we can express ADMM as where Π B 2 is Euclidean projection onto the unit ℓ 2 -ball surface.
There are many convergence results for ADMM discussed in the literature.ADMM consists of iteratively minimizing (, , ^, ) with respect to , minimizing (, , ^, ) with respect to , and then updating the dual variable u.The update involves iterative evaluation of the proximal operator associated with To this end, we use the block coordinate descent (BCD) [17] for ℓ 1 /ℓ 2 -regularization to compute  +1 of (15 where Prox ‖⋅‖ 2 is called the soft thresholding operator associated with ‖ ⋅ ‖ 2 , defined as Prox ‖⋅‖ 2 (  ) = (1 − /‖  ‖ 2 ) +   for any   .Under mild conditions,   +1 converges to some vector as  goes to infinity, denoted by  +1 .It is worthwhile to note that, by (20), we find that the iteration in ( 22) can be computed parallel between groups.
With regard to projection computation of ( 16), it is known that = 0; otherwise,  +1 is set to any given value on ℓ 2 -ball surface.Thus, An efficient implementation for ADMM is obtained; to be precise, the overall procedure can be stated as in Algorithm 1.

A Semidefinite Programming
Relaxation.This section intrduces a convex relaxation of (10).Since estimate of individual eigenvectors of   in ( 7) is unstable if the gap between their eigenvalues is small, it seems reasonable to instead focus on their span, that is, the principal subspace of variation.Let Λ  =      as the projection onto the subspace spanned by   ; a lesser known but equivalent variational representation of ( 10) is of the convex program: where S
Using the ADMM again, we rewrite (11) as the equivalent problem with equality constrained min ∞ × 1 F 1 (Θ) − ⟨ M , Θ⟩ +  ‖Δ‖  We notice that computing of Π F 1 () involves an eigendecomposition of  and then modifying the eigenvalues by solving a monotone, piecewise linear equation.

Statistical Properties
Before giving statistical results of the estimator (24) in terms of consistency and model selection properties, we recall the existing error bound [10] for   (z)−(z) in Frobenius norm for any z.
Recall that the following technique conditions are required to derive the gradient-based estimator.For shorthand, let  = ( + ).