Penalized Quadratic Inference Function-Based Variable Selection for Generalized Partially Linear Varying Coefficient Models with Longitudinal Data

Semiparametric generalized varying coefficient partially linear models with longitudinal data arise in contemporary biology, medicine, and life science. In this paper, we consider a variable selection procedure based on the combination of the basis function approximations and quadratic inference functions with SCAD penalty. The proposed procedure simultaneously selects significant variables in the parametric components and the nonparametric components. With appropriate selection of the tuning parameters, we establish the consistency, sparsity, and asymptotic normality of the resulting estimators. The finite sample performance of the proposed methods is evaluated through extensive simulation studies and a real data analysis.


Introduction
Identifying the significant variables is of great significance in all regression analysis. In practice, a number of variables are available for an initial analysis, but many of them may not be significant and should be excluded from the final model in order to increase the accuracy of prediction. Various procedures and criteria, such as stepwise selection and subset selection with Akaike information criterion (AIC), Mallows Cp, and Bayesian information criterion (BIC), have been developed. Nevertheless, these selection methods suffer from expensive computational costs. Many shrinkage methods have been developed for the purpose of computational efficiency, e.g., the nonnegative garrote [1], the LASSO [2], the bridge regression [3], the SCAD [4], and the one-step sparse estimator [5]. Among those, the SCAD possesses the virtues of continuity, unbiasedness, and sparsity. There are a number of works on the SCAD estimation methods in various regression models, e.g., [6][7][8][9]. Zhao and Xue [8] proposed a variable selection method to select significant variables in the parametric components and the nonparametric components simultaneously for the varying coefficient partially linear models (VCPLMs).
On the other hand, longitudinal data occurs frequently in biology, medicine, and life science, in which it is often necessary to make repeated measurements of subjects over time. The responses from different subjects are independent, but the responses from the same subject are very likely to be correlated. This feature is called "within-cluster correlation". Qu et al. [10] proposed a method of quadratic inference functions (QIFs) to treat the longitudinal data. The QIF can efficiently take the within-cluster correlation into account and is more efficient than the generalized estimating equation (GEE) [11] approach when the working correlation is misspecified. The QIF approach has been applied to many models, including varying coefficient models (VCM) [12,13], partially linear models (PLM) [14], varying coefficient partially linear models (VCPLMs) [15], and generalized partially linear models (GPLM) [16]. Wang et al. [13] proposed a group SCAD procedure for variable selection of VCM with longitudinal data. More recently, Tian et al. [15] proposed a QIF-based SCAD penalty for the variable selection for VCPLM with longitudinal data.
As introduced in Li and Liang [17], the generalized partially linear varying coefficient model (GPLVCM) possesses the great flexibility of a nonparametric regression model and provides the explanatory power of a generalized linear regression model, which arises naturally due to categorical covariates. Many models are the special case of GPLVCM, e.g., VCM, VCPLM, PLM, and GLM. Li and Liang [17] studied variable selection for GPLVCM, where the parametric components are identified via the SCAD but the nonparametric components are selected via a generalized likelihood ratio test instead of shrinkage. In this paper, we extend the QIF-based group SCAD variable selection procedure to GPLVCM with longitudinal data, and the B-spline methods are adopted to approximate the nonparametric component in the model. With suitable chosen tuning parameters, the proposed variable selection procedure is consistent, and the estimators of regression coefficients have oracle property, i.e., the estimators of the nonparametric components achieve the optimal convergence rate, and the estimators of the parametric components have the same asymptotic distribution as that based on the correct submodel.
The rest of this paper is organized as follows. In Section 2, we propose a variable selection procedure for the GPLVCM with longitudinal data. Asymptotic properties of the resulting estimators and an iteration algorithm are presented in Section 3. In Section 4, we carry out simulation studies to assess the finite sample performance of the method. A real data analysis is given in Section 5 to illustrate the proposed methodology. The details of proofs are provided in the appendix.

Methodology
2.1. GPLVCM with Longitudinal Data. In this article, we consider a longitudinal study with n subjects and m i observations over time for the ith subject (i = 1, ⋯, n) for a total of N = ∑ n i=1 m i observations. Each observation consists of a response variable Y ij and the predicator variables ðX ij , Z ij , UðijÞÞ, where X ij ∈ R p , Z ij ∈ R q and U ij is a scalar. We assume that the observations from different subjects are independent, but those within the same subject are dependent. The generalized varying coefficient partially linear model (GPLVM) with longitudinal data takes the form where μ ij is the expectation of Y ij when X ij , Z ij , and U ij are given, β = ðβ 1 ,⋯,β p Þ T is an unknown p × 1 regression coefficient vector, hð·Þ is a known smooth link function, and αðuÞ = ðα 1 ðuÞ, α 2 ðuÞ,⋯,α q ðuÞÞ T is a q × 1 unknown monotonic smooth function vector. Without loss of generality, we assume U ∼ U½0, 1.
We approximate αð·Þ by B-spline basis functions BðuÞ = ðB 1 ðuÞ,⋯,B L ðuÞÞ T with the order of M, where L = K + M + 1 and K is the number of interior knots, i.e., where γ k = ðγ k1 ,⋯,γ kL Þ T is a L × 1 vector of unknown regression coefficients. Accordingly, μ ij is approximated by where γ = ðγ T 1 , ⋯, γ T q Þ T and " ⊗ " is the Kronecker product.
We use the B-spline basis functions because they are numerically stable and have bounded support [18]. The spline approach also treats a nonparametric function as a linear function with the basis functions as pseudodesign variables, and thus, any computational algorithm for the generalized linear models can be used for the GPLVCMs.
To incorporate the within-cluster correlation, we apply the QIFs to estimate β and γ, respectively. Denote θ = ðβ T , γ T Þ T , we define the extended score g N ðθÞ as follows: where _ μ i = ∂μ i /∂θ, A i = diag ðVarðY i1 Þ,⋯,VarðY im ÞÞ is the marginal variance matrix of subject Y i , and M 1 , ⋯, M s are the base matrices to represent the inverse of the working correlation matrix R in GEE approach. Following Qu et al. [10], we define the quadratic inference functions to be where Ω n ðθÞ = ð1/nÞ∑ n i=1 g i ðθÞg i ðθÞ T . Note that Ω n depends on θ. The QIF estimate e θ is then given by 2.2. Penalized QIF. In real data analysis, the true regression model is always unknown. An overfitted model lowers the efficiency of estimation while an underfitted one leads to a biased estimator. A popular approach to identify the relevant predictors while estimating the nonzero parameters and functions in model (1) simultaneously is to exert some kind of "penalty" on the original objective function. Here, we choose the smoothly clipped absolute deviation (SCAD) penalty because it has several advantages such as unbiasedness, sparsity, and continuity. The SCAD-penalized quadratic inference function (PQIF) is defined as follows: where a > 2, ω > 0, p λ ð0Þ = 0; here, we choose a = 3:7 as in [4]. Note that This group-wised penalization ensures that the spline coefficient vector of the same nonparametric component is treated as an entire group in model selection.
Denote b θ to be the penalized estimator obtained by minimizing the penalized objective function of (7). Then, b β = ðθ∧ 1 ,⋯,θ∧ p Þ T is the estimator of the parameter β and the estimator of the nonparametric function αðuÞ is

Asymptotic Properties
3.1. Oracle Property. We next establish the asymptotic properties of the resulting penalized QIF estimators. We first introduce some notations. Let β 0 and α 0 ð·Þ denote the true values of βð·Þ and αð·Þ. In addition, γ 0 is the spline coefficient vector from the spline approximation to α 0 ð·Þ. Without loss of generality, we assume that β 0l ≠ 0, l = 1, ⋯, p 1 and β 0l = 0, l = p 1 + 1, ⋯, p, i.e., only the first p 1 component of β 0 is nonzero. Similarly, we assume that α 0k ð·Þ ≠ 0, k = 1, ⋯, q 1 and α 0k ð·Þ = 0, k = q 1 + 1, ⋯, q, i.e., only the first q 1 component of α 0 ð·Þ is nonzero. For convenience and simplicity, let C denote a positive constant that may have different values at each appearance throughout this paper and ∥A∥ denote the modulus of the largest singular value of matrix or vector A. Before the proof of our main theorems, we list some regularity conditions used in this paper.
Assumption (A1). The spline regression parameter γ is identifiable, that is, γ 0 is the spline coefficient vector from the spline approximation to α 0 ð·Þ. In addition, there is a unique Assumption (A2). The weight matrix Ω n = ð1/nÞ∑ n i=1 g i ðθÞ g T i ðθÞ converges almost surely to a constant matrix Ω 0 , where Ω 0 is invertible.
Theorem 1 indicates that the estimator of nonparametric components achieve the optimal convergence rate.
Furthermore, under suitable condition, Theorem 1 shows that the penalized QIF estimator has the sparsity property.
where Δ ⊗2 = ΔΔ T , τ = ðτ ij Þ n×n is a n × n block matrix with its ði, jÞ block taking the form Theorem 3 states that β∧ * is asymptotically normally distributed.
3.2. Selection of Tuning Parameters. Theorems 1-3 imply that the proposed variable selection procedure possessed the oracle property. However, this attractive feature relies on the choice of tuning parameters λ i . The popular criteria to choose λ i include cross-validation, generalized cross-validation, AIC, and BIC. Wang et al. [19] suggested using BIC for the SCAD estimator in linear models and partially linear models and proved its model selection consistency property, i.e., the optimal parameter chosen by BIC can identify the true model with probability tending to one. Tian proved that for partially linear models. Hence, we adopt BIC to choose the optimal fλ 1 , λ 2 g. Following [19][20][21], we simplify the tuning parameters as whereβ ð0Þ k andγ ð0Þ k are the unpenalized QIF estimates. Consequently, the original two-dimensional problem becomes a univariate problem about λ 0 , which can be selected according to the following BIC-type criterion: where From Theorem 4 of Tian et al. [15], the BIC tuning parameter selector enables us to select the true model consistently.

An Algorithm Using Local Quadratic Approximation.
Based on Fan and Li's local quadratic approximating approach [4], we propose an iterative algorithm to minimize the PQIF (7). Similar with Tian et al. [15], we choose the unpenalized QIF estimator e θ as the initial estimator. Let θ k = ðβ k 1 , ⋯, β k p , γ kT 1 , ⋯, γ kT q Þ T be the value of θ at the kth iteration. If β k l (or γ k l ) is close to 0 (or 0), i.e., jβ k l j ⩽ ϵ (or kγ k l k H ⩽ ϵ) with some small threshold value ϵ, then we set β k l = 0 (or γ k l = 0). We consider ϵ = 10 −6 in our simulations.
Suppose β k+1 l = 0, for l = p k + 1, ⋯, p, and γ k+1 l = 0, for l = q k + 1, ⋯, q, and β k+1 = ðβ k+1 correspond to q k zero functions and q − q K zero functions, has the same length and same partition with θ k+1 . For the parametric term, if jβ k l j > ϵ, the penalty function at β l ≈ β k l is approximated by

Computational and Mathematical Methods in Medicine
Similarly, to the nonparametric component, if kγ k l k H > ϵ, the penalty function at γ l ≈ γ k l is approximated by where p ′ λ is the first-order derivative of the penalty function p λ . This leads to the local approximation of the PQIF Q p n ðθÞ by a quadratic function: where _ Q n ðθ k Þ = ∂Q n ðθ k Þ/∂ω 11 , € Q n ðθ k Þ = ∂ 2 Q n ðθ k Þ/∂ω 11 ∂ω T 11 , with ω 11 = ðβ T N , γ T Z Þ T , and Minimizing the quadratic function (22), we obtain ω k+1 11 . The Newton-Raphson method then iterates the following process to convergence:

Simulation Studies
4.1. Assessing Rule. In this section, we conduct a simulation study to assess the finite sample performance of the proposed procedures. Following [17], the performance of estimator b β will be assessed by the generalized mean square error (GMSE), which is defined as The performance of estimator b αð·Þ will be assessed by the square root of average square errors (RASE) where u v , v = 1, ⋯, M are the grid points where the function b αðuÞ is evaluated. In our simulation, M = 300 is used.
To assess the performance of the variable selection, we use "C" to denote the average number of zero regression coefficients that are correctly estimated as zero and use "IC" to denote the average number of nonzero regression coefficients that are erroneously set to zero. The more closer the value of "C" to the number of true zero coefficient in the model and the more closer the value of "IC" to zero, the better the performance of the variable selection procedure is.
In our simulations, we use the sample quantiles of U ij as knots and take the number of internal knots to be 3, that is, OðN 1/5 Þ. This particular choice is consistent with the asymptotic theory in Section 3 and performs well in the simulations. For each simulated dataset, the proposed estimation procedures for finding out penalized QIF estimators with SCAD and LASSO penalty functions are considered. The tuning parameters λ 1 , λ 2 for the penalty functions are chosen by BIC from 50 equispaced grid points in ½−15, 5. For each of these methods, the average of zero coefficients over the 500 simulated datasets is reported.

Study 1 (Partial Penalty). Consider a Bernoulli response
where β = ð2,1:5,0:7, 0 T 17 Þ T , m = 6, X ij ∼ Nð0, I 20 Þ, αðU ij Þ = 0:4 cos ððπ/2ÞU ij Þ, and U ij are drawn independently from U½0, 1. Response variable Y ij with compound symmetry correlation structure (CS) is generated according to Oman [22]. In our simulation study, we consider ρ = 0:25 and 0.75, representing weak and strong correlations, respectively. In some situations, we prefer not to shrink some certain components in the variable selection procedure when some kind of prior information is available. Partial penalty arises naturally for such case. In this example, we only exert penalty on the parametric component, i.e., coefficient β. In this situation, the PQIF (7) becomes

Computational and Mathematical Methods in Medicine
The variable selection result is reported in Tables 1 and 2. Tables 1 and 2 show that the performance of the proposed variable selection approach improves as n increases, e.g., the number of correctly recognized zero coefficient increases to the number of true zero coefficient in the model and the GMSE of b β decreases as n increases. In addition, the RASE of b αðuÞ also decreases as n increases, which means the estimated curve of b αðuÞ fits better to the true line of αðuÞ when the sample size increases. Moreover, the SCAD penalty method outperforms the LASSO penalty ones in the sense of correct variable selection rate, which significantly reduces the model uncertainty and complexity.
The results are also reported in Tables 3 and 4. Table 3 reports the variable selection for the parametric components; it shows that the performances become better and better as n increases, e.g., the number of correctly recognized zero coefficients, which is denoted as values in the column labeled "C," becomes more and more closer to the true number of zero regression coefficients in the model. At the same time, the GMSE decreases steadily as n increases. Table 4 shows that, for the nonparametric components, the performances of the proposed variable selection method are similar to those of the method for the parametric components. As n increases, the RASE of the estimated nonparametric function also becomes smaller and smaller. This reflects that the estimate curves fit better to the corresponding true line as the sample size increases. Moreover, the SCAD penalty method outperforms the LASSO penalty ones in the sense of correct variable selection rate, which significantly reduces the model uncertainty and complexity.
To study the influence of misspecified correlation structure to the proposed approach, we perform variable selection when the working correlation structure is specified to be CS and first-order autoregressive (AR-1), respectively. The result is listed in Table 5. It is known that the QIF estimator is insensitive to misspecification in correlation structure. Table 5 shows that the proposed variable selection procedure gives similar results even when the correlation structure is misspecified. This indicates that our method is robust.

Study 3 (High-Dimensional Setup).
In this example, we discuss how the proposed variable selection procedure can be applied to the "large n, diverging p/q" setup for longitudinal models. We consider the high-dimensional setup of study 2. In this simulation, we take n = 300, m = 6, p = 20 = OðN 1/4 Þ, q = 10 = OðN 1/4 Þ. The true coefficient vector is β = ð2,1:5,0:7, 0 T 17 Þ T , αðuÞ = ðα 1 ðuÞ, α 2 ðuÞ, 0 T 10 Þ T , where α 1 ðuÞ and α 2 ðuÞ are defined in study 2. The other settings are the same with study 2. The results are reported in Table 6. It is easy to see that the proposed variable selection procedure is able to correctly identify the true model and works well in the "large n, diverging p/q" setup.

Application to Infectious Disease Data
We apply the proposed method to analyze an infectious disease data (indon.data), which has been well analyzed by many authors, such as [16,[23][24][25][26][27]. In this study, a total of 275 preschool children were examined every three months for 18 months. The response is the presence of respiratory infection (1 = yes, 0 = no). The primary interest is in studying the relationship between the risk of respiratory infection and vitamin A deficiency (1 = yes, 0 = no).
In our study, we consider the following GPLVCM model  where t is age, X 1 is vitamin A deficiency, X 2 , X 3 are the seasonal cosine and seasonal sine variables, respectively, which indicate the season when those examinations took place, X 4 is gender (1 = female, 0 = male), X 5 is height, X 6 is stunting status (1 = yes, 0 = no), and Z 1 = X 2 5 is the square of height. The with-cluster correlation structure is assumed to be exchangeable, i.e., compound symmetric. This structure is also used in [16,26,27].
We apply the proposed QIF-based group SCAD variable selection procedure to the above model and recognize five nonzero coefficients and one nonzero function α 0 ðtÞ, where β 1 = 0:842, β 2 = −0:685, β 3 = −0:309, β 4 = −0:554, and β 6 = 0:966. The results are generally consistent with those previous studies, but our results show that the height has no significant impact on the infectious rate and can be removed from the model. Figure 1 reports the curve of baseline age function α 0 ðtÞ estimated by QIF-based group SCAD that is estimated by QIF and that is estimated by QIF-based SCAD partial penalty to β in [16], where the GPLM without the varying coefficient term is used. Figure 1 implies that the probability of having respiratory infection increases at the very early stage, then decreases steadily, and declines dramatically when the age is over 5.5 years old. This also coincides with previous results [16,26,27].

Conclusion and Discussion
We proposed a QIF-based group SCAD variable selection procedure for the generalized partially linear varying coefficient models with longitudinal data. This procedure can select significant variables in the parametric components and nonparametric components simultaneously. Under mild conditions, the estimators of regression coefficients have oracle property. Simulation studies indicate that the proposed procedure is very effective in selecting significant variables and estimating the regression coefficients.
In this paper, we assume that the dimensions of the covariates X and Z are fixed. Study 3 in simulations shows that the proposed approach still have desired results when the dimensions p and q go to infinity as n ⟶ ∞. However, when in ultrahigh-dimensional case, the proposed variable selection procedure may not work well anymore. As a future research topic, it is interesting to consider the variable selection for the generalized partially linear varying coefficient models with ultrahigh-dimensional covariates. Then, ðA:2Þ Proof of Theorem 1. Let δ = n −1/2 , β = β 0 + δD 1 , γ = γ 0 + δD 2 , and D = ðD T 1 , D T 2 Þ T . We first show that for any given ε > 0, there exists a large constant C such that Note that β 0l = 0, for all l = P 1 + 1, ⋯, p, and γ 0k = 0, for all k = q 1 , ⋯, q, together with Assumption (A1) and p λ ð0Þ = 0, we have ðA:4Þ By Taylor expansion and Assumption (A4), we have ðA:5Þ Invoking the proof of Theorem 2 in Zhang and Xue [16],  [18], we get that kR k ðuÞk = OðK −r Þ. Therefore, ðA:7Þ Thus, we complete the proof of Theorem 1.
Similarly, we can prove that with probability tending to 1, b γ k = 0, k = q 1 + 1, ⋯, q. Note that kBðuÞk = Oð1Þ and b α k ðuÞ = B T ðuÞb γ k ; the second part of Theorem 2 is proved. Thus, we complete the proof of Theorem 2.