Model and Variable Selection Procedures for Semiparametric Time Series Regression

Semiparametric regression models are very useful for time series analysis. They facilitate the detection of features resulting from external interventions. The complexity of semiparametric models poses new challenges for issues of nonparametric and parametric inference and model selection that frequently arise from time series data analysis. In this paper, we propose penalized least squares estimators which can simultaneously select significant variables and estimate unknown parameters. An innovative class of variable selection procedure is proposed to select significant variables and basis functions in a semiparametric model. The asymptotic normality of the resulting estimators is established. Information criteria for model selection are also proposed. We illustrate the effectiveness of the proposed procedures with numerical simulations.


Introduction
Non-and semiparametric regression has become a rapidly developing field of statistics in recent years.Various types of nonlinear model such as neural networks, kernel methods, as well as spline method, series estimation, local linear estimation have been applied in many fields.Non-and semiparametric methods, unlike parametric methods, make no or only mild assumptions about the trend or seasonal components and are, therefore, attractive when the data on hand does not meet the criteria for classical time series models.However, the price of this flexibility can be high; when multiple predictor variables are included in the regression equation, nonparametric regression faces the so-called curse of dimensionality.
A major problem associated with non-and semiparametric trend estimation involves the selection of a smoothing parameter and the number of basis functions.Most literature on nonparametric regression with dependent errors focuses on the kernel estimator of the trend function see, e.g., Altman  where y i is the response variable and x i is the d × 1 covariate vector at time i, α t i is an unspecified baseline function of t i with t i i/n, β is a vector of unknown regression coefficients, and ε i is a Gaussian and zero mean covariance stationary process.
We assume the following properties for the error terms ε i and vectors of explanatory variables x i .
A.1 It holds that {ε i } is a linear process given by where b 0 1 and {e i } is an i.i.d.Gaussian random variable with E{e i } 0 and E{e i 2 } σ 2 e .A. 2 The coefficients b j satisfy the conditions that for all |z| < 1, ∞ j 0 b j z j / 0 and We define γ k cov ε t , ε t k E{ε t ε t k }.The assumptions on covariate variables are as follows.B.1 Also x i x i1 , . . ., x id ∈ R d and {x ij }, j 1, . . .d, have mean zero and variance 1.
The trend function α t i is expressed as a linear combination of a set of m underlying basis functions: where {φ t i φ 1 t i , . . ., φ m t i } is an m-dimensional vector constructed from basis functions {φ k t i ; k 1, . . ., m}, and w w 1 , . . ., w m is an unknown parameter vector to be estimated.The examples of basis functions are B-spline, P-spline, and radial basis functions.A P-spline basis is given by φ t i t i , . . ., t p i , t i − κ 1 p , . . ., t i − κ k p , 2.4 where {κ k } k 1,...,K are spline knots.This specification uses the so-called truncated power function basis.The choice of the number of knots K and the knot locations are discussed by Yu and Ruppert 18 .
Radial basis function RBF emerged as a variant of artificial renewal network in late 80s.Nonlinear specification of using RBF has been widely used in cognitive science, engineering, biology, linguistics, and so on.If we consider the RBF modeling, a basis function can take the form where μ k determines the location and s 2 k determines the width of the basis function.

Journal of Probability and Statistics
Selecting appropriate basis functions, then the semiparametric regression model 2.1 can be expressed as a linear model y Xβ Bw ε, 2.6 where X x 1 , . . ., x n , y y 1 , . . ., y n , B φ 1 , . . ., φ n with φ i φ 1 i/n , . . ., φ m i/n .The penalized least-square estimator is then a minimizer of the function where ξ is the smoothing parameter controlling the tradeoff between the goodness-of-fit measured by weighted least-square and the roughness of the estimated function.Also K is an appropriate positive semidefinite symmetric matrix.For example, if K satisfies w Kw where y I−S y, X I−S X ,and I is the identity matrix of order n.Applying least-square to the linear model 2.9 , we obtain the semiparametric ordinary least-square estimator SOLSE result: Speckman 21 studied similar solutions for partial linear models with independent observations.Since the errors are serially correlated in model 2.1 , β SOLSE is not asymptotically efficient.To obtain an asymptotically efficient estimator for β, we use the prewhitening transformation.Note that the errors {ε i } in 2.6 are invertible.Let b L ∞ j 1 b j e i−j , where L is the lag operator and a L b L −1 a 0 − ∞ j 1 a j L j with a 0 1.Applying a L to the model 2.

2.13
The regression errors in 2.12 are i.Under the usual regularity conditions the coefficients a j decrease geometrically so, letting τ τ n denote a truncation parameter, we may consider the truncated autoregression on ε i : where e i are i.i.d.random variables with E e i 0. We make the following assumption about the truncation parameter.
C.1 The truncation parameter τ satisfies τ n c log n for some c > 0.

Journal of Probability and Statistics
Now our estimation problem for the semiparametric time series regression model can be expressed as the minimization of the function Based on the work by Aneiros-Pérez and Vilar-Fernández 11 , an estimator for T τ is constructed as follows.We use the residuals ε y − X β SOLSE − B w SOLSE to construct an estimate of T τ using the ordinary least square method applied to the model

2.18
Define the estimate a τ a 1 , a 2 , . . ., a τ of a τ a 1 , a 2 , . . ., a τ , where where ε ε τ 1 , . . ., ε n and E τ is the n − τ × τ matrix of regressors with the typical element ε i−j .Then T τ is obtained from T τ by replacing a j with a j , σ 2 e with σ 2 e , and so forth.Applying least-square to the linear model, we obtain Then where X τ I − S X τ and y τ I − S y τ , with y τ T τ y and X τ T τ X.The following theorem shows that the loss in efficiency associated with the estimation of the autocorrelation structure is modest in large samples.

Theorem 2.1. Let the conditions of (A.1), (A.2), (B.1), and (C.1) hold, and assume that
where is nonsingular and let w 0 denote the true value of w, then one has 25

Variable Selection and Penalized Least Squares
Variable and model selection are an indispensable tool for statistical data analysis.However, it has rarely been studied in the semiparametric context.Fan and Li 23 studied penalized weighted least-square estimation with variable selection in semiparametric models for longitudinal data.In this section, we introduce the penalized weighted least-square approach.
We propose an algorithm for calculating the penalized weighted least-square estimator of θ β , w in Section 2.3.In Section 2.4 we present the information criteria for the model selection.
From now on, we assume that the matrices X τ and B τ are standardized so that each column has mean 0 and variance 1.The first term in 2.7 can be regarded as a loss function of β and w, which we will denote by l β, w .Then expression 2.7 can be written as L β, w l β, w nξw Kw.

2.26
The methodology in the previous section can be applied to the variable selection via penalized least-square.A form of penalized weighted least-square is

An Estimation Algorithm
In this section we describe an algorithm for calculating the penalized least-square estimator of θ β , w .The estimate of θ minimizes the penalized sum of squares L θ given by 2.17 .First we obtain θ SOLSE in Step 1.In Step 2, we estimate T τ by using ε obtained in Step 1. Then θ HT SGLSE is obtained using T τ Step 3 .Here the penalty parameters λ, and ξ, and the number of basis functions m are chosen using information criteria that will be discussed in Section 2.4.
Step 1.First we obtain β SOLSE and w SOLSE by 2.10 and 2.11 , respectively.Then we have the model y B w SOLSE X β SOLSE ε.

2.31
Step 2. An estimator for T τ is constructed followings the work of Aneiros-Pérez and Vilar-Fernández 4 .We use the residuals ε y − B w SOLSE − X β SOLSE to construct an estimate of T τ using the ordinary least square method applied to the model The estimator T τ is obtained from T τ by replacing parameters with their estimates.
Step 3. Our SGLSE of θ is obtained by using the model where y τ T τ y, B τ T τ B, X τ T τ X, and ε τ T τ ε.Finding the solution of the penalized least-square of 2.27 needs the local quadratic approximation, because the L 1 and hard thresholding penalty are irregular at the origin and may not have second derivatives at some points.We follow the methodology of Fan and Li 14 .Suppose that we are given an initial value θ 0 that is close to the minimizer of 2.27 .If θ 0 j is very close to 0, then set θ 0 j 0. Otherwise they can be locally approximated by a quadratic function as when θ 0 j / 0. Therefore, the minimization problem 2.27 can be reduced to a quadratic minimization problem and the Newton-Raphson algorithm can be used.The right-hand side of equation 2.27 can be locally approximated by where

2.36
The solution can be found by iteratively computing the block matrix equation:

2.37
This gives the estimators

Information Criteria
Selecting suitable values for the penalty parameters and number of basis functions is crucial to obtaining good curve fitting and variable selection.The estimate of θ minimizes the penalized sum of squares L θ given by 2.17 .In this section, we express the model 2.15 as where A τ X τ , B τ and θ β , w .In many applications, the number of basis functions m needs to be large to adequately capture the trend.To determine the number of basis functions, all models with m ≤ m max are fitted and the preferred model minimizes some model selection criteria.
The Schwarz BIC is given by

2.42
Here Σ λ θ is defined by 2.44 in what follows.
We also consider the use of the BICp criterion to choose appropriate values for these unknown parameters.Denote 43

2.44
Let N 1 and N 2 be the number of zero components in β 0 and w 0 , respectively.Then the BICp criterion is BICp where J G θ is the d m 1 × d m 1 matrix of second derivatives of the penalized likelihood defined by Here Λ is a diagonal matrix with ith element Λ i diag e 1 , . . ., e n and 1 n 1, . . ., 1 .The n-dimensional vector q has ith element where T ij is the element in the ith row and jth column of T τ .Also K is the d m × d m matrix defined by Konishi and Kitagawa 15 proposed a framework of Generalized Information Criteria GIC to the case where the models are not estimated by maximum likelihood.Hence, we also consider the use of GIC for the model evaluations.The GIC for the hard thresholding penalty function is given by where I G is a m d 1 × m d 1 matrix.Also I G is basically the product of the empirical influence function and the score function.It is defined by The number of basis functions m, penalty parameters ξ, λ 1 , λ 2 are determined by minimizing BICm, BICp or GIC.

Sampling Properties
We now study the asymptotic properties of the estimate resulting from the penalized leastsquare function 2.27 .
First we establish the convergence rate of the penalized profile least-square estimator.To achieve the root n convergence rate, we have to take λ ij small enough so that a n O p n −1/2 .
Next we establish the oracle property for the penalized least-square estimator.Let β S 1 consist of all nonzero components of β 0 and let β N 1 consist of all zero components.Let w S 2 consist of all nonzero components of w 0 and let w N 2 consist of all zero components.Let

2.53
If

Numerical Simulations
We now assess the performance of semiparametric estimators of the proposed in previous section via simulations.We generate simulation data from the model where α t i exp −3 i/n sin 3πi/n , β 3, 1.5, 0, 0.2, 0, 0, 0 and ε t is a Gaussian AR 1 process with autoregressive coefficient ρ.We used the radial basis function network modeling to fit the trend component.We simulate the covariate vector x from a normal distribution with mean 0 and cov x i , x j 0.5 |i−j| .In each case, the autoregressive coefficient is set to 0, 0.25, 0.5 or 0.75 and the sample size n is set to 50, 100 or 200. Figure 1 depicts some examples of simulation data.
We compare the effectiveness of our proposed procedure PLS HT with an existing procedure PLS .We also compare the performance of the information criteria BICm, GIC and BICp for evaluating the models.As discussed in Section 3, the proposed procedure PLS HT excludes basis functions as well as explanatory variables.First we assess the performance of α t by the square root of average squared errors RASE α :

Journal of Probability and Statistics
where {t k , k 1, . . ., n grid } are the grid points at which the baseline function α • is estimated.In our simulation, we use n grid 200.Table 1 shows the means and standard deviations of RASE α for ρ 0, 0.25, 0.5, 0.75 based on 500 simulations.RASE α increases as the autoregressive coefficient increases but decreases as the sample size increases.From Table 1, we see that the proposed procedure PLS HT works better than PLS and that models evaluated by BICp work better than those based on BICm or GIC.
Then the performance of β is assessed by the square root of average squared errors RASE β :

3.3
The means and standard deviations of RASE β for ρ 0, 0.25, 0.5, 0.75 based on 500 simulations are shown in Table 2.We can see that the proposed procedure PLS HT works better than the existing procedure.There is almost no change in RASE β as the autoregressive coefficient changes unlike the procedure of You and Chen 10 , whereas RASE β depends strongly on the information, BICp works the best among the criteria.We can also confirm the consistency of the estimator, that is RASE β decreases as the sample size increases.The first step ahead prediction error PE , which is defined as is also investigated.Table 3 shows the means and standard errors of PE for ρ 0, 0.25, 0.5, 0.75 based on 500 simulations.The PE increases as the autoregressive coefficient increases, but the PE decreases as the sample size increases.From Table 3, we see that PLS HT works better than the existing procedures and there is almost no difference in the PE depending on the information criteria.The models evaluated by BICm perform well for large sample sizes.The means and standard deviations of the number and deviation of basis functions are shown in Tables 4 and 5.The BICp gives a smaller number of basis functions than the other information criteria.The models evaluated by BICp also give smaller standard deviations of the number of basis functions.The models determined by BICp tend to choose larger deviations of basis functions than those based on BICm and GIC.The number of basis functions increases gradually as the sample size or ρ increase.From Table 4, it appears that the number of basis functions does not depend on the sample size n.From Table 5, it also appears that the deviations of basis functions do not depend on the sample size n and ρ.
We now compare the performance of our procedure with existing procedures in terms of the reduction of model complexity.Table 6 shows simulation results of the means and standard deviations of the number of parameters excluded β 0 or w 0 by the proposed procedure.The results indicate that the proposed procedure reduces model complexity.From Table 6, It appears that the models determined by BICp tend to exclude fewer parameters and give smaller standard deviations for the number of parameters excluded.This is due to the selection of a smaller number of basis functions compared to the selection based on the other criteria see Table 4 .There is almost no dependence of the number of excluded parameters on ρ.The models evaluated by BICp give a larger number of excluded parameters as the sample size increases.On the other hand, the models evaluated by BICm or GIC give a smaller number of excluded parameters as the sample size increases.Table 7 shows the means and standard deviations of the number of basis functions excluded as w 0 by the proposed procedure.From Table 7 it appears that the models evaluated by BICp tend to exclude fewer basis functions than those based on GIC and BIC.Again this is due to the selection of a smaller number of basis functions see Table 4 .The models determined by BICp also give smaller standard deviations of the number of basis functions than the other criteria.There is almost no dependence of the number of basis functions on ρ.
Table 8 shows the means and standard deviations of the number of basis functions excluded as β 0 by the proposed procedure.The number of β which values really 0 was five.From Table 8 we see that the proposed procedure gives nearly five.The models determined by BICp give results more close to five and smaller standard deviations of the number of basis functions than the other criteria.The number of basis functions approaches five as the sample size increases.The standard deviations of the number of basis functions excluded decrease as ρ increases.These results indicate that the proposed procedure reduces model complexity.
Table 9 shows the percentage of times that various β i were estimated as being zero.As for the parameters β j / 0, j 1, 2, 5, these parameters were not estimated zero for every simulations, we omit to show the corresponding results on Table 9.The results indicate that the proposed procedure excludes insignificant variables and selects significant variables.It can be seen that the proposed procedure gives a better performance as the sample size increases and that BICp is superior to the other criteria.

Real Data Analysis
In this section we present the consequence of analyzing the real-time series data using proposed procedure.We use two data in this study; the data about the spirit consumption data in United Kingdom and the association between fertility and female employment in Japan.

The Spirit Consumption Data in the United Kingdom
We now illustrate our theory through an application to spirit consumption data for the United Kingdom from 1870 to 1938.The data-set can be found in Fuller 26, page 523 .In this dataset, the dependent variable y i is the logarithm of the annual per capita consumption of spirits.The explanatory variables x i1 and x i2 are the logarithms of per capita income and the price of spirits, respectively, and x i3 x i1 x i2 .Figure 2 shows that there is a change-point at the start of the First World War 1914 .Therefore, we prepare a variable z: z 0 from 1870 to 1914 and z 1 from 1915 to 1933.From this we derive another three explanatory variables: x i4 x i1 z, x i5 x i2 z, and x i6 x i3 z.We consider the semiparametric model: We computed the basis function estimate for α using the existing procedure PLS and the proposed procedure PLS HT with BICp.The resulting estimates and standard errors SEs of β are given in Table 10.The selected number of basis function is seven with one excluded basis function and the spread parameter s is estimated as 0.12.Therefore, we obtain the model The results indicate that the proposed procedure excludes some variable and reduces model complexity.Table 10 shows that PLS HT selects only β 4 and β 6 .That indicates possible interactions between consumption and income and between consumption and income×price after 1915.Consumption increases as income increases; however, as income increases and the price also increases, consumption decreases.We plot the estimated trend curve, residuals and autocorrelations functions in Figures 3 to 5 You and Chen 10 used the following semiparametric partially linear regression model: The semiparametric least-square SLS regression gives y i α t i 0.65x i1 − 0.95x i2 .The residual mean square is 2.2 × 10 −4 , which is more than that of our SGLSE fit.For a fair comparison, we use model 4.3 to revise You and Chen's estimation.Our semiparametric generalized least-square gives y i α t i − 0.71x i2 .The result indicates that x i1 is insignificant in model 4.3 .

The Association between Fertility and Female Employment in Japan
Recent literature finds that in OECD countries the cross-country correlation between the total fertility rate and the female labor force participation rate, which until the beginning of the 1980s had a negative value, has since acquired a positive value.See for example, Brewster and Rindfuss 27 , Ahn and Mira 28 , and Engelhardt et al. 29 .This result is often interpreted as evidence for a changing sign in the time series association between fertility and female employment within OECD countries.However, OECD countries, including Japan, have different cultural backgrounds.We investigate whether or not the relation between the total fertility rate TFR and the female labor force participation rate FLP has changed in Japan from a negative value to a positive value.This application challenges previous findings and could be good news for policy  makers, as a positive relationship implies that a rising FLP is associated with an increasing TFR.
Usually, FLP contains all women aged 15 to 64.However, TFR is a combination of fertility rates for ages 15-49, so we use the FLP of women aged 15 to 49 instead of women aged 15 to 64.We take the TFR from 1968 to 2007 in Japan.The estimation is a semiparametric regression of log TFR i on log FLP i .As the law of the Equal Employment Act came into force in 1985, we use the interaction variables "dummy for 1968-1984 × log TFR " x i2 and for 1985-2007 x i3 .We also use dummy variables for 1990-1999 and 2000-2007 x i4 , x i5 and consider the semiparametric model log TFR i α t i β x i ε i , i 1, . . ., 40.

4.4
We applied the existing procedure PLS and proposed procedure PLS HT with BIC p .The resulting estimates and standard errors SE of β are given in Table 11.Therefore, we obtain the model y i α t i − 0.27x i1 − 0.20x i2 , i 1, . . ., 40.

4.5
The residual mean square of PLS HT is 2.24 × 10 −2 and that of PLS is 2.47 × 10 −2 .The selected number of basis functions is six with one excluded basis function and the spread parameter s is estimated as 0.30.

Concluding Remarks
In this article we have proposed variable and model selection procedures for the semiparametric time series regression.We used the basis functions to fit the trend component.An algorithm of estimation procedures is provided and the asymptotic properties are investigated.From the numerical simulations, we have confirmed that the models determined by the proposed procedure are superior to those based on the existing procedure.They reduce the complexity of models and give good fitting by excluding basis functions and nuisance explanatory variables.
The development here is limited to the case with Gaussian stationary errors, but it seems likely our approach can be extended to the case with non-Gaussian long-range dependent errors, along with the lines explored in recent work by Aneiros-Pérez et al. 30 .However, the efficient estimation for regression parameter is an open question in case of long-range dependence.This is a question we hope to address in future work.We also plan to explore the question of whether the proposed techniques can be extended to the cointegrating regression models with an autoregressive distributed lag framework.

Appendix Proofs
In this appendix we give the proofs of the theorems in Section 2. We use x to denote the Euclidian norm of x.
Let a τ,n a 1,n , . . ., a τ,n be the infeasible estimator for a τ a 1 , . . ., a τ constructed using OLS methods.That is a τ,n a 1,n , a 2,n , . . ., a τ,n E τ E τ −1 E τ ε, where ε ε τ 1 , . . ., ε n and E τ ε i,j , i 1, . . ., n, j 1, . . ., τ with ε i,j ε i−j−τ .For ease of notation, we set a j,n a j,n 0 for j > τ, and a 0,n a 0,n 1.We write Γ k for cov ε 0 , ε k .Then we can construct the infeasible estimate V using a τ,n and Γ k , k 0, . . ., τ.The following lemma states that the estimators β and w given in Theorem 2.1 have asymptotically normal distributions.Proof of Theorem 2.1.Let the estimator a τ,n a 1,n , . . .a τ,n be the ordinary least-square estimate applied to model 2.18 .For the ease of notation, we set a j,n a j,n 0 for j > τ and a 0,n a 0,n 1.Then we write

Lemma .1. Under the assumptions of Theorem 2.1, one has
where A.20 From assumptions A.1 , A.2 , and Lemma .1 we can see that under the assumptions about τ and by the Caucy-Schwarz inequality A A.24 Notice that y i−j − β x i−j − w φ i−j e i,n .Since e i,n is a stationary and invertible process whose linear process coefficients satisfy the given summability assumptions, we have for some M > 0, From the above decomposition and evaluation, we can see that Therefore, in order to prove the second equation in Theorem 2.1 we just need to show A.27 To see the above results are true, let y τ,i be the ith element y τ of model 2.20 .We have for T τ,i the ith row of T τ , X τ,i the ith column of X τ , and B τ,i the ith column of B τ e i T τ,i ε e i τ j 1 Proof.We show that with probability tending to 1, as n → ∞, for any β 1 and w 1 satisfying , and w 2 ≤ C 2 n −1/2 , ∂l β, w /∂β j and β j have the same signs for β j ∈ −C 1 n −1/2 , C 1 n −1/2 , for j S 1 1, . . ., d.Also ∂l β, w /∂w j and w j have the same signs for w j ∈ −C 2 n −1/2 , C 2 n −1/2 , for j S 2 1, . . ., m.Thus the minimization is attained at β 2 w 2 0.

1 ,
Hart 2 and Herrmann et al. 3 .These results have been extended to the case with long-memory errors by Hall and Hart 4 , Ray and Tsay 5 ,

BIC n log 2π σ 2 e 2 e trS θ log n, 2 .41 where σ 2 en − 1 y
log n the number of parameters , 2.40where σ 2 e is the least-square estimate of σ 2 e without a degree of freedom correction.Hastie and Tibshirani 16 used the trace of the smoother matrix as an approximation to the effective number of parameters.By replacing the number of parameters in BIC by trS β , we formally obtain information criteria for the basis function Gaussian regression model in the form BICm n log 2π σ − S θ y 2 and

47 and|
K| and |Σ λ θ | are the product of the m−N 1 and d m−N 1 −N 2 nonzero eigenvalues of K and Σ λ θ , respectively.

Figure 1 :
Figure 1: Simulation data with a n 50 and ρ 0.5, b n 100 and ρ 0.5, c n 200 and ρ 0.5.The dotted lines represent α t ; the solid lines α t ε t .

Figure 2 :
Figure 2: Application data-set: a : y i log the annual per capita consumption of spirits ; b : x i1 log per capita income ; c : x i2 log price of spirits .

Figure 3 :Figure 4 :
Figure 3: Plots of estimated curves.The solid line represents y; the dotted lines are the estimates of y; the dashed lines are the estimated curve α.

Figure 7 :
Figure 7: Plots of estimated curves, the solid line represents y; the dotted lines are the estimated curves of y; the dashed lines are the estimated curves of α.

τ j 1 a√ mα 1n a 1n u α 2 1n b 1n u 2 2 n√ m b 1n C , dα 2n a 2n u α 2 n b 2n u 2 Lemma . 2 .
j − a j X i−j,i , B τ,ij T τ,j • B τ,•i B τa j B i−j,i ≡ B τ,ij τ j 1 a j − a j B i−j,i Cα Cα2 2n d b 2n C A.37 by the Taylor expansion and the Cauchy-Schwarz inequality.As b n → 0, the first term on the right side of A.34 will dominate A.35 and A.36 as well as the second term on the right side of A.34 , by taking C sufficiently large.Hence A.31 holds for sufficiently large C.This completes the proof of the theorem.Under the conditions of Theorem 2.3, with probability tending 1, for any given β and w, satisfying β 1 − β 10 w 1 − w 10 O p n −1/2 and any constant C, S β 1 , 0 , w 1 , 0 min β 2 ≤C 1 n −1/2 , w 2 ≤C 2 n −1/2 , S β 1 , β 2 , w 1 , w 2 .A.38
Assume that penalty functions p λ 1j • and p λ 2j • are negative and nondecreasing with p λ 1j 0 p λ 2j 0 0. Let β 0 and w 0 denote the true values of β and w, respectively.Also let j p λ 2j w j0 : w j0 / 0 .2.50 Theorem 2.2.Under the conditions of Theorem 2.1, if a 1n , b 1n , a 2n , and b 2n tend to 0 as n → ∞, then with probability tending to 1, there exist local minimizers β and w of L β, w such that β p n −1/2 a 1n and w HT SLOSE − w 0 O p n −1/2 a 2n .Theorem 2.2 demonstrates how the rate of convergence of the penalized least-square estimator θ SGLSE , w 'HT SGLSE of L θ depends on λ ij for i 1, 2.
w S 2 .HT SGLSE .Let w 1 consist of the first S 2 components of w and let w 2 consist of the last m − S 2 components of w HT SGLSE .Assume that the penalty functions p λ 1 |β j | and p λ 2 |w k | satisfy 2n |w 10 | sgn w 10 , . . ., p λ S 2 n |w S 2 0 | sgn w S 2 0 .2.52 Further, let β 1 consist of the first S 1 components of β and let β 2 consist of the last d − S 1 components of β

Table 3 :
Means standard deviations of first step ahead prediction errors.

Table 4 :
Means standard deviations of the number of basis functions.

Table 5 :
Means standard deviations of the deviations of basis functions.

Table 6 :
Means standard deviations of the number of parameters excluded.

Table 7 :
Means stnadard deviations of the number of basis functions excluded.

Table 8 :
Means standard deviations of the number of explanatory variables excluded.

Table 9 :
Percentage of times β i is estimated as zero.
Table 11 shows that PLS HT selects only log FLP i 1968-1984 and 1985-2007.That indicates a negative correlation between TFR and FLP for 1968-2007, especially for 1968-1984, which means TFR decreases as FLP increases.We could not see the positive association in 80s which has been reported in recent studies, see, for example, Brewster and Rindfuss 27 , Ahn and Mira 28 , and Engelhardt et al. 29 .We plot the estimated trend curve, residuals and autocorrelation functions in Figures 7 to 9.