A Penalized h-Likelihood Variable Selection Algorithm for Generalized Linear Regression Models with Random Effects

Reinforcement learning is one of the paradigms and methodologies of machine learning developed in the computational intelligence community. Reinforcement learning algorithms present a major challenge in complex dynamics recently. In the perspective of variable selection, we often come across situations where too many variables are included in the full model at the initial stage of modeling. Due to a high-dimensional and intractable integral of longitudinal data, likelihood inference is computationally challenging. It can be computationally difficult such as very slow convergence or even nonconvergence, for the computationally intensive methods. Recently, hierarchical likelihood (h-likelihood) plays an important role in inferences for models having unobservable or unobserved random variables. ,is paper focuses linear models with random effects in the mean structure and proposes a penalized h-likelihood algorithm which incorporates variable selection procedures in the setting of mean modeling via h-likelihood. ,e penalized h-likelihood method avoids the messy integration for the random effects and is computationally efficient. Furthermore, it demonstrates good performance in relevant-variable selection. ,roughout theoretical analysis and simulations, it is confirmed that the penalized h-likelihood algorithm produces good fixed effect estimation results and can identify zero regression coefficients in modeling the mean structure.


Introduction
Reinforcement learning is specified as trial and error (variation and selection and search) plus learning (association and memory) in Sutton and Barto [1]. Traditional variable selection procedures, such as LASSO in Tibshirani [2] and OMP in Cai and Wang [3], only consider the fixed effect estimates in the linear models in the past literature. However, in real life, a lot of existing data have both the fixed effects and random effects involved. For example, in the clinic trials, several observations are taken for a period of time for one particular patient. After collecting the data needed for all the patients, it is natural to consider random effects for each individual patient in the model setting since a common error term for all the observations is not sufficient to capture the individual randomness. Moreover, random effects, which are not directly observable, are of interest in themselves if inference is focused on each individual's response. erefore, to solve the problem of the random effects and to get good estimates, Lee and Nelder [4] proposed hierarchical generalized linear models (HGLMs). HGLMs are based on the idea of h-likelihood, a generalization of the classical likelihood to accommodate the random components coming through the model. It is preferable because it avoids the integration part for the marginal likelihood and uses the conditional distribution instead.
Inspired by the idea of reinforcement learning and hierarchical models, this paper proposes a method by adding a penalty term to the h-likelihood. is method considers not only the fixed effects but also the random effects in the linear model, and it produces good estimation results with the ability to identify zero regression coefficients in joint models of mean-covariance structures for high-dimensional multilevel data. e rest of this paper is organized as follows: Section 2 provides the literature review on current variable selection methods based on partial linear models and h-likelihood. Section 3 explains a penalty-based h-likelihood variable selection algorithm and demonstrates via simulation that our proposed algorithm exhibits desired sample properties and can be useful in practical applications. Finally, Section 4 concludes the paper, and some future research directions are given.

Reinforcement Learning in the Perspective of Nonlinear
Systems. Reinforcement learning, one of the most active research areas in artificial intelligence, is introduced and defined as a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment in Sutton and Barto [1]. In addition, in the paper of Sutton and Barto [5], reinforcement learning is specified to be trial and error (variation and selection and search) plus learning (association and memory). Furthermore, Barto and Mahadevan [6] propose hierarchical control architectures and associated learning algorithms. Approaches to temporal abstraction and hierarchical organization, which mainly rely on the theory of semi-Markov decision processes, are reviewed and discussed in Barto and Mahadevan's paper [6]. Recent works, such as Dietterich [7], have focused on the hierarchical methods that incorporate subroutines and state abstractions, instead of solving "flat" problem spaces.
Nonlinear control design has gained a lot of attention in the research area for a long time. In the industrial field, the controlled system usually has great nonlinearity. Various adaptive optimal control models have been applied to the identification of nonlinear systems in the past literature. In fact, the two important fundamental principles of controller design are optimality and veracity. He et al. [8] study a novel policy iterative scheme for the design of online H ∞ optimal laws for a class of nonlinear systems and establishes the convergence of the novel policy iterative scheme to the optimal control law. He et al. [9] investigate an online adaptive optimal control problem of a class of continuous-time Markov jump linear systems (MJLSs) by using a parallel reinforcement learning (RL) algorithm with completely unknown dynamics. A novel parallel RL algorithm is proposed, and the convergence of the proposed algorithm is shown. Wang et al. [10] study a new online adaptive optimal controller design scheme for a class of nonlinear systems with input time delays. An online policy iteration algorithm is proposed, and the effectiveness of the proposed method is verified. He et al. [11] propose the online adaptive optimal controller design for a class of nonlinear systems through a novel policy iteration (PI) algorithm. Cheng et al. [12] investigate the observer-based asynchronous fault detection problem for a class of nonlinear Markov jumping systems and introduces a hidden Markov model to ensure that the observer modes run synchronously with the system modes. Cheng et al. [13] propose the finite-time asynchronous output feedback control scheme for a class of Markov jump systems subject to external disturbances and nonlinearities.

Partial Linear Models.
Linear models have been widely used and employed in the literature. One extension of linear models, which was introduced by Nelder and Wedderburn [14], is generalized linear models (GLMs). GLMs allow the class of distributions to be expanded from the normal distribution to that of one-parameter exponential families. In addition, GLMs generalize linear regression in the following two manners: first of all, GLMs allow the linear model to be related to the response variable via a link function, or equivalently a monotonic transform of the mean, rather than the mean itself. Second, GLMs allow the magnitude of the variance of each measurement to be a function of its predicted value.
On the contrary, Laird and Ware [15] propose linear mixed effect models (LMEs), which are widely used in the analysis of longitudinal and repeated measurement data. Linear mixed effect models have gained popular attention since they take into consideration within-cluster and between-cluster variations simultaneously. Vonesh and Chinchilli [16] have investigated and applied statistical estimation as well as inference for this class of LME models. However, it seems that model selection problem in LME models is ignored. is disregarded problem was noticed and pointed out by Vaida and Blanchard [17], stating that when the focus is on clusters instead of population, the traditional selection criteria such as AIC and BIC are not appropriate. In the paper of Vaida and Blanchard [17], the conditional AIC is proposed, for mixed effects models with detailed discussion on how to define degrees of freedom in the presence of random effects. Furthermore, Pu and Niu [18] study the asymptotic behavior of the proposed generalized information criterion method for selecting fixed effects. In addition, Rajaram and Castellani [19] use ordinary differential equations and the linear advection partial differential equations (PDEs) and introduce a case-based density approach to modeling big data longitudinally.
Recently, Fan and Li [20] develop a class of variable selection procedures for both fixed effects and random effects in linear mixed effect models by incorporating the penalized profile likelihood method. By this regularization method, both fixed effects and random effects can be selected and estimated. ere are two outstanding aspects regarding Fan and Li's [20] method. First of all, the proposed procedures can estimate the fixed effects and random effects in a separate way. Or in other words, the fixed effects can be estimated without the random effects being estimated, and vice versa. In addition, the method works in the high-dimensional setting by allowing dimension of random effect to grow exponentially with sample size.
Combined with the idea of generalized linear models (GLMs) and linear mixed effect (LME) models, one extension, generalized linear mixed models (GLMMs), is developed. In the traditional GLMs, it is assumed that the observations are uncorrelated. To solve the constrained assumption, GLMMs allow for correlation between observations, which often happens in the longitudinal data and clustered designs. e advantages of GLMMs are presented as follows: first of all, GLMMs allow random effects to be included in the linear predictor. As a result, the correlations 2 Complexity between observations can be explained through an explicit probability model. Second, when the focus is on estimating the fixed effects on a particular individual, GLMMs provide good subject-specific parameter estimates. However, since GLMMs are also called multilevel models, it is generally more computationally intensive when fitting the model. So far, all those GLMs and GLMMs are well-established parametric regression models. A serious disadvantage of parametric modeling is that a parametric model may be too restrictive in some applications. To overcome this restrictive assumption difficulty in the parametric regression, nonparametric regression has gained popular attention in the literature. ere are many nonparametric and smoothing methods, such as kernel smoothing, local polynomial fitting, and penalized splines. In this section, two often-used smoothing methods in estimating a nonparametric model are described in the following paragraphs since they are used later in simulations and applications. e first type is called local linear kernel smoothing. e main idea of local linear kernel smoothing is to locally approximate the function f linearly. Local linear kernel smoothing uses Taylor expansion as a fundamental tool. In particular, Taylor expansion states that any smooth function can be locally approximated by a polynomial of some degree.
Suppose we have a simple nonparametric model for i � 1, . . . , n. Let t 0 be an arbitrary fixed point where the function f is estimated. Assume f(t) has a first-order continuous derivative at t 0 . en, by Taylor expansion, f(t) can be locally approximated by in a neighborhood of t 0 that allows the above expansion where f (1) (t 0 ) denotes the first derivative of f(t) at t 0 .
Let α 0 � f(t 0 ) and α 1 � f (1) (t 0 ). e local linear smoother is obtained by fitting a data set locally with a linear function, to minimize the following weighted least squares criterion: where K h (.) � K(./h)/h, which is obtained by rescaling a kernel function K(.) with a positive constant bandwidth h. e primary objective of the bandwidth h is to specify the size of the local neighborhood [t 0 − h, t 0 + h], where the local fitting is conducted. Moreover, the kernel function K(.) determines how observations within the neighborhood contribute to the fit at t 0 . A detailed introduction of the kernel function will be provided in the later paragraphs.
e local linear smoother f h (t 0 ) � α 0 can be simply expressed as where A local linear smoother is often good enough for most problems if the kernel function K(.) and the bandwidth h are adequately determined. Moreover, it enjoys many good properties that the other linear smoothers may lack. Fan [21], Fan and Gijbels [22], and Hastie and Loader [23] separately discussed those good properties in detail. e kernel function K(.) used in the local linear smoother is a symmetric probability density function. e kernel K(.) specifies how the observations contribute to the local linear kernel fit at t 0 , whereas the bandwidth h specifies the size of the local neighborhood [t 0 − h, t 0 + h]. Several widely used kernel functions include the following: Suppose, for instance, the uniform kernel is used. All the t i 's within the neighborhood [t 0 − h, t 0 + h] contribute equally; or equivalently, the weights are the same, in the local linear kernel fit at t 0 ; on the contrary, all the t i 's outside the neighborhood [t 0 − h, t 0 + h] contribute nothing. Suppose, for another example, the Gaussian kernel is used. e contribution of the t i 's is determined by the distance of t i from t 0 . In other words, smaller distance (t − t 0 ) results in larger contribution since the Gaussian kernel is a bell-shaped curve, which peaks at the origin. e second type of smoothing is called regression spline smoothing. In local linear kernel smoothing introduced above, local neighborhoods were defined by a bandwidth h and a fixed point t 0 . On the contrary, in regression spline smoothing that will be introduced shortly, local neighborhoods are defined by a group of locations, known as knots, for example, . . , k are referred as interior knots or simple knots. en, local neighborhoods are divided by these knots, i.e., and within any two neighboring knots, a Taylor's expansion up to some degree is applicable. A regression spline can be constructed in terms of truncated power basis. As mentioned earlier, there are K knots τ 1 , . . . , τ K , and the k-th degree truncated power basis can be expressed as where a k + denotes power k of the positive part of a with a + � max(0, a). In most of the literature, it is called "constant, linear, quadratic, and cubic" truncated power basis when k � 0, 1, 2, and 3 correspondingly. For the purpose of this chapter, cubic truncated power basis is used in subsequent sections of simulations and applications.
We still consider the abovementioned simple nonparametric model: for i � 1, . . . , n. It is with conventional purpose to denote the truncated basis as where p � K + k + 1 is the number of the basis functions involved. en, the regression fit of the function f(t) in the nonparametric model can be expressed as where To sum up, parametric models are very useful for longitudinal data analysis since they provide a clear and easy description of the relationship between the response variable and its covariates. However, in most of data analysis, the parametric model does not fit the data well, resulting in biased estimates. To overcome the restricted assumptions on parametric forms, various nonparametric models such as nonparametric mixed effects models have been proposed for longitudinal data. Refer, for example, the study by Fan and Zhang [24] and Wu and Rice [25] among others. ere is always a trade-off model assumption and model complexity. Parametric models are less robust against model assumptions, but they are efficient when the models are corrected assigned. On the contrary, nonparametric models are more robust against model assumptions, but they are less efficient and more complex. A trade-off between efficiency and complexity by the information measure is fully investigated and discussed in Caves and Schack [26]. Zhang et al. [27] propose an improved K-means clustering algorithm, which is called the covering K-means algorithm (C-K-means).
ere are two advantages for the C-K-means algorithm. First of all, it acquires efficient and accurate clustering results under both sequential and parallel conditions. Furthermore, it self-adaptively provides a reasonable number of clusters based on the data features.
Semiparametric models come across in the need to compromise and remain good features of both parametric and nonparametric models. In semiparametric models, parametric component and nonparametric component are the two essential components. More specifically, the parametric component is often used to model important factors that affect the responses parametrically, whereas the nonparametric component is often used for less important and nuisance factors. Various semiparametric models for longitudinal data include semiparametric population mean models proposed in Martinussen and Scheike [28] and Xu [29], among others, and semiparametric mixed effects models in the study by Zeger and Diggle [30], Groll and Tutz [31], and Heckman et al. [32]. For the purpose of this paper, we restrict our attention to partially linear regression models.

h-Likelihood.
In longitudinal studies, there are two types of models, marginal models, and conditional models. By definition, marginal models are usually referred as population-average models by ignoring the cluster random effects. In contrast, conditional models have random effect or are subject-specific models. e main difference between marginal and conditional models is whether the regression coefficients describe an individual's response or the marginal response to changing covariates. Or in other words, changing covariates does not attempt to control for unobserved subjects' random effects. Diggle et al. [33] suggested the random effect model for inferences about individual responses and the marginal model for inferences about margins.
e idea of h-likelihood was introduced by Lee and Nelder [4]. h-likelihood is an extension of Fisher likelihood to models of GLMs with additional random effects in the linear predictor. e concept of h-likelihood is for inferences of unobserved random variables. In fact, h-likelihood is a special kind of extended likelihood, where the random effect parameter is specified to satisfy certain conditions as we shall talk more in details later. In the meantime, with the idea of h-likelihood, hierarchical generalized linear models (HGLMs) were introduced as well in Lee and Nelder's [4] paper. is class of hierarchical GLMs allows various distributions of the random component. In addition, these distributions are conjugate to the distributions of the response y. Four conjugate HGLMs were introduced in [4], namely, normal-normal, Poisson-gamma, binomial-beta, and gamma-inverse gamma (Table 1). If we let y be the response and u be the unobserved random component, v is the scale on which the random effect u happens linearly in the linear predictor. In other words, u and v are linked via some strictly monotonic function.
Consider the hierarchical model where y|v and v follow some arbitrary distributions listed in Table 1. e definition of h-likelihood, denoted by l h , is presented in the following way: where l(α; v) is the log likelihood function of v given parameter α and l(β, ϕ; y | v) is that of y|v given parameter β and ϕ. One point to note is that the h-likelihood is not a traditionally defined likelihood since v are not directly observable. In the traditional standard maximum likelihood estimation for models with random effects, the method is based on the marginal likelihood as the objective function. In this marginal likelihood approach, random effects v are integrated out and what remain in the maximized function are the fixed effects β and dispersion parameter ϕ. ere are two disadvantages of the marginal likelihood approach. First of all, the intractable integration of v is with obvious difficulty. In addition, random effects are nonestimable after integration. In contrast, the h-likelihood approach avoids such intractable integration. In fact, as clearly stated by Lee and Nelder [4], "we can treat the h-likelihood as if it were an orthodox likelihood for the fixed effects β and random effects v, where the v are regarded as fixed parameters for realized but unobservable values of the random effects." Furthermore, the h-likelihood allows us to have a fixed effect estimator that is asymptotically efficient as the marginal maximum likelihood estimator. Last but not least, the maximized h-likelihood estimates are derived by solving the two equations simultaneously: People always expect an outstanding property of likelihood inference to be invariant with respect to transformations. As for maximum h-likelihood estimates, estimates for random effects are invariant with respect to the transformation of the random components of u.
Furthermore, Lee and Nelder [4] mentioned adjusted profile h-likelihood, which is defined in the following way: where D(l h ) � − z 2 l h /zv zv T . It eliminates the nuisance effects v from the h-likelihood. Moreover, the D(l h ) part is often referred as the adjusted term for such elimination. In fact, this adjusted profile h-likelihood, which is used for the estimation of dispersion components, acts as an approximation of the marginal likelihood, without integrating v out. ere are a few outstanding contributions in Lee and Nelder's [4] publication. First of all, it widens the choice of random effect distributions in mixed generalized linear models. In addition, it brings about the h-likelihood as a device for estimation and prediction in hierarchical generalized linear models. Compared to the traditional marginal likelihood, the h-likelihood avoids the messy integration for the random effects and hence is convenient to use. Furthermore, maximized h-likelihood estimates are obtained by iteratively solving equation (14). To conclude, the h-likelihood is used for inference about the fixed and random effects given dispersion parameter ϕ.
On the contrary, Lee and Nelder [34] demonstrated the use of an adjusted profile h-likelihood for inference about the dispersion components given fixed and random effects. In this paper, the focus is on the joint modeling of the mean and dispersion structure. Iterative weighted least squares (IWLS) algorithm is used for estimations of both the fixed and random effects by the extended likelihood and dispersion parameters by the adjusted profile likelihood. Later, in [35], the algorithm was adjusted by replacing the extended likelihood to the first-order adjusted profile likelihood, as to estimate fixed effects in the mean structure.
Lee and Nelder [36] proposed a class of double hierarchical generalized linear models in which random effects can be specified for both the mean and dispersion. Compared with HGLMs, double hierarchical generalized linear models allow heavy-tailed distributions to be present in the model. Random effects are introduced in the dispersion model to solve heteroscedasticity between clusters.
en, h-likelihood is applied for statistical references and efficient algorithm, as the synthesis of the inferential tool. In addition, Lee and Noh [37] proposed a class of double hierarchical generalized linear models in which random effects can be specified for both the mean and dispersion, allowing models with heavy-tailed distributions and providing robust estimation against outliers. Greenlaw and Kantabutra [38] address the parallel complexity of hierarchical clustering. Instead of the traditional sequential algorithms, the described top-down algorithm in Greenlaw and Kantabutra [38] is parallelized and the computational cost of the topdown algorithm is with O(log n) time.
In conclusion, for both hierarchical generalized linear models (HGLMs) and double hierarchical generalized linear models (DHGLMs), h-likelihood plays an important role in inferences for models having unobservable or unobserved random variables. Furthermore, numerical studies have been investigated and shown that h-likelihood gives statistically efficient estimates for HGLMs as well as DHGLMs. In addition, Noh and Lee [39] have shown that the h-likelihood procedure outperforms existing methods, including MCMC-type methods, in terms of bias. Last but not least, compared to the traditional marginal likelihood, the h-likelihood avoids the messy integration for the random effects and hence is convenient to use. erefore, the h-likelihood method is worth attention.

Model Setup.
Suppose that we have k independent groups and each group contains m subjects. Let y ij be the j th subject of group i, where i � 1, . . . , k and j � 1, . . . , m. Based on the idea of modeling the mean structure in the HGLM framework, we consider a partial linear model for modeling the conditional mean: where f(.) is an unknown smooth function in t, t ij is an univariate explanatory variable in [0, 1] for simplicity, g(.) is the canonical link function for the conditional distribution of y ij , and x ij is a p × 1 covariate vector with β as the associated coefficients. In matrix representation, We assume that conditional random variables u i and y ij are from an exponential family with mean and variance: We also assume that (X T , t) T and ε are independent. e random effects presented in the mean model v i are linked to allows for the definition of h-likelihood given in Lee and Nelder [4]. In this paper, the identity link v i � u i is used, and hence, this canonical scale corresponds to the case that the conditional distribution of the response y is normal, i.e., y ij ∼ N(μ ij , ϕ). For simplicity, random effects are considered in the form of a random intercept throughout this paper. If a random intercept is not sufficient to represent the variation exhibited in the data, then the model can be easily extended to a more general form by considering a more complex random effects structure.

Estimation Procedure via Penalized h-Likelihood
us, the log of h-likelihood is For the purpose of this paper, the first and second derivatives of l h (β, v) with respect to β and v are derived and listed below: e maximum likelihood estimate for the random effects v is obtained by setting zl h (β, v)/zv to zero. en, an approximated likelihood for the fixed effects can be obtained by plugging the estimate v in l h (β, v). In addition, the marginal likelihood is approximated by the adjusted profile likelihood: where D(l h (β, v)) � − z 2 l h (β, v)/zv zv T . Now the problem of how to estimate the smooth function f(t) rises. In this paper, we use two nonparametric approaches to estimate f(t): local linear regression technique and spline technique.
In the framework of penalized variable selection, we apply a penalty on the approximated marginal likelihood so that where P λ (.) is the penalty function with tuning parameter λ.
Our aim is to maximize l p (β) and get the maximum likelihood estimates for the fixed effects β. We will give a brief theoretical support on how to derive the estimation in the following paragraphs. First of all, the L 1 penalty functions are singular at the origin, and they do not have continuous second-order derivatives. However, they can be locally approximated by a quadratic function as follows. Assume that we are given an initial value β 0 that is close to the maximizer of l h (β). If β j0 is very close to 0, then set β j � 0. Otherwise, they can be locally approximated by a quadratic function as when β j ≠ 0. In other words, for β j ≈ β j0 . A drawback of this approximation is that once a coefficient is shrunk to zero, it will stay at zero. Furthermore, note the first two derivatives of the log h-likelihood function l h (β, v) are continuous. Around a given point β 0 , the log h-likelihood function can be approximated by Similarly, l p (β) can be locally approximated by the quadratic function where C is a constant term, ▽l(β 0 ) � zl(β 0 )/zβ, ▽ 2 l (β 0 ) � z 2 l(β 0 )/zβ zβ T , and λ (β 0 ) � diag P λ ′ (|β 10 |)/|β 10 |, . . . , P λ ′ (|β p0 |)/|β p0 |}. e quadratic maximization problem yields the solution iteratively by When the algorithm converges, the estimator satisfies the penalized likelihood equation condition for nonzero elements of β 0 . As stated in Fan and Li [20], in the maximum likelihood estimation (MLE) setting, with good initial value of β 0 , the onestep procedure can be as efficient as the fully iterative procedure, when the Newton-Raphson algorithm is used. us, if we have a good initial value for β, the very next iteration can be regarded as a one-step procedure, and the resulting estimator can be as efficient as the fully iterative method.

Variable Selection via the Adaptive Lasso Penalty.
ere are many penalized likelihood variable selection criteria available in the literature review on penalized approaches, such as lasso penalty and SCAD. In this paper, we focus on the adaptive lasso penalty, which was introduced by Zou [40]. e form of the penalty function for adaptive lasso is given by where w is a known weights vector and λ is the tuning parameter satisfying λ > 0. It has been shown if the weights are data-dependent and cleverly chosen, the weighted lasso can achieve the oracle properties, or in other words, it performs well as if the true underlying model was known in advance. is is the main reason for our choice of penalty function. In addition, the adaptive lasso is less complicated than the smoothly clipped absolute deviation (SCAD) penalty introduced by Fan and Li [20] and hence is easier to implement.
For the choice of the data-dependent weights vector w, we use the hierarchical generalized linear model to estimate β hglm . To specify,

Complexity
As the sample size grows, the weights for zero-coefficient estimators get to infinity, whereas the weights for nonzerocoefficients converge to a finite constant. A significant part of our proposed method is the process of variable selection by choosing an appropriate penalty function. As a result, the choice of the tuning parameter λ in the penalty function becomes important. e most popular methods for choosing such tuning parameters are K-fold cross-validation and generalized cross-validation procedures in the literature. In fact, the consistency of selection of various shrinkage methods relies on an appropriate choice of the tuning parameters, and the method of generalized crossvalidation (GCV) method has been widely used in the past literature. erefore, we adopt the traditional method and generalized cross-validation method, for the choice of the tuning parameter. In particular, suppose we have the fitted Y � HY for a linear method under squared error, then the standard formula for the generalized cross-validation is (31) en, we obtain the tuning parameter λ with the minimized GCV.

Computational Algorithm.
We propose the following h-likelihood algorithm (Algorithm 1) for developing the method discussed in this paper. e computational cost of the proposed penalized h-likelihood algorithm is of order O(np 2 ), where n is the sample size and p is the number of associated coefficients in equation (16). e efficient path algorithm makes the proposed penalized h-likelihood algorithm an attractive method for real applications. In particular, if we have a good initial value for β, the very next iteration can be regarded as a onestep procedure, and the resulting estimator can be as efficient as the fully iterative method.

Simulation Studies.
To assess the finite sample performance of our proposed method, we conduct several simulation studies. All simulations are conducted using R codes. Our models have the form with v i ∼ N(0, σ 2 u ) and ε ij ∼ N(0, ϕ). It has been assumed throughout this chapter σ 2 u � 0.2 and ϕ � 1. In addition, the distribution of the response y ij conditional on the random components v i is also assumed to be N( . . . , x ij10 ) T for the model, we draw random samples from a multivariate normal distribution N(0, Σ), where the covariance matrix Σ is assumed to have an AR (1) structure with σ 2 � 1 and ρ � 0.5. e choice of the correlation parameter ρ is fixed here since the choice of the correlation has little impact on the resulting penalized estimates for β by trying several values for ρ ∈ [0.1, 0.9]. Furthermore, t ij are simulated from a uniform [0, 1] distribution. We do the simulation studies through several examples. For each of the cases, we run a simulation study over 100 simulated datasets.
Furthermore, for the nonparametric part of the model, we use three different functions for simulation purposes: f(t) � exp(0.1t), f(t) � sin(0.1πt), and f(t) � t 2 . Both f(t) � exp(0.1t) and f(t) � t 2 represent a nonlinear and increasing function, whereas f(t) � sin(0.1πt) represents a nonlinear and nonmonotonic function.
In order to examine the finite sample performance of our proposed method, we run simulations based on the following six examples. Example 2. Similar to Example 1 but with reduced number of within cluster subjects. We generate a balanced dataset, such that there are 5 subjects within each 100 groups. In other words, we have 100 clusters and 5 subjects within each cluster, denoted by i � 1, . . . , 100 and j � 1, . . . , 5. e size of the true model is d 0 � 5 with the true values of the parameters set to be β � (7.7, 4.6, 3.8, 2.9, 5.3, 0, 0, 0, 0, 0) T . In addition to the linear component, the nonparametric Example 3. We generate a balanced dataset such that there are 10 subjects within each 100 groups. In other words, we have 100 clusters and 10 subjects within each cluster, denoted by i � 1, . . . , 100 and j � 1, . . . , 10. e size of the true model is d 0 � 3 with the true values of the parameters set to be β � (2, 1, 3, 0, 0, 0, 0, 0, 0, 0) T . In addition to the linear component, the nonparametric component is Example 4. Similar to Example 3 but with reduced number of within cluster subjects. We generate a balanced dataset, such that there are 5 subjects within each 100 groups. In other words, we have 100 clusters and 5 subjects within each cluster, denoted by i � 1, . . . , 100 and j � 1, . . . , 5. e size of the true model is d 0 � 3 with the true values of the parameters set to be β � (2, 1, 3, 0, 0, 0, 0, 0, 0, 0) T . In addition to the linear component, the nonparametric component is f(t) � exp(0.1t).

Complexity
Example 6. Similar to Example 5 but with reduced number of within cluster subjects. We generate a balanced dataset, such that there are 5 subjects within each 100 groups. In other words, we have 100 clusters and 5 subjects within each cluster, denoted by i � 1, . . . , 100 and j � 1, . . . , 5. e size of the true model is d 0 � 3 with the true values of the parameters set to be β � (2, 1, 3, 0, 0, 0, 0, 0, 0, 0) T . In addition to the linear component, the nonparametric component is We simulate each random effect v i from a normal distribution with 0 mean and σ 2 u � 0.2. Moreover, we simulate t ij from uniform distribution of [0, 1]. en, we obtain the smoothing function f(t) by plugging in the values of t ij . Once we have the random effects and the nonparametric part of f(t), we can simulate the response y ij by computing its mean and variance through the model. In this case, ij β + v i and ϕ � 1. By default, we estimate the unknown smooth function f(t) by two methods: local linear kernel smoothing method and cubic spine smoothing method. We denote the estimates with respective to those two methods by PHKernel and PHSpline. In addition, we also calculated the cubic spline smoothing method without the penalty term, i.e., λ � 0, and denote the estimates algorithm by HSpline. However, due to the computational complexity of the local linear kernel smoothing method, we only consider the comparison between local linear kernel smoothing method and cubic spine smoothing method for Examples 1 and 2. For the rest of the four examples, we only run the simulations in terms of HSpline and PHSpline.
Before we report the simulation performances of our proposed penalty-based procedure, several terms, which will be listed in the summary tables, are introduced. First of all, let percentage of correctly fitted and percentage of overfitted be the proportions of selected models that are correctly fitted and overfitted, respectively. In the case of overfitting, the columns "1," "2," and ">2" represent the proportions of selected models including one, two, and more than two irrelevant predictors, correspondingly.
Furthermore, to characterize the capability of a method in producing sparse solutions, we define percentage of correct zeros(%) To characterize the method's underfitting effect, we further define percentage of incorrect zeros(%) (i) Assume a partial linear model excluding variable selection. Express f(t ij ) in a parametric way. For example, a cubic regression spline can be expressed by using the truncated power basis: where the 5 knots τ 1 , . . . , τ 5 are percentiles of t, α 0 , . . . , α 8 are the associated coefficients, and s � 3, r � 5, are the numbers corresponding to the cubic regression spline representation. (ii) Initialize the fixed effects β (0) � β hglm , where β hglm is the h-likelihood estimates by treating f(t ij ) in a parametrical way. en, we have (ii) For the (k + 1) th iteration, set the estimator β (k) from the k th iteration and update β by findings can be observed from Table 2 irdly, when we have a more sparse representation for the fixed effects β with smaller magnitudes, our proposed PHSpline tends to provide a little bit conservative result compared to Examples 1 and 2, in terms of variable selection accuracy. In particular, simulation results of our proposed PHSpline method for Example 3 provide a 74% of correct fit, a 14% of overfit with 1 irrelevant predictor included, a 8% of overfit with 2 irrelevant predictors included, and a 4% of overfit with more than 2 irrelevant predictors included. In fact, the overall performance of variable selection consistency for Example 3 is good with a 93.3% of correct zeros. On the contrary, when the number of within-cluster subjects decreases from 10 to 5 in Example 4, percent of correct zeros decreases to 84.4%, meaning that more irrelevant predictors are included in the model. Last but not least, similar trends can be observed for Examples 5 and 6 compared to Examples 3 and 4. Example 5 returns a 71% of correct fit, a 20% of overfit with 1 irrelevant predictor included, a 2% of overfit with 2 irrelevant predictors included, and a 7% of overfit with more than 2 irrelevant predictors included. On the contrary, Example 6 returns a 64% of correct fit, a 21% of overfit with 1 irrelevant predictor included, a 6% of overfit with 2 irrelevant predictors included, and a 9% of overfit with more than 2 irrelevant predictors included. As a result, the 71% of correct fit for Example 5 outperforms the 64% of correctly fit for Example 6, in terms of the variable selection consistency. Hence, generally speaking, our proposed PHSpline method works better when the number of within cluster subjects increases.
Besides the variable selection accuracy summarized in Table 2, prediction accuracy for the fixed effects β for various examples is also with our interest. In the following paragraphs, results of prediction accuracy for the fixed effects β are discussed and interpreted, with Tables 3-8 presented. Table 3 summarizes simulation result over 100 replications for Example 1. As we can see, both PHkernel and PHSpline can recover the relevant predictors accurately. In addition, the estimates of the fixed effects for both PHkernel and PHSpline are comparably making very little difference with the true values of β. However, in terms of speed of the algorithm, the PHSpline method is way fast than the PHKernel method, and hence, the PHSpline method is fast    10 Complexity to implement. On the contrary, the HSpline method returns the h-likelihood estimates of the fixed effects, without the penalty term. As we can observe from Table 3, the HSpline method gives nonzero estimates for all the β, resulting in bad variable selection performance compared with PHSpline, which involves a penalty term. Furthermore, PHSpline estimates tend to have relatively smaller standard deviations than those computed in HSpline estimates. erefore, the PHSpline method outperforms the other two methods by either variable selection accuracy or efficiency of the implementation speed. Simulation result over 100 replications for Example 2 is summarized in Table 4. Example 2 has a smaller number of within-cluster subjects than that in Example 1. In fact, similar to the results obtained in Example 1, both PHKernel and PHSpline methods return relatively good estimates of the fixed effects β in terms of variable selection accuracy and prediction accuracy. In particular, both PHKernel and PHSpline methods select one irrelevant covariate wrongly. In addition, the estimates of the fixed effects for both PHKernel and PHSpline methods are comparably making very little difference with the true values of β. On the contrary, as we can observe from Table 4, the HSpline method gives nonzero estimates for all the β, resulting in bad variable selection performance compared with PHSpline. Furthermore, PHSpline estimates tend to have relatively smaller standard deviations than those computed in HSpline estimates. In fact, it is not surprising to see that both PHkernel and PHSpline methods include X 6 as a relevant predictor in the model. Or in an equivalent way, both PHKernel and PHSpline methods return nonzero β 6 . e reason is that we have a AR (1) model, which means there is a correlation of ρ � 0.5 between X 5 and X 6 .
As we compare simulation results of Examples 1 and 2, our proposed PHSpline method tends to perform better when the number of within-cluster subjects increases. In addition, a similar conclusion can be drawn for the PHKernel method. Furthermore, both PHKernel and PHSpline methods work well when the nonparametric component is f(t) � t 2 .
Tables 5 and 6 present simulation results over 100 replications for Examples 3 and 4. In these two examples, we have a more sparse representation in terms of the fixed effects β than those in Examples 1 and 2. On top of that, the magnitudes of the fixed effects β are set to be smaller than those in Examples 1 and 2. For both of the results, the PHSpline method outperforms the HSpline method in terms of variable selection performance in two ways. First of all, the PHSpline method identifies some of the irrelevant predictors accurately, whereas the HSpline method gives nonzero estimates for all the β. ough PHSpline cannot guarantee 100% selection accuracy, it does improve the poor variable selection performance of HSpline by adding a penalty term. Furthermore, PHSpline estimates tend to have relatively smaller standard deviations than those computed in HSpline estimates. erefore, the PHSpline method performs better than the HSpline method, even for the sparse fixed effects β situation.
Similarly, simulation results over 100 replications for Examples 5 and 6 are presented in Tables 7 and 8. Again, we have a more sparse representation in terms of the fixed effects β than those in Examples 1 and 2, with smaller magnitudes of the fixed effects β.
e PHSpline method works pretty well in terms of variable selection for Example 5       Overall, the simulation results show that our proposed penalized h-likelihood approach performs good in terms of variable selection accuracy because of its ability to recover the true zeros, especially when the number of withincluster subjects is not too small. Generally, our proposed PHSpline method works better when the number of within cluster subjects increases. In addition, even when the true model is sparse, our penalized estimator still does no worse than the h-likelihood estimator in terms of estimation accuracy.

Conclusion
To conclude, we have introduced a new penalized h-likelihood approach to identify nonzero relevant fixed effects in the partial linear model setting in this paper. is penalized h-likelihood incorporates variable selection procedures in the setting of mean modeling via h-likelihood. A few advantages of this newly proposed method are listed below. First of all, compared to the traditional marginal likelihood, the h-likelihood avoids the messy integration for the random effects and hence is convenient to use. In addition, h-likelihood plays an important role in inferences for models having unobserved random variables. Last but not least, it has been demonstrated by simulation studies that the proposed penalty-based method is able to identify zero regression coefficients in modeling the mean structure and produces good fixed effects estimation results.
As for future research, it would be interesting to apply the proposed penalized h-likelihood approach to be extended for more complicated circumstances for the partial linear models. In other words, the model in this paper assumes only a simple one-component structure for the random effects, such that only a random intercept is considered. For possible future research, we may consider a partial linear model for modeling the conditional mean with more than one random effect, i.e., the extended multicomponent random effects model. Other future work, including variance components estimates of the random effects and study of penalized h-likelihood estimator's theoretical and asymptotical property such as convergence rate, would be investigated and discussed.

Data Availability
is is a theoretical study, and we do not have experimental data.

Disclosure
is work was part of the originally written Ph.D. thesis by the first author in 2013 [41].

Conflicts of Interest
e authors declare that they have no conflicts of interest.