Estimation of Stochastic Frontier Models with Fixed Effects through Monte Carlo Maximum Likelihood

Estimation of nonlinear fixed-effects models is plagued by the incidental parameters problem. This paper proposes a procedure for choosing appropriate densities for integrating the incidental parameters from the likelihood function in a general context. The densities are based on priors that are updated using information from the data and are robust to possible correlation of the group-specific constant terms with the explanatory variables. Monte Carlo experiments are performed in the specific context of stochastic frontier models to examine and compare the sampling properties of the proposed estimator with those of the random-effects and correlated random-effects estimators. The results suggest that the estimator is unbiased even in short panels. An application to a cross-country panel of EU manufacturing industries is presented as well. The proposed estimator produces a distribution of efficiency scores suggesting that these industries are highly efficient, while the other estimators suggest much poorer performance.


Introduction
The incidental parameters problem was formally defined and studied by Neyman and Scott 1 .In general, the problem appears in many models for which the number of parameters to be estimated grows with the number of observations.In such a model, even parameters that are common to all observations cannot be consistently estimated due to their dependence on observation-or group-specific parameters.In econometrics, the issue appears to be more relevant in panel-data models where the incidental parameters-although not as much parameters as latent data-represent group-specific intercepts.In this setting, the number of incidental parameters grows linearly with the cross-sectional dimension of the panel.Evidence on the inconsistency of estimators when the problem is ignored are available for discrete choice models 2 , the Tobit 3 , and the stochastic frontier models 4, 5 .
Lancaster 6 identified three axes around which the proposed solutions concentrate: i integrate the incidental parameters out from the likelihood based on an assumed density, ii replace the incidental parameters in the likelihood function by their maximum-likelihood estimates and maximize the resulting profile with respect to the common parameters, and iii transform the incidental parameters in a way that they become approximately orthogonal to the common parameters, and then integrate them from the likelihood using a uniform prior.
In the case of integrated likelihood, the Bayesian approach is straightforward: formulate a prior for each incidental parameter and use this prior to integrate them from the likelihood.As pointed out by Chamberlain 7 , such a procedure does not provide a definite solution to the problem.When the number of incidental parameters grows, the number of priors placed on these parameters will grow as well and, therefore, the priors will never be dominated by the data.It appears, however, that this is the best that could be done.In the end, the problem of incidental parameters becomes one of choosing appropriate priors.
There exists no direct counterpart to the Bayesian approach in frequentist statistics.Instead, a random-effects formulation of the problem could be used 4, 5 .In this setting, the incidental parameters are integrated from the likelihood based on a "prior" density.However, this density is not updated by the data in terms of its shape, but only in terms of its parameters.As such, it cannot be considered a prior in the sense the term is used in a Bayesian framework.The usual practice is to use a normal density as a "prior" which does not depend on the data.This random-effects formulation will produce a fixed-T consistent estimator as long as the true underlying data-generating process is such that the group-specific parameters are uncorrelated with the regressors.
Allowing for the incidental parameters to be correlated with the regressors, Abdulai and Tietje 8 use Mundlak's 9 view on the relationship between fixed-and random-effects estimators, in the context of a stochastic frontier model.Although this approach is likely to mitigate the bias of the random-effects estimator, there is no evidence on how Mundlak's estimator performs in nonlinear models.
This study proposes a different method for integrating the incidental parameters from the likelihood in a frequentist setting.In panel-data models, the incidental parameters are treated as missing data and the approach developed by Gelfand and Carlin 10 is used to update a true prior in the Bayesian sense on the incidental parameters using information from the data.The formulated posterior is then used to integrate the incidental parameters from the likelihood.
The rest of the paper is organized as follows: in Section 2, the proposed estimator is developed in a general framework and related to existing frequentist and Bayesian estimators.The following section discusses some practical considerations and possible computational pitfalls.Section 4 presents a set of Monte Carlo experiments in the specific context of stochastic frontier models.The sampling properties of the proposed estimator are compared to those of the linear fixed effects, random effects, and random effects with Mundlak's approach.The next section provides an application of the estimators to a dataset of EU manufacturing industries, while Section 6 presents some concluding comments.

Monte Carlo Maximum Likelihood in Panel Data Models
We consider the following general formulation of a panel-data model: where y it and x it are time-varying observed data.The z i s are time-invariant and unobserved data, which are potentially correlated with the x i s.Since the z i s are unobserved, they will be absorbed in the group-specific constant term.The estimable model becomes The view of the fixed effects as latent data rather than parameters justifies, from a frequentist perspective, the subsequent integration of the α i s from the likelihood function.The nature of the dependent variable discrete, censored, etc. and different distributional assumptions on ε give rise to an array of econometric models.
In such a model, it is usually straightforward to derive the density of the dependent variable conditional on the independent and the group-specific intercept.Let y i and x i be the vector and matrix of the stacked data for group i.The contribution to the likelihood of the ith group conditional on α i is where f y it | x it , θ, α i is easy to specify.Maximum likelihood estimation is based on the density of observed data; that is, on the density of y i marginally with respect to α i .In an integrated-likelihood approach, the fixed effects are integrated out from the joint density of y i and α i .We follow Gelfand and Carlin 10 to derive an appropriate density according to which such an integration can be carried out by writing the density of the data marginally with respect to the fixed effects as where ψ is any point in the parameter space of θ.
It is obvious from this formulation that f y i , α i | x i , ψ plays the role of an importance density for the evaluation of the integral.However, it is a very specific importance density: it has the same functional form as the unknown f y i , α i | x i , θ , but is evaluated at any chosen by the researcher point ψ.The same functional form of the integrand and the importance density can be exploited to reach a form in which, under some additional assumptions, every density will be known or easy to assume a functional form for it.Typically, the integral in 2.4 would be evaluated by simulation.For this formulation of the marginal likelihood, Geyer 11 showed that, under loose regularity conditions, the Monte Carlo likelihood hypoconverges to the theoretical likelihood and the Monte Carlo maximum likelihood estimates converge to the maximum likelihood estimates with probability 1.
The joint density of y i and α i can be written as the product of the known from 2.3 conditional likelihood and the marginal density of α i .Then, 2.4 becomes The following assumption is imposed on the data-generating process: In words, this assumption means that the way α i and x i are related does not depend on θ.
One may think of this as the relationship between α i and x i being determined by a set of parameters η, prior to the realization of y i .In mathematical terms, this would require that . This implies that the set of parameters that enter the distribution of α i conditionally on x i , but unconditionally on y i , is disjoint of θ.
In practice and depending on the application at hand, this assumption may or may not be restrictive.Consider, for example, the specification of a production function where y is output, x is a vector of inputs, and a represents the effect of time-invariant unobserved characteristics, such as the location of the production unit, on output.The assumption stated in 2.6 implies that, although location may affect the levels of input use, the joint density of location and inputs does not involve the marginal productivity of inputs.On the other hand, conditionally on output, the density of α does involve the marginal productivity coefficients, since this conditional density is obtained by applying Bayes' rule on f y i | x i , θ, α i .
Under the assumption stated in 2.6 , 2.5 can be simplified to Theoretically, f α i | y i , x i , ψ can be specified in a way that takes into account any prior beliefs on the correlation between the constant terms and the independent variables.Then, the integral can be evaluated by simulation.Practically, however, there is no guidance on how to formulate these beliefs.Furthermore, the choice of f α i | y i , x i , ψ is not updated during the estimation process and it is not truly a prior, just as in frequentist random effects.Alternatively, we can only specify the marginal density of α i and use Bayes' rule to get Again, there is not much guidance on how to specify p α i | x i , ψ .Additionally, in order to be consistent with assumption 2.6 , we need to assume a density for α i that does not involve ψ.But now the issue is not as important: it is f α i | y i , x i , ψ and not the assumed p α i | x i , ψ that is used for the integration.That is, p α i | x i , ψ is a prior in the Bayesian sense of the term since it is filtered through the likelihood for a given ψ before it is used for the integration.Accordingly, f α i | y i , x i , ψ is the posterior density of α i .
Before examining the role of the prior in the estimation, we note that the frequentist random-effects approach can be derived by using 2.8 to simplify 2.7 .If α i is assumed to be independent of x i and the parameters of its density are different from ψ, then the unconditional likelihood does not depend on ψ and the estimator becomes similar to the one Greene 4 suggests 2.9 It is apparent that there is an advantage in basing the estimation on the likelihood function in 2.7 rather than 2.9 .By sampling from f α i | y i , x i , ψ instead of p α i , we are using information contained in the data on the way α i is correlated with x i .For example, we may assume that in the prior α i is normally or uniformly distributed and that it is independent of the data.But even this prior independence assumption will not impose independence in the estimation, because of the filtering of the prior through the likelihood in 2.8 .
As it is the case in Bayesian inference, the role of the prior density of α i diminishes with the increase in the number of time observations per group.But the short time dimension of the panel is the original cause of the incidental parameters problem.The estimator proposed here is still subject to the critique that was developed for the corresponding Bayesian estimator: the density of the data will not dominate the prior as N → ∞ with T held fixed.On the other hand, when the true data-generating process is such that the group-specific constant terms are correlated with the independent variables, the method proposed here will mitigate the bias from which the random-effects estimator suffers.

Calculations and Some Practical Considerations
The first step in the application of the MC maximum likelihood estimator developed in the previous section is to sample from the posterior of α i given ψ.Since this posterior density involves the likelihood function, its functional form will, in general, not resemble the kernel of any known distribution.But, this posterior is unidimensional for every α i and simple random sampling techniques, such as rejection sampling, can be used.Of course, the Metropolis-Hastings algorithm provides a more general framework for sampling from any distribution.In the context of the posterior in 2.8 , a Metropolis-Hastings algorithm could be used to construct a Markov chain for each α i , while holding ψ fixed.
Given that M random numbers are drawn from the posterior of each α i , the simulated likelihood function for the entire dataset can be written as where α ij is the jth draw from f α i | y i , x i , ψ .The MC likelihood function can be maximized with respect to θ.The first term in the product is constant with respect to θ and can be ignored during the optimization.The relevant part of the simulated log-likelihood is One practical issue that remains to be resolved is the choice of ψ.Theoretically, this choice should not matter.In practice, however, when the calculations are carried on finiteprecision machines, it does.In principle, f y i , α i | x i , ψ should mimic the shape of f y i , α i | x i , θ , as it plays the role of an importance density for the evaluation of the integral in 2.4 .If ψ is chosen to be far away from θ, then the two densities will have probability mass over different locations and the ratio in 3.2 will be ill behaved in the points of the parameter space where the proposal density approaches zero, while the likelihood does not.
Gelfand and Carlin 10 propose solving this problem by choosing an initial ψ and running some iterations by replacing ψ with the MC maximum likelihood estimates from the previous step.In the final step, the number of samples is increased to reduce the Monte Carlo standard errors.The estimator produced by this iterative procedure has the same theoretical properties as an estimator obtained by choosing any arbitrary ψ.On the other hand, this iterative procedure introduces another problem: if during this series of iterations ψ converges to the value of θ supported by the data, then in the subsequent iteration the ratio in 3.2 will be approximately unity.In practice, the simulated likelihood will never be exactly one due to the noise introduced through the random sampling.As a consequence, the MC likelihood function will no longer depend on θ or at least it will be very flat.This leads to numerical complications that now have to do with the routine used for maximizing the likelihood.A way to overcome this problem is by introducing some noise to the estimate of θ from iteration, say k, before using it in place of ψ for iteration k 1.Additionally, increasing the variance parameter s contained in ψ will result in the proposal density having heavier tails than the likelihood, alleviating in this way the numerical instability problem in the ratio of the two densities.

Monte Carlo Experiments
In this section, we perform a set of Monte Carlo experiments on the stochastic frontier model 12, 13 .Wang and Ho 14 have analytically derived the likelihood function for the class of stochastic frontiers models that have the scaling property 15 by using within and first-difference transformations.Instead of restricting attention to this class of models, the formulation proposed by Meeusen and van den Broeck 13 is used here where the noise component of the error term is assumed to follow a normal distribution with mean zero and variance σ 2 v , while the inefficiency component of the error is assumed to follow an exponential distribution with rate λ.The technical efficiency score for observation i in period t is defined as TE it exp{−u it } and assumes values on the unit interval.
Under the described specification and assuming independence over t, the contribution of group i to the likelihood conditional on the fixed effects is where ε it y it − α i − x it β and θ β log σ 2 v log λ 2 .A major objective of an application of a stochastic frontier model is usually the estimation not only of the model's parameters, but also of the observation-specific efficiency scores.These estimates can be obtained as where μ it −ε it − σ 2 v /λ.Three experiments are performed for panels of varying length T 4, 8, and 16 , while keeping the total number of observations cross-section and time dimensions combined fixed at 2000.The sampling properties of four estimators are examined: i linear fixed effects within estimator, ii MC maximum likelihood, iii simple random effects, and iv correlated random effects using Mundlak's approach.
The data are generated in the following sequence: i N α i s are drawn from a normal distribution with mean zero and variance 2, ii for each i, T draws are obtained from a normal distribution with mean α i 1/2 α 2 i and standard deviation equal to 1/2 for two independent variables, x 1 and x 2 , iii NT draws are obtained from a normal distribution with zero mean and standard deviation equal to 0.3 for v it , iv NT draws are obtained from an exponential distribution with rate equal to 0.3 for u it , v the dependent variable is constructed as y it α i 0.7x 1,i 0.4x 2,i v it − u it .
For the MC maximum likelihood estimator, uniform priors are assumed for the α i s and their integration from the likelihood is based on 3000 random draws from their posterior.These draws are obtained using a Metropolis-Hastings random-walk algorithm.For the random-effects estimators, each α i is assumed to follow a normal distribution with mean μ and variance σ 2 α .Integration of the unobserved effects for the random-effects estimator, is carried using Gaussian quadratures.Although integration of the unobserved effects can be carried using simulation as suggested by Greene 4 , under normally distributed α i 's integration through a Gauss-Hermite quadrature reduces computational cost substantially.
Table 1 presents the means, mean squared errors, and percent biases for the four estimators, based on 1000 repetitions.The linear fixed-effects estimator is unbiased with respect to the slope parameters, as well as with respect to the standard deviation of the composite error term.This estimator, however, cannot distinguish between the two components of the error.Nevertheless, it can be used to provide group-specific but timeinvariant efficiency scores using the approach of Schmidt and Sickles 16 .This approach has the added disadvantage of treating all unobserved heterogeneity as inefficiency.
On the other hand, as expected, the simple random-effects estimator is biased with respect to the slope parameters.Interestingly, however, the bias is much smaller for the variance parameter of the inefficiency component of the error term.This suggests that one may use the simple random-effects estimator to obtain an indication of the distribution of the industry-level efficiency even in the case where the group effects are correlated with the independent variables.
The MC maximum likelihood and the correlated random-effects estimators are virtually unbiased both with respect to the slope and the variance parameters, even for small T .Furthermore, the mean squared errors of both estimators decrease as the time dimension of the panel increases.For the MC maximum likelihood estimator this can be attributed to the fact that as T increases more information per group is used to formulate the posterior of α i .
Obtaining estimates of observation-specific efficiency scores involves first generating estimates of the group intercepts.Estimates of the group effects can be obtained for the random-effects estimators using group averages of the dependent and independent variables, accounting at the same time for the skewness of the composite error term 14 .On the other hand, the MC maximum likelihood estimator can provide estimates of the α i s by treating them as quantities to be estimated by simulation after the estimation of the common parameters of the model.In both estimators, the α i s and θ are replaced in 4.3 by their point estimates to obtain estimates of the observation-specific efficiency scores.
Nevertheless, both the random-effects and the MC maximum likelihood estimators of the α i s are only T consistent.A different approach, which is consistent with treating the α i s as missing data, is to integrate them from the expectation in 4.3 .That is, one may obtain the expectation of e −u it | ε it unconditionally on the missing data.In this way, the uncertainty associated with the α i s is accommodated when estimating observation-specific efficiency scores.The integration of the α i s is achieved using the following procedure: 1 draw M samples from f α i | y i , x i , θ where θ is either the random-effects or the MC maximum likelihood point estimate, 3 take the sample mean of the E e −u it | ε it,j s over j.
By the law of iterated expectations, this sample mean will converge to the unconditional expectation of e −u it | ε it,j .In the random-effects model, integration can also be performed by quadratures rather than simulation.Figure 1 presents scatter plots of the actual versus the predicted efficiency scores for the MC maximum likelihood and the correlated random-effects estimators for a particular Monte Carlo repetition.Apart from the known problem of underestimating the scores of highly efficient observations, the approach of integrating the α i s from the expectation of e −u it | ε it produces good predictions for the efficiency scores for the MC maximum likelihood estimator.On the other hand, the predictions of the correlated random-effects estimator are more dispersed around the 45 • line.The MC maximum likelihood estimator has an advantage over the random-effects estimator because it does not need to specify a systematic relationship between the group effects and the independent variables.In other words, the quality of the estimates of the efficiency scores from the random-effects estimator deteriorates if there is a lot of noise in the relationship between the group effects and the group means of the independent variables.

Application
This section presents an application of the estimators discussed in this paper to a panel of European manufacturing industries.The dataset comes from the EU-KLEMS project 17 and covers the period from 1970 to 2007.It contains annual information at the country level for industries classified according to the 4-digit NACE revision 2 system.The part of the dataset used for the application covers 10 manufacturing industries for 6 countries for which the required data series are complete Denmark, Finland, Italy, Spain, The Netherlands, and UK .The production frontier is specified as Cobb-Douglass in capital stock and labor input, with value added being the dependent variable.A linear time trend is included to capture autonomous technological progress.The model specification is 5.1 where it is assumed that v it ∼ N 0, σ 2 v and u it ∼ Exp λ .Each industry in each country is treated as a group with its own intercept, but the production technologies of all industries across all countries are assumed to be represented by the same slope parameters.
The model is estimated using the linear fixed-effects, MC maximum likelihood, and simple and correlated random-effects estimators.The results are presented in Table 2. Given that under the strict model specification the group effects are expected to be correlated with the regressors, it does not come as a surprise that relatively large discrepancies between the parameter estimates of the linear fixed-effects and the simple random-effects estimators appear.Nonnegligible discrepancies are also observed between the linear fixedeffects estimate of β K and the corresponding estimates from the estimators that account for possible correlation between the group effects and the independent variables.Although this result appears to be in contrast with the findings of the Monte Carlo simulations, we need to keep in mind that the Monte Carlo findings are valid for the estimators on average, while the application considers a single dataset where particularities could lead to these discrepancies.For example, limited within variation in the capital and labor variables could induce multicollinearity and render the point estimates less precise.
On the other hand, all three estimators that can distinguish between noise and inefficiency effects produce very similar parameter estimates for the variances of the two error terms.The estimates of the parameter associated with the time trend suggest that the industries experience, on average, productivity growth at a rate slightly larger than 2%.
Figure 2 presents kernel density estimates of the observation-specific technical efficiency scores obtained by integrating the group effects from the expectation in 4.3 using the MC maximum likelihood and the two random-effects estimators.It appears that only the MC maximum likelihood estimator produces a distribution of technical efficiency scores similar to the original assumptions imposed by the model, with the majority of the industries being highly efficient.On the other hand, the simple random-effects estimator yields a bimodal distribution of efficiency scores.

Conclusions and Further Remarks
This paper proposes a general procedure for choosing appropriate densities for frequentist integrated-likelihood methods in panel data models.The proposed method requires the placement of priors on the density of the group-specific constant terms.These priors, however, are updated during estimation and in this way their impact on the final parameter estimates is minimized.
A set of Monte Carlo experiments were conducted to examine the sampling properties of the proposed estimator and to compare them with the properties of existing relevant estimators.Although the experiments were conducted in the specific context of a stochastic  frontier model, the proposed estimator can be generalized to other nonlinear models.The results suggest that, even in very short panels, both the MC maximum likelihood estimator and random-effects estimator augmented by the group averages of the regressors are virtually unbiased in the stochastic frontier model.Returning to Chamberlain's 7 observation that in panel-data settings the contribution of the prior is never dominated by the data, the results from the Monte Carlo experiments suggest that this is not an issue of major importance.It appears that when the objective is not the estimation of the incidental parameters but their integration from the likelihood, then even very vague priors do not introduce any bias in the common parameter estimates.
In the end, which estimator should be chosen?From the estimators considered here, the MC and Mundlak's random-effects estimators are able to distinguish inefficiency from group-and time-specific unobserved heterogeneity, while being reasonably unbiased with respect to the common parameters.The difference between the two is based on theoretical grounds.The MC estimator is able to account for the correlation of the group-specific parameters with the regressors in any unknown form.On the other hand, the correlated random-effects estimator lacks such a theoretical support; there still exist no analytical results on the properties of this estimator in nonlinear settings.
Another disadvantage of the correlated random-effects estimator is that it requires the inclusion of the group means of independent variables in the model.This approach could induce a high degree of multicollinearity if there is little within variability in the data.Lastly, in the specific context of stochastic frontier models, the MC maximum likelihood estimator provides better estimates of the observation-specific efficiency scores.

Figure 1 :
Figure 1: Actual versus predicted efficiency scores with simulated data.

Figure 2 :
Figure 2: Kernel density estimates of efficiency scores from the three estimators.

Table 1 :
Simulation results for the stochastic frontier model.

Table 2 :
Results for the EU-KLEMS model.