On the semi-parametric efficiency of the Scott-Wild estimator under choice-based and

Using a projection approach, we obtain an asymptotic information bound for estimates of parameters in general regression models under choice-based and two-phase outcome-dependent sampling. The asymptotic variances of the semiparametric estimates of Scott and Wild (1997, 2001) are compared to these bounds and the estimates are found to be fully efficient.


Introduction
Suppose that for each of a number of subjects, we measure a response y and a vector of covariates x, in order to estimate the parameters β of a regression model which describes the conditional distribution of y given x. If we have sampled directly from the conditional distribution, or even the joint distribution, we can estimate β without knowledge of the distribution of the covariates.
In the case of a discrete response, which takes one of J values y 1 , . . . , y J , say, we often estimate β using a case-control sample, where we sample from the conditional distribution of X given Y = y j . This is particularly advantageous if some of the values y j occur with low probability. In case-control sampling, the likelihood involves the distribution of the covariates, which may be quite complex, and direct parametric modelling of this distribution may be too difficult. To get around this problem, the covariate distribution can be treated non-parametrically. In a series of papers, (Scott and Wild 1986, 1997, Wild 1991) Scott and Wild have developed an estimation technique which yields a semi-parametric estimate of β. They dealt with the unknown distribution of the covariates by profiling it out of the likelihood, and derived a set of estimating equations whose solution is the semi-parametric estimator of β.
This technique also works well for more general sampling schemes, for example for two-phase outcome-dependent stratified sampling. Here, the sample space is partitioned into S disjoint strata which are defined completely by the values of the response and possibly some of the covariates. In the first phase of sampling, a prospective sample of size N is taken from the joint distribution of x and y, but only the stratum the individual belongs to is observed. In the second phase, for s = 1, . . . , S, a sample of size n (s) 1 is selected from the n (s) 0 individuals in stratum s who were selected in the first phase, and the rest of the covariates are measured. Such a sampling scheme can reduce the cost of studies by confining the measurement of expensive variables to the most informative subjects. It is also an efficient design for elucidating the relationship between a rare disease and a rare exposure, in the presence of confounders.
Another generalized scheme that falls within the Scott-Wild framework is that of caseaugmented sampling, where a prospective sample is augmenmted by a further sample of controls. In the prospective sample, we may observe both disease state and covariates, or covariates alone. Such schemes are discussed in Lee, Scott and Wild (2006).
In this paper, we introduce a general method for demonstrating that the Scott-Wild procedures are fully efficient. We use a (slightly extended) version of the theory of semi-parametric efficiency due to Bickel et al. (1993) to derive an "information bound" for the asymptotic variance of the estimates. We then compute the asymptotic variances of the Scott-Wild estimators, and demonstrate their efficiency by showing that the asymptotic variance coincides with the information bound in each case.
The efficiency of these estimators has been studied by several authors, who have also addressed this question using semi-parametric efficiency theory. This theory assumes an i.i.d. sample, so various ingenious devices have been used to apply it to the case of choice-based sampling. For example, Breslow, Robins and Wellner (2000) consider case-control sampling, that the data are generated by Bernoulli sampling, where either a case or control is selected by a randomisation device with known selection probabilities, and the covariates of the resulting case or control are measured. The randomisation at the first stage means that the i.i.d. theory can be applied.
The efficiency of regression models under an approximation to the two-phase sampling scheme has been considered by Breslow, McNeney and Wellner (2003) using missing value theory. In this approach, a single prospective sample is taken. For some individuals, the response and the covariates are both observed. For the rest, only the response is measured, the covariates being regarded as missing values. The efficiency bound is obtained using the missing value theory of Robins, Hsieh and Newey (1995).
In this paper, we adopt a more direct approach. First, we sketch an extension of Bickel-Klaassen-Ritov-Wellner theory to cover the case of sampling from several populations, which we require in the rest of the paper. Such extensions have also been studied by McNeney and Wellner (2000) and Bickel and Kwon (2001). Then information bounds for the regression parameters are derived assuming that separate prospective samples are taken from the case and control populations.
The minor modifications to the standard theory required for the multi-sample efficiency bounds are sketched in Section 2. This theory is then applied to case-control sampling and an information bound derived in Section 3. We also derive the asymptotic variance of the Scott-Wild estimator and show that it coincides with the information bound.
In Section 4, we deal with the two-phase sampling scheme. We argue that a sampling scheme equivalent to the two-phase scheme described above is to regard the data as arising from separate independent sampling from S +1 populations. This allows the application of the theory sketched in Section 2. We derive a bound and again show that the asymptotic variance of the Scott-Wild estimator coincides with the bound. Finally, mathematical details are given in Section 5.
In the context of data that are independently and identically distributed, Newey (1994) characterizes the information bound in terms of a population version of a profile likelihood, rather than a projection. A parallel approach to calculating the information bound for the case-control and two-phase problems, using Newey's "profile" characterization, is contained in Lee and Hirose (2007).

Multi-samples, information bounds and semi-parametric efficiency
In this section, we give a brief account of the theory of semi-parametric efficiency when the data are not independently and identically distributed, but rather consist of separate independent samples from different populations.
Suppose we have J populations. From each population, we independently select separate i.i.d. samples, so that for j = 1, . . . , J, we have a sample {x ij , i = 1, . . . , n j } from a distribution with density p j , say. We call the combined sample a multi-sample. We will consider asymptotics where n j /n → w j , and n = n 1 + · · · n J . Suppose that p j is a member of the family of densities where B is a subset of k and N is infinite-dimensional. We denote the true values of β and η by β 0 and η 0 , and p j (x, β 0 , η 0 ) by p j0 . Consider asymptotically linear estimates of β of the form where E j φ j (X) = 0, E j denoting expectation with respect to p j0 . The functions φ j are called the influence functions of the estimate and its asymptotic variance is J j=1 w j E j [φ j φ T j ]. The semi-parametric information bound is a matrix B that is a lower bound for the asymptotic variance of all asymptotically linear estimates of β: we have where the φ j are the influence functions ofβ.
The efficiency bound is found as follows. Let T be a subset of of p , so that P T = {p j (x, β, η(t)), β ∈ B, t ∈ T } is a p -dimensional submodel of P. We also suppose that if η 0 is the true value of η, then η(t 0 ) = η 0 for some t 0 ∈ T . Thus, the submodel includes the true model, having β = β 0 and η = η 0 .
Consider the vector-valued score functionṡ l j,η = ∂ log p j (x, β, η(t)) ∂t , whose elements are assumed to be members of L 2 (P j0 ), where P j0 is the measure corresponding to p j (x, β 0 , η 0 ). Consider also the space L 2k (P j0 ), the space of all k -valued functions squareintegrable with respect to P j0 , and the Cartesian product H of these spaces, equipped with the norm defined by The subspace of H generated by the score functions (l 1,η , . . . ,l J,η ) is the set of all vector-valued functions of the form (Al 1,η , . . . , Al J,η ) where A ranges over all k by p matrices. Thus, to each finite-dimensional sub-family of P, there corresponds a score function and subspace of H generated by the score function. The closure in H of the span(over all such sub-families) of all these subspaces is called the nuisance tangent space and is denoted by T η . Consider also the score functionsl The projectionl * in H ofl β = (l 1,β , . . .l J,β ) onto the orthogonal complement of T η is called the efficient score, and its elements (which are members of L 2,k (G 0 )) are denoted byl * j . The matrix B (the efficiency bound) is given by The functions Bl * j are called the efficient influence functions, and any multi-sample asymptotically linear estimate of β having these influence functions is asymptotically efficient.

The efficiency of the Scott-Wild estimator in case-control studies
In this section, we apply the theory sketched above in Section 2 to regression models where the data are obtained by case-control sampling. Suppose that we have a response Y (assumed discrete with possible values y 1 , . . . , y J ) and a vector X of covariates, and we want to model the conditional distribution of Y given X using a regression function say, where β is a k-vector of parameters. If the distribution of the covariates X is specified by a density g, then the joint distribution of X and Y is and the conditional distribution of x given Y = y j is In case-control sampling, the data are not sampled from the joint distribution, but rather are sampled from the conditional distributions of X given Y = y j . We are thus in the situation of Section 2 with g playing the role of η and 3.1 The information bound in case-control studies. To apply the theory of Section 2, we must identify the nuisance tangent space T η and calculate the projection ofl β on this space. Direct calculation shows thatl where E j denotes expectation with respect to the true density p j0 , given by p j0 (x) = p j (x, β 0 , g 0 ), where β 0 and g 0 are the true values of β and g. Here, and in what follows, all derivatives are evalueted at the true values of parameters.
Also, for any finite-dimensional family {g(x, t)} of densities with g(x, t 0 ) = g 0 (x), we havė It follows by the arguments of Bickel et al. (1993, p52) that the nuisance tangent space is of the form where dG 0 = g 0 dx, and L 2,k (G 0 ) is the space of all k-dimensional functions f satisfying the condition ||f || 2 dG 0 (x) < ∞. The efficient score, the projection ofl β on the orthogonal complement of T η , is described in our first theorem. In the theorem, we use the notations Then we have the following result: Then the efficient score has j, l elementl where h * l is any solution in L 2 (G 0 ) of the operator equation A proof is given in Section 5.1.

It remains to identify a solution to (4). Define
Note that the row and column sums of M are zero, since Using these definitions and (3), we get This suggests that h * l will be of the form for some constants c 1 , . . . , c J . In order that h * l satisfy (4), we must have Our next result gives the information bound.
The inverse of the information bound B is given by See Section 5.2 for a proof.
3.2. Efficiency of the Scott-Wild estimator in case-control studies. Suppose we have J disease states (typically J=2, with disease states case and control), and we choose n j individuals at random from disease population j, j = 1, . . . , J, observing covariates x 1,j , . . . , x n j ,j for the individuals sampled from population j. Also suppose that we have a regression function f j (x, β), j = 1, . . . , J, giving the conditional probability that an individual with covariates x has disease state j. The unconditional density g of the covariates is unspecified. The true values of β and g are denoted by β 0 and g 0 , and the true probability of being in disease state j is Under the case-control sampling scheme, the log-likelihood is (Scott and Wild 2001) Scott and Wild show that the non-parametric MLE of β is the "beta" part of the solution of the estimating equation where θ = (β, ρ), ρ = (ρ 1 , . . . , ρ J−1 ), and A Taylor series argument shows that the solution of (9) is an asyptotically linear estimate. Thus, to estimate β, we are treating the function l * (θ) = J j=1 n j i=1 log P * j (x ij , β, ρ) as though it were a log-likelihood. Moreover, Scott and Wild indicate that we can obtain a consistent estimate of the standard error by using the second derivative − ∂ 2 l * (θ) ∂θ∂θ T , which they call the "pseudo-information matrix". Now let n = n 1 + · · · + n J and let the n j 's converge to infinity with n j /n → w j , j = 1, . . . , J, and let ρ 0 = (ρ 01 , . . . , ρ 0,J−1 ) T where exp(ρ 0j ) = (w j /π 0j )/(w J /π 0J ). It follows from the law of large numbers and the results of Scott and Wild that the asymptotic variance of the estimate of β is the ββ block of the inverse of the matrix where all derivatives are evaluated at (β 0 , ρ 0 ). Using the partitioned matrix inverse formula, the the β, β block of ( where I * is partitioned as To prove the efficiency of the estimator, we show that the information bound (7) coincides with the asymptotic variance (12). To prove this, the following representation of the matrix I * will be useful. Let S be the J × k matrix with j, l element S jl = ∂ log f j (x,β) ∂β l | β=β 0 and j th row S j , and let E be the J × k matrix with j, l element E j [S jl ]. Also note that P j (x) = P * j (x, β 0 , ρ 0 ) and write P = (P 1 , . . . , P S ) T . Then we have the following theorem: Then I * ρβ consists of the first J − 1 rows of U, 3. I * ρρ consists of the first J − 1 rows and columns of M = W − V. A proof is given in Section 5. Now, we show that the information bound coincides with the asymptotic variance. Using the definition φ l (x) = J j=1 w j π j0l β,jl f j (x, β 0 ), we can write φ = (S − E) T P f * , and substituting this and the relationshipl β = S − E into (7), we get Moreover, Substituting this into (13) and using the relationships described in Theorem 3, we get By Theorem 3, the matrix I *

Efficiency of the Scott-Wild estimator under two-stage sampling
In this section, we use the same techniques to show that the Scott-Wild non-parametric MLE is also efficient under two-stage sampling.
4.1 Two stage sampling. In this sampling scheme, the population is divided into S strata, where stratum membership is completely determined by an individual's response y and possibly some of the covariates x, typically those that are cheap to measure. In the first sampling stage , a random sample of size n 0 is taken from the population, and the stratum to which the sampled individuals belong is recorded. For the ith individual, let Z is = 1 if the individual is in stratum s, and zero otherwise. Then n (s) 1 's are independent. As in Section 3, let f (y| x, β) be the conditional density of y given x, which depends on a finite number of parameters β, which are the parameters of interest. Let g denote the density of the covariates. We will regard g as an infinite dimensional nuisance parameter. The conditional density of (x, y), conditional on being in stratum s is, using Bayes theorem, where I s (x, y) is the stratum indicator, having value 1 if an individual having covariates x and response y is in stratum s, and zero otherwise. The unconditional probability of being in stratum s in the first phase is Introduce the function Q s (x, β) = I s (x, y)f (y|x, β) dy. Then Under two-phase sampling, the log likelihood is (Wild 1991, Scott andWild, 2001 where m s = n (s) 0 − n (s) 1 . Scott and Wild show that the semi-parametric MLEβ (i.e. the "β" part of the maximiser (β,ĝ) of (15)) is equal to the "β" part of the solution of the estimating equations ∂ * ∂β = 0, ∂ * ∂ρ = 0.
The function * is given by where Q 1 (ρ), . . . , Q S (ρ) are probabilities defined by S s=1 Q s (ρ) = 1 and log Q s /Q S = ρ s , s = 1, . . . , S, and µ s (ρ) = c(n 0 − m s /Q s (ρ)). The µ s 's depend on the quantity c and the m s 's, and for fixed values of these quantities are completely determined by the S − 1 quantities ρ s . Note that the estimating equations (16) are invariant under choice of c. It will be convenient to take c as N −1 , where N = n 0 + n 1 , where n 1 = S s=1 n (s) 1 . In order to apply the theory of Section 2 to two-phase sampling, we will prove that the asymptotics under two-phase sampling are the same as those under the following multi-sample sampling scheme: from the conditional distribution of (x, y) given s, having density p s (x, y, β, g) = I s (x, y)f (y|x)g(x)/Q s . We note that the likelihood under this modified sampling scheme is the same as before, and we show in Theorem 4 below that the asymptotic distribution of the parameter estimates is also the same. It follows that if an estimate is efficient under the multi-sampling scheme, it must also be efficient under two-phase sampling.
Theorem 4. Let N = n 0 + n 1 where n 1 = S s=1 n (s) 1 , and suppose that Letθ be the solution of the estimating equation (16), and let θ 0 be the solution to the equation where E s denotes expectation with respect to p s , A proof is given in Section 5.4.

4.2
The information bound. Now we derive the information bound for two-stage sampling. By the arguments of Section 4.1, the information bound for two-phase sampling is the same as that for the case of independent sampling from the S + 1 densities p s (x, y, β, g) where First, we identify the form of the nuisance tangent space (NTS) for this problem. As in Section 3, we see that the score functions for this problem arė ∂β and E s denotes expectation with respect to the true density p s (x, y, β 0 , g 0 ).
Similarly, if g(x, t) is a finite-dimensional subfamily of densities, then ∂ log ps(x,y,β,g(x,t)) . Arguing as in Section 3, we see that the NTS consists of all elements of the form where E denotes expectation with respect to G 0 .
As before, the efficient score isl H . An explicit expression for this squared distance is where h j and S j are the jth elements of h and S respectively. To obtain the projection, we must choose h j to minimise the term in the braces in (18). Some algebra shows that this term may be written as where Q s0 = Q(x, β 0 )g 0 (x) dx is the true value of Q s , (., .) 2 is the inner product on L 2 (G 0 ), A is a self-adjoint nonnegative-definite operator on L 2 (G 0 ) defined by As in Section 3, (19) is minimised when h j = h * j , where h j is a solution of Ah j = φ j , which must be of the form for constants c rj which satisfy the equations where v rs = P r P s Q * dG 0 and d sj = (P s , φ j ) 2 . Writing Γ = (γ rs ), C = (c rj ), D = (d rj ), W = diag(w 1 , . . . , w S ) and V = (v rs ), (23) can be expressed in matrix terms as where M = W(I − Γ) −1 − V. These results allow us to find the efficient score and hence the information bound, which is described in the following theorem: Theorem 5. The information bound B is given by The proof is similar to that of Theorem 2 and is omitted.

Efficiency of the Scott-Wild estimator
Letθ = (β,ρ) be the solutions of the estimating equations (16). By Theorem 4, under suitable regularity conditions,θ is asymptotically normal with asymptotic variance where I * and V are as in Theorem 4. It turns out that the matrix V is of the form for some matrix A. Thus, the asymptotic variance ofθ is and it follows from the partitioned matrix inverse formula that the asymptotic variance matrix ofβ is the inverse of where I * is partitioned as To demonstrate the efficiency ofβ, we must show that (27) and (25) coincide. To do this, we need a more explicit formula for I * . Let S be the S ×k matrix with s, j element and note that P s (x) = P * s (x, β 0 , ρ 0 ), where ρ 0 satisfies Q s (ρ 0 ) = Q s0 , s = 1, . . . , S. Finally, write P = (P 1 , . . . , P S ) T . Then we have the following theorem: 2. Let U = WE− P P T SQ * dG 0 (x). Then I * ρβ = A T U 0 , where U 0 consists of the first S −1 rows of U and A is a non-singular (S − 1) × (S − 1) matrix. The proof is given in Section 5.5. We now use theorems 4 and 5 to show that the efficiency bound (25) equals the asymptotic variance (27). Arguing as in Section 3, we get We complete the argument by showing that the term in the braces in (29) is zero. We have Hence the term in the braces in (29) is zero, the asymptotic variance coincides with the information bound and so the Scott-Wild estimator has full semi-parametric efficiency.

Proof of Theorem 3
First, we note the formula and the fact that Next, we note the derivatives when the derivatives are evaluated at (β 0 , ρ 0 ). Thus which proves 1. Also which proves 2. Finally,

Proof of Theorem 4
Under the two-stage sampling scheme, the joint distribution of n Thus, conditional on the n , the random variables {(x is , y is ), i = 1, . . . , n 1 , s = 1, . . . , S} are independent, with {(x is , y is ), i = 1, . . . , n (s) 1 being an i.i.d. sample from the conditional distribution of (x, y), conditional on being in stratum s, having density Then the estimating equations (16) can be written in the form Note that the functions ψ A standard Taylor expansion argument gives S N j is the jth element of S N and ||θ − θ 0 || ≤ ||θ − θ 0 ||.
N and S (2) N are the first and second terms above.
Under the alternative multisampling scheme, S converges to ψ s sufficiently quickly, we see that S N is asymtotically normal with zero mean and asymptotic variance V = S s=0 w s Var ψ s . Conversely, under two-phase sampling, the characteristic function of S N is where ( If n 0 > N * , the sum of the second two terms is less than in absolute value, so Again by the same arguments as above, [e itS (1) N ] converges to exp{− 1 2 t T V 1 t} where V 1 is w 0 Var[ψ 0 (Z 1 , . . . , Z S , θ 0 )] so that E[e itS N ] converges to exp{− 1 2 t T Vt})] , and hence S N converges in distribution to a multivariate normal with variance V = V 1 + V 2 .
From the definition of I * in Theorem 4 and the law of large numbers, we get The second term of this expression is zero, since S s=1 w s E s 1 Now we evaluate I * ββ . For the ββ submatrix, the third and fourth terms of (36) are zero. Thus, using the derivative ∂P † s ∂β = S − S T P, we get w s E s [SS T ] − S T P P T SQ * dG 0 (x).
which proves part 1. Now, consider I * ρβ,rj . Again, the third and fourth terms of (36) are zero. Introduce the parameters λ 1 , . . . , λ S−1 defined by λ r = log(µ r (ρ)/µ S (ρ)), r = 1, . . . , S − 1. Then, as in Theorem 3, we see that u pj is the p, j element of U, and so part 2 of the theorem is true with A pr = ∂λp ∂ρr . The ρρ submatrix is where κ s = Q s0 w s /c s . It follows from (37)  As in Section 5.3, the first term of this expression is δ pq w p − v pq . Routine calculations using the relationships λ p = log(µ p /µ S ) and µ p = w 0 − c p /Q p give ∂Q p ∂λ q = δ pq κ p − κ p κ q κ * where κ * = S p=1 κ p . This representation implies that