Two-Stage Adaptive Optimal Design with Fixed First-Stage Sample Size

In adaptive optimal procedures, the design at each stage is an estimate of the optimal design based on all previous data. Asymptotics for regular models with fixed number of stages are straightforward if one assumes the sample size of each stage goes to infinity with the overall sample size. However, it is not uncommon for a small pilot study of fixed size to be followed by a much larger experiment. We study the large sample behavior of such studies. For simplicity, we assume a nonlinear regression model with normal errors. We show that the distribution of the maximum likelihood estimates converges to a scale mixture family of normal random variables. Then, for a one parameter exponential mean function we derive the asymptotic distribution of the maximum likelihood estimate explicitly and present a simulation to compare the characteristics of this asymptotic distribution with some commonly used alternatives.


Introduction
Elfving 1 introduced a geometric approach for determining a c-optimal design for linear regression models.Kiefer and Wolfowitz 2 developed the celebrated equivalence theorem which provides an efficient method for verifying if a design is D-optimal, again for a linear model.These two results were generalized by Chernoff 3 and White 4 to include nonlinear models, respectively.See Bartroff 5 , O'Brien and Funk 6 , and references therein for extensions to the geometric and equivalence approaches.Researchers in optimal design have built an impressive body of theoretical and practical tools for linear models based on these early results.However, advances for nonlinear models have not kept pace.
One reason for the prevalence of the linear assumption in optimal design is that the problem can be explicitly described.The goal of optimal design is to determine precise experiments.Define an approximate design, proposed by Kiefer and Wolfowitz 7 , as ξ {x i , w i } K  1 , where ξ is a probability measure on X consisting of support points x i ∈ X and where η x, θ is some nonlinear mean function.In most practical examples it is necessary to consider a bounded design space, that is, x i ∈ X a, b , −∞ < a < b < ∞.It is assumed that y ij are independent conditional on treatment x i , where x 1 is fixed and x 2 is selected adaptively.Denote the adaptive design by ξ A {x i , w i } 2  1 , where w i n i /n.The likelihood for model 2.1 is where 1 y ij are the stage specific sample means, and the total score function is

2.3
where S i represents the score function for the ith stage.

The Adaptive Optimal Procedure
Fix the first stage design point x 1 and let θ n 1 represent an estimate based on the first-stage complete sufficient statistic y 1 .The locally optimal design point for the second stage is , where θ n 1 is the MLE of θ based on the first stage data, is traditionally used to estimate x * .However, when n 1 is small the bias of the MLE can be considerable.Therefore, for some mean functions η using a different estimate would be beneficial.In general, the adaptively selected stage two treatment is

Fisher's Information
Since x 1 ∈ X a, b , a bounded design space, but y ∈ R, there is a positive probability that x 2 will equal a or b.Denote these probabilities as π a P x 2 a and π b P x 2 b , respectively.Then the per subject information can be written as where x 2 is the random variable defined by the onto transformation 2.5 of y 1 .

Asymptotic Properties
We examine three different ways of deriving an asymptotic distribution of the final MLE which may be used for inference at the end of the study.The first is under the assumption that both n 1 and n 2 are large.The second considers the data from the second stage alone.Finally, assume a fixed first-stage sample size and a large second-stage sample size.

Large Stage-1 and Stage-2 Sample Sizes
If dη x 2 , θ /dθ is bounded and continuous and provided common regularity conditions that hold, This result is used to justify the common practice of using x * | θ θ n 1 to estimate x * in order to make inferences about θ.However, if dη x 2 , θ /dθ is not bounded and continuous then it is very difficult to obtain the result in 3.1 and for certain mean functions the result will not hold.In such cases the asymptotic variance in 3.1 must be replaced with lim n 1 → ∞ M −1 ξ A , θ .Lane et al. 17 examine using the exact Fisher's information for an adaptive design ξ A , M ξ A , θ , instead of M ξ * , θ in 3.1 to obtain an alternative approximation of the variance of the MLE θ n .

Distribution of the MLE If Only Second-Stage Data Are Considered
Often pilot data are discarded after being used to design a second experiment then the derivation of the distribution of the MLE using only the second-stage data takes if x 2 to be fixed: as n 2 → ∞, where M 2 x 2 , θ σ −2 dη x 2 , θ /dθ 2 .The estimate θ n 2 will likely perform poorly in comparison to θ n if n 1 and n 2 are relatively of the same size but conceivably may perform quite well when n 1 is much smaller than n.For this reason it represents an informative benchmark distribution.

Fixed First-Stage Sample Size; Large Second-Stage Sample Size
When the first-stage sample size is fixed and the second stage is large we have the following result.
Theorem 3.1.For model 2.1 with x 2 as defined in 2.5 if dη/dθ / 0 for all x ∈ X, θ ∈ Θ, x 2 is an onto function of y 1 , |dη/dθ| < ∞ and provided common regularity conditions, Proof.As in classical large sample theory cf.Ferguson 21 and Lehmann 22 : since S θ n can be expanded around S θ t as where θ t is the true value of the parameter and θ * ∈ θ t , θ n .Solving for It can be shown that θ n is consistent for θ t if n 2 → ∞ and n 1 /n → 0 which gives the result in 3.4 .Now, decompose the right hand side of 3.4 as

3.7
As n 2 → ∞, S 1 / √ n → 0, n 2 /n → 1, and 1/n d/dθ S 2 0 as n → ∞.Thus, the first term in 3.7 goes to 0 as n → ∞.Write the second term in 3.7 as 3.9 The first term in 3.9 goes to 0. To evaluate the second term, it is important to recognize that ε i2 y 2 − η x 2 , θ ∼ N 0, σ 2 /n 2 and y 1 ∼ N 0, σ 2 /n 1 are independent and thus where U is a random function of y 1 and Q ∼ N 0, σ 2 as determined by dη x 2 , θ /dθ −1 .Now, with √ w 2 → 1 as n 2 → ∞ the result follows from an application of Slutsky's theorem.
Remark 3.2.Provided dη x, θ /dθ is bounded and continuous UQ is the asymptotic distribution of √ n θ n − θ t as n → ∞.The important case for this exposition is presented in Theorem 3.1.However, the two other potential cases can be shown easily.

Example: One Parameter Exponential Mean Function
In model 2.1 let η x, θ e −θx , where x ∈ X a, b , 0 < a < b < ∞ and θ ∈ 0, ∞ .The simplicity of the exponential mean model facilitates our illustration, but it is also important in its own right.For example, Fisher 9 used a variant of this model to examine the information in serial dilutions.Cochran 23 further elaborated on Fisher's application using the same model.
For this illustration we use the MLE of the first-stage data to estimate the second-stage design point.Here,

4.1
The adaptively selected second-stage treatment as given by 2.5 is

4.2
Thus, the exact per subject Fisher information is
The asymptotic distributions of the MLE in Sections 3.1 and 3.2 can be derived easily.For the asymptotic distribution of the MLE in Section 3.3 consider the following corollary.For details on the functions h, v 1 , and v 2 see the proof of the corollary.
as n → ∞, where UQ is defined by where Φ • is the standard normal cumulative distribution function.Let Ψ q Φ √ n q − η x, θ / σ and h s s −1 e θs .Then if h a < h b ,

4.6
If h b < h a , then

4.7
Proof.First, we find the distribution of U where U h z and the random variable z is defined by for some constant c.Denote the solutions to 4.9 by W w .Let The W function is real valued on w ≥ −1/e, single valued at w −1/e, and double valued on w ∈ −1/e, 0 .U ∈ {θe, max{h a , h b }}, x 1 ∈ a, b , 0 < a < b < ∞.Therefore V c is real valued for all θ ∈ 0, ∞ .For simplicity, define v 1 min V c and v 2 max V c for a given c.We present the proof for the cumulative distribution function CDF of U and the CDF of UQ for the case where x * ∈ a, b and h a < h b .The derivation of the distributions under alternative cases is tedious and does not differ greatly from this case.Now consider the distribution of UQ.Recall q ∼ N 0, σ 2 and U and Q are independent.If t ∈ −∞, 0 , then

4.18
The distribution is symmetric, thus the derivation of the CDF if t ∈ 0, ∞ is analogous.

Comparisons of Asymptotic Distributions
First, consider the distribution described in 3.1 using M ξ A , θ in place of M ξ * , θ and the distribution described in 3.2 .When n 1 is significantly smaller than n 2 , M ξ A , θ and M x 2 , θ can differ significantly as a function of y 1 .This is primarily because M x 2 , θ is a function of x 2 , whereas M ξ A , θ is an average over y 1 .Through simulation it can be seen that a N 0, M −1 x 2 , θ is a better approximate distribution of √ n θ n − θ than N 0, M −1 ξ A , θ for only a small interval of x 2 , and this interval has a very small probability.For these reasons the distribution of the MLE using only the second stage data as described in Section 3.2 is not considered further.3 .An asymptotic distribution can be justified in inference if it is approximately equal to the true distribution.In this case the true distribution is that of √ n θ n −θ .However, θ n does not have a closed form and thus its distribution cannot be obtained analytically or numerically.To approximate this distribution 10,000 Monte Carlo simulations have been completed for each example to create a benchmark distribution.
Figure 3 plots the three different candidate approximate distributions, found exactly using numerical methods, together with the distribution of √ n θ n − θ approximated using Monte Carlo simulations, for θ 1, x 1 , σ .5, a .25,b 4, n 1 5, and n {30, 1000}.Note the y-axis represents P T i ≤ t , i 1, 2, 3, where T 1 is N 0, M −1 ξ * , θ , T 2 is N 0, M −1 ξ A , θ , and T 3 is UQ.When n 30 it is difficult, graphically, to determine if T 2 or T 3 provides a better approximation for √ n θ n − θ .It seems that if t ∈ −4, 0 the distribution T 3 is preferable to T 2 ; however, when t ∈ 0, 4 the opposite appears to be the case.It is fairly clear that for this example T 1 performs poorly.
When n 1000, it is clear that T 3 is much closer to √ n θ n − θ than both T 1 and T 2 .Further, comparing the two plots one can see how the distribution of √ n θ n − θ has nearly converged to UQ but still differs from those T 1 and T 2 significantly, as predicted by Theorem 3.1 and Corollary 4.1.
Using only graphics it is difficult to assess which of T 1 , T 2 , and T 3 is nearest √ n θ n − θ for a variety of cases.To get a better understanding, the integrated absolute difference of the CDFs of T 1 , T 2 , and T 3 versus that of √ n θ n − θ for x 1 2, σ .5, a .25,b 4, n {5, 10, 15}, and n {30, 50, 100, 400} is presented in Table 1.First, consider the table where θ .5.The locally optimal stage-1 design point is x 1 2 when θ .5;as a result this scenario is the most generous to distribution T 1 .However, even for this ideal scenario T 3 outperforms T 1 and T 2 for all values of n 1 .In many cases the difference between T 3 and T 1 is quite severe.In this scenario T 3 outperforms T 2 ; however, the differences are not great.Next, examine the results for θ 1 and θ 1.5.Once again T 3 outperforms T 1 and T 2 in all but 2 cases, where in many cases its advantage is quite significant.Also note that T 2 outperforms T 1 about half the time when θ 1 and the majority of the time when θ 1.5.This supports our observation that when the distance between x 1 and x * increases the performance of T 1 compared with T 2 and T 3 worsens which indicates a lack of robustness for the commonly used distribution T 1 .This lack of robustness is not evident for T 1 and T 2 .
One final comparison is motivated by the fact that if n 1 → ∞, T 1 , T 2 , and T 3 have the same asymptotic distribution.Although our method is motivated by the scenario where n 1 is a small pilot study, there is no theoretical reason that T 3 will not perform competitively when n 1 is large.Table 2 presents the integrated differences for the distributions T 2 and T 3 from √ n θ n − θ for x 1 2, θ 1, σ .5, a .25,b 4, n 1 {50, 100, 200}, and n {400, 1000}.T 1 is not included in the table due to the lack of robustness; it can perform better or worse than the other two distributions based on the value of θ.Even with larger values of n 1 , T 3 performs slightly better when n 1 50 and 100 and only slightly worse when n 200 indicating that using T 3 is robust for moderately large n 1 .

Discussion
Assuming a finite first-stage sample size and a large second-stage sample size, we have shown for a general nonlinear one parameter regression model with normal errors that the asymptotic distribution of the MLE is a scale mixture distribution.We considered only one parameter for simplicity and clarity of exposition.For the one parameter exponential mean function, the distribution of the adaptively selected second-stage treatment and the asymptotic distribution of the MLE were derived assuming a finite first-stage sample size and a large second-stage sample size.Then the performance of the normalized asymptotic distribution of the MLE, UQ, was analyzed and compared to popular alternatives for a set of simulations.
The distribution of UQ was shown to represent a considerable improvement over the other proposed distributions when n 1 was considerably smaller than n.This was true even when n 1 is moderately large in size.
Since the optimal choice of n 1 was shown to be of the order √ n for this model in Lane et al. 17 , the usefulness of these findings could have significant implications for many combinations of n 1 and n.
Suppose it is desired that P D 1 ≤ √ n θn − θ ≤ D 2 1 − α , where α is the desired confidence level and θ t is the true parameter.If one was to use the large sample approximate distribution given in 3.1 , D 1 and D 2 , and therefore n, cannot be determined until after stage 1.However, using 3.1 with M ξ A , θ in place of M ξ * , θ or by using UQ on can compute the overall sample size necessary to solve for D 1 and D 2 before stage one is initiated.One could determine n initially using 3.1 with M ξ A , θ or UQ and then update this calculation after stage-1 data is available.Such same size recalculation requires additional theoretical justification and investigation of their practical usefulness.
We have not, in this paper, addressed the efficiency of the estimate θ n .One additional way to improve inference would be to find biased adjusted estimates θ n that are superior to θ n for finite samples.We have not investigated the impact on inference of estimating the variances in the distributions of UQ, N 0, M −1 ξ * , θ , N 0, M −1 ξ A , θ , and N 0, M −1 x 2 , θ .Instead, the distributions themselves are compared.For some details on the question of estimation and consistency see Lane et al. 17 and Yao and Flournoy 20 .

4 . 8 Figure 1
Figure 1 illustrates the map from U to z ∈ a, b where θ 1, σ .5, a .25,and b 4. Lambert's product log function cf.Corless et al. 24 is defined as the solutions to

Figure 2
Figure 2 plots the CDF of U for θ 1, x 1 2, n 1 5, σ .5, a .25,and b 4. The distribution is a piecewise function with discontinuities at the boundary points a and b.Now consider the distribution of UQ.Recall q ∼ N 0, σ 2 and U and Q are independent.If t ∈ −∞, 0 , then