CMMMComputational and Mathematical Methods in Medicine1748-67181748-670XHindawi Publishing Corporation63912410.1155/2012/639124639124Research ArticleLet Continuous Outcome Variables Remain ContinuousBakhshiEnayatollah1McArdleBrian2MohammadKazem3SeifiBehjat4BiglarianAkbar1GuillénAlberto1Department of Statistics and ComputerUniversity of Social Welfare and Rehabilitation SciencesTehran 1985713834Iranuswr.ac.ir2Department of StatisticsThe University of AucklandPrivate Bag 92010AucklandNew Zealandauckland.ac.nz3Department of BiostatisticsSchool of Public Health and Institute of Public Health ResearchTehran University of Medical SciencesTehranIrantums.ac.ir4Department of PhysiologySchool of MedicineTehran University of Medical SciencesTehranIrantums.ac.ir2012295201220120811201121022012290220122012Copyright © 2012 Enayatollah Bakhshi et al.This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The complementary log-log is an alternative to logistic model. In many areas of research, the outcome data are continuous. We aim to provide a procedure that allows the researcher to estimate the coefficients of the complementary log-log model without dichotomizing and without loss of information. We show that the sample size required for a specific power of the proposed approach is substantially smaller than the dichotomizing method. We find that estimators derived from proposed method are consistently more efficient than dichotomizing method. To illustrate the use of proposed method, we employ the data arising from the NHSI.

1. Introduction

Recently, logistic regression has become a popular tool in biomedical studies. The parameter in logistic regression has the interpretation of log odds ratio, which is easy for people such as physicians to understand. Probit and complementary log-log are alternatives to logistic model. For a covariate X and a binary response variable Y, let π(X)=P(Y=1X=x). A related model to the complementary log-log link is the log-log link. For it, π(x) approaches 0 sharply but approaches 1 slowly. When the complementary log-log model holds for the probability of a success, the log-log model holds for the probability of a failure .

These models use a categorical (dichotomous or polytomous) outcome variable. In many areas of research, the outcome data are continuous. Many researchers have no hesitation in dichotomizing a continuous variable, but this practice does not make use of within-category information. Several investigators have noted the disadvantages of dichotomizing both independent and outcome variables . Ragland  showed that the magnitude of odds ratio and statistical power depend on the cutpoint used to dichotomize the response variable. From a clinical point of view, binary outcomes may be preferred for some reasons such as (1) setting diagnostic criteria for disease, (2) offering a simpler interpretation of common effect measures from statistical models such as odds ratios and relative risks. However, all advantages come at the lost information. From a statistical point of view, this loss of information means more samples which are required to attain prespecified powers.

Moser and Coombs  provided a closed-form relationship that allows a direct comparison between the logistic and linear regression coefficients. They also provided a procedure that allows the researcher to analyze the original continuous outcome without dichotomizing. To date, a method that applies the complementary log-log model without dichotomizing and without loss of information has not been available.

We aim to (a) provide a method that allows the researcher to estimate the coefficients of the complementary log-log model without dichotomizing and without loss of information, (b) show that the coefficient of the complementary log-log model can be interpreted in terms of the regression coefficients, (c) demonstrate that the coefficient estimates from this method have smaller variances and shorter confidence intervals than the dichotomizing method.

2. Methods2.1. Model

Let y1,y2,,yn be n independent observations on y, and let x1,x2,,xp-1 be p-1 predictor variables thought to be related to the response variable y. The multiple linear regression model for the ith observation can be expressed asyi=β0+β1xi1+β2xi2++βp-1xip-1+Eii=1,2,,n, oryi=xiβ+Eii=1,2,,n, wherexi=(1,xi1,xi2,,xip-1). To complete the model, we make the following assumptions:

E(Ei)=0 for i=1,2,,n,

var(Ei)=σ2 for i=1,2,,n,

the independent Ei follows an extreme value distribution for i=1,2,,n.

Writing the model for each of the n observations, in matrix form, we have[y1y2..yn]=[1x11x12x1p-11x21x22x2p-1..1xn1xn2xnp-1][β0β1..βp-1]+[E1E2..En], ory=Xβ+E. The preceding three assumptions on Ei and yi  can be expressed in terms of this model:

E(E)=0,

cov(E)=σ2I,

the Ei is extreme value (0,σ2) for i=1,2,,n.

2.2. (Largest) Extreme Value Distribution

The PDF and CDF of the extreme value distribution are given by f(yxβ,σ)=πσ6×exp(-y-xβ-kσσ×π6  -exp(y-xβ-kσσ×π6))-x,σ0,P(yc)=exp(-exp(-c-xβ+kσσ×π6))-x,σ,  k0.45.

It is easy to check thatωj=lnπ1lnπ2=ln(p(ycx))ln(p(ycx(-1,j)))=-exp(-((c-xβ+kσ)/σ)×π/6)-exp(-((c-x(-1,j)β+kσ)/σ)×π/6)=exp(π6βjσ)π1=π2exp((π/6)(βj/σ)), wherex=(1,x1,,xj,,xp-1),x(-1,j)=(1,x1,,xj-1,,xp-1),β=(β0,  β1,,βj,,βp-1). To return to a random sample of observations (y1,y2,,yn), we conclude that the PDF and CDF of each independent yi are given by (6), and the corresponding equality (7) is given bylnπ̂1lnπ̂2=exp(πσ̂6β̂j), where the estimate β̂j  is the (j+1)th element of vector β̂=(β̂0,β̂1,,β̂j,,β̂p-1). It is readily shown that the results also hold true for the smallest extreme value distribution (Appendix A).

2.3. The Proposed Confidence Intervals

Letβ̂=(β̂0,β̂1,,β̂j,,β̂p-1)=(XX)-1XYj=0,,p-1,σ̂2=Y(In-X(XX)-1X)Y(n-p). According to the preceding three assumptions on Ei and  yi, we obtainE(β̂)=E[(XX)-1XY]=(XX)-1XEY=(XX)-1XXβ=β,E(σ̂2)=1n-pE(Y(In-X(XX)-1X)Y)=1n-p{tr[(In-X(XX)-1X)σ2I]+E(Y)[In-X(XX)-1X]E(Y)}=1n-p{σ2tr[In-X(XX)-1X]    +βX[In-X(XX)-1X]Xβ}=1n-p{σ2[n-tr(X(XX)-1X)]+βXXβ-βXX(XX)-1XXβ}=1n-p{σ2[n-tr(X(XX)-1X)]+βXXβ-βXXβ[n-tr(X(XX)-1X)]}=1n-pσ2[n-tr(IP)]=1n-pσ2(n-p)=σ2. Therefore, β̂ and  σ̂2  are unbiased estimators of β and  σ2.

We have assumed that Ei is distributed as an extreme value, and we use the approximation of the extreme value distribution of the errors Ei by the normal distribution. For normally distributed observations, β̂j/(σ̂δj) follows a noncentral t distribution with n-p degree of freedom and noncentrality parameter -<βj/(σδj)<,1-α=P{t1-(α/2)[n-p,βj(σδj)]<β̂j(σ̂δj)<tα/2[n-p,βj(σδj)]}, where tα/2[r,s] represents the 100(1-(α/2))  percentile point of a noncentral t distribution with r degrees of freedom and noncentrality parameter -<s<, and δj is the (j+1)st diagonal element of (XX)-1. We use the approximation of the percentiles of the noncentral t distribution by the standard normal percentiles , then1-α=P{βj/(σδj)-zα/2[1+(βj2/(σ2δj)-zα/22)/2(n-p)]1/21-(zα/22/2(n-p))  <β̂j(σ̂δj)<βj/(σδj)+zα/2[1+((βj2/(σ2δj)-zα/22)/2(n-p))]1/21-(zα/22/2(n-p))  },(βjσ)U={β̂jσ̂[1-zα/222(n-p)]+zα/2[δj(1+((β̂j2/σ̂2δj)-zα/222(n-p)))]1/2},(βjσ)L={β̂jσ̂[1-zα/222(n-p)]-zα/2[δj(1+((β̂j2/σ̂2δj)-zα/222(n-p)))]1/2},Thus, we obtain an approximate 100(1-α) percent confidence interval for ωj{exp[π6(βjσ)L],  exp[π6(βjσ)U]}.

3. Comparison of the Two Methods

Let Yi be a continuous outcome variable. For fixed value of C, we define Yi* such thatYi*={1if  YiC,0if  Yi<C. Suppose that Y1*,,Yn* form a random sample of observations, and we fit a complementary log-log modelπi1=P(Yi*=1xi)=exp(-exp(xiθ)),πi2=P(Yi*=1x(-1,i))=exp(-exp(x(-1,i)θ)), where xi=(1,xi1,,xi,p-1)  is the P×1 vector of covariates for the ith observation, and θ=(θ0,,θp-1) is the P×1 vector of unknown parameters. The dichotomized ωj* parameter corresponding to the effect θj isωj*=ln(π1)ln(π2)=ln(P(Y*=1x))ln(P(Y*=1x(-1,j)))=(exp(xθ))(exp(x(-1,j)θ))=exp(θj)j=0,,p-1. In general, maximum likelihood estimation (MLE) can be used to estimate the parameter  θ=(θ0,,θp-1). Let θ̂=(θ̂0,,θ̂p-1) be the P×1 ML estimate of θ, and let COV(θ̂) be the P×P  covariance matrix of θ̂. Using COV(θ̂) from (23), one can construct confidence intervals. This matrix has as its diagonal the estimated variances of each of the ML estimates. The (j+1)th diagonal element is given by σθ̂j2. Therefore,ω̂j*=exp(θ̂j), and for large samples, (θ̂jL,θ̂jU)=(θ̂j-zα/2σ̂θ̂j,θ̂j+zα/2σ̂θ̂j) is a 100(1-α) percent confidence interval for the true θj. Then  (exp(θ̂jL),exp(θ̂jU)) is a 100(1-α) percent confidence interval for the true ωj*.

We now compare the ωj from (7) with the ωj* from (17)ωj=ln(π1)ln(π2)ωj*=ln(π1)ln(π2)ωj*=ωjexp(π6βjσ)=exp(θj)π6βjσ=θjβj,θj,σ. This show that the coefficient of the complementary log-log model, θj, can be interpreted in terms of the regression coefficients, βj. Note that β are related to the responses through the general linear regression modelyi=xiβ+Eii=1,,n, where the independent Ei are distributed as an extreme value with mean 0 and variance σ2>0.

4. Covariance Matrix of Model Parameter Estimators4.1. Derivation of <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M108"><mml:mtext>var</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mi>ω</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>*</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for Large <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M109"><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:math></inline-formula>

The information matrix of generalized linear models has the form =XWX , where W is the diagonal matrix with diagonal elements wi=(μi/ηi)2/(var(yi)), y is response variable with independent observations (y1,yn), and xij  denote the value of predictor j,μi=E(yi),ηi=g(μi)=jθjxij,j=0,1,,p-1. The covariance matrix of θ̂ is estimated by  (XŴX)-1.

Maximum likelihood estimation for the complementary log-log model is a special case of the generalized linear models. Letμi=πi=exp(-exp(jθjxij))πi=exp(-exp(ηi)),μiηi=(-exp(ηi))exp(-exp(ηi))=πilnπi,wi=(πIlnπi)2πi(1-πi)=πi(lnπi)21-πi,

thenXWX=[i=1nπi(lnπi)21-πii=1nxi1πi(lnπi)21-πii=1nxi,p-1πi(lnπi)21-πii=1nxi1πi(lnπi)21-πii=1nxi12πi(lnπi)21-πii=1nx1xi,p-1πi(lnπi)21-πii=1nxi,p-1πi(lnπi)21-πii=1nx1ixi,p-1πi(lnπi)21-πii=1nxi,p-12πi(lnπi)21-πi].It is readily shown that the results hold true for the largest extreme value distribution (Appendix A).

In large samples, var(θ̂j) approaches σθj2|θ=θ̂  which equals the (j+1)th diagonal element of (XWX)-1.

By applying the delta method, let f(θ̂j)=exp(θ̂j), then var(ω̂j*)var(exp(θ̂j))=var(f(θ̂j))=(f(θ̂j)θ̂j|θ̂j=θj)2(var(θ̂j))=(exp(θj))2×σθ̂j2.

4.2. Derivation of <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M128"><mml:mtext>var</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>ω</mml:mi></mml:mrow><mml:mo>̂</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> for Large <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M129"><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:math></inline-formula>

In large samples, from (10) σ̂2σ2 . Therefore,var(ω̂j)=var(exp(πβ̂jσ̂6))var(exp(πβ̂jσ6)). In addition, var(β̂j)=σ2δj.

By applying the delta method, let g(β̂j)=exp(πβ̂j/(σ6)), thenvar(ω̂j)var(exp(πβ̂jσ6))=var(g(β̂j))=(g(β̂j)β̂j|β̂j=βj)2×var(β̂j)=(πσ6exp(πβjσ6))2σ2δj=π26δj(expπβjσ6)2.

5. Sample Sizes Saving5.1. The Power for the Dichotomized Method

In large samples, σ̂θ̂j converges to σθ̂j almost surely . Therefore, for a given value of ωj=expθj (i.e., lnωj=θj), the power is given byp(ωj)=p{rejection  of  ωj=1ωj1}=p{exp(θjL)>1θj}+p{exp(θjU)<1θj}=p{θ̂j>zα/2σθ̂jθj}+p{θ̂j<-zα/2σθ̂jθj}=p{Z>zα/2σθ̂j-lnωjσθ̂j}+p{Z<-zα/2σθ̂j-lnωjσθ̂j}=p{Z>zα/2-lnωjσθ̂j}+p{Z<-zα/2-lnωjσθ̂j}=P{Z>z1}+P{Z<-z2}, where {z1=zα/2-lnωjσθ̂jz2=zα/2+lnωjσθ̂j}.

5.2. The Power for the Proposed Method

In large samples, σ̂ converges to σ almost surely . Therefore, for a given value of ωj=exp(πβj/σ6) (i.e., βj=σ(lnωj6/π)), the power is given by p(ωj)=p{exp(π6(βjσ)L)>1ωj}+p{exp(π6(βjσ)U)<1ωj}=P{βJL>zα/2σδjβj=σlnωj6π}+P{βJU<-zα/2σδjβj=σlnωj6π}=p{Z>zα/2σδj-(σlnωj6/π)σδj}+p{Z<-zα/2σδj-(σlnωj6/π)σδj}=p{Z>zα/2-lnωj6πδj}+p{Z<-zα/2-lnωj6πδj}=p{Z>z1}+p{Z<-z2},

where   {z1=zα/2-lnωj6πδjz2=zα/2+lnωj6πδj}. Our proposed method, since it is based on continuous data rather than dichotomized, is likely to be more powerful. We show that the proposed method can produce substantial sample size saving for a given power. Let

the number of parameters p=2 (i.e., θ=(θ0,θ1)),

xi=(1,xi1), xi1{-a+(2an/(g-1))n=0,,g-1}, that is, xi1  follows a discrete uniform distribution with range (-a,  a). For simplicity, a=2.

Total samples are n and n* for the proposed and dichotomized methods, respectively. These samples included k and k* set of these g uniformly distributed points for the proposed and dichotomized methods, respectively. That is, n=gk and n*=gk*, then

δj=[ki=1g(x1i-x̅1.)2]-1,j=1, and from (23),σθ̂j2=i=1g((πi)(ln(πi))2/ln(1-πi))(k*){i=1gx1i2((πi)(ln(πi))2/ln(1-πi))i=1g((πi)(ln(πi))2/ln(1-πi))-[i=1gx1i((πi)(ln(πi))2/ln(1-πi))]2}.We consider the same power for two methods:z1=z1*z2=z2*{zα/2-lnωjσθ̂j=zα/2-lnωj6πδjzα/2+lnωjσθ̂j=zα/2+lnωj6πδjπ6δj=σθ̂j,j=1π6[ki=1g(x1i-x̅1.)2]-1=i=1g((πi)(ln(πi))2/ln(1-πi))(k*){i=1gx1i2((πi)(ln(πi))2/ln(1-πi))i=1g((πi)(ln(πi))2/ln(1-πi))-[i=1gx1i((πi)(ln(πi))2/ln(1-πi))]2}relative sample sizen*n=k*k=6σθ̂j2π2δj=i=1g(x1i-x̅1.)2×i=1g((πi)(ln(πi))2/ln(1-πi))(π2/6){i=1gx1i2((πi)(ln(πi))2/ln(1-πi))i=1g((πi)(ln(πi))2/ln(1-πi))-[i=1gx1i((πi)(ln(πi))2/ln(1-πi))]2}.That is, (34) is independent of σ2 and applies for any power, and any test size α.

Table 1 presents relative sample sizes n*/n for a given fixed parameter ωj*  and an average proportion of success π̅. We consider the situations in which π̅=i=1g(πi/g)=0.1,0.2,0.3,0.4,0.5, g=9, ωj*=0.25,0.50,0.75.

Relative sample sizes required to attain any power for the dichotomizing method versus the proposed method.

 ω*=exp (θ) Average proportion of successes (π̅) 0.1 0.2 0.3 0.4 0.5 0.25 23.7166 9.5092 7.4954 7.1996 6.8575 0.50 10.6719 5.4176 3.4215 2.5209 2.1784 0.75 7.7088 3.8713 2.5171 1.9380 1.5841

For given fixed ωj* and π̅, the relative sample sizes in Table 1 can be computed by the following step:

compute the value θj via the equation θj=ln(ωj*),

calculate the cut-off point C iteratively such that π̅ attained the specified value for the values xi1, using the value of θj  in (i).

As can be seen from Table 1, all values are greater than 1. The values of n*/n increase as the  ωj* moves farther away from 1. Values of Table 1 immediately highlight the improvement accomplished by the proposed method.

6. Relative Efficiency of <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M185"><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>ω</mml:mi></mml:mrow><mml:mo>̂</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> with <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M186"><mml:mrow><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>ω</mml:mi></mml:mrow><mml:mo>̂</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>*</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>

Here, we examine the relative efficiency of the estimate  ω̂j to the estimate ω̂j*.

Using (24) and (26), the relative efficiency is given byr.e.  (ω̂j,ω̂j*)=var(ω̂j*)var(ω̂j)=6(exp(θj))2×σθ̂j2π2δj(exp(λβj/σ))2=6σθ̂j2π2δj. Note that the relative efficiency is independent of n and σ2 and converges to a constant. Comparing (34) and (35), the relative efficiency equals the relative sample sizes. Therefore, as in Table 1, the proposed method is a consistent improvement over the dichotomizing method with respect to relative efficiencies.

It should be noted that these results hold true under the following assumptions:

the responses yi and β are related through the equation yi=xiβ+Ei where the independent  Ei are distributed as an extreme value with mean 0 and variance σ2>0,

the independent variables xi follow a discrete uniform distribution.

7. Odds Ratio

For values of π larger than 0.90, -ln(π) and π/(1-π)  are very close. Hence, for large values of π,ln(π1)ln(π2)π1/1-π1π2/1-π2=OR. And from (7), odds ratio is given byOR=exp(π6βjσ). The parameters estimated from the linear regression can be interpreted as an odds ratio.

8. Simulation Study

It should be noted that, as in Table 1, the proposed method is a consistent improvement over the dichotomizing method with respect to relative efficiencies. These results hold true under the assumption that predictor variable has a discrete uniform distribution and that the random variables Ei follow an extreme value distribution. To demonstrate the robustness of this conclusion to changes in the distributions of predictor variables, simulations were run under different distributional conditions. The data were sampled 10000 times for three sample sizes {n=250,  500,  1000}, three average proportions of successes {π̅=  0.10,  0.50,  0.95}, and seven ωj{ωj=0.75,  0.90,  1.1,  1.2,  1.3,  1.4,  1.5}. The simulated data are generated using the following algorithm

Generate yi, where yi=β0+β1xi+Ei, β1=6lnωj/π through (7) to produce the correct ωj, and for simplicity β0=0, σ2=1.

For fixed π̅, generate cutoff point C using (15).

We simulated the data for two scenarios based on the distribution of the explanatory variable. In the first scenario, the independent variable follows a continuous uniform distribution and range (−2, 2), and in the second, the independent variable follows a truncated normal distribution with mean 0 and range (−2, 2). The relative mean square errors, relative interval lengths, absolute biases, and the probability of coverage were calculated.

Results of the simulations addressing the validity of the proposed method are displayed in Tables 2 and 3.

Simulated relative mean square errors, relative intervals lengths, coverage probabilities, and absolute biases for the proposed and dichotomizing methods (using a continuous uniform distribution for the explanatory variable and an extreme value distribution for the errors).

Sample sizeω Cut off.75.91.11.21.31.41.5
1.15a1.071.091.141.241.471.71
1.10b1.031.031.071.141.231.35
0.100.943c0.9480.9490.9490.9450.9380.933
0.948d0.9470.9490.9470.9510.9470.953
0.05e0.040.120.140.100.150.11
0.07f0.010.170.130.240.340.58
1.231.261.271.281.271.241.26
2.161.131.231.141.151.171.19
10000.500.9400.9510.9510.9450.9420.9370.934
0.9510.9490.9510.9500.9480.9470.948
0.040.010.080.100.050.090.04
0.050.040.150.120.090.120.13
12.7512.4413.2212.6813.1412.9112.79
3.673.573.583.633.693.763.84
0.950.9430.9510.9520.9440.9440.9380.929
0.9520.9540.9520.9520.9510.9510.951
0.040.070.110.100.100.170.10
0.750.680.861.011.211.451.24

1.301.081.071.171.241.541.95
1.161.031.041.081.151.251.39
0.100.9420.9500.9510.950.9440.9410.936
0.9510.9500.9490.9510.9540.9540.953
0.120.070.240.250.210.180.29
0.230.080.330.390.410.731.21
1.351.101.271.261.261.251.26
1.261.031.131.141.161.171.20
5000.500.9400.9490.9470.9480.9430.9400.933
0.9520.9510.9490.9490.9540.9500.951
0.230.340.270.230.260.250.38
0.480.110.170.180.310.260.42
13.0413.1713.813.9014.4514.4814.47
3.723.653.683.733.823.913.99
0.950.9420.9470.9510.9490.9470.9380.935
0.9530.9520.9540.9550.9550.9530.954
0.050.110.080.080.240.320.27
0.941.381.781.922.523.002.90

13.4114.461.121.281.521.962.33
3.783.731.041.091.181.301.45
0.100.9420.9490.9490.9450.9420.9420.933
0.9570.9540.9480.9490.9520.9570.953
0.020.200.380.330.420.410.66
2.112.740.420.841.181.782.24
1.271.251.321.281.301.301.29
1.161.131.131.141.161.181.20
2500.500.9410.9480.9520.9470.9450.9430.933
0.9510.9510.9510.9500.9510.9510.951
0.120.130.350.440.410.530.55
0.110.220.390.470.510.740.59
12.9814.615.6415.4617.0516.8918.33
3.753.723.823.884.014.124.29
0.950.9450.9550.9460.9480.9400.9370.932
0.9590.9550.9550.9590.9580.9570.952
0.020.160.390.220.460.470.51
1.222.753.973.984.995.196.19

a: Relative mean square errors, b: Relative intervals lengths, c: Coverage probability (proposed), d: Coverage probability (dichotomized), e: % bias (proposed), f: % bias (dichotomized).

Simulated relative mean square errors, relative intervals lengths, coverage probabilities, and absolute biases for the proposed and dichotomizing methods (using a truncated normal distribution for the explanatory variable and an extreme value distribution for the errors).

Sample sizeω Cut off.75.91.11.21.31.41.5
1.17a1.021.081.131.191.281.36
1.11b1.031.031.061.101.251.22
0.100.942c0.9480.9480.9520.9440.9420.940
0.951d0.9510.9500.9520.9490.9510.951
0.08e0.060.030.140.130.140.16
0.10f0.110.150.230.300.390.39
1.261.241.261.281.281.251.28
1.241.131.131.141.141.151.17
10000.500.9440.9480.9520.9470.9470.9440.941
0.9480.9510.9490.9490.9470.9500.949
0.020.090.080.070.180.160.13
0.030.060.120.160.200.160.14
12.3313.1213.0312.7112.8612.5512.88
3.623.593.613.623.643.683.71
0.950.9440.9510.9480.9480.9450.9450.946
0.9520.9480.950.9490.9490.9510.952
0.100.040.110.040.160.160.20
1.261.051.561.361.431.801.94

1.181.091.061.751.231.321.58
1.111.031.031.061.111.161.23
0.100.9450.950.9510.9510.9490.9430.944
0.9530.9530.9530.9500.9490.9510.950
0.040.130.310.180.330.360.37
0.210.080.370.500.620.690.96
1.251.271.271.291.271.291.25
1.141.131.131.141.151.161.17
5000.500.9440.9480.9490.9470.9480.9440.935
0.9510.9510.9510.9480.9510.9480.949
0.130.220.350.370.350.300.44
0.160.190.390.480.440.410.54
13.1114.0214.0213.513.5413.8014.32
3.733.713.733.753.773.813.86
0.950.9440.950.9510.9500.9470.9440.944
0.9540.950.9510.9530.9480.9560.953
0.150.100.240.380.320.330.43
2.502.702.923.102.923.363.89

1.281.111.121.191.331.541.76
1.111.031.041.081.131.191.28
0.100.9470.9510.9500.9470.9500.9500.942
0.9510.9500.9500.9520.9540.9520.951
0.400.340.370.640.690.580.81
0.260.060.691.081.301.552.22
1.321.301.271.331.311.331.31
1.151.131.131.141.181.171.18
2500.500.9510.950.9530.9510.9400.9450.940
0.9490.9510.9520.9480.9480.9500.948
0.220.430.570.690.660.580.66
0.380.530.640.890.910.820.91
14.0914.5116.2715.9115.8915.7315.60
3.863.873.933.923.984.044.11
0.950.9430.950.9510.9510.9470.9440.937
0.9530.950.9530.9560.9530.9560.952
0.300.370.570.680.420.620.75
4.985.526.5475.916.176.887.72

a: Relative mean square errors, b: Relative intervals lengths, c: Coverage probability (proposed), d: Coverage probability (dichotomized), e: % bias (proposed), f: % bias (dichotomized).

The simulations show that the relative mean square errors are all greater than 1, increasing with the average proportion of successes and when the ωj moves farther away from 1. The results in Tables 1 and 2 demonstrate that the proposed method provides confidence intervals which successfully maintain their nominal 95 percent coverage. For the proposed method in first scenario, 51 out of 63 coverage probabilities fell within (0.94, 0.96), and all 63 coverage probabilities are greater than 0.93 and, in the second scenario, almost all coverage probabilities fell within (0.94, 0.96). The absolute biases for proposed method are never greater than a few percent. The proposed method is less biased than the dichotomizing method in 6 of 63 simulations in both two scenarios.

9. An Example

To illustrate the application of the proposed method presented in the previous section, we utilize the data arising from the National Health Survey in Iran. The other analyses using this data appear in many places .

In this study, 14176 women aged 20–69 years were investigated. BMI (body mass index), our dependent variable, was calculated as weight in kilograms divided by height in meters squared (kg/m2). Independent variables included place of residence, age, smoking, economic index, marital status, and education level. The independent variables considered were both categorical and continuous. At first, BMI was treated as a continuous variable, and ω̂j and 95 percent confidence intervals were calculated using the proposed linear regression method. Then subjects were classified into obese (BMI ≥ 30 kg/m2) and nonobese (BMI <30 kg/m2). A complementary log-log model was used for the binary analysis, with obese or nonobese used as the outcome measure. The ω̂j* and 95 percent confidence intervals were calculated using the dichotomized method. Table 4 presents the coefficient estimates, estimated confidence intervals, and relative confidence interval lengths. The proposed and dichotomizing methods produced different confidence intervals, although the ω̂j and ω̂j* were similar only varying slightly. The ω̂j  estimate from the proposed method had smaller variances and shorter confidence intervals than the dichotomizing method. All relative confidence interval lengths were greater than 2.58.

Adjusted ω̂j*, ω̂j  for obesity and confidence intervals using two methods for the National Health Survey.

Covariatesω̂j(ω̂j*) 95% CIa (proposed) 95% CI (dichotomized)Relativeb length of CI
Place of residence1.65 (1.97)c 1.58–1.741.79–2.18 2.43
Age1.021 (1.019) 1.018–1.0221.015–1.022 1.75
Years of education0.99 (0.98)0.985–0.9970.971–0.9941.92
Smoking0.76 (0.68)0.66–0.900.51–0.921.71
Marital status1.16 (1.42)1.10–1.221.27–1.582.58
Lower-middle economy index 1.24 (1.32)1.14–1.321.18–1.481.67
Upper-middle economy index1.21 (1.26)1.14–1.291.12–1.422.0
High economy index1.20 (1.21) 1.11–1.30 1.08–1.36 1.47

aconfidence interval, bdichotomized/proposed, cproposed (dichotomized).

10. Discussion

When assuming the errors  Ei  are distributed as an extreme value distribution, as noted before, the method has several advantages. First, the method allows the researcher to apply the complementary log-log model without dichotomizing and without loss of information. Second, the ω̂j* from the dichotomizing method is dependent on the chosen cutoff point C and will vary with c. However, the proposed ω̂j is independent of the c since ω̂j is a function of the continuous Yi  and not a function of the dichotomized Yi*  defined through C. Third, we show that the coefficient of the complementary log-log model, θj, can be interpreted in terms of the regression coefficients, βj. Fourth, when the independent variables xi follow a discrete uniform distribution, the proposed method is a consistent improvement over the dichotomizing method with respect to relative efficiencies. The proposed method can provide sample size saving, smaller variances, and shorter confidence intervals than the dichotomized method. Fifth, when π  is large, the parameters estimated from the linear regression can be interpreted as odds ratios.

Our results were consistent with the findings by Moser and Coombs  and Bakhshi et al.  showing the greater efficiency of parameter estimates from the regression method that avoids dichotomizing in comparison with a more traditional dichotomizing method using the logistic regression.

Our main recommendation is to let continuous response remain continuous. Do not throw away information by transforming the data to binary. This means that if the objective is to estimate and/or test coefficients when responses are continuous, please resist dichotomizing your response variable.

AppendixA. Largest Extreme Value Distribution

(a) The PDF and CDF are given by f(yxβ,σ)=πσ6×exp(-y-xβ-kσσ×π6  -exp(y-xβ-kσσ×π6))-x,σ0,P(yc)=1-exp(-exp(-c-xβ-kσσ×π6))-x,σ0, where Y is a continuous outcome variable, x=(1,x1,,xp-1) is the p×1 vector of known independent variables, β=(β0,β1,,βp-1) is the p×1 vector of unknown parameters, and k0.45.

It is easy to check thatωj=ln(1-π1)ln(1-π2)=ln(1-p(ycx))ln(1-p(ycx(-1,j)))=-exp(-((c-xβ-kσ)/σ)×(π/6))-exp(-((c-x(-1,j)β-kσ)/σ)×(π/6))=exp(π6βjσ)1-π1=(1-π2)exp((π/6)(βj/σ)),   wherex=(1,x1,,xj,,xp-1),x(-1,j)=(1,x1,,xj-1,,xp-1),β=(β0,β1,,βj,,βp-1).

(b) Suppose that Ei is distributed as a largest extreme value with mean 0 and variance σ2>0. We conclude that the PDF and CDF of each independent Yi are given by (A.1), and the corresponding equality (A.2) is given byω̂j=ln(1-π̂1)ln(1-π̂2)=exp(π6β̂jσ̂).

(c) Similar to largest extreme value distribution μi=πi=1-exp(-exp(jθjxij))πi=1-exp(-exp(ηi)),μiηi=-(-exp(ηi))exp(-exp(ηi))=-(1-πi)ln(1-πi)wi=((1-πi)ln(1-πi))2πi(1-πi)=(1-πi)(ln(1-πi))2πi, thenXWX=[i=1n(1-πi)(ln(1-πi))2πii=1n(1-πi)(ln(1-πi))2πii=1nxi,p-1(1-πi)(ln(1-πi))2πii=1nxi1(1-πi)(ln(1-πi))2πii=1nxi12(1-πi)(ln(1-πi))2πii=1nx1xi,p-1(1-πi)(ln(1-πi))2πii=1nxi,p-1(1-πi)(ln(1-πi))2πii=1nxixi,p-1(1-πi)(ln(1-πi))2πii=1nxi,p-12(1-πi)(ln(1-πi))2πi].

Conflict of Interests

The authors have declared no conflict of interests.

AgrestiA.Categorical Data Analysis20022ndNew York, NY, USAWileyZhaoL. P.KolonelL. N.Efficiency loss from categorizing quantitative exposures into qualitative exposures in case-control studiesAmerican Journal of Epidemiology199213644644742-s2.0-0026657505MacCallumR. C.ZhangS.PreacherK. J.RuckerD. D.On the practice of dichotomization of quantitative variablesPsychological Methods20027119402-s2.0-003651575910.1037//1082-989X.7.1.19CohenJ.The cost of dichotomizationApplied Psychological Measurement1983732492532-s2.0-7995364752710.1177/014662168300700301GreenlandS.Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis.Epidemiology1995644504542-s2.0-0029334367AustinP. C.BrunnerL. J.Inflation of the type I error rate when a continuous confounding variable is categorized in logistics regression analysesStatistics in Medicine2004237115911782-s2.0-184278022810.1002/sim.1687VarghaA.RudasT.DelaneyH. D.MaxwellS. E.Dichotomization, partial correlation, and conditional independenceJournal of Educational and Behavioral Statistics19962132642822-s2.0-0030519338MaxwellS. E.DelaneyH. D.Bivariate median splits and spurious statistical significancePsychological Bulletin199311311811902-s2.0-12044254284StreinerD. L.Breaking up is hard to do: the heartbreak of dichotomizing continuous dataCanadian Journal of Psychiatry20024732622662-s2.0-0036276223ChenH.CohenP.ChenS.Biased odds ratios from dichotomization of ageStatistics in Medicine20072618348734972-s2.0-3454740630710.1002/sim.2737RaglandD. R.Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpointEpidemiology1992354344402-s2.0-0026774062MoserB. K.CoombsL. P.Odds ratios for a continuous outcome variable without dichotomizingStatistics in Medicine20042312184318602-s2.0-294270212710.1002/sim.1776JohnsonN. L.WelchH.WeiC. Z.Application of the non-central t distributionBiometrika1940313-4362389SerflingR. J.Approximation Theory of Mathematical Statistics1980New York, NY, USAWileyLaiT. L.RobbinsH.WeiC. Z.Strong consistency of least squares estimates in multiple regressionProceedings of the National Academy of Sciences of the United States of America1978757303430362-s2.0-0018190841BakhshiE.EshraghianM. R.MohammadK.SeifiB.A comparison of two methods for estimating odds ratios: results from the National Health SurveyBMC Medical Research Methodology20088, article 782-s2.0-5814928043010.1186/1471-2288-8-78