Within the context of clinical and other scientific research, a substantial need exists for an accurate determination of the point estimate in a lognormal mean model, given that highly skewed data are often present. As such, logarithmic transformations are often advocated to achieve the assumptions of parametric statistical inference. Despite this, existing approaches that utilize only a sample’s mean and variance may not necessarily yield the most efficient estimator. The current investigation developed and tested an improved efficient point estimator for a lognormal mean by capturing more complete information via the sample’s coefficient of variation. Results of an empirical simulation study across varying sample sizes and population standard deviations indicated relative improvements in efficiency of up to 129.47 percent compared to the usual maximum likelihood estimator and up to 21.33 absolute percentage points above the efficient estimator presented by Shen and colleagues (2006). The relative efficiency of the proposed estimator increased particularly as a function of decreasing sample size and increasing population standard deviation.
1. Introduction
The presence of highly skewed data is commonplace across both basic and applied sciences [1, 2]. In certain instances, the logarithmic transformation of such data may be undertaken with the primary purpose of establishing a normal distribution and improving variance, which may include removing heteroskedasticity in the process, to achieve N(μ,σ2). Patterson [3] provided seminal work concerning the statistical challenges involved in estimated the population mean following the transformation of data. More recently, Shen et al. [4] proposed an improved efficient minimum risk/relative mean squared error (RMSE) estimator of the lognormal mean, and numerous researchers have addressed the estimation of parameters within this distribution from both a frequentist and Bayesian context [5–9].
In presenting the lognormal distribution from a more fundamental perspective, if Y is a random variable with a lognormal distribution and a mean of E(Y)=ς, then ln(Y) will be normally distributed with mean of μ and variance of σ2. Therefore, Y may also be expressed as Y~ln(μ,σ2) with a mean of ς, observing that
(1)ς=exp(μ+σ22).
As such, when considering a random sample y1,y2,…,yn that is i.i.d. and given ln(μ,σ2) with a mean of ς, then xi=ln(yi) is i.i.d. as N(μ,σ2) for i=1,…,n. The following may also be defined:
(2)x¯=∑i=1i=n(xin)s2=∑i=1i=n(xi-x¯)2(n-1),
noting that x¯ and s2 are the maximum likelihood estimators (MLE) for μ and σ2, respectively [2]. By applying (1) to (2), the usual estimator (UER) for ς is
(3)UERς=exp(x¯+s22).
Addressed previously, Shen et al. [4] proposed a new estimator for UERς by minimizing its relative mean square error (RMSE) through an application of the delta method [10], which yields the following when applied to (3):
(4)RMSE(UERς)=E[(UERς-ς)ς]2.
Notably, the class of estimators used by the authors was
(5)ςc=exp(x¯+(1/(n+d))·s22),
with
(6)d>(-n).
By minimizing the RMSE of the estimators in the class by an order of 1/n2, the optimal value of “d” was obtained, wherein the optimal estimator, c, in the class was identified as
(7)c=1(n+4+(3·σ2/2)).
By applying the usual unbiased estimate s2 of σ2, a novel minimum risk/relative mean squared error estimator of ς was developed:
(8)SHEN(2006)=exp{x¯+(n-1)s22(n+4)(n-1)+3s2},
or, alternatively,
(9)RMSE(SHEN(2006))=E[(SHEN(2006)-ς)ς]2.
Given the prevalence of logarithmic data across research disciplines and the importance in efficiently estimating these data, the purpose of the current study was to derive and assess an approach to obtain statistically efficient estimators of the lognormal mean. More specifically, the objectives involved incorporating more comprehensive information from the sample’s coefficient of variance, CV=s/(n·x¯), following a logarithmic transformation of a resultant sample data in estimating the nontransformed population mean of the original distribution.
2. Preparatory Improvement
Being contingent upon the Rao-Blackwell theorem, any function of a sample mean or sample variance will be a uniformly minimum variance unbiased estimate (UMVUE) or Uniformly minimum mean squared error estimator (UMMSE) of the population if an estimator is an unbiased or minimum mean squared estimator (MMSE). Hence, a usual estimator (UER) as presented in (3) is the UMVUE of ln(ς) (i.e., the population mean of the original lognormal distribution). Furthermore, it should be noted that (n-1)·s2/σ2~χ2 distribution with (n-1) degrees of freedom.
Numerous empirical analyses involve conditions wherein small values of the sample estimate of the coefficient of variation are observed. Therein, an alternate estimator appearing in Lovric and Sahai [11], denoted by g⊗, of the population mean, μ, is offered rather than the typical estimator, x-:
(10)g⊗=(x¯)+x¯{(n·(x¯)2/s2)-1}.
Applying principals outlined in Nikulin [12], the relative efficiency (i.e., a key measure of an estimator’s optimality) of g⊗ versus x- can be expressed as a percentage as
(11)100η=100·{E(x¯-μ)2E(g⊗-μ)2}.
The UMVUE of the efficiency ratio, η, as a function of (x-, s2) may be determined as follows, given that (x-, s2) is a complete sufficient statistic for (μ, σ2):
(12)η=E(g⊗-μ)2E(x¯-μ)2=(nσ2)·E[(x¯-μ)+(x¯(n·(x¯)2/s2)-1)]2=1+2J+K,
with
(13)J=(nσ2)·E[x¯·(x¯-μ)(n·(x¯)2/s2)-1]=(nσ2)·Es2·Ex¯·[x¯·(x¯-μ)(n·(x¯)2/s2)-1]=(n·(n/2πσ2)1/2σ2)·Es2·[∫-∞+∞x¯(n·(x¯)2/s2)-1·(x¯-μ)·exp{-n·(x¯-μ)22σ2}dx¯]=-(n2πσ2)1/2·Es2·[∫-∞+∞x¯(n·(x¯)2/s2)-1·(d/dx¯)·exp{-n·(x¯-μ)22σ2}dx¯].
In explaining the aforementioned in more detail, particularly the development of term J to include the terms Es2 and Ex¯, suppose ϕ(x¯,s2) is a function of the sample mean x¯ and the sample variance s2 for a random sample from a normal population. Therefore, x¯ and s2 would be known as having independent sampling distributions. Consequently, the expression E{ϕ(x¯,s2)} may be regarded as a “two-phase” exercise; in the first phase, Ex¯{·} may be viewed as the expectation with respect to the random variable x¯, treating the silenced (i.e., pseudo) relative variable s2 as a constant, Es2{⋯}. Subsequently, in the second phase, the random variable s2 also has an expectation of being viewed as Es2{⋯}. Cumulatively, therefore, E(ϕ(·))=Ex¯,s2{ϕ(·)}=Es2[Ex¯{ϕ(·)}]. Applied specifically to (12), the integration by parts may be detailed accordingly to ultimately yield J=Es2·Ex¯[j]=E[j]:
(14)Ex¯[x¯(x¯-μ){(n(x¯)2/s2)-1}]=-(n{n/2πσ2}1/2)=∫-∞∞x¯(x¯-μ){(n(x¯)2/s2)-1}×(x¯-μ)exp{-n(x¯-μ)22σ2}dx¯=∫-∞∞x¯(x¯-μ){(n(x¯)2/s2)-1}(ddx¯)·{exp{-n(x¯-μ)22σ2}dx¯}.
Notably, the other term vanishes by the well-known properties of definite integral, ∫-AA{integrand}dx¯, as the integrand is the odd function of x¯. Subsequently, the remainder of the derivation follows by way of integration by parts as
(15)J=Es2·Ex¯[j]=E[j],
with
(16)j=-[{(n·(x¯)2s2)-1}·{(n·(x¯)2s2)+1}-2]=[(u-1)·(u+1)-2],
as
(17)u=n·(x¯)2s2.
Again, it is important to note that, independent of x¯, (n-1)·s2/σ2 approximates a χ2 distribution with (n-1) degrees of freedom. Therefore, again applied to (12),
(18)K=(nσ2)·Es2·Ex¯·[(x¯)2·(n·(x¯)2s2-1)-2]=({((n-1)/2σ2)(n-1)/2·(1/[Γ(n-1)/2])}σ2)·Ex¯·[∫0+∞{((x¯)2s2)·(n·(x¯)2s2-1)-2}hh·((s2)(n-1)/2)·exp{-(n-1)·s22σ2}ds2∫0+∞{((x¯)2s2)·(n·(x¯)2s2-1)-2}].
Furthermore
(19)dds2·(s2)(n-1)/2·exp{-(n-1)·s22σ2}={(n-1)2σ2}·σ2·(s2)(n-3)/2·exp{-(n-1)·s22σ2}-(s2)(n-1)/2{-(n-1)2σ2}·exp{-(n-1)s22σ2}.
By applying (19) within (18), the following may be obtained:
(20)K=({((n-1)/2σ2)(n-1)/2·(1/[Γ(n-1)/2])}n)·E(x¯)[∫0+∞{(n·(x¯)2s2)·(n·(x¯)2s2-1)-2}hh·((s2)(n-3)/2)·exp{-(n-1)·s22σ2}ds2E(x¯)[∫0+∞{(n·(x¯)2s2)·(n·(x¯)2s2-1)-2}]-{2·{((n-1)/2σ2)(n-1)/2·(1/[Γ(n-1)/2])}n·(n-1)}·(x¯)[∫0+∞{(n·(x¯)2s2)·(n·(x¯)2s2-1)-2hh·(dds2)·((s2)(n-1)/2)·exp{-(n-1)·s22σ2}ds2(n·(x¯)2s2)·(n·(x¯)2s2-1)-2}(n·(x¯)2s2)].
Again through the application of integration by parts,
(21)K=E(s2)·E(x¯)[(n·(x¯)2s2)·(n·(x¯)2s2-1)-2]-{2·{((n-1)/2σ2)(n-1)/2·(1/[Γ(n-1)/2])}n·(n-1)}·E(x¯)[∫0+∞{(n·(x¯)2s2)·(n·(x¯)2s2-1)-2·(dds2)·((s2)(n-1)/2)·exp{-(n-1)·s22σ2}ds2(n·(x¯)2s2)·(n·(x¯)2s2-1)-2·(dds2)}]=E[(n·(x¯)2s2)·(n·(x¯)2s2-1)-2]-{2(n-1)}·E[(n·(x¯)2s2)·(n·(x¯)2s2+1)·(n·(x¯)2s2-1)-3]=E[(n·(x¯)2s2)·(n·(x¯)2s2-1)-2-{2(n-1)}·[(n·(x¯)2s2)·(n·(x¯)2s2+1)·(n·(x¯)2s2-1)-3]],
or, expressed differently, noting that K=E(k):(22)k={n·(x¯)2s2}·{(n·(x¯)2s2)-1}-2-{(2(n-1))·n·(x¯)2s2}·{(n·(x¯)2s2)+1}·{(n·(x¯)2s2)-1}-3.
As such, the UMVUE relating to the statistical relative efficiency of g⊗ versus x¯, given that η=1+2J+K (hence, η^=1+2j+k) from (12), and including (16) and (19), would be derived from(23)η^=1+2j+k=1-2·{[(n·(x¯)2s2)+1]·[(n·(x¯)2s2)-1]-2}+({[n·(x¯)2/s2]·[(n·(x¯)2/s2)-1]-3}·{(n-3)-(n·(x¯)2/s2)·(n+1)}{[n·(x¯)2s2]·[(n·(x¯)2s2)-1]-3}·{(n-3)-(n·(x¯)2s2)·(n+1)}{[n·(x¯)2/s2]·[(n·(x¯)2/s2)-1]-3}·{(n-3)-(n·(x¯)2/s2)·(n+1)})×((n-1))-1=1-[({[(n·(x¯)2s2)-1]-3}·{[(n-3)·(n·(x¯)2s2)2]+[(n-3)·(n·(x¯)2s2)]-2·(n-1)[(n·(x¯)2s2)-1][(n-3)·(n·(x¯)2/s2)2]}{[(n·(x¯)2/s2)-1]-3})×((n-1))-1[(n·(x¯)2s2)-1]-3({[(n·(x¯)2/s2)-1]-3}],
and, thus,
(24)100η=100·E(x¯-μ)2E(g⊗-μ)2>100%,
if 0<η^<1,0<1+2j+k<1, or 0<{(n·(x¯)2/s2)2·(n-3)+(n·(x¯)2/s2)·(n-3)-2·(n-1)} per (17), as (n·(x¯)2/s2)>1 for all (x¯)2>s2/n, with the coefficient of variation of x¯<1 per (16), or if (n·(x¯)2/s2)>(n+1)/(n-3).
Given the aforementioned, the alternate estimator defined in (10), g⊗, would be a more efficient estimator of the normal population mean, μ. It is important to also note that this proposed estimator could be expressed as a function of the square of sample coefficient of variation:
(25)g⊗≅(x¯)·[1+(s2n·(x¯)2)].
3. The Improved Lognormal Mean Estimator
As previously noted, the purpose of this research investigation was to improve and test the estimator proposed by Shen (2006), presented initially in (8),
(26)SHEN(2006)=exp{x¯+[(n-1)·s22·(n+4)]+3·s2}.
Offered in (25), the development of the proposed estimator, g⊗, draws upon (10) and, through substitution from ((8), (26)), the following is obtained:
(27)g⊗=x¯+s2(n·x¯)=x¯+(1+s2n(x¯)2).
Describing the approach through which g⊗ from (10) was developed to the expression in (25) and further to (27),
(28)g⊗=(x¯)+x¯{(n·(x¯)2/s2)-1}=(x¯)[(n·(x¯)2/s2){(n·(x¯)2/s2)-1}]=(x¯)[1+{n·(x¯)2s2-1}-1]=(x¯)[1+s2n·(x¯)2{1-s2n·(x¯)2}-1]=(x¯)[1+s2n·(x¯)2{1+s2n·(x¯)2-s4n2·(x¯)2+0(1n4)}]=(x¯)[1+s2n·(x¯)2+0(1n2)]≃(x¯)[1+s2n·(x¯)2]uptoatermof0[1n].
To illustrate, if 1/n2 is even 1/(31)2, the term is negligible.
To derive an efficient estimator of the normal variance using the sample coefficient of variance, the sixth iteration efficient estimator of the normal variance presented in Lovric and Sahai [11] may be applied as
(29)E6ER(σ2)=6M1·G6**+M2·G*(M1+M2),
where
(30)M1=2·(n-1)(n+1)2,M2=([1+(s2/n·(x¯)2)]2[2·(s2n·(x¯)2)]·[2+(s2n·(x¯)2)]h·[1+(s2n·(x¯)2)]2[2·(s2/n·(x¯)2)]·[2+(s2/n·(x¯)2)]·[1+(s2/n·(x¯)2)]2)×(3·[1+(s2n·(x¯)2)2-2]2)-1,G6**=(s2)·{1+(s2/n·(x¯)2)3·[1+(s2/n·(x¯)2)2-2]6}G*=(s2)·{(n-1)(n+1)}.
Consequently, the proposed lognormal mean estimator from the current investigation, GS(2014), is
(31)GS(2014)=exp{(g⊗)+E6ER(σ22)}.
4. Empirical Simulation Study4.1. Study Methodology
To compare the proposed estimator, GS(2014), to the existing efficient estimator, SHEN(2006), and the usual maximum likelihood estimator, UER, a simulation study was undertaken using various sample sizes (i.e., n=31,n=41,n=51,n=71,n=101,n=121,n=151, and n=201) and population mean values (i.e., μ=0.30,μ=0.35,μ=0.40,μ=0.45,μ=0.50,μ=0.55, and μ=0.60), with a fixed variance of σ2=0.36 (i.e., σ=0.60≥μ). Some 11,000 iterations across sample sizes were conducted using Matlab 2010b [The Mathworks Inc., Natick, MA], drawing randomly from a population of N(μ,σ2). Comparisons of GS(2014) versus UER and SHEN(2006) versus UER were drawn in percentage terms via relative efficiencies, RelEff%, as
(32)RelEff%{(SHEN(2006))versus(UER)}=100·RMSE(UER)RMSE(SHEN(2006)),RelEff%{(GS(2014))versus(UER)}=100·RMSE(UER)RMSE(GS(2014)),
as, per (4) and (9),
(33)RMSE(UER)=E[(UER-ς)ς]2,(34)RMSE(SHEN(2006))=E[(SHEN(2006)-ς)ς]2,(35)RMSE(GS(2014))=E[(GS(2014)-ς)ς]2.
Therein, actual RMSEs of all estimators in (33), (34), and (35) were calculated as an average across each of the simulation’s 11,000 iterations.
4.2. Results
Presented in Table 1, the relative efficiencies of both the proposed estimator, GS(2014), and the existing efficient estimator, SHEN(2006), recorded improvements to the usual maximum likelihood estimator, UER. In general, relative efficiencies for both efficient estimators increased as a function of lower sample size, though the proposed estimator, GS(2014), was also more efficient at lower population standard deviations. Across 74 of the 77 analytic categories (96%), GS(2014) noted higher relative efficiencies than SHEN(2006), with the absolute difference being most pronounced at lower sample sizes plus lower population standard deviations. To illustrate, the greatest absolute percentage difference favoring the efficiency of GS(2014) compared to SHEN(2006) was +21.33 percent at n=31 and μ=0.30. Comparatively, the maximum difference across the three cases where SHEN(2006) outperformed the proposed estimator was +0.9 percent at n=201 and μ=0.60. The relative efficiencies for GS(2014) ranged from a low of 100.64 percent (μ=0.55,n=201) to a high of 129.47 percent (μ=0.30,n=31) compared to a low of 101.05 percent (μ=0.55,n=201) to a high of 108.71 percent (μ=0.50,n=31) for the SHEN(2006) estimator. Collectively, findings support the proposed estimator, GS(2014), under almost all combinations of sample sizes and population standard deviations compared to the current efficient estimator and the usual maximum likelihood estimator. In instances where the existing efficient estimator did perform better, the differences were small and perhaps clinically negligible (i.e., absolute differences below 1.0%).
Relative efficiencies of SHEN(2006) and GS(2014) estimators across varying sample sizes and population standard deviations.
Sample size Estimators
Relative efficiency versus the usual estimator, UER [in %]
Population standard deviation
μ=0.30
μ=0.35
μ=0.40
μ=0.45
μ=0.50
μ=0.55
μ=0.60
n=31
SHEN(2006)
108.14
108.06
107.89
107.39
108.71
108.11
108.60
GS(2014)
129.47
125.70
123.76
119.88
117.23
113.86
112.19
n=41
SHEN(2006)
106.89
106.38
106.10
105.29
106.84
105.52
106.10
GS(2014)
126.19
123.80
118.30
113.86
112.55
108.98
108.00
n=51
SHEN(2006)
104.79
104.36
104.38
104.58
104.40
104.57
104.70
GS(2014)
124.74
118.70
114.05
110.72
108.10
106.59
105.63
n=61
SHEN(2006)
104.25
104.29
103.85
104.26
104.57
103.93
103.43
GS(2014)
120.78
115.46
111.35
108.80
107.33
105.20
103.76
n=71
SHEN(2006)
103.20
103.49
104.04
103.50
103.32
103.27
103.32
GS(2014)
117.87
112.91
110.03
107.06
105.23
104.06
103.33
n=81
SHEN(2006)
103.51
103.75
102.76
102.93
102.70
103.08
103.08
GS(2014)
115.47
111.51
107.52
105.69
104.00
103.53
102.96
n=91
SHEN(2006)
103.02
102.89
102.83
102.69
102.20
103.09
103.02
GS(2014)
113.48
109.63
106.91
104.88
103.11
103.39
102.80
n=101
SHEN(2006)
102.53
102.27
102.41
102.30
102.65
102.40
102.27
GS(2014)
111.65
108.05
105.76
104.03
103.43
102.46
101.86
n=121
SHEN(2006)
101.87
101.54
102.37
102.29
102.07
101.75
101.94
GS(2014)
109.28
105.84
104.93
103.60
102.44
101.49
101.45
n=151
SHEN(2006)
101.58
102.15
102.03
101.67
101.84
101.55
101.72
GS(2014)
107.08
105.40
103.83
102.39
102.04
101.25
101.25
n=201
SHEN(2006)
101.11
101.33
101.29
101.10
101.18
101.05
101.28
GS(2014)
104.82
103.39
102.28
101.38
101.08
100.64
100.83
SHEN(2006): Shen et al. (2006) [4] estimator; GS(2014): proposed estimator; UER: usual maximum likelihood estimator; n: sample size; μ: population standard deviation.
5. Conclusion
Within the context of clinical and other scientific research, a substantial need exists for an accurate determination of the point estimate for a lognormal mean. The transformation of highly skewed data is often undertaken to achieve assumptions required for parametric statistical inference. Despite this, existing approaches that capture only a sample’s mean and variance do not necessarily yield the most efficient estimator. The current investigation developed and tested more efficient point estimators for a lognormal mean model by capturing more complete information within the sample’s coefficient of variation. Results of an empirical simulation study across varying sample sizes and population standard deviations indicated relative improvements in efficiency of up to 129.47 percent compared to the usual maximum likelihood estimator and up to 21.33 percentage points above the current efficient estimator. The relative efficiency of the proposed estimator increased particularly as a function of decreasing sample size and increasing population standard deviation.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
SkrepnekG. H.Regression methods in the empirical analysis of health care data2005113240251SkrepnekG. H.OlveyE. L.SahaiA.Econometric approaches in evaluating costs and outcomes within pharmacoeconomic analyses20121411051222-s2.0-8485691590710.3233/PPL-2011-0345PattersonR. L.Difficulties involved in the estimation of a population mean using transformed sample data19668535537ShenH.BrownL. D.ZhiH.Efficient estimation of log-normal means with application to pharmacokinetic data20062517302330382-s2.0-3374760955010.1002/sim.2456EvansI. G.ShabanS. A.A note on estimation in lognormal models19746977978110.1080/01621459.1974.10480204ZBL0291.62038RukhinA. L.Improved estimation in lognormal models1986811046104910.1080/01621459.1986.10478371ZBL0609.62041SkrepnekG. H.The contrast and convergence of bayesian and frequentist statistical approaches in pharmacoeconomic analysis20072586496642-s2.0-3454718192210.2165/00019053-200725080-00003PadgettW. J.WeiL. J.Bayes estimation of reliability for the two-parameter lognormal distribution1977644345710.1080/03610927708827505ZBL0356.62074ZhouX. H.Estimation of the log-normal mean19981722512264OehlertG. W.A note on the delta method1992462729LovricM. M.SahaiA.An iterative algorithm for efficient estimation of normal variance using sample coefficient of variation2011article 001NikulinM. S.HazewinkelM.Efficiency of a statistical procedure1992Berlin, GermanySpringer