The Lambert Way to Gaussianize Heavy-Tailed Data with the Inverse of Tukey's h Transformation as a Special Case

I present a parametric, bijective transformation to generate heavy tail versions of arbitrary random variables. The tail behavior of this heavy tail Lambert  W × F X random variable depends on a tail parameter δ ≥ 0: for δ = 0, Y ≡ X, for δ > 0 Y has heavier tails than X. For X being Gaussian it reduces to Tukey's h distribution. The Lambert W function provides an explicit inverse transformation, which can thus remove heavy tails from observed data. It also provides closed-form expressions for the cumulative distribution (cdf) and probability density function (pdf). As a special case, these yield analytic expression for Tukey's h pdf and cdf. Parameters can be estimated by maximum likelihood and applications to S&P 500 log-returns demonstrate the usefulness of the presented methodology. The R package LambertW implements most of the introduced methodology and is publicly available on CRAN.


Introduction
Statistical theory and practice are both tightly linked to Normality. In theory, many methods require Gaussian data or noise: (i) regression often assumes Gaussian errors; (ii) many time series models are based on Gaussian white noise [1][2][3]. In such cases, a model M N , parameter estimates and their standard errors, and other properties are then studied, all based on the ideal(istic) assumption of Normality.
In practice, however, data/noise often exhibits asymmetry and heavy tails, for example, wind speed data [4], human dynamics [5], or Internet traffic data [6]. Particularly notable examples are financial data [7,8] and speech signals [9], which almost exclusively exhibit heavy tails. Thus a model M N developed for the Gaussian case does not necessarily provide accurate inference anymore.
One way to overcome this shortcoming is to replace M N with a new model M , where is a heavy tail distribution: (i) regression with Cauchy errors [10]; (ii) forecasting long memory processes with heavy tail innovations [11,12], or ARMA modeling of electricity loads with hyperbolic noise [13]. See also Adler et al. [14] for a wide range of statistical applications and methodology for heavy-tailed data.
While such fundamental approaches are attractive from a theoretical perspective, they can become unsatisfactory from a practical point of view. Many successful statistical models and techniques assume Normality, their theory is very well understood, and many algorithms are implemented for the simple and often much faster Gaussian case. Thus developing models based on an entirely unrelated distribution is like throwing out the (Gaussian) baby with the bathwater.
It would be very useful to transform a Gaussian random variable to a heavy-tailed random variable and vice versa and thus rely on knowledge and algorithms for the well-understood Gaussian case, while still capturing heavy tails in the data. Optimally such a transformation should (a) be bijective, (b) include Normality as a special case for hypothesis testing, and (c) be parametric so the optimal transformation can be estimated efficiently. Figure 1 illustrates this pragmatic approach: researchers can make their observations y as Gaussian as possible (x ) before making inference based on their favorite Gaussian model M N . This avoids the development of, or the data analysts waiting for, a whole new theory of M and new implementations based on a particular heavy-tailed distribution , while still improving statistical inference from  heavy-tailed data y. For example, consider y = ( 1 , . . . , 500 ) from a standard Cauchy distribution C(0, 1) in Figure 2(a): modeling heavy tails by a transformation makes it even possible to Gaussianize this Cauchy sample (Figure 2(c)). This "nice" data x can then be subsequently analyzed with common techniques. For example, the location can now be estimated using the sample average ( Figure 2(d)). For details see Section 6.1.
Liu et al. [15] use a semiparametric approach, where has a nonparanormal distribution if ( ) ∼ N( , 2 ) and (⋅) is an increasing smooth function; they estimate (⋅) using nonparametric methods. This leads to a greater flexibility in the distribution of , but it suffers from two drawbacks: (i) nonparametric methods have slower convergence rates and thus need large samples, and (ii) for identifiability of (⋅), E ( ) ≡ E and Var( ( )) ≡ Var( ) must hold. While (i) is inherent to nonparametric methods, point (ii) requires to have finite mean and variance, which is often not met for heavy-tailed data. Thus here we use parametric transformations which do not rely on restrictive identifiability conditions and also work well for small sample sizes.
The main contributions of this work are threefold: (a) a metafamily of heavy tail Lambert W × distributions (see also [16]) with Tukey's ℎ distribution [17] as a special case, (b) a bijective transformation to "Gaussianize" heavy-tailed data (Section 2), and (c) simple expressions for the cumulative distribution function (cdf) ( ) and probability density function (pdf) ( ) (Section 2.4). In particular, analytic expressions for the pdf and cdf for Tukey's ℎ (Section 3) are presented here, to the best of the author's knowledge, for the first time in the literature. Section 4 introduces a method of moments estimator and studies the maximum likelihood estimator (MLE). Section 5 shows their finite sample properties. As has been shown in many case studies, Tukey's ℎ distribution (heavy tail Lambert W × Gaussian) is useful to model data with unimodal, heavy-tailed densities. Applications to S&P 500 logreturns confirm the usefulness of the Lambert W framework (Section 6). Finally, we discuss the new methodology and future work in Section 7. Detailed derivations and proofs are given in the Supplementary Material available online at http://dx.doi.org/10.1155/2015/909231.
Computations, figures, and simulations were done in R [18]. The R package LambertW implements most of the presented methodology and is publicly available on CRAN.

Multivariate Extensions.
While this work focuses on the univariate case, multivariate extensions of the presented methods can be defined component-wise, analogously to the multivariate version of Tukey's ℎ distribution [19]. While this may not make the transformed random variables jointly Gaussian, it still provides a good starting point for more wellbehaved multivariate estimation.

Box-Cox Transformation.
A popular method to deal with skewed, high variance data is the Box-Cox transformation A major limitation of (1) is the nonnegativity constraint on y, which prohibits its use in many applications. To avoid this limitation it is common to shift the data,ỹ = y + | min(y)| ≥ 0, which restricts to a half-open interval. If, however, the underlying process can occur on the entire real line, such a shift undermines statistical inference for yet unobserved data (see [20]). Even if out-of-sample prediction is not important for the practitioner, Figure 2(b) shows that the Box-Cox transformation in fact fails to remove heavy tails from the Cauchy sample. (We useỹ = y + | min(y)| + 1 and use boxcox from the MASS R package;̂= 0.37.) Moreover, the main purpose of the Box-Cox transformation is to stabilize variance [21][22][23] and remove right tail skewness [24]; a lower empirical kurtosis is merely a byresult of the variance stabilization. In contrast, the Lambert W framework is designed primarily to model heavy-tailed random variables and remove heavy tails from data and has no difficulties with negative values.

Generating Heavy Tails Using Transformations
Random variables exhibit heavy tails if more mass than for a Gaussian random variable lies at the outer end of the density support. A random variable has a tail index if its cdf satisfies 1 − ( ) ∼ ( ) − , where ( ) is a slowly varying function at infinity, that is, lim → ∞ ( )/ ( ) = 1 for all > 0 [25]. (There are various similar definitions of heavy, fat, or long tails; for this work these differences are not essential.) The heavy tail index is an important characteristic of ; for example, only moments up to order can exist.

Tukey's ℎ Distribution.
A parametric transformation is the basis of Tukey's ℎ random variables [17] = exp ( where is standard Normal random variable and ℎ is the heavy tail parameter. The random variable has tail parameter = 1/ℎ [17] and reduces to the Gaussian for ℎ = 0. Morgenthaler and Tukey [26] extend the ℎ distribution to the skewed, heavy-tailed family of ℎℎ random variables where again ∼ N(0, 1). Here ℓ ≥ 0 and ≥ 0 shape the left and right tail of , respectively; thus transformation (3) can model skewed and heavy-tailed data; see Figure 3(a). For the sake of brevity let ( ) := exp(( /2) 2 ). However, despite their great flexibility, Tukey's ℎ and ℎℎ distributions are not very popular in statistical practice, because expressions for the cdf or pdf have not been available in closed form. Although Morgenthaler and Tukey [26] express the pdf of (2) as (ℎ ≡ ) they fall short of making −1 ( ) explicit. So far the inverse of (2) or (3) has been considered analytically intractable [4,27]. Thus parameter inference relied on matching empirical and theoretical quantiles [4,17,26], or by the method of moments [28]. Only recently Headrick et al. [28] provided numerical approximations to the inverse. However, numerical approximations can be slow and prohibit analytical derivations. Thus a closed form, analytically tractable pdf, which can be computed efficiently, is essential for a widespread use of Tukey's ℎ (and variants). In this work I present this long sought closed-form inverse, which has a large body of literature in mathematics and is readily available in standard statistics software. For ease of notation and concision main results are shown for ℓ = = ; analogous results for ℓ ̸ = will be stated without details.

Heavy Tail Lambert W Random
Variables. Tukey's ℎ transformation (2) is strongly related to the approach taken by Goerg [16] to introduce skewness in continuous random variables ∼ ( ). In particular, if ∼ Tukey's ℎ, then 2 ∼ skewed Lambert W × 2 1 with skew parameter = ℎ. Adapting the skew Lambert W × input/output idea (see Figure 1), Tukey's ℎ random variables can be generalized to heavy-tailed Lambert W × random variables. (Most concepts and methods from the skew Lambert W × case transfer one-to-one to the heavy tail Lambert W random variables presented here. Thus for the sake of concision I refer to Goerg [16] for details of the Lambert W framework.) Definition 1. Let be a continuous random variable with cdf ( | ), pdf ( | ), and parameter vector . Then, is a noncentral, nonscaled heavy tail Lambert W × random variable with parameter vector = ( , ), where is the tail parameter.

Definition 2. For a continuous location-scale family random
with parameter vector = ( , ), where = ( − )/ and and are mean and standard deviation of , respectively.
The input is not necessarily Gaussian (Tukey's ℎ) but can be any other location-scale continuous random variable, for example, from a uniform distribution, ∼ ( , ) (see Figure 4).
is a scaled heavy-tailed Lambert W × random variable with parameter = ( , ).
Remark 4 (only nonnegative ). Although < 0 gives interesting properties for , it defines a nonbijective transformation and leads to parameter-dependent support and nonunique input. Thus for the remainder of this work assume ≥ 0, unless stated otherwise.

Inverse Transformation: "Gaussianize" Heavy-Tailed Data.
Transformation (6) is bijective and its inverse can be obtained via the Lambert W function, which is the inverse of = exp( ), that is, that function which satisfies ( ) exp( ( )) = . It has been studied extensively in mathematics, physics, and other areas of science [29][30][31] and is implemented in the GNU Scientific Library (GSL) [32]. Only recently the Lambert W function received attention in the statistics literature [16,[33][34][35]. It has many useful properties (see Appendix A in the supplementary material and Corless et al. [29]), in particular, ( ) is bijective for ≥ 0.

Lemma 5. The inverse transformation of (6) is
and sgn( ) is the sign of . ( ) is bijective for all ≥ 0 and all ∈ R.
Lemma 5 gives for the first time an analytic, bijective inverse of Tukey's ℎ transformation: −1 ( ) of Morgenthaler and Tukey [26] is now analytically available as (8). Bijectivity implies that for any data y and parameter , the exact input x = (y) ∼ ( ) can be obtained. In view of the importance and popularity of Normality, we clearly want to back-transform heavy-tailed data to data from a Gaussian rather than yet another heavy-tailed distribution. Tail behavior of random variables is typically compared by their kurtosis 2 ( ) = E( − ) 4 / 4 , which equals 3 if is Gaussian. Hence for the future when we "normalize y" we cannot only center and scale but also transform it to x witĥ2(x ) = 3 (see Figure 2(c)). While 6 The Scientific World Journal "normalizing" the Lambert Way can also improve KDE for heavy-tailed data (see also [38,39]).
Remark 6 (generalized transformation). Transformation (2) can be generalized to The inner term 2 guarantees bijectivity for all > 0. The inverse is For comparison with Tukey's ℎ I consider = 1 only. For = 1/2 transformation (10) is closely related to skewed Lambert W × distributions.
The explicit formula (14) allows a fast computation and theoretical analysis of the likelihood, which is essential for statistical inference. Detailed properties of (14) are given in Section 4.1. Figure 4 shows (13) and (14) for various ≥ 0 with four different input distributions: for = ℎ = 0 the input equals the output (solid black); for larger the tails of ( | ) and ( | ) get heavier (dashed colored).

Quantile Function.
Quantile fitting has been the standard technique to estimate , , and of Tukey's ℎ. In particular, the medians of and are equal. Thus for symmetric, location-scale family input the sample median of y is a robust estimate of for any ≥ 0 (see also Section 5). General quantiles can be computed via [17] = exp ( 2 2 ) + , where = ( ) are the -quantiles of ( ).

Corollary 8. The cdf of Tukey's ℎ equals
where Φ( ) is the cdf of a standard Normal. The pdf equals (for > 0) The Scientific World Journal Proof. Take ∼ N( , 2 ) in Theorem 7.
Section 4.1 studies functional properties of (23) in more detail.

Tukey's ℎ versus Student's . Student's ] -distribution with
] degrees of freedom is often used to model heavy-tailed data [41,42], as its tail index equals ]. Thus the th moment of a Student's random variable exists if < ]. In particular, and kurtosis Comparing (24) and (25) with (18) and (19) shows a natural association between 1/] and and a close similarity between the first four moments of Student's and Tukey's ℎ ( Figure 5). By continuity and monotonicity, the first four moments of a location-scale -distribution can always be exactly matched by a corresponding location-scale Lambert W × Gaussian. Thus if Student's is used to model heavy tails and not as the true distribution of a test statistic it might be worthwhile to also fit heavy tail Lambert W × Gaussian distributions for an equally valuable "second opinion. " For example, a parallel analysis on S&P 500 logreturns in Section 6.2 leads to divergent inference regarding the existence of fourth moments.

Parameter Estimation
Due to the lack of a closed form pdf of , = ( , ) has typically been estimated by matching quantiles or a method of moments estimator [4,26,28]. These methods can now be replaced by the, fast and usually efficient, maximum likelihood estimator (MLE). Rayner and MacGillivray [43] introduce a numerical MLE procedure based on quantile functions, but they conclude that "sample sizes significantly larger than 100 should be used to obtain reliable estimates. " Simulations in Section 5 show that the MLE using the closed form Lambert W × distribution converges quickly and is accurate even for sample sizes as small as = 10.
The MLE is that = ( , ) which maximizes (26); that is, Since ( | , ) is a function of ( | ), the MLE depends on the specification of the input density. Equation (26) can be decomposed as where is the log-likelihood of the back-transformed data x = ( ,1 , . . . , , ) (via (8)) and Note that ( , , ; ) only depends on ( ) and ( ) (and ), but not necessarily on every coordinate of . Decomposition (28) shows the difference between the exact MLE (̂,̂) based on y and the approximate MLÊx based on x alone: if we knew = ( , , ) beforehand, then we could back-transform y to x and estimatêx from x (maximize (29) with respect to ). In practice, however, must also be estimated and this enters the likelihood via the additive term R( ; y). A little calculation shows that for any ∈ R, log ( , , ; ) ≤ 0 if ≥ 0, with equality if and only if = 0. Thus R( ; y) can be interpreted as a penalty for transforming the data. Maximizing (28) faces a trade-off between transforming the data to follow ( | ) (and increasing ℓ( ; x̂)) and the penalty of a more extreme transformation (and decreasing R( ; y)); see Figure 6(b). Figure 6(a) shows a contour plot of ( = 0, = 1, ; ) as a function of and = : it increases (in absolute value) either if gets larger (for fixed ) or for larger (for fixed ). In both cases, increasing makes the transformed data ( ) get closer to 0 = , which in turn increases its input likelihood. For = 0, the penalty disappears since input equals output; for = 0 there is no penalty since (0) = 0 for all . Figure 6(b) shows an i.i.d. sample ( = 1000) z ∼ Lambert W × Gaussian with = 1/3 and the decomposition of the log-likelihood as in (28). Since = (0, 1) is known, the likelihood and penalty are only functions of . Theorem 9 shows that the convexity of the penalty (decreasing, red) and concavity of the input likelihood (increasing, green) as a function of holds true in general for any data z, and their sum (solid, black) has a unique maximum; herêM LE = 0.37 (blue, dashed vertical line). See Theorem 9 below for details.
The maximization of (28) can be carried out numerically. Here I show existence and uniqueness of̂M LE assuming that and are known. Further theoretical results for̂M LE remain for future work. Given the "nice" form of ( ), which is continuous, twice differentiable (under the assumption that (⋅) is twice differentiable), the MLE for = ( , ) should have the usual optimality properties, such as being consistent and efficient [44].

Properties of the MLE for the Heavy Tail Parameter .
Without loss of generality let = 0 and = 1. In this case Theorem 9 (see [14]). Let have a Lambert W × Gaussian distribution, where = 0 and = 1 are assumed to be known and fixed. Also consider only ∈ [0, ∞). (While for some samples z the MLE also exists for < 0, it cannot be guaranteed for all z. If < 0 (and ̸ = 0), then ( ) is either The Scientific World Journal 9

not unique in R (principal and nonprincipal branch) or does not have a real-valued solution in
then̂= 0.
Proof. See Appendix B in the supplementary material.
Condition (34) says that̂M LE > 0 only if the data is heavy-tailed enough. Points (b) and (c) guarantee that there is no ambiguity in the heavy tail estimate. This is an advantage over Student's -distribution, for example, which has numerical problems and local maxima for unknown (and small) ] [45,46]. On the contrary,̂M LE is always a global maximum.
Given the heavy tails in z one might expect convergence issues for larger . However, ( ) ∼ N(0, 1) for the true ≥ 0 and close to a standard Gaussian if̂M LE ≈ . Since the log-likelihood and its gradient depend on and z only via (z) the performance of the MLE should thus not get worse for large as long as the initial estimate is close enough to the truth. Simulations in Section 5 support this conjecture, even for̂M LE .

Iterative Generalized Method of Moments (IGMM).
A disadvantage of the MLE is the mandatory a priori specification of the input distribution. Especially for heavy-tailed data the eye is a bad judge to choose a particular parametric ( | ). It would be useful to directly estimate without the intermediate step of estimating first.
Goerg [16] presented an estimator for based on iterative generalized methods of moments (IGMM). The idea of IGMM is to find a such that the back-transformed data x has desired properties, for example, being symmetric or having kurtosis 3. An estimator for , , and can be constructed entirely analogous to the IGMM estimator for the skewed Lambert W × F case. See the Supplementary Material, Appendix C for details.
IGMM requires less specific knowledge about the input distribution and is usually also faster than the MLE. Oncê IGMM has been obtained, the back-transformed x̂I GMM can be used to check if has characteristics of a known parametric distribution ( | ). It must be noted though that testing for a particular distribution is too optimistic as x̂has "nicer" properties regarding than the true x would have.
However, estimating the transformation requires only three parameters and for a large enough sample, losing three degrees of freedom should not matter in practice.

Simulations
This section explores finite sample properties of estimators for = ( , , ) and ( , ) under Gaussian input ∼ N( , 2 ). In particular, it compares Gaussian MLE (estimation of and only), IGMM and Lambert W × Gaussian MLE, and, for a heavy tail competitor, the median. (For IGMM, optimization was restricted to ∈ [0, 10].) All results below are based on = 1, 000 replications.

Estimating Only.
Here I show finite sample properties of̂M LE for ∼ N(0, 1), where = 0 and = 1 are known and fixed. Theorem 9 shows that̂M LE is unique: either at the boundary = 0 or at the globally optimal solution to (35). Results in Table 1 were obtained by numerical optimization restricted to ≥ 0 (⇔ log ∈ R) using the nlm function in . Table 1 suggests that the MLE is asymptotically unbiased for every and converges quickly (at about = 400) to its asymptotic variance, which is increasing with . Assuming and to be known is unrealistic and thus these finite sample properties are only an indication of the behavior of the joint MLE,̂M LE . Nevertheless they are very remarkable for extremely heavy-tailed data ( > 1), where classic averagebased methods typically break down. One reason lies in the particular form of the likelihood (32) and its gradient (35) (Theorem 9): although both depend on z, they only do so through (z) = u ∼ N(0, 1). Hence as long aŝM LE is sufficiently close to the true , (32) and (35) are functions of almost Gaussian random variables and standard asymptotic results should still apply.

Estimating All Parameters Jointly.
Here we consider the realistic scenario where and are also unknown. We consider various sample sizes ( = 50, 100, and 1000) and different degrees of heavy tails, ∈ {0, 1/3, 1, 1.5}, each one representing a particularly interesting situation: (i) Gaussian data (does additional, superfluous, estimation of affect other estimates?), (ii) fourth moments do not exist, (iii) nonexisting mean, and (iv) extremely heavy-tailed data (can we get useful estimates at all?) The convergence tolerance for IGMM was set to tol = 1.22 ⋅ 10 −4 . Table 2 summarizes the simulation.
Gaussian Data ( = 0). This setting checks if imposing the Lambert W framework, even though it is superfluous, causes a quality loss in the estimation of = or = . Furthermore, critical values for 0 : = 0 (Gaussian) can be obtained. As in the -only case above, Table 2(a) suggests that estimators are asymptotically unbiased and quickly tend to a large-sample variance. Additional estimation of does not affect the efficiency of̂compared to estimating solely . Estimating directly by Gaussian MLE does not give better results than the Lambert W × Gaussian MLE.
No Fourth Moment ( = 1/3). Here ( , = 1) = 2.28, but fourth moments do not exist. This results in an increasing empirical standard deviation of̂as grows. In contrast, estimates for are not drifting off. In the presence of these large heavy tails the median is much less variable than Gaussian MLE and IGMM. Yet, Lambert W × Gaussian MLE for even outperforms the median.
Nonexisting Mean ( = 1). Both sample moments diverge, and their standard errors are also growing quickly. The median still provides a very good estimate for the location but is again inferior to both Lambert W estimators, which are closer to the true values and appear to converge to an asymptotic variance at rate √ .
Extreme Heavy Tails ( = 1.5). As in Section 5.1, IGMM and Lambert W MLE continue to provide excellent estimates even though the data is extremely heavy tailed. Moreover, Lambert W MLE also has the smallest empirical standard deviation overall. In particular, the Lambert W MLE for has an approximately 15% lower standard deviation than the median. The last column shows that for some about 1% of the = 1, 000 simulations generated invalid likelihood values (NA and ∞). Here the search for the optimal led into regions with a numerical overflow in the evaluation of ( ). For a comparable summary, these few cases were omitted and new simulations added until full = 1, 000 finite estimates were found. Since this only happened in 1% of the cases and also such heavy-tailed data is rarely encountered in practice, this numerical issue is not a limitation in statistical practice.

Discussion of the Simulations.
IGMM performs well independent of the magnitude of . As expected the Lambert W MLE for has the best properties: it can recover the truth for all , and for = 0 it performs as well as the classic sample mean and standard deviation. For small it has the same empirical standard deviation as the Gaussian MLE, but a lower one than the median for large .
Hence the only advantage of estimating and by sample moments of y is speed; otherwise the Lambert W × Gaussian MLE is at least as good as the Gaussian MLE and clearly outperforms it in presence of heavy tails.

Applications
Tukey's ℎ distribution has already proven useful to model heavy-tailed data, but parametric inference was limited to quantile fitting or methods of moments estimation [4,27,28]. Theorem 7 allows us to estimate by ML.
This section applies the presented methodology on simulated as well as real world data: (i) Section 6.1 demonstrates Gaussianizing on the Cauchy sample from the Introduction, and (ii) Section 6.2 shows that heavy tail Lambert W × Gaussian distributions provide an excellent fit to daily S&P 500 log-return series.

Estimating Location of a Cauchy with the Sample Mean.
It is well known that the sample mean y is a poor estimate of the location parameter of a Cauchy distribution, since the sampling distribution of y is again a Cauchy (see [47] for a recent overview); in particular, its variance does not go to 0 for → ∞.  Heavy-tailed Lambert W × Gaussian distributions have similar properties to a Cauchy for ≈ 1. The mean of ( ) equals the location of ( ) due to symmetry around and , respectively. Thus we can estimate from the Cauchy sample y, transform y to x̂, get̂from x̂=̂(y), and thus obtain an estimate for bŷ=̂.
The random sample y ∼ C(0, 1), with pdf ( ) = 1/ (1 + 2 ), in Figure 2  LE is approximately Gaussian and we can use the sample average for inference: x̂M LE = 0.033(0.0472) correctly fails to reject a zero location for y. The transformed x̂M LE features additional Gaussian characteristics (symmetric, no excess kurtosis), and even the null hypothesis of Normality cannot be rejected ( -value ≥ 0.5). Note, however, that Normality for the transformed data is only an empirical approximation; the random variable (( − )/ ), where is Cauchy, does not have a Normal distribution. Figure 2(d) shows the cumulative sample average for the original sample and its Gaussianized version. For a fair comparison̂( ) MLE was reestimated cumulatively for each = 5, . . . , 500 and then used to compute ( 1 , . . . , ). The transformation works extremely well: data point 49 is highly influential for y but has no relevant effect on x̂( )

MLE
. Even for small it is already clear that the location of the underlying Cauchy distribution is approximately zero.
Although it is a simulated example, it demonstrates that removing (strong) heavy tails from data works well and provides "nice" data that can then be analyzed with more refined Gaussian methods.

Heavy Tails in Finance: S&P 500 Case Study.
A lot of financial data displays negative skewness and excess kurtosis. Since financial data in general is not i.i.d., it is often modeled with a (skew) Student's -distribution underlying a (generalized) autoregressive conditional heteroskedastic (GARCH) [2,48] or a stochastic volatility (SV) [49,50] model. Using the Lambert W approach we can build upon the knowledge and implications of Normality (and avoid deriving properties of a GARCH or SV model with heavytailed innovations) and simply "Gaussianize" the returns before fitting more complex, GARCH or SV, models.   Table 3(b) confirms the heavy tails (sample kurtosis 7.70) but also indicates negative skewness (−0.296). As the sample skewnesŝ1(y) is very sensitive to outliers, we fit a distribution which allows skewness and test for symmetry. In case of the double-tail Lambert W × Gaussian this means testing 0 : ℓ = = versus 1 : Using the likelihood expression in (28), we can use a likelihood ratio test with one degree of freedom (3 versus 4 parameters). The log-likelihood of the double-tail fit ( Table 4( Here the symmetric fit gives a transformed sample that is more Gaussian, but it pays a greater penalty for transforming the data. Comparing twice their difference to a 2 1 distribution gives a -value of 0.29. For comparison, a skew-fit [51], with location , scale , shape , and ] degrees of freedom, also yields (Function st.mle in the R package sn.) a nonsignificant̂ (Table 4(b)). Thus both fits cannot reject symmetry. Assume we have to make a decision if we should trade a certificate replicating the S&P 500. Since we can either buy or sell, it is not important if the average return is positive or negative, as long as it is significantly different from zero.  (Table 4(c)) and Student -fit (Table 4(d)) reject the zero mean null ( -values: 10 −4 and 3 ⋅ 10 −5 , resp.).
Location and scale estimates are almost identical, but tail estimates lead to different conclusions: while for] = 3.71 only moments up to order 3 exist, in the Lambert W × Gaussian case moments up to order 5 exist (1/0.172 = 5.81). This is especially noteworthy as many theoretical results in the (financial) time series literature rely on finite fourth moments [52,53]; consequently many empirical studies test this assumption [7,54]. Here Student's and a Lambert W × Gaussian fit give different conclusions. Since previous empirical studies often use Student's as a baseline [41], it might be worthwhile to reexamine their findings in light of heavy tail Lambert W × Gaussian distributions.
is an adequate (unconditional) Lambert W × Gaussian model for the S&P 500 log-returns y. For trading, this means that the expected return is significantly larger than zero (̂= 0.055 > 0) and thus replicating certificates should be bought.

Gaussian MLE for Gaussianized
Data. For = ≡ < 1, also ≡ . We can therefore replace testing = 0 versus ̸ = 0 for a non-Gaussian y, with the very well-understood hypothesis test = 0 versus ̸ = 0 for the Gaussian x̂M LE . In particular, standard errors based on̂/ √ and thus t and -values should be closer to the "truth" (Tables 4(c) and 4(d)) than a Gaussian MLE on the non-Gaussian y (Table 4(e)). Table 4(f) shows that standard errors for̂x are even a bit too small compared to the heavytailed versions. Since the "Gaussianizing" transformation was estimated, treating x̂M LE as if it was original data is too optimistic regarding its Normality (recall the penalty (30) in the total likelihood (28)).
This example confirms that if a model and its theoretical properties are based on Normality, but the observed data is heavy-tailed, then Gaussianizing the data first gives more reliable inference than applying Gaussian methods to the original, heavy-tailed data ( Figure 1). Clearly, a joint estimation of the model parameters based on Lambert W × Gaussian random variables (or any other heavy-tailed distribution) would be optimal. However, theoretical properties and estimation techniques may not be available or well understood. The Lambert Way to Gaussianize data is Table 4: MLE fits to S&P 500 y Tables 4(a), 4(b), 4(c), 4(d), and 4(e) and the Gaussianized data x̂M LE  thus a pragmatic method to improve statistical inference on heavy-tailed data, while preserving the applicability and interpretation of Gaussian models.

Discussion and Outlook
In this work I use the Lambert W function to model and remove heavy tails from continuous random variables using a data-transformation approach. For Gaussian random variables this not only contributes to existing work on Tukey's ℎ distribution but also gives convincing empirical results: unimodal data with heavy tails can be transformed to Gaussian data. Properties of a Gaussian model M N on the back-transformed data mimic the features of the "true" heavy-tailed model M very closely. Since Normality is the single most typical and often required assumption in many areas of statistics, machine learning, and signal processing, future research can take many directions. From a theoretical perspective properties of Lambert W × distributions viewed as a generalization of already well-known distributions can be studied. This area will profit from existing literature on the Lambert W function, which has been discovered only recently by the statistics community. Empirical work can focus on transforming the data and comparing approximate Gaussian with joint heavy tail analyses. The comparisons in this work showed that approximate inference for Gaussianized data is comparable with the direct heavy tail modeling and so provides a simple tool to improve inference for heavy-tailed data in statistical practice.
I also provide the R package Lambert W, publicly available at CRAN, to facilitate the use of heavy tail Lambert W × distributions in practice.

Disclosure
The first version of this article entitled "Closed-form cdf and pdf of Tukey's ℎ-distribution, the heavy-tail Lambert W approach, and how to bijectively 'Gaussianize' heavy-tailed data" has appeared online [56].This research was performed while the author was at Carnegie Mellon University. Currently, Georg M. Goerg works at Google Inc., New York, NY 10011, USA.