Inference about the tail of a distribution. Improvement on the Hill estimator

The Hill estimator is often used to infer the power behavior in tails of experimental distribution functions. This estimator is known to produce bad results in certain situations which have lead to the so-called Hill horror plots. In this brief note, we propose an improved estimator which is simple and coherent and often provides an efficient remedy in the bad situations, especially when the distribution is decreasing slowly, when the data is restricted by external cuts to lie within a finite domain, or even when the distribution is increasing.


Introduction
It has been advocated that self-organization, first discovered to dominate sand pile formation [1], may very well apply to many financial, economic, traffic control or social phenomena. The general outcome of these systems is in general an asymptotic power like behavior of the experimental distribution x −µ in some variable x. The value of the critical exponent µ is often a major prediction of the models. Let us also mention models leading to Pareto [2], Zipf [3], Levy-flight [4] or Padé [5] type distributions which also provide asymptotic power like behaviors with some definite critical exponent.
A particularly simple estimator of the critical exponent for a set of data measures [6] has been provided by Hill [7]. It has proved to be very useful and has been widely used. See for example the KU Leuven web page [8]. Following Hill, the estimator is first applied to a subset of measures corresponding to the highest values of the variable. In order to increase the statistics, this subset is then extended to include progressively further experimental values of the variable, every set providing a value of µ. These values are then plotted in a so-called Hill plot [9]. The subset is extended as long as it includes values which can be considered to lie in the asymptotic domain where the distribution is power like. The values obtained when they are stable provide the best value of the Hill estimator, as usually seen on the Hill plot. It is then obvious that the estimator usually takes into account all the highest experimental values of the variable larger than a certain value x ≥ L. The data points with x < L do not belong to the asymptotic region or turn out to be unsafe to use.
Various properties, aspects and generalizations of the Hill estimator have been discussed and studied in numerous publications. One should mention that many questions about the asymptotic properties, about asymptotic normality, and about the volatility of the index have been addressed (see for example [10], [11], [12], [13], [14]). Some authors have proposed, in order to improve the high volatility of the Hill plots, to smooth the result by averaging the Hill estimator values corresponding to different numbers of order statistics [15]. Other authors have smoothed the Hill estimator by convoluting the experimental random variables by a kernel function together with a bandwidth parameter [16]. An optimal choice of the kernel and of the parameter improves the situation greatly.
However, it should be stressed that, in many cases, experimental distributions are related to discrete phenomena, which carry their own natural limits.
In other cases, large values in the data may be biased, unsafe or unreachable for technical reasons. Let us give two examples.
One example of considerable importance relates to the evaluation of risk in finance, in particular those related to the variations of interest rates. Starting from their known daily variation in the past, one may try to evaluate the probability of a major overnight variation of say 3 or 4 percent in the future. Using a normal law, as often used by risk managers, the catastrophe should occur of the order of once every 10,000 years. Starting from the same available data but with a power law this may happen tree or four times per century. The precise evaluation of the parameter µ is obviously of paramount importance. In data of the Federal Reserve System, if any day the variation is larger than a certain predefined limit the quotation is suspended. Hence the data is artificially cut for high x. The relevant X i must be restricted in a finite domain extending from some value L where asymptotics begins to some value R where the data is artificially cut.
Another example occurs in sociology. Suppose that some phenomena depends on the number of inhabitants of cities, it is clear that no data on earth today will be obtained for towns larger than say thirty millions inhabitants or smaller than say ten inhabitants. The available data will be restricted to a finite domain.
To conclude, in various cases, the variable is not only discrete but also restricted to lie within a finite domain D L ≤ x ≤ D R . Data outside the domain is either not available or unsafe to use or does not correspond to the asymptotic region. Remark that the domain may not only have a lowest (left) bound D L but also a highest (right) bound D R . Though the lower bound D L is taken into account by the Hill estimator, the possible presence of the highest bound D R has not been addressed by the Hill.
In this paper, we thus wish to provide a very simple improvement of the Hill estimator (and of the Hill plot) which takes into account in a perfectly symmetrical way of a lowest value L (D L ≤ L) and of a highest value R (R ≤ D R ) defining a safe domain [L, R] where the power behavior is at work, where data exists, and from which the critical exponent should be inferred.
The problem of existing limits on the data has been subject to a much lower scrutiny than the Hill estimator itself. We would like to cite the work of Beirlant and Guillou and coworkers [17], [18] dealing in particular with insurance policies where policy provisions (deduction, limits) constrain the data. They suppose censored data i.e. that the number of events above the constrain value is known and discuss the influence of censoring. This is not exactly the aim of our article.
We have focused our attention on data which exist in a finite range only and have obtained the simplest estimator which takes into account in a symmetrical way the lower and higher bounds of the domain.
Finally, we would like to mention the related problems of variance and bias of the estimators and especially the question of their experimental determination. For the Hill estimator, this is delicate and it has given rise to much research as is testified by numerous references [19], [20], [21], [22]. Starting from the basic definition of the variance-bias parameters applied to the improved estimator, a detailed discussion following the paths set up for the Hill estimator are certainly worth further work and publications, both in theory and when actual data, which carry their own uncertainty, are used. Indeed, the experimental precision of each data point is not always easy to determine, may in fact depend of the value of the X i itself and has to be taken into account precisely. On the other hand, in simulations, the result depends on the details of the random number generators.

The Hill Estimator
One supposes that, based on some theoretical model, one predicts a distribution density f (x) which behaves asymptotically as where f and x are real and the critical exponent µ is a positive real parameter. When x → ∞, the function g(x) is supposed to be a rather smooth function which, for large x (say x > L), is often assumed to become essentially a constant λ. In those cases and within a certain margin of error, the distribution is thus approximated in the form which is scale free. As explained in the introduction, it has been conjectured by different authors that this type of power like distributions and in particular those based on self-organized critical models, rather than the often used Gauss like exponential forms, could very well dominate certain financial, economic or social phenomena. Let Y i (Y i > 0) be a random sample obtained from experimental data for a phenomenon which is supposed to follow a distribution law f satisfying the requirements (1) and/or (2). The question is to draw inference on the critical exponent µ from the random sample.
This most important question was discussed very carefully by Hill [7] both from a Bayesian and a frequentist approach. He showed that, in a first approach, both points of view lead to the same very useful answer. His recipe for estimating µ can be outlined as follows.
• Let the set X i be the set Y i reordered (reversed order statistics) in such a way that i.e. the set X i is ordered in a decreasing fashion, X 1 being the highest Y value.
• Construct the sets of numbers H k , α k and µ k for k > 2. The set H k is defined by which is identical to the set proposed by Hill (see [7]) The set α k is defined by and the set µ k defined by • For quite many distributions, Hill showed that µ k is an estimator of µ improving when k is increased until "it seems unwise to proceed". It is the point X k ≈ L at the left of which the form of the actual distribution is not anymore approximated safely enough by the equation (2). In other words, this occurs when the variable x is not large enough anymore to be in the asymptotic region of the distribution. Both the phenomenon and the limitation can easily be seen by constructing specific examples.
Unfortunately, there are simple but somewhat more subtle distributions for which the set µ k does not converge well when k is increased. This type of situation has led to what has been called "horror Hill plots". Such horror plots can easily be constructed by considering simple examples or by looking at page 194 of the reference [23]. As argued in the introduction, there are also cases when the discrete experimental distribution carry their own domain, bounded on the left and on the right D L < x < D R .
In this short note, we show how to remedy some of these unfortunate situations in a very straightforward way. We construct some numerical examples to show this explicitly. This result will be achieved by taking into account, not only a left boundary L but also a right boundary R. It is supposed that the data cannot be trusted and/or is not power like at the left of L or at the right of R.

Improvement on the Hill estimator
We first derive a simple heuristic formula and then show how to apply it to improve the Hill estimator.

A simple and exact formula
Let us first suppose that f (x) is exactly, for x > 0, of a power law form with an arbitrary normalization constant λ. Take two arbitrary positive numbers 0 < L < R and define the average value < ln(x) > LR of ln(x) on the interval [L, R] as After some algebra, defining µ = α + 1 (10) and for later convenience one finds, from (8), the exact relation which depends on the correction function C(α, L, R) Equation (12) is our basic equation which is exactly valid for an exact power law distribution (8) and approximately correct for a distribution which satisfies (1) and/or (2).

The basic Hill Estimator
The basic Hill estimator is obtained when the upper limit R is chosen at infinity. Indeed, when µ is larger than 1 and R → ∞, Equation (12) The formula (14) can easily be used to draw inference for α and µ from the k highest values of the random sample X with the chosen order statistics which we consider as the conditional event. Take < ln(x) > LR as the experimental average value of ln(x) from L = X k to the highest empirical value X 1 , We find the formula for the Hill H k estimator (4) exactly.
Remark that by using a Simpson rule an experimental average, slightly better than (15), is obtained by The difference for the Hill estimator using the slightly better (17) rather than (15) is usually too minute to care.

The Improvement
From the equation (12), we see that the Hill estimator can be improved by taking into account not only a lowest value L = X l in the reduced empirical set {X l , X l−1 , . . . , X r } but also a highest value R = X r . Taking again a sample estimate for < ln(x) > LR in the reduced set, one obtains the solution α lr of as an inference for µ µ lr = 1 + α lr .
This obviously provides an easy generalization of the Hill procedure. It makes sense when it is known theoretically that the density function f (x) behaves as x −µ times a constant λ or times a slowly varying function g(x) and that the random sample is secure for D L ≤ L ≤ x ≤ R ≤ D R . It takes into account the facts that the exact theoretical form of f (x) is not known on the left of x = L and that the data points do not extend beyond R = X r because of limited statistics or when the data is poorly known outside a finite D L , D R domain. In this case, the set of X i derived from the random sample Y should include the values inside the interval only and thus take the X i 's between the two limit points (L < X l ≤ X i ≤ X r < R) into account.
The generalised Hill plot is obtained by taking first l = r + 1, then increasing l until "t seems unwise to proceed" and plotting the µ lr as a function of l. Otherwise l can be increased and r decreased until "it seems unwise to proceed". When the data is not biased for the largest values of X i , the choice r = 1 is optimal.
It is finally worth noting the very important fact that our basic equation (18) is perfectly symmetrical under the exchange X l ↔ X r . As a result, it will be apply and provide meaningful answer for distribution which increase (µ negative) rather than decrease with x, as will be seen in the examples.
Our fundamental equation (18) is transcendental and hence requires a further treatment. The estimation of the value of α lr inferred from (18) can, for example, be obtained numerically in the two following ways Method (1): Direct Evaluation. By using standard ad hoc computer programs, the equation (18) can be solved numerically for α lr .
Method (2): Iteration. A simple way to achieve the same result is as follows. Define the function D(α, L, R) as the derivative of the correction function C(α, L, R) with respect to α Take the first order α [1] lr as the Hill solution and define the successive approximations α lr , p = 2, . . . by iteration as In the right hand side, the correction function C and its derivative D are estimated for the value of α of the preceding iteration. Then α [p] lr → α lr for p → ∞. Empirically, for [p] as low as [4] or [5], the approximation for α lr and hence for µ lr = α lr +1 is already very good.

Theoretical comparison between the Hill estimator and the new estimator
When should the new estimator obviously be preferred to the Hill estimator?
We remark that the latter (see Eq.(4)) depends critically on two quantities only. These are the average value of the logarithm of the data points and the smallest X l (highest l) included in the sample. The improved estimator depends on these two quantities but also crucially on the third one: the highest attained value X r . Let us justify this by discussing more carefully than in the introduction three examples of situations, pertaining to three completely different domains of research (sociology, economy, high energy physics), where both limits X l and X r should be taken into account.
• Suppose that some sociological theory predicts an asymptotic power law for some phenomenon (crime rate, economic growth, power consumption) as a function of the number of inhabitants N inh in towns, either when this number is high or when it is small. Obviously, since every country has a finite number of inhabitants, the number of people in any city on earth is less than some number, actually 11.914 × 10 6 inhabitants for Bombay (India). There can be no data for larger numbers. The data is cut artificially on this higher side essentially by the fact that the earth has a finite population. On the other end of the statistics, when N inh is small, no aggregation of human beings is called a city if it contains less than say about ten inhabitants, depending on the rules for defining towns in different countries. Even more, no town will ever be defined with a fraction of one inhabitant. Hence the data is cut on the lower side by country depending administrative decisions.
• Suppose that some economic theory predicts a power law for the daily variation of the interest rates (or of the price of some share) when they are large (the tail of the variations). Usually, if this variation on some day exceeds some predefined number the floor has rules to suspend the quotation. Hence no data will ever be produced with higher variations. This artificial limit, which is usually not taken into account in the models, is included in the analysis of the data by the dependence of the new estimator in the highest value X r .
• In high energy physics, when two beams of initial particles are made to scatter head on at very high speed along a certain direction, final particles emerge from the scattering region at an angle with respect to the direction of the initial beams. The probability distribution is often plotted as a function of the transverse momentum which is related to this angle. The interaction region is surrounded by detectors. In the forward and in the backward directions there is a blind cone where the scattered particles cannot be identified among those of the intensive initial beams. Hence there are artificial limits both at low and at high transverse momenta outside which no data is produced because of the physics of the measuring devices. When the tails of the distributions of the scattered particles with respect to beam are studied, these artificial limitations have to be taken into account.
• In practical situations, to obtain the improved estimator the parame-ters bounds X l and X r , which limit on the left and on the right the data points, have to be chosen in an appropriate manner. There is, a priori, no unique method for choosing them. For actual phenomena, related to some definite model (for example to solutions of a differential equation), the model itself should provide some information on the bounds. The onset of the asymptotic domain of the model provides an indication on X l . On the other hand, the measure themselves may carry a natural right bound X r where the measurements become meaningless (for example a town of more than 5×10 7 inhabitants). The estimator should be evaluated for various values of the bounds X l , X r and educated guesses have to be applied. In general, it is convenient to simply use the highest data point as X r unless there are good reasons to reject it. In high energy physics, "good experimentalists" are known to apply suitable and correct cuts to their data, depending on a deep inside knowledge of their apparatus.

Tests of the new Estimator
In order to test the new estimator, we have used it for some trial functions.
The new estimator performs as well or even slightly better than the original Hill estimator in cases where the latter is known to be good. When applied to cases where the Hill estimator is known to perform more poorly, the improvement is often spectacular and the new estimator converges to a much better approximate value. Using the method (2) above (see Eq. (22)), a good value α lr is usually obtained after three steps only and always after four steps i.e. from α lr ≈ α  (14)- (17), the Hill plot is drawn for somewhat more complicated distributions where the deviation from a purely power behavior shows up for small x. In all the cases, r = 1 (see (18)) has been chosen.
1. We have used FORTRAN to do numerical calculations 2. We have used the DRNGCS module of the International Mathematical and Statistical Libraries (IMSL) to generate randomly a certain num-ber N rand of X i , inside a predefined interval [D L , D R ] and following a given distribution P . It should be noted that the normalization of P is inessential. As required by DRNGCS, this normalization is chosen in such a way that the cumulated distribution N (y) has N (D R ) = 1. In our examples, the distribution P and the cumulated distribution N have always been defined on a grid of 1000 to 10,000 points regularly separated between D L and D R . This is precise enough for our purpose.

For each example (1)-(13), we give the results in a table which includes
• The trial distribution P , up to an arbitrary factor, fed to the random generator.
• The predefined limits D L and D R chosen for the data D L ≤ X i ≤ D R produced by the random generator and ordered according to Eq.(3).
• The number N rand of randomly produced X i .
• The smallest value L = X Nran and highest value R = X 1 obtained from the random generator. Obviously D L ≤ L < R ≤ D R .
• The average value σ of the natural logarithm of the X i as produced by the random generator.
• The value expected for µ, the exponent of the power decrease of the starting distribution P .
• The values µ [5] produced by the 4 d iteration. Quite generally, the values µ [4] produced by the 3 d iteration are already very good except in example (13) below where we had to proceed to µ [5] ). In almost all cases (except in example (13)), it turns out that µ [j] ≈ µ [4] for j > 4 with at least four significant digits well within the errors of the method.

Example (1)
Let us first give an example where the old and the new estimators perform both well. We have supposed a power law (8) with µ = 5, D L = 3 and D R = 150. In this case, D R is large enough and R can be essentially ignored in all the formulas. Following the procedure outlined above we obtain the line (1) of the table. We see that in the values produced by the Hill estimator and by the new estimator are almost equal and close to the expected value µ = 5. The new estimator is slightly better. The same behavior is found when µ and the related D R are large enough.

Example (2)
The second example is almost identical to the first example except that the data is artificially reduced to the interval D L = 3, D R = 4. We find the line (2) of the table.
Here we see that in this extreme case, where the interval allowed has been reduced drastically, the Hill estimator has produced a very bad result µ Hill ≈ 10 while the new estimator converges toward a much better value close to the expected one.

Example (3)
The third example in line (3) of the table is identical to the second except that the number of X i , N rand is increased to 5000. The new estimator behaves even better while the Hill one is still very bad. This behavior was expected.
It should be noted that, when D R is increased from 4 to over 100, the two estimators tend gradually to produce compatible results. There is a weak dependency in the value chosen for N rand provided that it is chosen large enough.
Examples (4),(5), (6) When the probability distribution is chosen to decrease very slowly, here P = λ/ √ x, even if D R is chosen rather large D R = 150, D R = 1500 or D R = 15000, we see in line (4), (5) and (6) of the table that the new estimator is definitively better than the Hill one. An increase of N rand and/or of D R does not alter this fact drastically.

Examples (7),(8)
Let us now give an example where the initial distribution involves a slowly varying function as the consequence of a Padé(1,4) type distribution in the positive x region x > 0 We expect a µ = 4 type behavior provided that x is chosen sufficiently large.
The actual left limit L to be used depends on the values of the parameters p i . We analyze the problem in the case of an experimental symmetric distribution of variation of interest rates of a Padé type (see [5] and especially the results obtained in the first reference). Data which were extracted from the Federal Reserve System lead to p 2 = 494.7 and p 4 = 4886.0 for a lag of one day and a maturity of one year. Here, x is expressed in the convenient unit of percent/year. The parameters D R and D R as well as L and R are given percent/year. It is easy to see that for the term in p 4 to dominate over the term p 2 in the denominator, a value of x of about one percent in needed.
Hence L has to be chosen greater than one percent. We do not expect a x −4 behavior for smaller values. In lines (7) and (8) of the table, the left boundary D L is taken to be one percent while the right one D R , when the floor would be supposed to suspend the quotation, is respectively two and five percent.
Examples (9),(10), (11), (12) Let us now turn to two examples when there is a logarithmic factor, namely and when there is an inverse logarithmic factor In both cases the distribution decreases rather slowly so that taking into account the right boundary has usually a major effect. In these cases , we have increased the value of N rand to 5000. The results depend rather weakly on this choice. In lines (11) and (12) of the table, the presence of the ln(x) factor in the denominator leads to an effective decrease of the probability faster than 1/x. Hence, the effective µ [5] is expected and turns out to be somewhat higher than one. In the opposite way, in lines (9) and (10), when the logarithm appears in the numerator, the effective µ [5] is expected to be somewhat lower than one and it is.

Example (13)
Since the treatment of the new estimator (18) is perfectly symmetrical under the exchange X l ↔ X r of two limits X r and X l , it can be applied directly without any change to infer the asymptotic power behavior of distributions which increase with x and hence have a power behavior with negative µ up to a smooth multiplication function. Let us give a last example for µ = −3.5, D L = 3.0 and D R = 10000.0. Obviously the Hill estimator produces a wrong sign for µ Hill . The new estimator is perfect but the iterations had to be carried effectively to the fourth level since the starting µ [1] = µ Hill is so bad.
Example (14). Figure 1 In figure  Example (15). Example (16). Figure 3 In figure 3, the distribution is taken as ln(x)/x) related to example (10). The random data with N rand = 10000 has been collected in the interval D L = 100 and D R = 400. The continuous line is the original Hill plot, the dotted line is the improved plot and the dashed line is the expected asymptotic value µ = 1 (regardless of the ln(x)). In fact one sees that when k reaches about 4000, the value of µ is close to 1. For larger values of k, µ decreases below µ = 1. This is due to the fact that the slowly varying function ln(x) in the numerator leads to a slightly decreased value of µ whose effects increases with k. Indeed, since ln(x) is an increasing function of x, the distribution is expected to decrease somewhat more slowly than 1/x.
Example (17). Figure 4 In figure 4, the distribution is taken as 1/ √ x related to examples (4), (5), (6). The random data with N rand = 10000 has been collected in the interval D L = 3 and D R = 1500. The continuous line is the original Hill plot, the dotted line is the improved plot and the dashed line is the expected asymptotic value µ = 0.5. (1) x −5 Table (see text for the notation)

Conclusion
We have shown that, by a very simple alteration, the Hill estimator, which provides a useful inference for the power behavior of the tail of a distribution, can be easily improved to cover cases where it performs badly. This includes cases when the tail is not of the form of a power law but is multiplied by a slowly varying functions g(x) including logarithms for example. It also applies when the power µ is of the order or smaller than one. It even works for inferring the tails of increasing distributions i.e. when µ is negative. Needless to say, the approach outlined in this paper is one of efficiency. From the frequentist point of view, it is perfectly justified. However, in some way, the Bayesian point of view is also, but granted not completely, met as the theoretical and empirical knowledge of extreme tails of the distribution is somehow better taken into account. It should be stressed that the new estimator applies particularly when the data is confined by artificial external conditions to lie within a finite domain bounded not only by a lowest value of the variable but also by a highest value.