Defining Sample Quantiles by the True Rank Probability

. Many definitions exist for sample quantiles and are included in statistical software. The need to adopt a standard definition of sample quantiles has been recognized and different definitions have been compared in terms of satisfying some desirable properties, but no consensus has been found. We outline here that comparisons of the sample quantile definitions are irrelevant because the probabilities associated with order-ranked sample values are known exactly. Accordingly, the standard definition for sample quantiles should be based on the true rank probabilities. We show that this allows more accurate inference of the tails of the distribution, and thus improves estimation of the probability of extreme events.


Introduction
The quantile of a continuous, strictly monotonous distribution function  is defined as where  is the probability of nonexceedance of a variable value.When the distribution is unknown, sample quantiles provide estimators of their population counterparts based on a set of independent order-ranked observations  1 , . . .,   .The associated sample probabilities are then  1 , . . .,   , where   is the probability of a new sampled value  +1 being less than or equal to   .These nonexceedance probabilities are those defining the cumulative distribution function (CDF).Many different formulas for defining sample quantiles have been used in literature and statistical software.This has caused considerable confusion, in particular when performing extreme value analysis for various applications where probabilities of rare events need to be estimated.In a widely cited article Hyndman and Fan [1] identified this problem and emphasized that there is a need to adopt a standard definition for sample quantiles.The same problem was discussed again by Langford [2] who identified twelve different sample quantile definitions that are used in statistical software.
Hyndman and Fan [1] analysed nine different sample quantile definitions.They selected six "desirable properties" for an estimator of a sample quantile and considered how well different definitions satisfy them.This approach is similar to judging the plotting position estimators by five "postulates" as done by Gumbel [3] and by three "purposes" by Kimball [4].Hyndman and Fan [1] proposed   = ( − 1/3)/( + 1/3) to be used as the basis of the standard definition.
However, the definition of the quantile has not yet been standardized.Modern statistical software, such as Matlab, Excel, SciPy, STATA, , and , include different definitions and offer user-selected options for the formulation of the quantile function, as well as for plotting positions in quantile plots and quantile-quantile plots; see, for example, Castillo-Gutiérrez et al. [5].The inability to agree on a standard definition has arisen from the many proposals [6][7][8] and the subjective nature of the "criteria" and "desired properties." Since the quantile function is the reverse of the cumulative distribution function, the quality of its definition must be judged by how close the probabilities defined by it are to the true probabilities of the cumulative distribution function.Thus, the definition of a sample quantile function should be based on the true nonexceedance probabilities.It was pointed out by Makkonen [9] that, for order-ranked data, they are known exactly.We outline here two rigorous proofs of this conclusion and show how the appropriate definition for the sample quantile function follows from it.

Sample Probabilities
We present in the following two deductions of the probability   .
Consider in Figure 1 an order-ranked sample (a) of  random observations (white circles) and a new observation (grey circle) sampled randomly from the population the distribution of which is unknown.In the new sample (b) obtained by including the new observation the new value may fall in any interval of the original sample or be smaller than  1 or larger than   .In the sample, each observation  1 , . . .,  +1 has the same probability 1/( + 1) to be the smallest one.In particular, ( +1 is smallest) = ( +1 ≤  1 ) = 1/( + 1).
The result deduced above can also be derived by formal mathematics [10].Consider variate  with cumulative distribution function  and a sample of  observations  1 , . . .,   ranked in ascending order.Values   in different samples of size  are random values of variate   for which the probability density function   in terms of  and its derivative   =  is given by [3,11] We wish to associate a probability   = { ≤   } with each observed rank .The precise meaning of   is illustrated in Figure 2. Since the probability of event { ≤   } is controlled by two variates   and , the probability is obtained by integrating the joint density function    , of the variates   and  over the area where  ≤   .Due to the mutual independence of   and , their joint density function    , (  , ) equals   (  )(), where   and  =   are the density functions of   and , respectively.The nonexceedance probability { ≤   } is, therefore, obtained by integration of the joint density function over zone  ≤   : The last step in the deduction above is based on direct application of Euler's -function.

Plotting Positions
One might expect that, based on (3) and the definition of CDF, /( + 1) would have been the sample probability used by everybody.Unfortunately, this is not so, as discussed, for example, in the reviews [12,13].Even though /( + 1) has been recommended already by Weibull [7] and used by numerous researchers since the 1950s, there has been a lot of research aiming at "improving" the probability by using probability estimates called plotting positions of the form ( + )/( + ), where  and  are constants depending on the type of parent distribution, size of the sample, and so forth.These attempts include, for example, Benard and Bos-Levenbach [14], Blom [6], Langbein [15], Gringorten [8], Wilk and Gnanadesikan [16], Barnett [17], Cunnane [18], Guo [19], Jones [20], Yu and Huang [21], and Folland and Anderson [22].To the effect of adding to this complexity, even numerical methods to calculate plotting positions have been proposed [23].
The main reason for this trend appears to be a misunderstanding in the role of the mean value both in theoretical considerations and when interpreting the results of Monte Carlo simulations.This is shown in the Appendix, in which also the ideal performance of the Weibull positions is demonstrated in one special case using the bin frequency criterion introduced by Makkonen et al. [24].According to the bin frequency criterion, the method of least squares (MLS) using the Weibull probabilities gives a better distribution function than MLS using any of the plotting positions of Cunnane, Gringorten and Blom.

Sample Quantiles
It is apparent that even if there may be some reasons, for example, bias in the MLS, to use the linear transformations of /( + 1), that is, the plotting positions of the type of Cunnane, Blom, Gringorten, Hyndman and Fan, and so forth, in curve fitting, none of them is valid when the sample probabilities are plotted.The definition of a sample quantile does not need to be based on any "estimator" of   .It can be defined objectively based on the true nonexceedance probability   = /( + 1) of a new value drawn from the population.For variable values in between the observed values, one may interpolate linearly.Accordingly, we define the following.
For  = 1, . . ., For  = 1, . . .,  − 1 and   ≤  ≤  +1 we define The definition of the quantile function in ( 5) is illustrated in Figure 3.The definition of ( 5) has all the desired properties of Hyndman and Fan [1].The huge advantage of (5) over all other suggestions for the sample quantile definition is that it is based on the true probabilities.This makes it possible to standardize the sample quantile function in statistical software by sound theoretical basis instead of subjective criteria.
Using any other definition, for example, / promoted by Langford [2], as an estimate of the probability   , results in significant relative errors, particularly at the tails of the distribution.This is illustrated in Figure 4 showing the relative errors for different ranks .The relative error in the exceedance probability 1 −   is defined as Note that, at the upper tail of the distribution, the relative error remains large even when the sample size  goes to infinity.The error of using / or some other improper estimate of the probability plotting positions is particularly persistent in the extreme value analysis methods [13].This confusion alone justifies the definition as proposed here.

Conclusions
The sample quantiles can be defined by the true nonexceedance probabilities /( + 1) of the order-ranked sample values.Following the basic principles of the probability calculus, the sample quantile function can, therefore, be defined by (5).This definition removes the methodological uncertainty related to calculating sample quantiles and should be adopted as a standard in statistical software.
The claim that the true nonexceedance probabilities, socalled Weibull plotting positions, result in a biased estimate of a CDF is shown to be false and founded on a misunderstanding in theoretical considerations and when interpreting the results of Monte Carlo simulations.
The definition of the quantile function proposed here should, of course, be applied to its reverse function, EDF, as well, and used in the inference of data.This is particularly important in the extreme value analysis where probabilities of rare events need to be estimated.

Appendix Evaluating Plotting Positions
Consider a normally distributed variate  with mean 0, standard deviation 1, and distribution function .Take  values   =  −1 (/( + 1)),  = 1, . . ., .Assume that  represents the function which transforms the probabilities to the probability paper; that is, all points (  ; (  )) fall on the same straight line which represents the cumulative distribution function (CDF).The slope and intercept of the straight line are independent of .
Plot next the points (  ; (   )) on the same probability paper.Here, values   are the same as those above but the probabilities    represent Blom's [6] plotting position ( − 0.375)/( + 0.25), which have been developed for normal distribution.With increasing  the resulting curve approaches the correct straight line, but they never coincide as illustrated in Figure 5.In this way we have detected samples which are correctly represented by the Weibull plotting positions and incorrectly by any other.Vice versa, choosing properly plotting positions ( + )/( + ) with (, ) ̸ = (0, 1) we get a curve which by a linear regression can be forced to compensate for any error somewhere else.This is exactly what has been done in the history of plotting positions.The reason for and the nature of the error are characterized in the following.
First, the reasons why the Weibull [7] plotting positions have been objected in the literature are discussed.
(1) Gumbel showed that the expected value of the probability (  ) is ((  )) = /( + 1), and, for example, Langbein [15] argued that the probability of the next observation to fall in interval (  ;  +1 ) is 1/( + 1), but these observations have not been regarded as rigorous justifications to use the Weibull positions.It is likely that the terminology used, for example, by Chow [25] (mean number of exceedances in  future trials) and by Langbein [15] (mean value of exceedance probabilities), has not been understood as giving an ordinate on the CDF.A rigorous proof for ( ≤   ) = /( + 1) was presented in the textbook by Madsen et al. [11], but this has not deserved much attention in the later research.
(2) The way of thinking for some researchers has been that the sample values are given and the probabilities associated with them are random while the correct way is to think vice versa that the sample probabilities are exact and the sample values associated with them are variates.Some others, for example, Benson [26], appear to have fully understood that the plotting positions different from those of Weibull are not probabilities.Nevertheless, for example, Cunnane [18] claimed that it is not necessary to use the true probabilities because the final result, that is, the resulting regression line, is decisive.Cunnane [18] tried to estimate the probability ((  )) and used it as the plotting position when determining the CDF.In other words he used ( ≤ (  )) to represent ( ≤   ), although they are different concepts.It is not at all surprising that the Weibull positions corresponding to the latter probability are not representative of the former.
(3) In Monte Carlo simulations by the Weibull positions, the conventional curve fitting procedures, like the MLS tend to result in parameter estimates, the means of which do not coincide with the parameters of the distribution from which the samples are taken.It has been observed that the difference can be reduced by transforming the Weibull points linearly, that is, by replacing probabilities /( + 1) by ( + )/( + ) = [( + 1)/( + )][/( + 1)] + /( + ).
Geometrically, this simply means that, to improve the fit, the straight line on the probability paper, resulting from linear regression, is replaced by another straight line.In more detail, when points ( , , /( + 1)) are replaced by points ( , , ( + )/( + )), the linearity is also affected as illustrated in Figure 5.This effect remains hidden, however, because the linear regression forces the fitted curve to be linear.The behaviour described above has often been explained by stating that the Weibull probabilities are "biased" or more "biased" than some other probabilities.This is misuse of terminology.The bias is defined as (â) − , where (â) is the expected value of an estimator â determined from a sample and  is the correct parameter value.The Weibull probabilities are exact values in the same way as 1/2 is the probability of heads when tossing a coin.There is no need for estimation here.
In contrast to the abovementioned arguments against the Weibull positions, the bias in the parameters, resulting from the traditional Monte Carlo simulations by the Weibull positions, is, in fact, attributable to taking the mean of the parameter estimates and to the curve fitting method.From the mathematics we know that if  and  are nonlinearly related as  = (), it follows that () ̸ = (()).From the elementary statistics we know that the sample variance is a biased estimate of the population variance.Why should we then believe that the mean of standard deviations or any other distribution parameters obtained from successive samples ( 1, ,  2, , . . .,  , ) would approach the parameter of the population?On probability paper, the slope of the regression line represents /(()), where  stands for a proper nonlinear transformation of the probability axis.Thus, there is no a priori reason to expect that a mean of sample slopes in MC simulations presents something relevant in the probabilistic sense.
Consequently, the convergence of the mean of successive parameter estimates to the correct parameter value cannot be regarded as a goodness criterion for plotting positions.We should use a criterion based on the bin frequency instead, because it is the frequency by which probability is defined.From a parent distribution with given parameters, take a sample of size , find the estimated straight line () =  + , take from the parent distribution one additional random value  +1 , record the bin [0, 1/( + 1)], (1/( + 1), 2/( + 1)], . . ., (/( + 1), 1] to which  +1 =  −1 ( +1 + ) belongs, and repeat the steps.A uniform distribution of hits to each bin means that the method has been successful.The fit on each bin (( − 1)/( + 1), /( + 1)] can be considered separately using the criterion or the whole distribution by Here  is the number of simulations and   is the number of hits to bin .Such an analysis was made by Makkonen et al. [24].The simulations verified that the Weibull positions give the most accurate estimate in the sense of criterion (A.2) for all considered distributions, that is, for Gumbel, Weibull, normal, and lognormal distribution.Another simulation for the Gumbel distribution () = exp(− exp(−( − )/)) with mean = 5 and standard deviation = 2 was carried out using the Weibull and Gringorten plotting positions, and its results are presented in Figure 6.Sample size 2 was chosen to ) when using Weibull's [7], and Gringorten's [8] plotting positions in Monte Carlo simulations with 50,000 cycles.eliminate the possible bias due to the linear regression.The results confirm the performance of the Weibull positions in the same way as a nearly uniform distribution in numbers 1, . . ., 6 confirms that the die is fair.The erroneous deduction based on calculating the mean of the estimated distribution parameters  and  results in a traditional (and incorrect) conclusion that the Weibull positions are worse than those proposed by Gringorten [8].The source of this misunderstanding is demonstrated in Table 1.
The discussion above shows that all the claims presented against the Weibull plotting positions are unfounded.Particularly, the recent Monte Carlo simulations supporting these claims [27][28][29][30][31][32][33] are based on a misunderstood role of the mean of the sample parameters.The performance of a single fitted curve should always be compared with the Weibull probabilities plotted against the observed orderranked values.The plotting positions, different from those of Weibull, should not be used to eliminate the error observed when a mean of the sample estimates is taken in Monte Carlo simulations, because such an error never occurs in a practical situation where we have only one sample and one estimate for each parameter.

Figure 1 :
Figure 1: Re-ranking when adding a new sample.

Figure 2 :
Figure 2: Contours of joint density function    , drawn in   , plane.   , integrated over the grey half plane where  ≤   gives the probability { ≤   }.

Figure 3 :Figure 4 :
Figure 3: Illustration of the proposed sample quantile definition by an example where  = 9.

Figure 5 :
Figure5: cdf (straight line) of a normal distribution on probability paper and the curves due to using Blom's plotting positions[6] instead of the true probabilities.Sample sizes 10, 20, and 50.

Table 1 :
Parameters obtained for Gumbel distribution from MC simulation with 50,000 cycles.These parameters cannot be used in evaluating plotting positions (see text).