A Cluster Truncated Pareto Distribution and Its Applications

The Pareto distribution is a heavy-tailed distribution with many applications in the real world. The tail of the distribution is important, but the threshold of the distribution is difficult to determine in some situations. In this paper we consider two realworld examples with heavy-tailed observations, which leads us to propose a mixture truncated Pareto distribution (MTPD) and study its properties. We construct a cluster truncated Pareto distribution (CTPD) by using a two-point slope technique to estimate the MTPD from a random sample. We apply the MTPD and CTPD to the two examples and compare the proposed method with existing estimation methods. The results of log-log plots and goodness-of-fit tests show that the MTPD and the cluster estimation method produce very good fitting distributions with real-world data.


Introduction
There are many real-world problems modelled as heavytailed distributions, especially the Pareto distribution [1,2]. However, there are some difficulties in estimation of Pareto distributions. First, the Pareto distribution has infinite moments in some heavy-tailed cases. Therefore the moment estimation method for the shape parameter cannot be used in these situations. It is a loss for the estimation process since the moment estimator is a robust estimator. Several authors suggest using a truncated Pareto distribution which always has finite moments (e.g., [3][4][5]).
In some situations, data will behave differently within different thresholds. For example, losses from hurricane damage can be classified into small, medium, and large hurricane groups. The data in these classes may have different distributions, or by grouping, data with self-similarity may have the same kind of distribution but with different parameters. A cluster method for data is needed to determine these groups when dealing with real data sets. In this paper we study an example of 49 most damaging Atlantic hurricanes occurring between years 1900 and 2005 [6]. The costs are standardized to 2005 USD; see Figure 1.
Coia and Huang [7] applied Pareto and truncated Pareto models to fit the hurricane data set. The maximum likelihood estimator (MLE) and the moment estimator for the shape parameter were used. The results are shown in a loglog plot in Figure 2. Coia and Huang [7] also used Kolmogorov-Smirnov, Anderson-Darling, and Cramer-von-Mises goodness-of-fit tests. We note that the two estimated (by MLE and moment method) truncated Pareto curves fit the data set quite well; they fit much better in the tail than the original Pareto distribution (which is in a straight line). But the truncated Pareto curves do not fit the data uniformly well, especially for the middle value data. We observed that the pattern of data can be classified into three groups. The data in these three groups may still be Pareto distributed but with different shape parameters. In the literature, researchers study similar data sets by using cluster methods; for example, Coia and Huang [7] proposed a sieve model.
In this paper, we propose a more generalized methodmixture truncated Pareto distribution (MTPD)-in Section 2. We study the properties of the MTPD in Section 3. In Section 4, we propose a cluster method by using a two-point slope technique to estimate the MTPD from data which utilizes a cluster truncated Pareto distribution (CTPD). In Section 5, we review the nonparametric kernel density estimation method. In Section 6, we analyze the hurricane data and a second example regarding sizes of fish caught off the Atlantic shore of Massachusetts, USA, by using 2 ISRN Probability and Statistics   0   20   40   60   80   100   120   140   160   180   1900  1915  1919  1926  1932  1935  1944  1945  1949  1954  1955  1957  1961  1964  1967  1970  1979  1985  1992  1996  2001  2004  2004  2005  2005 Year Hurricane damage cost (10 9

USD)
Great Miami hurricane (1926) Katrina (2005) the CTPD, nonparametric kernel estimation method and three other existing semiparametric estimation methods in log-log plots (see Figure 3 in Section 6). We also perform Kolmogorov-Smirnov, Anderson Darling, and Cramer-von Mises goodness-of-fit tests on these two data sets. The results show that the proposed cluster method and nonparametric method are superior to other existing estimation methods, in both examples.
When 0 < ≤ 1, which is a heavy-tailed case, the mean and variance of are infinite, and the distribution is heavier in the right tail as decreases.
The truncated Pareto distribution (TPD) was originally used to describe the distribution of oil fields by size. It has a lower limit , an upper limit ], and a shape parameter . In fact, it has been shown that the truncated Pareto distribution fits better than the nontruncated Pareto distribution for some positively skewed populations [3]. Definition 2. The p.d.f. and c.d.f. of a random variable having the truncated Pareto distribution are given by where and ] are the left and right truncation points.  The quantile function of the truncated Pareto distribution is The mean, second moment, and variance of are, respectively, We consider a vector of thresholds where 0 < = 0 < 1 ⋅ ⋅ ⋅ < = < ∞, , ∈ R, = 1, 2, . . . .
Proposition 5. For given , 0 < < 1, the quantile function of a mixture truncated Pareto distribution in (10) is given by solving for in the following equation:
Definition 6. The c.d.f. of a random variable having the cluster truncated Pareto distribution (CTPD) is given by where ( ; T, Λ; W) is a c.d.f. of the MTPD in (10) and is the sample size in the th cluster in the th domain: ( −1 , ): where 's depend on the vector C = ( 0 , 1 , . . . , ) , where 0 = 0 < 1 < ⋅ ⋅ ⋅ < = , is the number of data less than or equal to the threshold . Note that is a function of and the random sample ( 1 , 2 , . . . ). Thus where is the indicator function of set .

Definition 7. A two-point slope is defined as
where depends upon empirical observations of differences between successive , −1 's. This usually occurs when We propose seven steps to construct a cluster truncated Pareto distribution in (16). Step Step 2. Determine by using (20); there are two main factors: (1) Determining depends upon empirical observations of differences between successive , −1 's, when | − +2, −1 − − +1, −1 | is much larger than the previous difference | − +1, −1 − − , −1 |. (This technique is used on the two examples in Section 6.) (2) We also ensure that the sample size within each group is sufficiently large (usually ≥ 5).
Then we let (22) Step (23) Then we have clusters: This construction is shown in Box 1.
Step 6. Estimatê, Moment . We suggest using the moment esti-mator̂in (26) since it has robust properties, but there are other estimators available in (25) and (27).

Remark 8.
There are three estimation methods for the shape parameters for all sub-samples, given by the following. (1)). The Hill [8] MLÊH ill for is defined aŝ

(1) Hill Estimator (original Pareto distribution in
where , is the th smallest order statistic and is the cut-off point. (2) Moment Estimator (truncated Pareto distribution in (3)).

Nonparametric Kernel Density Estimation
We apply the kernel density estimation method (KDE) [10].
It is a smoothing technique for estimating distributions. It is well known that the classical kernel density estimator from a random sample 1 , 2 , . . . , for true probability density function (p.d.f.) iŝ where (•) is a symmetric density function and ℎ > 0 is a bandwidth.
We will compare the KDE estimator and the CTPD estimator in the next section.

Applications
Now we apply the cluster truncated Pareto distribution to the hurricane example in Section 1.
This construction is shown in Box 2.

Kernel and Other Estimation Methods in Log-Log Plot.
We apply the kernel estimator in (29) which is normalized to the hurricane data. Here, we use a standard normal kernel and optimal bandwidth [10, p. 45] We ensure that the bandwidth is sufficiently large such that the estimated tail distribution is smooth enough. Table 1 giveŝ,̂, Median, 5% Value-at-Risk (VaR), and 1% VaR of each of four estimation methods. We note that the cluster method and kernel method give the largest medians. The cluster method gives the smallest VaRs. Figure 3 is a log-log plot which exhibits data and five estimated distribution curves. We note that the original Pareto distribution does not fit data well in the right tail. The moment and Aban estimated truncated Pareto distribution (TPD) fit data well in the right tail, but not so well in the smaller or middle values data. The cluster truncated Pareto distribution overcomes this problem and has the best fit to the data over the whole range. Figure 3 suggests a single distribution may not totally represent how natural data is distributed. We may consider grouping data by using the cluster method. Figure 3 also shows that the nonparametric kernel estimated distribution fits the data well. Figure 3 provides a visual observation. It is necessary to run goodness-of-fit tests mathematically to confirm which estimated distribution best fits the hurricane data.

Goodness-of-Fit Tests.
In this section we conduct three goodness-of-fit tests, Kolmogorov-Smirnov, Anderson Darling, and Cramer-von Mises. All three tests are based on the distance between the empirical distribution function and the proposed distribution function: original Pareto distribution in (1) or truncated Pareto distribution in (3) or mixture truncated Pareto distribution in (10).
Each test considers the same null and alternative hypothesis: where ( ) is the unknown true distribution of the sample data and * ( ) is one of our proposed four estimated distributions: (1) Pareto distribution in (1) with Hill estimator̂H ill in (25); (2) truncated Pareto distribution (TPD) in (3) with Aban estimator̂A ban in (27);  = X 1,n = X c 1 ,n = X c 2 ,n = X c 3 ,n = b = a n 1 = 27 n 2 = 13 n 3 = 9 Box 2: Construction of cluster truncated Pareto distribution from the data. We ran a test for each estimated distribution as * ( ).

3) Cramer-von Mises Test (C-v-M Test). Anderson and
Darling [12] proposed this test by using ( ) = 1 in (37). Thus under 0 the test statistic and value are given by where 1/4 ( ) is the modified Bessel function of the second kind, Table 2 gives the values of the test statistics and values of three goodness-of-fit tests. Note that the cluster truncated Pareto distribution has the smallest test statistics in the K-S test (i.e., the smallest errors) and the largest values. The kernel estimated distribution gives the smallest test statistics in the A-D test and C-v-M test, respectively. This means the cluster truncated Pareto distribution and kernel estimated distribution have the best fit to the hurricane data.
In Table 3, we took the largest data in the sample. The absolute error and integrated error are defined by Table 3 gives absolute errors and integrated errors of the five estimation methods in = 49, 18, 10 cases. We note that the cluster truncated Pareto distribution has the smallest errors in all 6 cases. This means the cluster method is superior in fitting the hurricane data compared with the other existing methods.

Fishing Example.
Another example is determining the chance of catching a fish of record length (size). Overfishing is a serious issue that has been known to collapse many fish populations. To help control the population, limits are set upon anglers to determine the size of fish which can be kept.
The data is from a fishing trip, May 29-June 3, 2011, to Buzzard's Bay of the Cape Cod area in Massachusetts, USA, by Coia [13]. The largest black sea bass lengths were measured out of a total of 326 black sea bass caught on the trip, using 43.0 cm as a lower-limit threshold. The threshold of 43.0 cm was chosen to be conservative. When sampling, locations producing smaller fish were avoided and the largest fish were targeted. A time-series plot in Figure 4 shows the lengths of these largest 72 fish in the order of which they were caught.
By using 71 two-point slopes in (19) and the seven steps in Section 4, we construct = 4 clusters. We select This construction is shown in Box 3. We also apply the kernel estimator in (29) and bandwidth in (33) to the fish data. Figure 5 is a log-log plot which exhibits data and five estimated distribution curves. We note that the original Pareto distribution does not fit the data well in the right tail. The moment and Aban estimated truncated Pareto distribution (TPD) fit the data well in the right tail, but not so well in the smaller or middle values data. The cluster truncated Pareto distribution overcomes this problem and has the best fit to the data over the whole range. Figure 5 suggests a single distribution may not totally represent how natural data is distributed. We group the data by using the cluster method. Figure 5 also shows that the nonparametric kernel estimated distribution fits the data well. Table 4 gavê,̂, Median, 5% VaR, and 1% VaR of each of four estimation methods. We note that the cluster method and kernel method give the largest median and the smallest VaRs.
We also ran goodness-of-fit tests mathematically to confirm which estimated distribution best fits the fish data. Table 5 gives the values of the test statistics and value of each of three goodness-of-fit tests. We note that the cluster truncated Pareto distribution has the smallest test statistics (i.e., the smallest errors) and the largest values in the K-S test. The kernel estimated distribution has the smallest test statistics in the A-D test and C-v-M test, respectively. Thus the cluster truncated Pareto distribution and kernel estimated distribution fit best to the fish data. Table 6 gives absolute errors and integrated errors of five estimation methods in = 72, 50, 10 cases. We note that the cluster truncated Pareto distribution and the kernel estimated   (2) 0.7240 (2) 0.7034 (1) 0.5560 (1) 0.0379 (1) 0.9438 (1) Cluster 0.0687 (1) 0.7348 (1) 0.8528 (2) 0.4445 (2) 0.0590 (2) 0.8202 (2) (2) 0.0561 (1) 0.0179 (1) 0.0040 (1) 0.0035 (1) 0.0022 (1) Cluster 0.0687 (1) 0.0687 (2) 0.0332 (2) 0.0055 (2) 0.0055 (2) 0.0046 (2) distribution have the smallest errors in all 6 cases. The cluster method and the kernel estimation method are superior in fitting the fish data compared with the other existing semiparametric estimation methods.

Conclusions
Overall, after the studies in this paper, we may conclude the following.
(1) Truncated Pareto models are useful for analyzing realworld data.
(2) Cluster truncated Pareto models are useful for grouped data.
(3) The two-point slope technique is innovate and useful for determining the threshold points in clustering.
(4) A nonparametric kernel estimated distribution fits heavy-tailed data very well.
(5) It is a difficult problem to determine -the number of groups which is based on the two-point slope in (20) in Step 2 of the construction of the CTPD in (16).
The very much depends on the empirical observations. We displayed this technique in the two examples in Section 6. We plan to do further studies on more complex data sets in the future.