On Compound Distributions for Natural Disaster Modelling in Kenya

Kenyan communities are exposed to natural disasters by an amalgamation of factors such as poverty, aridity, and settlements in areas susceptible to natural disasters or in areas with poor infrastructure. &is is expected to increase due to the effects of climate change. In an attempt to explain some of these variabilities, we model the extreme damages from natural disasters in Kenya by developing a compound distribution that takes into account both the frequency and the severity of the extreme events. &e resulting distribution is based on a threshold model and compound extreme value distribution. For frequency of events exceeding a threshold of 150,000, we found that it follows a negative binomial distribution, while severity of exceedance follows a generalized Pareto distribution. &is distribution fits the data well and is found to be a better model for natural disasters in Kenya than the traditional extreme value threshold model.


Introduction
Kenya has continued to face an increasing vulnerability to natural disaster risk. e communities are exposed to natural disasters by an amalgamation of factors such as poverty, aridity, and settlements in areas susceptible to natural disasters or in areas with poor infrastructure. ese factors coupled with naturally occurring hazards, which are currently being propelled by climate change, pose an extreme threat to the Kenyan society. As of 2018, a total of 113 natural disaster events had been recorded in the last six decades, affecting approximately 62 million people and resulting to 6,900 deaths. e total damage from these occurrences is estimated to be 609 million US dollars. Most of the natural disaster events are weather-related, with almost 70% of the landmass being affected by drought and a total of 55 flooding events recorded in various parts of the country.
As with most parts of the world, natural disasters are expected to increase in the future due to climate change. World Bank projects the number of drought days in many parts of the world to increase by more than 20% by 2080, and the number of people exposed to drought could increase by 9 − 17% in 2030 and 50 − 90% in 2080. e number of people exposed to river floods could also increase by 4 − 15% in 2030 and 12 − 29% in 2080 [1]. However, the effects would be felt mostly in the less-developed countries like Kenya. A study conducted by the Center of Research and Epidemiology of Disasters (CRED) found that people living in the poorer nations are six times more likely to be injured, to lose their homes, be displaced, or require emergency assistance than those in the wealthier nations. ey are also seven times more likely to die from a natural disaster than those in richer nations. erefore, with the increasing frequency and potential damage of natural disasters in Kenya, there is an urgent need to understand the characteristics of such events in the country. In modelling extremal and rare events, extreme value theorem (EVT) emerges as a vital tool to model such risks.
ere are two main approaches in EVT: classical EVT (block-maxima) and excess over threshold (EOT) [2]. Several studies have been conducted on EVT and its application to model natural disasters including Engeland et al. [3] who used the block maxima method to model hydrological floods and droughts in the USA and Jindrova and Pacakova [4] who used EOT to model historical natural catastrophe losses in the USA. In both studies, they found the EVT distributions to be a good fit for the data. However, such models only consider the loss severity distribution without explicitly considering that of the frequency of occurrence. As a result, Tebfu and Fengshi proposed the compound extreme value distribution (CEVD) and successfully used it to model typhoon in South China. ey assume that the frequency of occurrence is Poisson distributed while that of the severity follows an extreme value distribution. Other compound extreme value distributions have since been developed, including Poisson-Weibull CEVD [5], Poisson-generalized Pareto CEVD [6], and geometric-Gumbel CEVD [7].
In this paper, we seek to model the extreme damages from natural disasters in Kenya by developing a compound distribution that takes into account both frequency and severity of the extreme events. e rest of the paper is structured as follows: Section 2 is the methodology where we will discusses extreme value theory and compound distributions, which we will use to develop a model to model natural disasters in Kenya. e data analysis results and discussion are then presented in Section 3, and the study concludes in Section 4.

Extreme Value eory.
e cornerstone of extreme value theory is the study of the stochastic behaviour of the maximum (or minimum) of a sequence of random variables. Define where Y 1 , . . . , Y n is a sequence of independent random variables with a common distribution function F and M n represents the maxima (minima) of the observed process over n blocks or time units. If F is known, the distribution of M n is However, F is usually unknown in practice and will have to be estimated from the data. is poses a problem since a small error in the estimation of F can lead to large disparities for F n y . An alternate approach is to model F n y through asymptotic theory of M n , where we study the behaviour of F n y as n ⟶ ∞. Since F(y) < 1 for y < y sup , where y sup is the upper end-point of F, we have F n y ⟶ 0 as n ⟶ ∞. We can remove the degeneracy problem by allowing some linear renormalization of M n . Consider a linear renormalization: where c n and d n are sequences of constants with c n > 0. Under a suitable choice of c n and d n , the distribution of M n can be stabilised leading to "extremal types theorem" [2]: then G belongs to one of the following families: For c > 0 and d ∈ R.
e three classes of distributions are called extreme value distributions, with type I (Gumbel), type II (Frechet), and type III (Weibull), respectively. von Mises [8]and Jenkinson [9] combined the three types of extreme value distributions leading to the generalized extreme value distribution (GEV).

Theorem 2. If there exist sequences of constants c n and d n such that
where G is a nondegenerate distribution function, then G is a member of the GEVD family: defined on y such that 1 + ζ (y − ])/σ > 0 and with parameters: scale σ > 0, location ] ∈ R, and scale ζ ∈ R. eorem 2 suggests that regardless of the population distribution of M n , if a nondegenerate limit can be obtained by linear renormalization, then the limit distribution will be the GEV distribution. is approach is, however, inefficient in terms of data usage since only the maximum within each time period is used for modelling. An alternative approach is excess over threshold, where all the data above, some sufficiently high threshold is used for modelling. Theorem 3. Given a set of independent and identically distributed random variables Z 1 , . . . , Z n , with a common distribution function, F, the conditional excess distribution function, , of a random variable Z above a high threshold v can be approximated by for z > v and 1 + (ζ(z − v) /σ) > 0 and parameters scale σ > 0 and shape ζ ∈ R.
where a � P(X > v) is treated as a parameter to be estimated. e density function of the exceedances can be proved to be

Compound Distributions.
Let X 1 , . . . , X N be a sequence of independent and identically distributed random variables with a common distribution function Q. Also, let N be a counting random variable with probability function P n � P(N � n) and independent of X i . A compound distribution is the distribution of the random sum e distribution function of S N is given by where Q n * (x) is the n − th fold convolution power of Q.
Definition 1 (convolution). e convolution of two density functions Q X (·) and Q Z (·) on the positive real line is where q X (x) � (d/dx)Q X (x) and S � X + Z.

Remark 1.
e distribution function of a sum of independent and identically distributed random variables with common cdf Q is the n − th fold convolution power of Q: Compound distributions are used to model total losses in a portfolio or a group of insurance policies. In this context, S denotes the total losses, N is the number of losses, and X i is the size of the i − th independent loss. However, in extreme value analysis, we are interested in the tails of distributions, i.e., the distribution of the largest losses, and how they affect the total losses. A natural class of large loss distribution is given by the subexponential family, which is a subclass of the heavy-tailed distribution. By definition, heavy-tailed distributions have heavier tails than the exponential distribution, and their tails decay like a power function. All commonly used heavy-tailed distributions are subexponential.
is can be interpreted as for a sum of n independent random variables, X 1 , . . . , X n , with common distribution Q: Equation (16) is usually referred to as catastrophe principle. We can also express it as where M n � max(X 1 , . . . , X n ) and S n � n i�1 X i is relation implies that the total losses are directly dependent on the largest losses. In other words, the sum of n losses gets large if and only if its maximum gets large. erefore, assuming that the underlying distribution is subexponential (heavy-tailed), we can write the distribution function of the compound distribution in terms of the maximum values: For n � 0, 1, 2, . . ., where G(x) is the distribution of the maxima. Equation (19) is called a compound extreme value distribution (CEVD). Unlike EVT models, which do not consider the distribution of the frequency of occurrence in detail, CEVD assumes that the frequencies of extreme events are random variables. eorem 4 presents the CEVD as proposed by Liu and Ma [10].

Theorem 4. Let Y and Z be random variables with cumulative distribution functions G(x) and T(x), respectively. Let N be another random variable independent of Y and Z, with probability function
Define a random variable X as where Y i is the i-th independent observation of Y. en, the distribution function of X is International Journal of Mathematics and Mathematical Sciences 3 We can express equation (22) as Since we are interested in the upper limits of the distribution, we can take F 0 (x), ignoring ε(x), to be the value of F(x). e full principle of proof is discussed by Liu and Ma [10]. It can be shown that F 0 (x) is monotone nondecreasing and right continuous [6]. It satisfies F 0 (∞) � 1 and F 0 (− ∞) � P 0 . It is, however, worth noting that F 0 (x) is not a distribution function when P 0 > 0. We can modify the probability P 0 to make F 0 (x)(− ∞) � P 0 � 0, but since we are interested in the upper limits of F 0 (x), we will not consider the details of the modification.
As a result of the above discussion, we can formally define CEVD as follows.
Definition 3. Given a random variable N with probability mass function P(N � n) � P n for n � 0, 1, . . . and a set of independent and identically distributed random variables Y 1 , . . . , Y N with a common distribution function G(x) and assumed to be independent of N, the compound extreme value distribution comprising of N and the running maximum of Y, X � max 1≤i≤N Y i , is defined as Using Definition 3, we can now develop a distribution to model natural disasters in Kenya. Let X 1 , . . . , X N be a sequence of independent and identically distributed random variables. For a sufficiently high threshold, v, the observations that exceed v, x − v, are called exceedances. Denote the number of exceedances by N v . Assume the distribution of N v is a negative-binomial distribution with parameters κ > 0 and 0 < ρ < 1 such that Replacing equation (25) into CEVD formula (24), where G(x) is the cumulative distribution function of the exceedances given in equation (9). Equation (26) is then the distribution function of what we will call the negative binomial-generalized Pareto compound extreme value distribution (NB-GP CEVD).

Data Analysis, Results, and Discussion
We use data for all the natural disasters recorded in Kenya in the period 1964 − 2018 that were obtained from the CRED database. e severity of natural disasters is quantified in terms of the total number of people affected on an annual basis, which we deemed to be more reliable than the total damage in monetary terms. Table 1 shows the descriptive statistics for both the annual occurrence and the impacts. In summary, the minimum number of disaster occurrence and the resulting severity are zero, which corresponds to those years where no natural disaster event was recorded. On the contrary, the maximum number is 9, and the number of people affected is 23, 331, 469. e mean is 2 for the number of occurrences and 1, 130, 198 for the International Journal of Mathematics and Mathematical Sciences severity. We can also observe that the mean is greater than the median for both variables, indicating that the data are right-skewed.
We will start with exploratory analysis of the natural disaster data. e scatterplots in Figure 1 show that there are no serious violations of the independence assumptions   e exponential Q-Q plot in Figure 2 displays a convex departure from the straight line, indicating that the theoretical quantiles grow slower than the empirical quantiles. is suggests that the severity data are heavy-tailed.

reshold Selection.
We use three graphical tools to select an appropriate threshold. First, we plot the mean excesses for each value of 200 different thresholds across the whole dataset, against their corresponding thresholds, with a significance level of 5%. Figure 3(a) shows that the graph becomes linear right from the beginning, until around 8, 000, 000. is suggests a threshold of 0.
Next, we plot the maximum likelihood estimates of the GPD parameters at 80 different thresholds against their corresponding thresholds, together with 95% confidence intervals. Figure 3(b) shows that the scale parameters become stable at around 50, 000, while the reparametrized shape parameter becomes constant right from the start. e threshold is then 0 and 50, 000.
Finally, we plot a Gertensgarbe plot, which involves plotting the series of differences Δ r � z (r) − z (r+1) , r � 2, 3, . . . , n, of the order statistics, z (1) ≤ z (2) ≤ · · · ≤ z (n) , from the start to end and from the end to start. e cross   (Figure 3(c)) is at the observation numbered k � 19, which corresponds to a threshold of 150, 000. We also carry out the sequential version of Mann-Kendall test to contrast whether this point is the starting point of the extreme region. e results are indicated in Table 2. e null hypothesis that there is no change in the series of differences is rejected with a p value less than 0.001. We will now investigate the goodness of fit of the GPD in each of the three threshold values. Figure 4 shows that GPD fits the data best at the 50,000 and 150,000 thresholds. To avoid violating the asymptotic arguments underlying the GPD, we choose the threshold to be 150,000.

Fitting the Negative Binomial-Generalized Pareto Compound Extreme Value Distribution.
Given a threshold of 150, 000, we will first investigate the fit of the negative binomial distribution to the number of exceedances and the GPD to the exceedances. Tables 3 and 4 show that the p value is greater than 0.01 in both cases, indicating that the distributions are a good fit to their respective variables.
We can then fit the NB-GP CEVD to the data of natural disasters in Kenya. Table 5 shows the parameter estimates, and Table 6 shows the fit of the distribution. e maximized value of the log-likelihood function is found to be − 192.0693.
As observed in Table 6, the p values in both tests are greater than 0.01. us, we fail to reject the null hypothesis that natural disasters in Kenya follow a NB-GP CEVD at 1% level of significance. We can therefore conclude that the proposed distribution is a good fit for the data.
To assess the improvement achieved by using NB-GP CEVD instead of GPD, we investigate the quality of the proposed model relative to that of the GPD. is is done  using Bayesian information criterion (BIC) and alkaline information criterion (AIC). Table 7 shows that both measures are smaller for the NB-GP CEVD as compared to those of GPD, suggesting that the former is a better model for natural disasters in Kenya.

Conclusion
A compound distribution is developed to model the extreme damages from natural disasters in Kenya. Unlike the traditional extreme value theory models that only consider the severity of extreme events, the distribution proposed here captures both frequency and severity. e distribution is based on a threshold model and compound extreme value distribution, where the frequency of events exceeding a threshold of 150,000 is found to follow a negative binomial distribution, while the severity of the exceedance follows a generalized Pareto distribution. e exceedances are assumed to be independent, and the number of exceedances is also assumed to be independent of the severity. e distribution is shown to fit the data well and is found to be a better model for natural disasters in Kenya than the traditional extreme value threshold model. e proposed distribution can be an important tool to understand the risks associated with natural disasters in Kenya. is can be particularly useful to the country's disaster management bodies and other stakeholders to improve the existing disaster preparedness strategies, which will in turn reduce the negative economic and social impacts of such events.

Data Availability
e data used in this study is open source and available at https://www.emdat.be/emdat_db/.