A Poisson Gamma Model for Zero Inflated Rainfall Data

Rainfall modeling is significant for prediction and forecasting purposes in agriculture, weather derivatives, hydrology, risk and disaster preparedness. Normally two models are used to model the rainfall process as a chain dependent representing the occurrence and intensity of rainfall. Such two models helps to understand the physical features and dynamics of rainfall process. However rainfall data is zero inflated and exhibit overdispersion which is always underestimated by such models.<br><br>In this study we have modeled the two processes simultaneously as a compound Poisson process. The rainfall events are modeled as a Poisson process while the intensity of each rainfall event is Gamma distributed. We minimize overdispersion by introducing the dispersion parameter in the model implemented through Tweedie distributions.<br><br>Simulated rainfall data from the model shows a resemblance of the actual rainfall data in terms of seasonal variation, means, variance and magnitude. The model also provides mechanisms for small but important properties of the rainfall process. The model developed can be used in forecasting and predicting rainfall amounts and occurrences which is important in weather derivatives, agriculture, hydrology, and predicting of drought and flood occurrences.<br>


Introduction
Climate variables, in particular, rainfall occurrence and intensity hugely impacts on human and physical environment. Knowledge of the frequency of the occurrence and intensity of rainfall events is essential for planning, designing and management of various water resources system (Hussain, 2008). Specifically rain-fed agriculture is a sensitive sector to weather and crop production is directly dependent on the amount of rainfall and its occurrence. Rainfall modeling has a great impact in crop growth, weather derivatives, hydrological systems, drought and flood management and crop simulated studies.
Rainfall modeling is also important in pricing of weather derivatives which are financial instruments that are used as a tool for risk management to reduce risk associated with adverse or unexpected weather conditions. Further as climate change greatly affects the environment there is an urgent need for predicting the variability of rainfall for future periods for different climate change scenarios in order to provide necessary information for high quality climate related impact studies (Hussain, 2008).
However modeling precipitation poses a lot of challenges namely accurate measurement of precipitation since rainfall data consists of sequences of values which are either zero or some positive numbers (intensity) depending on the depth of accumulation over discrete intervals. In addition factors like wind can affect collection accuracy. Rainfall is localized unlike temperature which is highly correlated across regions, therefore a derivative holder based on rainfall may suffer geographical basis risk in case of pricing weather derivatives. The final challenge is the choice of a proper probability distribution function to describe precipitation data. The statistical property of precipitation is far more complex and a more sophisticated distribution is required (Cao, Li, & Wei, 2004).
Rainfall has been modeled as a chain dependent process where a two state Markov chain model represents the occurrence of rainfall and the intensity of rainfall is modeled by fitting a suitable distribution like Gamma (Leobacher & Ngare, 2011), exponential and mixed exponential (Odening, Mußhoff, & Xu, 2007;Hussain, 2008). These models are easy to understand, interpret and use maximum likelihood to find the parameters. However models involve many parameters to fully describe the dynamics of rainfall as well as making several assumptions for the process. Wilks (1998) proposed a multi-site model for daily precipitation using a combination of two-state Markov process (for the rainfall occurrence) and a mixed exponential distribution (for the precipitation amount). He found that the mixture of exponential distributions offered a much better fit than the commonly used Gamma distribution.
In Leobacher and Ngare (2011) the precipitation is modeled in monthly basis by constructing a suitable Markov-Gamma process to take into account seasonal changes of precipitation. It is assumed that rainfall data for different years of the same month is independent and identically distributed. It is assumed that precipitation can be forecast with sufficient accuracy for a month.
Another approach of modeling rainfall is based on the Poisson cluster model where two of the most recognized cluster based models in the stochastic modeling of rainfall are the Newman-Scott Rectangular Pulses model and the Bartlett-Lewis Rectangular Pulse model. These models represent rainfall sequences in time and rainfall fields in space where both the occurrence and depth processes are combined. The difficulty in Poisson cluster models as observed by Onof et al. (2000) is the challenge of how many features should be addressed so that the model is still mathematically tractable. In addition the models are best fitted by the method of moments so requires matching analytic expressions for the statistical properties such as mean, variance etc. Carmona and Diko (2005) developed a time-homogeneous jump Markov process to describe rainfall dynamics. The rainfall process was assumed to be in form of storms which consists of cells themselves. At a cell arrival time the rainfall process jumps up by a random amount, at extinction time it jumps down by a random amount both modeled as Poisson process. Each time the rain intensity changes, an exponential increase occurs either upwards or downwards.
To preserve non negative intensity, the downward jump size is truncated to the current jump size. The Markov jump process also allows for a jump of directly to zero corresponding to the state of no rain (Benth & Benth, 2012).
In this study the rainfall process is modeled as a single model where the occurrence and intensity of rainfall are simultaneously modeled. The Poisson process models the daily occurrence of rainfall while the intensity is modeled using Gamma distribution as the magnitude of the jumps of the Poisson process. Hence we have a compound Poisson process which is Poisson-Gamma model. The contribution of this study is twofold: a Poisson Gamma model that simultaneously describes the rainfall occurrence and intensity at once and a suitable model for zero inflated data which reduces overdispersion. This paper is structured as follows: in section 2 the Poisson-Gamma model is described and then formulated mathematically while section 3 presents methods of estimating the parameters of the model. In section 4 the model is fitted to the data and goodness of fit of the model is evaluated by mean deviance whereas residuals quantile does the diagnostics check of the model. Simulation and forecasting is in section 5 and the study concludes in section 6.

Model Description
Rainfall comprises of discrete and continuous components in that if it does not rain the amount of rainfall is discrete where as if it rains the amount is continuous. In most research works (Odening et al., 2007;Leobacher & Ngare, 2011;López Cabrera, Odening, & Ritter, 2013) the rainfall process is presented by use of two separate models one for the occurrence and conditioned on the occurrence another model is developed for the amount of rainfall. Rainfall occurrence is basically modeled as first or higher order Markov chain process and conditioned on this process a distribution is used to fit the precipitation amount. Commonly used distributions are Gamma, exponential, mixture of exponential, Weibull etc. These models work based on several assumptions and inclusion of several parameters to capture the observed temporal dependence of the rainfall process. However rainfall data exhibit over dispersion (Harrold, Sharma, & Sheather, 2003) which is caused by various factors like clustering, unaccounted temporal correlation or the fact that the data is a product of Bernoulli trials with unequal probability of events. The stochastic models developed in this way underestimate the overdispersion of rainfall data which may result in underestimating the risk of low or high seasonal rainfall.
Our interest in this research is to simultaneously model the occurrence and intensity of rainfall in one model. We would model the rainfall process by using a Poisson Gamma probability distribution which is flexible to models the exact zeros and the amount of rainfall together.
Rainfall is modeled as a compound Poisson process which is a Lévy process with Gamma distributed jumps. This is motivated by the sudden changes of rainfall amount from zero to a large positive value following each rainfall event which are modeled as pure jumps of the compound Poisson process.
We assume rainfall arrives in forms of storms following a Poisson process, and at each arrival time the current intensity increases by a random amount based on Gamma distribution. The jumps of the driving process represent the arrival of the storm events generating a jump size of random size. Each storm comprises of cells that also arrive following another Poisson process.
The Poisson cluster processes gives an appropriate tool as rainfall data indicate presence of clusters of rainfall cells. As observed by Onof et al. (2000) use of Gamma distributed variables for cell depth improves the reproduction of extreme values. Lord (2006) used the poisson-Gamma compound process to model the motor vehicle crashes where they examined the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. Wang (2010) proposed a poisson-Gamma compound approach for species richness estimation.

Mathematical Formulation
Let N t be total number of rainfall event per day following a Poisson process such that P (N t = n) = e −λ λ n n! , ∀n ∈ N and The amount of rainfall is the total sum of the jumps of each rainfall event, say (y i ) i≥1 , assumed to be identically and independently Gamma distributed and independent of the times of the occurrence of rainfall: (2) such that y i ∼ Gamma(α, P ) with probability density function Lemma 1. The compound Poisson process (2) has a cumulant function Proof. The moment generating function Φ(s) of L(s) is given by If we observe the occurrence of rainfall for n periods then we have the sequence which are independent and identically distributed.
If on a particular day there is no rainfall that occurred then Therefore the process has a point mass at 0 which implies it is not entirely continuous random variable.
Lemma 2. The probability density function of L in (2) is where δ 0 (L) is a dirac function at zero.
Proof. Let q 0 = 1 − p 0 the probability that it rained. Hence for L i > 0 we have We can express the probability density function f θ (L) in terms of a Dirac function as Consider a random sample of size n of L i with the probability density function If we assume that there are m positive values L 1 , L 2 , . . . , L m then there are M = n − m zeros where m > 0.
We observe that m ∼ Bi(n, 1 − exp(−λ)) and p(m = 0) = exp(−nλ), hence the likelihood function is We can observe from the above evaluation that λ can not be expressed in closed form, similar derivation also shows that α as well can not be expressed in closed form. Therefore we can only estimate λ and α using numerical methods. Withers and Nadarajah (2011) also observed that the probability density function can not be expressed in closed form and therefore difficult to find the analytic form of the estimators. So we will express the probability density function in terms of exponential dispersion models as described below: Definition 3. (Frees, Derrig, & Meyers, 2014) A probability density function of the form for suitable functions k() and a() is called an exponential dispersion model.
Θ > 0 is the dispersion parameter. The function k(θ) is the cumulant of the Exponential dispersion model, since Θ = 1 then k () are the successive cumulants of the distribtuion (Dunn & Smyth, 2008). The exponential dispersion models were first introduced by Fisher in 1922.
The relationship µ = dk(θ) dθ is invertible so that θ can be expressed as a function of µ, as Definition 4. The family of exponential dispersion models whose variance functions is of the Examples are, for p = 0 then we have a normal distribution, p = 1 and Θ = 1 it is a Poisson distribution, and Gamma distribution for p = 2 while when p = 3 it is Gaussian inverse distribution. Tweedie densities can not be expressed in closed form (apart from the examples above) but can instead be identified by their cumulants generating functions.
Hence we can solve for µ and k(θ) as follows: by equating the constants of integration above to zero.
Proposition 5. The cumulant generating function of a Tweedie distribution for 1 < p < 2 is Proof. From (6) the moment generating function is given by: For on 1 < p < 2 we substitute θ and k(θ) to have By comparing the cumulant generating functions in Lemma 1 and Proposition 5 the compound Poisson process can be thought of as Tweedie distribution with parameters (λ, α, P ) expressed as follows, The requirement that the Gamma shape parameter P be positive implies that only Tweedie distributions between 1 < p < 2 can represent the Poisson-Gamma compound process. In addition for λ > 0, α > 0 implies µ > 0 and Θ > 0.
Proposition 6. Based on tweedie distribution, the probability of receiving no rainfall at all is and the probability of having a rainfall event is Proof. This follows by direct substituting the values of λ and, θ, k(θ) into (6).
The function W (λ, α, L, P ) is an example of Wright's generalized Bessel function, however it can not be expressed in terms of the more common Bessel function. To evaluate it the value of j is determined for which W j the function reaches maximum (Dunn & Smyth, 2008).

Parameter Estimation
We approximate the function W (λ, α, L, P ) = ∞ j=1 λ j (αL) jP e −λ j!Γ(jP ) = ∞ j=1 W j following the procedure by (Dunn & Smyth, 2008) where the value of j is determinded for which W j reaches maximum. We treat j as a continuous so that W j is differentiated with respect to j and set the derivative to zero. So for L > 0 we have Lemma 7. (Dunn & Smyth, 2008) The log maximum approximation of W j is given by Substituting the values of λ, α in the above equation we have The term µ (2−p)j+(p−1)jP depends on the L, p, P, Θ values so we maximize the summation Using Stirling's approximation of Gamma functions we have log Γ(1 + j) ≈ (1 + j) log(1 + j) − (1 + j) + 1 2 log( 2π 1 + j ) log Γ(P j) ≈ P j log(P j) − P j + 1 2 log( 2π P j ) and hence we have For 1 < p < 2 we have P = 2 − p p − 1 > 0 hence the logarithms have positive arguments. Differentiating with respect to j we have where 1 j is ignored for large j. Solving for ∂ log W j ∂j = 0 we have Hence the result follows.
It can be observed that ∂W j ∂j is monotonically decreasing hence log W j is strictly convex as a function of j. Therefore W j decays faster than geometrically on either side of j max (Dunn & Smyth, 2008). Therefore if we are to estimate W (L, Θ, P ) byŴ (L, Θ, P ) = ju j=j d W j the approximation error is bounded by geometric sum For quick and accurate evaluation of W (λ, α, L, P ), the series is summed for only those terms in the series which contribute significantly to the sum.
Generalized linear models extend the standard linear regression models to incorporate nonnormal response distributions and possibly nonlinear functions of the mean. The advantage of GLMs is that the fitting process maximizes the likelihood for the choice of the distribution for a random variable y and the choice is not restricted to normality unlike linear regression (Agresti, 2015).
The exponential dispersion models are the response distributions for the generalized linear models. Tweedie distributions are members of the exponential dispersion models upon which the generalized linear models are based. Consequently fitting a Tweedie distribution follows the framework of fitting a generalized linear model.
Lemma 8. In case of a canonical link function, the sufficient statistics for Proof. For n independent observations y i of the exponential dispersion model (6) the log likelihood function is Proposition 9. Given y i is distributed as (6) then its distribution depends only on its first two moments namely µ i and V ar(y i ) Proof. Let g(µ i ) be the link function of the GLM such that η i = p j=1 β j x ij = g(µ i ). The the likelihood equations are Using chain rule we have Since V ar(y i ) = V (µ i ), therefore the relationship between the mean and variance characterizes the distribution.
Clearly a GLM only requires the first two moments of the response y i hence despite the difficulty in full likelihood analysis of Tweedie distribution as it can not be expressed in closed form for 1 < p < 2 we can still fit a Tweedie distribution family. The likelihood is only required to estimate p and Θ as well as diagnostic check of the model. Proof. From the log-likelihood, the covariance matrix of the distribution is the inverse of the Thereforeβ has an approximate N [β, (X T W X) −1 ] with V ar(β) = (X TŴ X) −1 whereŴ is evaluated atβ.
To computeβ we use the iteratively re-weighted least square algorithm proposed by Dobson and Barnett (2008) where the iterations use the working weights w i However estimating p is more difficult than estimating β and Θ such that most researchers working with Tweedie densities have p a priori. In this study we use the procedure by (Dunn & Smyth, 2008) where the maximum likelihood estimator of p is obtained by directly maximizing the profile likelihood function. For any given value of p we find the maximum likelihood estimate of β, Θ and compute the log-likelihood function. This is repeated several times until we have a value of p which maximizes the log likelihood function.
Given the estimated values of p and β, then the unbiased estimator of Θ is given bŷ Since for 1 < p < 2 the Tweedie density can not be expressed in closed form, it is recommended that the maximum likelihood estimate of Θ must be computed iteratively from full data (Dunn & Smyth, 2008).

Data analysis
Daily rainfall data of Balaka district in Malawi covering the period 1995-2015 is used. The data was obtained from Meteorological Surveys of Malawi. Figure 1 shows a plot of the data. In summary the minimum value is 0mm which indicates that there were no rainfall on a particular days, whereas the maximum amount is 123.7mm. The mean rainfall for the whole period is 3.167mm.
We investigated the relationship between the variance and the mean of the data by plotting the log(variance) against log(mean) as shown in Figure  V ariace = A * mean β , A ∈ R Hence the variance can be expressed as some power β ∈ R of the mean agreeing with the Tweedie variance function requirement.

Fitted Model
To model the daily rainfall data we use sin and cos as predictors due to the cyclic nature and seasonality of rainfall. We have assumed that February ends on 28th for all the years to be uniform in our modeling.
The canonical link function is given by log µ i = a 0 + a 1 sin 2πi 365 + a 2 cos 2πi 365 where i = 1, 2, . . . 365 corresponds to days of the year and a 0 , a 1 , a 2 are the coefficients of regression.
In the first place we estimatep by maximizing the profile log likelihood function. Figure 3 shows the graph of the profile log likelihood function. As can be observed the value of p that maximizes the function is 1.5306.

Figure 3: Profile likelihood
From the results obtained after fitting the model, both the cyclic cosine and sine terms are important characteristics for daily rainfall Comparing the actual means and the predicted means for 2nd July we haveμ = 0.3820 where as µ = 0.4333 similarly for 31st December we haveμ = 9.0065 and µ = 10.6952 respectively. Figure 4 shows the estimated mean and actual mean where the model behaves well generally.
Dev(y,μ) is called the deviance of the model and the greater the deviance the poorer the fitted model as maximizing the likelihood corresponds to minimizing the deviance.
In terms of tweedie distributions with 1 < p < 2 with the deviance is Based on results from fitting the model, the residual deviance is 43144 less than the null deviance 62955 which implies that the fitted model explains the data better than a null model.

Diagnostic Check
The model diagnostic is considered as a way of residual analysis. The fitted model presents challenges to assess especially for days with no rainfall at all as they produce spurious results and distracting patterns similarly as observed by (Dunn & Smyth, 2008). Since this is a nonnormal regression, residuals are far from being normally distributed and having equal variances unlike in a normal linear regression. Here the residuals lie parallel to distinct values hence difficult to make any meaningful decision about the fitted model Figure 5 So we assess the model based on quantile residuals which removes the pattern in discrete data by adding the smallest amount of randomization necessary on the cumulative probability scale.
The quantile residuals are obtained by inverting the distribution function for each response and finding the equivalent standard normal quantile.
Mathematically, let a i = lim y↑y i F (y;μ i ,Θ) and b i = F (y i ;μ i ,Θ) where F is the cumulative function of the probability density function f (y; µ, Θ) then the randomized quantile residuals for y i is

Simulation
The model is simulated to test whether it produces data with similar characteristics to the actual observed rainfall. The simulation is done for a period of two years where one was the last year of the data (2015)    The different statistics of the simulated data and actual data are shown in Table 2for comparisons.
The main objective of simulation is to demonstrated that the Poisson-Gamma can be used to predict and forecast rainfall occurrence and intensity simultaneously. Based on the results above the model has shown that it works well in predicting the rainfall intensity hence can be used in agriculture, actuarial, hydrology etc.
However the model performed poorly on predicting probability of rainfall occurrence as it underestimated the probability of rainfall occurrence. It is suggested here that probably use of truncated Fourier series can improve this estimation as compared to the sinusoidal.
But it performed better in predicting probability of no rainfall on days where there was little or no rainfall as indicated in the Figure 8.
It can also be observed that the model produces synthetic precipitation that agrees with the four characteristics of a stochastic precipitation model as suggested by (Odening et al., 2007)  we can also tell that a probability of a rain day is higher if the previous day was wet which is the basis of precipitation models that involve the Markov process. From Figure 7 we can also observe variation of rainfall intensity based on time of the season.
In additional the model allows modeling of exact zeros in the data and is able to predict a probability of no rainfall event simultaneously.

Conclusion
A daily stochastic rainfall model was developed based on a compound Poisson process where rainfall events follow a Poisson distribution and the intensity are independent of events following a Gamma distribution. Unlike several research that has been carried on precipitation modeling where by two models are developed for occurrence and intensity, the model proposed here is able to model both processes simultaneously. The proposed model is also able to model the exact zeros, the event of no rainfall which is not the case with the other models. This precipitation model is an important tool to study the impact of weather on a variety of systems including ecosystem, risk assessment, drought predictions, weather derivatives as we can be able to simulate synthetic rainfall data. The model provides mechanisms for understanding the fine scale structure like number and mean of rainfall events, mean daily rainfall, and probability of rainfall occurrence. This is applicable in agriculture activities, disaster preparedness and water cycle systems.
The model developed can easily be used for forecasting future events and in terms of weather derivatives, the weather index can be derived from simulating a sample path by summing up daily precipitation in the relevant accumulation period. Rather than developing a weather index which is not flexible to forecast future events, we can use this model in pricing weather derivatives.
Rainfall data is generally zero inflated in that the amount of rainfall received on a day can be zero with a positive probability but continuously distributed otherwise. This makes it difficult to transform the data to normality by power transforms or to model it directly using continuous distribution. The Poisson-Gamma distribution has a complicated probability density function which is difficult to estimate its parameters. Hence expressing it in terms of a Tweedie distribution makes it easy to estimate the parameters. As well as Tweedie distributions belong to the exponential family of distributions upon which generalized linear models are based, hence there is an already existing framework in place for fitting and diagnostic testing of the model.
The model developed allows the information in both zero and positive observations to contribute to the estimation of all parts of the model unlike the other model (Odening et al., 2007;Leobacher & Ngare, 2011;López Cabrera et al., 2013) which condition rainfall intensity based on probability of occurrence. In additional the introduction of the dispersion parameter in the model helps to reduce underestimation of overdispersion of the data which is also common in the aforementioned models.

Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper. Acknowledgments