A Poisson-Gamma Model for Zero Inflated Rainfall Data

Rainfall modeling is significant for prediction and forecasting purposes in agriculture, weather derivatives, hydrology, and risk and disaster preparedness. Normally two models are used to model the rainfall process as a chain dependent process representing the occurrence and intensity of rainfall. Such two models help in understanding the physical features and dynamics of rainfall process. However rainfall data is zero inflated and exhibits overdispersion which is always underestimated by such models. In this study we have modeled the two processes simultaneously as a compound Poisson process. The rainfall events are modeled as a Poisson process while the intensity of each rainfall event is Gamma distributed. We minimize overdispersion by introducing the dispersion parameter in the model implemented through Tweedie distributions. Simulated rainfall data from the model shows a resemblance of the actual rainfall data in terms of seasonal variation, means, variance, and magnitude. The model also provides mechanisms for small but important properties of the rainfall process. The model developed can be used in forecasting and predicting rainfall amounts and occurrences which is important in weather derivatives, agriculture, hydrology, and prediction of drought and flood occurrences.


Introduction
Climate variables, in particular, rainfall occurrence and intensity, hugely impact human and physical environment.Knowledge of the frequency of the occurrence and intensity of rainfall events is essential for planning, designing, and management of various water resources system [1].Specifically rain-fed agriculture is a sensitive sector to weather and crop production is directly dependent on the amount of rainfall and its occurrence.Rainfall modeling has a great impact on crop growth, weather derivatives, hydrological systems, drought, and flood management and crop simulated studies.
Rainfall modeling is also important in pricing of weather derivatives which are financial instruments that are used as a tool for risk management to reduce risk associated with adverse or unexpected weather conditions.
Further as climate change greatly affects the environment there is an urgent need for predicting the variability of rainfall for future periods for different climate change scenarios in order to provide necessary information for high quality climate related impact studies [1].
However modeling precipitation poses a lot of challenges, namely, accurate measurement of precipitation since rainfall data consists of sequences of values which are either zero or some positive numbers (intensity) depending on the depth of accumulation over discrete intervals.In addition factors like wind can affect collection accuracy.Rainfall is localized unlike temperature which is highly correlated across regions; therefore a derivative holder based on rainfall may suffer geographical basis risk in case of pricing weather derivatives.The final challenge is the choice of a proper probability distribution function to describe precipitation data.The statistical property of precipitation is far more complex and a more sophisticated distribution is required [2].
Rainfall has been modeled as a chain dependent process where a two-state Markov chain model represents the occurrence of rainfall and the intensity of rainfall is modeled by fitting a suitable distribution like Gamma [3], exponential, and mixed exponential [1,4].These models are easy to understand and interpret and use maximum likelihood to find the parameters.However models involve many parameters to fully describe the dynamics of rainfall as well as making several assumptions for the process.
Wilks [5] proposed a multisite model for daily precipitation using a combination of two-state Markov process (for the rainfall occurrence) and a mixed exponential distribution (for the precipitation amount).He found that the mixture of exponential distributions offered a much better fit than the commonly used Gamma distribution.
In study of Leobacher and Ngare [3] the precipitation is modeled on a monthly basis by constructing a suitable Markov-Gamma process to take into account seasonal changes of precipitation.It is assumed that rainfall data for different years of the same month is independent and identically distributed.It is assumed that precipitation can be forecast with sufficient accuracy for a month.
Another approach of modeling rainfall is based on the Poisson cluster model where two of the most recognized cluster based models in the stochastic modeling of rainfall are the Newman-Scott Rectangular Pulses model and the Bartlett-Lewis Rectangular Pulse model.These models represent rainfall sequences in time and rainfall fields in space where both the occurrence and depth processes are combined.The difficulty in Poisson cluster models as observed by Onof et al. [6] is the challenge of how many features should be addressed so that the model is still mathematically tractable.In addition the models are best fitted by the method of moments and so requires matching analytic expressions for the statistical properties such as mean and variance.
Carmona and Diko [7] developed a time-homogeneous jump Markov process to describe rainfall dynamics.The rainfall process was assumed to be in form of storms which consists of cells themselves.At a cell arrival time the rainfall process jumps up by a random amount and at extinction time it jumps down by a random amount, both modeled as Poisson process.Each time the rain intensity changes, an exponential increase occurs either upwards or downwards.To preserve nonnegative intensity, the downward jump size is truncated to the current jump size.The Markov jump process also allows for a jump directly to zero corresponding to the state of no rain [8].
In this study the rainfall process is modeled as a single model where the occurrence and intensity of rainfall are simultaneously modeled.The Poisson process models the daily occurrence of rainfall while the intensity is modeled using Gamma distribution as the magnitude of the jumps of the Poisson process.Hence we have a compound Poisson process which is Poisson-Gamma model.The contribution of this study is twofold: a Poisson-Gamma model that simultaneously describes the rainfall occurrence and intensity at once and a suitable model for zero inflated data which reduces overdispersion.
This paper is structured as follows.In Section 2 the Poisson-Gamma model is described and then formulated mathematically while Section 3 presents methods of estimating the parameters of the model.In Section 4 the model is fitted to the data and goodness of fit of the model is evaluated by mean deviance whereas quantile residuals perform the diagnostics check of the model.Simulation and forecasting are carried out in Section 5 and the study concludes in Section 6.

Model Formulation
2.1.Model Description.Rainfall comprises discrete and continuous components in that if it does not rain the amount of rainfall is discrete whereas if it rains the amount is continuous.In most research works [3,4,9] the rainfall process is presented by use of two separate models: one is for the occurrence and conditioned on the occurrence and another model is developed for the amount of rainfall.Rainfall occurrence is basically modeled as first or higher order Markov chain process and conditioned on this process a distribution is used to fit the precipitation amount.Commonly used distributions are Gamma, exponential, mixture of exponential, Weibull, and so on.These models work based on several assumptions and inclusion of several parameters to capture the observed temporal dependence of the rainfall process.However rainfall data exhibit overdispersion [10] which is caused by various factors like clustering, unaccounted temporal correlation, or the fact that the data is a product of Bernoulli trials with unequal probability of events.The stochastic models developed in this way underestimate the overdispersion of rainfall data which may result in underestimating the risk of low or high seasonal rainfall.
Our interest in this research is to simultaneously model the occurrence and intensity of rainfall in one model.We would model the rainfall process by using a Poisson-Gamma probability distribution which is flexible to model the exact zeros and the amount of rainfall together.
Rainfall is modeled as a compound Poisson process which is a Lévy process with Gamma distributed jumps.This is motivated by the sudden changes of rainfall amount from zero to a large positive value following each rainfall event which are modeled as pure jumps of the compound Poisson process.
We assume rainfall arrives in forms of storms following a Poisson process, and at each arrival time the current intensity increases by a random amount based on Gamma distribution.The jumps of the driving process represent the arrival of the storm events generating a jump size of random size.Each storm comprises cells that also arrive following another Poisson process.
The Poisson cluster processes gives an appropriate tool as rainfall data indicating presence of clusters of rainfall cells.As observed by Onof et al. [6] use of Gamma distributed variables for cell depth improves the reproduction of extreme values.
Lord [11] used the Poisson-Gamma compound process to model the motor vehicle crashes where they examined the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter.Wang [12] proposed a Poisson-Gamma compound approach for species richness estimation.

Mathematical Formulation.
Let   be total number of rainfall event per day following a Poisson process such that The amount of rainfall is the total sum of the jumps of each rainfall event, say (  ) ≥1 , assumed to be identically and independently Gamma distributed and independent of the times of the occurrence of rainfall: such that   ∼ Gamma(, ) is with probability density function Lemma 1.The compound Poisson process (2) has a cumulant function for 0 ≤  <  and  ∈ R, where   () is the moment generating function of the Gamma distribution.
Proof.The moment generating function Φ() of () is given by ( If we observe the occurrence of rainfall for  periods, then we have the sequence {  }  =1 which is independent and identically distributed.
If on a particular day there is no rainfall that occurred, then Therefore the process has a point mass at 0 which implies that it is not entirely continuous random variable.

Lemma 2. The probability density function of 𝐿 in (2) is
where  0 () is a dirac function at zero.Proof.Let  0 = 1 −  0 be the probability that it rained.Hence for   > 0 we have where If we let V =   and   (V  ) = ∑ ∞ =1 (V  /!Γ()), then we have We can express the probability density function   () in terms of a Dirac function as Consider a random sample of size  of   with the probability density function If we assume that there are  positive values  1 ,  2 , . . .,   , then there are  =  −  zeros where  > 0.
We observe that  ∼ (, 1 − exp (−)) and ( = 0) = exp (−); hence the likelihood function is and the log-likelihood for  = (, , ) Now for λ we have We can observe from the above evaluation that  can not be expressed in closed form; similar derivation also shows that  as well can not be expressed in closed form.Therefore we can only estimate  and  using numerical methods.Withers and Nadarajah [13] also observed that the probability density function can not be expressed in closed form and therefore it is difficult to find the analytic form of the estimators.So we will express the probability density function in terms of exponential dispersion models as described below.
Definition 3 (see [14]).A probability density function of the form for suitable functions () and () is called an exponential dispersion model.
Θ > 0 is the dispersion parameter.The function () is the cumulant of the exponential dispersion model; since Θ = 1, then   () are the successive cumulants of the distribution [15].The exponential dispersion models were first introduced by Fisher in 1922.
If we let   = log (  ;   , Θ) as a contribution of   to the likelihood function  = ∑    , then However we expect that E(  /  ) = 0 and −E( Furthermore Therefore the mean of the distribution is E[] =  = ()/  and the variance is Var() = Θ( 2 ()/ 2 ).
The relationship  = ()/ is invertible so that  can be expressed as a function of ; as such we have Var() = Θ(), where () is called a variance function.Examples are as follows: for  = 0 then we have a normal distribution,  = 1, and Θ = 1; it is a Poisson distribution, and Gamma distribution for  = 2, while when  = 3 it is Gaussian inverse distribution.Tweedie densities can not be expressed in closed form (apart from the examples above) but can instead be identified by their cumulants generating functions.
From Var() = Θ( 2 ()/ 2 ), then for Tweedie family distribution we have Hence we can solve for  and () as follows: by equating the constants of integration above to zero.
Proposition 5.The cumulant generating function of a Tweedie distribution for 1 <  < 2 is Proof.From ( 16) the moment generating function is given by For 1 <  < 2 we substitute  and () to have By comparing the cumulant generating functions in Lemma 1 and Proposition 5 the compound Poisson process can be thought of as Tweedie distribution with parameters (, , ) expressed as follows: The requirement that the Gamma shape parameter  be positive implies that only Tweedie distributions between 1 <  < 2 can represent the Poisson-Gamma compound process.In addition, for  > 0,  > 0 implies  > 0 and Θ > 0.

Proposition 6. Based on Tweedie distribution, the probability of receiving no rainfall at all is
and the probability of having a rainfall event is where Proof.This follows by directly substituting the values of  and , () into ( 16).
The function (, , , ) is an example of Wright's generalized Bessel function; however it can not be expressed in terms of the more common Bessel function.To evaluate it the value of  is determined for which the function   reaches the maximum [15].

Parameter Estimation
We approximate the function (, , , ) = ∑ ∞ =1 (  ()   − /!Γ()) = ∑ ∞ =1   following the procedure by [15] where the value of  is determined for which   reaches maximum.We treat  as continuous so that   is differentiated with respect to  and set the derivative to zero.So for  > 0 we have the following.

𝑊 (𝜆, 𝛼, 𝐿
Substituting the values of ,  in the above equation we have The term  (2−)+(−1) depends on the , , , Θ values so we maximize the summation And hence we have For 1 <  < 2 we have  = (2 − )/( − 1) > 0; hence the logarithms have positive arguments.Differentiating with respect to  we have where 1/ is ignored for large .Solving for ( log )/ = 0 we have Substituting  max in log   to find the maximum approximation of   we have Hence the result follows.
It can be observed that   / is monotonically decreasing; hence log   is strictly convex as a function of .Therefore   decays faster than geometrically on either side of  max [15].Therefore if we are to estimate (, Θ, ) by Ŵ(, Θ, ) = ∑ (41) For quick and accurate evaluation of (, , , ), the series is summed for only those terms in the series which contribute significantly to the sum.Generalized linear models extend the standard linear regression models to incorporate nonnormal response distributions and possibly nonlinear functions of the mean.The advantage of GLMs is that the fitting process maximizes the likelihood for the choice of the distribution for a random variable  and the choice is not restricted to normality unlike linear regression [16].
The exponential dispersion models are the response distributions for the generalized linear models.Tweedie distributions are members of the exponential dispersion models upon which the generalized linear models are based.Consequently fitting a Tweedie distribution follows the framework of fitting a generalized linear model.

Lemma 8. In case of a canonical link function, the sufficient statistics for {𝛽
Proof.For  independent observations   of the exponential dispersion model ( 16) the log-likelihood function is Proposition 9. Given that   is distributed as (16) then its distribution depends only on its first two moments, namely,   and Var(  ).
Proof.Let (  ) be the link function of the GLM such that The likelihood equations are Using chain rule we have Hence Since Var(  ) = (  ), the relationship between the mean and variance characterizes the distribution.
Clearly a GLM only requires the first two moments of the response   ; hence despite the difficulty of full likelihood analysis of Tweedie distribution as it can not be expressed in closed form for 1 <  < 2 we can still fit a Tweedie distribution family.The likelihood is only required to estimate  and Θ as well as diagnostic check of the model.Proposition 10.Under the standard regularity conditions, for large , the maximum likelihood estimator β of  for generalized linear model is efficient and has an approximate normal distribution.
Proof.From the log-likelihood, the covariance matrix of the distribution is the inverse of the information matrix J = E(− 2 ()/ ℎ   ).
To compute β we use the iteratively reweighted least square algorithm proposed by Dobson and Barnett [17] where the iterations use the working weights   : where (  ) =    .However estimating  is more difficult than estimating  and Θ such that most researchers working with Tweedie densities have  a priori.In this study we use the procedure in [15] where the maximum likelihood estimator of  is obtained by directly maximizing the profile likelihood function.For any given value of  we find the maximum likelihood estimate of , Θ and compute the log-likelihood function.This is repeated several times until we have a value of  which maximizes the log-likelihood function.
Given the estimated values of  and , then the unbiased estimator of Θ is given by Since for 1 <  < 2 the Tweedie density can not be expressed in closed form, it is recommended that the maximum likelihood estimate of Θ must be computed iteratively from full data [15].

Data and Model Fitting
4.1.Data Analysis.Daily rainfall data of Balaka district in Malawi covering the period 1995-2015 is used.The data was obtained from Meteorological Surveys of Malawi.Figure 1 shows a plot of the data.In summary the minimum value is 0 mm which indicates that there were no rainfall on particular days, whereas the maximum amount is 123.7 mm.The mean rainfall for the whole period is 3.167 mm.We investigated the relationship between the variance and the mean of the data by plotting the log(variance) against log(mean) as shown in Figure 2. From the figure we can observe a linear relationship between the variance and the mean which can be expressed as log (Variance) =  +  log (mean) Variace =  * mean  ,  ∈ R.
Hence the variance can be expressed as some power  ∈ R of the mean agreeing with the Tweedie variance function requirement.

Fitted Model.
To model the daily rainfall data we use sin and cos as predictors due to the cyclic nature and seasonality of rainfall.We have assumed that February ends on 28th for all the years to be uniform in our modeling.
The canonical link function is given by where  = 1, 2, . . ., 365 corresponds to days of the year and  0 ,  1 ,  2 are the coefficients of regression.
In the first place we estimate p by maximizing the profile log-likelihood function.Figure 3 shows the graph of the profile log-likelihood function.As can be observed the value of  that maximizes the function is 1.5306.
From the results obtained after fitting the model, both the cyclic cosine and sine terms are important characteristics for daily rainfall Table 1.The covariates were determined to take into account the seasonal variations in the stochastic model.
Comparing the actual means and the predicted means for 2 July we have μ = 0.3820, whereas  = 0.4333; similarly for 31 December we have μ = 9.0065 and  = 10.6952,respectively.Figure 4 shows the estimated mean and actual mean where the model behaves well generally.
Dev(, μ) is called the deviance of the model and the greater the deviance, the poorer the fitted model as maximizing the likelihood corresponds to minimizing the deviance.
In terms of Tweedie distributions with 1 <  < 2, the deviance is Based on results from fitting the model, the residual deviance is 43144 less than the null deviance 62955 which implies that the fitted model explains the data better than a null model.to be assessed especially for days with no rainfall at all as they produce spurious results and distracting patterns similarly as observed by [15].Since this is a nonnormal regression, residuals are far from being normally distributed and having equal variances unlike in a normal linear regression.Here the residuals lie parallel to distinct values; hence it is difficult to make any meaningful decision about the fitted model (Figure 5).
So we assess the model based on quantile residuals which remove the pattern in discrete data by adding the smallest amount of randomization necessary on the cumulative probability scale.
The quantile residuals are obtained by inverting the distribution function for each response and finding the equivalent standard normal quantile.
Mathematically, let   = lim ↑  (; μ , Θ) and   = (  ; μ , Θ), where  is the cumulative function of the probability density function (; , Θ); then the randomized quantile residuals for   are with   being the uniform random variable on (  ,   ].The randomized quantile residuals are distributed normally barring the variability in μ and Θ.
Figure 6 shows the normalized Q-Q plot and as can be observed there are no large deviations from the straight line, only small deviations at the tail.The linearity observed indicates an acceptable fitted model.

Simulation
The model is simulated to test whether it produces data with similar characteristics to the actual observed rainfall.The simulation is done for a period of two years where one was the last year of the data (2015) and the other year (2016) was a future prediction.Then comparison was done with a graph for 2015 data as shown in Figure 7.
The different statistics of the simulated data and actual data are shown in Table 2 for comparisons.
The main objective of simulation is to demonstrate that the Poisson-Gamma can be used to predict and forecast rainfall occurrence and intensity simultaneously.Based on the results above (Figure 8), the model has shown that it works well in predicting the rainfall intensity and hence can be used in agriculture, actuarial science, hydrology, and so on.
However the model performed poorly in predicting probability of rainfall occurrence as it underestimated the probability of rainfall occurrence.It is suggested here that probably the use of truncated Fourier series can improve this estimation as compared to the sinusoidal.
But it performed better in predicting probability of no rainfall on days where there was little or no rainfall as indicated in Figure 8.
It can also be observed that the model produces synthetic precipitation that agrees with the four characteristics of a stochastic precipitation model as suggested by [4] as follows.The probability of rainfall occurrence obeys a seasonal pattern (Figure 8); in addition we can also tell that a probability of a rain in a day is higher if the previous day was wet which is the basis of precipitation models that involve the Markov process.From Figure 7 we can also observe variation of rainfall intensity based on time of the season.
In addition the model allows modeling of exact zeros in the data and is able to predict a probability of no rainfall event simultaneously.

Conclusion
A daily stochastic rainfall model was developed based on a compound Poisson process where rainfall events follow a Poisson distribution and the intensity is independent of events following a Gamma distribution.Unlike several researches that have been carried out into precipitation modeling whereby two models are developed for occurrence and intensity, the model proposed here is able to model both processes simultaneously.The proposed model is also able to model the exact zeros, the event of no rainfall, which is not the case with the other models.This precipitation model is an important tool to study the impact of weather on a variety of systems including ecosystem, risk assessment, drought predictions, and weather derivatives as we can be able to simulate synthetic rainfall data.The model provides mechanisms for understanding the fine scale structure like number and mean of rainfall events, mean daily rainfall, and probability of rainfall occurrence.This is applicable in agriculture activities, disaster preparedness, and water cycle systems.
The model developed can easily be used for forecasting future events and, in terms of weather derivatives, the weather index can be derived from simulating a sample path by summing up daily precipitation in the relevant accumulation period.Rather than developing a weather index which is not flexible enough to forecast future events, we can use this model in pricing weather derivatives.
Rainfall data is generally zero inflated in that the amount of rainfall received on a day can be zero with a positive probability but continuously distributed otherwise.This makes it difficult to transform the data to normality by power transforms or to model it directly using continuous distribution.The Poisson-Gamma distribution has a complicated probability density function whose parameters are difficult to estimate.Hence expressing it in terms of a Tweedie distribution makes estimating the parameters easy.In addition, Tweedie distributions belong to the exponential family of distributions upon which generalized linear models are based; hence there is an already existing framework in place for fitting and diagnostic testing of the model.
The model developed allows the information in both zero and positive observations to contribute to the estimation of all parts of the model unlike the other model [3,4,9] which conditions rainfall intensity based on probability of occurrence.In addition the introduction of the dispersion parameter in the model helps in reducing underestimation of overdispersion of the data which is also common in the aforementioned models.

4. 3 .
Goodness of Fit of the Model.Let the maximum likelihood estimate of   be θ for all  and μ as the model's mean