Clustering of Rainfall Stations in RH-24 Mexico Region Using the Hurst Exponent in Semivariograms

An important topic in the study of the time series behavior and, in particular, meteorological time series is the long-range dependence. This paper explores the behavior of rainfall variations in different periods, using long-range correlations analysis. Semivariograms and Hurst exponent were applied to historical data in different pluviometric stations of the Rı́o Bravo-San Juan watershed, at the hydrographic RH-24Mexico region.Thedatabasewas provided by theWaterNational Commission (CONAGUA). Using the semivariograms, the Hurst exponent was obtained and used as an input to perform a cluster analysis of rainfall stations. Groups of homogeneous samples that might be useful in a regional frequency analysis were obtained through the process.


Introduction
When limited observations of hydrological events are available, the ability to provide appropriate characterization, analysis, and predictions of a phenomenon gets compromised.However, the analysis can be improved by identifying homogeneous samples that can be used in combination to make better estimates of a probability model.This is one of the major concerns within the practice of regional frequency analysis (RFA), where the final output is the estimation of extreme events in a geographical area that can be used as input in risk analysis, water management, zoning, and land use applications, Hosking and Wallis [1].However, the estimation of extreme events is considered a complex problem, mostly because the information is usually limited, serial correlation exists, multiple change-points might be present, and observations follow trends and seasonal patterns.To address these issues, hydrological time series studies have been successfully applied in the past, Machiwal and Jha [2]; however, most research efforts have been focused on trend detection tests, leaving aside other important properties such as stationarity, homogeneity, periodicity, and persistence.By addressing these properties, a better selection of homogeneous samples might be possible, and as a consequence, practitioners might achieve better predictions.
Previous works on time series analysis in climatology with applications in precipitation go back to Bhuiya [3], with the development of a test for stationarity after periodic and trend components were subtracted from hydrologic series.Buishand [4] used trend tests to evaluate the difference in precipitation between rural and urban areas of Amsterdan and Rotterdam.Buishand [5,6] constructed several tests of homogeneity in the mean of series with the use of cumulative sums, likelihood tests, and Bayesian inference.Kothyari et al. [7] evaluated three stations in India, Agra, Dehradun, and Dehli, to test for changes in rainfall and temperature, providing evidence of a change in the number of rainy days during monsoon season and an increment in temperature.Giakoumakis and Baloutsos [8] performed a trend analysis on historical series of annual precipitations from the basin of the Evinos Riven in Greece.By applying different tests of randomness, decreasing trends were found in the rainfall records.Other authors dealing with trend analysis, homogeneity, and change-points found in the literature are Angel and Huff [9], Mirza et al. [10], Tarhule and Woo [11], Luís et al. [12], Kripalani and Kulkarni [13], Adamowski and Bougadis [14], Yu et al. [15], and Kumar et al. [16].A comprehensive review of these works can be found in Machiwal and Jha [2] with descriptions of related developments in hydrological time series analysis.
Recent developments in hydrological analysis include the works of Golian et al. [17] with a classification and clustering approach of rainfall data using the natural-breaks classification method and the fuzzy c-means (FCM) algorithm.Shi et al. [18] analyzed variations in trends for precipitation data using a linear regression method, the Mann-Kendall test, and the Hurst exponent.The Hurst exponent, as part of a fractal analysis, was used to evaluate long-range dependence and the possibility of trends in the data.The following works around the Hurst exponent include the developments of Golder et al. [19], where the Hurst exponent is also used to explore long-term correlations, and cumulative rainfall observations were modeled using the alpha-stable probability law to deal with heavy-tailed distributions.Chang [20] extended the application of the Hurst exponent by developing a computation approach to estimate the exponent over time series that fits a discrete time fractional Brownian motion and fractional Gaussian noise.Yu et al. [21] also studied longterm correlations using the Hurst exponent and performed a multifractal analysis of rainfall series (see Kantelhardt [22]) based on a multiplicative cascade model and a multifractal detrended fluctuation analysis.Other recent works on time series analysis can be found in Carbone et al. [23] with the construction of a simulation model of storms using a double exponential distribution.Chou [24] investigated the complexity at different temporal scales of rainfall and runoff time series using the sample-entropy method, and finally, García-Marín et al. [25] performed a regional frequency analysis over rainfall data from Málaga, Spain, where the grouping of stations into homogenous regions has been done by following a cluster analysis with multifractal values of the different series.
In this paper, a clustering approach is used to group stations into homogeneous samples after summarizing the results of semivariogram analysis into a Hurst exponent.As a case study, a sample of pluviometric stations, the Río Bravo-San Juan watershed, at the RH-24 Mexico region was analyzed (Figure 1).A map of the Río Bravo-San Juan watershed is shown in Figure 2.This region is located in Mexico between the states of Nuevo León, Coahuila, and Tamaulipas covering an approximate area of 29420 km 2 .Some of the rainfall stations are shown in Figure 3.The data used has been provided by CONAGUA, the local institution responsible for water management in the country.

Problem Description
In practice, to perform RFA, it is required to identify homogeneous regions where data follow similar patterns that can be analyzed together to improve the identification of probability models that in turn can be used to estimate extreme events  and their frequency in terms of return periods.This analysis is usually executed when dealing with droughts, pollution, wind movement, temperature, atmospheric pressure, and rainfall observations, to name a few.These researches deal with the problem of finding groups of rainfall stations that create homogeneous regions by considering fractal structures captured through semivariograms and Hurst exponents.Rainfall data from a sample of the hydrographic region RH-24 Mexico, the Río Bravo-San Juan watershed, are used as a case study to evaluate the proposed approach.

Methodology
Semivariograms, in the present study, are used to quantify long-range correlations of data from different pluviometric stations using monthly records.By considering the analysis of semivariograms of historical series, a rescaled range analysis / is performed to obtain a measure of the Hurst exponent [26].The Hurst exponent is used as a metric of a particular pluviometric station.The process is repeated over each pluviometric station within the region under analysis.Hurst exponents are used as a reference to identify stations that exhibit similar patterns.As a consequence, a cluster analysis is applied to identify homogeneous samples.An advantage of the Hurst exponent is the simplicity of its algorithm that can be used to measure the condition of persistence or antipersistence of a process, and it provides a metric that can be used to classify different time series.

Semivariogram. The semivariogram or variogram 𝛾(ℎ)
is used to describe the relationship of paired observations separated by a distance ℎ.It is a geostatistical technique that allows a quantitative measure of the long-range persistence in nonstationary time series Witt and Malamud [27], Haslett [28], and Dmowska and Saltzman [29].Correlations over time and space create patterns that can be used to describe the behavior of a set of observations.Mathematically, the variogram estimates the expected squared difference between neighboring random variables.This calculation is performed over different ℎ values.Given a time series or stochastic processes {  ,  ≥ 0}, the autocovariance function at the point (,  + ℎ) is defined as with [  ] as the mean of the process at time .The semivariogram (ℎ) is given by half of the variance of the difference between pairs of observations at different "locations" in time: where −1 ≤   (,  + ℎ) ≤ 1 is the autocorrelation function (the autocovariance function normalized).
In the special case when ( +ℎ ,   ) = 0, ∀(, ℎ), it is said that the stochastic processes {  ,  ≥ 0} are uncorrelated and the semivariogram is reduced to the arithmetic mean of the variance of processes at times  and  + ℎ: If the random field {  ,  ≥ 0} has constant mean = [  ] = [ +ℎ ] =  ∀, the semivariogram (1) adopts the simple form: If () and ( + ℎ) are independent random variables ∀, again, though for a different reason, the semivariogram is reduced to the special case (2).
In principle, given a stochastic process {  ,  ≥ 0}, the expected value of differences [ +ℎ −   ] at time  and with lag ℎ is empirically estimated by the average over a "large enough" ensemble of realizations or paths in time.However, for a single time series {  ,  = 1, 2, . . ., }, the expected value can be estimated assuming an ergodic hypothesis, that is, a statistical principle of equivalence according to which the average over time and the average over the ensemble are the same, Lefebvre [30].Thereby the differences  +ℎ −   , which would be obtained with an infinitely reproducible process, are "simulated" or "cloned" from the "mother series."Thus, the average value of differences  +ℎ −   is estimated by where (ℎ) is the number of differences with a lag ℎ.When ℎ = 1, 2, 3, . .., the averages (4) are, respectively, According to (5) for a maximum value of ℎ "relatively moderate" or ℎ/ < 1, except in the presence of isolated extreme outliers, the two summations in (5) are roughly of the same order, such that the empirical average value (4) can be approximated by [ +ℎ ] ≈ [  ] =  = constant.This is an observed characteristic in the time series of the pluviometric stations.Therefore, the corresponding estimator of ( 3) is simply

Measurement of the Hurst Exponent 𝐻.
To estimate the Hurst exponent from a temporal series {  }, with  ∈ 1, 2, . . ., , the series is divided in a group of -subseries of length .Really, the size  is an average number.A standard way, though not the only, to obtain the  size of the subseries is partitioning the original series in powers of base 2. In doing so, in each of the successive partitions, the approximate value of  is as follows: , /2, /2 2 , /2 3 , . ... For each subseries  = 1, 2, . . ., , do the following: (1) Calculate the mean   and the standard deviation   .
(3) Get the partial sums: (4) Calculate the range: (5) Normalize the range: (6) For each subseries of length  take the average: (7) Hurst [26] found the relation of the statistical ⟨/⟩  given by the following power law: where  is the Hurst exponent and  is a positive constant.Two factors involved in the determination of the Hurst coefficient are the way time series is divided into a group of subseries and the asymptotic behavior of the rescaled range.First, the range of values  are used to calculate the slope of log(⟨/⟩  ) given the relationship Second, the determination of  is the result of the asymptotic behavior of the rescaled range, that is, when the value  tends to infinity.The analysis of the rescaled ⟨/⟩  over some values of  is estimated using a log/log expression given in (13).To obtain  coefficient, the least-squares method is used.The slope of this line is the Hurst coefficient .This exponent is considered a fractal index, Mandelbrot and Wallis [31], and provides information about long-term correlations exhibited by a series of observations; for a theoretical review of the Hurst exponent, see Mandelbrot [32].In practice, the Hurst exponent can take values between 0 and 1, where (i) 0 <  < 0.5 indicates nonpersistency in a series; that is, an increment is more likely to be followed by a decrement and vice versa; (ii)  = 0.5 indicates lack of serial correlation (Gaussian white noise); (iii) 0.5 <  < 1 indicates persistency; that is, an increment is likely to be followed by an increment and a decrement by another decrement.

Results
To illustrate the procedure, only the analysis of three rainfall stations was selected to be presented in this section.The measured values of monthly precipitation in millimeters from the stations Apodaca, El Cuchillo, and La Boca are displayed in Figure 4, and results obtained taking into account all stations (following the same process) are presented at the end of this section.
As can be seen, patterns and relationships between different stations are difficult to assess only through "eyeball" analysis.However, when semivariograms are obtained, as shown in Figure 5, a footprint of the data becomes more evident.A closer inspection in every station shows a seasonal pattern that repeats every 12 observations in the semivariogram.This can be explained due to the fact that monthly observations were used in the analysis.
Once the semivariograms were obtained, the Hurst exponent for the series of the  values was calculated.It can be seen that Hurst coefficients are close to 1, which indicates a positive long dependency of the data in the semivariograms.Hurst exponents from all stations under analysis are presented in Table 1, where the long dependency in all variograms becomes clear.Some of the coefficients that appear in the table exceed the interval established of possible values of the Hurst exponent, 0 <  < 1; this is a known error due to estimation bias or a possible linear retrogression.
To address the issue of finding homogeneous samples, a cluster analysis is performed using estimates of the Hurst exponent.As shown in Figure 6, a histogram of frequencies Distributions were selected based on a goodness of fit analysis.After identifying a set of feasible distributions with  values bigger than a significant level of 0.05, in every case, the distribution with the highest average  value was selected for each station.Gamma and Generalized Extreme Value were the distributions that gave the best fit over the data analyzed.These functions are respectively.As a result of the analysis, estimated distributions do not mix within cluster.This input can be used in a posterior analysis to obtain maximum likelihood estimators of the distribution parameters using all data concurrently.

Conclusions
Variograms followed by analysis of / or Hurst exponent estimation were used as input of a cluster analysis.exponent provides a measure to determine if a time series is like a Gaussian white noise or has underlying trends, and it can be used to cluster the pluviometric stations according to the values of their semivariograms.As a case study to evaluate this approach, a sample of rainfall stations, those included in the Río Bravo-San Juan watershed from the hydrographic region RH-24 Mexico, was used in the analysis.Long-range dependency was found in every variogram evaluated with the Hurst exponent; however, it was still found useful as an input of a cluster analysis.A goodness of fit process was executed with every series, and the results showed that existing dominant distributions within a feasible set (found independently in each station) do not overlap over clusters.The probability distributions were found nested within each cluster.This is indicative that homogeneous patterns were identified within groups, and groups were heterogeneous between themselves.
The study of the rainfall stations with semivariograms and / analysis provides a powerful tool that allows practitioners to analyze long-term correlations and clustering in hydrological time series.In future work, -moments and spectral and wavelets analysis will be used to improve understanding of complex time series of pluviometric rainfall

Figure 1 :
Figure 1: The watersheds of Mexico with Google Earth.In black edge, the Río Bravo-San Juan watershed.Source of the database: http://www.conagua.gob.mx/.

Figure 5 :
Figure 5: Semivariogram (6) corresponding to the time series of rainfall measurements from the three stations of the Río Bravo-San Juan watershed shown in Figure 4.In each case the maximum lag ℎ max = 0.20 *  was used, with  indicating the size of the time series.For comparison, the semivariogram corresponding to Gaussian white noise is included.From top to bottom, respectively: Gaussian white noise, Apodaca (station number 19004), El Cuchillo (station number 19016), and La Boca (station number 19069).

FrequencyFigure 6 :
Figure 6: Histogram for the Hurst exponents of the analyzed pluviometric stations.

Table 2 :
Clustering of pluviometric stations by the Hurst exponent of the semivariogram.