Statistics of the Performance of Gridded Precipitation Datasets in Indonesia

Gridded precipitation datasets have been used as alternatives to rain gauge observations, but their applicability for a specific region should be thoroughly evaluated. This article aims at finding the most appropriate one for climatological and hydrological applications in Indonesia, by evaluating the statistics of the performance of eight different datasets (research products) having horizontal resolutions between 0.1 and 0.25 and with a time span of data availability from 2003 to 2015. The datasets are compared against the observed daily rainfall at 133 stations using 13 statistical metrics that can be classified into three groups with different characteristics of measurements, namely distribution, time sequence, and extreme value representations. By applying summation of rank (SR), it is found that MSWEP and TMPA 3B42 are the top two datasets that outperformed based on distribution and time sequence performance metric groups. The extreme performances for all datasets are still good in 75 th percentiles; however, the performances decrease at more than 75th percentiles indicating still a poorly representation of daily extreme rainfall for all gridded datasets. Results of this study suggest that MSWEP (v2) is presently the best gridded precipitation datasets available for climatological and hydrological applications in Indonesia.


Introduction
Climate variability at subseasonal, seasonal, interannual, and interdecadal timescales has potential societal impacts across the globe. In terms of agricultural production, for example, roughly one-third of the observed variations in global yield is caused by climate variability [1]. Furthermore, climate change is causing extreme weather events and climate anomalies to increase in both frequency and intensity [2], leading to greater risks for natural and human systems [3]. e risks are even higher for countries like Indonesia that are prone to natural disasters. During 1900 to 2011, 56% of the disasters that killed almost 241,000 people a ected about 28 million population at a cost of around US$ 24 billion are of hydrometeorological (climate-related) type [4]. erefore, accurate estimation of hazards and risks due to historical and projected climate anomalies is essential to make development and business plans more climate proof and adaptive to climate change.
High impacts of hydrometeorological disasters (drought, wild re, ood, and landslide) in Indonesia are associated with the excess or de cit of rainfall and more prevalent than other types of climatological disasters such as heat wave [5]. While climate change analyses using the top-down approach have been facilitated by downscaling of projected future precipitation under the WCRP CORDEX (the Program Coordinated Regional Downscaling Experiment sponsored by World Climate Research Program) for the Southeast Asia region [6], the feasibility of further quantitative impact studies depends on the availability of observational data to calibrate the model output. In this context, the availability and quality of baseline climate data are crucial to carry out both bottom-up and top-down climate change studies [7] at regional scales. However, long-term continuous rainfall observations in Indonesia are only available at a very limited number of locations. Jakarta, for example, is an exceptional location where 130-year records, from 1983 to 2012, of rainfall and temperature data are available [8,9]. Other than that, rainfall data vary in length, network density, quality, and consistency for different regions, making it difficult to assess climate hazards and risks associated with extreme events, even for regions with important sociopolitical contexts.
In recent decades, there have been efforts to develop globally gridded precipitation datasets by various research groups and institutions.
ose datasets vary in terms of purpose, data origin, area coverage, record length, as well as spatial and temporal resolution [10]. In any case, the availability of such precipitation datasets is potentially helpful for coping with the lack of rain gauge observations. In fact, gridded datasets such as Tropical Rainfall Measuring Mission (TRMM)-based precipitation products have been extensively used in various studies with main concerns on large-scale climatic features [11]. However, prior to their application to study climate impacts at regional scales, global precipitation datasets need to be evaluated to understand their advantages, limitations, and uncertainties [10,12]. Moreover, Indonesia's archipelago constitutes the largest part of the Maritime Continent (MC) where spatial variations of rainfall climatology are prominent due to complex land-sea distribution, topography, and strong influence of Asian-Australian monsoons [13].
Intercomparisons of global precipitation datasets have been conducted for monsoon and MC regions involving reanalysis and TRMM precipitation products [14][15][16]. ese studies used EOF analysis, correlation coefficient, and bias in comparing the datasets with observations, except [16] that focused on the relative differences among the products.
ere are also studies focusing on Indonesian regions and the validation of specific datasets such as TRMM [17] and GSMaP [18,19] using rain gauge data. An intercomparison of four precipitation datasets, that is, SA-OBS, APHRO-DITE, CMORPH, and TRMM, has also been conducted for performance evaluation against rain gauge data [20]. Another study focused on performance evaluation for a specific purpose to detect low rainfall for drought monitoring on three datasets, that is, TMPA 3B42RT, PERSIANN, and CMORPH [21], and another one for a specific region of Bali Island on three other different datasets, that is, GSMaP, IMERG, and CHIRPS [22]. ese studies have compared different, but still limited, number of datasets. Moreover, only a few performance metrics such as bias and correlation are used, except Liu et al. [22] who used more diverse metrics of continuous, categorical, and volumetric types. None of the studies compared the statistical distribution between precipitation datasets and observed data.
In this work, we performed a more comprehensive evaluation on eight precipitation products that are derived from rain gauges, satellite-based estimates, and their combinations (see Table 1). is study aims to find the most robust precipitation dataset for climatological and meteorological research and applications in Indonesia. We propose a multimetric approach [34,35] with a total of 13 metrics that can be classified into three groups of statistical measures against observations. e first group is for assessing data distribution: (1) mean (g), (2) standard deviation (SC), (3) coefficient of variance (CV), (4) PDF skill scores (SS), and (5) Kolmogorov-Smirnov test (KST). e second is for evaluating the relationships in sequential data pairs as time series: (6) Pearson correlation coefficient (r), (7) mean error (ME), (8) root mean square error (RM), (9) relative change (RC), (10) T-test (TT), and (11) Z-test (ZT). e third is for addressing the performance on extreme events detection: (12) fraction skill score (FSS) and (13) Anderson-Darling test for the 75th, 90th, and 98th percentiles. For overall performance, we apply a summation of rank (SR) to all metrics used in all groups of scores and select the top one dataset. As a reference, we employed rain gauge data of 133 meteorological stations belonging to the Indonesian Meteorological, Climatological, and Geophysical Agency (BMKG) observed from 2003 to 2015 (see Figure 1).

Gridded Precipitation Datasets.
e summary of eight primary datasets used in this study is presented in Table 1.
ese datasets are research products with high-latency data transfer (in the order of several months) and generally derived from combination of rain gauge, satellite, and reanalysis data. It should be noted that all datasets have daily temporal resolution, but the spatial resolutions are 0.25 for five (CHIRPS, CMORPH-CDR, GFD, PERSIANN-CDR, and TMPA) and 0.1 for the other three (GSMaP RNL, GPM-IMERG and MSWEP) datasets. Comparisons between gridded and observed station precipitations were performed for the station locations by interpolating the gridded data, using the "nearest neighbor" method. In addition to the eight primary datasets, we also analyzed other five gridded datasets that have different specifications (see the Discussion section).

Observation Datasets.
e observation datasets used as a reference employs rain gauge data from 133 meteorological stations in Indonesia from 2003 to 2015 ( Figure 1). ese periods overlap the years between observation and all precipitation datasets that were being compared. e rainfall data are the same observed daily precipitation dataset that was used in Supari et al. [9] up to 2012, with additional stations and time periods. e same quality control analysis described in Supari et al. [9] was applied, consisting of checking for gross errors, missing values, outliers, and overall data homogeneity. e accumulation of daily precipitation measured at 07.00 local time assigned the date of that day's precipitation data regarding the guidelines of the Indonesia Meteorological Service (BMKG) [36]. e intercomparison between precipitation datasets and observation rain gauge in this study was carried out without a day shift (for more details, please refer to Van den Besselaar et al. [20]).

Metrics of Distribution-Based
e mean (or average) is the measure of the central tendency for both discrete and continuous data.

Advances in Meteorology
Given μ o the time mean, σ o the standard deviation of the observation, and μ m the time mean of a precipitation dataset with the same period, we de ne the performance metric where n g is a scale factor taken as equal to 1 [34,35]. e maximum value of g 1, and g < 0 if the di erence between the time mean of rainfall dataset and observation is greater than ng multiplied by σ o . e performance indices for this metric were calculated for daily (g_d), monthly (g_m), seasonal (g_DJF, g_MAM, g_JJA, and g_SON), and annual (g_a) timescales. Herein, DJF, MAM, JJA, and SON are the months of December-January-February, March-April-May, and September-October-November, respectively.

Standard Deviation (SC).
Standard deviation measures the spread of data distribution. e metric SC is a normalized quantity to represent the comparison between the spreads of evaluated datasets and the referent, given by where σ o and σ m are the standard deviations observation of gridded precipitation datasets, respectively, so that SC 1 is being the perfect skill [34,35]. As with g, the performance indices for SC were calculated for daily (SC_d), monthly (SC_m), seasonal (SC_DJF, SC_MAM, SC_JJA, and SC_SON), and annual (SC_a) timescales.

Coe cient of Variance (CV).
e metric CV is a normalized measure of dispersion. Given CV o and CV m are the coe cients of variation for the observation and precipitation dataset [34,35], the performance index is calculated as for daily (CV_d), monthly (CV_m), seasonal (CV_DJF, CV_MAM, CV_JJA, CV_SON), and annual (CV_a) timescales.

PDF Skill Score (SS).
e probability density function (PDF) skill score (SS) is calculated using samples of rain days (days with precipitation >0.5 mm) [36]. e SS compares the PDF observed and gridded precipitation datasets by the following formula [37]: where f k m and f k o are relative frequency of occurrence of a value in the k th bin belonging to the histograms of the dataset and observation, whereas Nb is the number of bins used to calculate the empirical PDF. If SS 1, the precipitation dataset perfectly simulates the observed PDF [35,37].

Kolmogorov-Smirnov Test (KST).
e KST test is like SS but for ECDF (empirical cumulative distribution function). Given F o (x) and F m (x) are ECDFs of the observed data and a precipitation dataset, the KST performance index is calculated by with D KS is the maximum absolute difference between ECDF of the two different datasets.
e metric values are normalized and bounded by 1 as the perfect skill [34,35].

Metrics of Time Sequence
2.4.1. Normalized Mean Error (ME). Normalized mean error (ME) is to measure value-to-value differences between the two time series calculated as where m i and O i are the i th data of the time series of the gridded dataset and observation, and N is the number of data records [38]. Calculations of ME are at daily (ME_d), monthly (ME_m), annual (ME_a), and seasonal (ME_DJF, ME_MAM, ME_JJA, and ME_SON) time sequences with ME � 1 means the perfect skill.

Normalized Root Mean Square Error (RM).
Root mean square error, or RMSE, is a common measure to quantify the difference between the two time series. For two time series p(t) and f(t) with n data records, it can be calculated as follows: Herein, we use normalized RMSE [35], which is expressed as follows: where n g and σ i o are the scale factor and observation standard deviation relevant to the index i of interest related to daily (RM_d), monthly (RM_m), annual (RM_a), and seasonal (RM_DJF, RM_MAM, RM_JJA, and RM_SON) time sequences.

Relative Change (RC).
e RC metric is only applied for annual rainfall P by calculating changes in two consecutive years as RC index for the whole data period (years) is then calculated as the mean difference of C i for the observation and precipitation datasets using equation (1) [34,35].

Pearson Correlation Coefficient (r).
e Pearson correlation coefficient (r) of the sequential time series for every point of observation and the corresponding grid cell of a precipitation dataset is computed using the following equation: where x and y are the variables of observation and the precipitation dataset with the mean x and y, and n is the degree of freedom of the variables [35]. e calculations of r are at daily (r_d), monthly (r_m), annual (r_a), and seasonal timescales (r_DJF, r_MAM, r_JJA, and r_SON).

Z-Test (ZT)
. Z-test compares the significant difference between the mean values of observation and a precipitation dataset taking into account the difference in sample size.
where x m , σ m , and n m are the mean, standard deviation, and sampling size of the dataset, respectively; x o , σ o , and n o are the mean, standard deviation, and sampling size of observation, respectively. e test P value of this statistic is approximated using the standard Gaussian distribution at 95%. P value <0.05 means the average of the precipitation dataset is significantly different with observation. e score of ZT is the number of stations with an insignificant P value (≥0.05) divided by the total number of stations [35]. ZT calculations are at daily (ZT_d), monthly (ZT_m), seasonal (ZT_DJF, ZT_MAM, ZT_JJA, and ZT_SON), and annual (ZT_a) timescales.

T-Test (TT).
e metric TT is computed the same way as the Z-test but using Student's t distribution [35].

Metrics of Extreme Value Representation.
To evaluate the performance of precipitation dataset for extreme value representation, two metrics are used: fraction skill score (FSS) and the Anderson-Darling Tests (ADT). e FSS uses the forecast verification approach to evaluate the detectability of moderate-to-heavy rainfall events [39], where rainfall data that exceed a threshold transform into a binary number of 1, otherwise it is 0 following the formula FBS represents the differences of mean squares between the referent (O i ), and the precipitation dataset (F i ) on each grid computed as

Advances in Meteorology
where N is the amount of data, while FBS w is the largest FBS that can be obtained. FSS ranges between 0 (no skill) and 1 (perfect skill). Although FSS can be calculated with absolute thresholds, in this study it is defined relatively to the 75th (p75), 90th (p90) [40], and 98th (p98) percentiles of each dataset.
A modified version of Anderson-Darling [41,42] was applied to test the differences between the precipitation dataset distribution with reference data, using the following formula: A represents the distribution of daily precipitation from the dataset that reproduces the distribution of daily observations concerning moderate-to-heavy rainfall (using the same thresholds as FSS). Smaller values of A indicate a similarity between the two distributions of daily precipitation at a 95% significance level. Let X and Y be an n and m sample with the empirical cumulative distribution function (CDF) of F and G. H denotes a measure determined by the weighted average of F and G, N � n + m. e rainfall distribution dataset is significantly different from the observed rainfall when P values <0.05. e score values were normalized value of A and bounded by 1 as the perfect skill (ADT).

e Summation of Rank.
e metrics were applied based on point-to-grid comparisons since the observation data as referents were at point locations, which may affect the representativeness of the values being compared. However, Tan et al. [43] pointed out that results of point-to-grid comparison are substantially similar to the grid-to-grid comparison. In this work, we use two scoring schemes for indexing the precipitation dataset performance with the distribution and time sequence metrics: the "ratio" scheme, which summates the station index of more than 0.5 divided by the total number of stations; and the "mean" scheme, which averages the entire index from all stations. ese indices and scores measure statistical performance at daily, monthly, seasonal, and annual timescales. However, the performance scores for extreme value representation were only calculated with the "mean" scheme. e summation of rank (SR) [34,35,44] method is used to summarize and quantify the total score of all metrics.

Distribution-Based Performance.
Scores for five performance metrics in the data distribution group, namely g, SC, CV, SS, and KST, are presented as heatmaps in Figure 2. It should be noted that the scores are calculated as an aggregate of all validated points in Figure 1, with two representation schemes, that is, "ratio" (Figure 2(a)) and "mean" (Figure 2(b)) as previously explained. e number of samples for CMORPH-CDR, GPM-IMERG-F, GSMaP_RNL, GFD, and TMPA 3B42 datasets are 133 or sampled at the 133 validating points. However, PER-SIANN-CDR, MSWEP, and CHIRPS had slightly fewer samples (128, 131, and 131, respectively) because some grids do not enclose observational data.
It can be seen from Figure 2(a) that, based on the ratio scoring (summation) scheme, MSWEP has the highest scores followed by TMPA 3B42 and GPM-IMERG-F. e MSWEP also has the highest scores with the mean scoring scheme, followed by TMPA 3B42 and CMORPH-CDR (Figure 2(b)). It is of interest to note that the metric gshows relatively low scores for seasonal and annual, in comparison to daily and monthly, timescales. Seasonally, the g scores are worse for DJF and MAM than those for JJA and SON. At most stations, the JJA and DJF periods correspond to dry and rainy season, respectively. e spatial distribution of SR distribution-based performance at all 133 stations in Indonesia can be seen on Figure S1 in the Supplementary Material (SM).

Time Sequence Performance.
Time sequence performance for each dataset was evaluated using six metrics, namely ME, RM, RC, r, TT, and ZT, and are presented as 17 Figure 2: Heatmaps of distribution performance metrics by ratio (a) and mean (b) with corresponding SR. heatmaps in Figure 3. As shown in Figure 2, the heatmaps in Figures 3(a) and 3(b) correspond to the results of the ratio and mean scoring schemes.
e SR scores show that the MSWEP dataset consistently appears in the topmost rank, along with the TMPA 3B42 in the second rank. On the other hand, the CHIRPS now emerges as the third-ranking dataset overperforming GPM-IMERG-F and CMORPH-CDR. In contrast to the distribution-based performance, the metrics for time sequence performance are worse with daily comparing to monthly, seasonal, and annual timescales. us, in general, longer temporal aggregates improve the time sequence performance of the datasets.
It is notable from Figure 3 that RM or normalized RMSE metric groups show the worst scores at all timescales. Most datasets have RM scores less than 0.5 (0.3) for the ratio (mean) scoring schemes (Figures 3(a) and 3(b)). ere are even negative scores, which indicate that the RMSE of dataset precipitation is larger than 1 standard deviation of the observation [35]. Negative values also appear in the normalized mean error (ME) scores at daily timescale, with the mean scoring scheme, except that of TMPA 3B42. In general, the mean scoring scheme tends to produce lower scores for all metrics in Figures 2 and 3. e spatial distribution of time sequence SR for all datasets at 133 stations as seen on Figure S2 in SM shows randomly distributed and lower values than the SR of the distribution performance.

Extreme Value Representation.
e two metrics for extreme value representation, that is, FSS and AD can be calculated using both percentiles and absolute thresholds. However, considering that the distributions of extreme values can be significantly different among datasets ( Figure S3 in the SM), we calculated the scores only based on three percentiles, that is, p75, p90, and p98, and show the results in Figure 4. In contrast to the results of two previously discussed performance rankings, Figure 4 shows that the MSWEP and TMPA 3B42 datasets are in the bottom three with the SR scores and only perform better when compared to CMORPH-CDR. In this case, the PERSIAN-CDR is in the top rank followed by GFD and CHIRPS. Figure S3 in SM shows the empirical CDF of all stations average ( Figure S3(a)) and five stations in Sumatra, Borneo, Java, Sulawesi, and Papua (from Figure S3(b) to S3(f )).
In general, all scores in this category drop drastically with higher percentiles. e ADT scores for p75 are the highest but those for p98 are the lowest among all scores. On the other hand, the scores of FSS decrease more gradually with p75, p90, and p98. ese results indicate that extreme precipitations are still poorly represented in the globally gridded datasets, bearing in mind that daily precipitation values are being compared.

Discussion
Based on SR scores, MSWEP and TMPA 3B42 are the top two datasets that consistently outperform the others for both distribution-based and time sequence performance metric groups. It should be noted that MSWEP combines several datasets including other gridded data, such as CMORPH and TMPA 3B42, and station (GHCN) data. MSWEP data processing involve bias and frequency correction, CDF matching, improvement in the peak attenuation of rain distribution, weighted combination of data sources, and refining the spatial resolution to 0.1 [33]. Results of a previous study [10] that compared 22 precipitation datasets also show that MSWEP had the best performance in temporal correlation with rain gauge observations and calibration scores for hydrological model applications. However, Figure 3 clearly shows that the scores of time sequence performance metrics are low at daily resolution. Longer temporal aggregation is likely needed to improve the representativeness of the datasets.
In contrast, statistical performances of MSWEP and TMPA 3B42 datasets are low in extreme value representation. e averaging process when combining several data sources could have smoothing effects that eliminates real extreme values. A study by Hamada et al. [45] shows that maximum near-surface rain rate from TRMM PR data surpasses 50 mm per hour only at percentiles higher than 90. erefore, gridded precipitation datasets may still be useful for qualitative studies of extreme events, but additional efforts should be needed in more quantitative applications.
Considering the variations in rainfall climatology of Indonesian region [46], spatial variation of performance indices might also be of user's concerns. Maps of SR scores for all 133 validating stations can be found in the SM 18  Advances in Meteorology ( Figure S1). In general, the spatial distribution of SR shows random patterns. However, relatively lower scores tend to concentrate over the northern part of Sumatra Island. is area is characterized by a bimodal annual rainfall and mountainous topography.
So far, we have discussed the statistical performance of eight gridded precipitation datasets that are categorized as (high latency) research data. Some previous studies include rain gauge-based datasets, as well as near real-time (low latency) satellite products in their analysis. For more comparisons, we applied the same procedures of scoring and ranking to five additional datasets, that is, SA-OBS, APH-RODITE, CMORPH-Raw, GPM-IMERG-Early Run, and GSMaP NRT (see Table S4 in the SM), and summarize the results in Table 2. By comparing the SR scores with those in Figures 2 and 3, the performance of MSWEP is comparable with that of rain gauge-based SA-OBS, whereas APHRO-DITE does not show good performance. On the other hand, from three low latency satellite products, GPM-IMERG Early Run show the best statistical performance comparable to MSWEP in terms of SR scores.
is study adopts a multimetric approach [34,35] to evaluate climate models. We use a combination of standard continuous and categorical verification statistics [47] as quantitative measures to assess the accuracy of the rainfall estimation amounts and occurrence of the gridded precipitation dataset. We apply continuous type metrics for the performance of data distribution comprehensively and time sequences in data pairs from time to time between the rainfall of the dataset and the observed data. In addition, categorical type metrics were employed to evaluate the representation of extreme events with a threshold value. e summation of ranks diagnoses the rank of precipitation dataset performance to decide the best performance robustly. Previous studies [10,21,22] used fewer metrics without rank scoring and a greater focus on biases and correlation in comparison. is study compares more of the statistical distribution for all rainfall data and extreme rain days. e weakness of this study lies in not considering physical factors when comparing the rainfall of precipitation datasets and observed data, such as comparisons with altitude differences [22], while regional influences of monsoon [35] were only briefly discussed. is research is still purely statistical analysis with the quantitative evaluation of the "value-to-to-value" between gridded precipitation datasets and point-based rain gauge stations. Nonetheless, our results could be informative for those who need to use gridded precipitation datasets, especially for climatological and hydrological applications in the Indonesian region.

Conclusions
e performance and reliability of eight gridded precipitation datasets, namely CHIRPS v2.0, CMORPH-CDR v1.0,  GFDv3, PERSIANN-CDR v01r01, TMPA 3B42v7, GSMaP_RNL V06, GPM-IMERG V06 (Final Run), and MSWEPv2, were compared with rain gauge station observations for daily, monthly, seasonal, and annual timescales in the period of 2003-2015. e findings of this study can be summarized as follows: (i) A multimetric approach of 13 metrics is grouped into three: data distribution, time sequence, and extreme value representation. e application of summation of rank deals with the ranking of all datasets for every performance metric and quantifies the scores of all metrics for diagnosing and deciding the best performing dataset.
(ii) e results show that MSWEPv2 is the best product, followed by TMPA 3B42 for daily, monthly, seasonal, and annual precipitation in comparison with rain gauge data based on summation of rank. e extreme performance of all gridded precipitation datasets is low in more than 75th percentile of daily rainfall. is study implicates for the application of climatology and hydrology in the Indonesian region using gridded precipitation datasets.
Data Availability e meteorological station data are available in BMKG or Indonesia Agency for Meteorology, Climatology, and Geophysics, which can be accessed at https://dataonline. bmkg.go.id/home. Precipitation datasets used for this study are included within the paper.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article. Acknowledgments e authors acknowledge the Indonesia Agency for Meteorology, Climatology, and Geophysics for facilitating the data of rainfall stations. TW drafted the initial manuscript and statistical data processing. TWH revised and improved the overall manuscript including the storyline and discussion. AS and LMH performed further revisions of the manuscript. All the authors were involved in discussions during the review process and all the authors have read and approved the final manuscript.
is manuscript contain materials from a dissertation by TW, as a part of the fulfillment to obtain a doctoral degree at Bandung Institute of Technology. e authors would like to express their gratitude to the Indonesia Endowment Fund for Education (LPDP) for funding this research (Grant Number 201812210213600). e second author (TWH) was partially supported by ITB-P3MI research grants.