An Assessment of GCM Performance at a Regional Scale Using a Score-Based Method

A multicriteria score-based method was developed to assess the performances of 18 general circulation models (GCMs) in the study region from 1970 to 2005. +e results indicate the following. (1) GCMs simulate temperature better than rainfall. +e temporal and spatial distributions of simulated temperature performed well compared with those from the observations. In comparison to temperature, the spatial distribution of simulated precipitation performed poorly. Most of the GCMs underestimated temperature and overestimated precipitation. (2)+eGrubbs test was used to detect anomalousmoving changes in the rank score (RS) results; the inm-cm4 and ipsl-cm5b-lr models were rejected when simulating temperature, while the bnu-esm and canesm2 models performed poorly when simulating precipitation. (3) Adding or removing any criterion does not significantly influence the RS result, which indicates that the multicriteria score-based method is robust. +e advantages of using multicriteria score-based method to assess GCMs performance were demonstrated, and this method also provides a more comprehensive assessment when compared with the single-criterion method. +e multicriteria method could replace other criteria as the research requirements and could be easily extended to different study regions; the results could be used for better informed regional climate change impact analyses.


Introduction
General circulation models (GCMs) are the most common tools for projecting future climate change.Errors and uncertainties in GCM metadata range in severity, specifically resulting in the inability to simulate observed meteorological events.GCM simulations are often characterized by biases and uncertainties that limit their direct application [1].Different forcing scenarios, GCMs, and subgrid-scale forcings and processes cause uncertainties, revealing an abundance of information but also indicate that a large amount of work is required to identify useful information, which limits GCM applications [2].Despite continuous efforts to improve the GCM simulation performance, the application of assessment methods is essential for climate change impact studies [3].
To improve the accuracy of GCM applications, GCMs have been assessed in many studies [4][5][6].ese assessments emphasize various aspects of GCMs according to their different applications.For example, in one study, where a long-term climate change analysis was the main focus, an assessment of the GCM performance before its application only focused on its long-term temporal and spatial distribution simulations.However, the drawback of this assessment was that using a single criterion could only describe the temporal or spatial performances of the GCMs but may not meet the other requirements of the study [4].A more comprehensive understanding of the advantages and disadvantages of GCMs is possible when more criteria are included in a GCM assessment.
To date, no assessment method used in the study of GCMs has been widely accepted.Assessing the performance of GCMs before using them is becoming an interesting issue.In this paper, a multicriteria score-based method was analyzed and the performances of all GCMs were quantitatively calculated and examined.We studied this method with the aim of comprehensively and accurately evaluating the performances of the GCMs.e outline of the paper is as follows: the data and methods are presented in Sections 2 and 3, respectively.Section 4 describes the performance of each GCM.e GCM simulations of temperature and precipitation are evaluated in the study region.e concluding remarks are provided in Section 5.

Study Region and Dataset
e performances of the GCMs in the Yellow-Huai-Hai region were assessed in this study.e Yellow-Huai-Hai region, which is located in north-central China between 30 °and 42.5 °N and 90 °and 122.5 °E (Figure 1), has the largest fluvial plain in China.Most parts of the study area are semiarid and semihumid (i.e., the Yellow River and Hai River basins, respectively), and only a small part of the region in the southeast of the study area has a humid climate (the area covering the Huai River basin).e Yellow-Huai-Hai region is an agricultural breadbasket and prime urban and industrial region in China. is region, therefore, plays an important role in the social and economic development of the country.us, the consequences of climate change seriously restrain economic growth [7,8].
All GCM data are from the fifth phase of the Coupled Model Intercomparison Project (CMIP5), which is the most important tool for analyzing future climate change.
High-quality temperature and precipitation data were derived from the daily dataset on China's surface climate (V3.0) during the period 1970-2005 provided by the National Meteorological Information Center.ese data are based on gauged data from 128 meteorological stations (Figure 1) and have been controlled for quality and accuracy by nearly 100%; for more details, see http://data.cma.cn/data/cdcdetail/dataCode/SURF_CLI_CHN_MUL_DAY_ V3.0.html.To effectively assess the performances of the GCMs, the daily data observed by the meteorological stations were collected as monthly data and interpolated to 2.5 °× 2.5 °cells using the inverse distance weighted method.e hollow circle in Figure 1 represents the location of each GCM grid point, and the data for the GCM grid points, which are denoted with black circles, were selected for the assessment in this study.

Methods
In this study, a multicriteria score-based method was developed to assess the performances of GCM simulations at a regional scale.
e criteria included mean annual data, standard deviation, annual climate cycle, normalized root mean square error (NRMSE), spatial distribution, climate change trend, empirical orthogonal function (EOF), and probability density function (PDF); these criteria are listed in Table 1.
In the assessment, the rank score (RS) values of 0-9, which are used to assess each individual assessment criterion, are written in the following form: where x i represents the relative error (RE) between the ith GCM result and observation or the related statistical value for the ith GCM.For the RE, a larger x i indicates a larger RS in the GCM performance assessment.In addition, the total RS for each GCM is summated by the RS for all weighted criteria.is RS method is used to describe the fitting degree between observed and simulated sequential statistical characteristics.According to the fitting results, the score of each GCM was assigned a number between 0 and 9 to assess the performance of each GCM.e RS does not represent the actual simulation accuracy of the specific models but is suitable for comparison between different GCM performances.Several different criteria that have the same statistical purposes, such as the Mann-Kendall (M-K) test (Z) and trend magnitude (β), which are criteria for trend analyses; EOF1 and EOF2, which are criteria for EOF analyses; and Brier score (BS) and significance score (Sscore), which are criteria for PDFs (which will be described later), have weights of 0.5 each during this summation (Table 1), while the other individual criteria have weights of 1.0.If a GCM effectively simulates an observation, then the RS is small.e RE was used to quantify the similarity between simulated and observed values for long-term monthly means and standard deviations: where X Gi and X Oi represent the simulated and observed data of the time series, respectively, and n represents the duration of these samples (432 months from 1970 to 2005).e GCM performances for time series are evaluated by the NRMSE [11,12] and defined as Based on historical data, X Gi and X Oi represent the GCM and observation results at historical time i, respectively; X O represents the mean of the observations; and n represents the length of the time series.e advantage of the NRMSE is that it can consider the mean and standard deviation of the predictor.e NRMSE is essentially the root mean square error divided by the standard deviation in the corresponding 2 Advances in Meteorology observations.e lowest value of the NRMSE is always associated with the best results, and this lowest value is reliable for determining the best simulation.e range of the NRMSE varies from 0 to positive infinity, where 0 indicates that there is a perfect agreement between the GCM data and reference data.
e correlation coefficient of the annual cycle was calculated between the observed and modeled long-term monthly mean values.For the spatial distribution, the correlation coefficient was calculated between the observed and modeled long-term means for each individual grid cell.
e M-K test and trend magnitude method were applied to determine the long-term monotonic annual trends and quantify their magnitudes [13].e rank-based value of the nonparametric M-K test statistic (Z) for climate variables in the GCMs and observations was estimated by where where x represents the time series of the annual climate variable, n represents the length of year, t represents the extent of any given tie (length of consecutive equal values), and  t denotes the summation over all ties.e trend magnitude β, for Sen's slope, which is a metric developed by Hirsch et al. [13] and proposed by Sen [14]; is defined as where 1 < i < j < n. e slope estimator, β, is equal to the median for all possible combinations of pairs for the whole dataset [7].X represents the time series for the variable to be assessed in the study.Sen's slope analyzes the change trend of the data by analyzing the time series data to possibly avoid the adverse influence of lost data in the analysis.e RE in Equation ( 2) was used to assess how close the values of Z and β are for each GCM to the observed values.
An EOF analysis was used in this study to compare the spatial distribution differences between the modeled climate variables and observations [15].An EOF can identify and quantify the spatial structures of correlated variabilities [16].
e two leading modes are selected in this assessment since they account for a majority of the total variance.Advances in Meteorology e BS and Sscore were used to assess the PDFs of the monthly climate variables in the GCMs.
where P mi and P oi represent the simulated and observed ith probability values, respectively, in each bin and n represents the number of bins.According to the data ranges, we set the number of bins as 100; thus, we divided all of the data into 100 equal parts sequentially and then calculated the probability density of each size.In this study, the BS represents the mean square error measure for probability forecasts [17,18] and the Sscore represents the calculated cumulative minimum value of the observed and simulated distributions for each bin, which can quantify the overlap between the observed and simulated data [19,20].erefore, when the BS of a GCM is lower and the Sscore is higher, the performance of the GCM is better.

Assessment of Temperature.
Table 2 includes the assessment results of the GCM performance for temperature in the Yellow-Huai-Hai region.e observed mean temperature during the historical period in the study region is 8.49 °C, while the simulated temperature via the GCMs is 3.62-8.09°C for the same period.Most GCMs underestimate the mean temperature by approximately 2 °C.e standard deviation in the observations is 0.53 °C, and most standard deviations in the GCMs are from 0.4-0.6 °C.e NRMSE is always used to compare the difference between the observations and simulations.erefore, if there were sets of data with very similar results for the means and standard deviations, smaller NRMSE results indicated a better simulation for the set of data.For the monthly mean temperature, the best NRMSE occurs with the mpi-esm-lr GCM (0.16), while the NRMSE result for the ipsl-cm5b-lr model was the largest of the GCMs.e simulated monthly distribution for the annual climate cycle for each GCM was relatively similar to that from the observed data, which can be seen from the correlation index (all values are larger than 0.995).Consequently, the correlation results for the monthly distribution of the annual cycle were rounded to 1.
e correlation coefficients of the spatial temperature distribution between each GCM and the observations were also larger than 0.9.
e simulated spatial temperature had a distribution similar to that of the observations, where the temperature increased from west to east, and the temperature was lowest in the source region of the Yellow River, while it was highest in the southern region of the Huai River basin (Figure 2).
According to the results of the M-K analysis in Table 2, the temperature in the Yellow-Huai-Hai region has increased over the past 36 years.Most GCMs show an increasing trend in temperature, excluding the giss-e2-h model.
e performances of the different GCMs in simulating the change trend differ.e Z value in the M-K test for observed temperature is 4.81, which means that the observed mean temperature significantly increases at the 0.05 significance level.However, the Z values for most GCMs are between 1.13 and 4.59 (excluding giss-e2-h and canesm2), which indicates that most GCMs underestimate the temperature change trend in this region.e trend magnitude, β, via Sen's slope shows similar results.
e results of the analysis of spatial temperature by using an EOF show that the first and second vectors of the EOF for monthly temperature via the observations account for 98.9% and 0.51% of the total variance (Table 2), respectively.e range of the first explained variance of GCMs is between 96.99% and 98.63, while that of the second explained variance is between 0.55% and 1.23%.
is result simply evaluates the GCMs' performance by using two explained variance values and indicates that all GCMs simulate the variability well.According to the EOF results of GCMs, all the GCMs perform well in terms of the physical process of temperature variability.It should be noted that there are certain special cases where the spatial patterns could differ, while the spatial patterns and observations have similar values of variance.However, this situation is relatively rare and is, therefore, not discussed in this study.
e empirical cumulative probability distribution (Figure 3) shows that the empirical cumulative probability distributions for monthly mean temperatures that are simulated by most GCMs are quite close to the observations (excluding the inm-cm4 and ipsl-cm5b-lr models, which underestimate the ensemble temperature in the Yellow-Huai-Hai region).e results of the Sscore and BS across all 53 selected grid points are presented in Figure 4. e variations in the scores across all 53 grid points imply spatial differences.A high Sscore with a relatively low BS indicates excellent GCM performance in terms of probability distributions in the grid points.e mean Sscores in the grid points of most GCMs are over 80%.e results of the ipsl-cm5b-lr model are consistent with the empirical cumulative probability plots, which have larger BS and smaller Sscore values.e BS and Sscore of each model behave differently between each grid point, reflecting the spatial variability of climatic elements.For example, in some GCMs, the Sscore in some grid points is over 90%, and the BS value is close to 0, which means that the probability density distribution of the GCM temperature in these grid points is very similar to that of the observations.However, the Sscore does not exceed 50% when there is a high BS value in some grid points, which indicates that the probability distribution of temperature simulated by these GCMs is not quite as strong in these grid points.By using the RS assessment, the performances of the GCMs have been evaluated and the final score of each model has been calculated.e ccsm4 model has the highest score, while the inm-cm4 model has the lowest score.Figure 5 describes the differences in annual temperature changes between the observations and the best and worst performing models in the Yellow-Huai-Hai region.We can clearly see that even though the ccsm4 model has underestimated the mean temperature, the model simulates a change trend similar to that of the observations.In contrast, the inm-cm4 4 Advances in Meteorology model vastly underestimates the temperature and simulates an incorrect temperature change in comparison to that of the observations.

Assessment of Precipitation.
Table 3 includes the assessment results of the GCM performances for precipitation.
In comparison with temperature, the GCMs perform poorly  Advances in Meteorology in terms of precipitation.e observed mean annual precipitation in the Yellow-Huai-Hai region is 568 mm, while most GCMs overestimate the value of precipitation (650 mm-1256 mm).Speci cally, precipitation in the bnu-esm model reaches 1,256 mm, which is two times greater than the observed precipitation amount.e standard deviations in the bnu-esm and mri-cgcm3 are 83.4 mm and 33.58 mm, respectively, which is quite di erent from the observation (61.5 mm).e NRMSE for precipitation (only 0.54-1.5) is much larger than that for temperature (0.16-0.55).e correlation coe cients for monthly precipitation in the annual cycle via the GCMs are lower than those for temperature, but most correlation coe cient values are still greater than 0.9.However, when we analyze the performances of the GCMs in terms of the precipitation spatial distribution, the spatial correlation coe cients of the GCMs are 0.45-0.82,which indicates that the GCM simulations for the spatial distributions of precipitation, are much worse than those for the spatial distributions of temperature.Figure 6 shows that the bnu-esm model performs poorly in terms of simulating spatial precipitation and, speci cally, it incorrectly estimates the high precipitation region in the study region.
e annual precipitation in the Yellow-Huai-Hai region experiences a nonsigni cant decrease at the 0.05 signi cance level.According to the Z value and magnitude of Sen's slope, the change trends for most GCMs decrease less than those in the observations; speci cally, some GCMs appear to have increasing trends during the study period.
e M-K test shows that precipitation in the GCMs shows di erent change trends, which indicates that simulated precipitation via the GCMs is much more uncertain than the simulated temperature.Analyzing the EOFs reveals that the di erence between the observations and GCMs is larger than that for temperature, which is consistent with other criteria assessment results, where precipitation is relatively poorly simulated compared to temperature.e physical mechanisms a ecting precipitation are mainly in uenced by largescale circulation factors; the inconsistent spatial distribution of simulated precipitation indicates that some GCMs could not explain the in uence of circulation factors.
e empirical cumulative probability distributions for the GCM monthly precipitation are compared with the    Advances in Meteorology precipitation in the 53 grid points is much larger than that for monthly temperature, and the outliers also indicate an inconsistency between the GCMs and observations (Figure 8).Although the Sscore median values have almost the same magnitudes as those for monthly temperature, the result of a higher BS and lower Sscore also indicates that the temperature simulations in the GCMs are better than the simulations for precipitation (especially the bnu-esm model).e results of the GCM by using RS values are shown in Table 3.
e csiro-mk3.6model simulated precipitation better than the other models, and its RS was only 12.24.In addition, the bnu-esm model performed the worst in terms of simulating precipitation, with the highest RS (48.52).Figure 9 describes the annual precipitation change in the Yellow-Huai-Hai region.e bnu-esm model vastly overestimated annual rainfall in the study region, and the oscillations in the model seem to have a reversed phase compared with those in the observations.Even the csiro-mk3.6model appears to slightly overestimate the annual precipitation in the study area and exhibits a di erent uctuation change compared with the observations in the early 1970s; the model has a uctuation change similar to that in the observed data after 1975.
Five GCMs, which have a total RS for temperature less than 17 (ccsm4, hadgem2, mpi-esm-lr, cesm1-bgc and access1-0), and 4 GCMs, which have a total RS for precipitation less than 20 (csiro-mk3.6, access1-0, ccsm4, cnrm-cm5, hadgem2, and cesm1-bgc), were chosen as good GCM groups (Figure 10).Compared with the observations, the good GCM groups are narrower in terms of their uncertainty intervals, and the mean values are closer to those of the observations.Note that the errors in the GCM metadata a ect the entire intensity spectrum, and bias correction is required to improve the GCM simulation capacity.After a simple bias correction, the models in the good GCM groups could be e ectively applied in future studies.

Overall Performance of the GCMs and Sensitivity Analysis.
e RSs for temperature and precipitation are used for assessing the performances of all GCMs (the last columns in Tables 2 and 3).In ascending order, the di erence between two successive ranking scores (i.e., the moving range (MR)) is used to detect the presence of any change points [21][22][23].In addition, the Grubbs test [24] is used to test whether there is an anomalous value in a univariate dataset.If the tests indicate significant differences, then we have evidence to reject the GCMs within the larger ranking score group.e results of the MR analysis and Grubbs test are shown in Table 4.
Two change points in the temperature were detected, while the result of the Grubbs test indicates that these change points are outliers at the 95% significance level (Figure 11).us, these GCMs (inm-cm4 and ipsl-cm5b-lr) should be rejected because their RSs are significantly different from those of the other GCMs.e differences in the first 8 GCMs are not significant.Some GCMs with high-ranking scores could not be rejected by the test because these GCMs could capture one or more characteristics in the temporal or spatial distributions for monthly temperature.
For precipitation, two change points were detected (Figure 12).e result of the Grubbs test indicates that the bnu-esm and canesm2 models should be rejected due to their poor performances in simulating precipitation.e RSs of the last two GCMs are remarkably different compared to those for the other GCMs, while the differences in the GCM RSs among the other models are small.
According to Table 4, the simulations in the GCMs for temperature perform better than those for precipitation.
is result is consistent with the study of global-scale AR4 GCMs, which revealed that most GCMs could capture the characteristics of monthly temperature but not those of precipitation [12].We should note that the same GCM model performs differently for different climate variables.For example, the bnu-esm model is the 6th best model for temperature, but it is the worst model for precipitation.In addition, the csiro-mk3.6model simulates precipitation the best, while it ranks only 10th in the RS assessment.
In addition, the GCMs perform differently in different regions.For example, the bnu-esm model is not suitable for projecting future climate changes in the Yellow-Huai-Hai region, but it may potentially perform well for another study region.
To analyze each individual assessment criterion in the ranking results, the overall results were compared with the results by removing an individual statistical criterion.Based on Figure 13, adding or removing any assessment criterion does not obviously influence the overall ranking.e RS score may change after a criterion is removed, but the betterperforming mode still performs well after adding or removing a criterion.e results indicate that this RS method robustly assesses the GCM performances.is robust assessment provides an advantage when using the multicriteria

Advances in Meteorology
method to assess the performances of GCMs rather than using an individual assessment criterion because a GCM may simulate individual statistical factors well but not provide good simulations for other factors.Each of the RSs for the statistics that were produced by a single criterion were used individually and compared with the overall results (Figure 14).According to the correlation analysis, no single criterion produced exactly the same result as the overall ranking, which also a rmed that the multicriteria method produced more information than the singlecriterion assessment.e assessments of a single criterion provided di erent results, such as the RS of the NRMSE criterion being close to the overall ranking and the correlation coe cient being 0.75, while the correlation coe cient of the spatial distribution was only 0.08.us, if there is a GCM that can simulate the spatial and seasonal distribution well in the Yellow-Huai-Hai region, this does not mean this model would also have better results in simulating   Advances in Meteorology other statistics (e.g., long-term means, trend magnitude, or probability density).

Conclusion
In this paper, a multicriteria score-based method is developed to assess GCM performance in the Yellow-Huai-Hai region from 1970 to 2005.e RSs of these criteria are applied to comprehensively assess the temporal and spatial performances of 18 GCMs when simulating precipitation and temperature in the study region.
All GCMs perform well when simulating temperature.Although all of the models underestimated the mean temperature, the results of the temporal and spatial distributions are quite close to those from the observations.e GCMs did not simulate precipitation as well as temperature, especially in terms of simulating precipitation spatial distributions.Most GCMs overestimate mean precipitation in the study area.
e good performing models are selected to comprise the good GCM groups, where the means of the good GCM groups are closer to the observations.By analyzing the sensitivity of the criteria, we found that removing or adding a criterion does not obviously in uence the results of the assessment, which indicates that the multicriteria score-based method is a robust method for assessing GCMs. is study provides a di erent method from a single evaluation criterion to assess the GCMs' simulation ability.Researchers could specify the criteria relevant to their speci c application and research requirements to select appropriate GCMs for their study.is method could be easily applied to di erent study regions and guide the selection of GCMs for use in regional climate change impact studies.

Figure 3 :Figure 4 :
Figure 3: Empirical cumulative probabilities for monthly mean temperature via observations and GCMs.

Figure 5 :
Figure 5: Annual temperature changes in the Yellow-Huai-Hai region.

Figure 7 :Figure 8 :Figure 9 :
Figure 7: Empirical cumulative probabilities of monthly mean precipitation from observations and GCMs.

Figure 10 :
Figure 10: Comparison between the simulation results for all GCMs and those in the good GCM groups (the black solid circles represent the mean value of the group).

Figure 11 :
Figure 11: Moving test and breaking points of temperature via GCMs: (a) moving range for temperature and (b) breaking point of ranking scores for temperature.

Figure 12 :Figure 13 :
Figure 12: Moving test and breaking points of precipitation via GCMs: (a) moving range for precipitation and (b) breaking point of ranking scores for precipitation.

Figure 14 :
Figure 14: Correlations between single criterion and overall ranking scores.

Table 1 :
Statistics of climate variables and their weights.Figure 1: Locations of meteorological stations and selected GCM grid points (solid circles) in the study region.

Table 2 :
Model performances for monthly mean temperature.
EOFs 1 and 2 are percentages of the explained variance for the first two leading modes of each EOF; PDF is the probability density function.

Table 3 :
Model performances for precipitation.
EOFs 1 and 2 are percentages of the explained variance for the rst two leading modes of each EOF; PDF is the probability density function.

Table 4 :
e scores of the GCMs for monthly temperature and precipitation.