Data-Driven versus Köppen–Geiger Systems of Climate Classification

Climate zone classication promotes our understanding of the climate and provides a framework for analyzing a range of environmental and socioeconomic data and phenomena.e Köppen–Geiger classication system is themost widely used climate classication scheme. In this study, we compared the climate zones objectively dened using data-driven methods with Köppen–Geiger rule-based classication. Cluster analysis was used to objectively delineate the world’s climatic regions. We applied three clustering algorithms—k-means, ISODATA, and unsupervised random forest classication—to a dataset comprising 10 climatic variables and elevation; we then compared the obtained results with those from the Köppen–Geiger classication system. Results from both the systems were similar for some climatic regions, especially extreme temperature ones such as the tropics, deserts, and polar regions. Data-driven classication identied novel climatic regions that the Köppen–Geiger classication could not. Renements to the Köppen–Geiger classication, such as precipitation-based subdivisions to existing Köppen–Geiger climate classes like tropical rainforest (Af) and warm summer continental (Dfb), have been suggested based on clustering results. Climatic regions objectively dened by data-driven methods can further the current understanding of climate divisions. On the other hand, rule-based systems, such as the Köppen–Geiger classication, have an advantage in characterizing individual climates. In conclusion, these two approaches can complement each other to form a more objective climate classication system, wherein ner details can be provided by data-driven classication and supported by the intuitive structure of rule-based classication.


Introduction
Categorizing regions across the globe based on their climate is bene cial for summarizing climatological data and for explaining and disseminating environmental and sociopolitical data. Climate classi cation has been used as a basis for regionalization at global and regional scales in various elds, including ecological monitoring [1,2], hydrology [3,4], evolutionary anthropology [5], agriculture [6][7][8], and epidemiology [9,10]. Many classi cation systems have been previously reported [11][12][13]. Among those, the system proposed by Wladimir Köppen [14] is the most famous; it classi es global climate into 30 classes under ve main groups, originally intended to be representative of the distribution of ve vegetation types described by De Candolle according to the climate zones known to ancient Greeks [15].
e Köppen system describes climate zones based on several climatic variables derived from monthly temperatures and precipitation. Subsequent re nements and modi cations were made to the original Köppen system [16,17]. In this study, we refer to the Köppen-Geiger classi cation system (hereafter called "KG"), following which world maps of climatic regions were prepared by Grieser et al. [18], Kottek et al. [19], and Peel et al. [20] for 1951-2000; Rubel and Kottek [21] for 1901-2100; and Beck et al. [22] for 1980-2016 and the future. e KG classi cation system is a rule-based, top-down approach to climate classi cation. Classi cation criteria in KG are tuned for their original purpose of reproducing the distribution of vegetation. erefore, the climate zones produced by the KG classification system are subjective and limited to predetermined climate types. At the same time, the rule-based nature gives the KG classification system the advantage of being reproducible and intuitive, which is desirable for ease of communication and wide-scale adoption. KG classification has been critically evaluated by several researchers for its ability to delineate distinct climates around the world. Triantafyllou and Tsonis [23] evaluated the KG classification system by classifying climate stations into Köppen classes on an annual basis and estimating the frequency of changes between major Köppen climate groups. ey found that in some regions of the world, KG is unstable to interannual changes in climate. ey also reported that KG is either unable or slow to respond to longterm climate change, such as global warming. ey suggested that statistical and data-driven approaches, such as factor analysis and cluster analysis, can be the basis for a more objective and robust classification system. Rubel and Kottek [24], in their comments on Köppen's original paper and the review of KG's subsequent developments, remarked that in the past, climate classification was exclusively based on human expertise; however, today, it is supported by various statistical techniques, such as cluster analysis, which can objectively define the global climatic regions.
In this study, we aimed at studying how a data-driven, bottom-up approach to climate classification would produce objectively delineated climate zones and how they could be compared with the outcomes of KG. We present an objective and data-driven classification of the world's climate based on a cluster analysis approach. Previous studies [25][26][27][28] have used cluster analysis methods to regionalize the global climate as well as regional climates. DeGaetano [29], Russell and Moore [30], and Unal et al. [31] applied cluster analysis to primary station data to regionalize climates. Fovell and Fovell [32] regionalized the climate in the United States by clustering 344 climate divisions of the National Climatic Data Center (NCDC) dataset. Marston and Ellis [33] and Park et al. [34] applied cluster analysis to gridded climate data to regionalize the climates of the United States and the Korean peninsula, respectively. Hoffman et al. [35] used cluster analysis of the outputs of a general circulation model (GCM) to identify climate regimes and compare different simulation scenarios. Kumar et al. [36] applied a parallel processing implementation of a k-means clustering algorithm on large high-dimensional datasets comprising observed, remotely sensed, and simulated data to identify ecoregions in the United States. e choice of input data, their preparation, and the selection of clustering algorithms are important factors that determine the nature of the clusters produced [26].
Our study differs from previous data-driven classifications in the selection and preparation of input data and the setup of clustering methods. We prepared climatic variables similar to those in the KG. While the KG criteria distinguish climate classes using predefined thresholds in the data variables, we intend to reveal natural groupings in the data. Because the input climatic data were selected to be similar, a comparison could be made between the rule-based KG climate classes and data-driven clusters. When setting up the clustering methods, we followed multiple approaches toward two main objectives. One was to minimize subjectivity owing to any prior information given when setting up cluster analyses, such as the number of clusters. For this purpose, we included ISODATA clustering in our study because it does not require prior specification of the number of classes. With the k-means and random forest clustering algorithms, which require prior specification of the number of classes, we minimized subjectivity by estimating the optimum number of clusters. Second, to compare data-driven and rule-based classifications, we set up cluster analyses to create the same number of clusters as in the KG.

Gridded Climatic Data.
Reanalysis data from the Climatic Research Unit gridded Time Series (CRU TS) is a widely used global climate dataset that covers all land areas, except Antarctica [37]. It provides 10 climatic variables at a spatial resolution of 0.5°. For this study, the monthly mean 2 m air temperature and monthly precipitation rate from the CRU TS dataset version 4.05 were used. Because this study was aimed at classifying the present climate, data for the 30-year period from 1991 to 2020 were extracted. e GMTED2010 global digital elevation model (DEM) was used for obtaining elevation data. It was developed based on data derived from multiple elevation data sources and is available at different resolutions [38].

KG Classification.
e KG classification system was adopted following the criteria described by Peel et al. [20], which have also been used by Kriticos et al. [39] and Beck et al. [22].
is classification system has been slightly modified from the original method presented by Köppen [14] and Geiger [16], and the differences have been discussed by Beck et al. [22]. e KG classification criteria are included in the Supplementary Materials (S1). e criteria for classification in KG were defined using 11 climatic variables that were calculated using the monthly precipitation and mean temperature data. ese variables were first calculated at a 1year time resolution. en, the latest 30-year (1991-2020) averages of these variables were calculated. e updated KG classification of the present climate was prepared by applying the classification criteria to the 30-year averaged variables.

Data-Driven
Classification. Ten climatic variables were derived from the precipitation and temperature data for data-driven classification: mean annual temperature (T mean ), mean annual precipitation (P year ), air temperature of the coldest month in summer (T smin ), air temperature of the warmest month in summer (T smax ), air temperature of the coldest month in winter (T wmin ), air temperature of the warmest month in winter (T wmax ), precipitation of the driest month in summer (P sdry ), precipitation of the wettest month in summer (P swet ), precipitation of the driest month in winter (P wdry ), and precipitation of the wettest month in winter (P wwet ). e definition of the seasons is consistent with that in the KG system; out of the two 6-month periods (April-September and October-March), the warmer is designated as summer and the colder as winter, at each grid cell. A map of the regions that experienced summer in April-September is presented in Figure 1. is indicates that the two regions are not separated along the equator. Some April-September summer areas were enclaved within the other region. Notably, these areas coincide with the Amazon and Congolian rainforests. All climatic variables were prepared in the form of global 0.5°grids, consistent with that in the original CRU TS data system. e GMTED2010 DEM dataset [38] provides elevation data at various resolutions. e 0.5°resolution elevation data (ELE) were used in this study.
ese climatic variables were selected after performing trial clustering exercises. Initial attempts with monthly means of the temperature and precipitation failed to recognize similar climates in different hemispheres because they occurred in different months. e use of seasonal methods allowed us to overcome this drawback. Because the seasons were defined in the same way as in KG, results that are more comparable to KG classes were obtained. However, one may obtain a more objective clustering output with an objective definition of seasons.

Principal Component Analysis.
e redundancy of information in the chosen data variables can be a source of bias in the clustering analysis. With regard to their hierarchical clustering analysis of the climate in the United States, Fovell and Fovell [32] discussed the problem of information redundancy. ey used principal component analysis (PCA) to reduce the correlation among the data, thereby attempting to reduce redundancy. However, they noted that a certain amount of redundancy can remain even among the principal components (PCs) that are orthogonal to each other.
We employed PCA to reduce redundancy in the data. Furthermore, the reduction in dimensions would help reduce the complexity of the computations. PCA was applied to normalized data variables. Both PCA and normalization have been employed in the cluster analyses of climatic data [Fovell [40], Fovell and Fovell [32], Gómez-Zotano et al. [41], Kozjek et al. [42]]. Netzel and Stepinski [26] highlighted the importance of proper normalization and reported that their modified normalization method for precipitation data performed better than uniform normalization, which tends to produce clusters that are largely influenced by temperature. In this study, we standardized each data variable as a z-score.
e number of PCs retained for cluster analysis was determined by inspecting the scree plot. Selecting PCs based on the location of the elbow in a scree plot is an accepted stopping rule in PCA [43]. ree PCs were selected based on the scree plot shown in Figure 2. ese represent 90% of the variance in the data. e loadings for the three PCs are listed in Table 1.

Cluster Analysis.
Cluster analysis was used in this study as a data-driven approach to delineate climatic regions. Out of the many available clustering algorithms, three were considered in this study.

k-Means
Clustering. k-means clustering is a widely used clustering method in which the k number of partitions is constructed by assigning each observation to the nearest cluster in terms of the distance to the mean of the cluster [44]. e k-means clustering algorithm by Macqueen [45] was implemented in the Cluster package in R version 4.1.1 [46] and used in this study. Euclidean distance was used as the distance function. ree selected PCs of the data variables were used for clustering.

ISODATA Clustering.
e iterative self-organizing data analysis technique (ISODATA), a partitioning-type clustering method, is a modification of the k-means clustering algorithm [47]. Unlike k-means clustering, it does not require prior specification of the number of classes. Starting with an initial user-defined number of clusters, the ISO-DATA algorithm alters the number of clusters by merging, splitting, or deleting clusters based on certain heuristics to converge to a solution with an optimum number of clusters. In this study, ISODATA clustering was performed using the fast implementation of the ISODATA algorithm provided in the SAGA GIS package [48].

Random Forest Clustering.
As a supervised learning method, random forest classification requires training with a labeled or classified dataset. In this study, we decided to use synthetic training data generated by k-means clustering on a random sample of 5000 grid cells of the dataset. e random forest model was trained on synthetic training data, and the dataset was clustered using the trained model. Random forest classification was performed using the randomForest package in R [49].

Number of Clusters.
Although the number of different climate zones is not known a priori, it is a necessary input for clustering methods, such as k-means. Some climate clustering studies have adopted a top-down approach to select the number of clusters (k) such that five and 13 cluster solutions are derived to match the first two levels of division in KG classification [26]. Various statistical measures, such as the Akaike information criterion, Bayes information criterion, information-theoretical V-measure, and Calinski-Harabasz criterion, have been used to determine the optimum number of clusters [42,50,51]. In this study, we used the Calinski-Harabasz criterion, which uses the pseudo-F statistic as a measure of cluster cohesiveness [52] because it has been widely used to determine the optimum number of clusters in many applications, including climatological clustering studies [33,53,54].
To allow a closer comparison with KG classification, 30cluster solutions were also developed using both k-means and random forest clustering.
Advances in Meteorology 2.5.5. Comparison of Clustering Results. We used the Jaccard similarity coefficient [55] to investigate the similarity between clusters produced by different methods and KG classes. e Jaccard similarity coefficient is the ratio between the intersection and union of two sets; it has values ranging from zero for non-intersection to one for exact similarity.
is index is widely used in the evaluation of similarity in clustering in addition to applications such as image recognition and text analysis [56,57].

Reproduction of the KG Classification.
A map of the KG classification of the present climate was prepared at a 0.5°r esolution ( Figure 3). Antarctica was not classified because the CRU TS dataset does not cover the continent. Maps of the five main KG groups and 13 level-2 classes are presented in Figures S2-1 and S2-2 in the Supplementary Materials.
By applying the KG classification scheme at an annual scale for 1901-2020, the annual variation in KG classes was investigated. Figure 4 shows the variability in the five main Köppen climate groups. Maps of variability between individual climate pairs are included in Supplementary Material S3. In the case of the main KG climate groups, two types of areas could be differentiated, as shown in Figure 4. ere are narrow and sharp regions that suggest that the corresponding climate groups are well-defined with less ambiguity. ere are also wider and fuzzy regions that suggest that the definitions of the corresponding climate groups are ambiguous. e identified regions of high variability agreed well with the findings of Triantafyllou and Tsonis [23]. Although KG is intended to be a classification of long-term climates, its application at an annual scale allows the identification of climates that are prone to be ambiguously characterized.

PCA.
e first PC, which represents 57% of the variance, is a combination of all 10 climatic variables, with P year having largest magnitude. P year has the largest magnitude in the second PC too, which explains 25% of the variance. Overall, most climatic variables had similar magnitudes in the first two components, suggesting that variance in the original data was shared similarly between the temperature and

Cluster
Analysis. k-means clustering was performed on data variables transformed into the three PCs. e pseudo-F statistic was calculated for clustering solutions of up to 35 clusters. An optimum number of clusters (k � 12) was selected by detecting the presence of a local peak, followed by a sharp decline in the pseudo-F statistic ( Figure 5). Based on this, a 12-cluster solution (KM12) was prepared using kmeans clustering, as shown in Figure 6. en, a 30-cluster solution (KM30) was prepared (Figure 7). ISODATA clustering resulted in a 16-cluster solution (Figure 8) named ISO16. Random forest clustering was used to develop a 30cluster solution (Figure 9) named RF30. All cluster maps were visualized using the same color scale applied in the order of the mean annual temperature of each cluster. Cluster identification numbers were assigned in the same order.

Jaccard Coefficient.
e Jaccard coefficient between the KG classification and each data-driven classification was calculated. e Jaccard coefficient values are listed in Supplementary Material S4.    S4). Further, the proportions of coverage between different classification schemes are visualized in Figure 10.

Discussion
Between the k-means 12-cluster solution and the 13 level-2 KG climate classes, the highest similarity is present in the EF KG climate to KM12 cluster 1 and BWh KG climate to KM12 cluster 10, with Jaccard similarity coefficient values of 0.81 and 0.67, respectively. Figure 11 shows how some of the regional boundaries are in close proximity, signaling that KG is able to identify some of the natural clusters in climate data. e dissimilarity between the BW KG class and KM12 cluster 10 is mainly due to the cold desert regions that correspond to the BWk class and are not included in KM12 cluster 10. Figure 12 plots the mean values of the two main PCs for the 13 level-2 KG climates and the KM12 clusters. Some centers are close together indicating similar regions in the two classification systems. At the same time, cluster analysis has recognized some unique climatic regions that were not seen in the KG classification results, as indicated by the presence of several isolated cluster centers.

Advances in Meteorology
Cluster 11 of the KM12 stands out in Figure 12. In the KG, that area is shared between Af, Am, and Cf classes with Jaccard coefficients of 0.446, 0.035, and 0.13, respectively. e relative contribution of the parameters for the clustering result can be studied based on the standard deviation. In distance-based k-means clustering and ISODATA clustering methods, the objective of the algorithm is to minimize the distance between cluster members and cluster mean, given by the within-cluster sum of squares (WCSS), which is equivalent to variance. erefore, by calculating the standard deviation of each variable of the cluster members ( Figure 13 for KM12), the contribution of the variables to the minimization of the objective function of WCSS can be compared. A variable that has high similarity, i.e., low standard deviation, contributes more to the differentiation of the clusters. e contribution of the precipitation variables is significantly high in some clusters like 2, 3, and 10 but significantly low in clusters 11 and 12. e influence of temperature variables appears to be similar for all clusters.
is distinction can be explained by the high dynamic range of original precipitation variables. Contribution of elevation is comparatively low across all clusters.

Climatic Regions Discovered by Cluster Analysis.
ere are instances wherein cluster analysis has subdivided the Köppen classes. e Af KG climate for tropical rainforests has no level-3 subdivision in KG. In the KM12 classification, the region corresponding to the Af climate was shared mainly between two clusters, numbers 11 and 12. As revealed by the climographs (Figure 14) of the regions, kmeans clustering distinguished different precipitation levels, although the temperature variation was similar in all three regions. KM12 cluster 11 has distinctly higher precipitation than KM12 cluster 12. Noting that, in KG, the Af class is defined by minimum monthly precipitation >60 mm, we can investigate the value of the same statistic for the KM12 clusters. e mean of the minimum monthly precipitation is 132.84 mm in KM12 cluster 11 and 48.11 mm in KM12 cluster 12. e mean annual precipitation is 3368 mm in cluster 11 and 2360 mm in cluster 12. ese statistics can be interpreted as follows. When identifying the wettest climate, KM12 establishes a lower threshold for precipitation that is more than twice the value in KG. While the Af class contains all three major rainforest regions in Central Africa, South America, and Southeast Asia, cluster 11 is not present in the African continent. e 30-cluster solutions from k-means and random forest clustering provided further insights into refining the KG climate classes. e warm summer continental climate of Dfb, which is present in Eurasia and North America, has been placed in separate clusters in the random forest classification ( Figure 15). Climographs revealed that the North American part of the Dfb class receives distinctly higher precipitation, suggesting that the Dfb class can be subdivided further.
Cluster analysis has detected isolated geographical areas with unique climates that were hidden in the KG classification results. For instance, both 30   Advances in Meteorology in the Tibesti mountains is different from that in the rest of the Sahara, as it receives more rainfall. Figure 16 shows that cluster 20 of the KM30 classification, which includes the above areas, receives substantially higher rainfall, in contrast to cluster 27, which is more similar to the BWh-hot desert KG climate.

Clustering with Random
Forest. Some clusters created by the distance-based k-means and ISODATA clustering methods appear to be primarily sensitive to either temperature or precipitation, rather than both. For instance, Clusters 11 and 12 of KM12 (which are the two warmest clusters), cluster 25 of KM30 (which is the sixth warmest cluster), and cluster 14 of ISO16 (which is the third warmest) have parts in colder regions, such as Scotland, Norway, New Zealand, and Chile, corresponding to locations of the boreal rainforests that receive very high precipitation. ese regions are clustered together with tropical rainforest regions, even though the temperature regimes are considerably different. However, forest-based clustering did not show a similar outcome. In RF30, the same higher precipitation regions at higher latitudes were clustered together with other colder regions. is highlights some advantages of machine learning models, such as random forest and other decision tree-based models, over distance-based clustering methods. In particular, compared with distance-based methods, treebased methods are generally more robust against outliers [58].

Suggestions for Dissemination of Data-Driven Classification Results.
Although data-driven classification offers an objective classification scheme with superfluous details, wider adoption of such schemes may be discouraged by certain common drawbacks. e climate zones produced by data-driven methods need to be retrospectively characterized. Not only may it be challenging to uniquely characterize all individual clusters, but a single definition may also not be satisfactory for users in different disciplines. Another problem is that the clustering results can be inconsistent. For instance, because the k-means algorithm converges to a local  Koppen13 KM12 Figure 12: Means of the two PCs for the 13 level-2 KG climates (Köppen13) and 12 k-means clusters (KM12). Centers of some clusters (5, 7, and 11) are distant from any KG class means.   minimum, it can produce different clusters in each trial with different initial conditions [59]. Although methods to search for optimum initial conditions [60,61] and reproduce a selected clustering outcome are available, the above attributes may make data-driven classifications less appealing than the intuitive and unique results offered by rule-based classifications.
Maps of climate zones produced by data-driven methods are highly fragmented with complex edges in some regions. For the ease of communication, maps of data-driven classifications can be post-processed to reduce noise and sharpen edges. Denoising is a process employed in signal and image processing to remove noise from analog and digital signals and images. ere are many different image denoising techniques [62]. For application in climate classification maps, a suitable technique can be chosen considering the requirements that the pixel values in the resulting filtered image must be limited to the values (cluster identification) in the original image and that edges must be preserved and sharpened. Rank filtering satisfies both these requirements. It replaces a pixel value with a specified ranking value from the sorted values of the neighborhood. Often, the rank is specified as the median, and this process is called a median filter [63]. Supplementary Material S5 shows a map of the KM30 clusters post-processed using a median filter. e climate zones in the denoised map are less fragmented, have sharper edges, and are generally more discernible. erefore, such post-processing techniques can be used to produce maps that are better suited for the communication of the results of data-driven classification.

Concluding Remarks
While the rule-based KG classification system has been well established as the foremost climate classification system, datadriven classifications offer an alternative with the promise of being more objective. Our study was devised to explore naturally emerging clusters in climate data and compare the identified climatic regions with those obtained with the KG classification system. Global climatic regions were objectively delineated by conducting cluster analyses on a data matrix comprising 10 climatic variables and elevation. In the climatic regions identified by cluster analyses, strong similarities to KG climate classes were observed in regions with extreme temperatures. All clustering methods delineated prominent climate regions that were similar to the KG climate classes in groups A (tropical), B (arid), and C (polar). Higher Jaccard coefficient values were also reported among the above groups, confirming that the best consensus between data-driven classifications and KG exists in these climates. In the temperate (C) and cold (D) KG climate groups, agreement with data-driven classifications was limited.
Further refinements to the KG climate classes were suggested based on the results of data-driven clustering. Instances wherein KG climate classes could be subdivided into distinct climates were identified, such as the subdivision of the Af-tropical rainforest climate and the Dfb warm summer continental climate. Unique climatic regions that were obscured in KG were also identified, such as mountainous regions within the Sahara.
In summary, our clustering results show that even though it is a rule-based classification system, KG approximates some of the natural clusters in terms of the global climate. Simultaneously, it obscures regions that can be differentiated in data-driven classifications. With no definitive measure of the performance of climate classification systems, it is impossible to conclude that one system is better than the other. Clustering-based climate classifications may be less appealing as a stand-alone system because of the lack of formal definitions, whereas KG has established a wide appeal due to its familiar definitions. However, definitions for data-driven clusters can be formulated based on climatology and geography, as demonstrated in selected cases.
In addition, we conclude that data-driven classifications are best used to complement and refine the structure and definition provided by the rule-based KG classification system. To that end, we demonstrated how Köppen classes can be refined using data-driven insights. In addition, post-processing methods, such as demonstrated median filtering, may be suitable for developing climate zone maps that are suitable for interpretation and communication.
e climate data selected in this study were limited to monthly means of the temperatures and precipitation. ere is an opportunity to enrich data-driven classifications by including more variables that are descriptive or predictive of the climate. Further studies should focus on investigating additional variables that could produce more insightful clustering outputs.

Conflicts of Interest
e authors declare that they have no conflicts of interest.