A Novel Framework for Selecting Informative Meteorological Stations Using Monte Carlo Feature Selection (MCFS) Algorithm

Spatial distribution of meteorological stations has a significant role in hydrological research. /e meteorological data play a significant role in drought monitoring; in this regard, accurate and suitable provision of meteorological stations is becoming crucial to improve and strengthen the skill of drought prediction. In this perspective, the choice of meteorological stations in a specific region has substantial importance for accurate estimation and continuous monitoring of drought hazards at the regional level. However, installation and data mining on a large number of meteorological stations require high cost and resources. /erefore, it is necessary to rank and find dependencies among existing meteorological stations in a particular region for further climatological analysis and reanalysis of databases. In this paper, theMonte Carlo feature selection and interdependency discovery (MCFS-ID) algorithm-based framework is proposed to identify the important meteorological station in a particular region. We applied the proposed framework on 12 meteorological stations situated in varying climatological regions of Punjab (Pakistan). We employed the drought index SPTI on 1-, 3-, 6-, 9-, 12-, 24-, and 48-month time-scale data to find the interdependencies among meteorological stations at various locations. We found that Sialkot has significance regional importance for studying SPTI-3, SPTI-6, and SPTI-48 indices./is regional importance is based on scores of relative importance (RI); for example, the RI values for SPTI-3, SPTI-6, and SPTI-48 indices are 0.1570, 0.1080, and 0.0270, respectively. Furthermore, the Jhelum station has more relative importance (RI = 0.1410 and 0.1030) for SPTI-1 and SPTI-9 indices, while varying concentration behaviour is observed in the remaining time scales.


Introduction
Drought is a creeping phenomenon and recurrently occurring natural disaster in many regions of the world [1][2][3]. It is an insidious, slow-moving natural hazard that has an adverse social and environmental impact and can influence the economy of any country [4]. ese effects can be well experienced outside the affected area, even at the global level. e complexity of the effects is largely due to the dependence of so many areas on water to provide goods and services such as availability and quality of water which have serious implications for water resource management [5,6]. Since rising pressure on water and other natural resources leads to drought [7], it is clear that producing more comprehensive assessments over time is challenging [8].
Due to the complex features of drought, it has severe and prolonged adverse effects [9]. Attempts have been made to identify the complexity of these effects at the local, regional, or national level [10]. It is almost impossible to track the databases and trends of the region because of the insidious behaviour of drought in a region. e researchers and policymakers have given different strategies for their countries to improve the level of preparation for their drought by building better early warning systems and adopting drought policies and response and mitigation plans [11,12]. It is therefore imperative for scientists and policymakers to develop efficient early warning drought-monitoring tools to avoid the severe effect of drought [13][14][15]. However, long-term records of drought indices of the regionally representative meteorological station are required for accurate estimation of regional drought forecasting and for developing an early warning tool. In this regard, accurate estimation and continuous monitoring of future drought at the regional level require a dense meteorological network. However, the implications of each meteorological network require high cost and resources. Particularly in developing countries, the high cost of installation and complex sampling design may force to adopt compromise allocation and installation of meteorological stations [16][17][18].
In the last decades, several algorithms and methods for the optimal selection of metrological stations have been developed [19,20]. ese algorithms and methods reduce the size of the network that provides more accurate and regionally representative estimates of metrological variables. In this perspective, resultant spatial distributions of optimal meteorological stations play a key role, specifically in hydrological research [21,22]. erefore, the essential feature of hydrologic research is to achieve optimal meteorological stations based on some meteorological characteristics [20,23]. To achieve the optimal network, geographical facts such as deserts, hills, and forests of a region are the key factors that significantly contribute to the distribution of meteorological stations. Consequently, optimal selection of meteorological networks requires a standardized climatic indicator.
ere are several tools to monitor drought and its characterization [24]. However, the estimation of standardized drought indices (SDIs) and other novel approaches of drought indices on all meteorological stations make a chaotic situation for regional forecasting and early warning management policies. e use of the SDI has been found in many applications [25][26][27]. SDIs are useful for drought characterization and comparing meteorological stations having different climatological characteristics [28]. Nevertheless, long-term SDI time-series data are crucial for drought characterization at individual meteorological stations [29]. It is thus necessary to rank and find dependencies among existing meteorological stations for climatological and reanalysis databases. In this regard, advanced statistical procedures are helpful to find important and regionally dependent stations under complex meteorological network settings.
In this study, we propose a framework to identify important meteorological stations and to discover dependencies among stations for up-to-date real-time droughtmonitoring systems. e core configuration of the framework is based on MCFS-ID [30,31]. We applied this proposed framework on 12 meteorological stations situated in varying climatological regions of Punjab (Pakistan). e analysis of the study is performed with the drought index SPTI on 1-, 3-, 6-, 9-, 12-, 24-, and 48-month time-scale data to find the interdependencies among meteorological stations at varied stations.

Standardization Precipitation Temperature Index (SPTI).
ere are numerous procedures that use the multiscalar drought index to describe the severity of the drought. A drought index is called the Standardized Precipitation Index (SPI), which is based on the overlong time period precipitation records to compute the precipitation scarcity [32]. e SPI can be monitored on different time-scale drought. e SPI is standardized by the suitable probability distributions for the observed monthly cumulative precipitation time series to estimate the quantitative values. us, positive and negative SPI values are used to identify greater than and less than median precipitation, respectively. Another water balance model, Standardized Precipitation Evapotranspiration Index (SPEI), where the same mathematical procedure of the SPI is used, is grounded on the basis of a difference between precipitation and potential evapotranspiration (PET) [33]. One significant advantage of the SPEI over the SPI is that it comprises the influence of the evaporation to designate the area being studied. In line with the same methodology of the SPI and SPEI, more recently, a multiscalar drought index SPTI is developed for the characterization of drought in both cold (minimum temperature − 5.50) and hot (maximum temperature 45.2) climate regions [34]. In this study, we used the SPTI due to the following three reasons: (1) the regions being selected have both cold and hot climatic weather, (2) the SPTI provides true values for regions observed with low temperature, and (3) there is no mathematical contention in the SPTI mechanism. e procedure for SPTI estimation is as follows: In step one, for each selected station, a De Martonne Aridity Index (DAI) is evaluated by utilizing monthly total precipitation and average temperature as follows: where P i denotes the monthly total precipitation and T i stands for the monthly mean temperature. In the second step, the candidacy of appropiriate probability distribution will be considered for DAI i series of each station. In this work, more specifically, 32 most frequently used probability distributions were applied to perceive the most suitable probability distribution using the propagate R package [35]. In step three, distributions are selected for each station's time-series data of DAI i on the basis of minimum values of the Akaike information criterion (AIC) and Bayesian information criterion (BIC). For each station, we standardize the cumulative distribution function (CDF) of the fitted distribution as follows: To adjust the effect of undefined values in the DAI, a little amendment in the CDF is constructed in equation (2). For example, in case of gamma distribution, q is the probability with a value zero for each station in the DAI time-series data. If m represents the ciphers (zero) present in DAI i time-series data, then q is likely to be estimated by m/n, where n shows all observations in the DAI i time series. More specifically, here, other familiar methods of probability plotting position such as that in [36] can be implemented for regulating the probability of undefined values in the CDF instead of the above specified method. where in which where in which Here, C 0 � 2.515517, C 1 � 0.802853, C 2 � 0.010328, d 1 � 1.432788, d 2 � 0.189269, and d 3 � 0.001308.

Monte Carlo Feature Selection and Interdependency Discovery (MCFS-ID).
In the past two decades, substantial progress has been attained in the zone of feature ranking and selection for high-dimensional classification. Draminski et al. [30] gave efficacious and well-presented features' ranking methods with respect to their classification importance. . Recently, a Bayesian technique of determining automatic relevance is developed with nonfilter approaches (see [37]). Apart from this, the importance of the so-called variable (i.e., feature) can be inferred using random forests [38]. Determination of the importance of variables is not compulsory for the construction of random forests, but it is a subroutine to be made corresponding to the construction of the forest [39,40]. Features' ranking by variable significance can thus be considered a by-product of the classifier [41]. In this approach, we rely heavily on using the classifier, and we shall not use it for the classification work per se. In fact, we only use classes: (i) according to the characteristics of rank, we differentiate between their classical powers according to their importance and (ii) we find interdependencies between features.
In this algorithm MCFS-ID, we can estimate the important features; specifically, m features are randomly selected out of all the d features, considering fixed subsets of m and m ≪ d, and for every feature's subset, trees t are built and their enactment is judged. Every tree from t in the inner loop is trained and estimated on unlike, randomly selected training and test sets, which are produced by dividing the complete training data into two subsets. Further detailed steps of the algorithm are given in Figure 1.
e relative importance of features gk, RI GK , is defined as where s.t overall trees are denoted by summation; the tree on which the split is constructed on the feature gk has all nodes n gk (τ); the τth tree has weighted accuracy denoted by wAcc τ ; the gain ratio for the node n gk (τ) is denoted by GR(n gk (τ)); no.in n gk (τ) stands for the number of samples in the node n gk (τ); no.in τ symbolizes the number of samples in the root of the τth tree; and fixed positive real values of u and v are now set to 1 by default [30]. For computational causes, the normalizing factor (no.in τ) is present mainly, which has the same value for all τ. Furthermore, m, s, and t are three parameters to be set by an experimenter. e expected constraint is that s is not too large for the choice of subset size m of features selected for each series of t experiments.
Once the ranking of the feature is completed by the MCFS-ID algorithm, a natural issue can be raised about potential interdependencies between informational features. Interdependence between features is frequently modelled using interactions, such as those in experimental design and analysis of variance [42]. Possibly, the most extensively used approach to identifying interdependencies is to find correlations between features or to find groups of characteristics that perform in the same sense [43]. In this approach, concentration is on ascertaining the characteristics that "support" in defining whether a sample is of a particular class. e interdependency (ID) graph is based on collecting information given by all the s.t trees (see Figure 1). To see how to create an ID graph, assume that, in each crowd of classification trees, each node represents an attribute, on which a partition is created. Now, for each node in each classification tree, all its integrated nodes can be kept in mind, on which the node is concerned, and a node is equipped with the attribute that is displayed in this manner, and any directed strand found in this way is actually an edge which combines two distinct characteristics in a directed way. e edges are found in all the path s.t MCFS-ID trees obviously, and the same edge may occur more than once in a single tree. e strong point of interdependence between the two nodes, essentially two features, is connected to a directed edge; the ID weight for a given edge, or the weight of the ID in short, is equal to the gain ratio (GR) in the multiplication node by a fraction. us, for node n k (τ) in the τ th tree, τ � 1, . . . , s.t, and its antecedent node n i (τ) , the ID weight of the directed edge from n i (τ) to n k (τ) is denoted [n i (τ) ⟶ n k (τ)], which equals where GR(n gk (τ)) denotes the gain ratio for the node n k (τ), no.in n k (τ) stands for the number of samples in the node Advances in Meteorology n k (τ), and no.in n i (τ) stands for the number of samples in the node n i (τ).

The Proposed Framework: MCFS-ID Algorithm-Based Selection Framework
Specifically, a key objective of this study is to develop a new framework for the selection of meteorological stations by incorporating the MCFS-ID algorithm and SDI time-series data. To achieve this work, this section comprises the MCFS-ID-based computational propagation for the choice of meteorological stations. e proposed framework has two steps which are given in the flow chart (see Figure 2). ey are detailed as follows: (1) Defining Region. is progression adjudicates the selection of region for drought monitoring. In this step, a specific region is assimilated for regional drought monitoring. In such a manner, a suitable selection of region will strengthen accurate and efficient drought mitigation policies at the province or country level. (2) Defining Meteorological Station. After the selection of a significant region, suitable selection of meteorological stations/monitoring stations is suggested.
We know that long climatic information plays a significant role in the model structure and measurable statistical inferences. Along these lines, the meteorological stations, which have a rich droughtmonitoring observation history, are suggested. After defining and characterizing the above two points, the stepwise execution of the proposed structure comprises 3 phases. e detailed clarifications and explanation are given in the following sections.

Phase 1: e Choice of Drought Indices and eir Estimation.
is phase comprises the selection of a drought indicator from the list of all available drought indicators of the SDI procedure and the estimation procedures. Numerous studies have given various drought indicators for the standardized procedure of the drought index [24]. In Section 2, we have illuminated a brief summary of various SDI indicators and their applications in various regions. Similar to SDI procedures, recent developments also focus on the parametric-and nonparametric-based estimation [44]. erefore, this phase is important for accurate regional drought monitoring and its analysis. e foremost important and major concern of this phase is to select the climatic parameters and the time scale for the estimation of the multiscalar drought index. Conditional in nature, it depends on climate, soil type, and tropical status, and several drought indices required various climatic parameters such as temperature, precipitation, solar radiation, and humidity. erefore, optimized selection of drought indices and their estimation procedure can meaningfully contribute to accurate and reliable drought monitoring. In particular, this step involves a deep knowledge of the following issues: In this step, the time scale of multiscalar drought indices is designated. For example, short time scales are suggested for meteorological data [45], whereas a longer time scale is recommended for the monitoring of agricultural and hydrological drought [46].  time-series data of the SDI of various meteorological stations. By incorporating the MCFS algorithm, we were able to decide which stations are more important for the reanalysis purpose. Selection of important stations is based on relative importance (RI) values. e station which has higher values of RI is considered important accordingly. Furthermore, with the help of fitting graphics, the facts about stations are obtained for selecting the most important station among other stations. e colour concentration of a node is proportional to the corresponding feature's RI. e node size is proportional to the number of edges associated with that node. e level of darkness and width of an edge are proportional to the ID weight of that edge.

Phase 3: e Choice of Meteorological Stations.
In this phase, a meteorological station is identified from the ID graph on the basis of RI values for the station. e station which has a higher RI value is considered the most important than other stations which are being compared in this particular study. By careful implementation of MCFS-ID, this study suggests identifying some important meteorological stations according to their importance. e first step is to find the RI values corresponding to their ID weights by configuring the MCFS-ID algorithm on the time-series data of the SDI.

Application
In this section, the application of the proposed framework is discussed. e preliminary application of the proposed framework is presented in Punjab Province of Pakistan (see Figure 3). Long-term time-series data of precipitation and temperature are required for index calculation. erefore, secondary data of these variables were collected for 46 years, from January 1971 to December 2017. is data set satisfies the requirements of the World Meteorological Organization (WMO) and is used for the analysis in this study.

Data and Study Area.
e data were collected from Punjab Province of Pakistan, and the study area was 12 meteorological stations named Bahawalpur, Bahawalnagar, Faisalabad, Jhelum, Khanpur, Lahore, Mianwali, Multan, Murree, Rawalpindi, Sargodha, and Sialkot (see Figure 3). Agriculture sectors are significantly affected by these stations, and most of these stations have significant importance for crop and farming. e Punjab regions have rich agricultural attributes among other provinces. erefore, the agriculture sector of Punjab Province continues to play a central role in Pakistan's economy in terms of gross domestic products (GDPs). However, several parts of the country are shockingly suffering from the severe drought condition due to the growing consequences of climate change and the effect of global warming. In 2018, moderate-to-several drought appeared in the arid land of Punjab regions. Although its intensity prevailed in other parts of the country including northern areas, moreover, the direct role of drought has been observed for rice crops.

Results.
is section presents the results for the proposed framework. e framework is proposed for drought monitoring and categorization of stations, and the detailed description is given in Figure 2. e monthly data set was used to calculate the drought index for varying time scales. e 12 stations were taken into account, and the index for seven periods (one, three, six, nine, twelve, 24, and 48 months) was calculated.

Estimation of Drought Indices.
ere are varying probability distribution concepts that are used to estimate the drought indices at varying time scales. An estimation procedure, the fitting suitable probability distribution of the DAI series, is evaluated using the propagate R package. e CDF of those distributions, which has the smallest value of the BIC, is further standardized according to the approximation (as described in Section 2.1).
is is repeated for all DAI time scales. Table 1 shows the BIC values for all the time scales of the Sialkot station. We perceive that three-parameter (3P) Weibull distribution has a minimum value of the BIC (− 692.1) at the one-month time scale; however, Weibull distribution has several applications in the field of hydrology and related disciplines [47], and it has better candidacy for standardization. e generalized extreme value has the lowest BIC value (− 535.0) for the threemonth time scale of the DAI. e second and third choices for DAI-3 are generalized normal and Johnson SU distribution. In a similar way, DAI-6, DAI-9, DAI-12, DAI-24, and DAI-48 have gamma distribution with BIC � − 543.1,

Start
The proposed framework Step 1 Step

e Choice of the Meteorological Station under MCFS-ID.
e MCFS-ID algorithm-based selection framework is designed to identify the important meteorological station in a specific region. Figure 4 shows the theoretical vs. empirical histograms of the selected distribution of all the time scale of the DAI series. It can be observed that DAI-1 has more accuracy between theoretical and empirical histograms, whereas a significant discrepancy still occurs in other time scales. is discrepancy is natural and cannot be controlled due to the behaviour of data. Recently, a probabilistic drought indicator is developed to address this discrepancy issue [48]. However, the analogy and application of this paper are beyond the description of these discrepancies. In this paper, varying distribution concept is used for the estimation of drought indices [49]. . Figure 5 shows the temporal behaviour of the various time scales of the SPTI on the same rationale, and the procedure of the drought index SPTI is estimated for all other stations. e suitable information for stations can be obtained from the graphics. e colour intensity of a node is proportional to the corresponding features of RI. e size of a node is proportional to the number of edges related to that node. e width and level of darkness of an edge are proportional to the ID weight of that edge. e graphical representations are given for 12 stations on varying timescale indices such as SPTI-1, SPTI-3, SPTI-6, SPTI-9, SPTI-12, SPTI-24, and SPTI-48, respectively. Since we would like to examine and analyze only the strongest ID weights (see Figure 6(a), ID graph of SPTI-1), we can see the node's size in the Lahore station for SPTI-1 is specifically large because of related edges and the intensity, and the intensity in colour for Jhelum shows the higher value of RI, and the darkness of an edge is proportional to the corresponding weight for that edge. Furthermore, for SPTI-48 (see Figure 6(g), ID graph of SPTI-48), the size of the node for Sargodha is large due to related edges, while the intensity in colour for Sialkot is specifically high because of a higher RI value of this station and also the darkness of the edges is related to the corresponding weights.
In Table 1, some statistics for precipitation and maximum and minimum temperatures are available for all selected stations, while Table 2 shows the BIC values for all stations at varying time scales. We are actually interested in finding out which stations are highly important. We can see from Table 3 that the stations which are more important have higher RI values than other stations. For SPTI-1, it can be anatomized that which six stations are more important among 12 selected stations. e results from Table 3 show   Advances in Meteorology    e findings from this study can be helpful for metrological networks in particular regions from the data mining point of view and for analysis and reanalysis purposes. Moreover, it is better to obtain information for dependencies among existing metrological stations in a region. In the estimation procedure, we standardized the CDF of some suitable distributions on the basis of the smallest BIC. e results which are calculated for the Sialkot station show that 3P Weibull distribution is suitable for the one-month time scale, the generalized extreme value is an appropriate choice for the three-month time scale of the DAI, and so on. Furthermore, the information is obtained by setting an exhaustive framework which works by using the MCFS-ID algorithm. In addition, discovering dependencies among stations, a particular station will be considered the strongest candidate which has a higher RI value.

Conclusion
e meteorological data play an important role in drought monitoring. erefore, the choice of meteorological stations in a specific region has substantial importance. In this study, we found that the Jhelum station has more relative importance for SPTI-1 and SPTI-9 indices, while Sialkot has regional importance for studying SPTI-3, SPTI-6, and SPTI-48 indices based on our proposed method MCFS-ID for identifying the important meteorological stations. In summary, our framework can discover dependencies among stations for up-to-date real-time drought-monitoring systems. It can be useful for making informed mitigation policies and for developing an early warning system for drought monitoring.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Ethical Approval
is study was conducted in accordance with the ethical standards of the responsible committee on human experimentation and with the latest (2008) version of the 1975 Helsinki Declaration.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.