Fuzzy Clustering-Based Ensemble Approach to Predicting Indian Monsoon

Indian monsoon is an important climatic phenomenon and a global climatic marker. Both statistical and numerical prediction schemes for Indian monsoon have been widely studied in literature. Statistical schemes are mainly based on regression or neural networks. However, the variability of monsoon is significant over the years and a single model is often inadequate. Meteorologists revise their models on different years based on prevailing global climatic incidents like El-Niño. These indices often have degree of severity associated with them. In this paper, we cluster the monsoon years based on their fuzzy degree of associativity to these climatic event patterns. Next, we develop individual predictionmodels for the year clusters. Aweighted ensemble of these individual models is used to obtain the final forecast. The proposed method performs competitively with existing forecast models.


Introduction
Monsoon is a complex phenomenon of a climatic system.It is influenced by multiple climatic parameters and seaatmosphere interactions.Prediction of monsoon is challenging due to large variability present in its patterns.Indian Meteorological Department (IMD) performs forecast of Indian summer monsoon rainfall (ISMR) since 1886.Indian monsoon forecast was initiated by Blanford [1] as early as 1882.The success of forecasts in span of 1882-1885 encouraged Blanford to design operational long range forecast model for monsoon in 1886.Subsequently, Walker [2] developed models studying the statistical correlations between rainfall and different global climate parameters.Thapliyal and Kulshrestha [3] introduce regression model in predicting south-west Indian monsoon rainfall.Gowariker et al. [4] propose power regression model for long-term forecast of monsoon, which provided accurate forecast for a long period, but failed to predict the extreme condition of 2002.In 2004, Rajeevan et al. [5] reassess different climatic parameters and introduce four new parameters to design statistical model for issuing long-range forecast of Indian monsoon.Succeeding in 2007, Rajeevan et al. [6] built models using ensemble multiple regression and pursuit projection regression to forecast Indian rainfall and proved to be superior to past IMD models.Schewe and Levermann [7] explain the change in distribution of Indian rainfall and also explain the reasons behind failure of monsoon in certain years.Wu et al. [8] propose a linear Markov model to predict short-term climate variability of East Asian monsoon.Fan et al. [9] develop two statistical prediction schemes for seasonal forecast of East Asian summer monsoon.The schemes take the direct outputs of the existing models and give better prediction of the summer monsoon.
Artificial neural networks (ANN) [10] are widely used in modelling the nonlinearity present in monsoon process.Sahai et al. [11] use ANN techniques with error backpropagation to forecast Indian summer monsoon rainfall.Hong [12] 2 Advances in Meteorology predicts Indian summer monsoon utilizing recurrent neural network and also demonstrates successful employment of support vector machine in solving nonlinear regression and time series problems.Three different backpropagation neural learning rules, namely, momentum learning, conjugate gradient descent learning, and Levenberg-Marquardt learning, are used by S. Chattopadhyay and G. Chattopadhyay [13] to perform a comparative study of different neural network method to predict rainfall time series.
Presence of large variability in monsoon patterns makes it difficult for a single model to predict its distribution.A number of uncertainties including boundary condition, parameter, and structural uncertainties are involved in construction of these models.Thus, it remains fundamentally challenging to have a single model for prediction.Multimodel ensembles are proposed to overcome the weakness of single model, which combine the outcome of different models to produce efficient results [14,15].In addition, monsoon shows different characteristics over years.There exist groups of years where variation of climatic parameters and pattern of rainfall are similar.We use fuzzy clustering to cluster the similar years together and model them separately.The motivation behind using fuzzy clustering is that each year manifests a mixture of physical climatic events.We cannot hard cluster a year into a specific group; years have their membership of belongingness to every cluster.Fuzzy clustering is used to enclose the characteristics of different events being related to a year of study.We use the same set of climatic parameters as predictor set for every cluster but frame different models for each cluster.
A number of prediction models, namely, multiple regression (MR), multilayer perceptron (MLP), recurrent neural network (RNN), and generalized regression neural network (GRNN) models, are used for prediction of Indian monsoon for the year clusters.There exists viable reasons for using neural networks like MLP, RNN, and GRNN for modelling: (i) Indian monsoon is a complex process, which cannot be adequately modelled by linear models, (ii) nonlinearity in the time-series pattern can be well captured by neural network learning, (iii) climatic events are much closely related to near years parameters disturbance as compared to distant years, and neural network enables attaching weight to the year parameter in appropriate manner.
In this work, climatic parameters that are strongly correlated with Indian monsoon are identified at the onset, which is followed by fuzzy clustering of years into groups with degree of belongingness of each year to the clusters.Then we model each cluster with four types of models, namely, MR, MLP, RNN, and GRNN, to forecast rainfall.Weighted ensemble of forecasts given by respective models for each cluster is considered as final predicted rainfall.Analysis and comparisons are performed on aggregate Indian rainfall and finally, a meteorological interpretation of the obtained clusters is presented.
The paper is organised in the following manner.We discussed the details of data and predictor climatic parameters in Sections 2 and 3. Proposed clustering based approach, prediction model, and ensemble technique are presented in Section 4 with experimental results in Section 5.
Meteorological significance is discussed in Section 6 and finally, conclusions are provided in Section 7.

Data Sets Used
We consider the annual Indian summer monsoon rainfall (ISMR), occurring in four months of June, July, August, and September.Annual ISMR is considered during period 1948-2013 for our study.The long period average (LPA)    [18], available at resolution of 2.5 ∘ × 2.5 ∘ .Finally, Niño 3.4 data, which is the sea surface temperature anomaly for the spatial coverage of 5 ∘ S to 5 ∘ N and 170 ∘ W to 120 ∘ W in Pacific Ocean region is acquired from National Center for Atmospheric Research (http://www.cpc.ncep.noaa.gov/products/analysismonitoring/ensostuff/ensoyears.shtml)[19].All the above monthly data are considered for the period 1948-2013 in our study and analysis.

Global Climatic Parameters Influencing Indian Monsoon
Indian monsoon is strongly influenced by several global climatic parameters, occurring at places distant from Indian subcontinent.Identification of predictor parameters relies on physical understanding of monsoon event and wind pattern flow.We have selected the climatic parameters based on the parameters used by Indian meteorological department's models [5,6], studying their correlation with Indian summer monsoon rainfall (ISMR) during our period of study .In the data preprocessing phase, climatic anomaly data are evaluated by calculating the deviation of parameter value from long-term average value of the parameter exclusively for each month, followed by correlation study between ISMR and the climatic parameters for a lag of zero to twelve months.We consider the best lagged predictor month having high correlation with ISMR.The predictor climatic parameters and their correlation values with Indian monsoon are shown in Table 1. Figure 1 shows the geographic location of climatic parameters influencing Indian monsoon.
Predictor Sets of Climatic Parameters.Based on the correlation with Indian monsoon, we have built five predictor sets for forecasting.Different combinations of the identified climatic parameters (Table 1) form the predictor sets.The predictor sets are shown in Table 2.

Methodology
We propose fuzzy clustering of monsoon years into groups followed by building models for each group separately and finally predicting Indian summer monsoon rainfall (ISMR) as weighted ensemble of forecasts provided by cluster models.
The block diagram of the proposed fuzzy clustering-based approach to prediction of ISMR is shown in Figure 2. Detailed steps are described in the following subsections.The approach of clustering the years is effective as we can build separate models for each cluster.These cluster models will be more accurate as variation within cluster is less.Finally, ensemble of forecasts of these cluster models results in better prediction of Indian monsoon.As an example consider two clusters of years corresponding to strong El-Niño and North Atlantic Oscillation, respectively.A drought year has correlation with both events and hence might have significant degree of belongingness to both clusters.

Fuzzy Clustering of Monsoon Years. Fuzzy 𝑐-means
clustering is used for grouping the similar years together.Fuzzy -means (FCM) is a method of clustering which allows one instance of input to belong to more than one cluster with some membership of belongingness.FCM attempts to partition a set of  elements  = { 1 , . . .,   } into a collection of  fuzzy clusters  = {cen 1 , . . ., cen  } and a partition matrix  =   ∈ [0, 1],  = 1, . . ., ,  = 1, . . ., , where   gives the degree of belongingness of element   to cluster with center cen  .FCM aims to minimize an objective function of (1).The update of partition matrix and centers occur in accordance with ( 2) and ( 3), respectively: where  denotes the level of cluster fuzziness.

Prediction Models.
Multiple regression and three models of artificial neural networks (ANN), namely, multilayer perceptron, recurrent neural network, and generalized regression neural network, are used to design prediction models for each cluster exclusively.Forecast of annual ISMR is provided by each cluster model separately and also by ensemble of all the clusters' model forecast.We describe below the models used.

Multiple Regression (MR).
Multiple regression model is used to learn the relationship between several independent predictor variables (  s) and a dependent variable ().
where   is the th observation of th independent variable, where the first independent variable takes the value 1 for all  and  represents the residual.

Multilayer Perceptron Neural Network (MLP).
Multilayer perceptron neural network is a class of ANN where connections between the neurons do not form a directed cycle.In this network, the information propagates in only one direction, from input nodes, through hidden nodes, and to the output nodes.The independent and dependent variables constitute the input and output layers, respectively.Number of hidden layers with corresponding nodes must be determined empirically for each prediction task.Four different parameter sets are considered empirically for model designed to forecast ISMR, shown in Table 3.

Recurrent Neural Network (RNN).
Recurrent neural network is a class of ANN which creates an internal state of the network to exhibit dynamic temporal behaviour.Climatic changes or events occurring in near or same time period are highly correlated.Similarly, rainfall patterns are more correlated to influencing factors in the near years as compared to the distant years.This phenomenon is well captured by RNN which gives weights in decreasing order to the values in near to distant years during training of network.Thus, it assists in modelling the system dynamics in much natural manner.Same set of climatic parameters as MLP network (Table 3) is considered with delay span of 2 units.

Generalized Regression Neural Network (GRNN).
Generalized regression neural network is a variant of radial basis function network.GRNN has three layers of artificial neurons: input, hidden, and output.The hidden layer has radial basis neurons, while neurons in the output layer have linear transfer function.Output of radial basis neurons is the input scaled by the spread factor.Given  input-output pairs   ,   ∈ R  ×R 1 , with  input variables and  = 1, 2, . . ., ,   represents the output from each hidden unit.The GRNN output for a test point,  ∈ R  , is described by where The reasons behind modelling using GRNN are (i) only one tunable design parameter (spread factor), (ii) one-pass algorithm (less time consuming), and (iii) accurately approximate functions from sparse data.

Ensemble of Predictors.
Complexity in monsoon process makes it difficult for a single model to predict rainfall accurately.We design separate models for each cluster of years obtained by fuzzy clustering using four predictors described in Section 4.3.Finally, annual ISMR is presented as weighted ensemble of forecasts of model designed for each cluster.Weight is taken as the fuzzy membership of belongingness of the test year in different clusters: where   represents the prediction given by a model for cluster ,    is the fuzzy membership of th test year to cluster , and  is the total number of clusters.

Validation of Proposed Approach.
The study is performed on data for the period 1948-2013.Fuzzy clustering is performed over the period to cluster it into three groups.The number of clusters is decided based on cluster quality.Separate prediction models are designed for all three clusters and ensemble of forecasts of these models is provided as predicted Indian summer monsoon rainfall.Test period 2001-2013 is considered to evaluate the forecasting skills of our proposed approach.
The forecast models for annual ISMR are chiefly evaluated in terms of mean absolute error.Other error statistics, namely, root mean square error, prediction yields, Pearson correlation, and Willmott index of agreement, are also evaluated to judge the efficacy of our proposed approach for prediction.They are described below.

(i) Mean Absolute Error (MAE). Mean absolute error
for prediction of annual ISMR is calculated in the following way: where  and  are the actual and predicted ISMR series for test period and  denotes the total number of test years.
(ii) Root Mean Square Error (RMSE).Root mean square error calculates the differences between model predicted output and actual values.They are a good measure to compare forecasting errors of various models: (iii) Prediction Yield (PY).Prediction yields are evaluated at three different error categories (5%, 10%, and 15% errors) to assess the overall prediction results by judging percent of predicted years within each allowed range of errors.
(iv) Pearson Correlation Coefficient (PC).Pearson correlation coefficient measures the strength of linear association between actual and predicted values, where the value of 1 means a perfect positive correlation and the value of −1 means a perfect negative correlation: where  and  are the actual and predicted ISMR series for test period and  and  are their corresponding mean.
(v) Willmott Index of Agreement (WI).Willmott index of agreement is a standardized measure of the degree of model prediction error.It varies between 0 and 1 with higher values indicating a better fit of the model for prediction: (11)

Experimental Results and Analysis
In this section we present the evaluation of our proposed fuzzy clustering-based approach.We first present the results of fuzzy clustering of the monsoon years for different predictor sets.Forecasting skills are evaluated for all cluster and the ensemble model in terms of mean absolute errors for test period 2001-2013.In addition, other measures like root mean square errors in prediction, correlation between predicted and actual rainfall, prediction yields, and agreement index between actual and predicted rainfall are also estimated to establish the efficiency of our proposed approach to prediction of Indian summer monsoon rainfall.

Clustering of Monsoon Years.
Fuzzy clustering is performed over period 1948-2013 to cluster the data into three clusters.We have performed an -cut, with value  = 0.3 to assign the data instances to the clusters.The value is ascertained empirically such that the distribution of elements within clusters is regular.A data instance can be assigned to more than one cluster simultaneously.The cluster sizes are shown in Table 4 while considering various predictor sets.

Prediction Accuracy.
We predict annual rainfall considering for all five predictor sets (Table 2) separately using four models, namely, MR, MLP, RNN, and GRNN.Test period is considered from 2001 to 2013.

Multiple Regression Model (MR).
Multiple regression models are built for every cluster by ascertaining optimal training period for each predictor set.Optimal training period is evaluated by varying training years and validating them for least absolute error in prediction during validation period (1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993).Individual cluster based as well as weighted ensemble models are considered for prediction.Table 5 gives the mean absolute error for individual cluster based and ensemble models for test period 2001-2013.The model provides mean absolute error of 6.2% for PredSet4 (Table 2).It is observed that the ensemble model outperforms all the single cluster models for every predictor set. Figure 3 shows the interannual variability of actual and ensemble predicted rainfall as percent of long period average (LPA).

Multilayer Perceptron Neural Network Model (MLP).
Multilayer perceptron neural network model is designed with four different sets of parameters described in Table 2. Mean absolute errors of all cluster and ensemble models are shown in Table 6.MLP model reports an error of 4.0% for PredSet4 (Table 2) with MLP parameters ParSet1 (Table 3).The actual and predicted rainfall by models built for clusters and ensemble model is shown in Figure 4. Ensemble predicted rainfall closely follows actual rainfall.

Recurrent Neural Network Model (RNN).
Mean absolute errors for prediction of annual rainfall by recurrent neural network model for the test period 2001-2013 are  presented in Table 7. PredSet3 (Table 2) with RNN parameters ParSet1 (Table 3) gives error of 5.1%.RNN gives weights in decreasing order of their distance from test year to the training years.The pattern of actual and ensemble predicted rainfall in terms of percentage of LPA is shown in Figure 5.

Generalized Regression Neural Network Model (GRNN).
Generalized regression neural network ensemble and individual cluster models' errors in terms of mean absolute errors are presented in Table 8.The model reports an error of 6.1% for PredSet3 (Table 2).Figure 6

Meteorological Analysis
Next, we try to visualize each cluster in terms of physical climatic events.The clusters obtained by fuzzy clustering are physically interpreted as being characterized by some global climatic events.The climatic events considered and studied during the time period 1948 to 2013 (period considered for clustering in our work) are El-Niño, La-Niña (http://ggweather.com/enso/oni.htm),positive and negative Indian ocean dipole (http://bom.gov.au/climate/IOD),drought, and flood, shown in Table 11.(i) Support.Support is defined as percentage of total number of years in the cluster corresponding to the climatic event: where  ce denotes the number of years associated with a specific climatic event in the cluster and  is the total count of years in the cluster.(ii) Confidence.Confidence is defined as percentage of years associated with the climatic event in the cluster to the total number of such event years: where  ce is the number of years associated with the climatic event during the period 1948-2013.We relate a cluster to a physical climatic event described in Table 11, if both support and confidence measures attain the corresponding thresholds.The thresholds are chosen in a way that 50% of years of study are under consideration.A low threshold compromises the importance of a climatic event being related to a particular cluster; on the other hand if even less number of years are taken, then threshold values should be high, which in turn will leave out most of the clusters.Therefore, as an optimal between the extremes, 50% of years are considered.Figure 9 shows histograms with confidence and support as bins of year-count for cases before and after threshold process, respectively, for predictors PredSet1 (Table 2).The threshold values obtained for predictor sets are presented in Table 12.For each predictor set, we associate the clusters with physical climatic events, if they satisfy both support and confidence thresholds.The climatic events corresponding to cluster are shown in Table 13.Results establish coexistence of events of La-Niña and flood.It also puts light on high probability of occurrence of El-Niño, drought, and positive IOD events simultaneously.

Conclusion
Monsoon is an important phenomenon for economic development of agricultural-land like India.Large variability of monsoon over years makes prediction of rainfall a challenging task.The paper attempts to address this problem by clustering the years into similar groups and finally, multimodel ensemble forecast is provided for Indian summer monsoon rainfall.
Different climatic parameters with best correlated month value are identified and five different predictor sets are built for prediction of Indian monsoon.Four different models, namely, MR, MLP, RNN, and GRNN, are designed for each cluster exclusively.The final forecast is provided by weighted ensemble of forecasts by each cluster's model, where weight is considered as fuzzy membership of belongingness in each cluster.Multilayer perceptron ensemble model provides mean absolute error of 4.0% for prediction of annual rainfall, which is appreciable for forecasting complex monsoon process.Proposed fuzzy clustering-based ensemble approach surpasses the conventional approach.Performance of proposed clustering-based ensemble models is superior to existing IMD's models [4,5].The error statistics also ascertain the superiority of multilayer perceptron model over other three proposed models.Lastly, in meteorological context the clusters are linked with global climatic events.
In the future, large number of climatic parameters influencing Indian monsoon can be explored and different predictor set can be used for different clusters of years to provide even better forecasting accuracy.

4. 1 .
Motivation: Variability of Monsoon Patterns.Trends and distributions of monsoon vary to a large extent over years.It is thus necessary to group the years into clusters which have similar patterns of predictor climatic parameters affecting monsoon.

•
Forecast of rainfall • Error in forecasting • Model comparison results• Climatic events associated with each cluster

Figure 2 :
Figure 2: Proposed fuzzy clustering-based ensemble approach for prediction of Indian summer monsoon rainfall.

Figure 3 :
Figure 3: Performance of forecasts by proposed fuzzy clusteringbased ensemble model and its respective three clusters models by MR for PredSet4.The deep and light purple bars represent the actual and predicted ISMR in terms of percent of LPA.The symbols represent forecasts given by individual cluster models.The results are shown for test period 2001-2013.

Figure 4 :
Figure 4: Performance of forecasts by proposed fuzzy clusteringbased ensemble model and its respective three clusters models by MLP for PredSet4.The deep and light purple bars represent the actual and predicted ISMR in terms of percent of LPA.The symbols represent forecasts given by individual cluster models.The results are shown for test period 2001-2013.
shows the interannual variations of ensemble forecast of rainfall by GRNN ensemble model along with actual rainfall pattern in terms of percentage of LPA for period 2001-2013.It is observed that the predicted values are close to actual rainfall patterns.Prediction by models designed for clusters is shown by different symbols.

Figure 5 :
Figure 5: Performance of forecasts by proposed fuzzy clusteringbased ensemble model and its respective three clusters models by RNN for PredSet3.The deep and light purple bars represent the actual and predicted ISMR in terms of percent of LPA.The symbols represent forecasts given by individual cluster models.The results are shown for test period 2001-2013.

Figure 6 :
Figure 6: Performance of forecasts by proposed fuzzy clusteringbased ensemble model and its respective three clusters models by GRNN for PredSet3.The deep and light purple bars represent the actual and predicted ISMR in terms of percent of LPA.The symbols represent forecasts given by individual cluster models.The results are shown for test period 2001-2013.

Figure 8
shows the El-Niño and La-Niña years associated with drought, normal, and excess rainfall years during 1948-2013.The years having rainfall 10% above LPA are excess rainfall years and years having rainfall 10% below LPA are drought years.The El-Niño and La-Niña years are shown by color codes (light green and green) in the figure.The chart helps to visualize the cooccurrence of El-Niño and La-Niña events with extremities of ISMR.6.1.Measuring Association between Climatic Events and ISMR.Support and confidence measures are considered to relate physical climatic event to the clusters generated by fuzzy clustering.They are defined below.

Figure 9 :
Figure 9: Histogram of the confidence and support measures as bins of year-count before (a) and after (b) thresholding for PredSet1.

Table 1 :
Climatic parameters (CP) influencing Indian monsoon with geographical location, correlation values, and correlated month (0 signifies same years and −1 signifies previous year).
∘ Figure 1: Climatic parameters over the globe governing Indian monsoon (purple patches signify the location of climatic parameters taken, and blue patch represents the Indian region); CP  represents parameter  in Table1.

Table 2 :
Predictor sets with climatic parameters.

Table 3 :
Model parameter setting for MLP models.

Table 4 :
Cluster size (number of years) by fuzzy -means clustering with -cut of 0.3 over the period 1948-2013.

Table 5 :
Mean absolute errors (%) for annual Indian summer monsoon rainfall prediction by individual MR cluster models and ensemble model for test period 2001-2013.Reports minimum error of 6.2%.

Table 6 :
Mean absolute errors (%) for annual Indian summer monsoon rainfall prediction by individual MLP cluster models and ensemble model for test period 2001-2013.Reports minimum error of 4.0%.

Table 7 :
Mean absolute errors (%) for annual Indian summer monsoon rainfall prediction by individual RNN cluster models and ensemble model for test period 2001-2013.Reports minimum error of 5.1%.

Table 8 :
Mean absolute errors (%) for annual Indian summer monsoon rainfall prediction by individual GRNN cluster models and ensemble model for test period 2001-2013.Reports minimum error of 6.1%.

Table 9
shows different forecast verification statistics for ensemble models during test period 2001-2013.We summarize the observations below.(i) Root Mean Square Error (RMSE).MLP ensemble model gives RMSE of 5.3%, followed by RNN ensemble model with 6.4%.GRNN and MR models give RMSE of 7.4% and 8.4%, respectively.

Table 10 :
Comparison of absolute errors for rainfall prediction by proposed ensemble models (Ensml) with clustering (WC) approach to standard method with same models without clustering (NC) approach.

Table 11 :
Physical climatic events under study.

Table 12 :
Threshold of support and confidence measures for associating obtained clusters with physical climatic events.

Table 13 :
Identified physical climatic events being associated with clusters obtained by fuzzy clustering.