Modeling the Relationship between Rice Yield and Climate Variables Using Statistical and Machine Learning Techniques

+is paper presents the application of a multiple number of statistical methods and machine learning techniques to model the relationship between rice yield and climate variables of amajor region in Sri Lanka, which contributes significantly to the country’s paddy harvest. Rainfall, temperature (minimum and maximum), evaporation, average wind speed (morning and evening), and sunshine hours are the climatic factors considered for modeling. Rice harvest and yield data over the last three decades and monthly climatic data were used to develop the prediction model by applying artificial neural networks (ANNs), support vector machine regression (SVMR), multiple linear regression (MLR), Gaussian process regression (GPR), power regression (PR), and robust regression (RR). +e performance of each model was assessed in terms of the mean squared error (MSE), correlation coefficient (R), mean absolute percentage error (MAPE), root mean squared error ratio (RSR), BIAS value, and the Nash number, and it was found that the GPR-based model is the most accurate among them. Climate data collected until early 2019 (Maha season of year 2018) were used to develop the model, and an independent validation was performed by applying data of the Yala season of year 2019. +e developed model can be used to forecast the future rice yield with very high accuracy.


Introduction
e ability to predict the future crop yield facilitates the responsible authorities to take the most appropriate decisions in order to ensure food security. As the human population continues to increase, which requires the efficient utilization of lands, enhancing the yield would be more important than increasing the farming area. Climate is one of the primary factors beyond human control, which determines the crop yield. In this sense, modeling and prediction of crop yield by considering the climate variables has become an interesting research topic.
Use of statistical techniques as well as machine learning algorithms to identify the relationship between past climate variables and yield data is presented in the literature [1][2][3]. As rice is a primary source of food for more than half the world's population, numerous research approaches were proposed for predicting the rice yield [4]. Similar research studies were conducted to model the relationship between the climatic factors and the yield of some other crops such as barley [5], corn [6], sugar cane [7], citrus [8], tea [9], coconut [10], sorghum [11], maize, and soybean [12]. A multiple number of climatic factors were considered in such research studies for the application of statistical methods and machine learning techniques.
Regression techniques, support vector machines (SVMs), and artificial neural networks (ANNs) are some of the techniques applied to model the relationship between rice yield and climate variables. ANN was applied with some climate parameters (precipitation, minimum temperature, average temperature, and maximum temperature) and reference crop evapotranspiration and yield over four years to predict rice yield in Maharashtra State, India [13]. is model was validated with an accuracy of 97.5%, a sensitivity of 96.3%, and specificity of 98.1% by developing a multilayer perceptron neural network. A similar research work performed on the data of several paddy grown areas in Sri Lanka proved that ANN model (with MSE < 0.386) can be used with less computational time to predict the future paddy yield based on future climatic data [14]. An advanced application of ANN integrated with multiple linear regression (MLR) and penalized regression models for prediction of rice yield based on weather parameters at the west coast of India was presented by Das et al. [15]. Its normalized root mean square error varied between 0.98 and 36.7%. Upland rice yield responses in Sahel, West Africa, were modeled to climate factors, by using several techniques, namely, MLR, boosted tree regression, and ANN [16]. As per the results, ANN outperformed the other two techniques and the research findings concluded that rainfall, not temperature, was the main climate driver of the rice yield in Sahel. A hybrid MLR-ANN model produced better accuracy than the conventional models, namely, ANN, MLR, support vector regression (SVR), k-nearest neighbour (KNN), and random forest (RF) [17].
SVM is another commonly used machine learning technique applied to model the relationship between rice yield and climate variables. e applicability of SVM in determining the relative influence of several climate factors on paddy rice yield in Southwest China was investigated, and it was found that SVMs outperformed ANN and MLR [18]. e relationship between climate variables and rice yield was quantified by applying MLR, principal component analysis, and SVM on 36 years of climate and yield data in Southwest Nigeria [19]. It concluded solar radiation as the climate variable of the highest influence on rice yield, which maximized yield during monsoon and postmonsoon periods. Testing eleven combinations of phenology, climate, and geography data to predict the site-based rice yields in South China using MLR and advanced machine learning methods like backpropagation neural network, SVM, and RF is presented in [20]. It was shown that machine learning methods were more precise than MLR, and SVM produced the highest precision in yield prediction. A hybrid SVR technique was applied to predict rice yield based on the climate and agricultural data in Taiwan from 1995 to 2015, resulting in an average RMSE and R 2 of 60 and 0.996 [21].
Statistical techniques such as MLR and Gaussian process regression (GPR) have also been used for prediction of rice yield. GPR has proven to be more accurate than SVM with an R 2 > 0.75 and yield error less than 10%, when they were applied to predict winter wheat yield in China based on climate and soil data [22]. However, the authors could not find any research publication on the application of GPR to build a relationship between rice yield and climatic data. Application of MLR to estimate the yield of crops such as sugar cane [7], citrus [8], and tea [9] was presented in the literature. Ji et al. compared the effectiveness of MLR models with ANN models for rice yield predictions in the mountainous Fujian Province of China [23]. Based on the values of R 2 and the RMSE, they justified the superiority of ANN models (R 2 � 0.67, RMSE � 891) for accurate yield prediction over MLR models.
As per the literature, ANN, support vector machine regression (SVMR), and MLR were applied to predict rice yield accurately based on the climate variables. In this paper, a research study conducted to model the relationship between rice yield and climate variables of a major province in Sri Lanka, which contributes significantly to the paddy harvest in the country, is presented. In addition to the aforementioned techniques, GPR, power regression (PR), and robust regression (RR), which have not been used or rarely used, were considered in this research. As rice is the staple food in Sri Lanka, identifying a suitable technique for predicting the yield is important in numerous ways. Section 2 presents a description of the data used for this research and a statistical analysis of them. A brief introduction to the techniques used for modeling and the criteria used for evaluating their performance are also demonstrated. In Section 3, the research findings are presented, results are analyzed, and the proposed models are validated. Finally, conclusions are presented in Section 4.

Data Collection.
Rice harvest, yield, and climatic data of two districts in Sri Lanka, namely, Kurunegala and Puttalam, over the last three decades were collected from the Department of Census and Statistics and the Department of Meteorology of Sri Lanka. Paddy yield data of the two major agricultural seasons (Yala and Maha) were considered. Yala season spans from May to August, while Maha season spans from the September to March of the following year. Rainfall, minimum temperature, maximum temperature, evaporation, average wind speed (morning and evening), and sunshine hours are the climatic factors considered for modeling. ese monthly climatic data except rainfall were averaged for each season in both districts, and they were used with the total rainfall of each season. e nonlinear relationship between paddy yield and the climatic parameters was defined as given in the following equation: (1) where RF is the rainfall in mm, T min is the minimum temperature in°C, T max is the maximum temperature in°C, E is the evaporation in mm, SH is the number of sunshine hours, and AWS m is the average wind speed in the morning in km/h and AWS e is that in the evening.

Analysis of Data.
It is important to understand the variation of rice yield and the harvest over the three decades so as to focus on the priority of the study [24]. e data produced values of 3.6 t/ha and 3.96 t/ha for the yield in Kurunegala Yala and Maha seasons, respectively, and those for the Puttalam Yala and Maha seasons were found to be 3.63 t/ha and 3.78 t/ha in order (Table 1). Net harvested area in Kurunegala district is about 65,000 hectares, while it is only about 10,000 hectares in Puttalam district. e highest harvest was produced in Kurunegala district during the Maha season with a mean of 230.5 tons and a standard deviation of 88 tons, followed by the Yala season of the same district with a mean of 132.9 tons and a standard deviation of 68 tons (Figure 1). e harvest in the Puttalam district is much lower than that in Kurunegala district during the corresponding agricultural seasons with a mean of 36.5 tons during the Maha season and 22.1 tons in the Yala season. erefore, it is evident that the rice yield as well as harvest in the Maha season is higher than that in the Yala season in each district. A much higher harvest in Kurunegala district is obvious due to the huge difference between the net harvested areas. Significant drops in the paddy harvest could be ob- e climatic data were also analyzed, and a summary of the statistics is presented in Table 2. As per the analysis of rainfall in both Kurunegala and Puttalam districts, the average rainfall in Maha season was always higher than that in Yala season, which must be due to the less precipitation during the southwest monsoon period between May and September and the heavy northeast monsoon that blows in October and November. Further, the average rainfall in Kurunegala district is more than that in Puttalam district during the corresponding seasons. T min in corresponding seasons is higher in Puttalam district, which lies in three climatic zones (arid, dry, and intermediate), than Kurunegala district, which is in the dry zone and intermediate zone of Sri Lanka. However, the highest T max occurs in Kurunegala district during the respective seasons except the minimum of T max during Yala season. e evaporation is greater in Puttalam district than in Kurunegala district during the comparable Yala and Maha seasons separately, as evident from the statistics in both districts. is may be due to the influence of more coastal areas in Puttalam district associated with strong winds that account for higher evaporation. e figures of the number of sunshine hours suggest that Kurunegala district gets more sunshine during both seasons compared to Puttalam district. Concurring with basic climatic features, the mean and other associated values in Table 2 indicate that wind speed in the evening is always higher than that in the morning during both seasons of the two districts. Further, stronger winds can be identified during Yala season than during Maha season in both districts.

Development of Models.
e Levenberg-Marquardt (LM) algorithm, which was developed by combining the gradient descent method and Gauss-Newton method, was used in ANN [25]. A single hidden layer was considered, while climatic data and rice yield data were formed as the data vectors and fed into the ANN model as the input layer and output layer, respectively. In this research study, 70% of the climatic data and the corresponding yield data were used for training, while rest of the data were equally distributed and applied for validation (15%) and testing (15%). Feed-forward network with backpropagation algorithm was used to predict the future yield. Backpropagation is a learning algorithm that enables a network to minimize the error (predicted yield-actual yield) by modifying the weights, which connect neurons (equation (2)). LM algorithm was used in optimization problems due to its faster convergence, and the backpropagation is defined by using the Gauss-Newton method [26]. e behavior of the LM algorithm can be expressed as a relationship between the new weight (x k+1 ) calculated as a gradient function and the current weight (x k ) determined by using the Newton algorithm.
where H is the Hessian calculation approximation, g is the gradient calculation, μ is a constant, and I is an identity matrix. SVMs are supervised learning methods, which use machine learning theory to improve the accuracy of the prediction models. ey can also be used as a regression method (SVMR), keeping all the main features that characterize the maximum margin algorithm [27]. Given a set (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x m , y m ) ∈ (X, Y), where X is the set of input vectors and Yis the set of output, y is predicted for an unseen x ∈ X. SVMR was used to develop a linear function that can make the best approximation of the dependent response variables. e similarity measures were  done based on dot products as indicated in the following equation: where w and b are regression parameters. During the implementation of SVMR-based models, the input was mapped to a high-dimensional feature space using a kernel function, and then a linear regression model was constructed in the new feature space to achieve two conflicting objectives, namely, minimizing errors and avoiding overfitting. Kernel functions (linear, polynomial, Gaussian, etc.) were incorporated for tuning, and most of the times the Gaussian kernel function outperformed the others. GPR is a nonparametric machine learning technique, which implements Gaussian processes for regression purposes and works well on nonlinear problems with small sample sizes [28]. It is a modified linear regression model, which explains the response by introducing latent variables.
e Gaussian process is a collection of random variables (x, y), whose properties are a finite number of subsets with a joint Gaussian distribution defined as where the mean function m(x) represents the expectation and the kernel function k(x, x ′ ) defines the covariance. It is a collection of random variables x ∈ X whose properties are any finite number of subsets with a joint Gaussian distribution. e GPR was used in combination with the kernel functions named rational quadratic, squared exponential, Matern 5/2, and exponential. All the aforementioned machine learning techniques-based models (ANN, SVMR, and GPR) were developed in MATLAB environment (version 9.4.0.813654-R2018a). MLR is a statistical technique, which can be used to model a linear relationship between the input variables (climatic data) and output variable (rice yield) with more than one explanatory variable. MLR also allows determining the overall fit (variance explained) of the model and the relative contribution of each predictor to the total variance explained [29]. It is an extension of ordinary least squares regression and can be expressed as follows: where n � i is the number of observations, β 0 is the yintercept, which is a constant term, β n represents the slope coefficients for each input variable, and ε is the model deviation. PR is a nonlinear regression model in which the output (response variable) is modeled in proportion to a power (polynomial) of the inputs (explanatory variables) [30].
where i is the number of observations and a, b, c, . . . , p are constants. RR provides an alternative to least squares regression that works with less restrictive assumptions and performs well even when outliers are present in the data [31]. Outliers violate the assumption of normally distributed residuals in least squares regression and tend to distort the least squares coefficients by having more influence than they deserve. Although RR can particularly benefit untrained users, careful consideration should be given as it conducts its own residual analysis and down-weights or completely removes some observations. All the aforementioned statistical models (MLR, PR, and RR) were developed by programming in R software (R 4.0.3).

Evaluation of Models.
e performance of each model was assessed in terms of the mean squared error (MSE), correlation coefficient (R), mean absolute percentage error (MAPE), root mean squared error ratio (RSR) [32], BIAS value [33], and the Nash number.
where x is the actual yield, y is the predicted yield, N is the number of data values, and σ x is the standard deviation of actual yield. In highly accurate models, R and Nash number are closer to 1, while the MSE approaches 0. e negative and positive BIAS values show underestimation and overestimation, respectively, while the values close to zero indicate high model accuracy. Further, the relative standard deviation of the actual yield was calculated to understand the behavior of actual yield data.

3.1.
Results. e climatic and rice yield data were separated into four groups based on the district and the agricultural season and applied on each of the six techniques to identify the relationship among them.
e performance of each model was measured in terms of the MSE and correlation coefficient ( Figure 2). As per the results, MSE of the models developed for Maha season of Kurunegala district was less than 0.06 and correlation coefficients were higher than 0.78. However, when the Yala season of the same district is considered, five models exhibit correlation coefficients over 0.55, but SVMR-based model demonstrated R � 0.17. When Puttalam district is considered, MSE of the models corresponding to Maha season varied between 0.024 and 0.098. Its correlation coefficients varied between 0.38 and 0.89, while those for Puttalam Yala season are higher than 0.71. In terms of MSE, the models corresponding to Puttalam Yala season are the best with MSE < 0.033. As per the performance analysis of the models, they were not much accurate, particularly in terms of the correlation coefficient. e less number of source data (climatic data and corresponding rice yield) may be a potential reason for this behavior. erefore, having considered the full set of data as a single sample, all the statistical methods and machine learning techniques were applied again to model the relationship between rice yield and climate variables. ese results are depicted in Figure 3 and summarized in Table 3. e models resulted in MSE < 0.1, which is comparable to previous results but with higher correlation coefficients. e yield functions obtained by applying statistical methods and derived in terms of the climatic factors are given in equations (9)- (11). Some climatic factors did not appear in the yield function as their impact was minimal compared to rainfall, temperature, and average wind speed.
As per the evaluation performed based on MSE, R, MAPE, Nash number, RSR, and BIAS value, the application of GPR generated the most accurate model among all the statistical methods and machine learning techniques considered in this research. e GPR outperformed the others demonstrating the lowest MSE, MAPE, and RSR. e BIAS value of the GPR model was also very close to zero despite being the second lowest among the six models. Moreover, R and the Nash number of GPR model were closer to 1 compared to the other five techniques.

Validation.
Variation of the rice yield predicted by applying GPR for the period of 2000-2018 was compared with the actual rice yield (Figure 4). e predictions were very close to the actual yield values, with maximum absolute errors of 0.22 t/ha and 0.24 t/ha in Kurunegala and Puttalam districts, respectively. ese very low error values indicate the validity of the GPR model. It was further analyzed by comparing the RSD and MAPE as well. ough the RSD of the actual yield data (9.3%) shows the uncertainty of actual yield in the ensuing years, the GPR model resulted in very lower MAPE (2%), demonstrating the possibility of accurate prediction of yield.
In order to validate the accuracy of the GPR model further, climatic data in year 2019 corresponding to Yala    season of the Kurunegala district were applied on the GPRbased model. e predicted rice yield was 3.77 t/ha, while the actual yield was 4.01 t/ha, resulting in an insignificant absolute error of 0.24 t/ha.

Conclusions
e relationship between the rice yield and climate variables of two geographically adjacent districts in Sri Lanka was modeled by applying three statistical methods and three machine learning techniques. Based on the significance of weightages and exponents associated with the climatic parameters in the yield functions of the statistical models, the rainfall, temperature, and average wind speed were found to be the most influential climatic factors. e accuracy of the models was evaluated in terms of the MSE, correlation coefficient, MAPE, RSR, BIAS value, and the Nash number. All the machine learning techniques (ANN, SVMR, and GPR) outperformed statistical methods (MLR, PR, and RR) in developing accurate relationship models. e GPR-based      model was the most accurate having MSE, MAPE, RSR, and BIAS values closer to zero while the Nash number and the correlation coefficient approaching 1. Further, an independent validation of the model was conducted by using a recent dataset, which was not used for developing the models. e results demonstrated the capability of the GPR-based model to predict the rice yield by using the known or forecast climatic data.
Data Availability e data used for the research are available from the corresponding author upon request subject to approval of the relevant authorities.

Conflicts of Interest
e authors declare that they have no conflicts of interest.