^{1}

^{2}

^{1}

^{1}

^{1}

^{2}

This paper presents the development of crop-weather models for the paddy yield in Sri Lanka based on nine weather indices, namely, rainfall, relative humidity (minimum and maximum), temperature (minimum and maximum), wind speed (morning and evening), evaporation, and sunshine hours. The statistics of seven geographical regions, which contribute to about two-thirds of the country’s total paddy production, were used for this study. The significance of the weather indices on the paddy yield was explored by employing Random Forest (RF) and the variable importance of each of them was determined. Pearson’s correlation and Spearman’s correlation were used to identify the behavior of correlation in a positive or negative direction. Further, the pairwise correlation among the weather indices was examined. The results indicate that the minimum relative humidity and the maximum temperature during the paddy cultivation period are the most influential weather indices. Moreover, RF was used to develop a paddy yield prediction model and four more techniques, namely, Power Regression (PR), Multiple Linear Regression (MLR) with stepwise selection, forward (step-up) selection, and backward (step-down) elimination, were used to benchmark the performance of the machine learning technique. Their performances were compared in terms of the Root Mean Squared Error (RMSE), Correlation Coefficient (

It is understood that favorable weather conditions as well as other factors like adoption of modern technologies into farming, food preservation techniques, and improved varieties of seeds, fertilizers in cultivation, and so on all contribute to enhanced food security and productivity in the field of agriculture. Among the many progressive steps taken towards the sustainable expansion of major crops grown worldwide, long-term plans for self-sufficiency and raising productivity in paddy cultivation are sensitive issues for agriculture scientists and policymakers because paddy rice continues to be the primary source of food in many countries of the world today and particularly in Asia. With the ever-growing world population towards 10 billion marks by the middle of this century, the demand for rice shall always be on increase and the agriculture technologists will be hard pressed to invent yield-enhancing techniques, as the scope of farming lands for paddy cultivation shall be exhausted within a few years.

Researchers have studied the factors that influence regionwise crop yield differences under technological, biological, and environmental categories [

Due to this significant influence created by weather on crop yield, it would be a useful exercise to identify the most impactful weather factors and the correlation among them, so that appropriate measures may be contemplated to maximize the effect of conducive factors and minimize that of harmful factors on the paddy yield. Given the uncontrollable and unpredictable nature associated with weather, the researchers’ scope is limited to the use of secondary data on regular weather patterns in developing crop-weather models for accurate yield prediction of crops despite occasionally extreme weather conditions.

Some related studies could be found in the literature that had used the following regression techniques to address the above topic in some other countries. Sharma and Joshi examined the spatial and temporal performance of rice production and yield and the factors determining the acreage and yield of paddy in coastal regions of India [^{2}) of 0.9234 using only four predictors, namely, percentage of rice area, number of days with minimum temperature, average daily minimum temperature, and monthly average solar radiation. In this paper, Power Regression (PR) and three Multiple Linear Regression (MLR) models with stepwise selection, forward selection, and backward elimination of variables are used to relate the paddy yield to weather indices and their performance shall be compared with that of the more powerful nonparametric methods of PR and RF to identify the most suitable model(s) in the Sri Lankan context characterized by two major paddy growing seasons in nine regions with different weather conditions.

Machine learning techniques have also been used to develop crop-weather models and to understand the most influential weather factors. Konduri et al. compared the performance of linear and nonlinear regression models in terms of ^{2} and the Root Mean Square Error (RMSE) and found that Support Vector Regression (SVR) and RF are capable of producing comparatively better performance over the linear models of Principle Component Regression and Ridge Regression in assessing the impact of climate on the crop yield [

Although the weather factors were known to control the crop yield to a greater extent, a comprehensive study focusing on their relative importance and correlation with the paddy yield has not yet been conducted to explore the situation in Sri Lanka. Therefore, the objectives of the present study were focused on investigating the most impactful weather indices on paddy yield in Sri Lanka. In light of numerous modelling techniques cited above, it was possible to narrow down the choice of methods that would help achieve the objectives of this study. Due to the overwhelming success reported in using RF, it will be used to shed more light on interregressor correlation, which is an important determinant of the behavior of variable importance matrix.

In Section

Eleven years of secondary data on paddy yield were obtained from the reports published by the Department of Census and Statistics, the premier state institute in Sri Lanka, maintaining the official repository of information on diverse fields collected using appropriate scientific methods and instruments. The temporal scope of data included the two main paddy cultivation seasons spanning from May to August (Yala season) and September to March (Maha season) of the ensuing year during the period from 2009 to 2019, while the spatial coverage encompassed seven administrative districts, which together contribute to nearly 62% of the overall annual paddy production in Sri Lanka (Figure

Study areas.

Table

Paddy yield in the study areas.

District | Average contribution to the paddy production in Sri Lanka (%) | Season | Yield (t/ha) | ||
---|---|---|---|---|---|

Mean | Median | Range | |||

Ampara | 15.57 | Yala | 4.8 | 4.8 | 1.1 |

Maha | 4.6 | 4.8 | 2.0 | ||

Polonnaruwa | 10.64 | Yala | 4.9 | 5.0 | 1.3 |

Maha | 5.0 | 5.0 | 1.6 | ||

Kurunegala | 10.56 | Yala | 3.7 | 3.6 | 0.6 |

Maha | 4.2 | 4.0 | 1.5 | ||

Anuradhapura | 9.07 | Yala | 4.6 | 4.5 | 1.4 |

Maha | 4.7 | 4.6 | 1.7 | ||

Batticaloa | 6.17 | Yala | 5.2 | 5.1 | 1.6 |

Maha | 5.8 | 5.8 | 1.4 | ||

Hambantota | 6.23 | Yala | 4.2 | 4.1 | 1.8 |

Maha | 3.1 | 3.1 | 1.9 | ||

Monaragala | 3.49 | Yala | 4.0 | 4.0 | 1.4 |

Maha | 4.2 | 4.2 | 0.9 | ||

Total | 61.73 |

Weather data were purchased from another state institute, the Department of Meteorology in Sri Lanka, for the same period as for the paddy yield data. The total rainfall during a cultivation season was used with the seasonal averages of eight more monthly mean weather indices in relative humidity (minimum and maximum), temperature (minimum and maximum), wind speed (morning and evening), evaporation, and sunshine hours. Thus, the above temporal and spatial extent provided a total of 11 years × 7 districts × 2 seasons of data for the analysis carried out using MLR, PR, and RF. In MLR, three types of variable selection methods, namely, stepwise, forward selection, and backward elimination, were employed.

Table

Mean weather in the study areas during the period of paddy cultivation.

District | Mean weather | ||||||||
---|---|---|---|---|---|---|---|---|---|

Rainfall (mm) | Minimum relative humidity (%) | Maximum relative humidity (%) | Minimum temperature (°C) | Maximum temperature (°C) | Evaporation (mm) | Sunshine hours | Morning wind speed (km/h) | Evening wind speed (km/h) | |

Ampara | 741.5 | 71.7 | 76.0 | 24.6 | 33.0 | 3.7 | 7.1 | 3.4 | 5.5 |

Polonnaruwa | 896.3 | 60.8 | 74.2 | 21.9 | 33.5 | 4.4 | 7.6 | 3.6 | 3.9 |

Kurunegala | 644.2 | 72.0 | 83.4 | 23.3 | 32.3 | 3.1 | 6.8 | 2.9 | 3.9 |

Anuradhapura | 681.5 | 70.2 | 83.2 | 23.6 | 32.2 | 3.6 | 7.4 | 5.8 | 6.1 |

Batticaloa | 994.8 | 71.0 | 83.4 | 25.4 | 32.2 | 3.6 | 7.3 | 3.0 | 6.9 |

Hambantota | 574.3 | 73.0 | 78.0 | 24.1 | 32.6 | 4.2 | 6.4 | 4.8 | 5.2 |

Monaragala | 801.7 | 64.1 | 78.0 | 22.3 | 33.0 | 3.2 | 6.3 | 2.9 | 4.2 |

The relative importance of predictors is usually measured by evaluating how much each predictor contributes to increasing the model accuracy [

In this research, the in-built variable importance method of RF regression model [

For each decision tree, RF regression calculates nodes’ importance using Gini Importance, assuming only two child nodes (binary tree). The importance of node

Next, the feature importance values are normalized and the normalized feature importance for

The final feature importance at the RF level is its average over the total number of trees (

The correlation between the yield and each weather index was determined to quantify its impact and also to identify whether the impact is positive or negative. Pearson’s correlation coefficient and Spearman’s correlation coefficient were calculated using the programming language R studio (version 1.3.1093). Pearson’s correlation coefficient is a test statistic that measures both the strength and direction of a pairwise linear relationship between two quantitative continuous variables [

A positive correlation coefficient implies an increase of both variables in the same direction and a negative value means the change of variables in opposite directions. The correlation matrix thus obtained is given in Table

Pearson’s correlation matrix.

As some studies had reportedly shown nonlinear relationships between the yield and weather indices [

Spearman’s correlation matrix.

A value of

As the number of observations is much more than the number of variables, linear regression is known to be a strong classical parametric method [

Three MLR methods differed according to the selection procedure of variables, namely, forward (step-up) selection, backward (step-down) elimination, and the stepwise selection, which were used. The stepwise regression is a combination of the other two techniques wherein variables are added stepwise after verifying their significance against a tolerance level. In the forward (step-up) selection method, the predictor variables (weather indices) are added in the decreasing order of their correlation with the dependent variable (yield). An opposite process takes place in the backward (step-down) elimination method in which each predictor variable not contributing to the regression equation is removed.

PR is a nonlinear regression model in which the output is modelled in proportion to the power of the explanatory variables. In PR, the function is a power (polynomial) equation of the form ^{b}, where

RF is a widely used supervised learning-based machine learning technique that has proved its efficiency in modelling the crop yield owing to its sound performance in many prediction domains [

RF constructs a predictive model and estimates the relative importance of predictors [

In this research, first, the data were feature normalized as an input set

After developing the models of RF, MLR with stepwise selection, MLR with forward (step-up) selection, MLR with backward (step-down) elimination, and PR, their performance was evaluated in terms of the correlation coefficient (

The feature importance of each independent variable on the paddy yield was measured as a fraction and the distribution of the two most important variables was examined to clarify their correlation values with the paddy yield in correlation matrices. The correlation of each weather index with the yield and the remaining weather indices was quantified using Pearson’s correlation method and Spearman’s correlation method. Strong and moderate correlations were distinguished from the weaker correlations based on three ranges. The performance of the five models can be understood in comparison with each other in terms of the statistical measures of

Minimum relative humidity was found to be the most important independent variable (Figure

Variable importance.

Distribution of minimum relative humidity data: (a) relationship with paddy yield and (b) data distribution.

The second most important independent variable is the maximum temperature. Both Pearson’s correlation and Spearman’s correlation indicate a positive relationship between the maximum temperature and the paddy yield. The positive Pearson’s correlation is coherent with the linear relationship (Figure

Distribution of maximum temperature data: (a) linear relationship between the maximum temperature and the paddy yield; (b) nonlinear relationship between the maximum temperature and the paddy yield; (c) normal distribution of maximum temperature data; (d) normal distribution of the dependent variable.

Wind speed is the third most important variable, whereas the winds in the morning and evening affect the yield contrarily such that wind in the morning is showing a positive correlation with the yield and in the evening is correlating negatively. This contrasting correlation of winds may be due to the negative effect caused by stronger evening winds (Table

Strong correlations were identified if both correlation values between two indices are within the interval [0.75, 1.0] or [−0.75, −1] and mediocre correlations if at least one of the values is within the interval [0.50, 0.74] or [−0.50, −0.74] and the other value lies in the higher (strong) interval. Accordingly, both strong and mediocre correlations of positively and negatively associated variables are summarized in Table

Correlation between the weather indices.

Level of correlation | Positively correlated pairs of weather indices | Negatively correlated pairs of weather indices |
---|---|---|

Strong | Maximum relative humidity and minimum relative humidity, evaporation, and maximum temperature | Maximum temperature and minimum relative humidity, rainfall and maximum temperature, maximum relative humidity and maximum temperature, maximum relative humidity, and sunshine hours |

Mediocre | Rainfall and minimum relative humidity, sunshine hours and maximum temperature, evaporation and evening wind, sunshine hours and evaporation, maximum relative humidity and morning wind, and maximum relative humidity and rainfall | Maximum relative humidity and evening wind, rainfall and evaporation, maximum relative humidity and evaporation, sunshine hours, and rainfall |

A total of five crop-weather models were developed in this study taking both linear and nonlinear aspects into consideration and their performance is summarized in Table

Performance of the regression models.

Technique | RMSE | MAE | MAPE (%) | |
---|---|---|---|---|

MLR: forward method | 489 | 0.54 | 374 | 9.2 |

MLR: backward method | 483 | 0.53 | 374 | 9.2 |

MLR: stepwise method | 472 | 0.75 | 361 | 8.9 |

PR | 485 | 0.75 | 356 | 8.7 |

RF | 71 | 0.99 | 60 | 1.4 |

The regression equations emerged from stepwise MLR and PR which are given in (

Error of the yield predicted by applying regression techniques: (a) MLR (stepwise), (b) PR, and (c) RF.

The most encouraging results were generated by the nonlinear RF method with the highest correlation coefficient and the least RMSE, MAE, and MAPE (Table

Distribution of error of the predicted yield.

Actual versus predicted yield of the RF model.

Researchers have used numerous statistical and machine learning techniques to develop crop-weather models for a variety of crops such as paddy, wheat, and corn. A summary of relevant research studies is presented in Table

Crop-weather models.

Ref. | Crop | Country | Evaluation criteria | Weather indices | The most influential weather indices |
---|---|---|---|---|---|

[ | Paddy | India | Full model and stepwise MLR | Rice area, number of days with minimum temperature below 22°C, average daily temperature (maximum and minimum), sunshine hours, rainfall, and solar radiation | Solar radiation |

[ | Corn | USA | Kincer’s method | Precipitation, temperature, sunshine, and relative humidity | Relative humidity |

[ | Crops | Uganda | MLR | Precipitation, temperature, and CO_{2} emissions | CO_{2} emissions |

[ | 7 crops including paddy and corn | Taiwan | MLR | Temperature and precipitation | Temperature and precipitation |

[ | Paddy | India | Gaussian process regression (GPR) and lasso regression | Temperature, average humidity, rainfall, wind speed, UV index, sun hours, and pressure values | Rainfall |

[ | Wheat | China | RF, SVM, and GPR | Maximum temperature, minimum temperature, drought index, and precipitation | Minimum temperature |

[ | Paddy | Korea | Random forest | Temperature (minimum, mean, and maximum) and sunshine hours | Minimum temperature and sunshine hours |

In the context of Sri Lanka where rice is the staple food, the effects of climatic variation were extensively researched [

Though paddy yield prediction models were developed by applying numerous techniques [

This study was carried out with data available at the Department of Meteorology and the Department of Census and Statistics of Sri Lanka with the objective of extracting the most influential weather factors on the paddy yield in Sri Lanka. The data covered seven major paddy growing regions that account for nearly two-thirds of the overall country production over eleven years in both agricultural seasons. A total of five regression techniques that can model linear relationships as well as nonlinearities and interactions were used. Of these, the RF model was the most accurate regression method. The difference in performance between the forward selection and backward elimination methods of the MLR was insignificant, while the stepwise MLR method was better and remained on par with the PR method. However, the excellence and the accuracy of the RF model were evidently proved by the statistical performance indicators as well as the distribution of errors between the actual yield and model produced yield. This research study may be extended by applying projected climate conditions on the RF model for the prediction of future paddy yield. The ability to predict the future yield will be beneficial to the agriculture authorities to ensure food security. Such projections are useful at macrolevel as the country’s economic activities are dominated by the agriculture sector in which the major crop is paddy.

RF regression was used to rank the weather indices affecting the paddy yield in Sri Lanka. The minimum relative humidity emerged as the most impactful weather index having a nonlinear correlation with the paddy yield, followed by maximum temperature which showed both linear and nonlinear relationships with the paddy yield. The morning wind speed was proved to be positively correlated, while the evening wind was negatively correlated with the paddy yield. Pearson’s and Spearman’s correlation matrices provided further insight into the degree of association between the pairwise weather indices. The weather indices of maximum and minimum relative humidity and evaporation with maximum temperature showed strong positive correlations. Nevertheless, maximum temperature, rainfall, and maximum relative humidity were negatively correlated with humidity, maximum temperature, and sunshine hours, respectively. In future research studies, nonclimatic factors may also be incorporated and their importance may be investigated.

The data used for the research are available from the corresponding author upon request, subject to the approval of the relevant authorities.

The authors declare that they have no conflicts of interest.

The authors are grateful to the Department of Census and Statistics and the Department of Meteorology, Sri Lanka, for providing past records of paddy harvest, yield, and climate data.