A Prediction Model Optimization Critiques through Centroid Clustering by Reducing the Sample Size, Integrating Statistical and Machine Learning Techniques for Wheat Productivity

Machine learning algorithms are rapidly deploying and have made manifold breakthroughs in various fields. The optimization of algorithms got abundant attention of researchers being a core component for deploying the machine learning model (MLM) abled to learn the parameters in significant ways for the given data. Modeling crop productivity through innumerable agronomical constraints has become a crucial task for evolving sustainable agricultural policies. The cross-sectional datasets of 26430 (D1) crop-cut experiments are taken by 2nd-stage area frame sampling, collected from crop reporting service. This research is taken as follows: firstly three more effective numerical optimized datasets are generated (D1, D2, and D3) from D1 by taking the centroid points of features which decrease the sample size; secondly MLM is integrated with the traditional statistical models (TSMs) for multiple linear regression (MLR), and thirdly decision tree regression (DTR) and random forest regression (RFR) are deployed to get the optimized models able to predict the wheat productivity well with 75% datasets to train and 25% to test the model using the evaluation metrics (R2, RMSE), information criterion (AIC) with weights (AICW), evidence ration (E.R), and decompositions of prediction error. The MLR outperformed for MLM than TSM. The performance capability of MLM and TSM got upswing for generated datasets. RFR got optimized and superperformed for D1, D2, D3, and D4. This study demonstrated strong evidences for deploying MLM for prediction of wheat productivity as an alternative of traditional statistical modeling.


Significances, Motivations, and Objectives of the Study.
Producing enough food for evolving population explosion has become the major concerns for the global world. Agriculture in aspect of core contributor in food production is ensuring to meet the sustainable food availability [1]. Food security has been considered as the foremost global threat, and therefore, it is essential to steer strategies to determine policies for future food security and sustainable food availability [2,3]. Food and agricultural organization, international food policy research institute, and many other international organizations deem their great concerns on this converted threat to attain sustainable food availability [4][5][6]. Modeling crop productivity through innumerable agronomical constraints has become a crucial task to attain sustainable agriculture and for evolving effective agricultural strategies [7]. A precise crop model based on certain conditions is a foremost need of time to evoke to handle the prevailing food trepidations [8,9]. Wheat being a 3rd largest food crop is playing a vital role for assuring the food supply in the world [4,[10][11][12]. Developing food prediction models, capable for true estimation of food availability, can assure veracious policy decision for managing national action plans for food security [13]. Pakistan stood 6th for wheat production, 8th for cultivated area under wheat crop, and 59th for wheat productivity [14]. Its exigent need of era is to develop accurate and precise wheat productivity model capable to predict the production on the reliable statistics which would help us to attain the assurance or nonassurance of future food demand [15]. Islam et al. [2] presented the study on the large datasets for building the statistical prediction model for the wheat productivity in Pakistan using hierarchical regression approach for selecting the features to address food security threat for the global concerns based on cross-sectional record.
is study presented the tradition statistical modeling and introduced the theory of centroid clustering used to generate the three more datasets from the original datasets. Generated datasets enhanced the model prediction capability with the reduction of sample size. ey applied different evaluation metrics, adjusted R 2 , ΔR 2 , MSE, and information criterion approaches such as Akaike information criteria (AIC), Schwarz information criterion (SIC), and weighted information criterion (Akaike weight "Wi") with evidence ratio "E.R," etc. e normality analysis and constant error variance are done by graphical presentation. e VIF is applied for multicollinearity, and nonconstant error variance is checked by Breusch-Pagan test which is developed in 1979 by Trevor Breusch and Adrian Pagan. e reliably analysis is performed by Cronbach's alpha test.
Machine learning algorithms widely develop and deploy rapidly and have made manifold breakthroughs in various fields. e advancement in science, technologies, and implementations of innumerable agronomical constraints in various fields of agriculture leads to immense volume of data [1,8,16]. e optimization of algorithms has become a significant part of machine learning and got abundant attention of researchers, and significance proficiency of numerical optimized algorithms of datasets affectedly influenced the machine learning model performance capability for the massive amount of data [17]. In this research, firstly the effective numerical optimized datasets are developed by taking the centroid points of features abled to enhance the machine learning model performance by decreasing the sample size, secondly machine learning models are integrated with the traditional statistical models, and thirdly different machine learning models are deployed to get the optimized models able to predict the wheat productivity well. is study designed to apply the supervise machine learning techniques, i.e., multiple linear regression model (MLRM), decision tree regression model (DTRM), and ensemble learning random forest regression model (RFRM) on the same datasets with the aim to enhance the model performance by reducing the sample size through centroid clustering.
is study integrates the efficacies of machine learning algorithms with benchmark traditional statistical models for wheat productivity.

Data Collection, Sampling Method, and Important Features Selection.
Punjab is the 2nd largest province of Pakistan which accounted 76% share in total wheat cultivation area. e administrative setup of Punjab comprises upon nine divisions, thirty-six districts, and one hundred and forty-five tehsils. e 26,430 field of wheat crop-cut experiments (C.C.E) is taken from crop reporting service (CRS), Punjab, for the year comprised from 2016-17 to 2019-2020. e list frame sampling (LFS) technique using systematic random sampling (S y RS) in which complete village (sample unit) was selected as basic unit was remained in practice in CRS, but after 2018-19, 2nd-stage area frame sampling (AFS) is applied to select the sample for C.C.E [18].
where Z i � cropped area of i th village in j th union councils of district, N i�1 Z i � total crop area of village in j th union councils of district, and P i � probability of selecting the i th village as sample. Qayyum and Shera [18] reported, at stage I, union councils are considered as population and village as sampling units using probability proportion to size (PPS), while at stage II, the selected sample village is considered as population and the land segment area is considered as sampling unit using the simple random sampling (SRS) techniques. e C.C.E is selected in land area segments. e wheat productivity with measuring scale munds/acre along with seven agronomical quantitative variables, i.e., fertilizer urea kg/acre, fertilizer DAP kg/acre, other fertilizers kg/acre, no. of water, seed quantity used kg/acre, no. of pest spry, no. of weeds spry, and eight binary categorical (0 for absence and 1 for presence) agronomical features, i.e., seed treatment, soil-type chikny loom, varieties adoption, harvest period April (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20), planting in November, land irrigated, farmers' area >25 acres, and seed type, is used in the current study. Experiment is performed using Python's key library called scikit-learn (Sklearn) by Jupyter Notebook as https://scikitlearn.org/stable/supervised_learning.html. Sklearn offers various prominent features for data processing, classification, clustering evaluation, and model selection. Mod-el_selection is Sklearn method used for setting to analyze datasets and then using it on unseen datasets for evaluation purpose.

Supervised Machine Learning Technique.
Machine learning is viewed as innovative extension of statistics capable of dealing with the massive datasets by adding the methods from computer science to the repertoire of statistics [19]. Machine learning is categorized as advanced tools applied for the prediction of agricultural production [20][21][22][23]. According to Jeong et al. [9], machine learning used latest process-based techniques as an alternative to traditional statistical modeling. Machine learning is viewed as assumption-free methods for correct data structure of model, and it is applied in complex projection concerns, i.e., function form for crop yield prediction [8,24]. Arthur Samuel (1901-1990), a pioneer in artificial intelligence (AI), coined the term machine learning in 1959 as "Field of study that gives computers the capability to learn without being explicitly programmed" [25,26]. e prominent layout of machine learning process is narrated as follows:  (MLRMs). MLR is used to endeavor the relationship of feature with wheat productivity for prediction for both statistical and machine learning modeling as Y i � X i β i + ε, where Y i � wheat productivity munds/acre, X j � features, and β j � features coefficients.

Decision Tree Regression Model (DTRM).
e decision tree regression model (DTRM) used the flowchart structure to predict the response. DTR-built internal node signifies a test, branches signify the outcome for test, and each leaf node signifies the final decision [27,28]. In contemporary speech, leaf nodes reproduce the outcomes of prediction after getting hierarchal representation of leaf and branch structure for root-to-leaf direction. DTRM with depths ranging from 1 to 20 is plotted for training and test performance to determine the optimum DTRM capable to predict the wheat productivity well. e cross-sectional hyperparametric tuning is exercised using the Grid-SearchCV. e GridSearchCV is scikit-learn library applied to find out the optimum number for min_sample_split and max_depth (tree depth). Figure 1 shows the structural flow for the decision tree model.

Random Forest Regression Model (RFRM).
e RFRM almost consists of the same set of hyperparameter tuning as DTRM except random forest (RF). RF used the additional randomness for the predicted made while growing the regression trees instead of pointing the important features to split the node. RFRM searches the best set of features and averaged multiple regression decision trees to avoid overfitting problem, and parameter no. of trees (n_sample) in the forest has been used which ranged from 10-100 [29,30]. RFRM used precision to build up the forest random and search the best feature [31]. RFRM uses bootstrap aggregating for agricultural decision related to crop productivity prediction [21,30,31]. Figure 2 depicts the structural flow for RFR.

Preparation of Datasets.
Data preprocessing is a technique used as a branch of data miming applied to search out accurate dataset from large dataset-based identifying, classification, clustering, and regression [32][33][34].
ree new datasets are generated from original 26,430 C.C.E by data preprocessing using centroid point clustering to increase the prediction interpretability and capability of models by reducing the sample size based at villages, tehsils, and districtlevel datasets [2]. (2)

Data Partition.
Sklearn provides a way to generate accurate results abled to make true prediction, and for that, it is needed to train your model using train datasets and then test on unseen datasets using Sklearn train_test_split function. e train_test_split function is used for splitting a single dataset into two different subsets using random partitions called training subsets and testing subsets. e training subset is used to learn or to build model, and testing subset is used to evaluate the model performance for unseen datasets. For the current study, data partition is carried out using randomization train-test split and capability performance of models is investigated based on four types of datasets taking the 75% data as training subsets and 25% dataset for testing/validation subsets as follows.
(  While applying the machine learning algorithms to predict the response variable (wheat productivity), the datasets split into two parts named training and testing datasets (Section 2.4). Two types of error are reported in prediction of response using machine algorithms [35], the error reported during training phase is called training error or bias, and this error is measured from overall observed data samples in the training phase, while the out-of-sample error (generalization error) measures the expected error on testing phase or in unseen datasets called variance. Both the underfit (high bias and high variance) and overfit (low bias and high variance) algorithms mislead the machine learning model prediction capability, and the bias-variance trade-off is common property in application of machine learning model building. e decomposition of prediction error is comprised as the sum of three components, bias, variance, and irreducible error [25,36]. e mathematical illustration of bias and variance is presented as the target variable (wheat yield) is going to be predicted by machine learning model taking the covariates (15 features) by the relation as y � g(x) + e where "e" is supposed to be the error term fallow normality. Using machine learning modeling technique, the estimated model of g(x) is g(x) and the expected squared prediction error at "x" is found as follows: Prediction error is decomposed into categories as bias and variance components as follows: prediction error � variance + bias 2 + irreducible error term. (4) at irreducible error term may be known as noise term which exists in the true relationship between the feature and response in model prediction and in machine learning model; the aim is to decrease both the bias and variance terms. However, in machine learning model prediction, there exists a bias-variance trade-off and the optimum model complexity means a situation where the model predicted well with low variance and low bias and is free from overfit and underfit model [37]. Figure 3 elaborates the condition of overfitting and underfitting at lower and higher model complexity, while at ideal range of model complexity, the MLM predicted well.

Evaluation Metrics and Information Criterion.
e evaluation metrics using the performance score (R2) and root mean square error (RMSE) are applied to measure the accuracies of regression models. Lower the value of RMSE and higher the performance score lead to support the good fit.   e Akaike information criterion (AIC) using the log-likelihood functions with simple penalties is applied to determine the theoretical and logical relevance of the predictors to the response and their statistical significance in model. Lower the value of AIC leads to conclude that the fitted regression model is good [38][39][40].

RMSE �
AIC � e 2k/n u 2 where k � no. of features and intercept, n � sample size, and 2k/n � penalty factor. One of the key objectives of driving the AIC is to determine the range of models with their relative AIC value. For comparing the multiple models, we can measure how much better the best candidate model is to be compared with next best models, and the easiest way to determine the comparison is to measure the change in of AIC values for the best model with the i th other models ΔAIC i � AIC i − AIC min . ΔAIC i is also used to measure the relative strengths of best models with other models. ΔAIC i is used to determine the level of empirical support of model comparisons for quick strength of evidence, and lower the difference leads to support the model. Burnham and Anderson [41] defined the evidence ratio "E.R" used to compare the efficiencies of various models and depicted the measure of how much more likely the best model is than other models [42].
Akaike weight is used to determine the probability of model having good prediction capability or not to predict the wheat productivity and summing to unity [ W i (AIC) � 1]. e higher weights lead to model having relatively good prediction capability and vice versa [38,43]. Cronbach's alpha "α" and reliability analysis are applied to determine the degree of consistency and relevance of predictors with reference to the measure of response [44,45].
where k � no. of items, s 2 i � variance of i th item, and s 2 T � aggregate item variance.
Reliability coefficient ranging from 0 to 1 and its values near to 0 indicate poor reliability while near to 1 depict strong reliability. e prediction capabilities of models are integrated by using the four different sample size datasets generated through centroid clustering scheme. is study integrates the efficacies of machine learning models with benchmark traditional statistical models to select the most optimum model that follows the evaluation metrics and information criteria.

Importance of Agronomical Features and Reliability of
Datasets. Feature importance refers to techniques that ascribe importance score to input variables which are useful to investigate that how useful the features are to predict the response. Feature importance scores provide the view insight datasets as well as inside the model and improved the efficiency, predictability, and effectiveness of a predictive machine learning model. Before deployment of machine approaches to different datasets, the variations of agronomical features prevailed in simultaneous order for the importance of usefulness in the current study are particularized in Figure 4 for D1, Figure 5 for D2, Figure 6 for D3, and Figure 7 for D4. Table 1 shows the values of Cronbach's alpha for the reliability measure and reports the reliability coefficients as 0.35 for D1, 0.39 for D2, 0.63 for D3, and 0.64 for D4. e reliability of datasets has become strong and strongest as we advanced from D1 to D4.

Performance Measures of Multiple Linear Regression Models.
e performance for the prediction capability of multiple linear regression for the generated different size datasets is evaluated and integrated for both the traditional statistical models and machine learning approaches.

Machine Learning Models.
Multiple linear regression models (MLRMs) are constructed using the machine learning approach and integrated with benchmark traditional statistical models. For MLM, Table 2 shows the performance score 0.266, 0.289, 0.838, and 0.932 for the training datasets and 0.264, 0.285, 0.834, and 0.655 for testing/validated datasets, respectively, for D1, D2, D3, and D4. e R 2 has become strong and strongest as we advance from D1 to D4 for train datasets (R 2 Dtrain(i) < R 2 Dtrain(i+1) ) and de novo the same for test data except for D4. e RMSE found 9.14 and 9.21 for D1, 7  e model is train and deployed for the training datasets using 75% train subsets. e D4 shows lowest AIC as 1.62 with highest Akaike weights (AIC W ) as 0.45 followed by AIC as 2.43 and AIC W as 0.30 for D3, AIC as 4.07 and AIC W as 0.13 for D2, and AIC as 4.43 and AIC W as 0.11 for D1. e Akaike weights are increasing (AIC w(i) < AIC w(i+1) ), and AIC is decreasing (AIC (i) > AIC (i+1) ) as we advance from D1 to D4. e evidence ratio justifies the results as D4 model is 4.06, 3.41, and 1.50 more likely to D1, D2, and D3 models, respectively.

Integrating Machine Learning and Traditions Statistics
Modeling for MLR. 14 more likely to D1, D2, and D3 and integrated that E.R is found better in MLM comparing with TSM for all datasets (E.R TSM < E.R MLM ). All the performance measure optimized well in ML models clarified that MLM has good prediction capability for prediction of the wheat productivity based on agronomical features. Figure 8 clarifies that the graphical relations exist for learning points of the models for evaluation metrics and information criterion for both MLM and TSM and shows that machine learning performed well for all the datasets and D4 optimized the machine learning multiple regression models.

Decision Tree and Random Forest Regression Models.
e machine learning models are trained and deployed for multiple linear regression models, and predicted well is further trained and deployed for the important and most prominent machine algorithms, i.e., decision tree regression models (DTRMs) and random forest regression models (RFRMs) with the aim to get the most optimized models able to predict the wheat productive well using 75% data to learn the model and 25% as validated datasets to evaluate the model capability on unseen datasets.

Hyperparametric Tuning of DTRM and RFRM.
Hackeling [46] reported hyperparametric tuning of DTRM models applied to avoid over and underfitting using the scikit-learn's library GridSearchCV to find out the optimum value of min_sample_split and max_depth (tree depth). Figure 9 shows DTR for D1 having 19822 samples point for training and 6608 sample points for testing phase and   illustrates that at lower model complexity the model is underfit (high bias and high variance) and the error curve for testing set raises again after tree depth 10 which leads to overfit the model, while for Figure 10, DTR for D2 has 4525 sample points for training and 1509 sample points for testing phase, and the same prevails after tree depth 06, indicating that optimum hyperparameter for tree depth is found 10 and 06 for DTR model based on D1 and D2. e tree depth values got optimized at 05 and 04 for models based on D3 having 108 sample points for training and 37 sample points for testing phase and D4 having 27 sample points for training and 09 sample points for testing phase (Figures 11  and 12). e min_sample_split value found optimized at 29, 28, 6, and 2, respectively, for D1, D2, D3, and D4. e RFR and DTR consist of the same set hyperparameters except random forest called no. of trees in the forest (n_sample) and its default value ranged from 10-100. e D1 optimized at no. of tree 10, D2 and D3 at no. of tree 50, and D4 optimized at no. of tree 100 for the prediction model for wheat productivity.

Decision Tree Regression Models. For the DTRM,
) for train and test models as we advanced from D1 to D4. e DTR model is trained and deployed for the training datasets using 75% train subsets. e AIC reported diminishing trend as 4.28, 3.96, 1.44, and 0.29 for D1 to D4 (AIC (i) > AIC (i+1) ). e AIC W of models based on D4 is highest with probability 0.54 ). e RFR model is trained and deployed for the training datasets using 75% train subsets. e AIC reported diminishing trend 4.26, 3.92, 2.18, and 0.70 with increasing AIC W as 0.09, 0.11, 0.26, and 0.54 for D1, D2, D3, and D4 models (AIC w(i) < AIC w(i+1) and AIC (i) > AIC (i+1) ). e highest values of AIC weight reported from model learn from D4 followed by models learn from D3, D2, and D1. e E.R values of RFR models show that the models learn from D4 and are 5.92, 5.0, and 2.10 more likely to models learn from D1, D2, and D3.

Comparative Quantification of Machine Learning Models
for Different Datasets. Section 3.3.1 depicts that machine learning performed well comparing with traditional statistical approaches for multiple regression models. Section 3.3 presents models further trained and deployed for machine learning algorithms, i.e., decision tree regression models (DTRMs) and random forest regression models (RFRMs) with the aim to get the most optimized models able to predict the wheat productive well.      8 Scientifica In Tables 2 and 3 and Figure 13, the performance score of RFR models is reported well for all training and testing datasets followed by DTR and MLR for D1 and D2. e performance score of RFR is found high for D3 training set, while little bit variation is found for testing sets, and for D4 all models show performance above 90% for training sets and only RFR approach to 0. (P.EMLRMDi) >(P.EDTRMDi) > (P.ERFRMDi). RFRM revealed good performance score and bottommost decomposition prediction error as we advanced from D1 to D4. RFRM successfully predicted the wheat productivity when compared against other models using the original and generated datasets.

Conclusions
is study integrated the efficacies of machine learning regression algorithms using multiple linear regression models (MLRMs), decision tree regression models (DTRMs), and random forest regression models (RFRMs) with benchmark traditional statistical models to converge the optimization capability of prediction models for wheat productivity. e original dataset of 26430 (D1) crop-cut experiment along with fifteen features is collected from the crop reporting service. e 2nd-stage area frame sampling is applied to select the sample. e new approach of centroid clustering scheme is introduced which can enhance the model performance by reducing the sample size. ree more datasets are generated to optimize the model performance for both the machine learning models (MLMs) and traditional statistical models (TSMs). e generated datasets comprise from 6034, 145, and 36 sample points generated from village, tehsil, and district-level centroid clusters. e 75% dataset is used as training and 25% as testing subsets. Evaluation metrics approach (R 2 , RMSE), Akaike information criterion (AIC) with weights (AIC W ), evidence ration (E.R), reliability analysis, and decomposition prediction error (P.E) are applied to compare the performance of models. e performance score (P.S) increased, while the RMSE and AIC decreased for both MLM   model. RFRM revealed good P.S, bottommost P.E for all the datasets. e RFRM successfully predicted the wheat productivity followed by DTRM and MLRM for D1, D2, D3, and D4. It is demonstrated that machine learning models provide superior performance by centroid clustering even for sample size as we advanced from D1 to D4. is study demonstrated strong evidences for the implementation of machine learning models as an alternative of traditional statistical models for future research direction and correct policy decisions regarding wheat productivity. e advancement in science, technologies, and implementations of innumerable agronomical constraints in various fields of agriculture leads to immense volume of data, and this study provides the detailed hierarchy of centroid clustering which leads to increase the model performance by reducing the sample size. is hierarchy of centroid clustering could also be extended to multistage centroid clustering for future research, and it could also be applied for all supervised machine learning algorithms to enhance the model performances.

Data Availability
e cross-sectional original datasets and generated datasets used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.

Authors' Contributions
Muhammad Islam performed descriptions, data preparations, methodologies, data analysis, and conclusion. Farrukh Shehzad contributed to supervision, preparations, data analysis, and descriptions.