Attribute Selection Impact on Linear and Nonlinear Regression Models for Crop Yield Prediction

Efficient cropping requires yield estimation for each involved crop, where data-driven models are commonly applied. In recent years, some data-driven modeling technique comparisons have been made, looking for the best model to yield prediction. However, attributes are usually selected based on expertise assessment or in dimensionality reduction algorithms. A fairer comparison should include the best subset of features for each regression technique; an evaluation including several crops is preferred. This paper evaluates the most common data-driven modeling techniques applied to yield prediction, using a complete method to define the best attribute subset for each model. Multiple linear regression, stepwise linear regression, M5′ regression trees, and artificial neural networks (ANN) were ranked. The models were built using real data of eight crops sowed in an irrigation module of Mexico. To validate the models, three accuracy metrics were used: the root relative square error (RRSE), relative mean absolute error (RMAE), and correlation factor (R). The results show that ANNs are more consistent in the best attribute subset composition between the learning and the training stages, obtaining the lowest average RRSE (86.04%), lowest average RMAE (8.75%), and the highest average correlation factor (0.63).


Introduction
Crop yield prediction (CYP) is important for agricultural planning and resource distribution decision making [1]. Regrettably, CYP is a difficult task because many variables are interrelated [2]. Yield is affected by human producer decisions or activities (such as irrigated water, land, and crop rotation) or incontrollable factors (such as weather). Commonly, cropping planners use the previous yield as an estimation of future yield. Nevertheless, crop yield varies spatially and temporally with a nonlinear behavior, introducing large deviations from one year to another [3]. Thus, more efficient methods for CYP have been developed, in which crop growth and data-driven models are the most popular. Crop growth models, using site-specified experimental data, regional calibration, and plot level observations, are recognized as robust and efficient models. However, they are available only for some crops, with development time and cost being extremely large [3]. On the other hand, data-driven models work with high-level information and are built empirically without a deep knowledge about physical mechanisms which produce the data. Previous works suggest that data-driven models have better adaptability for cropping planning than crop growth methods due to their friendly implementation and performance [4].
Data driven models are widely applied using classical statistics and data-mining methods. Statistical models use parametric structures tuned with sum-of-squares residuals, validated by hypotheses test and confidence intervals. Most of the statistical applications for CYP have been linear [3], obtaining a range from bad to moderate results. Data mining applies machine learning techniques and nonparametric structures, in which validation uses prediction accuracy. Machine learning (ML) obtains nonlinear models from massive datasets [5]. Most common ML techniques for CYP are regression trees [6] and neural networks [3,4,7]. Despite the high site dependency, neural networks are widely recognized as robust models, obtaining good results for CYP 2 The Scientific World Journal [7]. Comparisons between linear and nonlinear models for CYP show a small advantage in favor of nonlinear models [3,7]. However, the attribute subset is usually the same for all the evaluated techniques. In practice, the explanatory attributes are selected from expertise assessment or previous publications, for instance, [3,8]. However, the explanatory attributes may have a different impact on each technique, even using the same dataset [9]. A fairer comparison should include the best attribute subset for each technique, selected with some performance metrics [10]. Regrettably, only an exhaustive approach can guarantee the optimal subset for all regression techniques. Some CYP datasets have relatively few attributes and an exhaustive approach can be applied to model comparison purposes [8].
In this paper, a comparison between linear and nonlinear data-driven modeling techniques for CYP is presented. The best attribute subset for each technique is determined by measuring the predictive accuracy of each model subset. To obtain the optimal subset, a recursive algorithm finds all the feature combinations, building a regression model of each subset. The models are built using most samples of training datasets, leaving the more recent to measure the performance. The best subset for each technique is tested with samples representing future information which had not been included in the training stage. The most common techniques for CYP were compared: multiple linear regression, stepwise linear regression, M5 regression trees, and perceptron multilayer neural networks. Results per technique are compared against those obtained using the optimal attribute combination derived from the test dataset. The potential attributes considered for this work were irrigation water depth (mm), accumulated rainfall (mm), solar radiation (MJ/m 2 ), maximum and minimum temperatures ( ∘ C), relative humidity (%), and the farm location. To build the models, historical data of eight crops were obtained from one irrigation module located in Mexico. Results show the best CYP technique, the most influential attributes for each model, and the fact that an exhaustive approach on the training dataset does not guarantee optimality on testing dataset. This paper is organized as follows. Section 2 describes data sources, data-driven techniques, accuracy metrics, and the recursive algorithm used to build and test the models. Section 3 shows the experimental results and discussion. Finally, Section 4 presents the conclusions about realized work.

Data
Description. This paper uses data obtained from the irrigation district 075 (Santa Rosa III-1 module) in Sinaloa, Mexico (one of the largest and most productive districts in the country). Two data sources from the year 1999 to 2007 were collected: (a) agricultural production data and (b) weather information data. The former included attributes regarding sowed areas, crop types, quantity of irrigated water, starting and ending sowing dates, and crop yield. Such data were obtained from Spriter-GIS system [11]. The second data source includes climatological variables measures such as rainfall, solar radiation, and temperatures. Weather data were collected from the National Meteorological Service (SMN) stations located in the module vicinity. The CRISP-DM [5] methodology was applied to clean, homogenize, and integrate both data sources into one single database, obtaining eight crop representative datasets. Eight potential attributes (Table 1) were selected based on previous CYP works [12] and the data availability. Such attributes are referred to as potential because this work uses a complete algorithm to find the best attribute subset for each regression technique. Thus, the final subset of attributes depend on the algorithm execution. Average of weather attributes (solar radiation, temperatures, and humidity) was estimated with the last three crop growing stages, the most influential in the crop development. The crop datasets are described in Table 2. To simplify future references of these datasets, an ID is assigned to each one (which is shown in the first column). Table 2 describes the quantity of records and periods of time used for the training and testing stages. In order to maintain realistic conditions, the last year of available data was reserved for testing.

Multiple and Stepwise Linear Regression.
Multiple linear regression (MLR) is a popular technique which can be applied to predict a dependent variable , using a set of independent variables . MLR model is described by [14] where is the number of independent variables, is a regression coefficient, is the value for the observation , and is the residual error. If is a nonsingular matrix, an approximation for ( ) can be obtained by = ( )  efficiency. Linear regression is performed by adding or removing independent variables on each iteration. Initially, the variable with the highest correlation ( -squared) measured with respect to the dependent variable is included. Then, the remaining independent variable with the highest correlation with respect to the dependent variable is selected. This iterative process is repeated while the addition of a remaining independent variable increases -squared with a significant quantity. We use the SLR implementation in SPSS [15], which combines forward selection and backward elimination [16]. At each step, the best remaining variable is added according to a significance criterion of five percent; then the entire set of variables is reviewed to decide whether a single variable is removed using an of ten percent.

Regression Trees.
A regression tree (RT) is based on a decision tree, a classifier expressed as a recursive partition of the samples' space [17]. A tree is formed by nodes, in which the first is named the root node (without incoming edges).
All the other nodes have exactly one incoming edge. A node with outgoing edges is called a test node and a node without outgoing edges is called a leaf node. Each internal node in the tree splits the samples' space into two or more subspaces based on conditions of the input attributes values. In the case of numerical attributes the condition refers to a range of values. Each leaf is assigned to one class representing the most appropriate target value. Samples are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path. For regression trees, the class at the leaf nodes assigns a numerical value to the tested sample which corresponds to the value predicted by the regression model. The most common algorithms to build RTs are CART, M5, and M5 [17]. This work uses M5 algorithm implemented by Weka [17]; the standard deviation reduction (SDR) is applied as a measure of impurity on continuous attributes. The parameters to build an RT with a minimum of two samples by node, pruned and smoothed, were selected. Once calibrated, inputs are combined and transmitted to other processing units via the appropriate interconnections. The units are organized by layers, hiding the intermediate layers to the user. This process is represented by a nonadditive and nonlinear function that maps the set of inputs to a set of outputs [17]. The training stage is an iterative process performed to pound connections and it is guided for error measure. There are many ANN topologies and training algorithms. This work uses the most popular topologies and learning algorithm combinations: multilayer perceptron (MLP) and backpropagation algorithm [17]. MLP network has been a popular choice for CYP [1,3]. Backpropagation algorithm minimizes the error function using the gradient descent method. The combination of weights obtained is a solution of the learning problem. Since this method requires computation of the gradient of the error function at each iteration step, the continuity and differentiability of this function should be verified. In addition, an activation function is required where the sigmoidal function (1/(1 + − )) is commonly used [17]. In this work, a topology with three layers and 10 neurons on a single hidden layer was used; this topology was applied in other works [4]. The most recommended parameters were applied such as the weight decay and numeric attribute normalization [3]. Training epochs, learning rate, and the momentum were established by experimentation, being 1000, 0.3, and 0.01, respectively. Quantity of neurons at the input layer depends on number of attributes (see Section 2.5), while the output layer has only a neuron (CYP estimation).

Accuracy Metrics.
We use three of the most common metrics of regression models [5]: the root relative square error (RRSE), correlation factor ( ), and the relative mean absolute error (RMAE). RRSE compares the model prediction against the mean, which is frequently used to supply the crop yield value. An RRSE less than 100% indicates a prediction that is better than the average value. is the average of the real yield values, and̂is the average of predictions: Many CYP works use root mean squared error (RMSE) as accuracy metric. RMSE measures the difference between real and estimations values, exaggerating the presence of outliers [5]. We use RRSE instead of RMSE because the former applies the average value as common reference point, being easy to understand by people unaccustomed to physical crop yield dimensions.

Method to Find the Best Attribute Subset.
A combinatorial procedure to perform a complete enumeration of all the subsets {x 1 , x 2 , x 3 , . . . , x m } is presented in this paper. The procedure starts with a potential set of attributes = { 1 , 2 , . . . , }, such that each x 1 is a subset of . Each x k subset is evaluated using the training dataset, which is The Scientific World Journal 5 range is really used for training, and data from year b is reserved for validation. The model's validation is made using the metrics described in Section 2.3. Each validation result and the related attribute subset are registered in a sorted list according to these metrics. Ties are solved in the following order: RRSE (lower), (higher), and RMAE (lower). At the end of the process, the subset at the top is taken as the best. Algorithm 1 shows the algorithm of the optimal attribute search process.
The function evalModel(model, validSamples) of Algorithm 1 evaluates the argument model with samples taken from the validSamples dataset. This function uses the percentage-split validation scheme approach [5]. We tried other validation schemes as well, such as training and validating the models with the entire training dataset and cross-validation (CV). The former provided very poor results to predict the yield of future samples. On the other hand, CV (considered a robust validation scheme) was difficult to apply because (1), for subsets required for CV, − 1 models should be stored, and (2) the computational cost of the entire process is increased − 1 times for each evaluation [3], being not computationally tractable in practical applications.

Distance to the Optimal Attribute Subset (OAS).
Algorithm 1 can be applied to both the learning and testing stages. When this algorithm is applied to the former, a ranking of attribute combinations is obtained, placing the best attribute combination at the top. This subset is named the learning attribute subset (LAS). In testing stage, the algorithm is applied to the union of the training and the testing datasets, obtaining a rank of attribute subsets. In this last case, the subset at the rank's top is named the optimal attribute subset (OAS). Evidently, this last rank cannot be available in practice, because testing dataset represents unseen samples from the future. However, the rank of attribute combinations that originated the OAS can be used to define a new performance metric, which should be used only for evaluation purposes. Let be an attribute subset and the number of combinations that separates the OAS results from the subset results. Then can be used as a performance measure of . We called measure the "distance to the optimal attribute subset. "

Results and Discussion
Experimental results are presented in the next three sections. Section 3.1 shows the metric measures obtained in testing dataset with the OAS. Section 3.2 shows the metric measures using the potential attributes. Section 3.3 describes the results using the LAS on testing dataset.

Metric Measures Using the OAS on Testing Dataset.
The OAS for the testing dataset to each crop technique was obtained with the algorithm of Section 2.4. Table 3 shows every metric obtained per technique (RRSE, , and RMAE). RRSE shows that all techniques achieve better predictions than the average. For the potato crop datasets (PA06 and PA07), MLR obtains only slightly better results than the average (RRSE of 95%). In general, nonlinear techniques show some improvements over MLR, introducing small RRSE measures and values near to 0.7.
Tables 4(a), 4(b), 4(c), and 4(d) show the OAS composition found for each regression technique. The attributes in OAS are shown in shaded cells. Evidently, OAS is the same for SLR and MLR (Tables 4(a) and 4(b), resp.). Attributes selected are grouped in Table 5, which shows the quantity of times that a particular attribute is included in the OAS for each crop dataset. The average column in Table 5 indicates that the IWD, RH, SGR, and MINT attributes appear in more than half of optimal crop yield models, mostly independent of the regression technique. Besides, IWD (irrigation water depth) was the attribute most selected by all techniques.
Because attributes selected can be influenced by temporal elements, Figure 1 shows the results obtained only with the five crop testing datasets of year 2007. Attributes most frequently selected by MLR technique were MinT and IWD, with the latter always included in the OAS. Attributes most frequently selected by M5 were MinT and RH, with the latter always included in the OAS. On the other hand, attributes selected by ANN technique were IWD and RF. Unlike other techniques, ANN did not always select a specific attribute.     Table 6 shows the RRSE, , and RMAE measures using all the potential attributes as explanatory variables. RRSE indicates that only two of the eight crop models per technique obtain better predictions than the mean yield value. MLR has three models with good predictions. However, PA06 model shows an value of 0.07, indicating a very low linear relationship between the prediction and the real yield. The models for the PJ01 dataset are the most consistent, showing good results with every technique and a small improvement with nonlinear models. For every technique, the set of RRSEs lower than one hundred percent was averaged; in the case of Table 6, the figures obtained were 94.41, 74.34, and 75.79 for MLR, M5 , and ANN, respectively. The averages with the entire set of RRSEs were also calculated and shown in the row named Average (all) of Table 6. We decided to average the RMAEs with an RRSE lower than one hundred percent and an factor close to one and greater than a threshold value, set as 0.6 in this work. As is well known, a good prediction model should have a low RRSE and an value close to 1. Therefore, for all the potential attributes and when only RRSE and are considered, we can observe that M5 is the best technique. Averaging the RMAEs that accomplish this criterion (RRSE < 100% and an > 0.6), the best techniques were again M5 and ANN. The distance to the optimal attribute subset (described in Section 2.5) provides an idea of how far are the OAS results to those obtained with all the potential attributes. Table 7 shows values for the evaluated techniques, which indicates that very few models are close to the optimal results using all the potential attributes. Considering all the 256 possible combinations, most of the obtained results with all the attributes are located beyond the middle of the rank of combinations.    Table 8 show the obtained results. Our attribute selection algorithm (Section 2.4) improves the MLR, M5 , and ANN models performance, increasing the number of CYP models with RRSE measures lower than 100% (MLR obtained five models, while M5 and ANN obtained six models each). SLR has a poor performance, with only one model with an RRSE measure lower than 100%, and an average error even higher than MLR using all the potential attributes ( Table 9). RRSE of nonlinear techniques shows small improvements with respect to MLR. In addition, measures of MLR are higher than those obtained by M5 and inferior to those obtained with ANN. Average values for the models with RRSE lower than 100% are greater than 0.5. Only nonlinear techniques obtained an average RRSE value lower than 100%. Among these, ANN obtained better results, with the lowest RRSE, the highest , and the lowest RMAE measures. Table 9 shows that MLR, M5 , and ANN model errors built with the LAS decrease considerably when they are compared against the use of all the potential attributes. Average RMAE values decreased as follows (in %): (a) for MLR, from 17.54 to 13.59; (b) for M5 , from 19.49 to 11.93; and (c) for ANN, from 18.23 to 12.89. The average RMAE measure for SLR was scored worse than MLR using all the potential attributes, obtaining an RMAE of 18.65%. Table 10 shows the distances of the error measure obtained from the LAS to the OAS results. ANN is the regression technique with more CYP models closer to the optimal. MLR and M5 present similar average results. SLR was too far to optimal combination. Tables 4(a), 4(b), 4(c), and 4(d) show the attributes contained in each LAS grouped per technique. These attributes are identified with a √ symbol. As it can be seen, LAS and OAS differ, showing that the best attribute subset can vary from one year to another. Nevertheless, such behavior is different for each technique. Let us illustrate the situation with Figures 2(a), 2(b), and 2(c), which show the frequency when an attribute appears in the LAS and OAS for the MLR, M5 , and ANN regression techniques (resp.). To avoid mixing results from different years, these charts only include attribute subsets obtained from the crop datasets with 2007 testing 8 The Scientific World Journal  (Figure 2(b)). As a result, ANN is the most consistent technique since its best attribute subsets scarcely varied from year to year (Figure 2(c)).

Metric Measures Using All the Potential Attributes.
We applied the measure to the attributes in LAS that intersects the OAS and the RRSE. This allows us to distinguish the error caused by including the relevant attributes in the model and the errors due to the regression technique predictive ability. This measure is shown in column two of Table 11. We also applied the metric to the attributes excluded from LAS and those selected in OAS with the

Conclusions
This paper presents a comparison among several methods (linear and nonlinear) for crop yield prediction. The comparison is made using the best attribute subset found in the training dataset for each method, which was detected using a complete algorithm and the percentage-split validation scheme. The algorithm uses the oldest samples in training datasets to build the models, leaving the most recent to search the optimal attribute subset. The best attribute subset performance is measured with testing datasets composed of unseen samples from the future. The comparison covered eight crop datasets. The most common data-driven techniques for crop yield prediction were evaluated: stepwise linear regression, multiple linear regression, regression trees, and neural networks. The experimentation shows that our attribute selection using a complete method substantially improves the performance of all the evaluated techniques. ANN and M5 obtained the best prediction, and, between them, the former achieved the lower RRSE, the higher R correlation, and the lower RMAE value. With respect to the optimal attribute subset composition, MLR and ANN techniques show small differences between the best attribute subset in learning stage and the optimal attribute subset for the testing stage, while M5 shows the largest differences. The optimal attribute composition was different for all the evaluated techniques, which reinforces the hypothesis that using the same attributes subset for all the techniques is unfair. Nevertheless, none of the techniques was able to obtain the optimum subset with the training data for all the eight crops. The best technique was ANN, which achieved three attribute subsets equal to the optimal, and the other two subsets were very close to it. Thus, an attribute subset that can be used permanently in all the years for all the crops is difficult to select. Results obtained from machine-learning methods cannot be directly applied to a different set of crop databases, due their high data dependency. However, the procedure presented in this paper can be extended for a larger number of techniques and crop datasets. A future research focused on finding the best minimal subset of attributes which provide a good yield of predictions on other irrigation zones should be done.