Evaluation of Several Machine Learning Models for Field Canal Improvement Project Cost Prediction

Project cost prediction is one of the key elements in the civil engineering activities development. Project cost is a highly sensitive component to diverse parameters and hence it is associated with complex trends that make it difficult to be predicted and fully understood. Due to the massive advancement of soft computing (SC) and Internet of things (IoT), the main research objective of the current study was initiative. Several machine learning (ML) models including extreme learning machine (ELM), multivariate adaptive regression spline (MARS), and partial least square regression (PLS) were adopted to predict field canal cost. Several essential predictors were used to develop the prediction network “the learning process” including the total length of the PVC pipeline, served area, geographical zone, construction year, and cost and duration of field canal improvement projects (FCIP) construction. Data were collected from the open source published literature. The modeling results evidenced the potential of the applied SC models in predicting the FCIP cost. In numerical magnitude evaluation, MARS model indicated the least value for the root mean square error (RMSE = 27422.7), mean absolute error (MAE = 19761.8), and mean absolute percentage error (MAPE = 0.05454) with Nash–Sutcliffe efficiency (NSE = 0.94), agreement index (MD = 0.89), and coefficient of determination (R2 = 0.94), with best precision of prediction using all predictors, except geographical zone parameter in which less influence on the cost construction is presented. In general, the research outcome gave an informative primary cost initiative for cost civil engineering project.


Introduction
e scarcity of freshwater has been a global problem recently and expected to worsen in the future due to the increasing human population and decline in annual water allocation per capita [1,2]. e present scenario portrays water unsustainability due to the drastic increase in water utilization (>6 folds) in the 20 th century [3]. It is presently estimated that about 1.2 billion people globally have no access to a clean water supply [4]. Hence, several policies and projects are being implemented globally to ensure water sustainability. One of such projects aimed at water sustainability is the FCIP which aims at increasing the conveyance efficiency of field canals by about 25% via improvement of the field canals during irrigation processes in farmlands [5]. e project requires the construction of a burden PVC pipeline rather than relying on earthen field canals for the reduction of water seepage or losses during field operations [6]. FCIP is comprised of several simple components and structures which include concrete pain intakes for water collection from the source; water is channelled through the suction pipes to a plain concrete sump [7]. Water is first accumulated in the sump before being pumped by the pumping sets through the PVC pipelines by the irrigation valves.
e FCIPs are comprised of civil works, mechanical components, and electrical components as the major components. e components of the civil works are the pump house, pipelines, suction pipes, intake, and sump structure while the mechanical components are the irrigation valves, pump sets, and mechanical connections. e electrical boards and connections make up the electrical components of FCIPs [8].
e most interesting part of FCIPs is the cost estimation aspect that must be performed; manual cost estimation processes are time-consuming [9]. However, in some cases, scan be attained based on personal engineering and decision-makers' expertise. Cost estimation is highly associated with bias and inaccuracy and to overcome these issues of bias and inaccuracy during cost estimation [10]. erefore, SC models have been proposed as the potential solution. In line with this, the aim of this work is to come up with a robust ML-based SC model for FCIP cost estimation. e proposed models are expected to help decision-makers and management engineers in making decisions from the perspective of the stockholders.
Literature review studies suggested that numerous researches have focused on the development of reliable regression and mathematical techniques that can be used for cost estimation in civil engineering projects [11][12][13][14][15][16].
e nagging problem in this domain still relates to the performance accuracy of these models as the predicted cost is required to be highly accurate before the conception of the project. e weighted ANN has been developed for unit cost prediction in highway projects by [16], while a parametric cost model was developed based on a questionnaire survey for the estimation of the final cost of pump stations by [17]. A fuzzy logic-(FL-) based parametric cost estimate model has been presented by [18], for the prediction of the cost of building projects in the Gaza Strip. e study by [19] presented a hybrid ANN-FL model for cost prediction of water infrastructure. e prediction of the unit cost of the highway project in Libya using the ANN model has been presented by [20] and the performance of the ANN model was excellent. A conceptual cost model for the German residential building project was developed by [21] using historical data for 75 residential projects sourced from the building cost information center. e use of ANN to determine the relevant parameters for cost prediction during tunnel construction in Greece was reported by [22] based on survey questionnaires. e survey was based on expert opinions and interviews in relation to the key cost drivers. e reviewed literature suggests the need for intelligence models that are robust and capable of understanding the civil engineering complexity in more realistic manners. Several ML models have been reported recently, such as ANN [23], SVM [24], ANFIS [25], genetic programming [26], decision tree [27], and gradient boosting [28], and several others were reported in the latest review [29]. However, the fact remains that each of these models behaves differently in terms of prediction accuracy. Some existing models are also capable of providing accurate results interpretation; for instance, the variable coefficients of the regression models can explain the influence of each variable on the response of the model.
Numerous studies have focused on building projects without giving much attention to the conceptual cost of FCIPs. Hence, the attention of this study is on the pipeline construction projects which have not attracted appropriate research attention, especially on the provision of detailed model development steps in terms of sample size, multicollinearity, outliers, and singularity. For instance, the study by [16] only applied 14 and 4 cases for the training and validation of their neural network model. is may have elicited concerns about the sample size in this study as stated by [30]. e motivation of the current study was inspired from the exhibited literature on the prediction of the FCIP cost using newly explored machine learning models including ELM, MARS, and PLS. ese models are proven to be advantageous as they have very quick learning speeds with good performances and are useful in capturing complicated data mapping in very high set of predictors which produces interpretative results [31][32][33][34]. Modeling structure was adopted based on the correlation statistic to identify the input predictors for the built ML models. Based on the reported modeling results, comprehensive comparative analytical aspects were reported and discussed.

Extreme
Learning Machine. ELM model is one of the new methods of training recently developed single-layer feedforward neural networks [35]. e traditional ELM, as shown in Figure 1, has one input layer, one hidden layer, and one output layer; each of these layers has a specific number of neurons. e linear function is generally selected as the activation function of the input and output layers of ELM while the sigmoid function is selected for the hidden layer [36]. e first step of the standard ELM is a random input weight and hidden biases determination, followed by the determination of the hidden weights using the Moore-Penrose generalized inverse method to achieve the optimal solution of the linear system [37]. e advantages of the ELM over the other gradient-based methods are its strong generalization capability, no parameter tuning, and fast learning; these have made ELM more popular in numerous engineering tasks [38][39][40]. Consider a training dataset with N samples; the first process is to linearly map the input vectors into an L-dimensional feature space via nonlinear transformation; the expression of the simulated values of the ELM model is as follows: where N represents the number of samples for training, t i represents the output vectors that are associated with the input vector x i ; β l stands for the weight vectors that connect the hidden neuron to the output layer; w l is the weight vectors that connect the hidden neuron with the input layer; b l is the bias; and g is the activation function.
In the ELM, the idea is that the classical single-layer ANN can approach all the samples with zero deviation as mathematically expressed in the relation: where t i is the target output vector that is related to the input vector x i . e reconstruction of the above expression gives the following: 2 Complexity where where β is the weight of the matrix that connects the hidden and output layers; H is the hidden layer output matrix based on N samples; and T is the target output matrix based on N samples. Assume that the hidden biases and input weights are constant; it implies that the model may be considered a special linear system in which H and T are equal to the matrixes of the known dependent and independent parameters, while β is considered the coefficient matrix that should be optimized. Hence, the least-squares solution of the represented linear system mentioned above can be derived as where H † is the Moore-Penrose generalized inverse matrix of H.

Multivariate Adaptive Regression Spline Model.
MARS algorithms are nonlinear-nonparametric flexible regression models that were first developed by [41] and have found application in many fields of engineering due to their robustness [42]. is model is built with three major components, which are the basis functions (BFs), the knots, and the spline function [43]. e role of the BFs is to capture the relationship between the predictands and the predicted variables, amounting max (0, c − x) or max (0, x − c), where x is the threshold value, while c is the input variable value. e knots also represent the function of the base and base endpoints. A regression model is developed for each node by applying a spline function that consists of 1 or more BFs, followed by the substitution of the principal predictors [44]. In the MARS model, the predicted value is based mainly on linear BF elements combination. e MARS model can be reviewed as follows: consider Y as the target variable and X � (X1, X2, . . . , XP) as the P input variable matrix; then, the equation of the MARS model can be as follows: where β 0 is the initial fixed value; BF m is the applied BF for the fitting of the MARS model; and M is the total number of BFs [45]. e two major phases of the MARS model are the selection phase (or forward search) and the reversal pruning phase, as seen in Figure 2. e forward phase or selection phase can be regarded as a set of optimum input parameters.
A complicated over fitted model normally results from an excessive forward stepwise selection process due to a series of splits and such models cannot perform well predictively despite fitting the data perfectly. Hence, the backward procedure is normally applied to improve the predictive performance of the model by removing the unwanted variables that have been selected in the selection phase. e generalized cross-validation (GCV) is calculated as the deletion criterion as it is the basis for the backward pruning process [46,47].
where O i is the observed values; N is the number of data; f(x i ) is the predicted values for pattern i; M is the number of BFs; and C(M) is the penalty factor. In equation (7), the quantity of parameter d significantly impacts the procedure as it is the optimization cost of each BF; its range is 2 ≤ d ≤ 4. e inclusion of several BFs can result in overfitting; therefore, it is important to omit some BFs during the pruning phase to enable the emergence of a well-fitted model with the least GCV value [48].

Partial Least Square Regression (PLS) Model.
e first application of the PLS regression model was introduced over the literature by [49], and since then the model has been widely considered a new multivariate analysis technique in many fields [50,51]. It combined the features of principal components, typical multiple regression, and linear regression analyses; hence, it is suitable for finding the solution to numerous problems, especially problems that cannot be solved using the conventional multiple regression methods and problems with multiple correlations [52]. e efficiency of PLS in such cases is based on its ability to decompose and  screen the variables that mostly explain the dependent variables [53]. e first step of the PLS method is to extract the new variable called the component which serves as the independent variable, followed by the determination and establishment of the linear relationship between the dependent and independent variables [54]. After calculating the coefficient using PLS, the next step is the construction of the regression equation of the dependent variable. e regression model developed by using the PLS method is represented as where x 1 , . . . , x p represents the linear combinations of the remote sensing variables and a 0m , a 1m , . . . , a Pm are the PLS-computed regression model parameters. A higher number of principal components in the established model by PLS translates to better model accuracy; however, an excessive number of principal components results in overfitting and higher error. us, the optimal number of principal components must be determined to achieve a balanced PLS model. e cross-validation method was used to calculate the sum of squared residuals in this study. e prediction ability of the resulting model is a function of the extent of predictive residual errors sum of square (PRESS) value. So, the optimal number of principal components can be determined based on the minimum PRESS value and this PRESS value can be calculated as where y i , y i,− i represent the measured value of the ith sample and the estimated value upon exclusion of the ith sample and k is the number of iterations for validation.

Case Study and Data Explanation
For the modeling purpose, datasets were collected from the open source of literature [7]. e datasets are explained the key cost derived from the FCIPs. e data were including P1, the served area; P2, the total length of the PVC pipeline; P3, irrigation valve number; P4, construction year; P5, geographical zone; and cost and duration of field canal improvement projects (FCIP) construction. e significance of the dataset is contributing to the best knowledge of irrigation authorities and decision makers to have a prior understanding on the FCIP cost. e biodata of the current research were collected from the survey conducted for Soltani Canal, Egypt. e quantitative costs are related to construction sites recorded between 2011 and 2018. e polyvinyl chloride (PVC) pipeline system is explained in Figure 3 with diameter ranging between 22.5 and 35 cm. e statistical properties of the dataset over the training and testing phases are reported in Tables 1 and 2. It is seen that all together of 228 data were taken for both training and testing phase. In Tables

Application Results and Analysis
e feasibility of three machine learning models (ELM, MARS, and PLS) was evaluated to predict cost of FCIP construction. e models were built based on different input combinations, as reported in Table 3. Based on the correlation statistics, the input combinations were constructed as shown in Figure 4.
Based on the tabulated input parameters, it can be recognized that the total length of the PVC pipeline has the substantial correlation to the construction cost followed by the time duration, served area, irrigation valve number, and geographical zone.
Different statistical performance metrics including determination coefficient (R 2 ), root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), Nash-Sutcliffe efficiency (NSE), and agreement index (MD) were calculated to validate the applied models statistically [55,56].    (15) where y o and y p are the observed and predicted values of the FCIP cost;y o and y p are the mean values of the observed and predicted values of the FCIP cost; N is the number of observations; and j is the exponent term. Tables 4 and 5 report the statistical measures over the training and testing phases, respectively. In general, prediction performance of the models indicated less accuracy by using few predictors. However, MARS model exhibited better predictability performance over both the training and testing phases. It has been noticed that the maximum determination coefficient was achieved for model M6 (R 2 � 0.94) with a minimum RMSE of 28458.17 in the training phase while 27422.7 in the testing phase using all the predictor parameters, excluding the geographical zone in which less influence on the cost phenomena was revealed when compared to ELM and PLS whose coefficient of determination (R 2 ) maxed out at 0.90 with RMSE of 36011.43 and 36013.16, respectively, for model M6 in the training phase. Similarly, in testing phase ELM and PLS, coefficient of determination (R 2 ) maxed out at 0.89 with RMSE of 37141.8 and 37140.3 for model M6. In addition, it is seen that the ratio of the MSE and the potential error which is denoted by MD is 0.89 for MARS M6 model on both cases, i.e., training and testing phases. e model performances were assessed using graphical presentations such as scatter plots and Taylor diagram. Figure 5 shows the scatter plots between the actual observations and the predicted values. Among the three applied prediction models, MARS model is indicated as the best  Table 3: e modeling input combinations for the adopted dataset. 6 Complexity identical match with high correlation value. On the other hand, Figure 6 shows the Taylor diagram map in which the prediction models were evaluated based on the distance coordination in accordance with multiple statistical metrics (i.e., standard deviation, RMSE, and correlation value).

Discussion
Various studies have been conducted to estimate a reliable parametric cost model, but there is no available study carried out for FCIP [5]. However, prediction of cost is not new; a simplex optimization of ANN weights was used to create a model for estimating the unit cost of highway projects with a mean absolute percentage error (MAPE) of 1% [16]. Another study used a combination of ANN and fuzzy logic to create a high-precision cost prediction model for water infrastructure based on the sum of squares of mistakes. During the validation phase, the researchers produced multiple prediction models with perceptions ranging from 4.6 percent to 0.6 percent

Conclusion and Remarks
e prediction of cost related to civil engineering project is considered as vital topic to be studied comprehensively. In this study, couple of machine learning models including extreme learning machine (ELM), multivariate adaptive regression spline (MARS), and partial least square regression (PLS) were developed to predict field canal improvement project (FCIP) cost. For the purpose of the modeling development, datasets related to irrigation projects were collected from the open source published literature. Input combinations were initiated based on the total length of the PVC pipeline, served area, geographical zone, construction year, and cost and duration of FCIP construction. e prediction results showed that MARS and ELM models were presented positively in comparison with the PLS model. However, MARS model reported the superior results. Also, the research finding exhibited that all the predictors are substantial toward the cost calculation with almost no influence for the geographical zone of the pipeline network. Data Availability e data used in this study can be provided upon request from the authors.

Conflicts of Interest
e authors report no conflicts of interest.