ANN Based Approach for Estimation of Construction Costs of Sports Fields

Cost estimates are essential for the success of construction projects. Neural networks, as the tools of artificial intelligence, offer a significant potential in this field. Applying neural networks, however, requires respective studies due to the specifics of different kinds of facilities. This paper presents the proposal of an approach to the estimation of construction costs of sports fields which is based on neural networks. The general applicability of artificial neural networks in the formulated problem with cost estimation is investigated. An applicability of multilayer perceptron networks is confirmed by the results of the initial training of a set of various artificial neural networks. Moreover, one network was tailored for mapping a relationship between the total cost of construction works and the selected cost predictors which are characteristic of sports fields. Its prediction quality and accuracy were assessed positively. The research results legitimatize the proposed approach.


Introduction
The results presented in this paper are part of a broad research, in which the authors participate, aiming to develop tools for fast cost estimates, dedicated to the construction industry.The main aim of this paper is to present the results of the investigations on the applicability of artificial neural networks (ANNs) in the problem of estimating the total cost of construction works in the case of sports fields as specific facilities.The authors propose herein a new approach based on ANNs for estimating construction costs of sports fields.

Cost Estimation in Construction Projects.
Cost estimation is a key issue in construction projects.Both underestimation and overestimation of costs may lead to a failure of a construction project.The use of different tools and techniques in the whole project life cycle should provide information about costs to the participants of the project and support a complex decision-making process.In general, cost estimating methods can be classified as follows [1,2]: (i) Qualitative cost estimating: The expectations of the construction industry are to shorten the time necessary to predict costs, whilst on the other hand, the estimates must be reliable and accurate enough.There are worldwide publications in which the authors report the research results which respond to these expectations.The examples of the use of a regression analysis (based on both parametric and nonparametric methods) are as follows: application of multivariate regression to predict accuracy of cost estimates on the early stage of construction projects [3], implementation of linear regression analysis methods to predict the cost of raising buildings in the UK [4], proposal and discussion of the construction cost estimation method which combines bootstrap and regression techniques [5], and application of boosting regression trees in preliminary cost estimates for school building projects in Korea [6].Another mathematical tool for which some examples can be given is fuzzy logic, for example, implementation of fuzzy logic for parametric cost estimation in construction building projects in Gaza Strip [7] or proposal and presentation of a fuzzy risk assessment model for estimating a cost overrun risk rating [8].Case based reasoning (CBR) is also an approach which can be found in the publications dealing with the construction cost issue, for example, implementation of the CBR method improved by analytical hierarchy process (AHP) for the purposes of cost estimation of residential buildings in Korea [9] or the use of the case based reasoning in cost estimation of adapting military barracks also in Korea [10].The examples of the publications which report and discuss the applications of artificial neural networks in the field of cost estimation and cost analyses in the construction process are presented in the next subsection.

Artificial Neural Networks Cost Estimation in Construction
Projects.Artificial neural networks (ANNs) can be defined as mathematical structures and their implementations (both hardware and software), whose mode of action is based on and inspired by nervous systems observed in nature.In other words, ANNs are tools of artificial intelligence which have the ability to model data relationships with no need to assume a priori the equations or formulas which bind the variables.The networks come in wide variety depending on their structures, way of processing signals, and applications.The theory in this subject is widely presented in the literature (e.g., [11][12][13][14][15]).Main applications of ANN can be mentioned as follows (cf., e.g., [11,12,15]): prediction, approximation, control, association, classification and pattern recognition, associating data, data analysis, signal filtering, and optimization.ANNs features which make them beneficial in cost estimating problems (in particular for cost estimating in construction) are as follows: (i) Applicability in regression problems where the relationships between the dependent and many independent variables are difficult to investigate Some examples of ANN applications reported for a range of cost estimating and cost analyses in construction are replication of past cost trends in highway construction and estimation of future costs trends in this field in the state of Louisiana, USA [16], computation of the whole life cost of construction with the use of the concept of cost significant items in Australia [17], prediction of the total structural cost of construction projects in the Philippines [18], estimation of site overhead costs in the dam project in Egypt [19], prediction of the cost of a road project completion on the basis of bidding data in New Jersey, USA [20], and cost estimation of building structural systems in Turkey [21].The authors of this paper also have their contribution in studies on the use of ANN in cost estimation problems in construction.In some previous works, the authors presented the ANN applications for conceptual cost estimation of residential buildings in Poland [22][23][24] and estimation of overhead cost in construction projects in Poland [25,26].
1.3.Justification for Research.It needs to be emphasized that, despite a number of publications reporting research projects on the use of artificial neural networks in cost analyses and cost estimation in construction, each of the problems is specific and unique.Each of such problems requires an individual approach and investigation due to distinct conditions, determinants, and factors that influence the costs of construction projects.An individual approach to cost estimation in construction is primarily due to specificity of the facilities, including sports fields.The costs of a sport field are significant not only for the construction stage but also later in terms of its maintenance.The decisions made about the size, functionality, and quality are crucial for the future use and operational management of sport fields.The success in investigation of ANNs applicability in the problem will allow proposing a new approach for estimation of the construction cost of sport fields.The new approach, based on the advantages offered by neural networks, will allow predicting the total construction cost of sport fields much faster than with traditional methods; moreover, it will give the possibility of checking many variants and their influence on the cost in a very short time.

Formulation of the Problem and Research Framework
2.1.General Assumptions.The general aim of the research was to develop a model that supports the process of estimating construction costs of sports fields.The authors decided to investigate implementation of ANNs for the purpose of mapping multidimensional space of cost predictors into a one-dimensional space of construction costs.In a formal notation, the problem can be defined generally as follows: where  is sought-for function of several variables,  is input of the function , which consists of vectors  = [ 1 ,  2 , . . .,  푛 ], where variables  1 ,  2 , . . .,  푛 represent cost predictors characteristic of sports fields as construction objects, and  is a set of values which represent construction costs of sports fields.
In the statistical sense, the problem comes down to solving a regression problem and estimating of a relationship  between the cost predictors being independent variables belonging to the set  and constructions cost of a sports field being dependent variable belonging to the set .According to the methodology in cost estimating based on statistical methods, one can distinguish between two main approaches: estimating based on parametric methods and estimating based on nonparametric methods (cf.[1,2,25]).Both methods rely on the real-life data, that is, representative samples of cost predictors values and related construction costs values.In the case of the use of parametric methods function  is assumed a priori and the structural parameters of the model are estimated.On the other hand, nonparametric methods are based on fitting the function  to the data.According to the assumptions made for the research presented in this paper, the sought-for function was supposed to be implemented implicitly by ANN.
A general framework of the adopted research strategy is depicted in Figure 1.

Characteristics of Sports Fields Covered by the Research.
Sports fields are facilities for which some types of works are usually repeated during the construction stage.The main types of works that can be listed are (i) geodetic surveying, (ii) earthworks (topsoil stripping, trenching, compacting of the natural subgrade, etc.), (iii) works on subgrade preparation for the sports field surface, (iv) works on sports fields surface (usually surfaces are either natural or synthetic grass), (v) assembly of fixtures and in-ground furnishings (e.g., football/handball gates, basketball goal systems, volley ball, or tennis poles and nets), (vi) works on fencing and ball-nets installation, (vii) minor road works and works on sidewalks, (viii) landscape works and arranging green areas around the sports fields.
In the course of the research, a number of completed projects on sports fields in Poland were investigated.Both the fields dedicated to one discipline and multifunctional fields were taken into account.The facilities subject to the analysis differed in size of the playing area, arranged area for communication, arranged green area, and fencing.The surfaces were of two types: either natural or synthetic grass.It must be stressed here that the quality expectations for surfaces varied significantly and played an important role in the construction costs.The completed facilities are located all over Poland both in the urban areas (in cities of different sizes) and outside the urban areas (in the villages).

Preselection of Variables.
As the problem was formally expressed and the assumption about the use of ANNs was made, the authors focused on the analyses which allowed them to preselect cost predictors.The preselection was preceded by studying both technical and cost aspects of the construction projects on sports fields.This stage of the research allowed collecting the necessary background knowledge about the nature of sports fields as specific construction objects with their characteristic elements, range and sequence of construction works which must be completed, and the clients' quality expectations.
In the next step, 129 construction projects on sports fields that were completed in Poland in recent years were investigated.For the purposes of fast cost analysis, the authors preselected the following data to be the variables of the sought-for relationship: (i) Total cost of construction works as a dependent variable (ii) Playing area of a sports field, location of a facility, number of sport functions, the type of the playing field's surface (natural or artificial), quality standard of the playing field's surface, ball stop net's surface, arranged area for communication, fencing's length, and arranged greenery area as independent variables.
The criteria for such preselection were the availability of the data in the investigated tender documents and ensuring enough simplicity of the developed model due to which the potential client would be able to formulate the expectations about the sport field to be ordered by specifying values for potential cost predictors in the early stage of the project.Most of the mentioned variables were of a quantitative type.In the case of the location of the facility, type of the playing field's surface, and quality standard of the playing field's surface, only descriptive information was available; the three variables were of the categorical type.Table 1 presents synthetically the characteristics of all of the variables that were preselected in the course of the analysis of the problem.
The next step included data collection and scaling categorical variables.In the case of three of the variables (namely, location of the facility, type of the playing field surface, and quality standard of the playing field's surface), categorical values were replaced by numerical values.The studies of the problem, analyses of the number of completed construction projects on sports fields, and, especially, the analyses of construction works costs brought the conclusions of how the categorical values of the three variables are associated with the costs of construction works.Table 2 explains how the change of the categorical values stimulates the costs of construction works in general.The observation made it possible to order the values and transform them into numbers in the range from 0.1 to 0.9.
Categorical values for location of the facility have taken numerical values as follows: urban area: 0.9 for big cities (population over 100,000), 0.66 for medium cities (population between 20,000 and 100,000), and 0.33 for small cities (population below 20,000); outside the urban area: 0.1 for villages.In the case of the type of the playing field, surface artificial grass has taken the value of 0.9 and natural grass has taken the value of 0.1.Finally, depending on the client's expectations and specifications available in the tender documents, the descriptions of the demands for quality standard of the playing field's surface took values from the range between 0.1 and 0.9.
The studies of tender documents for public construction projects where completion of sports fields was the subject matter of the contract allowed for collecting the data for 129 projects.The data were collected for projects completed in the last four years all over Poland.The collected information was ordered in the database.After the analysis of outliers, the authors decided to reject some extreme cases for which the total construction cost was unusually high or unusually low.After the elimination of outliers, the data for 115 projects remained.

Selection of the Final Set of Variables.
Further analysis included the investigation of the significance of correlations between the dependent variable and all of the initially considered independent variables, preselected cost predictors.The significance of correlations for -value < 0.05 was assessed.The results of this step are synthetically presented in Table 3.
As the correlations for the two of preselected cost predictors (namely, location of the sports ground and the number of sport functions) appeared to be insignificant, they were rejected and no longer taken into account as the cost predictors.
Table 4 presents ten exemplary records with the specific numerical values of the dependent variable  (total cost of construction works) and independent variables  푗 (cost predictors) accepted for the model.
Table 5 presents descriptive statistics for the variables accepted for the model.Average, minimum, and maximum values are presented for each of the variables as well as the standard deviation.
It is noteworthy that minimum value, namely, 0.00, for the variables  4 ,  5 ,  6 , and  7 corresponds with the fact that in case of certain sports fields elements such as ball stop nets, arranged area for communication, fencing, and arranged greenery have not been included in a project's scope (cf.Table 4).Moreover, there is some regularity in Table 4 which manifests in the distribution of average values closer to minimum values than to maximum values.This is due to the fact that the number of small-sized and medium-sized sports fields in the database was relatively greater than the number of large-sized ones.This can be explained by a general rule valid for all kinds of construction.The number of small-sized and medium-sized facilities of all types, either newly built or existing, is always greater than those large-sized.
The database records (whose number equalled 115) were used as training patterns  for ANNs in the course of the research.
The values of the variables were scaled automatically before and after each of the ANN's training.This was done due to the functionalities of the ANN's software simulator used in the course of the research.The variables were scaled linearly to the range of values appropriate for activation functions employed for certain investigated ANN.The results, especially ANNs' training errors, presented further in the paper, are given as original, not scaled values.

Initial Training of Neural Networks
After the selection of independent variables, a formal notation of the relationship in the statistical sense can be given as follows: Consequently a prediction can be formally made: where  is dependent variable, total cost of construction works, as observed in real life, ŷ is predicted total cost of construction works,  is sought-for function, implicit relationship implemented by ANN,  1 ,  2 ,  3 ,  4 ,  5 ,  6 ,  7 are independent variables, selected cost predictors as presented in Tables 3 and 4, and  are random deviations (errors) for which () = 0, The aim of this stage of the research, namely, the initial training of ANNs, was to assess their applicability to the problem in general and to take a decision whether to continue the research or not.A variety of feed forward ANNs were trained in the automatic mode.The overall number of networks equalled 200; the authors took into account 100 multilayer perceptron (MLP) networks and 100 radial basis function (RBF) networks as the types appropriate for the regression analysis and suitable for the formulated problem.
The main criteria to assess the applicability of the ANNs were the quality of predictions made by trained networks and the errors.The measure for quality of predictions was Pearson's correlation coefficient (, ŷ) between that in real life and that predicted by networks values of the independent variable, the total construction cost, whereas the measures of error was the root mean squared error (RMSE): where cov(, ŷ) is covariance between  and ŷ,  푦 is standard deviation for , and  푦 is standard deviation for ŷ.
In the case of RBF networks, both the quality of predictions and errors were so dissatisfying that the authors decided to focus on the MLP networks only.The results for the MLP networks were satisfying; they are presented synthetically below in Figure 2 and in Table 5 and discussed briefly.The main assumptions for this stage were as follows: (i) From the database of training patterns, learning (), validating (), and testing () subsets were randomly drawn 10 times.
(iii) For each drawing, 10 different networks were trained.
(iv) The networks varied in the number of neurons in the hidden layer, .
(v) Distinct activation functions, such as linear, sigmoid, hyperbolic tangent, and exponential, were applied in the neurons of a hidden and output layer.
Learning and validating, that is,  and , subsets were used in the course of training process.The third subset, , was used for testing purposes after completing the training process as an additional check of the generalization capabilities (cf., e.g., [11]).Number of neurons in the hidden layer, , was assessed according to the following equation [27] and inequality [28]: where  is number of neurons in the input layer and  is number of neurons in the output layer.
where NNP is number of ANN's parameters, NNW is number of ANN's weights, NNB is number of ANN's biases, and  is cardinality of a learning subset.From (5),  ≈ 2,646.According to the assumptions about the cardinality of  subset and from inequality (6), NNP < 69.Compromising these two conditions, the number of neurons in the hidden layer, , varied between 2 and 6.
The overall results, in terms of the quality of predictions and errors, are presented in Figures 2 and 3.
Part (a) of Figures 2 and 3 depicts scatter plots of Pearson's correlation coefficients calculated for each network after the training process.Figure 2 shows coefficients for learning and testing subsets, whereas Figure 3 shows the coefficients for learning and validating subsets.In part (b) of both figures, one can see scatter plots of errors (namely, RMSE).Figure 2 presents the scatter plot for learning and testing subsets, and Figure 3 presents the scatter plot for learning and validating subsets.
Table 6 presents the summary of the initial training of 100 MLP networks.The average and standard deviation of (, ŷ) as well as percentage of the cases for which (, ŷ) is greater than 0.9 are presented in the table.Additionally, the average and standard deviation for RMSE errors are given.All values were calculated for learning, validating, and testing subsets separately.
As can be seen both in Figures 2 and 3 and in Table 6, the correlations for most of the cases are very high.For more than 80% of networks, (, ŷ) was greater than 0.9 in case of learning, validating, and testing.There are evident clusters of points in Figures 2(a This stage of the research confirmed the general applicability of ANNs to the investigated problem.The decision was to continue the research.Moreover, it allowed choosing a group of MLP networks to be trained in the next stage.

Results of Neural Networks Training in the Closing Phase of the Research
With respect to the initial training results, a group of 5 networks was chosen for the closing phase of the research.The details including networks' structures and activation functions are given in Table 7. (All of the networks consisted of 7 neurons in the input, the number of neurons ranging from 2 to 5 in one hidden layer and one neuron in the output layer.All of the networks were trained with the use of Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm.)Assumptions for the networks training and testing were different than in the initial stage.From the set of 115 training patterns, the testing subset, , was chosen randomly in the beginning.This subset remained unchanged for all of the networks and it was used to assess the generalization capabilities after all of the training processes.Cardinality of  subset equalled 10% of the total number of training patterns.The remaining patterns have been involved in the 10fold cross-validation of the networks (cf.[11]).The patterns available for the training process after the sampling of the  subset have been divided into the 10-fold complementary learning and validating subsets.The relation of the subsets was / = 90%/10% accordingly.The sum of squared errors of prediction (SSE) was used as the error function in the course of training: The performance of ANNs was assessed in general in the light of correlation between real-life and predicted values, RMSE errors of the prediction (as in the stage of initial training).Table 8 presents a summary of the training results for the five chosen networks after the training process based on 10-fold cross-validation approach.The results are given in terms of the networks' performance and errors.To synthesize and assess the performance and stability of training of each network, maximum, average, and minimum as well as the dispersion of the correlation coefficients (, ŷ) between real-life and predicted values were calculated.Errors, namely, RMSE values, are presented in the same manner.Both correlations and errors are given separately for , , and  subsets.The analysis of the results made it possible to choose finally one of the five networks.The most stable performance was observed for the MLP s-l 7-5-1.
Figure 4 depicts a scatter plot of the points that represent pairs ( 푝 , ŷ푝 ) for the finally chosen network MLP s-l 7-5-1.Real-life values  are set together with predicted values ŷ.The points represent training results for learning and validating subsets, as well as testing results for testing subset.In the legend of the chart, apart from the letters , , and  that explain membership of the certain pattern to the learning, validating, or testing subset accordingly, the numbers from 1 to 10 are given next to each letter.These numbers reveal the th fold of the cross-validation process.In Figure 4, one can see that the points in the scatter plot are decomposed along a perfect fit line.The deviations are in the acceptable range.In respect of the analysis of (, ŷ), RMSE errors, and distribution of points in the scatter plot, the results are satisfactory.
Two more criteria, relating to the accuracy of cost estimation, were also specified for assessment of the selected network MLP s-l 7-5-1.The accuracy of estimates was assessed with the use of three error measures: mean absolute percentage error (MAPE), absolute percentage error calculated for Table 9 shows the maximum, average, and minimum MAPE and PE max errors, as well as the standard deviations of these errors, after the 10-fold cross-validation for the selected network MLP s-l 7-5-1.The errors are given for the learning, validating, and testing, that is, , , and , subsets, respectively.MAPE and PE max errors have been carefully investigated for the selected network in all of 10 cases of cross-validation training and testing.In respect of average MAPE errors, the results were satisfying.Average MAPE errors were expected to be smaller than 15% for learning, validating, and testing

( 1 )
Cost estimating based on heuristic methods (2) Cost estimating based on expert judgments (ii) Quantitative cost estimating: (1) Cost estimating based on statistical methods (2) Cost estimating based on parametric methods (3) Cost estimating based on nonparametric methods (4) Cost estimating based on analogous/comparative methods (5) Cost estimating based on analytical methods.

(
ii) Ability to gain knowledge in the automated training process (iii) Ability to build and store the knowledge on the basis of the collected training patterns (real-life examples) (iv) Ability of knowledge generalization; predictions can be made for the data which have not been presented to the ANNs during a training process.

Figure 1 :
Figure 1: Scheme of the research framework (source: own study).

Figure 2 :
Figure 2: Quality end errors of ANNs after the initial training phase: (a) scatter diagram of Pearson's correlation coefficients and (b) scatter diagram of errors (RMSE);  and  stand for learning and testing subsets accordingly (source: own study).
) and3(a)  which represent networks with the potential of the acceptable or good quality of prediction.There are only few cases outside the clusters for which the correlations coefficients are very low, due to the failure of the training process.RMSE errors are in the acceptable range at this stage.

ComplexityFigure 3 :
Figure 3: Quality end errors of ANNs after the initial training phase: (a) scatter diagram of Pearson's correlation coefficients and (b) scatter diagram of errors (RMSE);  and  stand for learning and validating subsets accordingly (source: own study).

Table 1 :
Characteristics of dependent variables and independent variables considered initially to be used in the course of a regression analysis (source: own study).

Table 2 :
General relationship between the three categorical variables and cost (source: own study).

Table 3 :
Significance of correlations between the variables (source: own study).

Table 4 :
Exemplary records of the database including training patterns (source: own study).

Table 5 :
Descriptive statistics for the models' variables (source: own study).

Table 6 :
Summary of the initial training of ANNs for MLP networks (source: own study).

Table 7 :
Details of the selected ANNs for further training (source: own study).

Table 8 :
Results of the selected ANNs training (source: own study).