Estimating the Concrete Compressive Strength Using Hard Clustering and Fuzzy Clustering Based Regression Techniques

Understanding of the compressive strength of concrete is important for activities like construction arrangement, prestressing operations, and proportioning new mixtures and for the quality assurance. Regression techniques are most widely used for prediction tasks where relationship between the independent variables and dependent (prediction) variable is identified. The accuracy of the regression techniques for prediction can be improved if clustering can be used along with regression. Clustering along with regression will ensure the more accurate curve fitting between the dependent and independent variables. In this work cluster regression technique is applied for estimating the compressive strength of the concrete and a novel state of the art is proposed for predicting the concrete compressive strength. The objective of this work is to demonstrate that clustering along with regression ensures less prediction errors for estimating the concrete compressive strength. The proposed technique consists of two major stages: in the first stage, clustering is used to group the similar characteristics concrete data and then in the second stage regression techniques are applied over these clusters (groups) to predict the compressive strength from individual clusters. It is found from experiments that clustering along with regression techniques gives minimum errors for predicting compressive strength of concrete; also fuzzy clustering algorithm C-means performs better than K-means algorithm.


Introduction
Concrete is the most commonly used structural material and is composed of individual base materials [1]. Materials used in concrete, mix ratios, mixing process, transportation, and placements of concrete are all important parameters used to define concrete performance [2]. Concrete is a heterogeneous material and consists of separate phases like hydrated cement paste, transition zone, and aggregate. The compressive strength and failure of concrete are related to the weakest part of the microstructure [3]. The strength of concrete is managed by the proportioning of cement, coarse and fine aggregates, water, and various admixtures. The ratio of the water to cement is an important parameter for identifying concrete strength. Lower water-cement ratio gives higher compressive strength and minimum amount of water is necessary for the proper chemical action in the hardening of concrete, but extra water increases the workability and reduces strength of concrete [4].
A number of methods and techniques have been developed and proposed by many researchers to predict the properties of concrete. Determination of the compressive strength of concrete requires preparation, curing, and testing of special specimens [5]. The early prediction of concrete properties is an important activity, tests measuring properties of hardened concrete like strength and deformation are carried out using number of tests, and some of tests and test categorization are found in the study of Gencel et al. [6,7]. Prediction of compressive strength of concrete is an important activity in construction technology. Timely knowledge of concrete strength helps to schedule operations such as prestressing and removal of formwork. The speed of construction can be increased using maturity methods for determining concrete compressive strength. Understanding compressive strength 2 The Scientific World Journal  of concrete also helps in achieving construction quality control parameters like durability of structures and avoiding excessive loading, and so forth [8]. Regression techniques are the simplest and most efficient techniques for predicting related tasks; however, if the clustering is performed along with the regression techniques then possibly more accurate predictions can be made. Applying regression over the clustered data will identify more accurate relationships between the independent and dependent variables. In other words if regression is performed over the clustered data then more accurate curve fitting is possible between the independent and dependent variables. The objective of this work is to demonstrate that regression along with clustering gives less prediction errors for estimating compressive strength of concrete. Three regression techniques, namely, Simple Linear Regression (SLR), Logistic Regression (LR), and Least Median of Squares (LMS) along with the two popular clustering techniques -means and -means are used for estimating the compressive strength of concrete in this work.  [9][10][11]. ANNs are also used for the prediction of concrete compressive strength based on a variety of nondestructive tests [12][13][14]. Topçu and Saridemir [15] proposed an ANN and Fuzzy Logic based technique to determination of compressive strength of fly ash added concretes.   [16] proposed a technique for predicting the compressive strength of concrete with some admixtures. Altun et al. (2008) [17] proposed an ANN and multiple linear regression based techniques to estimate compressive strength of steel fiber reinforced concrete. An artificial neural networks based model is used for prediction of the thermal fields in young concrete structures [1,18]. Statistical regression analysis and artificial neural networks (ANNs) based techniques are proposed for predicting cost and schedule performance [19]. Ni and Wang [10] proposed Artificial Neural Network (ANN) and soft computing based approach for predicting concrete strength.
A neural network based approach has been proposed for the evaluation of concrete compressive strength by the use of ultrasonic pulse velocity values [20]. Regression and artificial neural network (ANN) based technique is proposed for the estimation of compressive strength of vacuum processed concrete [2]. An artificial neural network of the feed-forward back-propagation type is used for the prediction of density and compressive strength properties of the cement paste portion of the concrete mixtures [3]. An artificial neural networks based technique is proposed to help concrete structure designers and engineers to compute the effects of some concrete initial parameters [21].

Other Computing Approaches.
Fuzzy set theory has been applied in a wide range of scientific and construction research areas [15]. A number of studies are available on compressive strength, which is related to other properties or performance 4 The Scientific World Journal of concrete, like flexural strength, splitting tensile strength, elasticity modules, durability, and so forth [22][23][24]. A soft computing technique, namely, adaptive neurofuzzy inference system (ANFIS) is proposed byÖzel [25] for predicting the compressive strength of concretes using the mix design and flow properties of concrete. Prediction of 3-day strengths of concrete compressive strength is presented by Viviani et al. [26]. A number of construction factory alternatives are proposed by Kim et al. [27] for the realization of more desirable automated construction environment. The proposed techniques are evaluated on the basis of wind speed and air temperature using computational fluid dynamics (CFD) simulation. Compressive strength computation of normal and recycled aggregate concrete is performed by Janković et al. [28] and equation for calculating compressive strength is presented. The effect of increasing the water-cement ratio over cement hydration is studied by Ç olak (2006) [29]; the study can help in understanding the w/c-strength (watercement ratio strength) relationship in concrete as the natural consequence of a progressive weakening. An extensive study of concrete curing process is performed by Abdel-Jawad (2006) [30], where it is shown that the technical process of concrete curing involves a number of conditions for improving cement hydration. Rajamane et al. (2007) [31] analyzed the impact of temperature on strength and concluded that it depends on the timetemperature history of casting and curing; in the proposed work multivariable equations are proposed for the prediction of compressive strength of concrete. The equations given by Tanigawa et al. (1984) [32] demonstrated that prediction performance with a RMSE equal to 2.1000 can be achieved using multivariable equation technique. Classification algorithms, namely, Multilayer Perceptron, M5P Tree models and Linear Regression are used to predict the compressive strength of the high performance concrete by [4]. An estimation equation is derived for compressive strength development for the concrete containing fly ash [33]. In recent years other evolving computing techniques are also applied for predicting the compressive strength of the work.
ANNs and most of the soft computing based techniques are supervised learning techniques which require the efficient learning datasets for preparing the prediction models and always give low accuracies for prediction and other related tasks. To address this problem cluster regression based technique is proposed in this work to predict the compressive strength of concrete. The purpose of the proposed work is The Scientific World Journal 5 to demonstrate that regression technique along with clustering gives better prediction of compressive strength of the concrete. Rather than applying regression directly over the concrete dataset, it is applied over the groups (clusters) of concrete records in the proposed work. These clusters are formed on the basis of similarities between concrete records. Applying regression techniques over the clusters identifies more suitable estimations for each cluster (mathematically, better curve fitting between independent and dependent variables is achieved) which ensures the minimum forecasting errors for estimating the compressive strength of concrete.

Methodology
The methodology of proposed technique of estimating (predicting) the concrete compressive strength is summarized in Figure 1. The proposed method is composed of three major steps: clustering, applying regression, and performance evaluation using various parameters. These steps are summarized and explained below.
Clustering. In the first steps groups of similar concrete records from the concrete datasets are created. The number of groups can be decided by the user. In this study different numbers of groups are considered for the experiments. The groups are created using the clustering techniques -means and fuzzy clustering technique -means.
Regression Analysis. After creating the group of similar concrete records, regression techniques are applied to identify the relationship between concrete compressive strength with other components of concrete. In this work three regression techniques SLR, LR, and LMS are used for estimating compressive strength of concrete.
Performance Evaluation. Performance of prediction tasks performed in the previous steps is carried out with the help of predicting two popular error parameters MAE (Mean Absolute Error) and RMSE (Root Mean Square Error). These errors are calculated for each individual cluster and overall weighted average is also calculated for measuring the prediction errors.
Clustering is performed using -means and -means algorithms for creating the groups of similar characteristics concrete data. Overviews for both of these algorithms are given in this section. Distance functions play an important role in clustering, which is also discussed here in brief. Input: : the number of clusters, : a data set containing objects. Output: A set of clusters.

Method:
(1) arbitrarily choose objects from as the initial cluster centers; (2) repeat (3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; (4) update the cluster means, that is, calculate the mean value of the objects for each cluster; (5) until no change; Algorithm 1: -means.
The -means algorithm [35] is a partitioning based clustering algorithm. It takes an input parameter, , that is, 8 The Scientific World Journal the number of clusters to be formed, which partitions a set of objects to generate the clusters. The algorithm works in three steps. In the first step, number of the objects are selected randomly, each of which represents the initial mean or center of the cluster. In the second step, the remaining objects are assigned to the cluster with minimum distance from cluster center or mean. In the third step, the new mean for each cluster is computed and the process iterates until the criterion function converges. The performance of -means is measured using the square-error function defined in the following: where is the sum of the square error, is the point in space representing a given object, and is the mean of cluster . This criterion tries to make the resulting clusters as compact and as separate as possible. The algorithm is consisting of five major steps which are summarizes as given in Algorithm 1.

-Means (Fuzzy) Clustering
Algorithm. The fuzzymeans (FCM) algorithm [36,37] is one of the popularly used methods in fuzzy clustering (Algorithm 2). It is based on the concept of fuzzy partitioning which is summarized as follows.
(a) Choose a number of clusters.
(b) Assign randomly to each point coefficients for being in the clusters.
(c) Repeat until the algorithm has converged (i.e., the coefficients' change between two iterations is no more than the given sensitivity threshold).
(d) Compute the centroid for each cluster, using the formula above.
(e) For each point, compute its coefficients of being in the clusters, using the formula above.

Distance Function.
Clustering algorithms creates the groups of similar records. In order to find the similar records distance functions are used. Distance functions are used to measure the distance between two objects (or records). If the distance between a pair of records is less (minimum) then the records are said to be similar. In this work two clustering algorithms -means and -means are used to generate clusters (groups) of concrete records and Euclidian The Scientific World Journal 9 distance function is used to measure the distances between the pair of concrete records. Euclidian distance formula to measure the distance between two concrete records and with features is presented in the following: In this work UCI machine learning repository [38] of concrete data is used and there are eight ( = 8) attributes (features) in this dataset, namely, Cement (C), Blast Furnace Slag (BFS), Fly Ash (F), Water (W), Superplasticizer (S), Coarse Aggregate (CA), Fine Aggregate (FA), and Age (A). The description of the dataset is presented in the Table 1.

Regression Techniques.
Regression techniques are used to discover the relationship between a set of variables. These techniques are used for identifying the patterns (relations) of independent and dependent variables, the independent variables are technically termed as response variables, and dependent variable is termed as predictors. Regression techniques typically try to relate some statistical measures like mean or average between the set of variables to identify the relationship between them. In this paper three regression techniques, namely, Simple Linear Regression (SLR), Logistic Regression (LR), and Least Median of Squares Regression technique (LMS) are used to estimate the compressive strength of concrete.

Simple Linear Regression.
Simple linear regression is the simplest regression analysis technique, where a line equation ( = 1 1 + 2 2 +⋅ ⋅ ⋅+ + ) is used to relate the predictor variables ( 1 , 2 , . . . , ) and response to the predictor ( ). Simple linear regression is a statistical technique that fits a straight line to a set of ( , ) data pairs. The slope and intercept of the fitted line are chosen so as to minimize the sum of squared differences between observed response values and fitted response values. That is, a method of ordinary least squares is used to fit a straight line model to the data.

Logistic Regression.
Logistic Regression technique is the other regression technique, which is used to perform regression on multiple independent variables presented simultaneously to predict membership of one or other of the two dependent variable categories. If the dependent variable, , is one of the binary response or dichotomous variables, logistic regression can be used to describe its relationship with several predictor variables, ( 1 , 2 , . . . , ), and an odds ratio can be estimated. Logistic Regression is mostly used for prediction tasks involving multiple dependent variables and is also used for exploring the strong (dominating) dependent variables in prediction tasks. Logistic Regression is used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. Logistic regression techniques are designed using logistic model which can be expressed as given in the following: where is the probability of a classification match, and 1 , 2 , . . . , are the explanatory, independent variables. Rousseeuw (1984) [39] introduced Least Median of Squares (LMS) as a robust regression procedure. In this regression technique instead of minimizing the sum of squared residuals, coefficients are chosen so as to minimize the median of the squared residuals. In contrast to conventional least squares (LS), there is no closed-form solution with which to easily calculate the LMS line since the median is an order or rank statistic. A general nonlinear optimization algorithm performs poorly because the median of squared residuals surface is so rutted that merely local minima are often incorrectly reported as the solution.

Regression Clustering.
In regression clustering (RC) [40], regression functions are applied to the dataset simultaneously which guide the clustering of the dataset into subsets each with a simpler distribution matching its guiding function. Each function is regressed on its own subset of data  with a much smaller residue error. Both the regressions and the clustering optimize a common objective function.

Performance Measurements.
The performance of regression based prediction techniques is carried in terms of errors in regression. Two such common errors in regression based prediction are Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

Mean Absolute Error. The Mean Absolute Error (MAE)
is the average of the absolute value of the residuals (error). The MAE is very similar to the RMSE but is less sensitive to large errors. The MAE is calculated using the following:

Root Mean Squared Error. The Root Mean Squared
Error (RMSE) is the square root of the average squared distance of a data point from the fitted line. The RMSE is calculated using the following:

Concrete Compressive Strength Estimation Algorithm
Based on the methodology presented in previous section, the algorithm for predicting the compressive strength of concrete is now presented in this section (Algorithm 3). The algorithm takes two input parameters, , which is the number of clusters, and , a dataset containing objects of Concrete samples. The algorithm is composed of five steps. In the first step the UCI machine repository concrete dataset is selected and given as input to the algorithm, and then in the second step, numbers of clusters are created using -means or -means clustering techniques. In third step regression techniques are applied on each cluster created in Table 10: Equations for different clusters for estimating concrete compressive strength using LMS regression.

Number of clusters
Cluster Equation

Datasets.
In order to perform the experiments of proposed technique popular dataset of compressive concrete strength from UCI machine learning repository [38] is taken. The dataset is consisting of eight input variables and one output variable, namely, "concrete compressive strength. " The dataset is summarized in Table 1, where the name of the components along with their data types and measurement units is mentioned.

Implementation.
All the experiments are carried using Java programming language and Java based WEKA (Waikato Environment for Knowledge Analysis) tool [41]. WEKA is a popular open source tool for knowledge analysis. The implementation of -means clustering technique and regression techniques is taken from the WEKA. The implementation is performed in three major steps (as shown in Figure 1). In the first step using -means and -means algorithms clustering is performed, where the groups of similar concrete records are created. In the second step regression techniques are applied on created clusters to identify the relationship between the dependent variable (compressive strength of concrete) with independent variables (other variables) for a cluster. In the third step, errors are calculated for the estimations. The model is thus prepared for predicting the compressive strength of concrete for the new concrete records (for which compressive strength is unknown). In case of a new concrete record, mapping of the record to a suitable cluster is carried out by measuring the belongingness to a cluster then using the equation of that cluster estimation of the compressive strength of concrete is performed.

Results Analysis and Discussions.
In order to estimate the compressive strength of concrete clustering is applied (as shown in Section 3). Two clustering techniques, namely, -means and -means are applied to create hard and soft (fuzzy) clusters. The distribution of number of concrete instances for different number of clusters using -means algorithm is shown in Figure 2. Similarly the distribution of concrete instances for -means algorithm is shown in Figure 3. Since -means algorithm is fuzzy technique, one instance can belong to more than one cluster hence more number of instances is belonging to the clusters than themeans algorithm.
After the clustering is done, regression techniques are applied on each individual clusters for predicting the compressive strength of the concrete. Three different regression techniques SLR, LR, and LMS are applied on each individual cluster and prediction errors MAE and RMSE are recorded for each experiment. The MAE and RMSE values for each cluster using SLR regression technique is tabulated in Table 2 for -means algorithm, similarly the MAE and RMSE values for clusters using LR and LMS regression techniques are tabulated in Tables 3 and 4, respectively, using the -means clustering algorithm. The overall weighted MAE and overall weighted RMSE values are also recorded for each cluster. The overall weighted MAE and RMSE are calculated as follows: where MAE , RMSE , and are the MAE value, RMSE value, and number of concrete samples belonging to the th cluster and is the total number of clusters. The fuzzy clustering technique is based on fuzzy set theory, where the belongingness of an element in a set is  decided by its degree of membership. In the fuzzy clustering technique one record may belong to several clusters at a time; therefore, for the same number of clusters more numbers of concrete records belong to the same dataset as compared with the -means algorithm. The error measures MAE and RMSE for -means algorithm for SLR, LR, and LMS regression techniques are tabulated in Tables 5, 6, and 7, respectively. The relationship between the number of concrete data clusters with overall errors MAE for -means and -means algorithms is shown in Figures 4(a) and 4(b), respectively. From the figures it can be seen that low values of MAE errors is found when the number of clusters are between four and seven. LR regression technique gives the minimum MAE errors in prediction. From clustering point of view minimum error is achieved by -means algorithm, which is a fuzzy clustering algorithm and gives a natural way of concrete data belongingness in a particular cluster. The similar behavior is observed in Figures 5(a) and 5(b), where the relationship between RMSE values with the different number of clusters for -means and -means algorithm is presented. The minimum values of RMSE are achieved with the help of LR regression technique for -means clustering algorithm when the numbers of clusters are between four and seven.
It is observed from the experiments that minimum errors MAE and RMSE occur when the numbers of clusters are between four and seven. LR regression technique performs better than other regression techniques and the fuzzy clustering algorithm -means performs better than the -means algorithm since it gives a more natural belongingness for a concrete sample to a particular cluster. So it can be concluded from the experiments that for UCI machine learning repository of concrete data the best estimation of compressive strength of concrete can be performed by clustering the concrete samples in four to seven clusters using fuzzy clustering technique followed by applying LR regression technique for estimating the concrete compressive strength for individual cluster.
From the equations generation from SLR regression technique it is found that the components C, BFS have more weights (multiplication factor) and thus are more important in estimating compressive strength of concrete, components S and FA also affects slightly in SLR regression technique. Similarly, for LR regression technique for forecasting the compressive strength of concrete the components C, BFS, W, and FA have more weights than the other components. For the LMS regression technique also the weights of components C, BFS, W, and S are more in regression equations which indicate that these components are more important than the other components for estimating the compressive strength of concrete.

Conclusion and Future Scope
In the presented work -means clustering and -means fuzzy clustering techniques along with the three regression techniques, namely, simple linear regression (SLR), logistic regression (LR), and least median of squares (LMS), are used to predict (estimate) the compressive strength of the concrete. The purpose of the work is to demonstrate that if clustering can be combined along with the regression technique then prediction errors for estimating compressive strength can be minimized. It is demonstrated from the experiments that if the optimum number of clusters can be created on concrete data before applying the regression then prediction errors can be minimized in efficient manners. Clustering techniques -means and -means are generic in nature and are capable of creating the groups objects of any data type and any number of features (attributes); hence, the same proposed model can be used for estimating compressive strength of data for different concrete datasets. In this study it is found that for UCI machine learning repository concrete data, four to seven numbers of clusters along with the regression technique give minimum prediction errors for prediction the compressive strength of concrete. It is also found that fuzzy clustering algorithm -means is more efficient than the -means clustering algorithm for creating the clusters and gives minimum errors in predicting the compressive strength of the concrete. Partitioning based clustering algorithms -means and -means are used in this work; density based clustering algorithms can be explored for creating the groups of concrete data as a future scope of the work. The proposed model can also be validated by selecting other standard datasets and comparisons can be made between partitioning based and density based clustering techniques for predicting the compressive strength of concrete.