Applications of Gene Expression Programming for Estimating Compressive Strength of High-Strength Concrete

Department of Civil Engineering, College of Engineering in Al-Kharj, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia Department of Civil Engineering, COMSATS University Islamabad, Abbottabad 22060, Pakistan Department of Civil and Environmental Engineering, College of Engineering, King Faisal University (KFU), P.O. Box 380, Al-Hofuf, Al Ahsa 31982, Saudi Arabia Department of Architecture and Civil Engineering, City University of Hong Kong, Kowloon 999077, Hong Kong


Introduction
High-strength concrete (HSC) production in the construction industry has been adamantly upsurge in recent years for use in modern construction work [1][2][3]. Improving concrete performance ultimately enhances the overall effectiveness of modern concrete structures. HSC has significant strength in concrete media, greater than 40 MPa compared with the conventional concrete system [4]. HSC is a modified form of concrete that requires vibrating media and nonvibrating media for its placement; moreover, it is dense and homogenous concrete with adamant high strength and superior durability properties as compared with traditional concrete making it extensively applicable to the concrete industry [5,6]. For example, it is adamantly used for high-rise buildings, long-span bridges, piers, etc. American Concrete Institute (ACI) defines HSC as "concrete that possesses specific requirement for its working which cannot be achieved by conventional concrete" [7]. eir use in construction improves the working environment and unlocks the way for concrete construction automation. However, the major problem lies with its design procedure due to the complex nature of HSC. Various researchers have reported different guidelines and standards for design mixture, which compromises the use of chemical and mineral admixtures [8][9][10]. Due to its complex nature rather than conventional strength concrete, it requires experience and adamant knowledge of the constituent used in the mixture process. e HSC complex structure requires an arduous mix design procedure for attaining its essential properties. Concrete strength is an important aspect in highstrength concrete; however, variation in constituents, chemical and mineral admixtures, and design specifications may vary from source to source [11][12][13][14]. is creates ambiguity in the general relationship between cement ratio to mineral admixtures, chemical admixtures, w/b ratio, and aggregate grain sizes. ese variations in constituent somehow, if not properly managed, will produce deficiency in concrete strength. ese constituents can be properly and adamantly managed by using their desire (optimized) quantities that will produce the utmost aspect of strength rather than using experimental work. As these experimental works cost resources and time by using hit and trial of taking desire quantities to achieve maximum effect on ultimate strength. In this aspect, numerous researchers have used traditional methods by using linear and nonlinear equations to give prediction measures of (HPC) strength. ese methods were based on statistical analysis; however, accurate prediction from equation-based approaches is difficult and thus requires a lot of research to overcome these obstacles. In recent years, concepts of machine learning neuralbased approaches overwhelm these difficulties and provide an accurate prediction of concrete strength.
Machine learning approaches such as genetic engineering programming (GEP) [15][16][17], artificial neural networks (ANN) [18][19][20][21], support vector machine (SVM), decision tree (DT), adaptive boost algorithm (ABA), and adaptive neuro-fuzzy interference (ANFIS) [22][23][24][25][26] have been widely used and publicized in civil engineering domain [27]. Dong et al. used machine learning approaches like ANN and ANFIS for prediction of compressive strength of geopolymer concrete at 28 days with 210 data samples. e authors concluded that these approaches give better prediction; however, ANFIS approach outbreaks with the coefficient of determination (R 2 ) and model performance from ANN [28]. Nour and Güneyisi [29] used genetic engineering programming (GEP) for prediction of compressive strength of recycled aggregate (RA) concrete filled with steel tube columns with 97 test datasets and concluded that GEP provides an accurate prediction of (RACFSTC) with empirical relation. e authors observed and concluded the coefficient of determination (R 2 ) for testing and training is 0.996 and 0.995, respectively, providing accurate behavior of model [29]. Bingöl et al. model the compressive strength of lightweight exposed to high temperature by employing ANN approach [30]. e authors concluded that ANN is an advanced predictive approache; however, the model predicts the strength with adequate accuracy. Moreover, researchers used ANN and other machine learning approaches for the prediction properties of recycled aggregate concrete and high-performance concretes [31][32][33][34][35][36]. Pala  eir experiments included concrete mixtures of different water-cement ratios, including the lowest and highest fly ash concentrations, with or without additional small silica fume amounts. Based on the results, ANNs have tremendous potential as a suitable means to examine the effect of secondary raw materials on the compressive strength of concrete [37]. Iqbal et al. used genetic engineering programming machine learning approach for the prediction of green concrete with 234 data samples. e authors reported that gene programming gives adamant prediction accuracy with an empirical relationship [32]. Javed et al. [15] conducted experimental program predict the strength of sugarcane bagasse ash using different machine learning approaches. e authors obtained a strong correlation between input and output by using GEP approach. Moreover, the same trend was also observed by Azim et al. [31]. e authors used GEP for the prediction of reinforced concrete structure with adamant accuracy. GEP is superior to existing methods like feature selection, ANN, and M5P methods. e choice of features is an essential step in data processing and is seen in many areas, like genetics, medicine, and bioinformatics. e selection of the key elements (genes) is necessary in order to uncover new information concealed inside the genetic code and to recognise relevant biomarkers. Although the proposed algorithms can help sort by large numbers of genes relating to the problem at hand, the results generated appear to be unstable and thus cannot be reconstructed in other studies. It is vital to emphasize that the two most widely employed Machine learning models in previous studies, i.e., the ANN and the M5P models, sometimes face challenges to reliably predict outcomes in data domains that have complicated input(s)output(s) feature(s) (i.e., highly nonlinear or nonmonotonic) [17,[38][39][40][41]. at is because the ANN models, as well as their variants such as MLP-ANN, are predicated on local optimization and search algorithms (e.g., the backpropagation technique used in many neural ML-models based on a network to maximize the activation function parameters), which are highly susceptible to local (or around) minima instead of converging to the globally relevant.
is paper aims to build a GEP-based model for accurate prediction for high-strength concrete with an empirical equation. For this aspect, data have been acquired from previously published work compromising of 357 data points as shown in Table 1. It is worth mentioning that this research is primarily based on estimating the compressive strength of the high-strength concrete using a genetic engineering approach. e parameters used in the modelling of HSC consist of (cement, water, fine aggregate, coarse aggregate, and superplasticizer). Section 2 represents data input to output (strength) with optimal quantities with graphical representation (Kde contour graph), which was done by using python programmable software. Section 3 then shows the importance of each variable on its output by conducting sensitivity analysis (SA) or permutation features importance (PFI). Section 4 represents the statistical measures for model Advances in Civil Engineering 3  Advances in Civil Engineering 5  Advances in Civil Engineering 7

Genetic Programming Machine Learning Approach.
GP was firstly developed by Jone Koza in 1988, which generates a computer-based model to solve the problem by using the Darwinian selection principle [42]. GP is a predictive tool based on artificial intelligence that develops a program by emulating the progression of living organisms [42]. GP is the generalization form that comes from the genetic algorithm (GA) [43]. ese two approaches are somehow different from one another, which is distinguishing based on solution representation. GA represents the solution in the form of a string of numbers (chromosomes), whereas GP represents the solution of given data in the form of a tree-like structure by using the programming language [44]. GA provides linear fixedlength binary strings (chromosomes), whereas GP provides alternative strings of different shapes and sizes of nonlinear entities, thus making GP a versatile approach in the prediction of properties. In other words, the solution of the representation is expressed in the form of a parse tree with varying string size and shape. e hierarchy of problems in GP is similar to GA. e computer program then searches for the optimized solution of the problem in an independent manner [45][46][47]. e overall chain of GP in solving a problem by programming language consists of the following steps: (1) Generate and produce individual chromosomes (population set) by selecting in the random way of the problem in the form of function sets and terminal sets. ese sets chose their individuals at random and build computer models in tree form with roots (branches) reaching to the end in the terminal set as shown in Figure 1. (2) e GP algorithm than performing iteratively measures for the selection of best fitness chromosomes and generates new individual chromosomes by three measures, namely, reproduction, mutation, and crossover. GP works in the same way as a human analogy.
(A) Reproduction: During this procedure, the parts of individuals (chromosomes) are copied without any modification into the next process in a new population [44]. (B) Crossover: During this operation, a node is randomly selected on one of the roots of each program and the function set with the terminal set of each program is then swapped to create a new offspring program as shown in Figure 2. It can be seen that two new offsprings are generated from two parental computer-based programs [42,44]. (C) Mutation: During this procedure, node of individuals in terminal sets and function sets are selected at random and replaced by same parity.
is creates new offsprings by randomly choosing sets and best generation appeared in the form of tree as shown in Figure 3 [46].
(3) Genetic programming then finalized its best solution to problem by solving computer based program [48,49].
In recent years, approaches like linear genetic programming (LGP), multi expression programming (MEP), and genetic expression programming (GEP) have been used in prediction properties of many domains including civil engineering. ese approaches are mainly roots of genetic algorithms and genetic programming. Moreover, these processes diminish the limitation like genetic operation on tree, code growth with complexity, and implementation difficulties. Owing to their extreme benefits, these methods are a favourable candidate in execution complex forecast problems. However, in this paper, genetic expression programming was used for prediction of high-strength concrete.

Genetic Expression Programming (GEP) Approach.
Ferreira [50] proposed a new algorithm, which is the modified development form of GA and GP known as GEP. It incorporates both the linear string of fixed length and parse tree. e linear variant utilizes same genetic operator as used in GP with some minor modifications. e GEP model consists of five parameters having same analogy to GP, i.e., fitness function, terminal set, control parameters, terminal conditions, and function set. GEP algorithm creates population set of randomly selected individual chromosomes and afterward converts each individual into expression tree of different forms (shapes and sizes) to represent its solutions with mathematical expression. Later the target is then compared with the predicted one, and the fitness score of each individual entity is determined. e model stops if it gives best fitness; otherwise, individuals are selected on the basis of roulette wheel sampling. is then extracts the best survival chromosomes from individuals and passes them to the next generation. is loop goes on until the best survival chromosome with adamant fitness score is achieved. e basic step involves in representation of solution is shown in Figure 4.

Advances in Civil Engineering
where A, B, C, D are variables (terminal set) and 3, 4 are constants. is term is expressed as K expression (Karva notation) which is used to develop empirical relationship between sets and individual chromosomes [51]. is Karwa expression can also be represented by expression tree (ETs) diagram [52]. For example, the ETs diagram of above mentioned expression is expressed in Figure 5. e transformation of K-notation to ETs starts from the first position which resembles to the roots of ETs and continues through the string [29]. Similarly ETs also transform into K-expression by recording the ties from the base level to the adamant deepest layer. e GEP gene in mathematical form can also be expressed as Figure 2: Crossover example based on genetic programming. *

Experimental Datasets.
In this paper, 357 data samples have been utilized in modelling of high-strength concrete, which was acquired from previously published papers (see Table 1). However, the aim is to utilize these values to predict the optimized quantities rather than going for hit and trial in experimental work. e database consisting of 357 samples is randomly divided into sets of training, validation, and testing. is scaling is mainly done in machine learning approaches to avoid the overfitting of data, giving us more reliable results in the determination of coefficient (R 2 ). Moreover, training is done to train the model for the upcoming validation aspect, and in the end, testing was mainly done on unseen data for forecasting of high-strength concrete properties. Out of 357 datasets, 251 (70% data) were assigned to training set and remaining 53 (15%) data to testing and validation sets [53,54].

Python Measures for Presenting Database.
Representation of the database was done by using anaconda based python programming version 3.7. e data obtained from literature consist of five parameters starting from cement, water, fine aggregate, and coarse aggregate with superplasticizer concentration in the modelling of strength. Every parameter has an influence on strength properties. Python measures were done to find the correlation of each variable to its compressive strength and also to find the optimal dosage and influential effect of variables by conducting

Advances in Civil Engineering 11
permutation features importance. e correlation and distribution with of the variables are shown in Figure 6. It is well stated that model performance is adamantly affected by its variables [55]. Deep leaning is a handful tool in neuron-based artificial approach to predict the mechanical properties by knowing its actual concentration of variables. Python deals with machine learning approach and this correlation plot is made by using seaborn command. e description of data variable used in model is listed in Table 2.

Design of HSC Using Python.
is section deals with the parameters in the process of gaining its optimal goal. It is important to state that variables in the modelling of any model have an adamant and significant role in determining its goal. So, the variable study is conducted by using python programming.
(1) Contour Maps. Five contour plots obtained from the python model is illustrated in Figures 7(a)-7(j). As previously mentioned, that model performance is dependent on its variables, so the optimal quantity of variables is important to know rather than using experimental work.
is provides us a useful graph to predict the strength at 28 days.

12
Advances in Civil Engineering compressive strength in the form of contour giving us the required quantity of cement and Figure 7(f) shows the regression graph of cement versus strength. It can be seen that maximum data point used in the literature lies between 300 and 400 kg/m 3 . However, significant strength was also achieved by the binder in a range of 500 kg/m 3 . Moreover, the deep contour of cement lies in the range of 300 to 400 kg/m 3 . It is worth mentioning here that machine deep learning provides us the range in achieving our desire goal.

(b) Effect of Fine and Coarse Aggregate on Compressive
Strength. Fine and coarse aggregate is used to fill the void and to impart strength in making concrete, however, their concern dosage, type, and condition will affect concrete strength. It is clear from Figures 7(b) Table 3. In other words, using this much of concentration in HSC yields maximum output, thus eliminating its need for using experimental work.

Development of Model Using Gene Expression
is paper aims to develop a generalized equation for the compressive strength of high-strength concrete. erefore, a set of terminals and function set is used. ese variables and function sets have an adamant effect on the performance of the model. For modelling strength of HSC, four variables are selected as input parameters in gene expression programming d 0 : cement, d 1 : fine to coarse aggregate, d 2 : water, and d 3 : superplasticizer. Simple division multiplication summation and subtraction operation are used as the function set in model setup. erefore, the mechanical strength of HSC is dependent on the given relation (see equation (3)):  (3) e selection of variables has significant effect in generalization fitness of the GEP-based model. e variables used in the model are presented in Table 4. e model time is controlled by the basic arithmetic process, head size, chromosomes, population size, and complexity. It is better to select those sets which will give a generalized model in due time. Furthermore, the selection of these sets was determined by using hit and trial basis. e model performance is done by utilizing (RMSE) error. Afterward, GEP evaluates its model by presenting architectures structure with head size and number of genes [53].

Model Performance Analysis
e performance of any model in learning, training, and testing set is evaluated by the coefficient of determination (R 2 ) and also by using regression measures and error like relative root mean square error (RMSE), means absolute error (MAE), relative mean square error (RSE), and relative root mean square error (RRMSE). e calculated expressions are given as equations for these error functions which are listed below: where ex i , mo i are experimental actual strength and model strength, whereas ex i and mo i are average values of experimental and predicted outcome, respectively. e accuracy of the model is defined by its determination of coefficient (R 2 ). For the effective model, its value should be close to 1 and a value greater than 0.8 presents a high accuracy of the model [56]. is value shows the correlation between experimental and predicted outcomes. An R 2 value close to 1 and lower values of errors (MAE, RRMSE, RMSE, and RSE) indicate higher accuracy of the model. Moreover, an output index or performance index (ρ) is proposed to measure model efficiency as a result of both R 2 and RRMSE [55]. Lower value of the index indicates better performance of the model between experimental and prediction outcomes.
In deep and machine learning approaches, overfitting of data is a major concern. To counter fall this, researchers used objective function (OBF) for their model accuracy (equation (5)) OBF takes the overall data with error and regression coefficient into it to give the best-generalized model [55].
is is achieved by the following equation as presented by Gandomi et al. higher value of R and lower values of errors result in a significantly lower value of index and OBF.

Formulation of Compressive Strength of HSC Using GEP.
Genetic expression algorithm is used to predict the mechanical response of HSC in the form of empirical relation. is formulation is the function of variables expressed in equation (6). Expression resulting in the form of a relationship comes from expression trees as shown in Figure 8. It can be seen that GEP used both linear as well as nonlinear algorithms by forming a tree structure. Moreover, this complex architectural tree utilizes arithmetic operators, variables, and somehow constants in prediction of strength. Basic operator is employed by GEP in solving three sets of expressions. Each sub program or chromosomes reflect specific features of the problem, which in turn develops functionalized solution to the problem [50].
e structural gene, number of chromosomes, and operators are selected prior running the GEP algorithm. e best selection of model is based on several trials by varying its head size, gene numbers, and chromosomes with operational operators. e GEP algorithm selects the best generation and gene within the population set. Figure 8 presents the best / + c1 d3 d2 c5 c1 c8 outcome of f c . It can be seen that linkage function employed in GEP is the basic operator in which c represents constant values and d represents the input variables. e basic fitness function used in modelling perspective is RMSE.

Evaluation of Model and Analysis.
e evaluation of model between actual and predicted one is shown in Figure 9. It is clear that the GEP-based algorithm in prediction aspect is a prominent tool in assessment of strength. It can be seen that the regression line for data samples in training set, testing, and validation set approaches to 1. Model accuracy and validity can be judged by its coefficient of determination (R 2 ). Figures 9(a)-9(c) represent the model accuracy by depicting its R 2 value greater than 0.8; however, in our case, it is 0.910, 0.914, and 0.9 for testing training and validation set, respectively. ese sets consist of approximately 360 data samples, out of which testing training and validation set consist of 70/15/15 data points. is outfitted data modelled in the GEP algorithm indicate good relation between output and target values. Moreover, normalization of data was also done to give a generalized relation in the range of 0 and 1. e model accuracy of overall data can also be seen in the normalized graph as shown in Figure 9(d).
e model performance can also be evaluated by checking from statistical analysis such as MAE, R 2 , and RRMSE with RSE. e statistical measures of the proposed GEP-based model for testing, training, and validation set are shown in Table 5. Moreover, further analysis can be done by determining covariance (COV) and standard deviation (SD) of predicted to actual targets. Values of covariance and SD of training set are 0.16 and 0.059, respectively. e statistical analysis gives an accurate idea of model accuracy by its R 2 and error values with the adamant low objective function. Furthermore, the model accuracy can also be judged from its R 2 and statistical error values of all sets. us, proposed model give high accuracy of actual and predicted values.   e accuracy of the proposed model in a broader aspect can also be evaluated by checking the absolute error difference between predicted and actual targets as shown in Figure 10. It is adamantly clear from the figure showing its accuracy between predicted and actual ones with maximum average error of 2.64. Majority of the predicted data lie in the range of 0.029 MPa to 7.5 MPa. ese values are of absolute error with minimum and maximum of predicted datasets. Moreover, adamant difference between experimental and model values with less error depicts the adaptive nature of gene expression programming.
e reliability of any model is greatly dependent on its data set. Adamant data point increases the accuracy of the model with input variables. However, the validity of data to variables in relation making is quite a major concern in its modelling. To counter fall and to check the validation of the dataset, Frank and Todeschini [57] stated that the ratio of input data set to its variables should be equal to 5. is scenario presented by the author is for an ideal model. However, the current paper significantly outfits this ratio which is equal to 357/4 � 89.25 as compared to the available literature. Moreover, validation of the GEP model can also be checked by external statistical measures on the testing set. Golbraikh and Tropsha [58] proposed a generalized relationship that the slope of line regression (k′ or k) in the model should approach to 1. Similarly, various scholars have suggested that the squared relationship coefficient (origin) between the output and target values (Ro′ 2 ) or the coefficient amongst expected and tentative values (Ro 2 ) should be near to 1 [44]. ese external checks on the GEP-based model are presented in Table 6. Hence, it can be concluded that the models hold the expectation capability which is not just a connection amongst the input and output variables. e prediction of the mechanical behaviour of high-strength concrete by genetic expression algorithm is adamantly reliable in using data samples to its variables. e behaviour of the GEP based model can also be compared with the linear and nonlinear based model by presenting an empirical relationship between predicted and experimental results. e empirical relation of both results in the form of expression is shown in equations (5) and (6). Moreover, Figure 11 represents the behaviour of modelled data. It can be seen that GEP based model outfits in data presentation of testing, validation, and training set from linear and nonlinear ones with R greater than both modelled [59]. is is due to one of the advantages of GEP, as it takes both linear and nonlinear data into its database which ultimately generates accuracy of predicted data by showing expression tree and then simplified its data by decoding it in the form of the generalized equation as shown in Figure 8. Moreover, its simplest nature can help researchers to calculate the compressive  strength by doing hand calculations. ese algorithms help in predesign design to forecast prediction close to experimental work [56]. e accuracy of this can also be checked by residual error as shown in Figure 12. It represents the accuracy of data with frequency of data present in GEP model and its regression accuracy.

Compression of GEP Model with Other
Model. e performance of the GEP model is compared with other models available in the literature [7,60,61]. Al-Shamiri et al. [61] used extreme learning machine (ELM) and compared its model prediction accuracy with backpropagation neural network (BP-NN). e authors predict R 2 of testing set of about 0.9937 and 0.9938 for ELM and BP-NN [61]. Similarly,Öztaş et al. [7] predicted the compressive strength and slump of HSC using neural network. e author reported strong correlation between input and output result of testing set which is about 0.99 for both slump test and compressive strength. Baykasoǧlu et al. [60] predicted the parameters of highstrength concrete using machine learning techniques. Regression analysis, genetic engineering programming, and neural network were first employed to make generalized equation. Afterwards, a multiobjective optimization model is made to predict the outcome and comparison was also made between prediction and optimization results. Singh et al. [62] predicted the compressive strength of HSC using random forest and M5P techniques. e authors achieved a good relation by using random forest rather than M5P which is R 2 � 0.876 and 0.814 for testing set, respectively. It can be seen that prediction of HSC was evaluated using different approaches but none of the method give a diesrable equation which predicts the strength by using hand calculation.
us, employing GEP approach gives not only R 2 � 0.90 but also an equation with parameters involved.

Conclusion
e machine learning approach provides adamant accuracy between the modelled and experimental data. is will help in the predesign phase rather than conducting experimental tests by doing trials. e following conclusion has been drawn by utilizing GEP.
(i) Artificial intelligence using anaconda Jupiter notebook python-based is conducted on the input variables with compressive strength. is programming technique provides the optimal values of all these influential variables which will help the researcher to design their experimental work by just taking these optimized values.
(ii) GEP approach provides a simplified formulation of compressive strength with adamant accuracy between modelled and experimental results. is shows its diversity by considering linear and nonlinear data.
(iii) e statistical analysis gives significant accuracy between training, testing, and validation set with the coefficient of determination greater than 0.9. Moreover, an error like MAE, RSE, and RRMSE shows low value with high R-value. is contra values adamantly provides the accuracy of modelled data. (iv) e GEP model is compared with linear analysis and nonlinear analysis. However, GEP model outfits both analyses. Moreover, the current model was also compared with other published models, but GEP model gave us the required equation which helps in prediction with current parameters via hand calculations (v) Permutation feature importance was done by using python on variables to show the influential one in the modelling aspect. In another word, which parameter influences the compressive strength of HSC is check by PMI.

Data Availability
e data used in the study were collected from different research papers in modelling aspect.

Conflicts of Interest
e authors declare that they have no conflicts of interest.