On the Evaluation of Rhamnolipid Biosurfactant Adsorption Performance on Amberlite XAD-2 Using Machine Learning Techniques

Biosurfactants are a series of organic compounds that are composed of two parts, hydrophobic and hydrophilic, and since they have properties such as less toxicity and biodegradation, they are widely used in the food industry. Important applications include healthy products, oil recycling, and biological refining. In this research, to calculate the curves of rhamnolipid adsorption compared to Amberlite XAD-2, the least-squares vector machine algorithm has been used. Then, the obtained model is formed by 204 adsorption data points. Various graphical and statistical approaches are applied to ensure the correctness of the model output. The findings of this study are compared with studies that have used artificial neural network (ANN) and data group management method (GMDH) models. The model used in this study has a lower percentage of absolute mean deviation than ANN and GMDH models, which is estimated to be 1.71%.The least-squares support vector machine (LSSVM) is very valuable for investigating the breakthrough curve of rhamnolipid, and it can also be used to help chemists working on biosurfactants. Moreover, our graphical interface program can assist everyone to determine easily the curves of rhamnolipid adsorption on Amberlite XAD-2.


Introduction
As mentioned above, biosurfactants are organic compounds that are produced by microorganisms and consist of two parts: hydrophilic and hydrophobic. They are often produced by bacteria on living surfaces. One of the reasons for attracting many industrial applications to biosurfactants is due to their amphiphilic properties. Among the usable and outstanding capabilities of biosurfactants used in various industries such as mines, fertilizers, petrochemicals, and petroleum, we can mention the environmental degradability and reduction of surface tension between interstitial and low toxicity. The reduction of the interfacial tension is due to the increase in the solubility of hydrophilic molecules when using biosurfactants. Capabilities such as surface modification and interfacial tension have made surfactants attractive to the industry. Rhamnolipids (RLs) are the most studied type of biosurfactants. According to the literature, rhamnolipids can reduce water surface tension by about 60% [1][2][3] for different concentrations of RL 50-65 mg/L. The production of RLs usually involves a final product from a dilute solution contaminated with undesirable impurities. There are several ways to increase the concentration and eliminate contaminants in which the adsorption process is widely studied.
In this research, activated carbon is used for adsorbent in the process. The breakthrough curve of a packed column is a very significant attribute of this system. As a result, determining such curves will be useful for optimizing and understanding the performance of the column. To model the adsorption phenomena, the mass balance in liquid and solid phases is evaluated. It may also include modeling the porous and liquid film resistance and also axial dispersion. Finally, with a suitable software package, a set of differential equations can be solved.
Ill conditions and uncertainty in differential equations make using conventional mathematical models not suitable. Intelligent models have to be a powerful tool in solving process modeling problems. To predict the optimized targets, in various fields such as petroleum and gas fields, methods such as SVM, ANN, group method data manipulation (GMDH), fuzzy logic system, and adaptive fuzzy neural inference system can be used. Interactions between AI neurons are achieved by connecting different units. Artificial neurons' interactions are achieved by connecting different units. Each weighted output is related to the sum of the output from the previous synaptic weight layer, and then it is used as an input for a specific neuron. Backpropagation ANNs are extensively applied, as they have shown to be a capable and powerful tool [4]. The GMDH model is a type of backpropagation ANN that was proposed by Ivakhnenko [5]. Darwin's theory of selection inspired this approach. The prominent feature of this method is the internal process of the elements [6][7][8].
To process elements in a conventional ANN, log sigmoid, hard limit, linear, and tangent sigmoid transfer functions are considered. On the other hand, the GMDH method constructs simple polynomials, roughly predicting the targeted systems. In the next step, the complexity of the polynomials is further developed so that satisfactory models are achieved. [9,10].
Due to the importance of predicting a trustworthy estimation of breakthrough curves, this research is aimed at predicting of breakthrough curves utilizing the LSSVM method for rhamnolipid ( Figure 1) adsorption over Amberlite XAD-2. Furthermore, results are compared with those of ANN and GMDH models. The investigated model takes into account 204 data points in its network for adsorption over the Amberlite XAD-2. Various graphical and statistical methods are considered to evaluate the accuracy of this strategy.

Model Development
In the present research, the LSSVM strategy was applied to calculate the curve to achieve rhamnolipid uptake relative to the Amberlite XAD-2 model resulting in a more simplified way [11,12]. SVM can be defined as a function as below: The parameters of the above expression are as follows: w T denotes the transpose vector corresponding tothe output layer.
b and φðxÞ represent the bias and the kernel function, respectively.
The input ðxÞ consists of N × n dimension in which n and N are input parameters and some data points, respectively. The following cost function is optimized to evaluate w T and b parameters [13]: which is constrained by y k is the k th output while x k is the k th input. ε stands for the fixed precision of the estimation. Also, slack variables (ξ k , ξ * k ) are dealing to determine the acceptable error margin. The below lagrangian is applied to minimize the cost function: 2 where a k and a * k stand for Lagrangian multipliers. In the last step, the SVM is given below: Quadratic programming must be solved to determine the SVM parameters. The LSSVM eliminates deficiencies in the solving process of a quadratic programming problem [11,12]. LSSVM uses the below equation in the process of model development: where γ denotes tuning parameter. e k is the error variable.
The following constraints are applied to the cost function: The Lagrangian of the LSSVM is expressed as In the above phrase, the symbol a k represents the Lagrangian multipliers. To optimize Eq. (8), its derivatives are set to zero, and as a result, the following equations are achieved: By solving the aforementioned equations, LSSVM parameters are obtained. LSSVM employs the kernel function in the same way that SVM strategy does. The most common applied kernel function is the radial basis function (RBF) which is given by where σ 2 stands for the tuning parameter corresponding to the kernel function. As a result, two tuning parameters (σ 2 and γ) are adjustable. The last-mentioned parameters can be determined by minimizing the error between the predicted values and experimental ones through the application of mean square error (MSE): where y is the output value, and exp. and pred. subscripts denote experimental and predicted values, respectively. Also, in this paper, we used the particle swarm optimization algorithm for the determination of these tuning parameters. A typical diagram of the proposed LSSVM approach has been shown in Figure 2.
The adjusted parameters are γ and σ 2 in the LSSVM model and based on the identified cost function (Eq. (11)), and these parameters are optimally determined by optimization technique. The values of γ and σ 2 in this study are 984523.52 and 0.246, respectively, through the PSO algorithm with swarm size and iteration of 80 and 1000, respectively.
Different statistical error analyses such as mean absolute error (MAE), coefficient of determination (R 2 ), and root means square error (RMSE) are implicated to analyze the model's performance.

Identification of Outlying Experimental Data
The outlier is a set of data having a different behavior in comparison with the bulk of data. Finding outliers would improve the accuracy and reliability of a proposed model remarkably.
To help to trace outliers, there are two procedures numerical and graphical procedures. One of the most powerful methods is the Leverage method in which the deviation of estimated values from the experimental ones is calculated. It also includes dealing with Hat matrix being made of experimental and predicted data. The equation below is used for calculating the Hat indices [14][15][16]:

BioMed Research International
p and f stand for numbers of data points and model parameters, respectively.
A reliable model would contain the majority of the predicted values by satisfying the following constraint: Regardless of the value of H, if the value of R for a given data is outside the above range, it is considered a possible candidate for being an outlier. The data in this paper are provided in Table S1, and this data set is taken from the previous paper [17]. As discussed, input parameters are initial rhamnolipid concentration, fixed bed height, flow velocity, and run time, while the ratio of final to initial concentration of rhamnolipid (C/C_0) would be the output parameters ones. The 204 data points are divided into two categories: training and testing.
To create the LSSVM model, 75% of data points are considered as learning points, and the rest of them were used to examine the efficiency of the opposed model. Furthermore, data are normalized within the range of [-1,1] applying the equation below: Here, D and D N represent actual and normalized data points, respectively. Also, D min and D max stand for minimum and maximum values of data points, respectively.
3.1. Evaluation of the model's Accuracy. The predictive model's accuracy is investigated employing different graphical and statistical methods. Figure 3 represents experimental data points and model estimation by the proposed LSSVM method in the training and testing stages. Figure 4 shows predicted values against experimental ones. The more it would be close to line Y = X, the more appropriate the prediction of the proposed model.    (iv) Genome expression: Node 1 : The ANN model based on these four input variables as mentioned as follows: (1) input layer (2) hidden layer including six neurons (3) output layer Figure 5 represents the cross plot of the aforementioned strategies. As explained, data points of the LSSVM model are closer to the line Y = X, than ANN and GMDH models. Also, the calculation of the determination coefficient shows that the proposed LSSVM approach is superior to ANN and GMDH in terms of accuracy.
Compared to ANN and GMDH models, the less relative error is observed in the proposed LSSVM model. Figure 6 indicates more reliability of the suggested LSSVM model.
Estimation accuracy is also investigated by applying the following statistical methods: Table 1 presents statistical values of the presented model compared with ANN and GMDH approaches showing the higher value of R 2 and lower values of STD, AAD, and RMSE, and as a result, the LSSVM model possesses higher accuracy and reliability than others. The dependency of the (C/C 0 ) as an output parameter on input parameters is illustrated in Figure 7.
Four different conditions of H 0 = 7, U = 160, and C 0 = 24 , and H 0 = 11, U = 160, and C 0 = 8 and H 0 = 7, U = 240, and C 0 = 24 and H 0 = 11, U = 80, and C 0 = 8 were investigated to measure the prediction ability of LSSVM, ANN, and GMDH models for indicating that the LSSVM model acquired better estimation. As this figure shows, as time goes by, the ratio of C/C 0 increases.
In the last part of this research, the leverage approach is applied to find outliers, employing the Hat matrix, Williams plot, and residuals. As discussed, Eq. (13) is used to calculate H values. Figure 8 also illustrates the Williams plot. All of the H is in the range [-3, +3], and R is in the range [0, 0.08], and then the accuracy of the proposed model is desirable and acceptable; so, the accuracy of the proposed model is satisfactory. There are only two of the data points that are outside of the applicable domain which is shown in the figure by a blue circle. As R values approach zero and H value reduces, the reliability of data points is increased [18][19][20][21][22][23].

Conclusion
Then LSSVM approach was employed to estimate breakthrough curves of rhamnolipid adsorption over Amberlite XAD-2 as a function of fixed bed height, flow velocity, runtime, and initial rhamnolipid concentration. The particle swarm optimization method was employed for the training process enhancing the accuracy of the proposed model. Various statistical and graphical methods were applied to evaluate the model's reliability showing that the AAD% value for adsorption over activated carbon was 0.75%. For ANN and GMDH models that were developed by Padilha et al., AAD% of activated carbon is reported to be 1.9% and 6.2%. Based on the above evidence, we can find that the proposed LSSVM model is more reliable for the process of predicting the breakthrough curves.