Application of a Supervised Learning Machine for Accurate Prognostication of Hydrogen Contents of Bio-Oil

This paper deals with modeling hydrogen contents of bio-oil (H-BO) as a function of pyrolysis conditions and biomass compositions of feedstock. The support vector machine algorithm optimized by the grey wolf optimization method has been used in modeling this end. Comprehensive data for this purpose were aggregated from previous sources and reports. The results of various analyses showed that this algorithm has a high ability to predict actual results. The calculated values of R 2 , MRE (%), MSE, and RMSE were obtained as 0.973, 1.98, 0.0568, and 0.241, respectively. According to the results of various analyses, the high performance of this model in predicting the output values was proved. Also, by comparing this model with the previously proposed models in terms of accuracy, it was observed that this model had a better performance. This algorithm can be a good alternative to costly and time-consuming laboratory data.


Introduction
Consumption of fossil fuel-based energy is increasing because of several developing economies and a rise in the population. is causes a rise in emissions of greenhouse gas, a reduction in the amount of fossil fuel in several countries, and an increase in the fuel price in the market [1,2]. Renewable energy resources can be substituted with fossil fuelbased energy to manage the aforementioned issues and decrease fossil fuel geographical reliance [3]. Different sustainable energy resources such as the energy of wind, solar, geothermal, hydro, and biomass are possible alternatives [4,5]. Among these renewable energy resources, bioenergy (biomass energy) is the most sustainable and promising one, which could be substituted with old fuels for chemical and energy applications. Biomass is mostly produced from plants, involving municipal solid wastes, forestry and agriculture remain, sewage sludge, and food waste [6,7]. It could be transformed into liquid, gaseous, and solid products through thermochemical and biochemical conversion procedures. Due to the substantial progress in the past years, researchers can devise thermochemical procedures and propose comparatively great conversion performance with easy pretreatment and low cost [8,9].
One of the procedures in which thermal decomposition of materials has taken place to create noncondensable gas, biochar, and BO in the absence of oxygen is pyrolysis. e BO, liquid product or pyrolysis oil, is a viscous dark brownish fluid commonly comprised of 350 greatly oxygenated composites [10,11]. Mostly, the yield or quantity of BO relies on the conditions of pyrolysis and the composition of biomass feedstock [12]. e composition of biomass feedstock is generally described by ultimate and proximate analysis.
To obtain the biomass elemental composition comprising H, O, C, and N contents, the ultimate analysis is employed, while the quantitative analysis is implemented to obtain ash, volatile, the fixed carbon, and moisture substances of organic matter. Different factors like heating rate, pyrolysis temperature, residence time, and biomass particle size can affect the pyrolysis procedure [13,14]. Several investigations have been performed about the effect of the composition of biomass's raw material and state of pyrolysis on the BO generation. For example, Gholizadeh et al. conducted a study about the production of BO from twenty various biomass feedstocks and found that the mean produced BO was greater from the woody biomass (52 percent wet wt.) in comparison with the herbaceous biomass (38 percent wet wt.) [15]. Also, Sarkar and Wang used slow pyrolysis of waste coconut shells to investigate the effect of temperature on the yield of BO and found that the highest BO production (48.7% wt.%) was obtained at 600 degrees Celsius [16]. Hao et al. discovered that, at 500°C, the BO produced from the UPM (Ulva prolifera macroalgae) and RS combination (rice straw) generated the maximum amount of BO (46.68 wt.%) [17]. Hanif et al. also investigated the effect of reaction temperature on BO output and discovered that a 300-350°C average temperature resulted in the highest BO output from algal biomass (48 wt.%) [18]. Traditional methods for determining the yield of BO and its relationship to influential parameters such as conditions of pyrolysis and composition of biomass need extensive testing, which is labor-intensive, costly, and time-consuming. erefore, using data mining, machine learning, and deep learning approaches, it is necessary to analyze the behavior of biomass pyrolysis in terms of feedstock composition and pyrolysis process parameters in order to assess their cumulative influence on the efficiency of BO production. Several unique and advanced methods have been coupled with traditional methods to improve performance with both linear and nonlinear problems as a result of AI (artificial intelligence) advancement [19][20][21][22][23]. In comparison to traditional methods, ML (machine learning), a subset of artificial intelligence, and procedures such as random forest (RF), multilinear regression (MLR), decision tree (DT), and support vector machine (SVM) have shown significant performance in biomass pyrolysis due to their high ability to predict the results [24,25]. On datasets exhibiting a linear relationship between the input variables and the target, linear regression analysis is widely used. Hussain and Mustafa developed a model of linear regression for the production of BO from biomass by fast pyrolysis by correlating retention time, biomass content, and reaction temperature with BO output [26]. e determination coefficients for different models were in the range of 0.81-0.99, according to the findings. At the same time, a linear regression-based methodology was utilized to investigate the relationship between 20 different biomass feedstock samples and the distribution of BO components [27]. e BO components and biomass composition were discovered to have a strong relationship. Although more phenols were produced by woody biomass, more ketones were yielded from straw, more fatty acids were produced by algal biomass, and more furans were yielded from shell biomass. However, linear regression models only consider linear relationships between variables and are ineffective for complex processes that need nonlinear correlations. Furthermore, these models with linear regression were typically developed with a restricted number of empirical results and based on some effective factors, which reduces the model's applicability and reliability.
us, it is important to perform a comparative examination of various predictive machine learning models. In the current research, a new machine learning model involving support vector machine hybridized with a novel algorithm called grey wolf optimizer is utilized for the BO yield prediction using the composition of biomass (proximate and ultimate analysis) and conditions of pyrolysis. In this paper, a wide range of experimental input data and various statistical based analyses have been used to evaluate the accuracy of this model. e uniqueness of the proposed model lies in the intriguing trait of model performance independence from outliers.

Experimental Database
A sum of 116 experimental biodiesel of output values is gathered to provide a forecasting tool for predicting the hydrogen content values of bio-oil. ese database details are accessible elsewhere [20]. For teaching and testing, the dataset of experimental outputs is randomly broken down into two 82 and 34 points datasets for the training and testing phases, respectively. e function of the testing dataset, on the contrary, is to assess the model's generalization or ability to predict unknown data.

Model Statement
3.1. SVM. SVM (support vector machine) may be used as a regression method, being referred to as the approach of statistical learning theory regression. e main feature of this approach is that, by utilizing the proper covariance function (F), linear regression is achieved by transferring the inputs from a low-dimensional (D) area to a highdimensional area. e input data is described as i and y i are the output scalar and the scalar m-D input, respectively. e regression of support vector was described as follows [28]: where λ and b indicate regression F's weight vector and deviation word. By minimizing the regularized hazard F, that issue could be changed to an optimization process illustrated as follows: Vapnic (1995) has established the above-mentioned equation and the equation is famous for the ε-insensitive loss F [29]. e role of ε in the equation is to restrict the regression's range. It might be observed that if the forecasted and real value deviation is less than ε, loss F would equal 0; contrariwise, the loss is equivalent to the model absolute deviation and ε. e following is the definition of the optimization object: where C is a penalty factor or a regularization parameter and is a slack variable that may be used to adjust the teaching collection of data bias. e present situation may be described as a dual issue. e issue is explained in the following sections: where α * i and α i , respectively, refer to the hyperplane best weight vector and multiplier of Lagrange. e hyperplane F formula is defined as follows: e final version of regression F is as follows: where K〈x i , x j 〉 refers to covariance F that is specified by scalar product of φ (x i ) and φ (x j ). e Gaussian radial basis F, which is employed in this work, is a prevalent type of covariance Fs: where c is the covariance parameter [30]. It is noteworthy that C, ε, and c are the SVM model key regressed parameters.

GWO.
Mirjalili proposed GWO, a novel metaheuristic method [31]. is approach, which used a new swarm intelligence methodology, was centered on the haunting behavior of grey wolves and a naturally occurring hierarchical connection. e GWO outperforms other metaheuristic approaches, for example, Particle Swarm Optimization [32], Ant Colony Optimization [33], and Genetic Algorithm [34]. e algorithm of GWO is usually comprised of four various parts: hierarchy, chasing, surrounding, and assaulting. ese wolves are mainly gregarious, as the peak of the food hierarchy. α is considered to be the best answer. en, β is considered to be the second-best option; likewise, δ specifies the third-best option, and ω denotes the rest of the best solutions. Here, α, β, and δ wolves are in charge of steering the optimization and the other wolves would comply. In the surrounding hunt, the conduct is specified as follows: where X ⇀ denotes the current position vector and X ⇀ p refers to the current hunt location. A ⇀ and C ⇀ represent the coefficient vectors, calculated as follows: where α ⇀ ranges from 2 to 0. r ⇀ 1 and r ⇀ 2 are random vectors with values varying between 0 and 1 and A ⇀ ranges accidentally between −α and α. If A ⇀ | | value is less than 1, the prey will be attacked by wolves and the wolves would get the current prey position. In nature, the influence of impediments around the prey might be evaluated in the vector of C ⇀ . is parameter's random value generates unpredictable prey weights, which might limit local optimal stagnation, particularly during the last rounds. Grey wolves are capable of locating and pursuing the prey. α, β, and δ and wolves of various iterations can lead this process. ω agents' F is to International Journal of Chemical Engineering update the position depending on the other three current ideal positions. is part can be defined as follows: In conclusion, this algorithm begins with several grey wolves randomly generated so that the wolves of α, β, and δ are achieved according to related finesses determination and the likely prey location. In the optimization process, A ⇀ and C ⇀ govern the attack and exploration operations. Finally, once the desired criterion is reached, this process will be terminated.

Accuracy Evaluation of Dataset
e precision of applied data is one of the significant subjects in a forecasting appliance preparation; thus, evaluating the dataset's accuracy is vital. As a result, leverage analysis is conducted. e following is an explanation of the hat matrix, an important notion in this method [35]: e matrix shown above is X function, that is, an m × n matrix. e values m and n denote the number of actual data points together with prediction tool parameters, accordingly. e matrix's primary diagonal is utilized to calculate each real point's hat value. William's plot is presented regarding hat values on the x-axis (x-A) and standardized residuals on y-A to better discern outliers from the reliable limit. e primary diagonal of the matrix is utilized to define each actual point's hat value. To discern outliers from valid points, William's plot is presented concerning hat values on x-A and standardized residuals on y-A. Figure 1 shows that X suspicious points are out of the designated sound zone of [−3, 3]. In the preceding figure, a crucial leverage value, denoted by H * , is also provided, seen as follows: On the basis of the established zone for outlier detection, it can be said that the majority of the output data points have adequate and reliable validity for the construction of a forecasting tool.

Sensitivity Analysis
Sensitivity analysis was performed on the input data to determine the effect of each of them on the target parameter. More details about this method are given elsewhere [36,37]. Figure 2 shows the results of this analysis for the proposed model. Accordingly, H and O have the most and the least effect on the target parameter, respectively, which have relevancy factors equal to +0.73 and −0.63, respectively.

Parameters of Model Evaluation
For quality assessment of agreement between values of estimated and actual output values, the statistical parameters, listed as follows, are used:

Modeling Results
e support vector machine method was adjusted and parameterized using the teaching data following grey wolf optimization. e forecasting tool's performance assessment is crucial after determining the optimum SVM structure. To that purpose, Figure 3 shows a visual analogy between biodiesel determined and the actual output values for testing and training data collection. One of the common tools for model evaluation is the concurrent representation of model outputs and real output data. As demonstrated in this illustration, the determined and the real target values overlap with each  other with a high rate of precision. e proximity of the value of forecasted output values to the actual one proves the model's correctness.
In Figure 4, the actual and anticipated cross-plot of output values is shown for both the teaching and testing stages. By representing actual values versus estimated ones, the cross plot is specified. e precision of the model will be increasingly obvious when the obtained points are closer to the bisector line. Furthermore, in these locations, the fitting line can aid in accurate judgment. As demonstrated, there is a high degree of agreement between biodiesel estimated and real values, by R 2 values of 0.9722 and 0.977 for the teaching and testing stages, respectively. ese values indicate how well the suggested line fits. Put differently, these fitting line values address the correlation between the expected and actual output values. ese findings in both training and testing stages show that this model is qualified for predicting biodiesel characteristics.  International Journal of Chemical Engineering Furthermore, Figure 5 depicts the relative divergence of calculated values from true ones. e discrepancy between determined and actual target levels is explained by the relative deviation. For the biodiesel output, these values are accurate to within 10%. A good explanation for the suggested model's accuracy might be deviation's low value.
A statistical analogy is helpful after a visual comparison. Table 1 provides a concise overview of the previously discussed parameters (MSE, RMSE, MRE, and R 2 stated in equations (13)- (16) [20], and concluded that their models had the ability to predict the target parameter with R 2 and RMSE equal to 0.352 and 1.41 and 0.84 and 0.56, respectively.
Together with the training assessment, the model's effectiveness in predicting unobserved output values of biodiesel must be investigated. Based on the findings obtained during the testing stage, it is clear that GWO-SVM has sufficient generality in evaluating the distinct biodiesel target values.

Conclusion
Because biodiesel is a clean fuel form for producing energy, the necessity of study on biodiesel qualities is obvious for all researchers and authors working in this subject. A novel prediction technique based on GWO-SVM was created in this study to assess the hydrogen contents of bio-oil as a function of pyrolysis conditions and biomass compositions of feedstock. As previously stated, the uniqueness of this model lies in the intriguing trait of model performance independence from outliers. To check the correctness of the databank, the leverage methodology was employed on output data points first-ever in the writings, and this investigation proved the reliability of the utilized databank. Contrasting model outputs with 116 values of experimental target values yielded R 2 � 0.973, MRE � 1.98, MSE � 5.68E-02, and RMSE � 2.41E-01, as well as good visual accord between experimental values and value of GWO-SVM output data. SVM-based model was proved to be the best forecasting tool, as shown by this analysis, with no restrictions in accurately predicting the target values of biodiesel in various operational settings. Furthermore, the effects of various input parameters on output were determined. According to the model and sensitivity analysis results, this research might be useful for scientists working on biodiesel and nature-friendly production challenges. In generating clean fuels, the studied tools are useful for stimulating various processes. As a result, they have the opportunity to support the resolution of global warming issues.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.