The Start of Combustion Prediction for Methane-Fueled HCCI Engines: Traditional vs. Machine Learning Methods

In this work, 11 regression models based on machine learning techniques were employed to provide a fast-response and accurate model for the prediction of the start of combustion in homogeneous charge compression ignition engines fueled withmethane.ese regression models are categorized into linear and nonlinear types. Although the robust random sample consensus (RANSAC) model is a nonlinear type as well as SAM (simple algebraic model), the prediction accuracy is enhanced from 89.3% to 98.4%. Such accuracy is also achieved for the linear models, namely, ordinary least squares, ridge, and Bayesian ridge models. Indeed, due to the linear hypothesis (the correlation for the start of combustion prediction), the presentedmodels have an acceptable response time to be used in real-time control applications like the electronic control units of the engines.


Introduction
Using environmentally friendlier devices is one of the most important approaches to have a green atmosphere for the next decades. Homogeneous charge compression ignition (HCCI) engines are considered as more e cient and cleaner engines than both conventional types, spark ignition (SI) and compression ignition (CI), in the engine industry [1]. ese engines release less NOx emission due to the employed low temperature combustion (LTC) process and consume less fuel as they are designed in lean burnt mode [2]. Furthermore, the carbon-based emissions such as CO 2 , CO, and HC may be decreased or even removed thanks to the ability of these engines in running by the light hydrocarbons such as methane or hydrogen [3,4]. In consequence, the target of having a cleaner environment can be achieved by extending the applications of this type of engine in the automotive industry [5,6] as well as the other approaches such as reactivity controlled compression ignition (RCCI) [7,8], catalyst converter development [9,10], and engine downsizing [11,12].
One of the most challenging issues for the industrial applications of HCCI engines is the combustion stability, especially in high-speed operation conditions, and it is needed to be controlled by the start of combustion (SOC) timing [13]. ere is no direct controller actuator for the SOC, such as the spark plug at the SI and injector (injection timing) at the CI engines, and the SOC will be a ected by the engine inlet parameters like the pressure and temperature of the inlet charge, equivalence ratio, and engine speed [14]. Consequently, the real-time exact prediction of SOC may lead to extending the usage of these types of engines in movement applications, as it is considered as one of the key parameters in combustion stability [15,16].
A wide range of SOC predictors has been presented in the literature. To study the e ects of key parameters on the SOC, fundamentally, the three-dimensional computational uid dynamics (3D-CFD) models are used [17,18]. ese models present high accuracy due to employing detailed chemical kinetics mechanism while they need a long run time. e 0D thermodynamic models are considered as the second type of SOC predictors which have much better run time than 3D-CFD models [19,20]. Although the run-time is noticeably declined by the 1D thermodynamic models, developing the predictor models leads the researcher to provide the third type called semiempirical models for the faster ones [21,22]. ese models generally use the engine performance data and provide an algebraic correlation to predict the SOC. e primary efforts caused the models which need to unmeasurable parameters as seen in literature [23,24] and then they are developed for the models with measurable inputs [25,26]. Although the accuracy of these models is acceptable for control application, the run time is still needed to be improved in the real-time applications.
Flourishing the machine learning (ML) techniques is a promising achievement to enhance the performance of SOC predictors.
ese techniques have been recently used by researchers to develop control models in the engine industry [27][28][29]. Using ML, it is possible to provide the numerical models to predict the SOC timing employing the engine performance dataset. ese models may have better run times than the modified knock integral model (MKIM) [25] and simple algebraic model (SAM) [26] if the proposed correlations have less nonlinearity.
In this work, employing the dataset consists of the performance and operating conditions of 3 different HCCI engines fueled with Methane, adopted from a thermodynamic model, and using different regression ML techniques and several SOC prediction models are presented. e performance of them is compared with Simple Algebraic Model (SAM) model. e main novelty of this work is providing a simple linear model with higher accuracy thanks to the ML techniques. It is completely sufficient for control applications and to use in the Electronic Control Unit (ECU) via the linearity of the model.

Model Description
In general, three different types of numerical models are employed in this research, namely, SAM, virtual engine, and ML regressor models. In this study, 6 linear and 5 nonlinear supervised type regressor models were used. In this section, these models are described in detail.

SAM.
e SOC time of methane-fueled HCCI engines will be predicted via a simple algebraic correlation due to the engine inlet parameters [26]: where N is the engine rotary speed, ϕ is the equivalence ratio, EGR is the mass ratio of exhaust gas recirculated (EGR) to inlet charge, and P IVC and T IVC refer to the inlet charge pressure and temperature at the inlet valve closing (IVC) time, respectively. e constants A, B, D, E, and F relate to the used fuel, and C 1 and C 2 depend on the real compression ratio (CR).

Virtual Engine.
Each control-oriented model uses the experimental dataset to provide its own semi-empirical correlations. In this research, the results of a stand-alone thermodynamic model of engine closed-cycle [3,5,26] are employed instead of the experimental data. is model includes a detailed chemical kinetics mechanism of methane oxidation called "GRI 3.0 [30]" which consists 325 reactions and 53 species. e validity of the employed model has been evaluated by several engines and different operation conditions, and SOC is defined at a crank angle where 5% of the fuel is consumed.
e main excellence of this model is SOC detection with acceptable accuracy; therefore, it is sufficient to be used as the virtual engine for this study. e virtual engine is run for a wide range of engine inlet parameters, namely, engine speed, equivalence ratio, EGR, charge pressure and temperature at IVC, relative humidity (RH), and real CR.

ML Regressor
Model. ML regressor models generally use a hypothesis to define the target value (SOC). For the multivariate linear regression, the hypothesis is defined as [31] where x is the independent variable and θ is the coefficient of the related variable. e subscript n refers to the number of independent variables. It should be noted that the x 0 is considered equal to 1.00. Using the training dataset, the target of ML techniques is to find the best vector of θ which has the least residual for the test dataset. So, the type of considered cost function, which should be minimized, is the base of the definition of different ML techniques; as an example, the cost function of ordinary least squares (OLS) method is defined as [32] where Y is the vector of dependent variable of the problem (target value of each case) and the superscript m is the number of train cases. A variety of approaches are used to minimize the cost function, but generally, they can be divided into two categories, namely, noniterative and iterative. In noniterative approach, the best vector of θ is calculated by z/zθ j J(θ) � 0 in the train dataset domain as [33] θ � X T X For the iterative approach, the vector of θ is simultaneously updated based on the learning rate (α) in each iteration [34].

Results and Discussion
Considering the validated virtual engine [3,5,26], demanded dataset is constructed by running the virtual engine for 3 engines defined in Table 1. is dataset consists of 1177 sample train cases and 253 sample test cases which are adopted from the results of running the virtual engine for each engine at different operating conditions. Considered independent parameters are engine speed, equivalence ratio, EGR, charge pressure and temperature at IVC, RH, and real CR, and the dependent parameter is SOC. For the training dataset, the dispersion of SOC is between 10 CAD before Top Dead Center (TDC) and 10 CAD after TDC, as shown in Figure 1, however, it is mainly (more than 67%) occurred before TDC.
e performance of the SAM in SOC prediction for both the train and test samples is studied in the first step, as shown in Figure 2. e accepted range of SOC prediction in control applications is reported as the residual (|SOC real − SOC model |) with less than 2 CAD [26]. For the test samples, SAM predicted the SOC with 89.3% in the acceptable range as well as 70.6% for the train samples reported in Table 2. In addition, the root mean squared error (RMSE) is reported as 1.64 and 1.84 for the test and train samples, respectively.    In the next step, both linear and nonlinear ML regression methods are applied to the training dataset and the results are illustrated in detail. e built-in Python code has been used for such an investigation, and the employed models are reported in Table 3. Achieved coefficient vector (vector of θ ) for linear methods is presented in Table 4. e most effective parameter on SOC variations is reported the equivalence ratio by the OLS, Ridge, Support Vector Regression (SVR), and Bayesian Ridge (BR) models while the CR is considered with a more important role by Lasso and Huber models, as shown in Figure 3.
In Figure 3, the impact factors of independent parameters on SOC for the ML linear regressors; OLS, Lasso, ridge, SVR, BR, and Huber were shown. Looking more detail in Figure 3 and Table 4, and considering the out of rage intercepts of SVR and Huber models, it seems, these models are not converged, and the learning is failed. In addition, the considered impact factor for equivalence ratio (0.00) in the Lasso model shows that the effect of equivalence ratio can be ignored which is not acceptable in real applications. However, the main parameters of comparing these models are considered the acceptable residual (less than 2 CAD) percentage and the score of the model (the best score is equal to 1.00) in the test dataset. Table 5 reports the performance of defined models in Table 3 on the test dataset. Considering reported scores and the acceptable residual percentage, it can be concluded that among the nonlinear models, the Robust RANSAC model is greatly learned and has outstanding accuracy by predicting up to 98.4% in the acceptable range. e same noticeable performance is achieved by the OLS, ridge, and BR models from the linear models; however, the performance of the Lasso model is still acceptable by predicting up to 99.6% with less than 3 CAD errors. e scatter plots of achieved residuals by the models are shown in Figures 4 and 5 for the test and train datasets, respectively. Due to these plots, it can be concluded that the main reason for found errors in the Lasso model is approaching the learning process to overfitting. e same trend has existed            Mathematical Problems in Engineering for the nonlinear SVR and nearest neighbors models. Complete overfitting has occurred for the decision tree model, and the reason for the pretty weak performance of MPL neural network, Huber, and linear SVR models is the convergence issue.

Conclusion
In this work, 11 regression models based on machine learning techniques were employed to provide a fast-response and accurate model for the prediction of the start of combustion in homogeneous charge compression ignition engines fueled with methane. e dataset for such an investigation is adopted from validated virtual engines which used the detailed chemical kinetics of methane combustion. e main achievements of the work are listed in the following:

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.