QSPR Models for Octane Number Prediction

Quantitative structure-property relationship (QSPR) is performed as a means to predict octane number of hydrocarbons via correlating properties to parameters calculated from molecular structure; such parameters are molecular mass 𝑀 , hydration energy 𝐸 𝐻 , boiling point 𝐵 𝑃 , octanol/water distribution coefficient log 𝑃 , molar refractivity 𝑀 𝑅 , critical pressure 𝐶 𝑃 , critical volume 𝐶 𝑉 , and critical temperature 𝐶 𝑇 . Principal component analysis (PCA) and multiple linear regression technique (MLR) were performed to examine the relationship between multiple variables of the above parameters and the octane number of hydrocarbons. The results of PCA explain the interrelationships between octane number and different variables. Correlation coefficients were calculated using M.S. Excel to examine the relationship between multiple variables of the above parameters and the octane number of hydrocarbons. The data set was split into training of 40 hydrocarbons and validation set of 25 hydrocarbons. The linear relationship between the selected descriptors and the octane number has coefficient of determination (𝑅 2 = 0.932) , statistical significance (𝐹 = 53.21) , and standard errors (𝑠 = 7.7) . The obtained QSPR model was applied on the validation set of octane number for hydrocarbons giving 𝑅 2𝐶 𝑉 = 0.942 and 𝑠 = 6.328 .


Introduction
Octane rating or octane number is a standard measure of the performance of gasoline fuel.The most common type of octane rating worldwide is the research octane number (RON) and motor octane number (MON).Octanes are a family of hydrocarbon that are typical components of gasoline.The octane rating of gasoline is measured in a test engine and is defined by comparison with the mixture of 2,2,4-trimethylpentane (isooctane) and heptane that would have the same antiknocking capacity as the gasoline fuel under test.For example, gasoline with the same knocking characteristics as a mixture of 90% isooctane and 10% heptane would have an octane rating of 90.The ASTM standard for reporting this measurement is an internal combustion engine in which octane is measured by interpolating between the nearest standards above and below the unknown sample [1].The procedure is time consuming, involves expensive and maintenance intensive equipment, and requires skilled labour.
A more thorough understanding of the relations between the structure of alkanes and their physicochemical properties and the empirical rules of octane number (ON) dependence on the structure of alkanes are discussed by A. Perdih and F. Perdih [2].The relation between the structure of hydrocarbons and their octane was studied using a number of topological indices [3].Nikolaou et al. [4] presented a calculation method that effectively utilizes the compositional data from high-resolution capillary GC analysis and the measured pure and blending RON values of various hydrocarbons, which are widely published.Chung et al. [5] concluded that ridge regression is a viable method for calibration of RON with the NIR data.The prediction results of a ridge calibration model showed more stable prediction performance especially when the spectral baselines were varied.Correlation of the octane number (ON) of heptane and octane isomers with various topological indices was studied by Hosoya [6].
Prediction of gasoline octane numbers using NIR spectrophotometer was studied.Fifty-nine unleaded gasoline samples were analysed spectroscopically as described by 2 Journal of Theoretical Chemistry Kelly et al. [7] to evaluate the chemometrics techniques of principal components regression (PCR) and partial least squares regression (PLS) and to assess the accuracy of the predictions as a function of wavelength range and spectral resolution.They have demonstrated for the obtained data that having a greater resolution is not crucial to the prediction accuracy.Determination of octane numbers of gasoline compounds from their chemical structure by 13 C NMR spectroscopy and neural networks, according to Meusinger and Moros [8], demonstrated that the statements from neural network calculations cannot be interpreted in a chemical or physical manner.
Determination of motor gasoline adulteration using FTIR spectroscopy and multivariate calibration was studied [9] using a practical procedure based on the use of density, distillation temperatures, and FTIR analysis along with multivariate calibration.Thirteen peaks of the absorbance at wavenumbers 434, 461, 484, 673, 694, 1030, 1086, 1217, 1231, 1460, 1497, 1606, and 3028 cm −1 were chosen to perform the multivariate calibration.Artificial neural network (ANN) models have been investigated to determine the research octane number (RON) of gasoline blends by Pasadakis et al. [10].Graph-theoretical approaches have found application in diverse areas of chemical, industrial, environmental, pharmacochemical, and medical research [11,12].Reid et al. [13] and Albahri [3] investigated the structural dependency of octane number using a structural group contribution approach.
The physical and chemical properties of a compound are a function of its molecular structure.Quantitative structureproperty relationship (QSPR) is empirically defined relationship between molecular structure and observed properties.The most comprehensive chemometric data analysis used in quantitative structure-activity and structure-properties relationships (QSAR and QSPR) is explained by Ferreira [14].QSPR are developed by finding the proper descriptors.It may be constitutional, topological, electrostatic, geometrical, and quantum chemical molecular descriptors.QSPR relationship is most derived by using curve fitting software to find a best predictive model.So the obtained model can be used to estimate the properties of other molecules simply from their chemical structure and without the need of experimental determination or synthesis.Abdel-Moghny et al., Gad, and Gad and Khairou have previous studies on QSPR for critical micelle concentration (CMC), hydrophilic-lipophilic balance (HLB), and crude oil emulsion stability [15][16][17].
In this study eight descriptors were selected due to their direct physical significances on the behaviour of whole chemical structure of organic hydrocarbon to study QSPR for octane number.The molecular geometry of hydrocarbons was optimized using the AMI, a semiempirical selfconsistent field calculation SCF method for chemical calculations.Molecular mass , hydration energy   , boiling point   , octanol/water distribution coefficient log , molar refractivity   , critical pressure   , critical volume   , and critical temperature   are calculated.Different descriptors have been tested by principal component analysis PCA before regression to have a better overview of the variables.
Then multiple linear regression analyses (MLRA) were performed for modelling and estimating the octane number of different hydrocarbon from the calculated descriptors of the chemical structures.The obtained empirical equation sounds acceptable for prediction of the octane number of unknown hydrocarbon.

Methodology
Experimentally determined octane numbers of the selected hydrocarbons were quoted from Russian Chemical Bulletin [18].The data sets contain 65 molecules which are divided into two training and validation sets.A training set includes 40 molecules and a validation set includes 25 molecules.The molecular structures were drawn using ChemSketch freeware (Advanced Chemistry Development, Inc. ACD/Labs Release 12.00 (2010)) and then optimized using AM1 semiempirical method.HyperChem 6.03 for windows 8.1.Single point calculations were performed.Each molecule was subjected to molecular mechanics optimization, comparing energies to determine the global minimum energy conformation.Molecular mass , hydration energy   , boiling point   , octanol/water distribution coefficient log , molar refractivity   , critical pressure   , critical volume   , and critical temperature   were calculated using ChemOffice Ultra 2004.The eight calculated descriptors are listed in Tables 1 and 2 in addition to the literature of experimental octane number.

Data Processing
Principal component analysis (PCA) was performed to determine the correlation coefficients matrix of different individual descriptors using statistical computer software XLSTATE version 7.1.Correlation coefficients matrix of the calculated descriptors and the octane number values were determined and so multiple linear regression analysis (MLRA) was carried out using Microsoft Excel 2013.The resulting correlation model for prediction of octane number is in the form of the following equation: , where  0 is the intercept (the value of octane number when all   = 0) and   is the regression coefficient or slop for variables   (the calculated descriptors).

Training Set.
The biplot is a visualization technique to investigate the interrelationships between the octane number (ON) and different descriptors in multivariate data.Normally the clusters of observations are illustrated by plotting the scores for the first and second principal components (PC1 and PC2) shown in Figure 1.
The placing of a parameter on the biplot shows that octane number is influenced by the vectors that lie near it or in the opposite side.However those vectors that lie roughly perpendicular to octane number have low correlation values.The biplot reveals that the parameter such as   lies close to octane number; that is, this means that   has +ve correlation with octane number.However the other parameters lie nearly on the opposite direction of octane number.It means that these parameters are −ve correlated to octane number.Both the +ve and the −ve are significant correlation.
The relative importance of the descriptors can be confirmed by looking at the correlation matrix shown in Table 2.The higher the correlation coefficient is significant values regardless its sign positive or negative.
Eight descriptors combinations have been tested by multiple linear regression analysis for 40 molecules of octane number values of hydrocarbons.The squared correlation coefficient (or coefficient of multiple determination), also called the -squared of the equation, is denoted as  2 .It measures the explanatory power of the regression equation.It falls in the range of 0 to 1, where 0 means the regression accounts for none of the variation and 1 means the relationship was deterministic and the regression accounts for all of the variation.Coefficient of determinations,  2 , is found to   be 0.932.The standard error expresses the variation of the residuals or the variation about the regression line.Thus the standard error measures the model error; it is equal to 7.76, where  = 53.22 and significance  = 5.687 E-16.So, the regression model is considered to be highly significant.
Table 3 shows regression coefficients , standard errors, -stat,  value (significance), and confidence intervals of the linear regression models ( 2 = 0.932,  = 7.76,  = 53.22,and Sig.= 5.687 E-16).The -values and  values indicate that the proposed 8 descriptors are relevant to predict ON.Table 3 also defines the confidence intervals for regression coefficients   of descriptors.Positive values in the regression coefficient indicate that the descriptors contribute positively to the value of octane number.Negative values indicate that the greater the value of the descriptor, the lower the value of octane number.
The -stat measures the statistical significance of the regression coefficients.The higher -stat values correspond to the relatively more significant regression coefficients. values indicate that the model is statistically significant.The resulting correlation model for prediction of the physical properties of interest is in the form of the following (1).The predicted (ON) gives linear relationship with the literature (ON) as shown in Figure 2  The linear correlation between the experimental and predicted values of octane numbers ON is graphically represented in Figure 3.
We come to conclusion that in model (1), as the energy of hydration energy   , boiling point   , molar refractivity   , and octanol/water distribution coefficient log  increase, octane number decreases.However an increase in the values of other descriptors, namely, molar mass , critical pressure   , critical volume,   , and critical temperature   , increases the value of the octane number.

Figure 1 :
Figure 1: The biplot shows the intercorrelation of different descriptors.

Figure 2 :
Figure 2: Linear relationship between the predicted and the literature octane number for the training set.

Figure 3 :
Figure 3: Linear relationship between the literature and predicted values of the octane number for the validation set. 2   = 0.9419.

Table 1 :
Training set of hydrocarbons with their molecular mass , hydration energy   , boiling point   , molar refractivity   , octanol/water distribution coefficient log , critical pressure   , critical volume   , critical temperature   , the literature octane number ON(L), and the predicted octane number ON(P).

Table 2 :
Correlation coefficients matrix of the selected descriptors.
In order to test the predictive power of the obtained model (1), the regression coefficients   shown in Table3are used to predict the octane number for the rest of molecules given in Table4.The obtained results of validation set are pronounced to be good which are characterized by  2   value of 0.94 and a standard error of 6.96.It is clear from Figure3that the predicted values of ON are in good agreement with the literature value.

Table 4 :
Cross-validation set of hydrocarbons with their molecular mass , hydration energy   , boiling point   , molar refractivity   , octanol/water distribution coefficient log , critical pressure   , critical volume   , critical temperature   , and the literature octane number ON.