Rapid and Simultaneous Prediction of Eight Diesel Quality Parameters through ATR-FTIR Analysis

Quality assessment of diesel fuel is highly necessary for society, but the costs and time spent are very high while using standard methods. Therefore, this study aimed to develop an analytical method capable of simultaneously determining eight diesel quality parameters (density; flash point; total sulfur content; distillation temperatures at 10% (T10), 50% (T50), and 85% (T85) recovery; cetane index; and biodiesel content) through attenuated total reflection Fourier transform infrared (ATR-FTIR) spectroscopy and the multivariate regression method, partial least square (PLS). For this purpose, the quality parameters of 409 samples were determined using standard methods, and their spectra were acquired in ranges of 4000–650 cm−1. The use of the multivariate filters, generalized least squares weighting (GLSW) and orthogonal signal correction (OSC), was evaluated to improve the signal-to-noise ratio of the models. Likewise, four variable selection approaches were tested: manual exclusion, forward interval PLS (FiPLS), backward interval PLS (BiPLS), and genetic algorithm (GA). The multivariate filters and variables selection algorithms generated more fitted and accurate PLS models. According to the validation, the FTIR/PLS models presented accuracy comparable to the reference methods and, therefore, the proposed method can be applied in the diesel routine monitoring to significantly reduce costs and analysis time.


Introduction
Diesel fuel is a petroleum-derived product of great importance for a country's economy since most of the transportation of industrial and agricultural products depends on diesel vehicles [1,2]. is fuel is a complex mixture composed mainly of para nic, ole nic, and aromatic hydrocarbons ranging from 8 to 28 carbon atoms and, in a lower concentration, substances containing oxygen, nitrogen, sulfur, and metals [3][4][5]. e diesel composition is in uenced by several factors, such as the origin of crude oil, operating variables of the re nery, the addition of fractions from cracking process, and the insertion of additives to increase engine performance [3]. erefore, the fuel quality is susceptible to many variables until the fuel reaches the consumer. In this perspective, the monitoring of diesel quality parameters is extremely important for commercialization, engine performance, consumer rights, business competition, and environmental risks [5,6]. e assays performed to ensure the diesel quality are based on standardized procedures that require speci c equipment to determine each physicochemical parameter. According to the standard methods, the quality assessment requires considerable sample volume and analysis time, besides the great expense of equipment maintenance and several specialized analysts [7][8][9][10][11][12]. erefore, the development of methods to monitor diesel quality accurately, quickly, and environmentally friendly is highly necessary [13]. is becomes possible by attenuated total re ection Fourier transform infrared (ATR-FTIR) spectroscopy associated with multivariate regression methods such as partial least square (PLS). Studies demonstrated the possibility to predict some diesel properties using midinfrared spectroscopy combined with chemometric tools [14][15][16][17][18], some aimed at the prediction of biodiesel content [16], and others were devoted to the identi cation of diesel adulteration with waste vegetable oils [17,18].
In USA, European Community, and Japan, the regulations of diesel properties for consumption are established, respectively, by ASTM D975, EN 590, and JIS K2204 [19][20][21]. In Brazil, the regulation and supervision of fuels are performed according to ANP (National Agency of Petroleum, Natural Gas, and Biofuels) Resolution no. 30/2016, which requires that assays must be conducted according to ASTM, EN, or NBR standards [22]. According to this resolution, at least eight quality parameters of diesel are analyzed in o cial monitoring laboratories: aspect; color; density; ash point; total sulfur; volatility (distillation temperatures at 10% (T10), 50% (T50), and 85% (T85) recovery); cetane index; and biodiesel content [23]. e development of an alternative method for determining the physicochemical parameters of diesel through ATR-FTIR has several advantages for routine quality monitoring. e use of ATR-FTIR can reduce costs, increase analytical frequency, use smaller sample volume, and provide the determination of all required parameters using only one equipment. Moreover, infrared spectrometers are already purchased by monitoring laboratories for determination of biodiesel content in diesel according to EN 14078.
In view of the high costs and long time required to assess diesel quality by standard methods, this work aimed at the development of a simple and fast analytical method based on ATR-FTIR analysis and PLS regression method to determinate eight diesel quality parameters simultaneously. In this study, multivariate lters and variable selection techniques, such as genetic algorithm (GA), forward interval PLS (FiPLS), and backward interval PLS (BiPLS), were evaluated for the best model predictive ability.

Samples.
For eight months, the quality parameters of 3549 samples of diesel fuel were analyzed by Cempeqc (Center for Monitoring and Research of the Quality of Fuels, Biofuels, Crude Oil and Derivatives) according to ASTM and EN standards. e samples were stored at 10°C for further spectroscopic analysis. e standards and equipment used in the determination of quality parameters are presented in Table 1.
Although an extensive sample set can provide greater robustness to a prediction model, this work aimed at the development of a simple method that can be easily reproduced by other laboratories. erefore, we selected about 10% of the diesel samples for spectroscopic and chemometric analysis. e 3549 diesel samples were divided into groups using hierarchical cluster analysis (HCA) to select the most representative samples. An HCA was executed for each month, and the physicochemical parameters were used as variables. e clusters were performed using 60% of similarity, complete linkage method, and autoscale preprocessing to give the same in uence for all variables. e software used for HCA was Pirouette (Infometrix), version 3.11. At the end of the eight months, 409 diesel samples were selected.

Spectroscopic Analysis.
e infrared spectra of the 409 samples were obtained by a Nicolet 6700 FTIR spectrometer ( ermo Scienti c, Waltham, USA) using 32 scans and 4 cm −1 resolution. A Smart ARK ATR sampling accessory of ZnSe crystal and angle of incidence 45°were used to acquire the infrared spectra. e ATR accessory required one milliliter for each sample, and a new background spectrum was acquired every hour to reduce the baseline shifting and ambient variations. e conditions of temperature and relative humidity during the analysis were 20.7 ± 2.0°C and 40 ± 9%, respectively.

Chemometric Analysis
e chemometric analysis was executed using Matlab 2013a (MathWorks) with PLS toolbox 7.3.1 (Eigenvector Research Inc.). e FTIR spectra were converted into vectors of 1738 variables and the combination of the vectors resulted in the matrix X of dimension 409 by 1738. Prior to the development of PLS models, the sample set was separated into two-thirds for calibration (273 samples) and one-third for validation (136 samples). e Onion algorithm was used to select the samples with less covariance (based on distance from the mean) for each set and, consequently, to obtain greater sample representativeness in both sets [24,25]. e algorithm was performed for each parameter to ensure that the calibration set had the largest range of reference values.
Initially, the PLS models were developed using the full spectra (full X-block) preprocessed by the mean center or autoscale, depending on the best t. e number of latent variables (LV) was chosen based on the root mean square errors of calibration (RMSEC), cross-validation (RMSECV), and prediction (RMSEP) in order to minimize the prediction errors and avoid model over tting [26,27]. e crossvalidation was performed using venetian blinds mode with 10 splits. en, statistical tests were applied according to ASTM E1655 [28] to detect the presence of outliers in the calibration and validation sets. Outliers include high leverage samples and samples whose reference values are inconsistent with the model. erefore, samples with high leverage and studentized residuals were excluded from the sample sets.

Preprocessing Evaluation.
Spectral data usually present baseline shifting due to instrumental variations and re ectance deviations [29]. e baseline shifting is typically corrected by applying the rst or second derivative, or by polynomials that correct the displacement based on a standard spectrum, for example, multiplicative scatter correction (MSC) and standard normal variate (SNV). In addition, digital lters such as smoothing are also used to improve the signal-to-noise ratio of spectral data [28]. Multivariate lters, such as generalized least squares weighting (GLSW) and orthogonal signal correction (OSC), are less usual preprocesses, but these lters are very useful to eliminate baseline shifting and increase signal-to-noise ratio [30][31][32][33]. erefore, the following preprocessing was evaluated in modeling: mean center, autoscale, Savitzky-Golay smoothing and derivatives, SNV, MSC, GLSW, and OSC.

Variable Selection Methods.
Many studies have shown that variable selection is an e cient way to increase the signal-to-noise ratio and, as a consequence, improve the predictive ability of the model [34,35]. When the noise dominates over the information related to the property of interest, the removal of variables often leads to better accuracy and performance of the analytical method [35,36]. e selection of variables can be performed based on the spectral knowledge (manual approach) or through algorithms that search for variables that provide the minimum prediction error to the model. Some of the most popular methods for selecting variables are the interval selection method, such as the forward interval PLS (FiPLS), the backward interval PLS (BiPLS), and the genetic algorithm (GA), a technique that employs a probabilistic and nonlocal search process which manipulates binary strings with the coded experimental variables. Details on these variable selection methods can be found in [35].
In this study, four di erent approaches were evaluated to select variables: manual exclusion, FiPLS, BiPLS, and GA. e manual exclusion was carried out evaluating the spectral residues and loadings plots. Spectral regions with no absorbance or high relative standard deviation (RSD) were excluded from the data and compared with results obtained using the full spectra. Both iPLS methods were executed using interval size of 25 variables, and the number of intervals was determined by the algorithm to obtain the lowest value of RMSECV. e GA was performed with a population size of 128 models, one variable by window, initial terms of 30%, the mutation rate of 0.5%, double crossover, 200 generations, and PLS regression method. All approaches were performed using only the calibration set to avoid overestimated results.

Model Validation.
e PLS models were statistically evaluated by gures of merit (FOM) according to ASTM E1655 and Valderrama et al. [28,37]. e accuracy of the models, de ned as the degree of agreement between a measured value and reference value, was assessed by the values of RMSECV, RMSEP, correlation coe cients (r), average relative errors (ARE), and relative percent di erence (RPD). e RMSECV was obtained by cross-validation using the venetian blinds mode with 10 splits, and the RMSEP was obtained by the validation samples that were measured independently from the calibration samples. en, the RMSECV and RMSEP were compared with the reproducibility of the reference method.
e ARE was used as a parameter to evaluate the magnitude of the prediction errors in relation to the reference values [38]. e ARE value was calculated by where y i and y i correspond, respectively, to the reference value and predicted value by the model and n v is the number of validation samples. e relative percent di erence (RPD) was obtained by the ratio of the standard deviation of the validation set reference values to the RMSEP value. RPD values above 2.5 indicate that the model has acceptable accuracy over the measurement range, while values above 10 are considered excellent for alternative methods [39]. Linearity is an important parameter to evaluate the performance of the model since the PLS regression method is not suitable for nonlinear relationships between the variables x and the property of interest [40]. e linearity corresponds to the ability of the model to provide results directly proportional to the property of interest. One way to evaluate this parameter in multivariate models is through the residues of calibration and validation samples plots. If the distribution of residues is random, it can be said that the model shows a linear behavior. In addition to the residue plots, the linearity was also evaluated by the values of determination coe cients (R 2 ) and bias. is last FOM indicates the presence of systematic errors in the model. Bias can be assessed by a t-test for the validation samples at a con dence interval of 95%. e average bias was calculated by summing the di erences between the reference value and the predicted value divided by the number of validation samples [28]: en, the standard deviation of validation errors (SDV) was calculated as Journal of Analytical Methods in Chemistry 3 and nally, the value of t bias was given by If the value obtained for t bias was greater than the critical value for n v − 1 degrees of freedom, then the multivariate model presented signi cant systematic errors. e precision of the models was evaluated by the analysis of 14 replicates of 30 diesel samples performed on di erent days. e average of relative standard deviations (RSD) and the intermediate precision-calculated through (5), where n is the number of samples and m the number of replicates-were used as parameters [37]. en, the intermediate precision was compared to the repeatability value of the reference method:

Physicochemical Assays.
e values of reproducibility and repeatability of the reference methods, the range of    Table 2. e quality parameter that presented the highest number of nonconforming samples was T10, followed by T85 and biodiesel content. As ANP Resolution no. 65 allows only a variation of 0.5% (v/v) of biodiesel content, most of the samples were in a narrow range of concentration. e same occurred with the total sulfur but in two di erent ranges of concentration due to the availability of two types of commercial diesel with distinct sulfur content.

Spectroscopic Analysis.
e FTIR spectra of all diesel samples are represented in Figure 1. Functional groups of the constituents of samples could be observed by characteristic absorption bands of each group of atoms through the infrared spectra. e most intense bands were caused by C-H groups stretch (3000-2800 cm −1 ) and angular deformations (1464 cm −1 and 1379 cm −1 ) [42]. e bands at 2350 cm −1 and 667 cm −1 were, respectively, results of asymmetrical stretch and angular deformation of CO 2 molecules present in the atmosphere [43]. e presence of biodiesel in the samples was observed by carbonyl absorption band (1750-1735 cm −1 ) and aliphatic ester absorption band (1300-1000 cm −1 ). Aromatic compounds had characteristic bands of low intensity in 900-675 cm −1 from the C-H out-of-plane angular deformation. e sulfur is present in diesel as mercaptans and sul des, and it was observed by S-H axial stretch at 2600-2550 cm −1 and C-S axial stretch at 700-650 cm −1 [43,44]. e S-H stretch was very weak; however, few groups have absorption in this region, so it was useful for the total sulfur parameter. e vibrational group attribution to each band is present in Table 3.

Outlier Detection.
During calibration, outlier statistics were applied to identify samples that had unusual leverage and studentized residuals. e outlier detection was performed prior to the variable selection because the exclusion of variables may reduce outlier detection capabilities of the model [28]. e number of outliers from each sample set is shown in Table 4. Considering the calibration and validation sample set with, respectively, 273 and 136 samples, the number of outliers (3% maximum) was not signi cant for the prediction models.
High studentized residual values may be the result of errors in the reference measurement, spectral acquisition error, reference value transcription, or even a failure of the model. Error in the spectral acquisition would lead to the presence of the same outlier in all models of prediction; however, di erent outliers were detected for each model. e absence of new outliers in the model after removal of the anomalous samples indicated that there was no failure in the model. erefore, errors in the reference values were most likely responsible for the presence of outliers.

Preprocessing Evaluation.
e baseline shifting in the raw spectra was observed in Figure 1. e shifting may be the result of variations in the position of the ZnSe crystal since it was removed from the spectrometer for cleaning before each analysis. All the evaluated preprocessing-derivatives, MSC, SNV, GLSW, and OSC-provided baseline correction and higher correlation coe cients than mean center or autoscale preprocessing. Moreover, multivariate lters (GLSW and OSC) provided models with greater explained variance using fewer latent variables (Table 5). erefore, all models were  Journal of Analytical Methods in Chemistry preprocessed using OSC, except the model for T85, which presented better t with GLSW preprocessing.

Variable Selection.
e exclusion of regions without information of sample constituents or low signal-to-noise ratio may improve the performance of the models [35]. In Figure 2, the noisy spectral regions can be observed through the relative standard deviation (RSD), represented by the blue line, and calculated from the mean of 14 replicates, represented by the red line. In addition, there was no  absorption by the components of diesel in the ranges 4000-3100 cm −1 and 2450-1950 cm −1 ; thus, these spectral regions were excluded, and new models were developed. e RMSEP and correlation coe cient of validation (r val ) obtained by the di erent variable selection approaches are presented in Table 6. e manual exclusion of variables provided better results only for the prediction models of ash point, T10, cetane index, and biodiesel content. e manual selection of variables had the risk of inadvertent exclusion of important variables for the modeling, impairing the performance of the model. e selection of variables by interval selection methods reduces the values of RMSEC and RMSECV but might decrease the predictive ability of the model. e FiPLS method usually uses few intervals to correlate the spectral variables with the property of interest and, as consequence, the calibration model is more susceptible to over tting and the prediction of unknown samples is impaired, especially properties that are correlated to several spectral variables. As the sulfur content is correlated only to the S-H and C-H bond variables, the FiPLS method provided the best t to the model. e distillation temperatures and the cetane index depend on the size and structure of the hydrocarbon chains of the diesel components; therefore, these are properties related to several functional groups with response in the midinfrared region.
us, the selection methods such as BiPLS, which seek to exclude noisy variables rather than including variables more correlated to the property of interest, tend to be more suitable for optimization of these diesel parameters. e analytical signals in the midinfrared region result in many correlated variables; that is, FTIR data present many collinearities. Normally, the problem of collinearity can be attenuated by the application of the genetic algorithm, since the spectral variables are manipulated in binary strings and the search for variables that provide a minor error of prediction is performed by a probabilistic and nonlocal process [35]. GA was the best variable selection approach for prediction models of density, ash point, T50, and biodiesel content.
In general, the selection of variables by iPLS and GA provided improvements in the predictive ability of the calibration models, except for T85, and the di erence between the results obtained by both algorithms was not signi cant. e selected variables used in the best-tted models are presented in the supplementary material (available here) attached to the article.

Model Performance.
After de ning the most appropriate variable selection method for each parameter, the gures of merit were determined for the prediction models ( Table 7). e complexity of the diesel composition, consisting of hundreds of compounds, generates a large amount of information in the FTIR spectra and, therefore, the correlation between the matrix X and the property of interest requires a considerable number of LVs. Although the use of OSC reduces the collinearity problem and increases the captured variance of the X and y blocks, several analytical signals were correlated to the properties of diesel, so several LVs were required. e accuracy of the models was evaluated by comparing the RMSEP values (Table 7) with the reproducibility values of the reference methods (Table 2). Since all models presented RMSEP values below or equivalent to the reproducibility value, the FTIR/PLS method could be considered accurate for predicting diesel parameters. In addition, the correlation coe cients were above 0.89, except for T85; thus, the predicted values were well correlated with the reference values (Figure 3). Although the prediction model for sulfur content presented r val equal to 0.987, the obtained ARE value was high when compared to the others. e high relative errors that resulted in ARE equal to 14.10% were caused by the low sensitivity of the model for prediction of S500 diesel samples. However, the RPD value indicated that the model was accurate when the RMSEP value is compared to the sulfur content range of the validation sample set. e determination coe cients (R 2 ) indicated that the prediction models of ash point and T85 presented lower linearity than the other parameters. Figure 3 shows that, for these parameters, the residues tend to be negative values with the increase of the reference value. Although the models have low bias values, the t-test revealed that there were systematic errors in the prediction models for sulfur content, T85, and biodiesel content. Since the models for sulfur and biodiesel content presented good linearity (R 2 val > 0.88), the presence of systematic errors can be reduced by the addition of more samples to the model. e precision of the models was evaluated by the analysis of 30 diesel samples on 14 consecutive days. Although the samples were stored at 10°C between the analyses, the diesel fuel consists of semivolatile compounds and, therefore, changes in sample composition during the replicates acquisition imply an increase in measurement uncertainty.
e intermediate precision values of the models were above the repeatability values of the reference methods, except for the biodiesel content prediction. However, the RSD values showed that almost all models had good precision (RSD below 1%) and only the models for prediction of ash point and sulfur content presented low precision.
Although the prediction models for ash point, sulfur content, and T85 have the limitations mentioned above, the conformity ranges of these parameters (Table 2) can be met by FTIR/PLS models with reliability since the accuracy and precision of the method are known. If an unknown sample is analyzed by FTIR and the result obtained is in the nonconformity range, it is recommended that the result is conrmed by the standard method. Since only about 2% of the diesel samples in Brazil presented nonconformities in 2017 [45], the FTIR/PLS method can be applied in routine monitoring of diesel quality to reduce the costs and time of analysis.

Conclusions
is study showed the possibility of applying ATR-FTIR spectroscopy with PLS regression method to predict the quality parameters (density; ash point; total sulfur content; distillation temperatures at 10% (T10), 50% (T50), and 85% (T85) recovery; cetane index; and biodiesel content) in commercial diesel samples.
All the evaluated preprocessing (derivatives, MSC, SNV, GLSW, and OSC) provided baseline correction and higher correlation coe cients. In addition, the GLSW and OSC preprocessing provided greater explained variance to the model using fewer latent variables. e selection of variables by iPLS or GA provided better predictive ability to the calibration models, except for T85. However, the di erence between the results obtained by both algorithms was not signi cant.
According to the model validation, all PLS models presented acceptable accuracy when compared to the values of reproducibility and had good precision, except for sulfur content prediction of S500 diesel samples. Since the application of the ATR-FTIR/PLS method is able to reduce costs and increase considerably the analytical frequency, the diesel quality monitoring programs, as well as the nal consumer, can bene t greatly from the application of the proposed method.

Conflicts of Interest
e authors declare that they have no con icts of interest.

Supplementary Materials
In the supplementary material can be visualized the variables used in the PLS models that presented better predictive abilities. e variable selection method used for each diesel property is also presented in the supplementary material. Supplementary Figure 1