Dynamic Localized SNV , Peak SNV , and Partial Peak SNV : Novel Standardization Methods for Preprocessing of Spectroscopic Data Used in Predictive Modeling

An essential part of multivariate analysis in spectroscopic context is preprocessing. (e aim of preprocessing is to remove scattering phenomena or disturbances in the spectra due to measurement geometry in order to improve subsequent predictive models. Especially in vibrational spectroscopy, the Standard Normal Variate (SNV) transformation has become very popular and is widely used in many practical applications, but standardization is not always ideal when performed across the full spectrum. Herein, three different new standardization techniques are presented that apply SNV to defined regions rather than to the full spectrum: Dynamic Localized SNV (DLSNV), Peak SNV (PSNV) and Partial Peak SNV (PPSNV). DLSNV is an extension of the Localized SNV (LSNV), which allows a dynamic starting point of the localized windows on which the SNV is executed individually. Peak and Partial Peak SNV are based on picking regions from the spectra with a high correlation to the target value and perform SNV on these essential regions to ensure optimal scatter correction. All proposed methods are able to significantly improve the model performance in cross validation and robustness tests compared to SNV. (e prediction errors could be reduced by up to 16% and 29% compared with LSNV for two regression models.


Introduction
Chemometric approaches are becoming increasingly popular as they enable more comprehensive extraction of relevant information out of complex data provided by modern instrumental analytics.At the same time, advances in data analysis make it possible to reduce the size of the instrument hardware by compensating for the missing measurement quality of miniaturized instruments.In combination with multivariate calibration, the development of models based on low-cost analytics, such as vibrational spectroscopy, allows the development of models that predict parameters usually determined with cost-intensive measuring instruments or complex methods.Monitoring the alcoholic fermentation [1] and determining the viscosity of engine oil [2,3] or proteins in milk [4] by spectroscopic means become thus feasible.It has also been possible to determine specific viscosity modifiers and pour point depressant additive compounds in engine oils [5] by FTIR, which is due to the fact that the concentration of a component follows, according to the Lambert-Beer Law, a linear dependency on the light absorbance of the medium [6,7].
Preprocessing methods play a decisive role for the performance of these models, as spectra can be influenced by various disturbing factors that interfere with the significance of the measurement [8][9][10][11].e main influence comes from the measuring geometry, which includes the sample thickness, the distance from the detector to sample, the contact pressure, and the angle from the light source to sample [12,13].e elimination of scattering effects by particles of different size and distribution also plays a major role in preprocessing.
Different spectroscopic measurement techniques suffer from different major disturbing factors.In near-infrared spectroscopy, it is usually a constant or linear baseline offset due to scattering light, Raman spectra often show polynomial fluorescence background, and for mid-infrared spectra, the sample thickness and thus the spectroscopic response plays a crucial role [14,15].e information about the sample is present in the shape of the spectrum and independent of the offset (additive effect) and the scaling of the absolute signal intensity (multiplicative effect).e task of preprocessing is to remove these interfering factors from the informative part of the spectrum, and there are different approaches for this.
A method for eliminating constant offset terms is to calculate the first derivative [9]. is procedure can be extended to higher-order derivatives also eliminating offset terms with linear or quadratic baseline curves.e disadvantage of calculating the deviation of a spectrum is that noise effects are amplified.
Multiplicative signal correction (MSC) is another tool which can deal with the two major effects.A reference spectrum, in most cases represented by the mean spectrum of the calibration data set, is defined, and the spectra are corrected for the baseline and the multiplicative amplification effects [16,17].e approach is associated with the Kubelka-Munk theory, which takes optical phenomena caused by light scattering into account [18,19].For each spectrum, the two correction parameters are estimated via a least squares regression calculation.
Standard normal variate (SNV) removes a constant offset term by subtracting the mean value of the full spectrum and brings all spectra to the same scale by subsequent division by the standard deviation of the full spectrum [20].Due to its simplicity, SNV is a popular preprocessing method [21].SNV and MSC usually yield similar results and are often regarded as exchangeable [22].Since no extra regression step is needed for the SNV transformation to estimate the correction parameters, in the following, the focus lies on SNV as the models should be kept as simple as possible.
Some efforts have been made to optimize standardization techniques.A piecewise MSC (PMSC) method has been proposed by Isaksson and Kowalski [23], which significantly improved the predictive power of several regression models based on near-infrared transmittance spectra.A Localized SNV (LSNV) approach has been introduced by Bi et al. performing the SNV not on the full spectrum but on subsequent sequences [24].
is strategy also yielded very promising results in several regression cases based on benchmark NIR data sets.In the following, a dynamic version of the LSNV algorithm, called DLSNV, is presented.By allowing for a dynamic starting point of the first and subsequent SNV windows, it is more flexible to align the SNV to important vibrational bands in the spectra.PSNV and PPSNV are based on the idea that the standardization can be optimized when performed on distinct wavenumber windows across highly specific regions of the spectrum.

Experimental
As a sample set, data originated from an investigation about aging and interaction phenomena in Automatic Transmission Fluids (ATF) were used.Many ATF samples have been stored for different periods at several temperatures to produce artificially aged samples.
e aim of the presented study was to transfer information coming from a highly specific, costly, and complex measurement method (High-Performance Liquid Chromatography coupled with Quadrupole Time-of-Flight-Mass Spectrometry (HPLC-QToF-MS)) to data measured with a low-cost, flexible tabletop instrument (Fourier-Transform Infrared (FTIR) spectrometer). is was achieved by analyzing each sample coming from the storage experiment and determining the additive response signals in these samples by HPLC-QToF-MS.By using these additive responses as reference values, a calibration model was created in order to be able to predict the concentration of the additive compounds in the samples by evaluating the FTIR spectra.e new standardization techniques proposed here are being tested for the regression models.

Additive Compounds. Two additive compounds from two different ATF oils were analyzed:
Within ATF A: an unsaturated ethoxylated amine known as friction modifier Within ATF B: a bis-tert-butyl-hydroxytoluene (BHT) derivate known as phenolic antioxidant 2.2.Samples and Experiments.For the investigation of degradation phenomena in ATFs, a comprehensive storage experiment had been set up. e effects of different materials on ATFs and the impact of temperature on oil aging should be analyzed.
erefore, the ATFs were stored under various conditions in an oven.ree parameters had been varied: the storage temperature, the storage time, and added materials.e storage times had been adjusted to the temperatures so that a comparable load, according to Arrhenius Law, could be expected.e parameters are listed in Table 1.
For all time/temperature combinations, three interaction experiments have been conducted: (i) storage with pure oil (ii) storage with oil plus copper alloy chips (iii) storage with oil plus chips from copper alloy, iron, and PA66 e samples were prepared by storing 100 ml fresh oil in a glass jar with a screw cap.e lid had been manipulated with a central hole that allowed air exchange.

FTIR.
e FTIR spectra were collected in transmission with a Bruker Alpha instrument in combination with the QuickSnapTM transmission sample compartment in the wavenumber region ranging from 4000 to 600 cm −1 with a spectral resolution of 4 cm −1 .e samples were measured without any special sample preparation with two different setups: (1) a droplet of ATF between two potassium bromide (KBr) discs separated by 2 Journal of Spectroscopy a teflon spacer with the thickness of about 50 μm, and (2) fixed KBr cuvette of 100 μm thickness filled with ATF.
After each sample measurement, the KBr discs and the cuvette were rinsed several times with petroleum ether in order to prevent cross contamination.e cuvette was dried with N 2 gas after rinsing, and the KBr discs were dried under ambient air.For the measurement type (1), 4 spectra per sample were recorded, and for type (2), one spectrum per sample was recorded.
Due to the sample layer thickness, the hydrocarbon bands are saturated, and therefore, the spectra had to be cut in the wavenumber regions between 3000 and 2815 cm −1 (C-H stretching mode) and between 1491 and 1424 cm −1 (C-H bending and rocking mode).Additionally, the CO 2 bands were eliminated by cutting out the region from 2387 to 2285 cm −1 as well.e spectra of ATF A are shown in Figure 1 in transmission without any preprocessing as measured, in Figure 1(b) after truncation and SNV transformation, and in Figure 1(c), SNV transformed after calculating the absorbance spectra by using A � −log(T).In Figure 2, the same diagrams are shown for ATF B. In both cases, two series of curves can be discriminated from the raw spectra by the eye.e blue series comes from measurement type (1), and the red set comes from the cuvette measurements (2).To combine the two data sets from the measurement setups (1) and (2) are challenging tasks for a predictive model as the main variance is due to the thickness variation.e data set demonstrates the importance of suitable and sophisticated preprocessing methods in order to eliminate the difference in the spectra induced by the varying sample thickness.e standardization techniques presented here are able to meet this need.

Liquid Chromatography Coupled with Mass Spectrometry.
e measurements for the determination of the additive compound signals were performed with an Agilent liquid chromatograph 1260 coupled with a high-resolution QToF 6540 mass spectrometer with methanol/water/ammonium acetate and isopropanol as an eluent.Ionization was carried out by means of electrospray (ESI).e final compound peak area data set was created using the Agilent MassHunter Qualitative Analysis B. 06.00 analysis software.
e response signals of the additive compounds are standardized by subtracting mean and dividing by standard deviation in order to bring all signal values on the same scale.
e standardized signals are depicted in Figure 3.

Implementation.
e proposed novel standardization methods and respective optimization processes were implemented via Python scripts.

Regression Algorithm-Ridge.
For the prediction, the ridge regression estimator implemented in the Python scikitlearn framework for machine learning applications was used [25].It is a linear model which solves a regression task via the least squares loss function J(w) with L 2 regularization [26].Regularization is an approach to minimize the issue of overfitting, which is particularly important for highdimensional data such as FTIR spectra, by controlling the quadratic sum of the model coefficient w. is is done by adding the penalizing term L 2 weighted by the hyper parameter λ.
us, the loss function is defined as where y i stands for the reference value of the ith sample and y i,pred for the prediction of this sample.Since the performance of the preprocessing methods has to be assessed independently from the actually used predictive regression model, the same regression model with identical hyperparameter λ was applied to the various preprocessed data sets.For the regression of the friction modifier compound of ATF A, λ � 5, and for the antioxidant of ATF B, λ � 3 was used.ese parameters turned out to be the best choices regarding cross validation and robustness for the SNV transformed data set in a previously conducted internal study.

Model Performance Evaluation.
To assess the performance of our models, two different approaches were chosen, namely, the predictive power under cross validation and noise addition.

Cross Validation.
For cross validation, the mean from the different measurements of one sample was calculated.e sample set was randomly divided 50 times into a calibration and validation set by taking 70% of the data as training samples and 30% as test samples in each validation iteration with different combinations.Each separation run was provided with a unique random seed to ensure that the data set was split into the same training and test sets for each model, enabling better comparability of results between the different models.

Robustness against Noise.
In order to assess the model performance under noisy input spectra, the model was calibrated by the full original data set.Random Gaussiandistributed white noise was added to each data point.ese perturbed samples were predicted by the model and the prediction error was monitored.is was done for different noise levels.e random numbers added to each data point were generated by a standard normal distributed (mean: μ � 0 and standard deviation: σ � 1) random Journal of Spectroscopy number generator.
e noise levels were de ned by the factors (0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, and 0.45), which were multiplied with the output of the random number generator.For each noise level, 50 simulated noisy data sets were generated and predicted by the pretrained model in order to be able to make well-founded statements about the model performance under noise perturbation.
e noise robustness work ow is a very helpful tool to investigate whether a good calibration error is a real advantage or if the model ran into over tting.Using the same regression algorithm twice with di erent regularization parameters λ, the lower regularized model will generate a lower initial calibration error than the more stringent regularized model.But if the models are tested for robustness, the latter tends to have a lower error slope when the noise level increases.

Evaluation Metrics.
e built-in functions R 2 score and mean squared error (MSE) of the scikit-learn framework were used as performance metrics.

Mean Squared Error (MSE).
e mean squared error (MSE) of a prediction is calculated by the squared di erences between the predicted value y i,pred and the reference value y i of the ith sample.For a given data set with n samples, the MSE is the average value over all samples.It follows the following formula [27]: e best possible MSE value is 0, and small values are desirable as the deviation from the correct prediction is low.From MSE, the root-mean-squared error (RMSE) was calculated by taking the square root.e RMSE value has the same dimension as the original reference target values.

R 2
Coe cient of Determination.R 2 describes the portion of the variance in the target values (dependent variables) that can be predicted from the spectra (independent variables) by the model [28].e best possible score for R 2 is 1.0.R 2 gets 0.0 for a constant model which predicts a constant value disregarding of the input features.For linear regression modeling with intercept, R 2 is equal to the square of Pearson correlation coe cient between predicted and reference target values [29].For a data set comprising n samples, the R 2 score is given as where y i,pred is the model prediction of the ith sample which has a reference value y i , and y pred is the mean value of all predictions. (5) by bringing the spectra to zero mean and unit variance.For this purpose, the mean spectrum x is subtracted from each data point x i and divided by the standard deviation. with

Dynamic Localized SNV (DLSNV).
e DLSNV work ow is based on the SNV-transformed spectra data set (Figure 4(a)).To calculate the DLSNV data, the spectra are divided into multiple regions.On each of these regions, standardization is performed.To adjust the windows to important areas in the spectrum, a starting point can be de ned.In Figure 4(b), the DLSNV spectra are shown, with a starting point of 100 and a window size of 300 pixels.

DLSNV algorithm
(i) Perform SNV on a window of the spectrum ranging from rst data point to the s th one (ii) Subdivide spectra from s th data point into windows of all the same size ws To optimize the two parameters, window size ws and starting point s, a three-step approach is performed.In each step, the predictive power of the model is assessed via the e optimization steps can be summarized as follows: (1) Perform LSNV with window sizes from 50 to 500 pixels, and determine R 2 for all window sizes.Find the optimal window size ws opt1 .(2) Perform LSNV with optimal window size of step 1 ws opt 1 , vary the starting point from 0 to 2•ws opt 1 , and select the optimal starting point s opt .(3) Perform LSNV with optimal starting point s opt with window sizes from 50 to 2•ws opt 1 in order to nd the best combination of window size ws opt 2 and starting point s opt .
In Figure 4(d), the nal DLSNV spectra after optimization are shown.Note that jumps can occur between the individual standardization windows since the mean value of this current window is subtracted for each window.However, this does not a ect the regression model.
In Figure 5(a), the ATF A samples are shown with SNV performed on the entire spectral region, and in Figure 5(b), the same spectra are depicted after DLSNV optimization.Figure 5(c) shows a zoom-in view of the highlighted region of Figure 5(a), and in Figure 5(d), the same region is depicted after DLSNV optimization.e baseline is removed for the exact spectra sequence, and thus, peaks are aligned in a way that the di erent aging levels of the samples can already be recognized by eye.e shown snipped spectrum is the phenolic antioxidant region.us, the decrease of this band can be associated with the aging level.Magenta indicates (relatively) fresh samples, whereas red indicates a strong degradation level.

Peak SNV.
e idea behind the Peak SNV method is to standardize the important areas of the spectrum independently of each other.e optimization work ow for PSNV is shown in Figure 6, starting from the single SNV transformed data set.Data points with a high correlation with the target values (points of interest, POI) are selected (Figure 6(a)), and the SNV transformation is performed on windows around the centroids.Once the POIs are identi ed, the PSNV transformation is conducted as follows: PSNV algorithm (i) Subdivide spectra into sequences ranging from half the distance from the previous POI to half the distance to the next one (Figure 6(b)).SNV is performed across these windows.
To nd the POI, an initial regression model is tted to the data.In order to identify important regions of the spectra, the model coe cients are assessed.e normalized absolute values of the coe cient vector are fed into a peakpicking algorithm.Since it may occur that POIs are in close proximity, an agglomeration of the POIs is conducted in order to prevent from very narrow standardization windows.Peak centroids are calculated via the mean value of the combined POIs.e task for the optimization process is to nd the best window for POI agglomeration, agg opt , which is done by analyzing the calibration R 2 for each agglomeration window and picking the window size with maximal correlation between the predicted and reference target (Figure 6(c)).e steps are summarized as follows: (1) Fit the data set to the target values (only calibration) (2) Pick peaks from the normalized model coe cient vector (|w|/max(|w|)), threshold for peaks 0.1 (3) Combine peaks which are within a certain window agg, and calculate the centroid of the agglomerated POIs (4) Perform PSNV across the centroid of the POIs (5) Evaluate performance via R 2 for agg between 10 and 50 data points, and choose agg opt according to maximal R 2 After optimization, each window has an individual window size and range over the peak centroid of important signals in the spectrum.On these windows, SNV transformation provides an optimal baseline and scatter e ect removal.e optimized spectrum is shown in Figure 6(d).e idea behind Partial Peak SNV is similar to PSNV: picking the regions of the spectrum which show a high correlation with the target values, agglomerating POIs in close proximity, and standardizing these important spectral features (Figure 7(a)).But unlike for PSNV, not only the whole spectrum is nally taken into account but also a small window around the POI.It may occur that the same data point appears several times in di erent standardizations (see overlapping regions in Figures 7(b   (3) Perform PPSNV across the peaks with the window size pw (4) Evaluate the performance via R 2 for pw between 1 and 200 data points, and choose pw opt according to maximal R 2 e saturation e ect in the low intensity area of the compound response is almost completely removed in the latter three cases.It is also notable that the scattering around the green bisecting line is signi cantly reduced.us, the con dence interval for the predictions is improved.e RMSEP values during the cross validation of the regression of the friction modi er component are summarized in Figure 9 in a box-and-whisker plot representation.e red line indicates the median, within the boxes, the interquartile range (IQR) (contains 50% of the data) is depicted, and the margins of the whiskers represent Q 1 − 1.5 • IQR and Q 3 + 1.5 • IQR for the lower and upper bound, respectively (Q 1 means the smallest 25% of the data set are smaller than this value and Q 3 means the smallest 75% are smaller than this value).Subplot Figure 9(a) refers to the transmission spectra and Figure 9(b) refers to the absorbance spectra.e labels are associated with (1) without standardization, (2) single SNV transformation on the full spectral range, (3) Localized SNV, (4) Dynamic Localized SNV, (5) Peak SNV, and (6) Partial Peak SNV.

Results and Discussion
It is noticeable that the RMSEP is very poor in case of the crude transmission spectra and that SNV has a very useful impact on them, whereas the improvement after SNV is low for absorbance spectra.
For all sophisticated optimized standardization approaches DLSNV, PSNV, and PPSNV, the median and the scattering around the median of RMSEP decreases drastically with respect to the SNV-transformed full spectra but also LSNV seems to be a reasonable choice.DLSNV on absorbance spectra is characterized by the lowest median and the smallest scattering con rmed by Table 2, summarizing the mean values and standard deviation of RMSEP.

RMSEP Proposed methods
Proposed methods  2) is with a single SNV transformation on the full spectrum, ( 3) is with optimized LSNV, ( 4) is with optimized Dynamic DLSNV, ( 5) is with optimized PSNV, and ( 6) is with optimized PPSNV.
Journal of Spectroscopy e summarized RMSEPs of the regression to the antioxidant additive of ATF B are shown in Figure 10 in a box-andwhisker plot representation where Figure 10(a) refers to the transmission spectra and Figure 10(b) to the absorbance spectra.In this case, DLSNV, PSNV, and PPSNV reduce both the median and the scattering around the median enormously when compared with SNV on full spectra.e best performance is achieved by PPSNV conducted on the transmission spectra con rmed by Table 2. On transmission spectra, DLSNV and PPSNV perform better than LSNV, but PSNV only has a positive e ect when compared to SNV.In relation to LSNV, using PSNV, the predictive power is reduced.e fact that PPSNV is the best choice for this regression use case suggests that it is bene cial to only use spectral regions with high correlation with the target value and drop regions without or low correlation.

Noise Robustness.
In Figure 11, the performance of the regression model for the prediction of noisy spectra is shown for the friction modi er.In subplot Figure 11(a), the curves for all preprocessings, are depicted and in Figure 11(b), a zoomed view is shown.Without any preprocessing, the initial calibration error for both transmission and absorbance spectra is very poor and rises very fast with the increasing noise level factor.Although the sophisticated preprocessing methods LSNV, DLSNV, PNSV, and PPSNV show a lower initial calibration error, the slope of the error is lower than for SNV.In Figure 11(b), the trend of the transmission spectra having low error steepness is visible.One may say that the three proposed standardization techniques show a very similar noise robustness behavior and are signi cantly better than none or  2) is with a single SNV transformation on the full spectrum, (3) is with optimized LSNV, ( 4) is with optimized Dynamic DLSNV, ( 5) is with optimized PSNV, and ( 6) is with optimized PPSNV.In Figure 12, the performance of the regression model of the antioxidant for the prediction of noisy spectra is shown.
e absorbance spectra with or without preprocessing show a similar noise trend as the SNV-transformed spectra.In Figure 12(b), the localized versions are shown in a zoomed view.e PPSNV preprocessing on the transmission spectra is characterized by the flattest noise dependency.ese resuts demonstrate the superiority of the PPSNV method in this use case.As mentioned above, PSNV is not advantageous in this application and shows the lowest noise immunity, but it is preferable to the SNV across the whole spectrum.

Summary.
e optimized parameters for the preprocessing methods are summarized in Table 3. e LSNV optimization process selects the same window size as DLSNV.us, the second window size run has no influence on the final result in these two cases but the starting point produces an improvement.
As already mentioned in Table 2, the cross validation performances of the tested methods are summarized as mean values and standard deviation for all cross validation runs.For the friction modifier, the performances of DLSNV, PSNV, and PPSNV are very similar.Table 2 also lists the relative improvements against the benchmark preprocessing, LSNV, accompanied by corresponding p values from a two-sided t-test, which tests the significance of the mean values being different (the deviation for relative improvements when the same mean value is given due to the fact that the improvements were calculated from exact values rather than rounded values).
e best mean RMSEP value for the regression model for the friction modifier of 0.26 is produced by DLSNV based on absorbance spectra.e antioxidant compound is modeled best by PPSNV preprocessing of the transmission spectra and yields a very low prediction error of 0.17.
To summarize, one may say that all proposed methods performed very well reducing both mean and standard deviation of the cross validation error compared with SNV.PSNV is not reasonable for the antioxidant additive as the performance is poor compared with the benchmark preprocessing method LSNV.
Which preprocessing method is the best depends on the actual regression use case, but in general, it is shown that PPSNV outperforms PSNV. is suggests that it is beneficial to drop spectral regions showing low or no dependency on the target value and to only consider highly correlated peaks.
For the antioxidant compound, PPSNV yielded an enormous improvement.
is could be explained as the  phenolic aging inhibitor is a compound with very narrow vibrational band in the ATF B and thus does not have a great impact when the SNV is carried out across the entire spectrum.is may lead to a suboptimal alignment of this band.In case of novel standardizations, the SNV is optimized to the high correlative bands, and scatter effects can be compensated for these exact regions.e fact that PSNV is unsuccessful for the antioxidant may be because the POIs are not centered to the middle of the SNV window and may have large left and right margins if they are far away from other POIs.As a result, they may not be optimally standardized.
is is shown in Figure 6(b), where the single POI at about 2700 cm −1 has a large single SNV window.
e study provides an overview of model performances when using transmission or absorbance spectra suggesting that both cases can lead to valid regression models.However, for quantitative models built on transmission spectra, the SNV is vital, whereas in the absorbance case, the predictive power does not depend on SNV transformation.In absorbance spectra, the influence of the baseline constant is reduced because high transmission values are converted into low absorbance values.
To conclude, DLSNV, PSNV, and PPSNV were able to improve both transmission and absorbance predictive models.
e scattering around the mean values are also drastically reduced because the model does not have to learn how to compensate for the baseline shift in each cross validation step leading to more reproducible results.Each vibrational band is optimally aligned so that the additive depletion trend is encoded in the absolute signal intensity, and the model does not have to weigh a data point as background correction.

Conclusion
e results presented in this study demonstrate the outperformance of the proposed novel standardization strategies Dynamic Localized SNV, Peak SNV, and Partial Peak SNV to improve both the mean and scatter of RMSEP  Journal of Spectroscopy values in cross validation and the robustness against noise drastically with respect to SNV transformation executed on the entire spectrum.Against the benchmark LSNV, an enhancement of the predictive power of a ridge regression model by up to 16% and 29% could be achieved for the friction modifier and the antioxidant compound, respectively.e demonstrated optimization workflows for performing SNV on specific regions of the spectrum have been introduced here for the first time.erefore, the standardization methods used in this paper are capable of eliminating nonlinearities by flexible rescaling in defined areas.To our knowledge, such standardization techniques have not been presented elsewhere.

Figure 1 :Figure 2 :
Figure 1: FTIR spectra of ATF.(a) Raw full transmission spectra without any preprocessing.e two data sets with di erent measurement setups can be discriminated by eye.e blue spectra originate from measurement type (1) with two KBr discs separated by a Te on spacer, and the red set of curves originates from the cuvette measurement (2).(b) SNV-transformed transmission spectra after truncation of the saturated C-H vibrational regions and CO 2 areas and (c) SNV-transformed absorbance spectra after truncation.

Figure 3 :
Figure 3: Standardized additive responses used as target value for the FTIR regression model for (a) the friction modi er compound and (b) the antioxidant plotted against time for storage temperature 140 °C for all three storage experiments measured by HPLC-QToF.

Figure 4 :
Figure 4: Demonstration of the work ow and optimization process for Dynamic Localized SNV.(a) Single SNV, (b) Dynamic Localized SNV with starting point 100 and window 300 for visualization, (c) three-stage optimization process for window, starting point and nal window optimization, and (d) optimized DLSNV.

1 WavenumbersFigure 5 :
Figure 5: Demonstration of the improvement of peak alignment for DLSNV.(a) SNV-transformed transmission spectra with marked zoom level of (c).(b) Optimized DLSV spectra with marked zoom area of (d).Magenta indicates (relatively) fresh samples, whereas red indicates a strong degradation level.

Figure 6 :Figure 8 :
Figure 6: Demonstration of the work ow and optimization process for Dynamic Localized SNV.(a) Picked peaks of normalized absolute coe cient vector and indication of the POIs in one spectrum, (b) spectrum separation according to agglomerated peaks, (c) optimization process in order to nd optimal agglomeration window size, and (d) optimized PSNV.

Figure 7 :
Figure 7: Demonstration of the work ow and optimization process for Dynamic Localized SNV.(a) Picked peaks of normalized absolute coe cient vector and indication of the POIs in one spectrum, (b) spectrum separation according to agglomerated peaks, (c) optimization process in order to nd optimal window size around the POIs, and (d) optimized PPSNV.

4. 1 .
Cross Validation.In Figure8(a), the cross validation recovery function for predictions of the SNV preprocessed spectra of the regression on the friction modi er compound is shown.A 50-fold cross validation strategy with a calibration/validation splitting of 70%/30% was used.Red dots represent the prediction of calibration, and blue dots represent validation samples.It is obvious that the linear model struggles to predict the high and low compound intensity regions correctly.e nonlinearity is visualized by an arrow and a dashed line to guide the eye.In Figure8also, the cross validation recovery function for predictions after Dynamic Localized SNV (Figure 8(b)), Peak SNV (Figure 8(c)), and Partial Peak SNV (Figure 8(d)) optimization are shown.

Figure 9 :
Figure 9: Box-and-whisker plot representation of the root-mean-squared error of prediction of the cross validation strategy (50 folds, random train test split of 70/30% of the data) for the friction modi er compound of ATF A. In (a), the RMSEP values for the transmission spectra are shown, and in (b), the RMSEP values for the absorbance spectra are shown.Boxplot (1) is without standardization, (2) is with a single SNV transformation on the full spectrum, (3) is with optimized LSNV, (4) is with optimized Dynamic DLSNV, (5) is with optimized PSNV, and (6) is with optimized PPSNV.

Figure 10 :
Figure 10: Box-and-whisker plot representation of the root-mean-squared error of prediction of the cross validation strategy for the phenolic antioxidant compound of ATF B. In (a), the RMSEP values for the transmission spectra are shown, and in (b), the RMSEP values for the absorbance spectra are shown.Boxplot (1) is without standardization, (2) is with a single SNV transformation on the full spectrum, (3) is with optimized LSNV, (4) is with optimized Dynamic DLSNV, (5) is with optimized PSNV, and (6) is with optimized PPSNV.

Figure 11 :
Figure 11: RMSE as a function of the noise level factor for the regression of the friction modifier compound of ATF A calibrated by the unperturbed full data set.e error bars represent the standard deviation of the prediction error calculated from the statistics of 50 repetitions of noise addition.In subplot (a), all curves are shown, and in (b), the sophisticated standardizations are shown.

Figure 12 :
Figure 12: RMSE as a function of the noise level factor for the regression of the antioxidant compound of ATF B calibrated by the original full data set.e error bars represent the standard deviation of the prediction error calculated from the statistics of 50 repetitions of noise addition.In subplot (a) all curves are shown, and in (b), the sophisticated standardizations are shown.

Table 2 :
Summary of the model performances described by the mean value and the standard deviation of the RMSEP values during cross validation.e relative improvements and respective p-values compared with LSNV are also listed.