Crude Oil Source Identification of Asphalt via ATR-FTIR Approach Combined with Multivariate Statistical Analysis

The types of crude oil for producing asphalt have a decisive inﬂuence on various performance measures (including aging re-sistance and durability) of asphalt. To discriminate and predict the crude oil source of diﬀerent asphalt samples, a discrimination model was established using 12 greatly diﬀerent infrared (IR) characteristic absorption peaks (CAPs) as predictive variables. The model was established based on diverse ﬁngerprint recognition technologies (such as principal component analysis (PCA) and multivariate logistic regression analysis) by using attenuated total reﬂectance-Fourier transform infrared spectroscopy (ATR-FTIR). In this way, the crude oil source of diﬀerent asphalt samples can be eﬀectively discriminated. At ﬁrst, by using PCA, the 12 CAPs in the IR spectra of asphalt samples were subjected to dimension reduction processing to control the variables of key factors. Moreover, the scores of various principal components in asphalt samples were calculated. Afterwards, the scores of principal components were analysed through modelling based on multivariate logistic regression analysis to discriminate and predict the crude oil source of diﬀerent asphalt samples. The result showed that the logistic regression model shows a favourable goodness of ﬁt, with the prediction accuracy reaching 93.9% for the crude oil source of asphalt samples. The method exhibits some outstanding advantages (including ease of operation and high accuracy), which is important when controlling the source and quality and improving the performance of asphalt.


Introduction
Asphalt pavements are widely used: as a black binding material produced from oil, asphalt is widely used as the binder in asphaltic mixtures [1][2][3]. Due to the differences in origins and production modes of crude oil for producing asphalt, the properties of crude oil exert important influences on the performance of asphalt mixtures, which also lead to significant differences in the performance of the various asphalt produced therewith [4][5][6][7][8].
e conventional performance of the same grade of asphalt is very similar; however, different asphalt exhibit large differences in various aspects, including high-and lowtemperature performance, durability, and fatigue properties, which are considered as external expressions of chemical composition, molecular structure, and transformation of asphalt [9][10][11]. Furthermore, the study shows that the differences in the composition and structure of asphalt mainly depend on the source of crude oil and refining process of asphalt production. Due to the differences in the geological structure, oil generation conditions, and age, the nature and composition of crude oil in different regions are very different. However, crude oil with similar properties and composition in the same region has similar processing, storage, and transportation options. At the same time, most of the petroleum asphalt is produced by distillation currently, and the molecules in the asphalt retain their original state in the crude oil. erefore, most of the composition and structure of asphalt are inherited from crude oil; that is to say, the structural performance of asphalt mainly depends on the source of crude oil. Because the asphalt is produced by different types of crude oil, the physical and chemical composition information about asphalt is unique. Just like the human fingerprint information, these components which can express the unique structure of asphalt can be called the "oil fingerprint" of asphalt. It is because of the uniqueness of "oil fingerprint" information of asphalt that it is feasible to discriminate the oil fingerprints of asphalt from different crude oil sources [12][13][14][15][16].
At present, as the composition and structure of asphalt are extremely complex, the characterization of its structure requires more high-resolution and high-throughput analysis means and equipment, so there are few reports on the identification and analysis of asphalt oil fingerprints [17]. However, the identification and analysis of marine oil spill fingerprints has always been an issue of widespread concern. Similar to the method and purpose of identifying "oil fingerprints" of oil spills at sea, the purpose of recognising oil fingerprints of asphalt is to attain oil fingerprint information of asphalt through different methods such as physical, chemical, and biological methods [18]. Moreover, by applying multivariate statistical methods (including principal component analysis (PCA) and regression analysis), the chemical composition variables of oil fingerprints are summarised, classified, and discriminated [19,20]. On this basis, qualitative and quantitative relationships between data are obtained to distinguish the crude oil source of asphalt, thus effectively controlling their qualities. Meanwhile, some testing methods used in the "oil fingerprint" identification of marine oil spills have been successfully used to analyse the composition and structure of asphalt [21][22][23]. For example, a gas chromatograph-mass spectrometer (GC-MS) was used to explore the chemical compositions of smoke released by asphalt materials during heating [24,25]. Gel permeation chromatography (GPC) and thin-layer chromatography (TLC) were used to measure the molecular weights and the composition distributions of asphalt [26][27][28]. Nuclear magnetic resonance (NMR) and Fourier transform infrared spectroscopy (FTIR) were used to investigate the compositions, structures, and functional groups of asphalt [29,30]. In all analytical techniques, compared with other methods (including GC-MS and NMR), which generally show some disadvantages (including high cost, damage to samples, and being laborious and time consuming during analysis), infrared (IR) spectroscopy is the most widely used technique in investigating asphalt materials. e reason is that IR spectroscopy shows many outstanding advantages, including being label-free, rapid, nondestructive, and low-cost, with simple sample preparation [31][32][33]; however, in the above analysis, the chemical structures of asphalt are qualitatively analysed, mainly aiming at those of a certain or multiple specific asphalt samples while lacking quantitative research into the types of asphalt. e research into discrimination of the types of asphalt, tracing of the production area, and quality control of asphalt has not yet been reported. erefore, by utilising attenuated total reflectance-Fourier transform infrared spectroscopy (ATR-FTIR), the characteristic functional groups of asphalt from different crude oil source were discriminated and quantitatively analysed. Based on multivariate statistics, PCA and logistic regression analysis were conducted on IR spectral data to establish a discriminant function. An accurate, nondestructive, stable method of discriminating the crude oil source of asphalt samples was explored, which provides a scientific basis for realising reasonable selection, supervision quality, and guaranteed origins of asphalt.

Experimental Materials.
During the experiment, 33 asphalt samples were purchased from factories in China for producing asphalt. Before being applied, the asphalt samples were sealed in original oxygen-free containers at 5°C to prevent the samples from being oxidised. Additionally, all asphalt samples were unprocessed before use. As mentioned in Section 1, the differences in the "oil fingerprint" of asphalt are determined by the crude oil from which it is produced. Due to the same geological structure, oil generation conditions, and age in the same region, the composition and chemical structure of crude oil are also very similar. erefore, the "oil fingerprints" of asphalt produced by crude oil from the same region are very similar, such as crude oil from the Middle East Gulf region, including Saudi Arabia, Iran, Kuwait, Iraq, and United Arab Emirates, crude oil from South America, including Marry, Poscan, Maya, and Castilla, and crude oil from the Bohai Rim region of China, such as Bohai Bay, Huanxiling, and Caofeidian. e crude oil of 33 asphalt samples came from the above three regions. According to the names of the three regions, the crude oil source of asphalt is divided into three categories: Middle East, South America, and the Bohai Rim region of China. e basic performance measures (penetration ratio (ASTM D5), ductility ratio (ASTM D113), and softening point (ASTM D36)) of asphalt and the crude oil source of asphalt are listed in Table 1. It is worth noting that the last digit of the asphalt number listed in Table 1 represents different sampling batches of the same asphalt.

FTIR Analysis.
rough ATR-FTIR (using a Cary 630 FTIR microscope), the IR spectra of asphalt samples were explored. Within the range of 400-4,000 cm −1 , 64 scans were conducted, each at a resolution of 1 cm −1 . e samples were placed on the horizontal ATR crystal made of zinc selenide, being subjected to multiple reflections. After each operation, the ATR crystal was cleaned using acetone. e original spectrum data were first subjected to baseline correction by applying the OMNIC software to eliminate baseline effects. Afterwards, based on the standardised variation diagram of preprocessed spectrum data, the difference in masses of different samples was eliminated.

Multivariate Statistical Analysis.
rough the combination of principal component analysis (PCA) and multiple logistic regression analysis, the infrared spectrum data are analysed to establish the discrimination model of the crude oil source of asphalt. Logistic regression analysis is a multivariate analysis method to analyse and predict attributedependent variables based on single or multiple continuous or attribute-independent variables. Furthermore, each variable is required to be independent of each other in variable screening and parameter estimation. In many studies, there is a certain degree of linear dependence between their variables, which is called multicollinearity. is multiple collinear relationship may increase the mean square error and standard error of the estimated parameters, which leads to the instability of the analysis results of the logistic regression model. e main reason for the problem of multicollinearity is the overlap of information. However, PCA can reduce the repeatability of information and achieve the purpose of eliminating multicollinearity by extracting independent principal components from explanatory variables.
For this reason, this study used a multinomial logistic regression model based on PCA to improve the discrimination accuracy of the model. First of all, the PCA was used to reduce the dimension of the CAPs variables of the infrared spectrum, so that the variables with strong correlation were integrated into the same principal components. e principal components were independent of each other; thus, the multiple collinear relationship between variables was eliminated. en, by using these principal components as independent variables, the discriminant model of crude oil source of asphalt was obtained by logistic regression analysis.

PCA Analysis.
PCA refers to a simplification of multidimensional data to several relevant variables (principal components) through a dimension reduction approach. Each principal component reflects most of the information of original variables, and the contained information is not repeated. PCA can compress countless information and simplify complex problems [34]. e modelling process of PCA is as follows: (1) Calculation of the correlation coefficient matrix: where r ij (i, j � 1, 2, . . . , p) refers to the correlation coefficient of original variables X i and X j , r ij � r ji , which can be calculated by using the following formula: (2) (2) Calculating eigenvalues and eigenvectors: e characteristic equation |λI − R| � 0 was solved. Generally, the eigenvalues were calculated by using the Jacobi method and, in descending order are e eigenvectors e i (i � 1, 2, . . . , p) corresponding to eigenvalue λ 1 were separately calculated, satisfying ‖e i ‖ � 1, that is, contribution : cumulative contribution : In general, the eigenvalues with the cumulative contribution not lower than 70% are taken. λ 1 , λ 2 , . . . , λ m are the corresponding first, second, . . ., m th (m ≤ p) principal components. (4) Calculating the loads of principal components: (5) Scores of various principal components:

Logistic Regression Analysis.
Logistic regression is a multivariate analysis method for investigating the relationship between binominal or multinomial observation results (dependent variable) and influencing factors (independent variable), belonging to probabilistic nonlinear regression methods. e logistic regression when the dependent variable only shows two or more states belongs to binomial logistic regression and multinomial logistic regression, respectively [35,36]. For discriminating and classifying the crude oil of asphalt, multinomial logistic regression is applied to conduct data analysis, owing to the crude oil of asphalt being sourced from the Bohai Rim region of China, South America, and the Middle East.
(1) Model fitting: For multinomial logistic regression, a certain level of dependent variables is defined as the reference level herein. Compared with the other levels, i-1 (i refers to the number of dependent variables) generalised logistic regression models were fitted. By taking three-level dependent variables as an example, it is supposed that the values of dependent variables are 1, 2, and 3: the probabilities corresponding to the values are π 1 , π 2 , and π 3 , respectively. Based on m-independent variables, two models are fitted as follows: logit (2) Meaning of regression parameters: For multinomial logistic regression, each independent variable contains (m − 1) parameters. e parameter β 1m represents an independent variable x m that changes one unit on the premise that other independent variables remain unchanged, and it reflects the variation of the log-odds ratio (OR) of class i. e OR is subjected to logarithmic transformation to obtain the linear mode (ln(p i /1 − p i ) � β 0 + β 1 X 1 + β 2 X 2 + . . . + β n X n ) of the logistic regression model.

Establishment of Discrimination Indices for Crude Oil
Source of Asphalt. FTIR is an important means of identifying organic compounds. When irradiating organics using the IR light, the molecules absorb the IR light leading to vibrational energy level transition, and different chemical bonds or functional groups show diverse absorption frequencies.
e contents of various materials are reflected in their IR absorption spectra, which can be quantitatively analysed according to peak location and absorption intensity. e structural composition of asphalt is complex, and asphalt shows significant differences in behaviour. For these reasons, it fails to effectively characterize the difference of behaviours of asphalt from different crude oil only by quantitatively comparing the peak areas of IR spectrograms. erefore, by observing the shapes and locations of IR spectrograms, 12 significant characteristic absorption peaks (CAPs) were selected to analyse the transmittances of absorption peaks. e IR absorption spectra of 33 asphalt samples are similar. By using the mean value method, the mutual mode of the IR spectrogram of all asphalt samples was constructed (Figure 1): the assignments of 12 characteristics peaks are as follows: the strong absorption peaks around 2850 cm −1 and 2920 cm −1 are triggered by the stretching vibration of CH 2 , and a very weak absorption peak around 1700 cm −1 is induced by the stretching vibration of C�O. Moreover, the vibration of the benzene ring leads to the absorption peak in the vicinity of 1600 cm −1 , and the absorption peaks at 1380 cm −1 and 1460 cm −1 are caused by the bending vibration of CH 3 . e fingerprint region appears below 1300 cm −1 , in which the absorption peaks at 1166 cm −1 and 1032 cm −1 are triggered by the stretching vibrations of C�S and S�O, respectively. e stretching vibration of CH results in a weak absorption peak around 969 cm −1 , while the absorption peaks at 872 cm −1 and 812 cm −1 are induced by vibrations of an isolated hydrogen and two adjacent hydrogen atoms on the benzene ring, respectively. Additionally, the absorption peak at 723 cm −1 is also caused by the stretching vibration of CH 2 .

Analysis of Predictive Variables Based on Descriptive
Statistics.
e IR spectra of all asphalt samples are similar, and it is difficult to distinguish the differences among asphalt samples by comparing spectrograms alone. Hence, 12 significantly different CAPs were selected from the spectrograms to describe the transmittances of absorption peaks based on descriptive statistics. From two aspects of centralised location (including indices such as average and median) and degree of dispersion (including indices such as extreme value), the samples are described so as to reflect spectrographic data (Table 2).
In Table 2, according to the analysis result of descriptive statistics on the transmittances of 12 CAPs, it can be seen that the asphalt produced by crude oil from the Bohai Rim region of China showed a larger transmittance. By contrast, the transmittances of asphalt produced by crude oil from the Middle East and South America were consistently low. However, it is impossible to distinguish the oil source of asphalt based on the descriptive statistics of infrared spectral transmittance of asphalt. erefore, it is necessary to introduce multivariate statistical analysis methods, such as multinomial logistic regression analysis based on PCA described in Section 2.3.

Correlation Analysis of Predictive Variables.
Correlation analysis aims to explore the correlation among multiple variables, which is also an important parameter for evaluating the fingerprint variables of asphalt [37]. In order to further evaluate whether the selected 12 variables were of sufficient significance to the prediction model, a correlation analysis of the 12 CAP variables was required. Generally, correlation analysis is conducted by applying Pearson and Spearman correlation coefficients. e Pearson correlation coefficient is generally applicable to data satisfying a normal distribution, and the Spearman correlation coefficient is employed for data that do not satisfy a normal distribution. erefore, before the correlation test, it is necessary to test the normal distribution of 12 variables to determine the appropriate correlation test method.
By using the skewness-kurtosis test method, whether the transmittances of the 12 CAPs of 33 asphalt samples conform to a normal distribution was assessed, and through the K-S test as an auxiliary analysis method, the accuracy of the test results was ensured [38,39]. e 12 variables were processed by importing them into SPSS19 (Tables 3 and 4).
It can be seen from Table 3 that the values of skewness and kurtosis of transmittances of the 12 CAPs of all asphalt samples produced by three origins of oil fluctuate within a certain small positive and negative range around zero. It can be further seen from Table 4 that the asymptotic significances of the 12 variables all exceed 0.05. Moreover, based on the result of the skewness-kurtosis test, it can be considered that the 12 variables of 33 asphalt samples all conform to a normal distribution, which provides a basis for determining the method for testing correlation among variables. erefore, the Pearson correlation coefficient is used to analyse the correlation between variables ( Table 5).
As shown in Table 5, the IR CAP at 2850 cm −1 showed a significant correlation with those at 2920, 1460, and 723 cm −1 , respectively. Additionally, there are significant correlations between each IR CAP at 1700, 1600, 1460, 1380, 1166, 1032, 969, 872, 812, and 723 cm −1 . Moreover, multiple CAPs exhibited a high correlation. e aforementioned CAPs with high correlation covered all 12 CAPs. is showed that the 12 selected CAPs contained most of the fingerprint information about the asphalt, thus providing a basis for selecting variables capable of discriminating the different crude oil sources of asphalt.

Establishment of Logistic Regression and Discriminant
Model Based on PCA 3.4.1. PCA on All Variables. According to the correlation analysis of variables, it can be seen that the information contained in the 12 CAPs shows a certain repeatability. PCA not only can remove repeated information but can retain key information, thus realising dimension reduction. Furthermore, it makes the modelling for logistic regression and discrimination more reliable due to reducing the disturbance caused by accidental factors. e transmittances of the 12 CAPs of 33 asphalt samples are input into the SPSS19 software for PCA. e results are displayed in Table 6 and Figure 2. As shown in Table 6, there      are three principal components whose eigenvalues exceed one. e first, second, and third principal components explain 77.658%, 15.498%, and 3.508% of the nature of the original variables, respectively. e cumulative variance contribution of the three principal components is 96.664% (research shows that there is a high explanation rate when the cumulative contribution is higher than 70%). It can be seen from the scree plot ( Figure 2) that the broken lines of the first three principal components are steep while later tending to become shallower. is further indicates that it is appropriate to extract the three principal components (PCA1, PCA2, and PCA3). According to the correlation coefficients between the principal component and the original variables, the principal components Y 1 , Y 2 , and Y 3 are separately expressed as follows: Y 3 � −0.025x 1 + 0.393x 2 + 0.527x 3 + 0.049x 4 − 0.781x 5 − 0.997x 6 + 0.004x 7 + 0.305x 8 − 0.025x 9 + 0.129x 10 + 0.069x 11 + 0.471x 12 , where x 1 , x 2 , . . . , x 12 represent the transmittances of CAPs at 2920, 2850, 1700, 1600, 1460, 1380, 1166, 1032, 969, 872, 812, and 723 cm −1 , respectively.

e Process and Result of Multinomial Logistic
Analysis. By substituting the transmittances of the 12 CAPs of 33 asphalt samples into formulae (9)(10)(11), the scores of the  Advances in Materials Science and Engineering three principal components can be calculated (Table 7). Moreover, the scores of the principal components are taken as factors, and three kinds of origins of asphalt are considered as dependent variables. Among them, the crude oil from the Bohai Rim region of China is regarded as a reference group to establish a multinomial logistic regression model based on principal components. On this basis, the parameters of the three principal components used for the logistic regression model are obtained.
Based on the parameter from regression, the logistic regression model can be obtained as follows: where p 1 , p 2 , and p 3 refer to the probabilities of crude oil sources (the Middle East, South America, and Bohai Rim region of China) of asphalt and Y 1 , Y 2 , and Y 3 denote the first, second, and third principal components, respectively.
By substituting expressions (9), (10), and (11) into expression (12), the expression (formula (13)) for characterising the relationship between the logistic regression model and the 12 variables can be acquired. During discrimination and prediction, the probabilities of crude oil sources of asphalt can be separately acquired by substituting the transmittances of the 12 CAPs of the asphalt. e maximum probability corresponds to the predicted origin: Additionally, to validate whether the model shows adequate practical meaning, it is necessary to test the goodness of fit, pseudo R-squared, and likelihood ratio of the model. e tests (including the Pearson chi-square test and the deviance chi-square test) of goodness of fit can test whether the model fits the original data, or not. If the significance level exceeds 0.05, the fitting effect is favourable. e pseudo R-squared value can verify the degree of explanation offered by the model for information contained in its original variables, which is shown in Cox, Nagelkerke and McFadden pseudo R-squared values. e closer the result is to 1, the better the explanation. e likelihood ratio test measures the contribution of original variables to the model. If the significance level is lower than 0.05, the contribution of original variables is high.
According to the test result (Table 8) obtained through use of the logistic regression model, the goodness of fit, pseudo R-squared, and likelihood ratio of the model all satisfy test requirements. is indicates that the extracted principal components PCA1, PCA2, and PCA3 also retain key information about the data while effectively realising dimension reduction, which makes a significant contribution to the construction of the logistic regression model. e final result obtained through model regression is also meaningful.

Validation of Discriminatory Effect of the Model.
By taking IR CAPs of 33 original asphalt samples as verification samples, the discrimination effect obtained through the multinomial logistic regression model in multivariate statistical analysis was evaluated by applying formula (13). e discrimination result of multinomial logistic regression in multivariate statistical analysis is shown in Table 9.
As shown in Table 9, discrimination accuracies of 15, 12, and six asphalt samples separately produced by crude oil sourced from the Middle East, South America, and Bohai Rim region of China are 93.3%, 91.7%, and 100%, respectively. e comprehensive discrimination accuracy is 93.9%. e above result showed that multivariate logistic regression analysis based on PCA can rapidly discriminate the origins of asphalt.

Conclusions
Based on ATR-FTIR technology, the infrared spectra of 33 kinds of asphalt produced by crude oil from the Middle East, South America, and Bohai Rim region of China were collected. Furthermore, the 12 selected CAPs of infrared spectra were analysed by multivariate statistics. e comprehensive accuracy of the logistic regression model based on PCA in discriminating asphalt, which were produced by crude oil from three different regions reached 93.9%. e results indicated that the combination of ATR-IR spectral analysis and multivariate statistics can accurately and nondestructively discriminate between different crude oil source of asphalt. Moreover, the method shows some remarkable advantages, including ease of operation, rapidity, and high  accuracy, which is important when controlling the origins and quality of asphalt and improving the performance thereof. e method provided in this paper is suitable for the oil source identification of base asphalt produced by crude oil from different regions and can also provide reference for other kinds of asphalt, such as polymer-modified asphalt. However, the accuracy and applicability of this method need to be further improved. In particular, whether the asphalt produced by crude oil mixing from different regions can be effectively identified needs further research.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.