Quantitative Analysis of the Main Controlling Factors of Oil Saturation Variation

. With the high-speed development of arti ﬁ cial intelligence, machine learning methods have become key technologies for intelligent exploration, development, and production in oil and gas ﬁ elds. This article presents a work ﬂ ow analysing the main controlling factors of oil saturation variation utilizing machine learning algorithms based on static and dynamic data from actual reservoirs. The dataset in this study generated from 468 wells includes thickness, permeability, porosity, net-to-gross (NTG) ratio, oil production variation (OPV), water production variation (WPV), water cut variation (WCV), neighbouring liquid production variation (NLPV), neighbouring water injection variation (NWIV), and oil saturation variation (OSV). A data processing work ﬂ ow has been implemented to replace outliers and to increase model accuracy. A total of 10 machine learning algorithms are tested and compared in the dataset. Random forest (RF) and gradient boosting (GBT) are optimal and selected to conduct quantitative analysis of the main controlling factors. Analysis results show that NWIV is the variable with the highest degree of impact on OSV; impact factor is 0.276. Optimization measures are proposed for the development of this kind of sandstone reservoir based on main controlling factor analysis. This study proposes a reference case for oil saturation quantitative analysis based on machine learning methods that will help reservoir engineers make better decision.


Introduction
The variation of oil saturation is a matter of long-standing concern to reservoir engineers. At the middle-to-late development stage, complex geological characteristics and different development scenarios make the oil saturation variation more complex. Clarifying the main controlling factors influencing oil saturation is essential to optimize development plans.
There are three main conventional methods for measuring, analysing, and predicting oil saturation. The first category is the material balance equation method, which estimates average oil saturation of an entire reservoir based on reservoir geological data and development and production data [1]. Deng et al. developed an improved time-differentiated variable multiple material balance model to evaluate residual oil saturation in water-flooded zones by differentiating the water flooding into numerous displacement processes. This method's calculations show that the residual oil saturation is in excellent agreement with the experimental results of the core analysis [2]. Shahamat and Clarkson discussed the application of conventional flowing material balance (FMB) to the analysis of single-phase or multiphase flow in single or multiwell scenarios and proposed a new, comprehensive FMB. The developed FMB can be used to determine the original volumes of hydrocarbons in place in both conventional and unconventional reservoirs [3]. Rahman et al. proposed a new, rigorous material balance equation for gas flow in the presence of a compressible formation and residual fluid saturation [4].
The second category is core analysis and logging analysis. Core analysis is a laboratory method for the direct measurement of oil saturation. Based on the coring tool, core analysis is classified as conventional coring [5,6], pressure coring [7,8], and sponge coring [9,10]. Zhang et al. sampled threemeter-scale core intervals and calculated oil saturation index by X-ray diffraction analysis, TOC analysis, and programmed pyrolysis analysis [11]. Xiao et al. systematically studied the geological and geochemical characteristics of the First Member of the Cretaceous Qingshankou Formation in the Qijia Sag based on core samples and core analysis [12]. Logging is the most widely used method for obtaining reliable oil saturation in oil fields [13], especially pulsed neutron logging [14]. Dong et al. studied the detection principles, modes, advantages, and disadvantages of pulsed neutron logging tools, established a formation model based on the Monte Carlo method, and analysed the sensitivity of detection [14]. Nie et al. introduced a novel oil content model for shale oil reservoirs by analysing the logging and core experimental data and building the relationship between kerogen and the different well logging porosities including nuclear magnetic resonance (NMR) porosity, neutron porosity, and density porosity [15].
The third category is reservoir numerical simulation, which simulates the development process and calculates key parameters (production, injection, saturation, etc.) based on a geological model. Ren and Duncan utilized a commercial reservoir simulator to simulate CO 2 -enhanced oil recovery (CO 2 -EOR). These simulations explore the effects of strength of aquifer flow, flow direction, and capillary pressure on the nature and distribution of oil saturation in residual oil zones (ROZs) [16]. Ren and Duncan explored the impact of various elements: oil saturation, well patterns, reservoir heterogeneity, and permeability anisotropy based on simulations [17].
In recent years, the world has entered the era of "Industry 4.0", and so has the oil and gas development industry. Machine learning is also widely known for its application in extracting of complex patterns from massive data. Machine learning is widely applied to the upstream oil industry, such as for production prediction and optimization [18][19][20], geological modelling, managing uncertainty [21,22], and for characterization of connectivity and heterogeneity [23,24]. Wang et al. developed a novel equal probability gene expression programming (EP-GEP) method to analyse the production decline of carbonate reservoirs. Validation and comparison showed that this method outperformed the traditional Arps model in perdition accuracy [18]. Liu et al. proposed a well production performance prediction method based on an artificial neural network. This method can help engineers analyse the massive data from unconventional reservoirs to understand the underlying patterns and relationships, especially on enhanced oil recovery (EOR) pilot projects [19]. Niu et al. established a multiparameter comprehensive intelligent prediction method of karst curtain grouting volume (KCGV) based on support vector machine (SVM). The application results show that this method achieves excellent prediction performance on the KCGV and can provide practical and beneficial help for the field karst curtain grouting project [20]. Kang et al. established a classification model for determining the proper geological scenario among plausible geological uncertainties by utilizing machine learning methods including support vector machine (SVM), artificial neural network (ANN), and convolutional neural network (CNN). The results show that this method generates more reliable reservoir models and successfully reduces the uncertainty in the geological scenario [21]. Du et al. utilized deep transfer learning (DTL) to extract features from a training image (TI) of porous media to replace the process of scanning a TI for different patterns as in multiple-point statistical methods. The experimental results show that the proposed method is of high efficiency while preserving similar features with the target image, shortening reconstruction time [22]. Song et al. developed a novel prediction model to forecast vertical heterogeneity of the reservoir based on various deep neural network algorithms. The machine learning models have the ability to learn and capture hidden relationships between dynamic production data and reservoir heterogeneity. The application results show that the proposed models achieve excellent performance in predicting heterogeneity [23]. Liu developed a machine learning method to evaluate the connectivity between injectors and producers based on back propagation (BP) algorithms and convolutional neural networks (CNNs) algorithms in interlayer reservoirs. The model training dataset consists of dynamic production data under different permeability, interlayer dip angle, and injection pressure. The results show that CNN has better prediction performance than BP [24]. Huang et al. developed long short-term memory (LSTM) neural network model to forecast well performance [25].
Previous studies have looked at the distribution of remaining oil saturation and the influencing factors [26][27][28][29][30]. These studies mainly focused on static characteristics of the reservoir, such as microscopic pore structure and heterogeneity. However, they failed to combine static geological data with dynamic production data, to conduct quantitative analysis, and to calculate main controlling factors.
In this study, machine learning algorithms are employed in the quantitative analysis of main controlling factors of oil saturation variation during the water flooding process of a Middle East reservoir. This study is divided into the following sections. In Section 2, we briefly describe the geological characteristics, development history, well layout, and dynamic production performance of the research reservoir. In Section 3, we propose the workflow of this study. We introduce detailed data gathering and processing, which is significant and the basis of the model training. We screen a total of 10 machine learning algorithms suitable for the analysis and present their basic principles. In addition, model evaluation method is introduced in this section. In Section 4, we obtain the optimal algorithms by comparison and apply them to calculate main controlling factors affecting oil saturation variation. Finally, the discussion and conclusion appear in Section 5.

Research Area
Research area (RM reservoir) is a long-axis anticline reservoir located in the Middle East with a NW-SE trend, and its main lithology is sandstone. The sedimentary environment is shallow sea continental slope deposition, open sea, front reef, and lagoon environment. The RM reservoir developed good reservoir continuity, and the physical properties of the north part are better than those of the south part. Core analysis results show a good porosity permeability relationship, with an average porosity range of 10-15% and an average permeability range of 40-300 md. The formation thickness of RM reservoir ranges from 10 to 60 m, and effective section 2 Geofluids thickness ranges from 6 to 40 m of single well. The target reservoir has been developed by primary depletion for almost 40 years; in 2009, its daily oil production was approximately 200,000 barrels relying on 127 open producers. Water flooding was implemented in this reservoir after 2010 when formation pressure decreased rapidly, resulting in the closure of a large number of producers and declining production. Current reservoir production is approximately 150,000 barrels per day. Long periods of production and development have accumulated a large amount of valuable surveillance and test data, providing a solid foundation for the application of machine learning algorithms. The current development of this reservoir is facing difficulties in evaluating the effect of water flooding and optimizing development measures.

Methodology
3.1. Workflow. The specific workflow of this research includes 4 steps as shown in Figure 1. The first step is data processing including variable selection, correlation analysis, and outlier processing. Training models by utilizing different algorithms is the second step. The next step is model evaluation including scoring and comparing different algorithm models. The final step is quantitatively calculating the main controlling factors of oil saturation variation.

Data Collection and
Processing. The most crucial component in machine learning analysis is the database, which is the fundamental of modelling. With the help of data collection and processing, a large number of useful data can be gathered and one can gain meaningful insights from the relationship between different valuables. In this process, inappropriate variables are screened out, outliers are removed, missing values are inserted, and an integrated dataset is established for model training.
The factors that affect oil saturation involve geology, oil reservoirs, engineering, and others, as shown in Figure 2. The dataset established and utilized in this study contains original data from 468 wells representing the actual situation of target reservoir development.
Static geological data include formation thickness, permeability, porosity, and net-to-gross (NTG) ratio. The detailed data come from seismic interpretation, logging interpretation reports, core analysis, and studies. Dynamic production data includes oil production variation (OPV), water production variation (WPV), water cut variation (WCV), neighbouring liquid production variation (NLPV), and neighbouring water injection variation (NWIV), which comes from daily and monthly monitoring. In order to better analyse the impact factors of oil saturation, we also considered the impact of injection and production of neighbouring wells. We define Step IV: Quantitative analysis Step II: Model training with various algorithms Step III: Model evaluation Step I: Data processing 3 Geofluids the neighbouring relationship as two wells are directly adjacent to each other without any well in between. The diagrams of neighbouring wells and neighbouring relationships are shown in Figure 3. Oil saturation variation (OSV) as the dependent variable is also the key parameter in this study. All saturation data comes from production logging test (PLT) and reservoir saturation test (RST) reports. We use the variation of dynamic data instead of rate, because variation can reflect the production performance over a period of time. The variation of dynamic data in this study is equal to the difference between the two saturation tests.
In this study, the Pearson coefficient was employed to analyse correlation between variables; the calculation method is shown in Equation (1)     -1 (negative correlation) to 1 (positive correlation). When the correlation coefficient is 0, it means that there is no correlation between the two variables.
where ρ X,Y is the correlation coefficient of a pair of random variables ðX, YÞ, cov is the covariance, σ X is the standard deviation of X, and σ Y is the standard deviation of Y.
The heat map of correlation analysis results is shown in Figure 4. In this reservoir, porosity has a moderate correlation with permeability and a strong correlation with NTG, which indicates that the higher the NTG, the higher the porosity, and the better reservoir properties. Water cut variation is in strong correlation with water production variation, which is in line with the physical law of reservoir development. In addition, the correlation between oil saturation variation and other variables is very weak, so it is impossible to establish a simple linear model to analyse the implicit relationship between them.
The existence of outliers seriously affects the performance of machine learning models, so proper handling of outliers is essential in enhancing model accuracy. Boxplot is a graphical method for depicting the distribution of data through quartiles and averages. Boxplots show the five-number summary of a dataset: minimum, maximum, first quartile (Q 1 ), third quartile (Q 3 ), and median. The boxplot method defines an index, interquartile range (IQR) to demarcate outliers; the calculation is shown in Equations (2) and (3). The evaluation index of the model accuracy is the coefficient of determination (R 2 ), and the calculation formula is shown in Equation (4).
where y i andŷ i are the actual data and prediction. y is the average of data. The overview and outlier distribution of this dataset is shown in Figure 5. The three variables of water production variation (WPV), water cut variation (WCV), and neighbouring liquid production variation (NLPV) have a relatively high proportion of outliers. There are four methods to deal with outliers: Ignore, Mean, Median, and Delete. "Ignore" means to ignore outliers without any processing. "Mean" and "Median" mean to replace outliers with either mean values or median values. "Delete" means removing outlier data points. The four processing methods are tested on the three algorithms of RF, GBT, and ADBT, and the results are shown in Table 1. Using the median values to replace outliers obtains the highest training and testing R 2 .

Machine Learning Algorithms. The Pearson coefficient
can only simply analyse the linear effect of a single variable on the oil saturation variation. In the process of reservoir development, the oil saturation variation is affected by multivariable nonlinearity. In order to clarify the contribution of each variable on oil saturation variation, machine learning methods are employed to produce quantitative analysis.
In this study, a total of 10 machine learning algorithms are selected and used to developed regression models based on dataset. These regression models can be used to analyse hidden relationships between the dependent variable "oil saturation variation" and the independent variables and to quantify feature variable importance (impact factor). These 10 algorithms include ensemble algorithms (random forest, AdaBoost, and gradient boost), linear regression algorithms (linear regression, polynomial regression, Lasso, ridge regression, and elastic net regression), support vector machine, and multilayer perceptron.
Random forest (RF) is an ensemble learning algorithm used for classification and regression and is composed of decision trees [31][32][33]. In the classification or regression process, the output of RF is based on the output value of its internal decision trees. RF outperforms decision tree due to the ability to avoid overfitting. RF has the following advantages: (i) It is unsurpassed in accuracy among current algorithms (ii) It can process large datasets and thousands of input variables efficiently without dimensionality reduction (iii) It can handle datasets with missing data and maintains accuracy when a large proportion of the data is missing (iv) It can calculate estimates of variable importance in the classification Adaptive boosting or AdaBoost (ADBT) is an iterative algorithm that trains different weak classifiers for the same training dataset and then combines these weak classifiers to form a stronger final classifier [34,35]. ADBT classifier has high accuracy and is not prone to overfitting.
Gradient boosting (GBT) is a classification and regression method based on boosting technique [36,37]. The core idea of GBT is to train the newly added weak classifier based on the negative gradient information of the current model loss function and then integrate the trained weak classifier into the Least absolute shrinkage and selection operator (Lasso) is a regression analysis method that performs feature selection and regularization at the same time to enhance accuracy and interpretability [38][39][40]. Lasso uses L1 regularization, which will make some learned feature weights 0, so as to achieve the purpose of sparseness and feature selection.
Ridge regression is a biased estimation regression method dedicated to collinearity data analysis [41,42]. It is essentially an improved least squares estimation method. Ridge regression obtains more realistic regression coefficients by abandoning the lack of bias of the least square method and at the cost of losing part of the information and reducing the accuracy. The fitting of ridge regression to abnormal data is stronger than the least square method.
Elastic net regression (ENR) is a hybrid of ridge regression and Lasso. ENR is a linear regression model trained using L1 and L2 regularization as a priori regularization term [43,44]. Support vector machine (SVM) is a supervised learning algorithm used for analysing data, classification, and regression. SVM establishes a hyperplane in a high-dimensional space to make a good separation which has the largest distance to the nearest data point [45][46][47].
Multilayer perceptron (MLP) is a multilayer fully connected feedforward neural network. The basic structure of MLP consists of three layers: the first input layer, the middle hidden layer, and the last output layer. Each node is a neuron that uses a nonlinear activation function. This enables MLP to process complex linear inseparable data [48,49].
Linear regression (LR) is a regression analysis that uses the least square function to model the relationship between one or more independent variables and dependent variables [50].
Polynomial regression (PR) is considered a special form of multiple linear regression. PR modelled a nonlinear relationship between the independent variable x and the dependent variable y as an nth degree polynomial in x [51].

Accuracy
Comparison of Different Algorithms. The 10 algorithms mentioned above have been trained and tested on the same processed dataset. The best one can be screened out through the horizontal comparison of multiple algorithms. Train and test performances of 10 algorithms are shown in Figure 6, the detailed values are shown in Table 2, and the crossplot of the actual value and the model prediction value are shown in Figure 7.
Random forest (RF) and gradient boosting (GBT) are significantly better than the other algorithms. These two accurately capture the features of the dataset and fit it well.
AdaBoost (ADBT) performs well in test processing, but obtains low fitting in model training. Support vector machine (SVM), multilayer perceptron (MLP), linear regression (LR), and least absolute shrinkage and selection operator (Lasso) rank in the middle. The fitting effect of polynomial regression (PR) is very poor, and the crossplot presents a divergent shape, as shown in Figure 7. Elastic net regression (ENR) and ridge regression are completely inapplicable to this dataset.
In comparison, more advanced ensemble learning algorithms defeated the traditional regression algorithms. RF and GBT are utilized to quantitatively analyse the main controlling factors of oil saturation variation.

Main Controlling
Factors. RF and GBT are used to calculate the impact factor of each variable on OSV, as shown in Figure 8. "Average" is the mean of the results of RF and BGT. Neighbouring water injection effect is the variable with the highest degree of impact on OSV, which is consistent with the development dynamics of the target reservoir after 2010, as shown in Figure 9. It shows the positive effect of water flooding in this reservoir.
Permeability and oil production rank second and third, respectively, in importance, which is also in line with the   8 Geofluids law of reservoir development. The neighbouring liquid production variation ranks fourth, indicating that complete injection-production well patterns are essential to improve oil recovery and tap the potential of remaining oil. Based on the analysis results of the main controlling factors of oil saturation, we propose optimization measures for the development of this kind of sandstone reservoir.
(i) Continue to implement water flooding development, establish effective displacement system, and closely monitor the performance of producers. Reservoir saturation test (RST) and production logging test (PLT) need to be conducted to monitor oil saturation variation and prevent water breakthrough  (ii) In areas with high permeability, optimize the injection-production well patterns by infilling wells and others to tap the potential of remaining oil (iii) In areas with low permeability or poor properties, reservoir reconstruction measures such as hydraulic fracturing and EOR measures such as low-salinity water flooding can be applied

Discussion and Conclusion
Machine learning is a data-driven analysis method. It can process massive data and clarify hidden relationships between variables. The two traditional methods fail to quantitatively calculate the impact factor of the influence that affects oil saturation. The main advantage of this research is that all data used by machine learning analysis come from actual sandstone reservoir. Machine learning has been widely used in the field of oil and gas development, but not all algorithms are perfectly applicable. The purpose of this research is to select suitable algorithms through comparing and testing 10 different machine learning algorithms, make full use of real oil field data, and conduct quantitative analysis of the main controlling factors of oil saturation. This research can provide strong support for the further research that characterizes oil saturation distribution for the whole reservoir by developing neural network model. This article proposes a method for analysing the main controlling factors of oil saturation variation. Actual static geological data and dynamic production data are gathered to establish machine learning analysis models. The specific conclusions are as follows.
(1) Established a data processing workflow, including correlation analysis and outlier processing. The median method was the most successful outlier processing method in this study (2) In comparison and testing of 10 machine learning algorithms, RF and GBT were the optimal algorithms and obtained the highest accuracy in modelling (3) The oil saturation analysis model was established using RF and GBT algorithms, and the main controlling factors were quantitatively calculated. NWIV is the most important factor, with an impact factor of 0.276 (4) The ranking of the variables provides the basis of the proposal for optimizing reservoir development. The workflow is also an advanced and complex data analysis method, which provides a foundation for the subsequent establishment of a neural network saturation prediction model (5) Continue to implement water flooding development, establish effective displacement system, and closely monitor the performance of wells to prevent water breakthrough

Data Availability
The manuscript is a self-contained data article; the entire data used to support the findings of this study are included within the article. If any additional information is required, this is available from the corresponding author upon request to weichenji@petrochina.com.cn.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.