Prediction of Fluid Viscosity in Multiphase Reservoir Oil System by Machine Learning

It is important to realize rapid and accurate prediction of fluid viscosity in a multiphase reservoir oil system for improving oil production in petroleum engineering. This study proposed three viscosity prediction models based on machine learning approaches. The prediction accuracy comparison results show that the random forest (RF) model performs accurately in predicting the viscosity of each phase of the reservoir, with the lowest error percentage and highest R2 values. And the RF model is tremendously fast in a computing time of 0.53 s. In addition, sensitivity analysis indicates that for a multiphase reservoir system, the viscosity of each phase of the reservoir is determined by different factors. Among them, the viscosity of oil is vital for oil production, which is mainly affected by the molar ratio of gas to oil (MR-GO).


Introduction
The fluid viscosity of the oil-gas reservoir [1,2] is the key factor to determine the final development effect and economic benefit of the oil-gas reservoir. Therefore, it has become an important basis performance for formulating oil-gas field development plans [3], studying oil and gas reservoir performance, implementing plan adjustment, and evaluating stimulation [4].
The combination of PVT device and high-pressure falling ball viscometer [5,6] can realize the laboratory analysis of reservoir samples, to obtain the viscosity value in reservoir environment (high pressure and temperature). PVT device [7,8] can create specifical temperature and pressure to simulate reservoir environmental conditions; therefore, it has been widely used in the oil industry in recent years [9]. However, the acquisition of such data, including sampling and subsequent analysis, will cost considerable cost and time, which is not desirable [10].
In addition, to simplify and quickly obtain the reservoir fluid viscosity and to analyze the influencing factors of viscosity, a large number of simulation studies on viscosity have appeared in the petroleum industry, and many com-monly used viscosity models have been proposed, including LBS viscosity model [11], CS viscosity model, LLS viscosity model, Pt viscosity model, and PR viscosity model [12,13]. The application of these models can realize the viscosity acquisition of reservoir fluid with specific composition and realize viscosity prediction. But the viscosity of oil-gas reservoir fluid, especially in oil-gas-water multiphase reservoir system, is affected by many factors, including reservoir environmental conditions, oil and gas composition, and water and gas injection [14]. At present, it is impossible to find a general viscosity model to describe the viscosity characteristics of fluids in multiphase reservoir systems.
Therefore, the objective of this study is to establish a reliable and accurate machine learning model for predicting the viscosity of each phase in a multiphase mixed oil-gas-water system. Research shows that deep neural networks (DNN), random forests (RF), and support vector regression (SVR) are very good at capturing and learning the nonlinear feature relationships between data, and they can accurately predict parameters in a data-driven manner without physical models. Compared with some classic machine learning algorithms, these machine learning algorithms can often maintain high prediction accuracy even under small sample conditions, which map more feature spaces. And trained models have higher portability and can quickly adapt to different application scenarios. This paper, therefore, would choose these three machine learning methods to predict and analyze the viscosity of each fluid in a multiphase mixed oil-gas-water system.
To achieve this research purpose, Tarim reservoir oils were taken as an example. Tarim Oilfield is located in Kuqa County, Xinjiang, China. It is urgent to tackle the key problems of enhanced oil recovery. The early gas injection research recognized that gas injection is the practical technical direction of enhanced oil recovery of reservoir in Tarim. This study collects a large number of experimental and simulation data to provide a large amount of data for machine learning. In addition, the sensitivity analysis of influencing factors plays a vital role in the development guidance of the reservoir oils, so the collected data was sorted and analyzed, according to the reservoir environmental factors (temperature T and pressure P) and the reservoir composition (the molar ratio of gas to oil MR-GO and molar ratio of water to oil MR-WO [15,16]). Therefore, the developed viscosity prediction model covers wider ranges of input data, which is important to the production of oil reservoirs.
The structure of this paper is as follows. In the following section, the background, governing equations, and development methodology of the three presented models, including RF, DNN, and SVR, are introduced and described in detail. In addition, this section will also give the calculation method of the statistical indicators for evaluating the three models. Next, in Section 3, the accuracy and calculation time comparison of these three developed models will be evaluated by the statistical indicators, and the reliability analysis of the calculation process of the RF model will be given. Moreover, the sensitivity analysis of the influencing factors will be carried out, and the influence weight of each influencing factor on the output viscosity in the multiphase system will also be given. Finally, Section 4 will present the key findings of this paper.

Prediction Models.
Three prediction models were used to predict the viscosity of a multiphase reservoir oil system from input data of crude oil systems, such as MR-GO, MR-WO, reservoir environment pressure (P, MPa), and temperature (T,°C). Table 1 indicates the statistical characteristics of the input data.
2.1.1. Random Forest (RF). Random forest [17], as an ensemble learning algorithm based on classification and regression trees, has been widely used in many fields. Ho [18] first proposed the random forest algorithm in 1995 and improved the algorithm by Breiman [19] in 2001. RF is a machine learning method based on statistical learning theory. First, multiple samples are extracted from the original data set through the bootstrap resampling method. Then, a decision tree model is established for  Figure 1: The training process of random forest.

Input layer
Hidden layer Output layer … Figure 2: The training process of deep neural network.
where M rf represents the calculation result of the random forest model, K is the number of regression trees required, and t i represents a single regression tree model. In the calculation process of the random forest model, there are two extremely critical hyperparameters, which are the number of regression trees (n-estimators) and the number of random variables at the nodes (max depth). Too little number of regression trees will affect the accuracy of the calculation. Similarly, too many numbers will increase the complexity of the calculation. The training process of random forest is shown in Figure 1.

Deep Neural Network (DNN).
A deep neural network [20,21] is a machine learning method that combines a multilayer perceptron structure and a backpropagation algorithm. It is mainly composed of three parts, including the input layer, hidden layer, and output layer ( Figure 2).
The key structure of a deep neural network is called a neuron, which can characterize the nonlinear mapping relationship between input and output. The output equation of each neuron is as follows: where O is the output value, g is the activation function, w is the weight of the input parameter, and b is the threshold. The training process of the neural network model is divided into two steps: forward propagation and backward propagation. First, forward propagation is used to calculate the predicted value of the model. The error gradient between the predicted value and the true value is obtained by the loss function. According to the error gradient, the weights and thresholds in the neural network are adjusted through the backpropagation algorithm. Repeating this process can make the network continuously learn the hidden features between the data.

Support Vector Regression (SVR)
. Support vector machine [22][23][24] is a machine learning method proposed by Cortes and Vapnik [25] in 1995 to learn the mapping relationship between parameters. The core idea of support vector regression is to find the nonlinear mapping relationship between input space and output space. Relying on the nonlinear mapping, data is mapped to a high-dimensional characteristic space. The estimating where f is the linear function and w and b are the identified weight vector and the bias term, respectively. In the high-dimensional feature space, the optimization problem for SVR with ε-insensitive loss function is as follows: where kwk 2 in the objective function is the confidence range reflecting the generalization ability, ξ i and ζ i are the slack variables that represent the upper and lower limits of allowable error, ∑ N i=1 ðξ i + ζ i Þ denotes the experimental risk reflecting the learning capacity of function, ε > 0 is an insensitive loss coefficient, and parameter CðC ≥ 0Þ is a penalty factor. In SVR, the dual problem of Equation (4) is often derived by using the Lagrange multiplier method, based on which a linear regression function can finally be constructed.

Statistical Evaluation of Three Models.
To evaluate the accuracy of machine learning models, three statistical indi-cators were used, including mean square error (MSE), mean absolute error (MAE), and coefficient of determination (R 2 ) [26][27][28]. This experiment also utilizes these indicators to where N represents the total number of samples, X i data is the referenced parameter that is the actual expected value, and X i model represents the predicted value of the machine learning methods.

Comparison of Three Proposed Models.
To evaluate the precision of each constructed network, MSE, MAE, and R 2 are calculated, based on different output data of gas viscosity (υ g ), oil viscosity (υ o ), and water viscosity (υ w ), respectively. The calculated evaluation results are presented in Table 2.
According to the results of Table 2, the proper prediction model for viscosity modeling is RF. Take the oil viscosity modeling as an example; among these three prediction models, the presented RF model has the lowest MSE of 0.008, lowest MAE of 0.0093, and highest R 2 value of 0.9623. Compared to the other two developed models, the SVR model has the worst prediction results.
To reveal and visualize the performance of each prediction model, a cross-plot picture between model predictions and the corresponding experimental values is drawn in Figure 3, for gas viscosity (υ g ), oil viscosity (υ o ), and water viscosity (υ w ), respectively. In this curve, there are three color points: blue points for estimated data of RF, red points for estimated data of SVR, and yellow points for estimated data of DNN, and Y = X line for experimental values. In a cross-plot picture, a higher precision is attained when the data is closer to the Y = X line. From Figures 3(a)-3(c), in all prediction results for gas viscosity (υ g ), oil viscosity (υ o ), and water viscosity (υ w ), only in the RF model, there is an adequate closeness of the majority of the data points to the line Y = X, showing a very good agreement between model predictions and the corresponding experimental values. For the RF model, the coefficient of determination (R 2 ) is high to 0.9824, 0.9623, and 0.9999 for gas viscosity (υ g ), oil viscosity (υ o ), and water viscosity (υ w ), respectively, which is much higher than the other models.
Moreover, in Figure 4, model predictions and the corresponding experimental values of oil viscosity (υ o ) by three proposed models are also compared and depicted. From these results, the predicted results using the RF model are very close to the corresponding experimental results. It can be seen that the RF model achieves a good accuracy of Finally, taking 1200 sets of data as an example, the prediction calculation time of machine learning models was compared with the tradition numerical method (TDM), as shown in Table 3. In this paper, the experiment is based on TensorFlow and Sk-learn learning library with Python language. The hardware resources include Intel i7-7700hq@2.8G processor, 16 G memory, and Nvidia GTX 1060 (6 G) graphics card. Table 3 indicates that machine learning models only need 0.53 s, 0.82 s, and 0.76 s for the prediction calculation of RF, DNN, and SVR, respectively. Prediction time by machine learning models is tremendously fast, while for the tradition numerical method, the prediction time is high up to 52.8 s. Machine learning models, therefore, have advantages in terms of computing time.

Results
Analysis by RF Model. Statistical evaluation parameters of the best network RF model for training, testing, and total data sets are presented in Table 4. Table 4 reports that for training set, R 2 of the RF model is high to 0.9921, while MAE and MSE are very low with the values of 0.0021 and 9:942E − 5. For testing set, the R 2 , MAE, and MSE values are 0.9816, 0.0035, and 0.0003. While for a total set, the corresponding statistical evaluation results are 0.9811, 0.0023, and 0.0001, respectively, indicating good evaluation parameters for the RF model.
Cross-plot figure of the proposed RF model is also shown in Figure 5. It could be seen that there is a dense point distribution around the Y = X line for all of the data, and the error is basically maintained within 5%, indicating the sufficient accuracy and reliability of the developed RF model.

Importance
Analysis. Importance analysis of influencing factors was carried out to figure out a sensitivity analysis on viscosity in a multiphase gas-oil-water system. For each viscosity, such as gas, oil, and water, each of the independent influencing factors, such as P, T, MR-GO, and MR-WO, was evaluated in this part. The results of the importance analysis are indicated in Figure 6. For each output result, the sum of the influence proportion of the influencing factors (P, T, MR-GO, and MR-WO) is 1. The higher the impact proportion value, the stronger the relationship between the input parameter and the output function.
As it is expected, for gas viscosity, the MR-GO is the most significant factor influencing the output result with an impact proportion of 0.53. The second is environmental factors, including P and T, and the proportions are 0.20 and 0.25, respectively. Here, the water content has a litter effect. For oil viscosity, the output value is mainly affected by ambient temperature T and gas content MR-GO, and the specific impact proportion is 0.42 and 0.40, respectively. Secondly, pressure P also plays an important role with the affecting proportion of 0.17. Finally, the viscosity of water in the multiphase system is mainly determined by the ambient temperature T, and the influence proportion is 0.76, followed by its content MR-WO in the system, and the ratio is 0.23.

Conclusions
In this study, three machine learning models, namely, random forest (RF), deep neural network (DNN), and support vector regression (SVR), were proposed for calculating and predicting the phase viscosity of multiphase reservoir oils systems.
To make a judgment of the accuracy of each developed model, various statistical evaluation indicators, including mean square error (MSE), mean absolute error (MAE), and coefficient of determination (R 2 ), were applied. The results show that the RF model has higher accuracy compared with the other two models. Specifically, for the RF model, the R 2 , MAE, and MSE values are 0.9811, 0.0023, and 0.0001, respectively, indicating good evaluation performance. Moreover, machine learning models have advantages in computing time for the RF model, which only needs 0.53 s for 1200 sets of data prediction.
Moreover, an importance analysis of influencing factors was carried out on viscosity in this multiphase gas-oilwater system. For gas viscosity, the MR-GO is the most significant influencing factor with an impact proportion of 0.53, followed by P and T. Next, oil viscosity is mainly affected by T and MR-WO, and P also has an effect. Finally, for water, the viscosity is mainly determined by T, followed by MR-WO.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
All authors confirm that there is no financial/personal interest or belief that could affect our objectivity, and no conflicts exist.