CNN-LSTM Model Optimized by Bayesian Optimization for Predicting Single-Well Production in Water Flooding Reservoir

Geared toward the problems of predicting the unsteadily changing single oil well production in water ﬂ ooding reservoir, a machine learning model based on CNN (convolutional neural network) and LSTM (long short-term memory) is established which realizes precise predictions of monthly single-well production. This study is considering more than 60 dynamic and static factors that a ﬀ ect the changes of oil well production, introduce water injection parameters into data set, select 11 main control factors, and then, build a CNN-LSTM model optimized by Bayesian optimization. The e ﬀ ectiveness of the proposed model is veri ﬁ ed in a realistic reservoir. The experiment results show that the prediction accuracy of the proposed model is over 90%, which suggests a penitential application in an extensive range of applications. Production forecasting by the developed model is simple, e ﬃ cient, and accurate, which can provide a guidance for the dynamic analysis of a water ﬂ ooding reservoir, and work as a good reference of the development and production of other types of reservoirs.


Introduction
Oil production prediction runs through the entire development course of a water-driven oilfield; it is the foundation of the well stimulation and plays an important role in making investment decisions.For a long time, the main production prediction methods are reservoir numerical simulation, mechanism model, and decline curve analysis, which have their respective advantages and disadvantages.Since the 1990s, fuzzy comprehensive evaluation (FE), BP neural network (BPNN), grey model (GM), and other methods have been applied to oilfield production prediction [1,2].Recently, with the maturity and wide application of big data, artificial intelligence, and other technical theories and methods, more and more intelligent algorithms have been introduced into the petroleum industry, providing a new way to solve complex engineering problems [3,4].
As a nonlinear fitting method, machine learning can learn rules from data and make a prediction.In the past 20 years, there has been extensive research on production prediction based on machine learning.Common methods include random forest (RF), support vector machine (SVM), fuzzy comprehensive evaluation (FE), artificial neural network (ANN), and autoregressive integrated moving average model (ARIMA) [5][6][7][8][9][10][11][12][13][14].These classical machine learning methods have been fully applied in the field of petroleum industry and show strong vitality.
Y. Duana et al. use RTS (Rauch Tung Striebel) to smooth the gas production series, and then, an ARIMA model was established to predict the gas production [6].J. Gu et al. proposed an oil well production prediction model which combines ARIMA and Kalman filter to eliminate the influence of nonsynchronicity and hysteresis [7].Viet Nguyen-Le et al. propose 3 ANN models to predict the parameters in Arps's hyperbolic decline and then reconstructs the production profile [13].Y. Zhu et al. established an "ε-SVM" production prediction model based on sequential minimal optimization (SMO) algorithm using production records and bottom hole pressure data [14].
L. Zhang et al. proposed a GRU-FNN model to predict single-well production in water drive oilfield [23].X. Song et al. adopted particle swarm optimization (PSO) to optimize the basic structure of LSTM model and achieved better model performance [24].Weiss applied fuzzy ranking and neural network to establish correlations to predict oil production [25].H. Wang et al. established a production prediction model for high-water-cut oil fields considering both timing and engineering factors based on LSTM and increased the input features to 17 items, with a relative error of only 1% [26].Sagheer et al. compared the application effects of simple RNN and LSTM in oil and gas production predictions, verified that LSTM had better performance than simple RNN [27].
All the predictions made in the above studies are pretty accurate, but there is still room for further improvements.First, most of the works are aimed at predicting the total production of a reservoir or an oilfield, few research focused on the single-well production prediction.In fact, accurately predicting the production of single well is not easy because the fluctuation of single-well production is more irregular and more drastic than that of the block/reservoir production.Also, there are too many factors affecting single-well production, and we can hardly analyze them clearly.Second, in most studies, the production series is divided into 3 segments based on time, the earlier one for training, the middle one for validating, and the latest one for testing.However, as we all know, the production changes of oilfield/oil well have significant stages, and the early stage, middle stage, and late stage have different dominant control factors or rules of change.Learning the rules in early stage and predicting the production changes in late stage are bound to cause errors.Finally, there are few studies that consider both dynamic and static characteristics in time series prediction.Based on these considerations, to achieve accurate single-well production prediction in water flooding oilfield, the data set was built considering over 60 geological and development factors that affect oil well production.The influence of the water injection on oil production is also quantified and added to the data set as a feature.The missing data are filled in accordance with the industry knowledge and the distribution pattern of data.Secondly, the feature correlation analyses tailored to time series prediction are carried out, through which 11 dominant control factors of oil well production are selected to build the samples.The production prediction model is established based on a one-dimensional convolution neural network and a long short-term memory neural network, and the model hyperparameters are optimized by Bayesian optimization.Finally, an example numerical test is carried out in a practical reservoir.Results show that the CNN-LSTM model has better performance compared with the Bi-LSTM model, Attention-LSTM model, or any other models.Production forecasting by the developed model is simple, efficient, and accurate, which can provide a guidance for the dynamic analysis of a water flooding reservoir and work as a good reference of the development and production of other types of reservoirs.

Methodology
2.1.LSTM and Conv1D.LSTM (long short-term memory) can be the most successful recurrent neural network (RNN) nowadays.Its unique design of gate structure and cell state makes it possible to capture long-term dependencies, which is also the key to its great success.As shown in Figure 1, different from the other simple RNN units, there are three gates in the LSTM cell unit: forgetting gate, input gate, and output gate.In each time step, LSTM cells will partially pass the information to the next step and retain some information in the cell as the cell state, which is the key for LSTM to achieve long-term memory.LSTM neural network performs well in the field of time series prediction [28].
Conv1D (1-dimensional convolutional neural network) is famous in the areas of image recognition, but in recent years, people find that 1-dimensional CNN (Conv1D) can well accomplish the task of time series prediction and even outperforms LSTM and other time series prediction models 2 Geofluids in certain cases.When Conv1D is used for time series prediction, the inputs, n temporal steps with m features, can be viewed as an image of size of n × m, and Conv1D can extract a new image from the input data by scanning and calculating in one direction.Changing the number of filters and the size of convolutional kernels, we can easily control the size of outputs (see Figure 2).It should be noted that the convolution kernel of Conv1D is also 2-dimensional, but it can only slide windows in one direction, and that is the difference between Conv1D and 2-dimensional convolutional neural network.

Spearman Correlation
Coefficient.Spearman correlation coefficient, also known as the rank correlation coefficient, can measure the degree of nonlinear correlation between two features and is a method to analyze the correlation between two variables.Spearman correlation coefficient is calculated as in The model illustrated in this figure wants to predict production for the next three months, that is, it wants three outputs, which we achieve using a simple dense layer containing three neurons.4 Geofluids When the ρ is negative, it means that the two features have a negative correlation, one feature increases means the other decreases.Similarly, if ρ is positive, the two features are positively associated.The larger the absolute value is, the stronger the correlation between two features is.
2.3.Bayesian Optimization.Acclaimed as one of the best hyperparameter tuning algorithm at present stage, Bayesian optimization is a general gradient-free global optimization strategy, which can identify a good set of hyperparameters with few iterations.It is suitable in two cases: (1) the objective function is extremely complex and time-consuming to evaluate; (2) the target function is difficult to differentiate in respect to the independent variable.
In each iteration, Bayesian optimization decomposes the optimization problem into multiple small optimization problems.It samples the original function curve by a certain method and builds an alternative curve by fitting these points.Bayesian optimization builds a model to describe the parameter-distribution of the objective function using the Gaussian process model and then solves the minimum value of the alternative model and uses this minimum value as the optimal solution of the original function in this iteration.This process is called surrogate optimization.By gradually increasing the number of sampling points, the model will gradually approach the original objective function, and an optimal combination of hyperparameters will be obtained.
In this paper, Bayesian optimization is used to optimize network model hyperparameters, including network depth and width, initial learning rate, and network activation functions.For all hyperparameters to be optimized, a maximum value and a minimum value are first specified for them, respectively.Based on the Bayesian optimization method in the package Keras_tuner, appropriate maximum iteration and optimization objective are set and then executed to obtain the optimal combination of hyperparameters.

Data Processing and Feature Engineering
3.1.Sources of Data.Data used in this paper were collected from 426 oil wells and 94 water wells in a Chinese reservoir named A with an average production time of 406 months.The oldest wells in the data set can trace back to 1956.During more than 60 years of development, the reservoir has experienced a variety of production methods such as pumping production, water flooding production, and fracturing production.At present, the reservoir has entered the ultrahigh water cut stage, with an average water cut of more than 95%.
The initial data include monthly production data of 426 oil wells, monthly injection data of 94 water-injecting wells, and single-well reservoir information.The specific items are shown in Table 1.

Data Preprocessing
3.2.1.Filling.In this paper, most of the missing data are bottom hole flowing pressure, reservoir static pressure, or dynamic liquid level height.According to their own charac-teristics, we use different filling methods, taking well A as an example to illustrate.
(1) Reservoir static pressure According to the oil field development experience, the change of static pressure during the oilfield development process will not be significant and may have an obvious trend; therefore, the regression fitting method can be useful.Through the regression analysis using existing data, the relationship between the static pressure and time can be obtained, and the missing static pressure data will be calculated, as shown in Figure 3(a).
(2) Bottom hole flowing pressure (FBHP) Figure 3(b) shows that changes of bottom hole flowing pressure during a well's production life has stages, and the range of FBHP in different stages varies greatly, which limits the application of fitting regression method.In this case, we use the cubic spline interpolation, and the filling results are shown in Figure 3(b), which is basically consistent with the actual value distribution.
(3) Working fluid level As the working fluid level and FBHP were highly correlated (see Figure 3(c)), we can easily obtain a linear relationship between the two features, and one can be calculated from another.If either the working fluid level height or bottom hole flowing pressure is available, cubic spline interpolation is used.3.2.2.Encoding.Since a large part of static features in the data set are category features, the one-hot encoding process for category features should be carried out to concordance numerical features and category features.The features that are treated with this operation include units, layers, well types, and producing schemes.

Measurement of Injection-Production Relationship.
A major factor that affects the production of oil well in water-driven reservoir is waterflooding measure.In this paper, we propose the parameter injection t i , producing well i injected volume in month t, and add it into the data set as a feature "inject." As shown in Figure 4, an oil/producing well is often controlled by n water/water-injecting wells.We assume that a water-injecting well has a certain radiation radius, and in the radiation radius around a water well, the oil well within the radius is affected by that water-injecting well, and the smaller the distance is, the greater the impact of inject well on producing well is, and the larger producing well injected volume is.The total producing well injected volume equals the superposition of all "affected water-injecting wells." According to this hypothesis, we can quantify the producing well injected volume, as shown in Here, i denotes the producing well i, j denotes the waterinjecting well j, m denotes the number of water-injecting wells who influences producing well i.And the distance between producing well i and water-injecting well j is distance i,j .In tth month, α t j units of water were injected to reservoir by water-injecting well j.
The radiation radius is determined by correlation analysis: the closer the radiation radius to the actual radiation radius, the higher the correlation between producing well injected volume and the well production.By establishing the relationship between radiation radius and correlation coefficient, we can find the optimal radiation radius.As shown in Figure 5, 1000 m is the optimal radiation radius.Under this radius, producing well injected volume in a The correlation between producing well injected volume and production features.Seven correlation coefficients are used to compare; they are ρ (radiation radius, water cut), ρ (radiation radius, cumulative oil production), ρ (radiation radius, cumulative water production), ρ (radiation radius, monthly oil production), and ρ (radiation radius, recovery).6 Geofluids certain time and certain producing well is calculated and added to the data set as the feature "inject." 3.3.Feature Engineering 3.3.1.Feature Dimension Reduction.For better prediction accuracy, feature compression is needed to improve the quality of features and limit the number of features.In the process of feature compression, static features and dynamic features need to be treated separately, so are the discrete features and continuous features.Otherwise, the static features representing the characteristics of individual wells will submerge in the continuously changing dynamic features.Similarly, the highly sparse 0-1 features created by one-hot encoding will also submerge in the continuously changing numerical features.In the data processing working flow of this paper (Figure 6), we first separate the dynamic features (features that change over time), as shown in Table 1.There are in total 11 dynamic features without dimension reduction.Secondly, we separate the 0-1 features from the static features for principal component analysis (PCA) dimensionality reduction (with a confidence of 95%); then,    the remaining numerical features are compressed by another PCA model (with a confidence of 95%), and two sets of compressed features are spliced to form the features numbered no.0-no.32 (see Figure 7) finally.

Feature Selection.
In previous studies, the analyses of correlation with time series are generally carried out in the same time step in terms of production prediction; this correlation is the relationship between the current month's output and the current month's flowing pressure (or other features).This correlation can reflect the relationship between targets and variables and between variables and variables.However, there are still some loose places: the data used in prediction is not the data of the current month, but the data of the pre-vious n months, which means that if we want to get a more accurate correlation coefficient, the calculation should also be carried out between the target month's output and the input data used to predict it (see Figure 8).
The study takes the input length of 12 and the output length of 12 (using the previous 12 months' data to predict the next 12 months' production) as an example to illustrate the problem.
Figure 7 shows the visualization results of Spearman correlation coefficient between the inputs and outputs.Figure 7(a) shows the average of 12 correlation coefficients between the production in the next nth month and the feature k in the last m months: Here, ρðoutput n , feature k Þ denotes the Spearman correlation coefficient between output n and feature k ; output n denotes the production in the nth month; feature k denotes the kth feature; fature k m denotes the feature k in the last m months.
Figure 7(b) shows the correlation coefficient between the production in the next 1th month and the features in the last mth months.When a certain feature is fixed, from left to right, the color of the grids in Figure 7(a) gradually becomes lighter, and the color of the grids in Figure 7(b) gradually becomes darker, which means that the larger the time interval between the inputs and outputs, the lower the correlation between them, and the greater the difficulty of prediction.8 Geofluids Observing row by row, we can easily find that dynamic features have stronger correlation with future productions than static features, which can be easily explained: the correlation coefficient depends to a certain extent on the relative variation of the value of features in the sample.Compared with the production which remains changed over time, static features only vary from well to well, so the overall correlation is weak.We also note that the forecast target (production) has a strong autocorrelation-future production is far more correlated with past production than any other characteristic (up to 0.9), which means that the forecast value will be largely determined by the past production data.
To reduce the degree of difficulty in model fitting, accelerate the convergence, and eliminate invalid features, features are selected, as shown in Figure 7(c): with ±0.2 as the threshold, the features of correlation ðaccumulative correlation score over all outputsÞ ≥ 0:2 or ≤−0.2 will be retained, and 11 features are eventually entered into the data set.The correlation between them is shown in Figure 9.We can find that the correlation between monthly water production and monthly liquid production is much greater than that between monthly oil production and monthly liquid production, which confirms that the block has entered the ultrahigh water cut stage, and most of the produced liquid is water.

Sample Generation.
Before generating the input samples, we need to standardize the data, so that different features have the same scale, and they will have a fair chance to be learned by the model.As mentioned in the introduction, in order to enable the model to learn the rule of production variation at each production stage, we divided the data set by wells: 341 wells constitute the training set for model training, 43 wells constitute the validation set for hyperparameter optimization, and 42 wells constitute the test set for model

Model Design and Evaluation
4.1.Evaluation Metrics.In this paper, in order to evaluate the prediction performance of the model comprehensively, five evaluation metrics are used to evaluate the model, as shown in Table 2.It is worth noting that, to reduce the uncertainty caused by the potential randomness of the model, the model evaluation results in this paper are all from the average of 3 experiments under the same setting.4.2.Model Structure Design.After extensive investigations and comparative experiments, LSTM and CNN are selected to build a hybrid model.CNN is stacked in the first place, its excellent ability of feature extraction enables the model to extract as much hidden knowledge as possible.After the CNN layers, we stacked the LSTM layers in the hope that the model could learn the changes of timing sequence better.In addition, the layer normalization is used between the LSTM layers, which prevents possible gradient extinction and gradient explosion.
In this paper, Bayesian optimization is used to optimize the hyperparameters and the structure of the model.Since the limited space, only the structure and hyperparameter optimization results of the optimal model (CNN-LSTM) are presented here, as shown in Table 3.

Comparative
Model.There were 8 models built for the optimal model selection, as shown below: CNN-LSTM: Conv1D for feature extraction.LSTM for timing sequence capture [29,30].
LSTM: classical LSTM model.Bi-LSTM: the normal LSTM model can only learn the information from front to back, but cannot catch the information from back to front.Bidirectional LSTM is an improvement in this aspect.It combines forward LSTM and backward LSTM to capture forward and backward information at the same time [31,32].
GRU: GRU is one of the varieties of LSTM and maybe the most successful one.It simplifies the three gate structures of LSTM cells into two and can achieve almost the same prediction accuracy as LSTM while greatly accelerating convergence [23].
CNN: one-dimensional convolutional neural network model for time series prediction [33].
Attention-LSTM: the LSTM model supplemented with an attention mechanism in hope that the addition of attention mechanism can help the model better capture the critical time steps which may contain the key information about production changes.This paper adopts Luong Attention mechanism and "General" score function.To focus attention on multiple time steps instead of one, we modify the weight activation function to "sigmoid" [34][35][36].
Self-Attention: do not use RNNs or CNNs, and the multihead self-attention mechanism is used to realize the prediction of time series.The structure of the model refers to BERT [37], but word embedding and location embedding are dropped, and the activation function adopts "sigmoid" [38].
ARIMA: one of the most common time series prediction models.
It is worth noting that the hyperparameters of all the above models are also the most combined ones obtained by Bayesian optimization.

The Training Set.
The input of the neural network is selected features as well as production records in a certain  84-3: using the data from the previous 84 months to predict oil production of the next three months.
10 Geofluids duration, and the model is trained in a "supervised" fashion with future well production being the outputs.In this case, the inputs at each step of the model include historic production/features data for 82 consecutive months; the output is the production in the next 3 months.
In the training process, the model based on the encoderdecoder uses the "Teacher Forcing" hybrid training strategy [35], in which 60% of the input data of each time step in the decoder stage is real data, and the other 40% is the predicted value of the previous model output.This hybrid strategy can prevent overfitting caused by rapid convergence while ensuring the prediction accuracy of the model.Through the experiment, "Adagrad" was chosen as the model optimizer.The size of batch and the initial learning rate were set at 36 and 0.05, respectively.Callback function ReduceL-ROnPlateau was used to adapt the learning rate.The maximum epoch is 150; training process is controlled by the callback function EarlyStop.And model loss is MAE of predicted production.

Results and Discussion
5.1.Discussion of Different Models.Among the 8 models proposed, CNN-LSTM model achieved the best performance, and the predicting performance of other models are shown in Figure 10 and Table 4.
Results show that the prediction precision of attention mechanism model (including self-attention mechanism) is the worst.In many papers, attention mechanism improves the score of time series prediction [32,[34][35][36].However, it is not the case in this paper.In fact, the production of a well in a certain month does not heavily depend on one or more certain previous months, so adding weights to different time steps is not very helpful in forecasting.Besides, the addition of attention mechanism greatly aggravates the training burden of the model, which is another possible reason for the low prediction accuracy of the model.
Bi-LSTM model also failed to reach the expected score, and its performance was slightly worse than that of classical LSTM model.Unlike the semantic recognition task such as machine translation, changing the order of oil well production series makes no significant difference in predicting production, and the use of Bi-LSTM brings more complex model structure undoubtedly, making model training more difficult.
GRU and CNN models gain similar results, with LSTM model having slightly higher accuracy than LSTM, but it should be noted that the training time of GRU and CNN was almost half of LSTM.GRU or CNN models may be more appropriate in some less-desirable cases.

Discussion of Model Input
Length.In this paper, CNN, LSTM, and CNN-LSTM model were used to conduct contrast experiments, respectively, and to verify the influence of different input length on model prediction error, as shown in Figure 11.
It is easy to see that the performance of CNN model is better than that of LSTM and CNN-LSTM when the input length is short.However, with the increase of the input length, the prediction error of CNN model also increases significantly, while that of LSTM and CNN-LSTM is decreasing.Compared with CNN and LSTM, CNN-LSTM model seems to inherit advantages from both, better than LSTM in short case and better than the other two in long case.Considering data characteristics, case requirements, and model performance, the following example uses 84 months' data to predict monthly well production over the next three months.

Discussion of Feature Selection.
To verify the effectiveness of feature selection, two feature selection plan were used for feature selection experiment.Plan A: input all features; plan B: input select features.The prediction accuracy of each model using different features is shown in Table 5.
It is found that features and algorithms have cross influence on the prediction accuracy of the model.The RNN model has strong sensitivity to the number of features, and the selection of the number of features can reduce its prediction error.The model with convolutional layer is insensitive to the changes in the number of features.It can be seen from Table 5 that there is not much difference in model prediction accuracy before and after feature selection.The analysis shows that the RNN (LSTM/GRU) model is more inclined to capture the connection of samples in time series, while the convolution structure model is more inclined to analyze and extract high-dimensional features.Therefore, prior feature selection for the LSTM model means that part of the    14 Geofluids real tendency, but other models also can fit the curve closely.Another obvious phenomenon that the predictions at earlier time stages are worse compared to latter ones; in other words, the error at high production is greater than the error at low production.This is because production changes are more dramatic in the early stages of the well development.
Extensive experiments show that proposed scheme significantly improves production prediction accuracy and enhances predict efficiency.

Conclusion
This paper proposes a machine learning model for predicting single-well production in water flooding reservoir.The specific conclusions are as follows: (1) More than 60 factors of geology and development that affect the changes of oil well production was comprehensively considered to build the data set.Data filling and feature extraction were carried out, respectively, according to the characteristics of data.
Features were analyzed and selected from the perspective of time sequence.The data set is divided by wells to make the sample distribution more practical (2) Eight models with good performance in the time series prediction are constructed.By comparison, CNN-LSTM model gains the best score, while the improvement of attention mechanism and Bi-LSTM model is limited.It also illustrates that complex models that are doing well in other tasks may not be suited to the well production prediction (3) The Bayesian optimization is used to optimize the hyperparameter of the models, which can greatly improve the efficiency of hyperparameter optimization and improve the prediction accuracy of the models (4) An experiment case is carried out in reservoir A, and the results prove that the model proposed in this paper can accomplish the prediction task of singlewell production successfully and provides a good reference and guidance for development and production in water flooding reservoir

Figure 1 :
Figure 1: The basic structure of LSTM cell.In each cell unit, long-term information and short-term information are conveyed, respectively.

Figure 2 :
Figure 2: Diagram of Conv1Ds.For the inputs sized m × n and a Conv1D layer with k filters, the outputs' size is k × n.Using multiple Conv1Ds, the final output size is 1 × n.The model illustrated in this figure wants to predict production for the next three months, that is, it wants three outputs, which we achieve using a simple dense layer containing three neurons.

Figure 3 :
Figure 3: Static pressure, working fluid level and FBHP in well A. (a) The relationship between time and static pressure.(b) Comparison between real values and predicted values of FBHP.(c) The relationship between working fluid level and FBHP.

Figure 4 :
Figure 4: Diagram of injection-production relationship.Producing wells always be influenced by several water-injecting wells.

Figure 6 :Figure 5 :
Figure 6: The data processing working flow.Static features and dynamic features need to be treated separately, also the discrete features and continuous features do.

Figure 7 :Figure 8 :
Figure 7: Heatmap of correlation coefficient between inputs and outputs.(a) The average of 12 correlation coefficients between the production in the next nth month and the features in the last 12 months.(b) The correlation coefficient between the production in the next 1th month and the features in the last mth months.(c) The average of 12 correlation coefficients between the production in the next 1th month and the features in the last 12 months.m feaures

Figure 10 :
Figure 10: The predicting performance of different models.

Figure 11 :
Figure 11: Different input length vs. model prediction error.

Table 1 :
Initial data set.The italicized items are category features which need to be one-hot encoding. 8

Table 5 :
Comparison of prediction accuracy of different models under different feature selection strategies (84-3).