Development of Decline Curve Analysis Parameters for Tight Oil Wells Using a Machine Learning Algorithm

To obtain a reliable production forecast, one has to establish a geological model with well logs and seismic data. The geological model usually has to be upscaled using certain upscaling techniques. Then, a dynamic reservoir model is constructed with another dataset, including completion data, production data, ﬂ uid properties, and relative permeability curves. At last, the dynamic model needs to be validated by a history matching process. This approach is data-intensive, time-consuming, and often not rigorously accomplished due to the lack of skillset and time. In this study, 10,000 groups of reservoir/completion input data were generated by Latin hypercube sampling method, and then, 10,000 groups of output (oil rate and cumulative production data) were obtained by numerical simulation. Next, a machine learning technique was applied to establish a model between the input data and determining parameters of a decline curve analysis model by ﬁ tting the generated cumulative production rate. Overall coe ﬃ cients of determination ( R 2 ) of the three Arps decline curve factors were 0.966, 0.990, and 0.945. The validation result shows that the production rate and cumulative production predicted by the proposed machine learning – decline curve analysis (ML-DCA) model agreed well with those simulated by reservoir simulation. As a result of the ML-DCA regression model, a complete understanding can be established of the impact of reservoir properties on the DCA model. The proposed ML-DCA model not only provides a quick and robust method for petroleum engineers to estimate production performance for unconventional reservoirs from reservoir and completion properties without full- ﬁ eld geocellular modeling but also can be used to optimize the completion and operation parameters for wells of interest.


Introduction
Unconventional oil and gas reservoirs have been able to be developed economically with advancements in horizontal well drilling and multistage hydraulic fracturing technology. Management and operation of unconventional oil and gas reservoirs demand accurate prediction of production rates, which facilitates better development strategy, more economic feasibility, more reliable reserve evaluations, and more informed business decisions.
Many efforts have been made to develop efficient numerical models for simulation of unconventional oil and gas production considering complex hydraulic and natural fracture geometries and multiple gas transport mechanisms in nanopores [1][2][3][4]. High-resolution, three-dimensional (3D), geocellular models characterize geological features and frac-ture network complexities with grid blocks and their related rock and fluid properties, and they yield a comprehensive geographic distribution of pressure and saturation over a period of time. However, such a model is computationally prohibitive as a large number of simulations are required for history matching and production optimization in a close-loop reservoir management and decision-making context. The challenges become even more discouraging with a lack of rock and fluid data and insufficient production history. Researchers have tried to accelerate simulations through techniques such as upscaling [5,6], multiscaling [7][8][9], and streamline methods [10,11]. However, all of these speed-ups require a full physics-based model as a starting point for simplification.
Decline curve analysis (DCA) as an alternative technique to forecast oil and gas production has been applied in the oil industry for a long time. The most commonly used curve-fitbased decline curve models for unconventional oil and gas reservoirs include the multisegment Arps decline model [12][13][14][15], modified Arps DCA model [16], transient hyperbolic model [17], stretched exponential decline method (SEDM) [18], Duong method [19], and power law exponential (PLE) method [20]. It is well known that curve-fit-based DCA models are easy to apply for unconventional reservoirs once the flow regimes related to depth of investigation are appropriately identified, but they fall short of prediction accuracy as they cannot capture the reservoir features or properties that are highly relevant to production performance.
With the introduction of data analytics in the oil and gas industry, such practical challenges have caused emerging data-driven solutions in the area of data mining and machine learning (ML), in which geological data, completion data, and dynamic data are input to the ML algorithm to unravel hidden physical relationships contained in the data but not represented in existing simulation models to output future production performance. Li and Han [21] applied a neural network to study the correlation between selected reservoir/completion properties and the determining parameters of the logistic growth model (LGM). Thus, a trained ML-DCA model can be used to predict the production trends for both an existing well and a new well according to the given reservoir/completion properties. Sun et al. [22] proposed a recurrent neural network-(RNN-) based sequence-to-sequence model to forecast production. Mukherjee et al. [23] tested and performed various ML algorithms, including linear regression (LR), principal component analysis (PCA), neural network regression (NNR), boosted decision tree (BDT), and binned decision tree (BiDT), to forecast gas production. Tamimi et al. [24] presented a comprehensive intelligent decision support system (IDSS) to forecast and determine parameters of Arps decline curve model from 3,400 unconventional wells. Temizel et al. [25] applied the long short-term memory (LSTM) method for predicting long-term production behavior in unconventional shale reservoirs. Xue et al. [26] proposed a multiobjective random forest (MORF) algorithm to forecast the production rate, with reservoir and completion characteristics as input. Gross et al. [27] proposed a physics-informed ML workflow combining fast reduced-order models (ROMs) with reservoir simulation models to predict production as a function of pressure management in a fractured Marcellus shale reservoir. Doan and Vo [28] used ML techniques to enhance the accuracy of production forecasting in the North Malay Basin.
Data-driven models may not capture many geological features of a reservoir, but they run much faster than fullphysics models, and the trained model on real data can avoid uncertainty or making hypothesis. They can provide reliable predictions with enough data in the calibration process. However, when the number of parameters becomes large, data-driven models require a certain number of samples for training and testing before they can be effectively used.
The unique nature of this study is using a simulation model as a basis for the production dataset, which provides an alternative for utilizing the machine learning algorithms when actual data are insufficient or unavailable. This research offers less experienced engineers with an effective ML-DCA model that correlates reservoir and fluid properties, as well as completion parameters, with the DCA model to forecast the production rates without full-field geocellular modeling for unconventional reservoirs. Therefore, without even working with complicated geomodels, less experienced engineers can make robust estimations about the production forecast once the ML-DCA model is established and delivered by experienced reservoir engineers. The implementation between DCA and the simplified reservoir model with ML algorithm makes it easy to extend the workflow for any other type of reservoirs, such as thermal or compositional.
This study is organized as follows. Firstly, the methods in the proposed ML-DCA workflow, including decline curve model and artificial neural network were introduced, followed by synthetic production profiles generated and established from single-well numerical simulation considering 15 geological and completion parameters that included matrix and fracture network properties for data-driven study. Next, the ML-DCA model was trained and tested by the synthetic production profiles; then, a case study was given to evaluate the prediction performance of the ML-DCA model, which shows the power and accuracy the proposed ML-DCA workflow; and a sensitivity analysis was performed to investigate the effect of geological and reservoir parameters on the production with the ML-DCA model; finally, the limitations and future work were discussed and conclusions were drawn.

Methodology
2.1. Decline Curve Analysis. The Arps decline curve is the most common DCA. The Arps hyperbolic decline curve model is [29].
where q is the predicted production, q i is the initial production, t is time, b is a constant, and D i is the initial decline rate. When the constant loss ratio b is 0, the decline curve reduces to an exponential decline; if b is 1, the decline curve becomes harmonic; if b is greater than 0 but less than 1, the decline curve is hyperbolic; and if b is greater than 1, often during transient flow, the decline curve is called superhyperbolic. When one deals with low-and ultralow-permeability wells, transient flow lasting months to years is common, followed by a transition flow regime and later BDF. In practice, b is constant for a long time during transient flow in hydraulically fractured wells and during BDF, so one can apply the Arps model with confidence to those time periods during which b is constant [12]. In this study, cumulative oil production was selected to perform DCA regression as it is much smoother than the production rate. This approach also mitigates the effect of irregular and noisy production data, especially for real field data. Cumulative oil production is the integral of the production rate (Equation (1)), which is defined as where Q is the cumulative production. Integrating and rearranging Equation (2), the cumulative production-time relationship for different b values can be derived and shown as in Table 1.

Artificial Neural
Network. An artificial neural network (ANN) is a common ML technique inspired by the structure of neurons in the human brain. The ANN was selected as the ML method because it has been successfully applied in many engineering and science problems to extract complex and nonlinear relationships among process variables. In this study, a typical three-layer back-propagation ANN structure was established, consisting of an input layer, a hidden layer, and an output layer, as shown in Figure 1. Several key variables, including hidden layer neurons, training algorithm, and transfer functions, were considered to get the optimal ANN structure. The neuron number in the input layer equals the input feature number. The neuron number in the hidden layer was tuned to optimize the objective function. Each neuron received the input information from the output of the neurons in the previous layer and generated and passed output to the next layer. The mathematical model of the i th neuron is expressed as where y i is the output of the i th neuron on the next layer, x j is the input from the previous layer, w ij is the weight, n is the input number from the previous layer, and ϕ is the activation function. The weight was tuned for each neuron through optimizing the loss function in the model training process. The R-squared score (R 2 ) is commonly used as the loss function. It is a statistical measure that represents the proportion of the variance of a dependent variable that is explained by an independent variable or variable in a regression model. R 2 score is given as whereŷ i is the predicted value for i, and y i is the mean of y i .

Machine Learning-Decline Curve Analysis Algorithm.
The ML algorithm provides a statistical technique to analyze a system with an existing dataset without being explicitly programmed. To introduce how the ML algorithm solves a regression problem, one usually defines the input vector, x i , as where N is the sample size or the number of input vectors and P is the number of input parameters, such as permeability and Input layer Hidden layer Output layer Figure 1: General ANN structure.  3 Geofluids porosity. Thus, the input matrix X can be defined as In this study, the input was a column matrix composed of reservoir, completion, and operation parameters, such as fracture half-length, fracture width, fracture permeability, and porosity.
The corresponding output parameters can be given as where the output vector y i is the determining parameters of the decline curve model corresponding to the input vector x i .
where t is the number of determining parameters for the decline curve model. Clearly, there are three parameters in the Arps decline curve model; thus, the output y i is a column matrix of b, D i , and q i .
The existing measurement dataset was used to train the ML model. The training dataset is given as The purpose of ML is to establish a mapping function, ∅ð•Þ, from the training dataset The proposed ML-DCA model can be expressed and established as Figure 2 with the training dataset.
2.4. Pearson Correlation. The most commonly used type of correlation that describes the degree of relationship between two variables is the Pearson correlation. Pearson's r measures the linear relationship between two variables, say X and Y. A correlation of 1 indicates that the data points perfectly lie on a line for which Y increases as X increases. A value of -1 also implies that the data points lie on a line;

Hydraulic fractures
Well-1 Figure 4: MFHW model to generate the simulated oil and gas production data for DCA. 4 Geofluids however, Y decreases as X increases. The formular for r is Figure 3 shows the proposed workflow of ML-DCA. First, a 3D numerical model was established for a typical multistage fractured horizontal well (MFHW). Then, the Latin hypercube sampling (LHS) method was performed to generate enough (e.g., 10,000) experimental designs within the predetermined distribution type and ranges of certain input parameters, which were then used to simulate the cumulative oil production rates by running reservoir simulation. Next, the DCA regression algorithm was then performed to fit the simulated cumulative oil production rates and obtain the determining parameters of the DCA model. Finally, the ML algorithm (e.g., ANN, but it can be other algorithms in future work) was trained and tested to investigate the correlations or mapping function between geologi-cal and completion parameters and the determining parameters of the DCA model.
Note that the proposed workflow can be extended to investigate various decline curve models, such as SEDM, PLE, and Duong model, by simply replacing the decline model. In addition, all related algorithms were developed with open-source libraries that can be easily integrated with an in-house or commercial simulator.

Data Generation
Building a data-driven model requires a large set of geological features and completion data (production, pressure, well logs, etc.) as inputs and the factors of the DCA model as output. The ML method was then used to establish the correlation between reservoir features and production. Ideally, it is better to obtain the dataset used to build the data-driven model from the actual field. However, synthetic data generated from numerical or analytical models can alternatively be used if good-quality, actual data are insufficient or unavailable [26,30]. In this study, we used a numerical 5 Geofluids model to generate cumulative production profiles with randomly generated geological and completion parameters. Later, the simulated cumulative production rates were used to establish the relationship between reservoir properties and the DCA model.

Single-Well Reservoir
Model. The MFHW model to simulate production from tight oil reservoirs was a three-phase, 3D rectangular model that was established with a CMG simulator. A tartan grid was used to model the MFHW as it is the best way to catch transient behavior. Otherwise, ones need to use local grid refinement (LGR), which is more time consuming. The grid number (Nx × Ny × Nz) of the model is set to be 50 × 21 × 7. The grid size for each direction (DI, DJ, and DK) was one of the uncertain parameters that were generated by the sampling method later. Thus, the well control area and drainage area (Lx = Nx × DI, Ly = Ny × DJ, Lz = Nz × DK) was changed with the grid size. The horizontal well was placed in the center of the reservoir model and produced under the constraint   Geofluids of constant flowing BHP. Duration of production was set to be 10 years, which is considered a realistic tight oil production scenario. Figure 4 shows the sideview of the MFHW model, in which the darkened sections of the grid blocks represent the hydraulic fractures.    Table 2). Among the 15 parameters, the first three features were used to define the drainage area. Matrix permeability and porosity were the tight oil reservoir properties. The following four features, including horizontal well length, fracture half-length, fracture spacing, and fracture effective permeability, define the horizontal well and completion parameters. The layer-up and layer-down specified the number of layers that defines the fracture penetrate above/below the horizontal well. Thus, these two parameters were related to the fracture height. Bubble point pressure was one of the  8 Geofluids fluid properties, and initial pressure defined the initial reservoir condition. Bubble point pressure was set to be lower than the initial reservoir pressure for each scenario. The water saturation of the reservoir was constant in this study, and the value was set to 0.4. Relative permeability curve was predefined. The last two features were operation parameters during production. Oil production changed with the operating BHP, which was set to lower than the initial reservoir pressure. The well was shut off when the oil rate was lower than the monitored oil rate defined by well constraints. To obtain effective training and testing of datadriven models, 10,000 samples of each parameter were generated through the LHS method, with the ranges and distribution type listed in Table 2. The probability distributions of each parameter are shown in Figure 5.

Generation of Cumulative
Production. The corresponding 10-year monthly oil production for each combination was then simulated by the numerical model. The distribution of the 10,000 cumulative oil productions and recoveries are displayed in Figures 6 and 7. Basically, the cumulative production of 50% of wells was less than 10 × 10 4 bbl. The typical oil recovery usually was less than 35%, mainly distributed between 5% and 15%.

Decline Curve Analysis Regression.
To obtain the best predictive model, production data from 10,000 wells were transformed into DCA space. DCA best fit curves were usually computed with the least-squares regression [34]. Thus, leastsquares regression was programmed to estimate the three determining parameters of the Arps decline model from the 10,000 synthetic cumulative production profiles generated by the represented reservoir simulation. Figure 8 shows parts of 10,000 fitting results as an example. It shows good agreement between the cumulative oil production calculated by the DCA model and the synthetic cumulative production simulated by the numerical model. Figure 9 is the distribution of coefficient of determination (R 2 ). 99% of the DCA regression had a R 2 greater than 0.99. In other words, the DCA curve can be a proxy model of the single-well simulation to forecast tight oil production in this study. Figure 10 gives the histogram of the determined initial rate (q i ), b value, and initial decline rate (D i ) by the ML-DCA model. The distribution can be used for further study if one performs production uncertainty analysis with the DCA model, such as to forecast P10, P50, and P90. b value affects long-term production but does not make much difference in short-time production. Initial production rate and initial decline rate affect the short-term rate. As shown in Figure 11, there was no apparent correlation between b and initial production rate (q i ), meaning that initial production alone cannot represent how well or how poorly a well will produce in the long term. However, b showed a significant correlation with the initial decline rate (D i ) (Figure 11(b)). This observation was similar to Tamimi et al.'s [24] study based on more than 3,400 unconventional wells in DJ basin. Simply put, a higher initial decline rate indicates a higher b value. Thus, it seems a higher decline rate may indicate a bad well, but eventually it produces more in the long term.  Figure 12: ANN model used in this study to estimate the three Arps parameters from geological and operational parameters.  Training algorithm Bayesian regularization Iterations to achieve optimal structure 185 250 324 Overall

Neural Network Regression.
In this study, three-layer back-propagation neural network was established, as shown in Figure 12. As it can been seen, the three factors of the DCA model were solved sequentially rather than simultaneously, as the sequential network shown as Figure 12 achieved the highest overall R 2 and proven to be the optimal regression workflow to determine the factors of the DCA model ( Table 3). The optimal regression workflow is summarized as follows:   Figure 16: Flowchart to evaluate the production prediction performance.  (v) Step 5: the ANN's training process continues interactively until the desired level of error or maximum iteration number is reached, and the first final network used to determine the vector of q i is achieved.
(vi) Step 6: the vector of q i is added to the input matrix, and Step 2 to Step 5 are repeated to obtain the vector of b (vii) Step 7: the vector of b is added to the input matrix, and Step 2 to Step 6 are repeated to determine the vector of D i The optimum network was achieved and is given in Table 4.  Figure 16 shows the process to evaluate the prediction performance of the ML-DCA model by comparing the production profiles with numerical simulation. Table 5 presents the values of the 15 selected key features used to forecast the production profile of a tight oil well. The comparison result of the well is depicted in Figure 17. Overall, both the production rate and cumulative production between these two methods agreed with each other very well, indicating that the performance of the ML-DCA model is acceptable and reliable for science and engineering applications. It can be concluded that the ML-DCA model has the same accuracy as singlewell numerical simulation. Thus, instead of establishing a complex numerical model, engineers who have less experience with numerical simulation can use the ML-DCA model as a proxy model to forecast tight oil production with low cost.

Sensitivity Analysis.
Pearson correlation coefficient was calculated to evaluate the significance of each property in the three determining parameters (q i , b, and D i ) ( Figure 18). Figure 18(a) reflects permeability, grid size in the vertical direction (DK), and fracture spacing to be the top three factors having the most significant effect on the initial rate, followed by initial pressure, fracture permeability, well length, matrix porosity, fracture half-length, etc. Among them, fracture spacing had a significant negative impact on the initial rate. And operating BHP, monitored oil rate, layer-up, and layer-down had a minor effect on the initial rate. Figure 18(b) indicates that permeability has mainly a negative influence on b. The higher the permeability, the lower the b value. It also can be concluded that operating BHP, bubble point pressure, well length, initial pressure, fracture half-length, and fracture permeability had a negative impact on b value, while fracture spacing, porosity, and grid size had a positive effect on b value. Figure 18(c) shows the correlation between the initial decline rate (D i ) and reservoir and completion parameters. Fracture spacing, used to define the number of fractures in the reservoir model, had a prominent influence on the initial decline rate D i . Meanwhile, the grid size (DK, DI, and DJ), well length, porosity, and fracture half-length had a negative influence. Results also indicate matrix permeability, initial pressure, fracture permeability, bubble point pressure, and operating BHP to have a positive impact on the initial decline rate.

Discussion and Future Work
It is well known that ML-DCA can be powerful if the data quantity and quality can be improved by including actual field data. However, with most tight oil wells having less than 60 months of production history, these data cannot be directly used to perform history matching. This study proves that this simulation-based proxy tool is reliable with a well-maintained database generated from a singlewell reservoir simulation, and its efficiency in computational time allows the practicing engineer to achieve modeling objectives and to reduce uncertainty in a rapid way. As a result, the ML-DCA algorithm not only can be used as a tool to determine DCA factors and predict production, once the initial reservoir conditions, rock properties, and completion and operation parameters are given, but also can be used to optimize the completion and operation parameters for a target reservoir, such as fracture spacing, fracture half-length, and operating BHP. As for future work, further studies can be conducted to investigate the effect of measurement errors and sample size on production prediction performance.

Conclusions
In this study, a ML-DCA algorithm, integrating a ML technique with a DCA model, was developed to predict tight oil production performance, serving as a proxy for analytical/numerical reservoir simulation. The following critical conclusions can be summarized: (1) With the reservoir and completion parameters as the inputs and the DCA factors as outputs, the ML-DCA model can be trained to determine DCA factors accurately, with the overall prediction errors (R 2 ) of the three Arps decline curve factors being 0.966, 0.990, and 0.945