Research on the Rate of Penetration Prediction Method Based on Stacking Ensemble Learning

ROP is an important index to evaluate the e ﬃ ciency of oil and gas drilling. In order to accurately predict the ROP of an oil ﬁ eld in Xinjiang working area, a ROP prediction model based on the historical drilling data of this working area was established based on stacking ensemble learning. This model integrates the K -nearest neighbor algorithm and support vector machine algorithm by stacking ensemble strategy and uses genetic algorithm to optimize model parameters, forming a new method of ROP prediction suitable for this oil ﬁ eld. The prediction results show that the accuracy of ROP prediction by this method is up to 92.5%, and the performance is stable, which can provide reference for the optimization of drilling parameters in this oil ﬁ eld and has speci ﬁ c guiding signi ﬁ cance for improving the e ﬃ ciency of drilling operations.


Introduction
Rate of penetration (ROP) is an essential technical index to evaluate the efficiency of drilling operations.Predicting ROP can provide a basis for drilling decisions and optimization of construction parameters to reduce cost and increase efficiency.For ROP prediction, most scholars at home and abroad use traditional methods to establish equations for prediction [1].For example, Bahari and Seyed [2] established the W.C. Maaurer equation, but the consideration factors need to be more comprehensive.Adebayo and Akande [3] established an empirical formula between rock characteristics, physical properties and ROP, but did not consider the influence of drilling parameters on ROP.Changsheng [4] improved the B-Y ROP equation based on the multiple regression method so that the coefficients of different formations could be changed, but this equation only applied to mud drilling.Kumar and Murthy [5] obtained a more accurate ROP prediction model, but the model parameters were significantly affected by lithology and were relatively complex.Hung et al. [6] summarized the ROP prediction model for rotary percussion bits, but this model mainly applies to high-hardness formations.Jing et al. [7] established a ROP prediction model using analytic hierarchy process and feedforward neural network, which provided help for optimizing bit parameters.Wang et al. [8] established a ROP prediction model based on neural network algorithm, which can improve development efficiency and reduce development costs.Chiranth et al. [9] preliminarily established the ROP prediction model using machine learning algorithm and optimized it using random search method, particle swarm optimization algorithm, and eyeball method.Based on the drilling data of a directional well in Changqing Oilfield, Liu et al. [10] established the ROP prediction model using BP neural network with high calculation accuracy.Xinghua et al. [11] established the ROP prediction model using four machine learning algorithms, such as the K-nearest neighbor algorithm and support vector machine.The prediction results showed that the model based on the lifting tree algorithm had the highest accuracy.Liu et al. [12] established a dynamic prediction model of gray-weighted Markov mechanical drilling rate, which is based on the grey system theory and combined with Markov theory, to realize dynamic real-time prediction.
Through investigation and analysis, it is found that the existing ROP prediction mainly relies on experience or improves on the equations established by predecessors or tries to find the main control factors affecting ROP by means of controlling variables, and most of the machine learning algorithm models for predicting ROP described above use a single model or a single model with optimized parameters, which will inevitably lead to large errors and cannot meet the current drilling construction needs.In order to solve the above problems, this paper proposes a combined prediction model which combines the K-nearest neighbor algorithm and support vector machine algorithm with stacking integration strategy to predict ROP and optimizes it with genetic algorithm.Among them, the stacking ensemble learning algorithm is a classic algorithm among heterogeneous ensemble learning algorithms.The higher the difference of primary learners in the model, the better the performance of the model.Therefore, before modeling, we should not only analyze the prediction effect of each primary learner but also consider the differences between primary learners.The K-nearest neighbor algorithm is an inert learning mechanism, which is widely used because of its efficient training method and good classification effect.The training time complexity of support vector machine algorithm is high, but its classification accuracy is high, which makes it achieve good results in practical application.Combined with the historical drilling data of an oil field in Xinjiang working area, the prediction accuracy can reach 92.5%, which meets the engineering design requirements.It is helpful to guide the optimization of drilling parameters and improve drilling efficiency.

Rate of Penetration Prediction Algorithm
2.1.K-Nearest Neighbor Algorithm.The K-nearest neighbor [13][14][15](KNN) is a semisupervised machine learning algorithm with mature theoretical basis and good application stability.The core of the algorithm is first to measure the distance between the new sample and the classified samples in the training data set and then classify the new sample according to most of the K neighbors closest to the new sample.The basic process is as follows: (1) using the historical data set, build a database with large capacity and particular representative; (2) set the state vector; (3) calculate the distance between the new state vector and the historical data state vector; (4) determine the value of the number of the nearest neighbors K; (5) determine the nearest neighbor weight; and (6) the predictive value of the algorithm can be obtained by weighting K neighbors.In Figure 1, "?" represents samples of unknown categories, "triangle" and "circle" are marked as samples of known categories, and K of the KNN algorithm is set to 5; then, the cluster needs to contain samples of 5 known categories.Obviously, the number of "triangle" categories is greater than the number of "circle" categories, so it can judge the "?" is the "triangle" category by voting.The actual application of the algorithm is often more complex than this example, and the dimensions of the data will be higher.

Support Vector Machine Algorithm.
The main ideas of support vector machine (SVM) algorithm [16][17][18][19] are structural risk minimization principle and VC dimension theory of statistical learning theory.The SVM algorithm has many unique advantages in solving small sample and nonlinear regression models.It overcomes the problems of "dimension disaster" and "over-learning" to a large extent and can be used to classify discrete dependent variables and predict continuous dependent variables.
The principle of the algorithm is shown in Figure 2, where triangles are labeled as category 1, origin is labeled as category 2, and support vectors are all on the dashed lines.SVM uses kernel function to transform linear inseparable problems into linear separable problems.In the case of linear inseparable problems, nonlinear transformation is needed.After transformation, the equation for dividing the hyperplane is where ω is the normal vector, b is the displacement term, and φðxÞ is the mapping transformation of x.
In order to maximize the distance between the found hyperplane and different categories, the following requirements should be met: The hyperplane equation is obtained by quadratic programming and kernel function: ?  2 Geofluids where α i is the weight coefficient, y i is the truth value, k is the kernel function, k ðx, x i Þ represents ϕðxÞ • ϕðx i Þ, and ϕðxÞ represents the mapping transformation with respect to x.

Stacking Algorithm.
The stacking algorithm [20][21][22][23] is a combination strategy algorithm in ensemble learning.The algorithm requires multiple learners to support its training process.These learners are divided into primary learners and meta-learners, the former training individual data, and the latter training combined data.As a heterogeneous integrated learning technology, the stacking algorithm can grasp the advantages of different prediction models, infinitely superpose primary learners through superposition, and use meta-learners to optimize the combination of different primary learners.Therefore, compared with a single prediction model, the stacking algorithm has better performance, stronger robustness, and generalization.This paper uses KNN and support vector machine algorithm as primary learners in stacking ensemble learning.The specific operating principle of stacking algorithm is shown in Figure 3.
It can be seen from Figure 3 that the new training data is obtained based on the training of the primary model.Overfitting will occur if the original training data is directly used to train the meta-model.Therefore, K-fold cross-validation method should be used to generate the training data needed for the subsequent establishment of the meta-model with the samples not used by the primary model.
The main steps of ROP prediction based on Stacking ensemble learning are as follows: (1) the original data of drilling blocks are cleaned and normalized.Then the normalized data set is divided into training set and test set.(2) The training set is divided into K parts, and the K-fold cross-validation method is used for cross-validation of each primary learner.In each cross-validation, 1 part is used as the validation set of the model, and the remaining (K-1) parts serve as the training set.After each cross-validation, the trained learner predicts the test set and the validation set.(3) The predicted values of each validation set obtained by a single primary learner after K cross-validation are integrated into matrix M, and matrix Q is obtained by averaging the K predicted values of the test set in rows.(4) The output feature matrix (M 1 , M 2 , ⋯, M n ) of the training set can be obtained after the training of the primary learner in the first part, and this matrix is used as the training set of the metalearner in the second part.(5) In the second part, after the training of the metalearner, the matrix ðQ 1 , Q 2 ,⋯,Q n Þ is taken as the test set, and the predicted value of the model is finally output.
As a model fusion algorithm, stacking ensemble learning can achieve better performance in many practical problems than a single algorithm or other simple ensemble learning algorithms, especially in processing classification problems.Stacking ensemble learning has been widely applied in classification, recognition, prediction, and other problems [21][22][23][24].

Model Establishment
3.1.Data Preprocessing.The modeling data used in this paper are all from an oil field in Xinjiang working area.In order to avoid the high dimension of the original data set affecting the prediction results, the max-min method in the standardized method is used to process the 400 groups of original data, and the max-min method is shown in where x i ′ is the normalized data, x i is the original sample data before normalization, x min is the minimum eigenvalue of the original sample data, and x max is the maximum eigenvalue of the original sample data.3 Geofluids of different sizes of drill bits is divided into three grades: low ROP, medium ROP, and high ROP.The ROP classification makes the value of ROP independent of the size of the drill bit and transforms the modeling task from regression problem to classification problem, which is beneficial to improve the accuracy of model prediction.The rate of penetration classification standard of this field is shown in Table 1.

Establish the Prediction Model of ROP.
In order to avoid the problem of insufficient generalization ability of the prediction model and ensure the accuracy and stability of the prediction results, 380 sets of original data obtained in the field were randomly divided into training set and test set according to the ratio of 9 : 1, among which 342 training samples were used to construct the model and 38 test samples were used to evaluate the model.
The rate of penetration prediction model with default parameters was established based on the KNN algorithm, SVM algorithm, and stacking algorithm, with the input values of drilling weight, drilling speed, rotation speed, drilling fluid displacement, density, funnel viscosity, bit pressure drop, and bit diameter in the sample data and the output Considering that this problem is a classification problem, the accuracy rate is used as the evaluation method of model classification accuracy.The specific idea is as follows: classification accuracy = number of accurate classification groups/total number of predicted groups.The total number of groups in this prediction is a fixed value of 38 groups.The prediction results of the default model are shown in Tables 2-5, where "low ROP is 1, medium ROP is 2, and high ROP is 3." It can be seen from Tables 2-5 that the classification accuracy of KNN, SVM, and stacking models for predicting ROP is low.The classification prediction accuracy of the integrated stacking model is the highest, but only 81.6%.Therefore, the algorithm parameters need to be optimized to improve the classification prediction accuracy of the model.

Optimization Model.
The genetic algorithm [25][26][27][28][29][30][31] (GA) is an efficient random search and optimization method based on natural genetic mechanism and biological evolution.In the solving process, this algorithm simulates the evolutionary mechanism of the biological world and then obtains the optimal solution through the operation of chromosome selection, crossover, and mutation.The core of this algorithm is to evaluate the merits of individuals through the fitness function.The dominant individuals will occupy a higher proportion of the next generation population.Com-pared with traditional optimization algorithms, the genetic algorithm has strong robustness.In addition, global search characteristics and implied parallelism are also two significant advantages.
Genetic algorithm optimization steps are as follows: (1) determine input and output variables, and transform the optimal solution problem into population reproduction problem by coding (binary coding is adopted in this paper); (2) randomly generate individuals and set populations; (3) determine the fitness function; (4) determine selection, crossover, and mutation operations; (5) calculate the fitness value and eliminate it; (6) if the iteration stop condition is met, the optimal solution can be obtained; otherwise, the operation of steps ( 4)-( 6) can be continued.
In this paper, the parameters of the genetic algorithm are set as follows: the population number is 150, the maximum reproduction number is 70, the chromosome length is 100, the crossover probability is 0.85, the mutation probability is 0.01, and the absolute error is selected as the fitness function.The number K of KNN neighbor samples after optimization is 30.The voting weight of neighbor samples is inversely proportional to the distance, and the distance standard is Manhattan distance.The penalty coefficient of the optimized SVM is 12, the kernel function parameter is 0.1, and the kernel function adopts Gaussian kernel function.
The optimized model parameters were input into the corresponding model, and all the optimized parameters were fused with the stacking ensemble algorithm.The precision and generalization of the optimized model were verified with 40 groups of test data.The prediction results of each optimized model are shown in Tables 6-9, where "low ROP is 1, medium ROP is 2, and high ROP is 3." The prediction results of ROP level showed that the accuracy of each model was improved after the optimization of genetic algorithm.The classification prediction accuracy of the stacking integrated model was the highest, which was 92.5%.The stacking integrated model is the most suitable for predicting ROP in this oilfield and provides a reference for optimizing drilling construction parameters.

Prediction of Rate of Penetration of an Oilfield in Xinjiang
The rate of penetration prediction software was prepared based on the stacking model optimized by the genetic algorithm.The construction parameters of the deep well section of an oilfield in Xinjiang working area were entered as follows: the bit entering degree is 100%, the bit out degree is 50%, the bit tooth wear is 50%, the bit weight is 80 kN, the speed is 30 r/min, the bit pressure drop is 0.53 MPa, the bit diameter is 215.9 mm, the drilling fluid density is 1.6 g/cm 3 , the drilling fluid circulation displacement is 24 L/s, and the drilling fluid funnel viscosity is 57 s.Software calculation results are as follows: the prediction of ROP level under this construction parameter is "1-low speed, ROP less than 4m/h."In the same way, ROP prediction data under the other two construction parameters can be obtained, as shown in Table 10.
Table 10 shows that ROP predicted by the stacking integrated model in the oilfield is consistent with the actual ROP.Confirming that the ROP prediction method based on stacking ensemble learning can be used to predict the field of ROP level and scope.Field operation personnel can adjust construction parameters based on the prediction results and the dynamic optimization of the drilling construction plan to improve the rate of penetration and drilling efficiency.

Conclusion
(1) According to the original data of an oil field in Xinjiang working area, data cleaning and processing were carried out.KNN, SVM, and stacking algorithm were used to establish the ROP prediction model, and the prediction accuracy was as follows: 63.2%, 73.7%, and 81.6%.The highest accuracy of the stacking ensemble model was 81.6%, which needs to be further optimized (2) The parameters of each model are optimized by the genetic algorithm.The optimized accuracies of KNN, SVM, and stacking models for predicting ROP are as follows: 73.7%, 78.9%, and 92.5%.The prediction accuracy of the stacking integrated model is the highest, which can be used for ROP prediction and drilling construction parameter optimization in an oilfield in Xinjiang working area

Figure 3 :
Figure 3: Operating principle of the stacking model.

( 3 )
The ROP prediction software suitable for an oil field in Xinjiang was developed based on the integrated learning model optimized by genetic algorithm.The field operators can adjust the construction parameters based on the prediction results and dynamically optimize the drilling construction scheme (4) Due to the limited data available, only a few features mentioned in this paper can be considered factors affecting the ROP.More detailed features such as the specific composition of drilling fluid and formation parameters can be considered in the future

Table 1 :
Classification of ROP corresponding to different bit sizes in an oil field in Xinjiang working area.

Table 2 :
Confusion matrix of results of K-nearest neighbor algorithm for predicting ROP grade.

Table 3 :
Confusion matrix of support vector machine algorithm for predicting ROP grade.

Table 4 :
Confusion matrix of prediction results of the stacking integrated model ROP grade.

Table 5 :
Default model prediction results.

Table 6 :
Confusion matrix of optimized K-nearest neighbor algorithm for predicting ROP level.

Table 7 :
Confusion matrix of optimized support vector machine algorithm for predicting ROP level.

Table 8 :
Confusion matrix of optimized Stacking integration model for predicting ROP level.

Table 9 :
Model prediction results optimized by genetic algorithm.The default parameters of each model are as follows: the number K of KNN neighbor samples is 5, the voting weight of neighbor samples is the same, and the distance standard adopts Euclidean distance.The penalty coefficient C of SVM is set to 1, the kernel function g is set to 1/n_features, and the kernel function adopts Gaussian kernel function.Building the stacking integrated learning model is divided into two stages.First, the primary learner (primary model) is trained by the stacking algorithm using the training dataset.Then, a new training dataset is generated based on the primary learner to train the metalearner (metamodel).

Table 10 :
Comparison of predicted ROP data and actual ROP data.