Combining K-Means Clustering and Random Forest to Evaluate the Gas Content of Coalbed Bed Methane Reservoirs

The accurate calculation of the gas content of coalbed bed methane (CBM) reservoirs is of great significance. However, due to the weak correlation between the logging response of coalbed methane reservoirs and the gas content parameters and strong nonlinear characteristics, it is difficult for conventional gas content calculation algorithms to obtain more reliable results. This paper proposes a CBM reservoir gas content assessment method combining K-means clustering and random forest. The K-means clustering is used to divide the reservoirs and distinguish the types to establish a random forest model. Judging from the evaluation effect of the research block, the prediction accuracy of the new method is significantly higher than that of the original method, and more accurate gas content prediction values can be obtained for different types of reservoirs. Studies have shown that this method can help the gas content evaluation of CBM reservoirs, improve the accuracy of gas content evaluation, and better support the exploration and development of CBM reservoirs. The results of this study show that the random forest method based on clustering can effectively distinguish the relationship between different logging responses and gas content. On this basis, the random forest algorithm modeling can effectively characterize the complex relationship between gas content and logging curve response. In the case of poor correlation between gas content and logging curve, the gas content of the reservoir can also be accurately calculated.


Introduction
With the continuous progress of exploration and development, research on various unconventional reservoirs such as coalbed bed methane (CBM) and shale gas is in full swing, and they are the key growth points of reserves [1][2][3]. CBM is natural gas that is generated by biochemistry and pyrolysis during the formation and evolution of coal seams and is stored in coal seams. At present, the United States, Canada, Australia, Russia, India, China, and other countries have all started exploration and development of CBM [4][5][6].
The gas content of a CBM reservoir is a very important parameter, which determines the reserves and final production of the reservoir [7,8]. However, compared with other reservoirs, it is more difficult to calculate the gas content of CBM reservoirs, which extremely restricts the determination of high-quality reservoirs and the formulation of develop-ment plans, resulting in unclear understanding of CBM reservoirs. For the core gas content calculation method, Kim proposes to combine the moisture and ash content with the coal bed temperature and pressure and the equilibrium water state correction amount to calculate the coalbed gas content [9]. Ahmed et al. provided the establishment of an isotherm adsorption model to describe the gas content using an isotherm adsorption experiment [10]. Hawkins et al. proposed to use the Langmuir coal rank equation to predict gas content [11]. However, none of the above methods can predict the vertical continuity of the gas content in the formation. Logging is currently the only method that can accurately predict the vertical gas content change of a single well. It is of great significance to establish an accurate gas content logging evaluation model. Some scholars have studied the logging calculation method of the gas content of CBM reservoirs. Liu et al. [12] , Meng et al. [13] and Shao et al. [14] both proposed a statistical method for evaluating the gas content of CBM reservoirs. Jin et al. [15] and Fu et al. [16] also used this method. In addition to the use of statistical models or volume models for gas content evaluation, the relationship between gas content and logging response is too complicated. At present, methods for evaluating coal reservoir gas content using machine learning algorithms have gradually emerged. Hou and Wang [17] used the error back propagation neural network to predict the air content and achieved certain results. Pan and Huang [18] and Wu also used BPNN to predict the air content. Lian et al. [19] introduced support vector machines to the evaluation of air content. Guo et al. [20,21] used the grey system and random forest to predict the gas content. Xiang et al. [22] proposed the application method of deep learning in CBM logging interpretation and believed that the effect of deep belief network in CBM gas content prediction is better than BPNN, multiple regression, and Langmuir equation method.
Although many scholars have proposed a variety of methods for evaluating the gas content of CBM, it cannot be ignored that the complexity of CBM reservoirs is greater, even higher than that of shale gas reservoirs. The logging response of such a complex reservoir is affected very diversely, and the main controlling factors of logging response of different reservoirs are different. No matter how strong the approximation ability of the model is, it is difficult to accurately evaluate the gas content parameters by establishing a single evaluation model. The innovative part of this article is to use clustering methods to classify data with different feature relationships, so that machine learning algorithms can predict different data more targeted, and use more efficient machine learning algorithms to improve the prediction effect.
Based on this idea, this paper proposes a K-means clustering + random forest air content evaluation method, that is, first collect data and use the clustering method to classify the data. After that, the classified data is used to establish a model separately and evaluate the gas content. Finally, a series of established models are used to apply the logging curve to obtain the final gas content prediction curve for the entire well section. In this way, the influence of different main control factors on the prediction can be eliminated as much as possible, so that the model is more targeted and the prediction effect of the model is improved. Although this approach seems to be a more complicated modeling method, the prediction effect of the actual reservoir gas content has been greatly improved. From the perspective of the prediction effect of the research block, the method proposed in this paper is effective and can help the exploration and development of CBM.

Data
The study block is located in the southeast of Qinshui Basin in Central China. Drilling revealed that there are 16 coal seams in Taiyuan formation and Shanxi formation, with the maximum total thickness of 23. The production of wells in different positions varies greatly, and the gas content distribution is unclear, which restricts the exploration and development of coalbed methane. We collected 169 coring gas content measured data from 22 CBM parameter wells in this block, and 6 logging curves including natural gamma ray, spontaneous potential, borehole diameter, deep shallow lateral resistivity, and bulk density. The measured gas content data shows that the gas content of No. 3 coal seam is mainly distributed between 5 and 20 m 3 /t ( Figure 1).

Method
3.1. K-Means Clustering. Although we often use classification or regression algorithms in supervised learning methods to predict categories or values, we still often encounter situations where we need to use unsupervised learning methods to obtain a set of data categories. When the amount of data is large, you can consider using clustering algorithms to get different data categories. Clustering is subordinate to unsupervised learning, which does not rely on the defined classes and training examples of class labels. Among them, K-means clustering is a very classic clustering method [23].
At the beginning of K-means clustering, we first need to figure out how to quantitatively calculate the difference between two comparable elements. The smaller the degree of difference, the greater the direct correlation between the two samples, and the more likely it is a rock sample of one type of rock. We define the degree of dissimilarity mathematically here.
Suppose X = fx 1 , x 2 , x 3 , ⋯, x n g, Y = fy 1 , y 2 , y 3 , ⋯, y n g, where X and Y are two-element items, each with n measurable characteristic attributes; then, the degree of dissimilarity between X and Y is defined as Among them, R is the real number field. That is to say, the degree of dissimilarity is a mapping of two elements to   2 Geofluids the real number field, and the real number quantitatively represents the degree of dissimilarity of the two elements. The calculation of dissimilarity can use Euclidean distance, Manhattan distance, Minkowski distance, and so on. Usually, we use Euclidean distance: The above method of calculating dissimilarity has a problem, that is, attributes with a large value range have a higher impact on distance than attributes with a small value range. In order to solve this problem, it is generally necessary to normalize the attribute value. The so-called normalization is to map each attribute value proportionally to the same value interval, so as to balance the influence of each attribute on the distance. Usually, each attribute is mapped to the interval [0,1], and the mapping formula is Among them, max ða i Þ and min ða i Þ represent the maximum and minimum values of the ith attribute in all element items. The so-called clustering problem is to give a set of elements D, where each element has n observable attributes, use a certain algorithm to divide D into k subsets, and require the degree of difference between the elements within each subset as much as possible low, and the element dissimilarity of different subsets is as high as possible. Concentration, each subset is called a cluster. Different from classification, classification is exemplary learning, which requires that each category be clarified before classification and that each element is mapped to a category, while clustering is observational learning, and the category may not be known or even the number of categories may not be known before clustering.
K-means tries to find the natural category of the data. The user sets the number of categories to find a good category center. The algorithm flow is as follows: (1) Enter the number of data sets and categories K After a number of cycles, the best classification effect can be obtained. Different from marine shale reservoirs, the relationship between gas content of coal reservoirs and logging response of the coal reservoirs is relatively poor, and the laws are inconsistent, which also leads to the unreliability of the final prediction model. This is because coal reservoirs are more complex than shale reservoirs and have worse continuity, which causes the logging of coal seams to be affected by multiple factors. Using the clustering method to obtain multiple categories and establishing corresponding prediction models based on different categories can greatly improve the prediction results.

Random Forest.
Random forest is a highly flexible machine learning algorithm that has just emerged in the 21st century. It refers to a classifier that contains multiple decision trees. The thinking behind it is similar to group wisdom. In the 1980s, Breiman et al. invented an algorithm for classification trees, which performed classification or regression through repeated dichotomy of data, which greatly reduced the amount of calculation. In 2001, Breiman combined the classification trees into a random forest, that is, randomized the use of variables and the use of data, generated many classification trees, and then summarized the results of the classification trees [24]. Random forest improves the prediction accuracy without a significant increase in the amount of calculation. Random forest is not sensitive to multivariate collinearity, and the results are relatively robust to missing data and unbalanced data and can well predict the effect of thousands of explanatory variables.
Random forest uses a random method to build a forest. There are many decision trees in the forest, and there is no correlation between each decision tree in the random forest. After obtaining the forest, when a new input sample enters, let each decision tree in the forest make a judgment separately to see which category the sample belongs to. The class with the most classification times is the predicted class. Random forest can handle quantities whose attributes are discrete values. The construction process of random forest is as follows: (1) If there are N samples, N samples are randomly selected for replacement (one sample is randomly selected each time and then returned to continue selection). Use the selected N samples to train a decision tree as the sample at the root node of the decision tree (2) When each sample has M attributes, when each node of the decision tree needs to be split, then m attributes are selected from these M attributes, and the condition m < <M is satisfied. Then, from these m attributes, strategies such as information gain are used to select one attribute as the split attribute of the node (3) In the process of decision tree formation, each node must be split according to step 2 until it can no longer be split. Note that there is no pruning during the entire decision tree formation process (4) Follow steps 1-3 to build a large number of decision trees to form a random forest In the process of building each decision tree, attention should be paid to the impact of sampling and complete splitting. The first is two random sampling processes. Random forest samples the input data in rows and columns. For line sampling, a replacement method is used, that is, in the sample set obtained by sampling, there may be duplicate samples. Assuming that there are N input samples, there are also N samples sampled. In this way, when training, the input samples of each tree are not all samples, making it relatively difficult to overfitting. Then, perform column sampling, from M features, select m (m < <M).
After that, a decision tree is built using a completely split method for the sampled data, so that a certain leaf node of the decision tree cannot continue to split, or all the samples in it point to the same category. Generally, many decision tree algorithms have an important step-pruning, but this is not done here. Since the previous two random sampling processes ensure randomness, even if pruning is not performed, overfitting will not occur. Using a random forest method to predict gas content should be able to achieve better results.

Combination Method of K-Means Clustering and Random
Forest. It is difficult to evaluate the gas content of coal reservoirs, because the logging response has been affected by various factors, resulting in a poor relationship between the logging response and the core. Only by using clustering and other methods to truly combine logging responses for classification, different types of data are affected differently, and the relationship between logging responses and gas content in different categories is closer. Therefore, K-means clustering is performed first, and then based on the results of the clustering, a random forest model of different types is established for final application. In fact, the inherent meaning of this model is similar to that of random forests. It uses K-means clustering combined with random forests to form a "forest group" to predict gas content more accurately. The modeling and forecasting process is as follows: (1) Use K-means clustering to divide the data into several categories. The measurement method usually used to compare the results of different K values is the average distance between a data point and its cluster centroid. Since increasing the number of clusters will always reduce the distance to the data point, when K is the same as the number of data points, increasing K will always reduce the metric to zero. Therefore, this indicator cannot be used as the sole target. Conversely, the average distance to the center of mass is plotted as a function of K, and the "elbow point" at which the reduction rate changes sharply can be used to roughly determine the K value (2) Use K sets of data and random forest algorithm to train K models. After determining the category of the new data, the corresponding model can be used to calculate the gas content (3) When predicting new data, first determine the category of the new data by calculating the Euclidean distance between the sample data and the centroids of multiple classes of data. The new data belongs to the category corresponding to the centroid with the smallest Euclidean distance. After the category is determined, the corresponding model is used for prediction, and the predicted value of the gas content of the sample point is obtained, and the reliability of the algorithm is determined by comparing with the real value

Result
First, the data needs to be further analyzed to clarify the relationship between the gas content of the CBM reservoir and the logging response. The corresponding results are shown in Figure 2.
In Figure 2, Vg refers to the total gas content results obtained through experiments. AC refers to the acoustic time difference curve response, CAL refers to the well diameter curve response, and CNL refers to the neutron porosity curve response. DEN refers to the density curve response, GR refers to the natural gamma curve response, and RD refers to the deep resistivity curve response. It can be clearly seen that the correlation between each curve and gas content is poor, which is obviously different from marine shale reservoirs. From the correlation of each curve, deep resistivity logging, neutron logging, and sonic logging have a relatively good relationship with the gas content parameters of the reservoir. It is recommended that the above curves can be used as the input curve of the model. When the gas content increases, the sonic time difference of the coal seam increases significantly. As the gas content of the reservoir increases the hydrogen index of the coal seam, the neutron porosity also increases. In addition, with the increase of gas content, the response value of deep resistivity logging increased significantly, indicating that the adsorbed gas in coal reservoirs can significantly increase the resistivity of the reservoir and reduce the conductivity of the CBM reservoir. We also recommend that the resistivity curve be logarithmically converted during input. Based on the above data, the K-means clustering research is carried out. Figure 3 reflects the relationship between the clustering results and the sample-particle distance. It can be clearly seen that when there are more than 3 cluster types, the average distance reduction speed slows down significantly, indicating that it is not necessary to select more than 3 clusters. So here, we choose the number of clusters to be 3. After clustering, the gas content prediction model is established in different categories. Use the established model to predict the modeling samples, and the results are shown in Figure 4.
It can be seen from Figure 4 that, first of all, the correlation between the response of the logging curve and the poor total gas content does have a great impact on the prediction of gas content. Even if the random forest algorithm has strong approximation and generalization capabilities, the prediction effect obtained is very poor. It can be clearly seen from the results on the right that the prediction effect of the classification modeling after clustering is obviously better than that of Figure 4(a). The prediction effect is poor only when the total gas content is less than 5 cm 3 /g, and the reservoirs with total gas content less than 5 cm 3 /g are not the reservoirs of our concern. Through the clustering method, data with relatively consistent main control factors are unified and classified, and the model established on this basis is more targeted. Judging from the core prediction results, the ideas 4 Geofluids proposed in this article are very helpful for the gas content prediction of coal-measure reservoirs. This method is used to predict the gas content of test wells A and B in the study area. The results are shown in Figures 5 and 6, respectively.
In Figures 5 and 6, the first track is the depth track, and the second track is the caliper curve measured by the fourarm caliper tool. In the third track, the SP curve is the spontaneous potential logging curve, the GR curve is the natural gamma logging curve, Rxo is the microsphere focused resistivity logging curve, RS is the shallow lateral resistivity log-ging curve, and RD is the deep lateral resistance. Rate logging curve. In the fifth track, DEN is the density logging curve, AC is the sonic logging curve, and CNL is the neutron porosity logging curve. In the sixth channel, Vg_RF is the gas content curve directly predicted by random forest, and Vg_ core is the gas content value of the core. In the seventh track, Vg_KRF is the gas content curve obtained by random forest prediction after clustering, Category is the clustering result of the curve, and Conclusion is the interpretation conclusion of the CBM reservoir.
It can be seen from Figure 5 that the correspondence between the Vg_KRF curve and the core is much higher than the Vg_RF curve, indicating that the random forest modeling effect based on the clustering method is better. Through the analysis of the curve, it can be seen that the gas content of type I reservoirs is relatively low, and the corresponding natural gamma curve content is relatively high. This indicates that the mud content of this type of reservoir is high, which affects the logging response and causes the previous. The prediction of gas content is inaccurate when the unified model is established. In addition, it can be found that the natural gamma response value of coal reservoirs corresponding to category III is low, the density response value is low, the acoustic wave response value is relatively high, and the resistivity response value is relatively high. This shows that type III coal reservoirs are high-quality coal reservoirs with higher coal content, and their gas content should also be higher than other reservoirs. From the perspective of the prediction effect, it is obvious that the gas content prediction results directly based on the random forest algorithm predict low gas content in type III reservoirs, especially in the interval

Geofluids
with very high gas content. In reality, this will make it difficult for us to find the best quality reservoirs. In addition, it has to be mentioned that although we only use the acoustic log response value, resistivity log response value, and neutron log response value, other curves also have a good corresponding relationship with the category, which proves that the accuracy of the class. Type IV reservoirs obviously correspond to the expanded diameter interval, and targeted modeling for this interval can enhance the reliability of the model as much as possible. Therefore, from the application effect of well A, the gas content evaluation method proposed in this paper is more reliable than previous methods. Figure 6 shows the importance of targeted models. The coal reservoir in Figure 6 is basically a type I reservoir. The gas content curve obtained by directly using random forest for modeling and prediction has very small fluctuations and is not very specific, which makes it difficult for us to directly use the results for high-quality reservoir recognition. The prediction effect of Vg_KRF is relatively more accurate and can be used for high-precision characterization of gas content. However, it can be seen that at 1243.1 m-1245.5 m, the predicted result of Vg_KRF is too small, but the predicted trend is consistent with the actual core trend, indicating that the clustering results need to be adjusted. Or, due to the   Figure 5: Gas content prediction results of well A in the study area. 6 Geofluids saving of CBM reservoir development costs, the resolution of the logging tools used for measurement is not enough, and the logging response is disturbed when the vertical change of the reservoir is severe, which ultimately leads to inaccurate classification. Therefore, in the next step of the study, we can focus on the study of log curve superresolution based on wavelet transform and other methods to further improve the prediction effect. In general, the method proposed in this paper is of great help to the gas content evaluation of coal reservoirs with poor correlation between logging response and reservoir parameters.

Summary and Conclusions
The calculation of the gas content of CBM reservoirs is more complicated than other reservoirs. Various characteristics of coal reservoirs will have a series of effects on the logging response. This paper proposes a combination of K-means clustering and random forest algorithm to solve the difficult problem of calculating the gas content of CBM reservoirs. The research conclusions are as follows: (1) The fact that the calculation of the gas content of CBM reservoirs is more complicated is that the CBM reservoir itself is relatively complex, which causes the logging response to be affected by multiple factors. It further affects the relationship between CBM logging response and gas content, making the correlation poor (2) First, the samples are clustered, and the gas content prediction model of random forest is established for each type of sample. Through clustering results com-bined with actual logging curve analysis, it can be clearly seen that through clustering algorithms, different types of CBM reservoirs can be effectively divided. Furthermore, by comparing the prediction results of CBM gas content, it can be seen that the gas content model established after clustering is more targeted and can evaluate the gas content more accurately. The method proposed in this paper can improve the calculation accuracy of pure CBM gas content and provide a way of thinking for parameter evaluation when the relationship between logging response and reservoir parameters is poor Nomenclature CBM: Coalbed bed methane RF: Random forest Vg: Gas content AC: Sonic log CAL: Caliper log CNL: Neutron log DEN: Density log GR: Gamma log RD: Resistivity log.

Data Availability
All our data has been fully displayed in the pictures in the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.  Figure 6: Gas content prediction results of well B in the study area. 7 Geofluids