Total Organic Carbon Content Prediction in Lacustrine Shale Using Extreme Gradient Boosting Machine Learning Based on Bayesian Optimization

The total organic carbon (TOC) content is a critical parameter for estimating shale oil resources. However, common TOC prediction methods rely on empirical formulas, and their applicability varies widely from region to region. In this study, a novel data-driven Bayesian optimization extreme gradient boosting (XGBoost) model was proposed to predict the TOC content using wireline log data. The lacustrine shale in the Damintun Sag, Bohai Bay Basin, China, was used as a case study. Firstly, correlation analysis was used to analyze the relationship between the well logs and the core-measured TOC data. Based on the degree of correlation, six logging curves reflecting TOC content were selected to construct training dataset for machine learning. Then, the performance of the XGBoost model was tested using K-fold cross-validation, and the hyperparameters of the model were determined using a Bayesian optimization method to improve the search efficiency and reduce the uncertainty caused by the rule of thumb. Next, through the analysis of prediction errors, the coefficient of determination (R2) of the TOC content predicted by the XGBoost model and the core-measured TOC content reached 0.9135. The root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) were 0.63, 0.77, and 12.55%, respectively. In addition, five commonly used methods, namely, ΔlogR method, random forest, support vector machine, K-nearest neighbors, and multiple linear regression, were used to predict the TOC content to confirm that the XGBoost model has higher prediction accuracy and better robustness. Finally, the proposed approach was applied to predict the TOC curves of 20 exploration wells in the Damintun Sag. We obtained quantitative contour maps of the TOC content of this block for the first time. The results of this study facilitate the rapid detection of the sweet spots of the lacustrine shale oil.


Introduction
Recently, unconventional shale oil and gas reservoirs have profoundly revolutionized the energy industry in North America and China [1,2]. Unlike marine shales, the Bohai Bay Basin in northeast China mainly develops lacustrine shale plays, with frequent changes in the sedimentary environment and strong reservoir inhomogeneity. The accurate and efficient identification of the sweet spots in thin shale plays is a hot research issue. Studies found that the exploration potential of shale oil is primarily associated with three factors: the hydrocarbon generation potential, reservoir capacity, and recoverability [3]. Organic matter is an essen-tial material for determining the hydrocarbon generation potential and hydrocarbon enrichment [4]. The total organic carbon (TOC) content is a key indicator to evaluate organic matter abundance [5]. An accurate TOC value is typically obtained from Rock-Eval pyrolysis of the rock core sample; however, drilling to obtain core samples is time-consuming and expensive, resulting in the discontinuous and nonuniform distribution of core-measured TOC data. Moreover, the thickness of organic-rich lacustrine shale plays is usually very small; thus, it is unreasonable to use discrete coremeasured TOC data points to evaluate the hydrocarbon generation potential. Well logs have high resolution and provide continuous data. The variation of the organic matter content affects the petrophysical properties of the formation, such as radioactivity, resistivity, and density (DEN), forming a unique logging response; therefore, the TOC curve can be predicted using well logs [6].
At present, methods using well logs to predict the TOC content include statistical correlation, overlapping methods, multiple regression, and machine learning. Beers first proposed using the natural gamma radioactivity intensity to evaluate the TOC content [7]. Subsequently, many scholars established empirical relationship equations between the natural gamma (GR) logs and TOC in different areas [8,9]. Swanson found that the radioactivity of organic matter was mainly related to the adsorption amount of uranium (U) in the formation. Thus, researchers predicted the TOC content using the GR ray spectrum logs [10,11], such as establishing a linear correlation between the TOC content and the U log [12] or establishing a multivariate statistical relationship between the TOC content and the U log combined with the thorium (Th)/U ratio log [13]. Schmoker found that the main reason for a decrease in the DEN of an organic-rich formation was the increase in the organic matter content; therefore, a regression relationship was established between the DEN log and TOC content [14]. Herron proposed a method to determine the TOC content using the carbon-oxygen ratio log [15]. Passey et al. proposed the △log R method [16], which overlaps the porosity logs with the deep resistivity (RD) log and uses the nonsource rock zone as the baseline to establish an empirical relationship formula between the TOC content and the well logs. Subsequently, many scholars proposed improved methods based on the △log R method [17][18][19][20]. In recent years, the emergence of special logging methods has provided many approaches to predict TOC content. Examples include calculating the TOC content using element capture spectroscopy logs [21] or combining the nuclear magnetic resonance logs and the DEN log to estimate TOC content [22]. All the above methods are developed based on the rock physical model (RPM) and rely extensively on empirical formulas. Due to the third artificial intelligence boom, machine learning has been widely used for lithology identification [23][24][25] and reservoir evaluation [26,27]. Machine learning methods for TOC content prediction include support vector machine (SVM) [28,29], Gaussian process regression (GPR) [30,31], extreme learning machine (ELM) [32,33], neural network [34,35], fuzzy clustering [36], and random forest (RF) [37]. Machine learning is data-driven, which improves the accuracy and efficiency of TOC prediction compared to conventional methods.
Practically, most of the core samples that can be used as machine learning samples are concentrated in key reservoir zones. However, few labeled data points exist in nonreservoir zones, leading to a significant imbalance in training samples. When individual models are used to optimize the objective function, it is easy to fall into local minima, and these models have poor generalization ability. Ensemble learning can effectively solve this problem by training multiple models and taking advantage of the composite output. The individual models are used to create an optimal predictive model, which provides higher prediction accuracy than an individual model. A popular example of an ensemble model is RF, which has been used for seismic reservoir prediction [38], lithology identification [39], and hydrocarbon source rock prediction [40]. However, RF is based on the bagging technique and is sensitive to noise and prone to overfitting when performing regression prediction. In contrast, the gradient boosting decision tree (GBDT) is based on the boosting technique and generally performs better for regression problems. Chen et al. first proposed the extreme gradient boosting (XGBoost) method based on GBDT [41]. Unlike the GBDT algorithm which utilizes first-order derivative information, XGBoost carries out a second-order Taylor expansion on the loss function and contains a regular term in the objective function to find the optimal solution to avoid overfitting, making the method highly efficient, flexible, and portable. Yan et al. applied XGBoost to well logging interpretation of tight sandstone and found that it performed better for fluid identification than the SVM and RF models [27]. Nguyen et al. used XGBoost for predicting compressional and shear waves in micritic limestones and achieved higher accuracy than an artificial neural network (ANN) and SVM [42]. Gu et al. used a particle swarm optimization (PSO) algorithm to determine the hyperparameters of the XGBoost algorithm and applied XGBoost to predict the permeability of tight sandstone [43]. To date, the XGBoost model has not been applied to the TOC prediction of reservoirs. Therefore, in this study, a workflow consisting of XGBoost machine learning based on Bayesian optimization for TOC prediction is proposed and applied to lacustrine shale oil in the Bohai Bay Basin. The prediction results are compared with the △ log R method and other typical machine learning methods to demonstrate the accuracy and reliability of the proposed method.

Theory of Machine Learning
2.1. Theory of the XGBoost Model. XGBoost is an ensemble boosting algorithm that consists of multiple decision tree iterations. It is an improvement of the GBDT. Multiple classification and regression tree (CART) models are first constructed to make predictions using the dataset; these trees are then combined into a new tree model. The models are continuously and iteratively enhanced, with each iteration generating a new tree model that fits the residuals of the previous tree. As more trees are added, the complexity of the ensemble model becomes progressively higher until it approaches the complexity of the data itself; thus, training achieves optimal results [41]. If there are K regression trees, the expression of the prediction function is defined as where f k is the k th regression tree, F represents the set of CARTs, and y i ∧ is the predicted value of the i th sample.

Geofluids
The loss function L is represented by the predicted value y i ∧ and the true value y i : where n is the number of samples. The prediction accuracy of the model is jointly determined by the deviation and the variance. The loss function represents the deviation of the model, and the variance is determined by the regular term Ω that suppresses the complexity of the model. Therefore, the objective function Obj can be defined as where T represents the number of leaf nodes, ω is the leaf weight value, γ is the penalty factor of the leaf tree, and λ is the leaf weight penalty factor. XGBoost uses a gradient boosting strategy where the newly generated regression tree needs to fit the residuals of the last prediction. The objective function at the t th iteration can be rewritten as A Taylor expansion is performed on the objective function to obtain where g i = ∂ y∧ t−1 i lðy i , y∧ t−1 i Þ is the first-order derivative of the loss function and h i = ∂ 2 y∧ t−1 i lðy i , y∧ t−1 i Þ is the second-order derivative of the loss function.
Therefore, it is only necessary to calculate the g i and h i values of the loss function for each step and optimize the objective function to obtain f ðxÞ for each step. Finally, an optimal ensemble model is obtained based on the additive method.

Bayesian Optimization of the Hyperparameters.
When a machine learning model is established, the hyperparameters need to be determined in advance. The selection of the hyperparameters has a significant impact on prediction accuracy. Therefore, it is important to obtain the optimal combination of hyperparameters. The optimization of the hyperparameters is a typical black-box optimization problem. Commonly used optimization methods include grid search (GS), random search (RS), genetic algorithm (GA), PSO, and Bayesian optimization [44]. The GA and PSO algorithms require a sufficient number of initial sample points and are not very efficient for optimization. At present, GS, RS, and Bayesian optimization are the most common methods. The GS method needs to traverse all possible parameter combinations, which is very time-consuming for a large data volume and many hyperparameter dimensions. In contrast, the RS randomly samples the hyperparameters in a certain range and selects them by comparing the performance of different combinations, which does not guarantee that the optimal combination will be obtained. Moreover, the GS and RS are computed independently for each hyperparameter combination. The current computation does not use the result of the searched points, but this information guides the search process and can improve the quality of the results and the search speed. In contrast, Bayesian optimization selects the most promising hyperparameters by evaluating the past results, enabling the selection of the appropriate hyperparameters with fewer iterations than the RS method [45,46]. Theoretically, Bayesian optimization solves the global optimal solution of the objective function: where x denotes the hyperparameters to be optimized, X is the set of hyperparameters to be optimized, f ðxÞ is the objective function, and x * is the optimal combination of hyperparameters. The core of the Bayesian optimization algorithm consists of two parts: first, the posterior probability distribution is calculated based on past results using GPR to obtain the expected mean and variance of the hyperparameters at each sampling point. Second, an acquisition function is constructed to determine the next sampling point based on the posterior distribution.

Gaussian Process.
The Gaussian process (GP) is a generalization of the multivariate Gaussian probability distribution defined by the mean function m ðxÞ and the covariance function kðx, x′Þ.
The GP can be expressed as For convenience in practical applications, let the prior mean function be 0. There exists a Gaussian distribution satisfying The covariance matrix KðX, XÞ can be expressed as The corresponding covariance function can be expressed as According to the nature of the GP, after adding the sample X * to be predicted, the new Gaussian distribution can be expressed as Then, the joint posterior distribution of f * is By evaluating the mean and covariance matrices, f * can be sampled from the joint posterior distribution.

Acquisition Functions.
The acquisition function determines the next sample point based on the posterior results of the probabilistic agent model. Usually, the selection of sample points for the acquisition function requires both exploring new areas in the objective space and exploiting areas that are already known. The exploitation refers to searching for the global optimal solution based on the current optimal solution to improve the mean value of the objective function. The exploration refers to detecting the unevaluated sample points to reduce the uncertainty of the objective function. When the GP is used as the probabilistic agent model, the four commonly used acquisition functions include probability of improvement (PI), entropy search (ES), upper confidence bound (UCB), and expected improvement (EI) [45]. In this paper, the EI is chosen as the acquisition function; its mathematical expression is where D = ðX, θÞ represent the observations and f ′ is the minimum value of the current observation of f .

TOC Prediction
Process. The flowchart of the TOC prediction based on the Bayesian optimization XGBoost model is shown in Figure 1. It contains three parts, namely, data preprocessing, model building, and model application, which are described as follows.
(1) Data preprocessing: we first collect the coremeasured TOC data and the corresponding well log data. The data are depth-corrected, outlier-processed, and normalized, and then, the well logs relevant to the TOC prediction are selected as machine learning input features using linear regression crossplots and Pearson correlation coefficient techniques. Finally, the processed data are randomly divided into a training set and a test set using an appropriate rule   During the sedimentation of E 2 s 4 L , the lake level oscillation caused by tectonic movement has led to cyclic changes in the sedimentary environments, and the lithology of the formation shows "sandwich" characteristics ( Figure 3). The upper part is the E 2 s 4 L -I group, characterized by dark oil shales, and thin-bedded sandstone is locally observed. The middle part is the E 2 s 4 L -II group, which is composed of siltstone and argillaceous dolomite. The lower part is the E 2 s 4 L -III group, charac-terized by intercalated oil shales, marl, and dolomite. The total thickness of E 2 s 4 L ranges from 20 m to 220 m, the TOC content ranges from 2% to 12.8%, R 0 ranges from 0.4 to 0.6%, and the organic matter is mainly type I, with some types II 1 and II 2 . The hydrocarbon generation intensity is about 4200 × 10 4 t/km 2 . The latest prediction showed that the E 2 s 4 L formation has 20:9 × 10 8 t hydrocarbon resources, demonstrating significant potential for shale oil exploration [47].

Data Analysis.
The data used in the study is from the key exploration well S352 and consists of well log data and coremeasured TOC data. Well S352 was drilled from 3150 to 3352 m to encounter E 2 s 4 L formation, and 145.92 m of sealed coring was completed at depths of 3169-3348.97 m, obtaining a core length of 122.47 m, with a core recovery rate of 83.9%. A total of 107 experimental core samples were obtained at nonequal intervals in this core section (3169-3348.97 m). A Leco carbon and sulfur analyzer was used to 5 Geofluids measure the TOC content according to Chinese standard GB/T 191452003, and 104 valid TOC data points were obtained. The available conventional well logs include GR, natural potential (SP), well diameter (CAL), neutron (CNL), DEN, transit time (AC), RD, and natural gamma energy spectrum (U, TH, K). Before using data, depth correction and outlier filtering were performed to ensure that the core-measured TOC data and the well log data had a one-to-one correspondence. Table 1 shows the distribution characteristics of the preprocessed well logs, including the mean, maximum and minimum values, standard deviation, skewness, and kurtosis. It can be seen that most of the log curves satisfy a Gaussian distribution, except for the RD, which has a large deviation. Thus, we applied a logarithmic transformation of the RD data before use.
Crossplots were created to analyze the correlation between the core-measured TOC content and the well logs, and linear regression was used to fit the data. The coefficient of determination (R 2 ) was calculated to evaluate the goodness of fit of the linear model. It is defined as

Geofluids
The crossplots are shown in Figure 4. It is observed that AC, CNL, RD, GR, TH, and U have a positive linear relationship with the TOC content. R 2 of AC is the highest (0.3431), followed by CNL (0.2984). The linear relationship between RD, GR, TH, and the TOC content is weaker, with R 2 values of 0.0408, 0.0112, and 0.0957, respectively. The DEN and potassium (K) have a negative linear relationship with the TOC content, with higher R 2 for DEN (0.2805) and lower R 2 for K (0.1002).
For the multisource data, Pearson's correlation coefficient was calculated to measure the degree of linear correlation between the well log data and the TOC content. It is calculated using where p x,y reflects the degree of linear correlation between variables x and y, cov ðx, yÞ is the covariance of variables x and y, σ x is the standard deviation of x, and σ y is the standard deviation of y.
The correlation between the variables can be evaluated by creating a heat map of the Pearson correlation coefficient. As shown in Figure 5, the values represent the correlation coefficient p x,y . A negative number represents negative correlation, a positive number represents positive correlation, 0 means no correlation, and a value close to 1 or -1 indicates a strong correlation. It can be seen that the highest correlation occurs between AC and TOC (0.59), followed by CNL (0.55) and DEN (-0.53), respectively. The correlation between GR, U, and TOC is relatively poor (0.02 and 0.07, respectively).
In summary, none of the well logs were significantly correlated with the TOC content. However, the results provide a ranking of the well log data according to their association with the core-measured TOC content. Thus, we can identify and remove irrelevant and redundant features from the training dataset, reduce the complexity of the model by reducing the dimensionality of the input data, and improve the efficiency of the model [37]. Therefore, based on the results, we selected six logs (AC, DEN, CNL, K, TH, and RD) as input training features.

K-Fold
Cross-Validation (CV). In machine learning, the data are typically randomly divided into three parts: training set, test set, and validation set. However, we had very few labeled data points, resulting in strong uncertainty when using a small validation dataset to evaluate the model performance and robustness. The optimum method to avoid this problem is K-fold CV. The dataset is split into K parts, and for each iteration, K − 1 parts are used as the training set, and the remaining part is used as the test set, obtaining K models. The K-fold CV makes use of all data, substantially improves the learning ability of the model, and increases the model's robustness. In this paper, following the suggestions of Zhang et al. [48] and Wong [49], the folding number K was set to 5 and is related to the trade-off between computation time and bias ( Figure 6).

Comparison of Models.
We compared the performance of the XGBoost model with other machine learning algorithms. Four methods were selected, i.e., RF, SVM, K -nearest neighbor (KNN), and multiple linear regression (MLR). The detailed description of these algorithms can be found in the book of Mohri et al. [50]. The hyperparameters of each machine learning algorithm were determined using a Bayesian optimization method to ensure fairness. Additionally, we included the most widely used ΔlogR method for comparison. This method overlays the RD logs in logarithmic coordinates and the porosity logs in arithmetic coordinates to calculate the TOC content in organic-rich shales, where the two logs are separated. The difference between the two logs, ΔlogR, is then derived empirically using where R is the resistivity (Ω·m), Δt is the measured transit time (μs/ft), and R baseline and Δt baseline are the resistivity and transit time values, respectively, where the two logs overlap in the baseline of the organic-deficient zone. The ΔlogR and the organic maturity are used to determine the TOC content in the organic-rich zones, as shown in  where LOM is the level of organic maturity. ΔTOC is the TOC content background level in organic-rich shale.

Evaluation Criteria.
In addition to R 2 , we chose the root mean square error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE) to evaluate the model performance. These criteria are defined as follows: where y i is the true value, y i ∧ is the predicted value, ϵ is a positive minimum value, and n is the number of samples.

Results and Discussion
In this study, a 5-fold CV was adopted to test the model performance and robustness. The code was implemented on a microcomputer with an Intel Core i7-7700 CPU with 32 GB RAM and a Windows 10 system. The programming language was Python. The SVM, KNN, MLR, and RF models were implemented in the open-source Scikit-learn machine learning package. We used the open-source XGBoost toolkit to run the XGBoost algorithm, and the △log R method code was written by the authors.

Comparison of Model
Performance. The dataset was randomly divided into a training set and a test set for the 5-fold CV. All data were normalized to eliminate the effect of unit and scale differences between different well logging parameters. The crossplots of the predicted and core-measured TOC content are shown in Figure 7; the solid line is the 1 : 1 line, and the dashed line is the linear regression line. It should be noted that the ΔlogR method used all available data for analysis, and no 5-fold CV was used. The results showed that the XGBoost model has the best prediction performance, with R 2 of 0.9135, followed by the RMF model with an R 2 value of 0.8931 and the ΔlogR method with an R 2 value of 0.8345. In contrast, the other three methods have mediocre prediction performances, with R 2 values around 0.74. Furthermore, we compared the RMSE and MAPE of the different methods using 5-fold CV. Figure 8 shows the RMSEs of the test set, which indicates that the RMSEs of XGBoost and RF are substantially lower than those of the other methods. Moreover, it can be inferred that the XGBoost model is the most reliable because its RMSE value is the lowest in all cases, except when k is 1. Figure 9 shows the MAPEs of the test set. In terms of the relative error performance, the XGBoost model outperforms the other models, with a maximum MAPE value of 16.14% for k = 4, a minimum of 9.77% for k = 1, and a mean MAPE value of 12.55%. The second-best model is RF, with a maximum MAPE value of 17.18% for k = 4, a minimum of 9.05% for k = 1, and a mean MAPE value of 12.97%. The MAPE of SVM fluctuates considerably; the maximum value is 22.86%, and the minimum value is 11.06%. The mean MAPE value of KNN is 16.49%. The MLR had the lowest performance, with MAPE values exceeding 20% in each test. Table 2 lists the mean values of the MAE, RMSE, and MAPE of the different methods for 5-fold CV. The mean values of the MAE, RMSE, and MAPE of the XGBoost model are 0.63, 0.77, and 12.55%, respectively, and each is the lowest value compared with the other methods. The error analysis results indicate that the XGBoost method has the highest accuracy, providing a significant advantage over other machine learning methods, as well as the ΔlogR method, for TOC prediction.

Model Validation.
We selected well S352 to validate the prediction results of the TOC content of different methods. The well logs, core-measured TOC data, and TOC curves predicted by different methods are shown in Figures 10 and 11. The first track represents the mud logging lithology, the second track shows the lithology indicator logs, the third track is the resistivity logs, the fourth track shows the porosity logs, the fifth track is the GR ray spectrum logs, and the 6th-10th tracks are the TOC curves predicted by the △log R, MLR, KNN, SVM, RF, and XGBoost methods; the red dots represent the coremeasured TOC data. Figure 10 shows the predicted results of the E 2 s 4 L -I group. Between 3150 and 3200 m, the lithology is oil shale, and the fluctuations of the well logs are small, indicating good formation homogeneity. The TOC curves predicted by all methods are highly correlated with the core-   Figure 6: Fivefold cross-validation. 10 Geofluids  11 Geofluids measured TOC content, and the trends are similar. However, between 3200 and 3236 m, the lithology starts to change. The resistivity logs show high resistance characteristics, and the core-measured TOC content increases significantly. The prediction results of the XGBoost, RF, and △log R methods are in good agreement with the coremeasured TOC content. In contrast, the predicted values of the MLR, SVM, and KNN methods are considerably smaller than the core-measured TOC content. Figure 11 shows     13 Geofluids reliably in nonhomogeneous formations, providing the highest prediction accuracy and best generalization ability. Thus, this method is more suitable for TOC prediction of lacustrine shale oil than the other methods used in this study.

Prediction of the TOC Distribution.
We selected 20 exploratory wells drilled in the E 2 s 4 L formation to predict the TOC content using the XGBoost model. The contour maps of the predicted TOC content of the E 2 s 4 L -I, E 2 s 4 L -II, and E 2 s 4 L -III groups in the study area are shown in Figure 12. In the E 2 s 4 L -I group, the TOC content is relatively higher on the west side of well A10-A49-A95 (>4%), and the highest value occurs near well S166 (>6%). The area with a TOC content exceeding 4% is 73 km 2 (Figure 12(a)). In the E 2 s 4 L -II group, the TOC content is relatively low, ranging from 1.5% to 3.1%. The area with a TOC value greater than 2% covers 115 km 2 (Figure 12(b)). In the E 2 s 4 L -III group, areas with a TOC content greater than 4% are located near wells S224, A49, A104, Sh25, and Sh17, with an area of 23 km 2 . The TOC content of the other areas is below 4% (Figure 12(c)). Vertically, the E 2 s 4 L -I group has the highest TOC content, followed by the E 2 s 4 L -III group and the E 2 s 4 L -II group. Horizontally, high-quality source rocks are mainly distributed on the west slope of the study area and sporadically in other regions.   14 Geofluids

Conclusions
We proposed a robust data-driven Bayesian optimization XGBoost model to predict the TOC content using wireline log data. The data were obtained from the Damintun Sag, Bohai Bay Basin, China, consisted of well logs and coremeasured TOC data. Linear regression crossplots were obtained, and Pearson's correlation coefficients were calculated to evaluate the relationship between the well logs and the core-measured TOC data. The results indicated that none of the well logs were significantly correlated with the TOC content. However, the correlation analysis enabled us to identify and remove irrelevant and redundant well logging features for the TOC prediction and reduce the model complexity by reducing the dimensionality of the input data. The model performance was evaluated using 5-fold CV. The quantitative error analysis of the four criteria showed that the proposed approach performs better compared to the traditional method (ΔlogR), with R 2 increasing from 0.8345 to 0.9135 and MAE, RMSE, and MAPE decreasing from 0.79, 1.07, and 14.80% to 0.63, 0.77, and 12.55%, respectively. Also, the XGBoost model outperforms other popular machine learning algorithm (i.e., RF, SVM, KNN, and MLR) in terms of robustness, accuracy, and generalization in predicting TOC for strongly nonhomogeneous lacustrine shale plays. We used the proposed approach for the TOC prediction of 20 exploration wells in the Damintun Sag and obtained contour maps of the TOC content in the E 2 s 4 L formation. The maps enabled the identification of areas with high hydrocarbon generation potential, which is useful for finding sweet spots. Generally, machine learning relies extensively on the quality and quantity of the training 15 Geofluids data. As new exploration occurs, additional data should be added in real time to improve the reliability and generalization ability of the model. In the future, we plan to create a database for machine learning. In addition to predicting the TOC content, this database can be used to predict other petrophysical and geomechanical properties of reservoirs.