Fault Recognition Method Based on Attention Mechanism and the 3D-UNet

Oil and gas reservoirs are of great significance for economic benefits. Faults act as important conduits for transporting hydrocarbons and as essential sealing conditions. The location and morphology of faults reflect changes in the shape of the strata, so fault interpretation of seismic data has been an essential task in oil and gas exploration and development. The traditional fault identification method is time-consuming and inaccurate with large uncertainties. This paper proposed a fault recognition method based on 3D-UNet that added attention mechanisms to a convolutional neural network (CNN). This approach takes advantage of the UNet end-to-end architecture and the attention mechanism to focus on essential areas and suppress irrelevant information, allowing the model to focus on more valuable features. A fault identification network for seismic data was proposed by combining the 3D-UNet architecture with the Squeeze-and-Excitation (SE) attention mechanism. The 3D-UNet architecture comprises two parts: encoding and decoding, and the network architecture realizes end-to-end training. At the same time, SE was used to focus on the advantages of feature channels, further improving the accuracy of the network. The model performance was evaluated using the synthetic dataset provided by Wu. Experimental results show that the proposed model has better prediction performance in terms of accuracy, better recognition continuity, and richer fault detail.


Introduction
Faults are one of the fundamental geological phenomena. With the development of oil and gas from the shallow formation to the deep formation, faults are the main path of oil and gas migration or underground sealing structures for oil, gas, and water. To speed up the exploration and development, people hope the interpretation speed and precision are as high as possible. erefore, promoting efficiency and improving productivity become the most critical issues in modern seismic exploration. As a result, fault identification methods have evolved from conventional manual fault interpretation methods to artificial intelligence methods.
After seismic data processing, the conventional artificial fault interpretation method distinguishes the discontinuities of a seismic event in the profile. Although this method can accurately identify faults, it relies too much on the subjective awareness of fault interpreters, which is a heavy and timeconsuming workload. In order to overcome the shortcomings in manual fault identification methods, many scholars have conducted extensive research to speed up fault identification, which is mainly carried out by calculating seismic data attributes. Some scholars calculated the semblance [1], coherency [2,3], variance [4], and gradient magnitude [5] from seismic data to assist fault interpretation. Although the interpretation efficiency was greatly improved, there may be some noise in the extracted seismic attributes, so the identification accuracy of faults is relatively poor. Subsequent related work still requires manual intervention.
Researchers have proposed automatic or semiautomatic techniques to reduce the identification time further for faults. For example, the proposed ant tracking algorithm [6,7] and optimal surface voting technology [8] have further enhanced seismic attributes while suppressing noise, increasing fault identification accuracy and speed. Jacquemin and Mallet [9] proposed a double hough transform to detect faults automatically. Yan et al. [10] proposed combining principal component analysis (PCA) to realize forward diffusion and backward diffusion to enhance fault features. ese methods can extract fault features well, but the fault interpretation speed is slow and not all faults can be detected.
With the development of machine learning [11][12][13][14], researchers have combined seismic attributes and machine learning algorithms to identify faults automatically. Di et al. [15] used seismic attributes to highlight faults, create a dataset, and conduct a support vector machine (SVM) analysis on the data set to detect faults. Tingdahl and de Rooij [16] used various seismic attributes as the input of artificial neural networks to identify and predict faults. ese methods are computationally intensive and noise sensitive and do not consider the spatial characteristics of the fault distribution. Recently, deep learning has made remarkable achievements, and many scholars further use deep learning to solve the problem of fault identification. Wu et al. [17] proposed a faulted network model based on UNet end-toend model. Seismic data samples were taken as the input of the neural network model to predict faults by learning fault characteristics. Xiong et al. [18] used slices in three directions as three-channel input of the neural network to predict the intersection of pieces in three principles to judge whether it was a fault. Chang et al. [19] used a residual neural network to establish the UNet model for fault detection and carried out multiscale and multilevel feature extraction of seismic data and deep learning method for fault recognition with high accuracy and low loss rate, ensuring the accuracy of fault recognition.
Because many of the above deep learning uses the UNet structure of encoding and decoding, it shows that UNet has a good performance in seismic fault identification and provides a new solution for seismic fault identification. In this paper, we transformed the fault identification problem into a binary image segmentation problem; that is, 1 represents faults, and 0 represents nonfault. e 3D-UNet architecture is used as the base model, and an attention module is added (see Section 3). e main contribution of this paper is to design a network model that adds an attention mechanism. e SE module can be easily implemented and embedded into the 3D-UNet model, and it can be trained end-to-end to obtain the fault prediction model and realize automatic fault identification. In the absence of dimensionality reduction, the SE module captures channel attention through feature channels to assign high importance scores to fault-related features, which are used to increase attention to fault areas. Our proposed method performs well on the synthetic seismic dataset provided by Wu. is paper is organized as follows. A brief overview of related research on attention mechanisms is presented in Section 2. Subsequently, we describe the proposed approach and other models for prediction in Section 3. Section 4 presents the experimental results and analysis. Lastly, we end with some conclusive remarks in Section 5.

Related Work
Since the AlexNet [20] network achieved good results in the image classification challenge in 2012, many methods have been proposed to deepen the performance of the convolutional network. Some work adapts structure from the connection mode of the convolution layer, pooling layer, and complete connection layer [21,22], and some work focuses on the deeper network [23,24]. ese models further improve the performance of convolutional networks in terms of width and depth. In 2015, Ronneberger et al. [25] proposed UNet to solve the image segmentation problem in the medical field. UNet is mainly divided into encoding and decoding, an asymmetric structure. It uses skip connections to concatenate the outputs of each layer of the decoder with the inputs of each layer of the upsampling so that the lowlevel features can be better expressed. As inspired by UNet, Zhou et al. [26] proposed UNet++.
is model can be regarded as a combination of many small UNet stacked by many units. In the final test, the pruning algorithm removes redundant connections. e difference between 3D-UNet [27] and 2D UNet lies in that 2D operation is transformed into 3D operation, but the basic structure remains unchanged. 3D images are not converted into slices and then input into a 2D network, but the whole picture can be input into the network model as data. Compared with UNet, Vnet is a variant of UNet. Vnet [28] is proposed for 3D data, so the final output is 3D data of single channel and a short connection mode is added in each stage, which also realizes endto-end training. Since seismic fault data have a Z-axis, which is equivalent to a cube, 3D convolution is used in the cube to obtain the feature map. e disadvantage is that the number of three-dimensional convolution parameters is large, it is difficult to train, and it cannot transfer learning with the pretrained model. e attention mechanism is derived from the study of human vision. Humans will selectively focus on the area of interest in an image and then devote much attention to this area while ignoring other places to obtain more details of the target to be paid attention to and suppress irrelevant information [29][30][31]. Recently, the combination of CNNs and attention mechanisms has been effectively applied to many fields. For example, Trebing et al. [32] designed SmaAt-UNet which combines the attention mechanism of UNet and convolution block attention module (CBAM) into the prediction of the weather forecast, Li et al. [33] proposed to use the attention mechanism to establish the connection between feature images and combine lightweight attention and UNet for segmentation of retinal vessel images, and both of them achieved good results. Attention value calculation identifies the critical areas in the image through a new layer of weight parameters so that the neural network can learn the crucial information of each picture. en, the network can automatically adjust the weight to adapt to different recognition tasks. Its essence is to locate exciting information and suppress useless information to quickly screen out more valuable information for the current study from a large amount of information and improve the feature extraction ability. erefore, the core point of the attention mechanism lies in how to calculate attention. Scholars have performed many kinds of research on calculating attention value, mainly from the three directions of channel attention, spatial attention, and channel and space mixed engagement. Hu et al. [34] proposed that the innovation of the SE module lies in its focus on the relationship between media because the convolution operation of the original CNN model is to multiply the feature information in the convolution region and the convolution kernel and then add them to obtain the new channel information. e squeeze-excitation operation scores each channel feature, making the network pay more attention to the most informative channel features while suppressing the less important ones. e FC layer used in SE for dimensionality reduction has a side effect on the attention mechanism, and the dependencies between channels are not high. erefore, ECA [35] module gives up dimensionality reduction and considers each channel and k neighbors to achieve appropriate channel interaction. ECA uses Conv1D after global average pooling, and the local cross-channel interaction rate, namely, K, is realized by the convolution kernel size of one-dimensional convolution.
In contrast, the interaction rate adopts a self-convolution adaptation algorithm. at is, the convolution kernel size varies with the number of channels. e ECA module is further lighter and has fewer parameters than SE. Since SE and ECA modules only calculate the attention of the features between tracks, they still consider the spatial features equal weight, so they do not calculate the importance of the spatial features. e spatial feature attention mechanism is just the opposite of the SE module. It calculates the weight of spatial features and considers the weight of channel features to be the same. erefore, Woo et al. [36] proposed the CBAM, which combines spatial and channel elements to calculate attention. CBAM comprises two modules, Channel Attention Module (CAM) and Spatial Attention Module (SAM). CAM first adopts max-pooling and average pooling operations and then obtains feature maps through shared multilayer perceptron (MLP). We carried out the maxpooling and average pooling operations in parallel; we got the feature maps of channel attention by adding the two feature maps and multiplying the original feature maps after the sigmoid function. e input of spatial attention is the feature map of channel attention and carries out maxpooling and average pooling. en, the two feature maps are spliced together for convolution operation. After the same sigmoid, we got the new feature map by multiplying the original feature map. CAM is equivalent to adding maxpooling in the SE module. CBAM proves that global maxpooling and average pooling can improve efficiency.

Proposed Network.
We propose a network architecture combining 3D-UNet and SE modules. e model we proposed is a 3D-UNet network architecture established and extended, as shown in Figure 1. e original UNet comprises four downsampling layers and upsampling layers. is design has reduced the layers of downsampling layer and upsampling layer into three due to GPU limitations. e core function of the downsampling layer is to extract context information to facilitate better the classification of the classifier. Each downsampling layer will further extract features for advanced feature services. e function of the upsampling layer is to double the original size. e 3D-UNet network presents a U-shaped structure composed of encoding and decoding parts. In the coding part on the left, each operation contains two convolution blocks, the size of the convolution kernel is 3 * 3 * 3, and then our SE module is next, and the SE receives the feature map on the left. e pooling operation is connected to the attention module output, and max-pooling is used to compare parameters further and extract compelling features. e pooling step size is 2 * 2 * 2. After the encoding part is over, the corresponding decoding part follows. e decoding amount is the same as the encoding amount having three upsampling layers. e skip connection operation splices the low-level semantic information with the high-level semantic information to capture more helpful information (the innovation of the UNet network). e attention output of each layer of the coding part is concatenated with the corresponding upsampling output and then uses two convolution operations to halve the feature channel. e size of the convolution kernel is also 3 * 3 * 3. e output of the last layer is a 1 * 1 * 1 convolution operation and sigmoid function so that the value of each pixel in the final output image corresponds to within 0-1. Finally, the output size of the network model is consistent with the input size. e principle of the SENet module is to distinguish according to the importance of each channel. e global average pooling first compresses the spatial dimension for each output channel, and each channel obtains one scalar. We squeezed the global feature into a number with the international receptive field. We got the number of C channels, and the size is changed from H * W * C to 1 * 1 * C so that the output dimension and the number of input feature channels can be kept consistent. e second is to pass the global features through a structure such as FC-ReLU-FC-Sigmoid to obtain C scalars between 0 and 1. e first FC layer reduces the dimensionality, reducing the feature dimension to 1/r of the input. e other FC layer is used to return to the original feature dimension. e FC layer and Relu activation function can reduce parameters and improve nonlinearity. Finally, we use the sigmoid function to get C values between 0 and 1, which is the weight of each channel. en, we multiply the original feature map to get a new feature map. e first step is global average pooling, called squeeze.
en, the two FC layers and Relu activation functions are used to obtain a weight value between 0 and 1.
en, we use a fully connected neural network to perform a nonlinear transformation on the result after the squeeze operation, called excitation. erefore, the squeeze-excitation operation is equivalent to scoring each channel feature. e network pays great attention to those channel features with a large amount of information while suppressing relatively unimportant parts.
Since the SE module is proposed for two-dimensional data, the original module needs to be modified while the seismic data is three-dimensional. e frame diagram of the Computational Intelligence and Neuroscience 3 SE module is shown in Figure 2. e first is the squeeze operation. at is, through a global average pooling operation, H * W * Z * C becomes a scalar of 1 * 1 * 1 * C, and the formula is as follows: where u c represents the original feature, and X, Y, and Z represent the seismic input size.
Next is the excitation operation, as shown in formula (2), multiplying W1 by z is equivalent to the first FC layer. e dimension of W 1 is C/r * C, and r is the scaling parameter. We set it to 8 in this article and reduced the parameters through this parameter r.
erefore, the result of W 1 multiplied by z is 1 * 1 * 1 * C/r, and the output dimension remains unchanged. en, multiplying W 2 by the output result of the activation function is equivalent to the second layer of the FC layer, and the dimension of W 2 is C * C/r. So the output dimension is 1 * 1 * 1 * C, and through the sigmoid function, C scalars between 0 and 1 are obtained finally, and the dimension of s is 1 * 1 * 1 * C.
where z is the result obtained by squeezing, δ is the Relu activation function, σ is the sigmoid function, and s is the weight of each channel. Finally, the weight is weighted to the original feature by multiplication, and the formula is as follows: e numbers above each bar display the number of channels; the sheer numbers on the left and right sides correspond to the input size.

Other Networks.
For comparison, we also trained other network structures based on 3D-UNet with two different attention mechanisms, CBAM and ECA. ey are the proposed models and 3D-UNet, 3D-UNet with CBAM, and 3D-UNet and ECA models. Table 1 shows the training parameters of each attention mechanism. Since the attention mechanism is only added to UNet, the complexity of all models is the same. ese models are implemented in the platform of Keras, where we used a 3 * 3 * 3 convolution kernel for the convolution layer and a 2 * 2 * 2 max-pooling step for the pooling layer. Comparing the classical image segmentation model of a fully convolution network (FCN) and attention gate added AttUNet with our proposed model, the FCN model stride is 4 and 8. Table 2 shows the number of parameters and model complexity of the image segmentation model. FLOPs are calculated with the input resolution of a 128 * 128 * 128 size.

Data.
is paper uses the public dataset provided by Wu to test, which was simulated based on the classical convolution model theory. Firstly, an initial horizontal reflectance model   with transverse convolution variation is generated by a modified two-dimensional Gaussian formulation. Secondly, to increase the complexity of the model, a new flat reflectance model is obtained by adding plane shear stress to the initial horizontal reflectance model. In addition, faults with different strikes, dips, throws, and displacements are added to the geophysical model, and a Ricker wavelet is convoluted with the final reflectivity model to obtain the synthetic seismic data. Finally, a certain amount of random noise is added to the simulated seismic data to improve the realism of the synthetic seismic records [39]. e dataset contains 200 training samples and 20 verification samples. Each collection of samples contains seismic data and their corresponding fault labels, with a size of 128 * 128 * 128, as shown in Figure 3. Figure 4 depicts slices of selected planes in the Inline, Xline, and Time directions. e Inline direction is generally the main survey line, perpendicular to the geological structures and evenly arranged, so the fault is more visible in the Inline slice.

Hyperparameter Settings.
e models mentioned above are trained for 25 epochs, where the Adam optimizer was used [40], and the initial learning rate was set to le − 4. e hardware used for the training process is four Tesla V100-SXM2-32GB graphics cards. We input 3D seismic images into the network in a batch each time. Each batch includes the original image and its rotated images after 90°, 180°, and 270°rotations. e image size of seismic data is 128 * 128 * 128.

Model
e loss function uses β � i�N i�0 (1 − y i )/N to represent the proportion of pixels occupied by the nonfault in the entire picture, and 1 − β represents the proportion of pixels occupied by the fault in the whole 3D seismic image. To distinguish whether it is a fault or a nonfault in the pixellevel 3D image, we use precision-recall (PRC) [42] and receiver-operating-characteristic (ROC) [43] curves to evaluate the pixel-level classification result and the performance of the classifier. As the horizontal and vertical coordinates show that the distributions of true-positive rate (TPR) and false-positive rate (FPR) of samples are not balanced, the ROC curve will be a more stable indicator that can reflect the quality of the model. If the number of negative samples increases, false positives (FP) and true negatives (TN) will increase. e PRC's horizontal and vertical coordinates will be affected so that the PRC curve will change accordingly. However, comparing the ROC curve calculation formula, when the number of negative samples increases, FP and TN will increase proportionally, so their values will not be affected and will not be affected. e PRC curve is convex on the right, the better. e ROC curve is convex on the left, the better. Because we wanted to analyze the model quantitatively, we calculated the AUC area (the area under the ROC curve). e larger the AUC area, the better the model performance. e F-1 score is the harmonic mean of precision and recall. Only when precision and recall are both good, the F-1 score can be large enough. e definitions are as follows:

Results and Discussion
After training the four models discussed in Section 3, we select the model with the minor verification loss for each model from the training process and then test and evaluate it on the verification set. Figure 5(a) is the model accuracy curve of the training and test sets, and it shows that the accuracy of our proposed model is stable at about 94% after 25 epochs. Figure 5(b) shows the loss function curves of the training and test sets, and the value gradually converges to 0.01 at about 25 epochs. Comparing the proposed model results with the other models, we find that all models gradually reached about 92% accuracy and 0.02 loss after 25 epochs, and the proposed model (blue line) has the highest accuracy and fast convergence rate among all models, as shown in Figure 6. e Table 1: e number of parameters of the compared attention mechanism.

Model
Parameters GFLOPs 3D-UNet-SE [34] 1634865 2.72e + 02 G FCN4 [37] 1503337 21.7 G FCN8 [37] 1615689 21.6 G AttUNet [38] 1756008 3.76e + 02 G Computational Intelligence and Neuroscience accuracy rate of the 3D-UNet with the ECA module is slightly lower than that of the SE module but higher than that of the other two attention modules. Since the CBAM attention mechanism combines the attention of both space and channel, our proposed model is more accurate than the 3D-UNet with the CBAM module. On the one hand, the performance of the 3D-UNet with the CBAM module is worse than that of the original 3D-UNet model (correct rates are 0.92 and 0.93, respectively). Although the proposed module has increased the number of parameters slightly, its accuracy is greatly improved compared with the CBAM module. e 3D-UNet and CBAM modules are slightly less accurate than other models. e proposed model has a faster convergence rate in all models, and the loss value drops faster. e 3D-UNet with the ECA module has a large loss value after the loss curve becomes stable, but the convergence speed is fast. Although the 3D-UNet model has a slower convergence rate than our proposed model, its loss continues to decrease after a certain learning period, the loss value is also small, and it is better than the 3D-UNet model with the CBAM module. e 3D-UNet model with the CBAM module converges slightly slower than other models.
In addition, we calculated the precision-recall and operating-characteristic curves of the model on the validation set, as shown in Figure 7. e model of UNet with SE module is at the top of the diagonal, and the model of 3D-UNet with CBAM module is at the bottom of the diagonal, so the classification effect of 3D-UNet with SE module is better. However, 3D-UNet with ECA module is the same as our proposed model. Since the gap between the ROC curves is small, we also compared the PRC curves, as shown in Figure 7(a). According to the PRC curve, 3D-UNet with SE module is at the top right, 3D-UNet with CBAM module is at the bottom right, and 3D-UNet with ECA module is at the top right of the other two models. erefore, the classification effect of our proposed model is better.
We calculated the four evaluation indicators of the four models on the validation set and listed the scores in Table 3.
is table shows that our proposed model performs best in most indicators. e lowest precision score belongs to 3D-UNet. e difference between the highest and lowest precision obtained by all four models on the data set is 0.116, while the lowest recall score belongs to the 3D-UNet with the ECA module. Since the precision and recall are adversarial, one of the two is high, and the other is low. Because the F-1 score is the average value of precision and recall, it can better reflect the quality of the model. e highest F-1 score belongs to the proposed model, and the lowest F-1 score      Computational Intelligence and Neuroscience belongs to 3D-UNet. e difference between the highest and the lowest is 0.096, which also shows that the SE attention we have added has played a significant role. As a larger AUC represents a better performance than the others, our proposed model has the highest score indicating the best performance. When ranking the model, we can see that implementing 3D-UNet with SE performs best in almost all indicators. Although the model has an imbalance between the positive and negative samples of faults and nonfaults, the added SE attention helps fault recognition. e comprehensive comparison shows that the proposed model is better than other models regarding the accuracy, loss, and evaluation indicators. It also indicates that the proposed model has good performance in fault identification of seismic data. e experiments compare the classical semantic segmentation model FCN and AttUNet with the addition of the attention gate with our proposed model. We calculated the evaluation metrics of the different segmentation models on the validation set and showed the results in Table 4.

Computational Intelligence and Neuroscience
As shown in Table 4, our proposed UNet-SE model has the best performance improvement among the models in terms of evaluation metrics.
e FCN model has lower accuracy and F-1 score than the UNet model due to its disadvantage of combining context information. e difference between AttUNet and our proposed model is that it adds a gate, and although it has a certain degree of improvement in accuracy over the FCN model, the accuracy, AUC, and F1-score are all lower than the UNet-SE model. eoretical data D013 (the 13th test sample) was used as an example to show the identification results of faults in slices along Xline, Inline, and Time directions, as shown in Figure 8. e shallow seismic events in the Xline slice show strong amplitude, large fault throw, and apparent discontinuity, and these features make fault identification easy. In contrast, the middle and deep seismic events in the Xline slice are weak in amplitude, blurring the faults. erefore, it is challenging to identify the faults using manual interpretation or other identification methods through tracking seismic events and comparing their misalignment features.
is paper uses 3D seismic data and fault label samples to establish the neural network that has learned the 3D characteristics of fault distribution and can identify faults with different occurrences and degrees of concealment. According to the figure, both the location and shape of the predicted faults are in line with the labeled real faults, which further verifies the excellent performance of the neural network model proposed in this paper regarding the accuracy and reliability of fault recognition.
Moreover, this paper compares the prediction results among the proposed model and the other models in Xline, Inline, and Time slices, as shown in Figure 9. Figure 9(a) shows the results of the proposed model; Figure 9 after adding the CBAM module to the 3D-UNet; Figure 9(c) shows the results only with the 3D-UNet; Figure 9(d) shows the results after adding the ECA module to the 3D-UNet. e 3D-UNet network only can identify faults but contain some fuzzy details, indicating a low recognition accuracy. After adding the SE module, the recognition results are better than adding other attention mechanism modules, and the results closely fit the labeled data. is paper also compares the different semantic segmentation models in Xline, Inline, and Time slices, as shown in Figure 10. Among the models, the FCN model has the lowest continuity and accuracy than the other models in the recognition results, as the FCN model upsamples directly with 8 and 4 strides, resulting in a checkerboard effect on the images. AttUNet is better than the FCN model but worse than the UNet model in terms of continuity and accuracy of recognition.

Conclusions
is paper transforms the fault recognition problem into a binary classification problem and proposes a new method to recognize faults using 3D-UNet and the attention mechanism of the neural network. Several conclusions can be drawn as follows: (1) e UNet architecture can effectively characterize the fault features in seismic data through encoding, decoding, and end-to-end training. rough skip connection to fuse feature information of multiple layers, this paper improved the accuracy and reliability of fault recognition.
(2) e network uses the SE attention mechanism to capture channel attention through feature channels, which further improves the accuracy of network training. After multiple experiments, the loss function converges to 0.01, and the model accuracy rate is 94%. Compared with other attention mechanisms and network models, the proposed method can more accurately identify the fault location, enhance the generalization ability of the network structure, and reduce the artificial uncertainty of the fault interpretation results. (3) Due to the large size of the 3D data, the existing network architecture will lead to intensive computation. A combination of 3D volume and 2D slice can be considered in the future to train a more lightweight network architecture and achieve higher accuracy while having less computation.
Data Availability e network code is available from the corresponding author upon request. e data are available from https:// github.com/xinwucwp.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.