Seismic Facies Segmentation Using Ensemble of Convolutional Neural Networks

The use of machine learning for seismic interpretation is a growing area of interest for researchers. Manual interpretation demands time and specialized effort. The use of machine learning model will expedite the process. The Convolutional Neural Networks (CNNs) are a class of deep learning algorithms used for images. In this paper, seismic facies segmentation using encoder-decoder architecture of CNNs is proposed. The proposed method filled the gap using a multimodel approach for seismic interpretation. The novelty of the model is that it is not limited to the current dataset and semantic segmentation models. The encoder-decoder architecture input and output size is the same, and it allows the labelling of each pixel of the image. Four models are trained on the open-sourced F3 block Netherlands dataset. Images of 128 × 128 were extracted from the data. Data augmentation is used in two of the models to increase the data size for better model learning. Results of individual models and their ensemble are compared. Ensemble is performed by taking the average of the probabilities of the classes obtained from the trained models. Ensemble gave the superior results. Seven classes are segmented with a global pixel accuracy (GPA) of 98.52%, mean class accuracy (MCA) of 96.88%, and mean intersection over union (MIoU) of 93.92%.


Introduction
Discovery of the new reserves of oil and gas is strategic for countries. Seismic reflection surveying is used to obtain subsurface information to locate drilling locations in the oil and gas industry [1]. The structural and stratigraphic geometric features and potential hydrocarbon reservoirs can be configured from the seismic reflections. Accurate interpretation of seismic amplitude and zones is significant for the discovery of oil and gas [2]. Accurate interpretation leads to a lesser number of drills and significantly impacts the characterization of the reservoir. Seismic amplitude is manually interpreted by field experts and geophysicist to differentiate between bodies of rocks (strata) with different physical properties [3]. The geophysicist needs to manually analyse the images generated from seismic survey to mark the boundaries between different strata known as horizon. The process needs to be done for thousands of seismic images. Traditional method of manual interpretation for seismic delineation takes time, and there are false positives [4]. Accurate interpretation saves economic resources, as resources are spent for the drilling of the reservoirs.
For data and image classification, methods in computer vision and pattern recognition look for unique features. Success of these methods depends on the feature extraction techniques [5]. Convolutional Neural Networks (CNNs) provided an alternative for automatically learning the domain specific features [6]. The CNNs consists of a large number of convolutional layers, which can learn the most useful features automatically and eliminate the need for manual feature extraction techniques. The initial layers of the CNN detect basic features like edges, colours, and shapes. The deeper layers then combine the basic features to extract more complex features that represent the whole image. The hierarchical structure in the convolutional nodes of CNNs identifies the complex objects.
A deep CNN contains millions of parameters and requires a large amount of data for training the network. The CNNs are used for various image applications such as image classification, object detection, localization, and object segmentation.
Semantic segmentation networks are the improvisation of CNNs in the form of encoder-decoder architecture [7]. The encoder is used for object detection and classification, and the decoder is used for object localization. Semantic segmentation is the process of taking an image and labelling each pixel in that image with a certain class.
Seismic facies interpretation is an important step for hydrocarbon exploration and exploitation [8]. A major issue of the application of deep learning to seismic data is lack of availability of labeled datasets. Limited datasets undermine the potential of deep learning. Annotating the seismic data is a time-consuming task and requires subject-level expertise. Very few labeled seismic datasets are available online. It is not feasible to use a trained model from one location and implement it directly to another location because the structures and dynamics are different for each location. Researchers annotated their own seismic facies datasets [9][10][11][12], annotated the Netherlands F3 block (North Sea) dataset for training deep neural networks, and open sourced it for further research. The authors used the 3D seismic reflection data in conjunction with the well log data to manually annotate the 3D seismic dataset.
The North Sea contains hydrocarbon deposits, and this area is well-studied. The continental shelf of the North Sea is situated in the waters of the Netherlands. A rectangular area of dimension 16 km × 24 km known as the F3 block is located in the North Sea. A 3D seismic survey was conducted in 1987 to search for hydrocarbons and understand the lithostratigraphy of the area. The data was open sourced by dGB Earth Sciences. The F3 dataset is used extensively in research and studies [12].
Fully Convolutional Network (FCN), SegNet, U-Net, ENET, and DeepLab are few of the semantic segmentation networks with their distinct architectures. DeepLab's variant DeepLabv3Plus and Seg-Net are fine-tuned in the proposed method using pretrained encoders: vgg-16 and ResNet-18. The results show a mean class accuracy of 96.88% and mean intersection over union of 93.92%.
1.1. Artificial Intelligence. Machine intelligence is another name for artificial intelligence. When humans are born, they have natural sensors in their bodies. They start to observe the world, their parents first; listen to voices; feel the hotness or coldness of a body, etc. With the help of such observations, they develop their responses and actions. This can be termed as training phase. With the passage of time, the brain develops, memory becomes stronger, and now, the actions and responses are well connected to the previous experiences. This example is given as an analogy to understand the human intelligence.
Similarly, a computer can be equipped with sensors and storage. The computer can then be trained by feeding thousands of images and voices. With the help of these images, the computer can learn the features differentiating them. After the training phase when an unseen (not from the images used to train the computer) image is shown to the computer, it can tell either it is from the class of images or not with very promising results using the modern techniques in AI especially deep CNN.

Machine
Learning. Machine learning (ML) brings the promise of deriving meaning from all the data which is presently available in terms of audio, images, and datasets. Google Search is one of the examples of machine learning; each time Google Search is used, several ML models working in the background are activated to understand and correct the text and adjust the results according to user interests based on previous searches and other data being collected from other applications a user is using. There are several MLbased learning algorithms available to train the models which include Random Forest, Decision Tree, Naïve Bayes, SVM, and kNN to name a few.

Related Work
The application of computational techniques on seismic data followed the trend of development of the machine learning community. Initially, different mathematical features and techniques are calculated to assist geologists in making predictions. The popularity of machine learning techniques enabled researchers to feed the calculated features in different machine learning models to extract results. In the third stage, due to GPUs, it was possible to make extremely complex models which gave rise to Convolutional Neural Networks. With the CNNs, there is no requirement of feature engineering and images are directly fed to the network.
Initially, different computational techniques were used to classify an image by their geological attributes that presented an application of textural analysis to 3D seismic volumes [13]. In this paper, the authors combine the image textural analysis with a neural network classification to quantitatively map seismic facies in three-dimensional data. In 2011, for exploration geology and geophysics, seismic texture analysis was a developing concept and a large number of different algorithms were published in the literature. In [14], review of the seismic texture concepts and methodologies, focusing on latest developments in seismic amplitude texture analysis, with particular reference to the gray-level cooccurrence matrix (GLCM) and the texture model regression (TMR) methods is presented. There are discontinuities in seismic images with varying illumination and contrast. In [15], a solution using the congruency of phase in Fourier components is proposed. The proposed algorithm shows far more better and efficient results in terms of accuracy as compared to the texture-based methods for salt dome boundary detection. Data-driven algorithm is proposed in [16]. The proposed protocol overcame the limitations of existing texture attribute-based salt dome detection techniques which depend on the relevance of attributes to the geological nature of salt domes and the number of attributes used for classification. The authors used a gray-level cooccurrence matrix (GLCM) with attributes extracted from Gabor filter to delineate salt domes in seismic data. In [17], seismic attributes are combined with their spatial locations for unsupervised seismic analysis using fuzzy c-means algorithm. This method reduced the effect of seismic noise presented in discontinuous regions. A comprehensive evaluation of accuracy and performance of three texture descriptors, Gabor filters, GLCM, and Local Binary Patterns 2 Wireless Communications and Mobile Computing (LBP), is presented in [18] for seismic image retrieval to assist human interpreter in selecting region of interest (ROI). Before the deep learning techniques became popular, features used to be hand engineered and fed into machine learning algorithms such as Random Forest Regressor, Support Vector Machines, and Boosting algorithms. In [19], an extremely randomized tree to automatically identify salt boundaries is presented. The proposed protocol extracted the features of signal amplitude, curve length, and second amplitude for each of the voxels and made predictions using extremely randomized trees. After the prediction, a postprocessing step is added to further increase the accuracy. Reference [20] used a machine learning approach for classifying facies. The 3D seismic reflection data of the North Sea is used in the paper. Fifteen different attributes are extracted for each pixel, such as reflector dip, continuity, and frequency range. The attributes are trained on twenty ML models such as Support Vector Machines (SVM), K-nearest neighbours, regression trees, and neural networks. The best result of accuracy 98.3% is obtained using SVM. Reference [21] uses a 3D seismic survey data from New Zealand and calculated four features (peak spectral frequency, GLCM, homogeneity, and curvedness). These 4 features are input to Artificial Neural Networks (ANN) and Support Vector Machines (SVM), and ANN gives better accuracy on the test set as compared to SVM. Reference [22] uses 6-7 features (4-5 measured properties and two geologically derived features) and trained the model using K-nearest neighbours (KNN), Naive Bayes, fuzzy logic, and ANNs. ANN is the most effective model and gives the best result.
Unsupervised learning techniques like self-organizing maps and principal component analysis (PCA) are also used to classify lithostratigraphic zones. Reference [23] uses unsupervised learning techniques like K-means clustering, PCA, projection pursuit, vector quantization, and Kohonen self-organizing maps. Reference [24] uses competitive neural networks on seismic data to distinguish the distinct seismic behaviour of facies. Heterogeneity of classes was indicated in the results; however, the classes in the results were not labeled. In [25], Kohonen selforganizing maps are used to estimate the number of seismic facies and make maps. Wavelet transform was also used to detect seismic trace singularities. Others tried deep convolutional autoencoders (DCAE) for facies classification. Reference [26] uses DCAE as it can learn nonlinear, discriminant, and invariant features from unlabeled seismic data. The results show that DCAE outperforms conventional methods like K-means or SOMs, Moreover, the results of DCAE are of much higher resolution and highlight important information. Reference [27] uses sparse autoencoder architecture that can detect major geological features from unlabeled seismic data. The model is tested on real and synthetic seismic data in order to extract relevant structures from the data.
The development of computational power led to a greater emphasis on the use of supervised CNN algorithms for seismic applications. The CNN supervised approach can be divided into seismic classification and seismic seg-mentation methodologies. Seismic classification uses the convolutional, max pooling, and fully connected layers to predict the class of the centre pixel of the image. In [28], a novel CNN consisting of six convolutional layers followed by a fully connected layer is presented. The network uses this architecture for salt detection. Cubes of 65 × 65 × 65 are extracted from the 3D seismic data, and the centre of the cube represented the class (salt or not salt) of that pixel. Reference [29] uses CNNs (vgg-16, ResNet-50, and Waldeland) to classify seismic images of the F3 dataset. The centre pixel for each image is classified, and this step is repeated such that all pixels are classified. The results for vgg-16 and Waldeland architectures show significant improvement towards accuracy however found to be ineffective for ResNet-50. An investigation over the use of fully supervised CNNs and semisupervised Generative Adversarial Networks (GANs) is presented in [8]. The models are tested on realistic synthetic images. Results shows that CNNs perform better in scenarios where abundant data is available; however, GANs work better on new sites with limited availability of data. Reference [30] uses a custom-built CNN architecture to detect faults from a 3D seismic cube. The input to the network consisted of three orthogonal slices of 24 × 24, and the voxel at the intersection is classified as fault or not fault. The network is trained using synthetic images and tested on both synthetic and real data. The results show that CNN obtains a classification accuracy of 74% on the real dataset.
In [31], an encoder-decoder structure is introduced for CNNs. The encoder learns the distinctive features from the image, and the decoder maps back the features semantically. In [2], a novel segmentation network (Danet-FCN) is proposed. Danet is combined with FCN for pixel classification. For validation, F3 and the Penobscot datasets are used. The mean IoU of more than 98% is obtained on both of the datasets. A modification of Danet-FCN is presented in [4] in order to propose a new architecture Danet-FCN2 and Danet-FCN3 by removing the fused connections. The Danet-FCN3 improves the IoU to 99% on the Penobscot dataset. Reference [32] uses U-Net architecture with dilated convolution and soft attention mechanisms. The soft attention mechanism allows the model to suppress noise and learn the main features. The authors trained the models from scratch. CNN results show improvement with the use of dilated convolutions and soft attention mechanisms. Reference [33] presents work on the TGS salt classification dataset at Kaggle. Semisupervised technique is used for salt classification using an ensemble of CNNs. An iterative process is used by which predictions at each stage are treated as pseudolabels and the network is retrained using the training data and confident pseudolabels. The ensemble of U-Net architecture with the encoder of ResNet34 and ResNeXt50 is used. The results show an IoU of 0.896. In [34], a modified U-Net architecture is proposed to detect salt from seismic images. The model is trained and tested on synthetic dataset, and the results show a mean IoU of 90.53%. A comparison of 3D-based patch model and encoder decoder architecture on the F3 dataset is presented in [9]. The dataset is divided into 9 facies and manually interpreted the data for training. The work draws conclusion that the encoder-decoder model 3 Wireless Communications and Mobile Computing gives better results at near real-time speeds, at the expense of long training time. Reference [13] presents a fully annotated 3D geological model of the Netherlands F3 block. This model is based on the study of the 3D seismic data in addition to 26 well logs and is grounded on the careful study of the geology of the region. The study proposed two baseline models for facies classification based on deconvolution network architecture and made their codes publicly available. The first model is patch based; in this, patches are extracted from the crosslines, and in lines and the model is trained on it. The second approach is section based; in this, complete in lines and crosslines are fed to the model. Results shows that the section-based model (MCA of 0.817) gives better results than the section-based model (MCA of 0.705).
2.1. Dataset. A major challenge in seismic facies classification is the availability of annotated dataset. The data needs to be manually labeled by the domain experts. The process of labelling requires availability of a geophysicist and is subjective to human bias. Researchers working in the field of application of deep learning labeled their own datasets due to these limitations. The authors in [13] labeled the F3 Netherlands dataset and open sourced it for further research. In this paper, the same labeled F3 seismic dataset is used for model verification and results.
The Netherlands F3 block dataset is in 3D NumPy array format. First, the dataset is converted into image/ label form (see Figure 1). There were a total of 22368 images, each of which is of the size 128 × 128. The dataset is split into train, validation, and test sets with a ratio of 60%, 20%, and 20%, respectively (see Table 1). The split is performed randomly. The image is divided into 7 facies which are as follows:

Methodology
In this paper, two deep learning models for semantic segmentation are used DeepLabv3Plus and SegNet. The models will be discussed with a brief overview of semantic segmentation and data augmentation.
Main elements of the framework are Dataset is converted from NumPy arrays to images and labels. This conversion makes the dataset readily usable on any platform. Images and corresponding labels are resized to support the available hardware resources. The resizing of images to a suitable size is an iterative process to avoid GPU memory issues while training. Through iterations, the size of 128 × 128 is chosen. Dataset is split into three parts, namely, training, validation, and testing.
3.1. Semantic Segmentation. Semantic segmentation is the process of categorizing each pixel of the image to a class. Semantic segmentation is applicable to a variety of computer vision tasks like autonomous driving and medical image diagnosis. Semantic segmentation is one of the most challenging aspects of computer vision. The classification problem objective is to predict the presence of an object in the image [35]. The difference in segmentation problems is to predict the class of each of the pixel within the image. The image needs to be segmented into different objects like car, pedestrian, roads, and road signals for autonomous driving applications.
In [31], use of CNNs for semantic segmentation is proposed. The authors use skip connections to join the semantic information from a deep layer to the localization information from a shallow layer, to produce the pixel-wise segmentation of the image. The decoder part was implemented as bilinear interpolation. The method improved the PASCAL VOC results by 20%; however, one of the major drawbacks of this technique is that it tends to ignore small objects.  In [36], a deep deconvolutional network for decoding, which consisted of deconvolution, unpooling, and activation layers is presented. The presented model performs better on the PASCAL VOC 2012 segmentation with an accuracy of 72.5% (see Figure 2). For the encoder-decoder architecture used for semantic segmentation in this paper, the encoder phase is similar to the conventional CNN classification model and consists of multiple convolutional and pooling layers. Each convolutional layer first convolves its input and then also applies batch normalization and an activation function. The pooling layer is used to downsample the image. In the pooling layer, a sliding window is passed through the image and is

Wireless Communications and Mobile Computing
used to summarize the information by selecting minimum/ maximum/average from the window. The encoder is used to classify the objects within an image. At the end of encoder stage, a low-resolution feature map is obtained. Encoder is followed by a decoder which works in an opposite manner to the encoder. It consists of multiple upsampling and convolutional networks to bring the output to the same size as that of the input. Decoder is used for the localization of objects to generate boundaries of the objects within the image.
For computer vision problems of semantic segmentation, the encoder-decoder architecture gave better results than other CNN architectures. We are using two of the encoderdecoder architectures that are DeepLabv3Plus and SegNet.

DeepLabv3Plus
. DeepLab is one of the popular semantic segmentation architectures and is used in [37]. DeepLab is a model designed and open sourced by Google. It uses Atrous convolution in place of deconvolutional networks. Atrous convolution allows enlarging the field of view of filters without increasing the number of parameters or the amount of computation. Multiple Atrous convolutions are used in parallel to catch the contextual information at multiple scales.
Deep convolutional network consists of multiple layers due to which information of the smaller scale objects is lost. This is because the input feature map reduces as we move in the network. Atrous convolutions are used in DeepLab in the convolutional layers to counter this problem. Atrous convolution consists of an additional parameter of rate r which is the stride at which input signal is sampled. The normal convolution is a specific case for r = 1. In [38], denser features are extracted by the use of Atrous convolutions without the need of extra parameters.
DeepLabv3Plus consists of multiple Atrous convolutions (see Figure 3). The Atrous rate applied to each of the convolutional layers is different which enables to extract features at different resolutions. This enables the extraction of denser features. DeepLabv3Plus obtained an accuracy of 89% on PASCAL VOC 2012 test datasets.
This research work uses DeepLabv3Plus upgraded versions; the results show improvement across object boundaries. ResNet-18 is used as an encoder for DeepLabv3Plus. Two variants of the model are trained; one is with data augmentation, and the other one is without data augmentation.

SegNet.
SegNet is a semantic segmentation model developed by the University of Cambridge [7]. The encoder consists of 13 convolutional layers in the VGG16 network. At the decoder stage, it performs nonlinear upsampling, which  The major difference between SegNet and other encoderdecoder architectures is in the way it upsamples data. During the downsampling in the encoder stage, the pooling indices are stored in SegNet. During the decoder stage, the pooling indices are used to place the values in their original position as before the downsampling. In [7], such case is presented where the information of the pooling indices is passed on to the upsampling stage to produce dense feature maps (see Figure 4). The U-Net presented in [39] on the other hand transfers entire feature map from the encoder to the decoder. U-Net requires greater memory and training time due to this task. The first advantage of SegNet is that in the decoder, the upsampling layer is used which keeps intact the high-frequency details. The second advantage is convolutional layers are used in place of fully connected layers. The convolutional layers can remember the indices of image features as discussed in [38,[40][41][42][43].
In this paper, SegNet with VGG16 encoder is used in two variants. One of the variants is with data augmentation, and the other is without data augmentation.

Data Augmentation.
CNNs require a large quantity of data to learn from the images and perform well. Data augmentation is a technique to increase the size of data from the original data. New artificial training data is created in this process. The original training data is transformed to produce new training data. The transformations include a range of mathematical operations such as flipping, rotation, padding, scaling, cropping, and changing its colour.
The purpose of data augmentation in the presented models of this research work is to increase the data size. This allows the model to better generalize and learn from the data (see Figure 5) for the effect of reflection in the data along the x-axis. Translation and reflection are used as the data augmentation techniques.
3.5. Application. In this paper, 5 models are developed for seismic facies segmentation of the generated images from the F3 Netherlands block.   Figure 7: Visualization of the ratio of the pixel counts of classes. Lower NS is the class with the highest ratio of pixels at about 45%.

Wireless Communications and Mobile Computing
Two semantic segmentation networks, DeepLabv3Plus and SegNet are used. Both are based on encoder-decoder architecture. Two models are trained using DeepLabv3Plus with encoder as ResNet-18. One model is trained without data augmentation, and the second was trained with data augmentation (translation and reflection). Two more models are trained using SegNet with encoder as vgg16, one with training data augmented and the other without augmentation. Ensembles of four models are created. For ensemble, averages of the predicted scores of the four models, corresponding to each class for a single pixel, are taken. The class with the highest average probability represented the pixel (see Figure 6). Results of the individual models and the ensemble are compared.
The numbers of pixels of seven classes in generated training patches are imbalanced (see Figure 7). This imbalance is detrimental to the learning process because the learning is biased in favor of the dominant classes. Class weighting is used to handle this issue. Median frequency class weights were calculated (see Table 2) Weights in the pixel classification layers in encoders for both DeepLabv3Plus and SegNet are replaced with median class weights calculated (see Table 2) to compensate the class imbalance for training to be unbiased. Training options used for four models are as follows: Adam optimizer is used as weight optimization, initial learning rate is set to 0.001, squared gradient decay factor is set to 0.99, and minibatch size was 32. Due to pretrained encoders, training for all four models converged in few  106  148  190  232  274  316  358  481  522  563  604  645  686  727  768  809  860  902  944  986  1028  1070  1112  1154  1196  1319  1360  1401  1442  1483  1524  1565  1606  1647  1698  1740  1782  1824  1866  1908  1950 106  148  190  232  274  316  358  481  522  563  604  645  686  727  768  809  860  902  944  986  1028  1070  1112  1154  1196  1319  1360  1401  1442  1483  1524  1565  1606  1647  1698  1740  1782  1824  1866  1908  1950  1992  epochs. Training of each model is started with ten epochs but due to no change in accuracy and RMSE after five epochs, it is stopped early. The validation loss and accuracy plots with epochs are presented in Figures 8 and 9) for DeepLabv3Plus and SegNet models, respectively. There is no change in RMSE and accuracy in all of the models after five epochs. Early stopping is applied on all four of the models after five epochs.

Results
To assess the performance of the proposed architecture, the following evaluation metrics are used.
(i) Class accuracy (CA) The percentage of the correctly classified pixels in a class i, is called class accuracy.
where i represents class, P represents predicted pixels, and T represents true pixels.
Moreover, the results of the individual models are presented in Tables 3-6. The results show a comprehensive analysis of presented models in terms of accuracy and intersection over union.
In Table 7, results of the ensemble for the proposed four models are presented (see Table 7). Ensemble gives the best result for GPA, MCA, and MIoU. Global pixel accuracy (GPA) is the percentage of pixels over all classes that are correctly classified. Mean class accuracy (MCA) is the average of class accuracy over all classes whereas class accuracy for a class is the percentage of pixels that are correctly classified in that class. Intersection over union (IoU) measures the overlap between the two sets, and it should be 1 if and only if all pixels were correctly classified. Further, averaging IoU over all classes gives the mean intersection over union (mean IoU).
The MCA and MIoU achieved with the ensemble method are 0.9655 and 0.9392 which are the highest amongst all five of the models. This shows that using an ensemble of various models is a useful technique that improves the results. The error in individual models generally occurs at the boundaries of various classes. The ensemble takes the average for each pixel, by which a wrong prediction made by one model can be compensated by the right prediction made by the other models.
Amongst the individual models, DeepLabv3Plus with basenet of ResNet-18 with data augmentations gives far better results, with a MCA and MIoU of 0.9655 and 0.9355, respectively. For both DeepLabv3Plus and SegNet, the results with data augmentation are better than the results     A random image and its labels are taken from the test set to calculate MIoU using individual models and ensemble (see Figure 10). Ensemble did not give the highest MIoU on this random test image (see Table 8).
Visual results of the ensemble are better than those of other models (see Figure 11) because ensemble gives two classes the highest IoU whereas other models gives the highest IoU either to one class or none highlighted in italic. In Figure 11, when predicted pixels of a class went beyond the boundary of that class, those pixels are marked as green. When predicted pixels of a class did not reach the boundary of that class, the gap is marked as magenta.
When pixels of a certain class are present in both prediction and ground truth but in nonoverlapping regions, then the IoU is calculated as 0. When a pixel of a certain class is not present in both prediction and ground truth, IoU is

Conclusions
The ensemble of semantic segmentation networks gave better results as compared to individual models. The ensemble of Fully Convolutional Network (FCN), SegNet, U-Net, ENET, and DeepLabv3Plus with ResNet-50 and ResNet-101 as encoders is proposed for future works. Dataset in the form of images is open sourced so that researchers may try other semantic segmentation networks and ensemble them.
The automatic seismic facies segmentation proves to be a promising alternate to manual labelling of seismic facies by geologists. For manual labelling, a high degree of subject expertise is required, which can introduce into the results. The process is very complicated and computationally exhaustive and requires high degree of accuracy. It is not feasible for the geologists to label the entire area, and so usually, labelling is performed on only few of the images or portions of the block. These limitations can be countered by using deep learning techniques since they prove to give accurate results. Deep learning will require the labelling of some of the images, on which the models can be trained. After this, it could be applied to the complete area to make predictions.

Future Work
For future work, pretrained CNNs like VGG-19, ResNet-34, ResNet-50, ResNet-101, ResNet-152, Inception-v1, Inception-v3, SE-ResNet, ResNext, SENet-154, DenseNet-121, DenseNet-169, and DenseNet201 can be used as encoders in semantic segmentation networks like U-Net, Linknet, PSPNet, and FPN. Moreover, average ensemble results can be compared with voting ensemble. Success of the proposed architecture in the paper encourages improvising not only with other semantic segmentation networks and encoders but also with the data augmentation. Also, in the proposed architecture, crossentropy loss function is used. This loss function can be replaced with other loss functions to check if improvement in IU can be achieved.

Data Availability
The dataset is publically available in [12].