A Robust Context-Based Deep Learning Approach for Highly Imbalanced Hyperspectral Classification

Hyperspectral imaging is an area of active research with many applications in remote sensing, mineral exploration, and environmental monitoring. Deep learning and, in particular, convolution-based approaches are the current state-of-the-art classification models. However, in the presence of noisy hyperspectral datasets, these deep convolutional neural networks underperform. In this paper, we proposed a feature augmentation approach to increase noise resistance in imbalanced hyperspectral classification. Our method calculates context-based features, and it uses a deep convolutional neuronet (DCN). We tested our proposed approach on the Pavia datasets and compared three models, DCN, PCA + DCN, and our context-based DCN, using the original datasets and the datasets plus noise. Our experimental results show that DCN and PCA + DCN perform well on the original datasets but not on the noisy datasets. Our robust context-based DCN was able to outperform others in the presence of noise and was able to maintain a comparable classification accuracy on clean hyperspectral images.


Introduction
Advances in data collection and data warehousing technologies have led to a wealth of massive repositories of data. Together with active research in artificial intelligence, big data science promises mountain ranges of unexplored datasets and the smart tools to extract relevant information. An important goal in computer-based hyperspectral imaging is to be able to accurately perform this information mining without human work. Government, industry, and academia sectors seek to automate this process. ey find it valuable for their future to be able to reduce the human requirement in core processing tasks, such as segmentation, classification, and its applications.
Ever since Vapnik's [1,2] work transformed the statistical learning theory community, research has indicated the considerable potential of SVM in supervised classification, However, in many real-world classification problems such as remote sensing, medical diagnosis, object recognition, and business decision-making, the costs of selecting a poor kernel for high dimensional data is too high in terms of computational performance and a handicap to robust, real-time hyperspectral classification and segmentation.
More recently, deep networks have dominated classification problems, such as image segmentation. Convolutional-based neural networks or CNNs are driving advances in recognition. CNNs are not only improving for all domains of image classification [3][4][5][6][7] but also making progress on object detection [8][9][10], key-point-based prediction [11,12], and local correspondence [13]. e natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used Deep CNNs for image segmentation [14][15][16][17][18][19][20], in which each pixel is labeled, but with shortcomings that this work addresses.
Typically, DCN-based algorithms use the output of the last layer of the network to assign category labels. Imposing a softmax layer on top of a fully-connected dense layer, DCN focuses on semantic information. However, when the task we are interested in is more granular, such as one of classifying mixed pixels or dealing with imbalanced multiclass classification of hyperspectral images, these last layers are not optimal.
Image segmentation faces yet another challenging gap: global information answers the what, while local information provides the where. It is not immediately clear that deep convolutional neural networks for image classification yield a structure sound enough for accurate, pixel-wise multiclass classification. Moreover, when working with high dimensional features, there is often no go-to algorithm that is exact and has acceptable performance. To obtain a speed improvement, many practical applications are forced to settle for approximation approaches, in which they do not return exact answers. In practice, numerical optimizations and fast approximation saturate the spectrum of algorithms and research. However, image segmentation can also be explored as the reconstruction to a low-quality image from its high quality observations. is point of view has many important applications, such as low-level image processing, remote sensing, medical imaging, and surveillance.
ere are also paramount applications that would benefit from advances in unsupervised image segmentation, such as medical applications and homeland security. Early detection of tumors, kidney disease, heart disease, microbleeds, and microdamages is critical to worldwide public health. ere is significant research and new investments for advancing magnetic resonance imaging technology that can accurately aid in early diagnosis. e authors in [21] reviewed the principles and applications of a gradient echo MRI, the so called T2 * weighted. During COVID, the pharmaceutical industry joins forces with academia to develop algorithms for automated assessment of large-scale datasets [22]. Detection of illicit drugs, warfare agents, and dangerous substances is critical to security. e authors in [23] introduced a new technology that can rapidly detect explosives using a thermal imager.
is thermal spectroscopy pushes the boundaries of traditional image and signal processing techniques. e problem is that the state-of-the-art in machine learning and data science demands for abundance of labeled samples, which require domain expert input.
is is not feasible to spend time and effort labeling training samples. It is more efficient to develop a new method that scales and requires small number of labeled training samples.
Moreover, noise is a challenging variable, specially within imbalanced data. Hyperspectral imaging is such a data containing highly-imbalanced classes. Multiclass classification using DCN suffers from the presence of noise. erefore, this study proposes a method that can address these challenges using a deep learning-based image clustering model that combines both an adaptive dimensionality reduction approach and a robust feature augmentation approach which can cluster different types of imaging datasets with high positive predictive value. e main contribution of this paper is a new preprocessing approach to deal with noisy, highly-imbalanced hyperspectral classification. In Section 2, we present a literature review. In Section 3, we explain our approach. In Section 4, we explain our experiments, while in Section 5, we compare our results. And in Section 6, we present our conclusions and future lines of research.

Related Works
is section presents previous works and relevant literature in the areas of dimensionality reduction, feature augmentation, noise reduction, and hyperspectral image classification.

Dimensionality Reduction.
As big data, cloud computing becomes the standard for data storage, and high dimensional datasets are more and more commonplace. To process such large oceans of data, dimensionality reduction offers two options: feature projection and feature selection. Feature projection techniques transform data from a highly dimensional space to a new space with a lower dimensionality. Principal Component Analysis is one of the most popular linear transformations. In [24] the authors effectively conducted a dimension reduction by applying the principal component analysis to highly overlapped photo-thermal infrared imaging dataset. Feature selection techniques are an alternative that aims to choose the most information-rich features and discard irrelevant features and noise. e authors in [25,26] present different feature selection techniques to integrate spectral band selection and hyperspectral image classification in an adaptive fashion, with the ultimate goal of improving the analysis and interpretation of hyperspectral imaging.
Recent literature [27] proposes a Kronecker-decomposable component analysis model that combines dictionary learning and component analysis with great results on low rank modeling. e Kronecker product is compatible with the most common matrix decomposition. erefore, it can be used to learn low-ranking dictionaries in tensor factorization. It also can effectively remove noise.
Principal Component Analysis [28] or PCA is a classical dimensionality reduction with multiple implementations. One intuitive implementation consists of six steps: standardization, covariance, eigenvalues, eigenvectors, reduction, and projection.
is formulation is based on maximizing variance within a low-dimensional projection.
ere are other formulations that scale better to high dimensionality. One of such solver implementations consists of breaking down PCA into two easy-to-calculate subproblems: alternating least square linear regressions [29] using an iterative algorithm based on the idea that the product of principal orthogonal components can be an approximation to the original data.
Despite the fact that PCA is among the most established techniques for dimensionality reduction, the story does not end here. ere are many other techniques that show great empirical applications and theoretical guarantees. e authors in [30] introduced a Forward Selection Component Analysis and obtained comparable results to PCA and Sparse PCA. And in [31,32], anomaly and change detection was carried out with great success in hyperspectral imaging. Yet, [33] suggests PCA as yet a powerful preprocessing step to denoise data. Similarly to numerous other noise reduction methods including patents [34], PCA works under the assumption that the signal needs to be cleaned from the same global noise.

Image Classification.
Deep learning and big data science are the state-of-the-art in image classification. From support vector machines to convolutional neural networks to spectral clustering, both academia and industry keep pushing for more innovative research. Collaborative and in particular interdisciplinary research is needed to bring these advances to other fields and transform innovations into applications. e authors in [35] and [36] bear witness to the benefits of incorporating diversity to research teams. With authors with top degrees in civil engineering, computer science, and communications and graduate and undergraduate authors, these teams show that in order to push the science forward we need the help of everyone.
ere are many classic image segmentation algorithms, from simple thresholding to similarity-based clustering to connectedness and discontinuity-based detection.
reshold-based image segmentation seeks to divide the scale range into background and a set of target foregrounds based on global or local information, for instance, minimizing their interclass variance, maximizing entropy, and/or fuzzy sets theory. One big advantage of using these simple methods is the low computational cost in terms of code complexity which is evident in fast speed operation.
is is mainly because thresholding does not take into account spatial information. One drawback is that in the presence of noise, results are not optimal. Similarity-based segmentation uses the idea of clustering based on certain aggregation in feature space. K-means clustering is one of the most well-known unsupervised algorithms. K-means groups together pixels based on their distance; hence, it is considered a distancebased partition method. Connectedness-based image segmentation is a region growing approach that links together points with similar features creating homogeneous and smoothly-connected segments. Discontinuity-based image segmentation seeks to detect object edges or high changes in intensity. Its motivation comes from the idea that there is always a discontinuity between different regions or segments. ese discontinuities can be detected using derivatives. Prewiit, Sobel, and Laplacian operators are among the most popular differential operators for spatial domain edge detection which can be applied using convolution for image segmentation.
ere are also emerging machine learning and deep learning approaches. Support Vector Machines or SVM is a machine learning algorithm that models classification tasks as optimization problems subject to inequality constraints. e original algorithm [1] was invented by Vapnik and Chervonenkis in 1963. SVM uses a dual Lagrangian, which depends only on labeled samples. e traditional SVM philosophy consists of finding the hyperplane that maximizes the margin between points of different classes. Note that the hyperplane is at the centre of the margin that separates the two classes. e kernel trick was introduced in [2] by Cortes in 1995. is hyperplane is denoted by the perpendicular vector w from the origin and it is characterized by (12). Introduce a new variable Y subscript i-th such that Y i is positive (+1) for gray samples and it is negative (-l) for yellow samples. is optimization problem is solved using a Lagrangian multiplier (13). After applying the partial derivatives, it is evident that the solution only depends on the inner product of the supporting vectors x i . Different kernel functions SVM may be employed to solve nonlinearly separable samples. us, SVM performs so well on binary classification.
Deep Convolutional Neuronets or DCN is a deep learning algorithm that models a classification task as series of convolutional layers, pooling layers, dropout, and an activation layer usually consisting of a softmax function. CNN-based learning has recently achieved expert level performance in various applications. In [37] the authors present a deep fully convolutional neural network for semantic pixel-wise segmentation. Evaluation of the decoder variants shows that accuracy increases for larger decoders for a given encoder network. Experimental results on road scenes and indoor scenes show that the proposed SegNet outperforms other segmentation benchmarks.
Some other applications of DCN-based segmentation are listed in [38,39] and [40]. In [38], the authors extended the original DeepLab with more speed, accuracy, and simplicity by compiling a comprehensive evaluation on benchmark and challenging datasets, such as PASCAL VOC 2012, Cityscapes, among others. In [39] the authors present a new unsupervised image segmentation based on the centre of a local region. e authors validated their work on 2D and 3D medical images. MATLAB was used to implement the approach on X-rays, abdominal and cardiovascular MRI images. In [40] the authors present an image segmentation approach that recasts the problem into a binary pairwise classification of pixels.
Deep learning high speed and accuracy come with a price: subject matter expert labor to label. DCN-based approaches are supervised learning and labeled samples are needed in abundance which results in a high demand for SME input. Despite the shortcomings, multiple research initiatives are pushing the boundaries of noninvasive medicine, remote sensing, and natural language processing. Deep learning-based models stand at the core of these emerging applications.

Applications in Medical Image
Processing. U-NET deep FCN structure is highly applicable for medical image segmentation. Multiple U-NET variants [41][42][43] and domain specific models [44] have been applied to process medical images. For instance, [41] presents a U-Net variant for image Computational Intelligence and Neuroscience segmentation on brain tumor MRI scans while [42] presents another U-Net variant based on nested and dense skip connections for medical image segmentation. Moreover, [43] introduces a robust self-adapting U-Net-based framework for medical image segmentation. And [44] adds the emerging attention mechanism to a nested U-Net architecture for image segmentation on liver CT scans. One interesting medical application of image segmentation using a deep learning model is presented in [45]. A new hybrid of the classic V-Net architecture is used to help detect kidney and renal tumors on CT imaging with successful performance of medical segmentation. is wealth of deep learning research branches out from the U-Net model and provides expert-level solutions to medical image segmentation.
Recently, one shot learning models have been proposed to detect COVID-19 using medical images. Signoroni et al. [46] introduced a learning-based solution designed to assess the severity of COVID-19 disease by means of automated X-ray image processing, a domain specific implementation of [42]. Furthermore, [47] compiles an early survey of medical imaging research toward COVID-19 detection, diagnosis, and follow-up. One of their findings is the proliferation of AIempowered applications which use X-rays and/or CT scans to provide partial information about patients with COVID-19.
is reinforces the sense that deep learning-based solutions are widely used in medial image processing.
Tensor-based learning has also been incorporated into medical image processing and hyperspectral imaging. An et al. [48] presented a tensor-based low rank decomposition model for hyperspectral images and evaluates its classification accuracy on hyperspectral cubes. Moreover, the authors in [49] proposed another tensor-based representation to better preserve the spatial and spectral information and capture the local and global structures of hyperspectral images. Yet these models do not focus on imbalanced datasets nor try to solve the denoising problem. Recently, in the field of optical coherence tomography (OCT) [50] has introduced a tensor-based learning model, which tackles the denoising problem on high resolution OCT medical images with great results. However, it is unclear how well tensorbased models would represent the structure of imbalance datasets and will remain outside the scope of our work.

Applications in Natural Language
Processing. Natural language processing (NLP) is a field with multiple-machinelearning-(ML-) and deep-learning-(DL-) based research initiatives. With sentiment analysis as a fundamental task of NLP, researchers have proposed several domain specific applications of ML-and DL-based frameworks. e main challenge encountered in machine-learning-based sentiment classification is the unmanageable amount of data. To address this challenge, [51] presents an ensemble learning (EL) approach for feature selection, which successfully aggregates several different feature selection results, so that we can obtain a more robust and efficient feature subset. Moreover, [52] also explores the predictive performance of different feature engineering schemes, four supervised MLbased algorithms and three EL-based methods obtaining experimental results that yield higher predictive performance compared to the individual feature sets. Furthermore, in [53], the author presents yet another comprehensive analysis this time of keyword extraction approaches with empirical results that indicate an enhanced predictive performance and scalability of keyword-based representation of text documents in conjunction with EL-based models.
Sentiment analysis is a critical task of extracting subjective information from online text documents, mainly based on feature engineering to build efficient sentiment classifiers. To improve the feature selection process, [54] proposes and validates the effectiveness of a hybrid ensemble pruning scheme based on clustering and randomized search for text sentiment classification. Sentiment analysis can be reduced to a text classification problem. However, the text classification problem suffers from the curse of high dimensional feature space and feature sparsity problems. To mitigate and lift this curse, [55] explores several classification algorithms and EL-based methods on different datasets.
To recognize sentiment in information-rich but unstructured text, [56] presents a DL-based approach to sentiment analysis on product reviews with outperforming results. Since Twitter can serve as an essential source for several applications, including event detection, news recommendation, and crisis management, in [57], the author presents a DL-based scheme for sentiment analysis on Twitter messages with consistent and encouraging results.
ML-and DL-based models are at the core of NLP research. For instance, Onan [58] indicated that DL-based methods outperform EL-based methods and supervised ML-based methods for the task of sentiment analysis on educational data mining. And the list does not stop here. Onan [59] indicated that topic-enriched word embedding schemes utilized in conjunction with conventional feature sets can yield promising results for sarcasm identification. Onan [60] presented first usage of supervised clustering to obtain diverse ensemble for text classification and compare it to ML-and DL-based models. Onan and Toçoglu [61] employed a three-layer stacked bidirectional long short-term memory architecture to identify sarcastic text documents with promising classification accuracy results. Onan [62] presented an extensive comparative analysis of different feature engineering schemes and five different MLbased learners in conjunction with EL-based methods.

Methodology
e main objective of our proposed approach is to optimize the performance of DCN on hyperspectral images. We developed a context-based feature augmentation approach to provide resistance against noise to deep learning classification of highly imbalanced hyperspectral images. e classification apparatus used in this study relies on a deep convolutional neuronet (DCN) to perform multiclass classification based on findings in [63]. e input to this network is a highly imbalanced hyperspectral image or cube. Figure 1 shows a hyperspectral cube. Figure 2 shows a 1-by-1 column along the spectral dimension.
Our proposed approach will be a preprocessing module in this classification apparatus as shown in Figure 3. Our 4 Computational Intelligence and Neuroscience four-step approach is introduced as follows. Full details are presented in Sections 3.1 through 3.2.
(i) Local gradients are feature vectors of differences, defined in Section 3.1. In this step, we calculate these feature vectors for each pixel p in the hyperspectral cube, as differences between the pivotal pixel p and its surrounding pixels in a 3-by-3-by-3 local neighborhood. is set of differences will constitute the local gradients of p. (ii) Reference clusters are feature vectors of high and low thresholds, defined in Section 3.2. In this step, we calculate these feature vectors for each pixel p in the hyperspectral cube, as statistical thresholds of the surrounding 9-by-9 reference neighborhood. is set of thresholds will constitute the reference clusters of p. (iii) Prototype contexts are feature vectors of similarity, defined in Section 3.3. In this step, we calculate these feature vectors for each pixel p in the hyperspectral cube, as the degree of membership of the local gradients to the reference clusters. is set of similarity degrees will constitute the prototype contexts of p. (iv) Concatenated features are all feature vectors, defined in Sections 3.1 and 3.2. In this step, we concatenate local gradients, reference clusters, and prototype contexts into one context-based feature vector for each pixel p in the hyperspectral cube.

Calculate Local
Gradients. e first step of our approach is to calculate the local gradients [64]. Figure 4 shows a pivotal pixel p(1, 1, 1) in its 3-by-3-by-3 local neighborhood.
It is important to note that this moving cubic-shaped local neighborhood only uses partial data around the borders of the hyperspectral image. us the indexes, i, j, k, will only run from 1 to the dimension length −1 for each dimension x, y, z.

Calculate Reference Clusters.
e second step of our approach is to calculate the reference clusters [64]. Figure 5 shows a pivotal pixel p (5,5,5) in its 9-by-9 reference neighborhood. e reference clusters ζ is the sets of high and low thresholds {hi 1 , hi 2 , hi 3 , . . ., hi 13 }, {lo 1 , lo 2 , lo 3 , . . ., lo 13 }, where hi i is the central value of the high-valued gradients and lo i is the central value of the low-valued gradients within p's reference neighbors for each discrete direction i. We calculate these central values using the meanμ and variance σ 2 equations presented in (1) and (2) to set hi � μ+2σ and lo � μ-2σ. Such reference clusters are calculated for each pixel p i,j,k within the hyperspectral cube. Computational Intelligence and Neuroscience It is important to note that this moving square-shaped reference neighborhood only uses partial data around the borders of the hyperspectral image. us the indexes, i, j will only run from 5 to the dimension length −5 for each spatial dimensions. It will use however all the spectral bands on the z dimension.

Construct Prototype Contexts.
e third step of our approach is to construct the prototype contexts. e prototype contexts κ is the sets of similarity features {c 1 , c 2 , c 3 , . . ., c 13 } where c i is the prototype context with the highest degree of membership for each discrete direction i. We calculate this degree of membership M with the equation presented in (3)-(6) where D 2 is the square of the Mahalanobis distance, χ is the vector of local gradients, κ is the vector of prototype contexts, W is the inverse pooled covariance matrix, and the K factor is equal to the square root of the product between the highest value in χ and the highest value in κ. Such prototype contexts are calculated for each pixel p i,j,k within the hyperspectral cube.

Concatenated Augmented
Features. e fourth step of our approach is to concatenate all features vectors. ese feature vectors consist of the local gradients, reference clusters, and prototypes contexts. Such context-based feature vectors are concatenated for each pixel p i,j,k within the hyperspectral cube. Figure 6 shows how our context-based approach integrates into a deep learning classification model. Note that to evaluate the robustness of our approach, we added a synthetic noise to the original datasets. is noise was generated using a Gaussian equation. And classification accuracy was used as the main measurement to compare the performance of the model and in particular the resistance to noise in imbalanced hyperspectral images. Details are presented in the following section.

Experiments
In this section, we describe the datasets, dataset partition policy, and experimental settings. Multiple settings are designed to evaluate the performance of our approach on noisy and clean data, as well as on imbalanced and balanced data.

Datasets.
Four datasets were used in our experiments. e first two are the Pavia Centre and Pavia University datasets.
ese two datasets were acquired by the ROSIS sensor during a flight campaign over Pavia, Italy. e original Pavia Centre dataset is a hyperspectral cube with a spatial resolution of 1096 × 715 and 102 spectral bands, and the original Pavia University dataset is a hyperspectral cube with a spatial resolution of 610 × 340 spatial pixels and 103 spectral bands. e corresponding ground truths differentiate nine classes. For more details, please visit the following link. is link was last accessed on February 1, 2021 (http:// www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_ Sensing_Scenes#Pavia_Centre_and_University).
It is important to note that the Pavia Centre data are considered a balanced hyperspectral cube, whereas the Pavia University data are considered an imbalanced hyperspectral cube. It is clear from Figure 7 that the Pavia Centre samples are evenly distributed between classes. But, in Figure 8, the majority of Pavia University samples belong to one single class, namely the class Meadows.
us, this predominant class dwarfs minority classes, such as Shadows, Bitumen, and Painted Metal Sheets.
is disparity is what makes Pavia University data imbalanced.
To evaluate the robustness of our approach, we added a synthetic noise to the original "clean" datasets and produced two additional synthetic datasets. us, together with the two clean datasets, two noisy datasets were used in our experiments, corresponding to the noisy Pavia Centre and the noisy Pavia University datasets. Identically to their clean counterparts, the noisy Pavia Centre dataset is a hyperspectral cube with a spatial resolution of 1096 × 715 pixels, 102 spectral bands and 9 distinct classes, and the noisy Pavia University dataset is a hyperspectral cube with a spatial resolution of 610 × 340 pixels, 103 spectral bands and 9 distinct classes.
To produce these noisy datasets, an intermittent irregular noise was incorporated. Equations (7)-(9) were used to generate a noise signal corresponding to a signal-to-noise value of SNR dB � 120. In (7), G and F are random variables and N follows a Gaussian distribution with a probability density function presented in (8). Similarly to [65], this weighted random noise will follow a Gaussian normal distribution N(μ, σ), where the mean µ is zero and the variance σ is determined from the signal-to-noise ratio (SNR dB ) formula presented in (9).

Dataset Partition Policy.
Datasets were divided into training and testing sets; 80% of the data was used during the training (a.k.a. model-fitting) phase while the remaining 20% of the data was used for testing (a.k.a. model-prediction) phase. One-fourth of the training set was used as validation set during the fitting phase. Figure 9 shows the full-partition schema.
To rank our context-based DCN approach, two additional models are implemented: (i) a baseline deep learning approach, namely, DCN, and (ii) a benchmark approach, that is PCA + DCN. And classification metrics are used to evaluate and compare the performance and effectiveness of our approach.

Baseline Experiments.
As a baseline, we observe the performance of a deep learning model without any preprocessing on the different hyperspectral datasets. Four types of experiments are included in this section. First, we work on clean data, running individual experiments for  Computational Intelligence and Neuroscience balanced and imbalanced datasets. en, we focus on noisy data, and again we run individual experiments for balanced and imbalanced datasets. A Deep Convolutional Neuronet (DCN) was used as a baseline to perform the classification. We used a DCN which consists of three types of layers, namely, input layer, hidden convolutional layer(s), and output layer. In Figure 10, the input dataset is shown as a cube. Similarly to [40], the hidden convolutional layers are shown as flat squares, the maxpooling layers in whiter color, and the dropout layer in pale. Straight lines are used to depict fully-connected layers or dense layers. Finally, for multiclass classification, the activation function is based on a softmax function.
During the model-fitting phase, we run for 20 epochs. At this point, the network achieves stability without running into overfitting. DCN used the two original datasets and the two noisy datasets. e results of our fitting phase are presented in Figures 11 to 14. e average classification accuracy on clean test data was 86.1 ± 3.9 percent, whereas in noisy data was 66.9 ± 2.9 percent. ese results suggest an adversary effect of noise on our basic model.

Benchmark Experiments.
As a benchmark comparison, we observe the performance of a deep learning model with noise reduction model as a preprocessing on the different hyperspectral datasets. Similarly, to the previous section, this section presents four types of experiments. First, we work on clean data, running individual experiments for balanced and imbalanced datasets. en, we focus on noisy data, and again we run individual experiments for balanced and imbalanced datasets.
Principal Component Analysis (PCA) together with DCN was used as a benchmark to perform the classification. Ten principal components are sufficient to represent 99% variability of the data. Figure 15 shows the Scree Curves for both the Pavia Centre dataset in Figure 15(a) and the Pavia University dataset in Figure 15(b).
As suggested by the Scree Curves, PCA + DCN was implemented using only the first ten principal components. Twenty epochs were used during the model-fitting phase, a.k.a. training phase. In our experimental runs, the dataset partition policy was maintained the same and both the original datasets and the noisy datasets were randomly selected into training, validation, and testing sets.
e results of our fitting phase are presented in Figures 16 to 19. e average classification accuracy on clean test data was 84.1 ± 6.1 percent, whereas on noisy data was 37.3 ± 4.7 percent. Compared to the results for vanilla DCN, these results strongly suggest an adversary effect of noise on the principal component-based model. Another important point to analyze is that during training of PCA + DCN on  is dataset is considered balanced because for each class, there is relatively the same number of samples.

Pavia University
Asphalt Bare soil Bitumen Gravel Meadows Painted metal sheets Self-blocking bricks Shadows Trees  Figure 9: Partition policy: datasets are divided into 3 parts (20%, 20%, and 60%). e training task uses 60% of the samples. e validation task uses 20%. e testing task uses the remaining 20%. 8 Computational Intelligence and Neuroscience noisy data, the model suffered from overfitting after the 4 epochs as shown in Figure 18.

Enhanced Experiments.
We integrate our context-based feature augmentation module as a preprocessing step to the deep learning model. We observe the performance of a context-based deep learning model on the original highly imbalanced hyperspectral dataset. en, we observe the performance of our enhanced model in the presence of noise. We also run our context-based DCN for 20 epochs using the two original datasets and the two noisy datasets. All context-based features were used to achieve better noise resistance. e results of the model-fitting phase are presented in Figures 20 to 23. e average classification accuracy on clean test data was 87.5 ± 3.4 percent, whereas on noisy data was 85.0 ± 4.2 percent. Compared to previous results, these  Computational Intelligence and Neuroscience percentages suggest that our proposed approach exhibits a high-level of accuracy on clean data and robustness against noise on both the Pavia University and the Pavia Centre datasets.  Tables 3 and 4 present the classification results on the synthetic, "noisy datasets", Pavia Centre with noise and Pavia University with noise, correspondingly.

Results and Discussion
Our experimental results suggest that all models suffer in the presence of noise, but the negative impact of noise can be mitigated with our proposed context-based approach. Tables 3 and 4 present the precision, recall, F1-score, and overall accuracy scores for DCN, PCA + DCN and our context-based DCN. Table 3 focuses on the noisy Pavia  Centre dataset, while Table 4 focuses on the noisy Pavia University dataset. In both tables, we can observe that our proposed model achieves better results.

Tabular Summary and Analysis.
Comprehensive summary tables are presented as follows. A total of three approaches were analyzed: a basic DCN with no preprocessing, a PCA + DCN, and a context-based DCN. ey are listed on different rows. Four datasets were used: two without noise referenced as "clean data" and the same ones with random noise referenced as "noisy data". Imbalanced datasets are listed on shaded columns of the tables. e values in each cell represent overall classification accuracy. Table 5 summarizes the overall accuracy of each model during the fitting/ learning phase, whereas Table 6 summarizes the overall accuracy of each model during the testing/prediction phase.
It is important to note that during training on labeled samples as well as during testing on new samples, our proposed context-based DCN outperformed both DCN and        PCA + DCN, especially in the presence of random noise. PCA + DCN did not perform well for noisy cases because it was not able to remove our synthetic noise signal, which was not just random but also intermittent and irregular.

Conclusions
Hyperspectral imaging is an area of active research. Deep learning-based approaches to classification are the current state-of-the-art. However, our experimental results showed that in the presence of noisy hyperspectral datasets, these expert-level models underperform. To address this shortcoming, this paper presented a context-based feature augmentation approach to increase noise resistance in highly-imbalanced hyperspectral classification. On noisy datasets, our robust approach outperformed a basic deep learning model and outclassed a combination of PCA and DCN approach. In addition, on highly-imbalanced noisy data, our context-based DCN approach suffered significant loss in terms of classification accuracy (less than 10%), whereas DCN and PCA + DCN suffered from an alarming 25% and 50% cuts in classification accuracy respectively.
Future lines of research should focus on applying our context-based approach to other noisy datasets in areas such as MRI and other highly imbalanced 3D medical images.

Data Availability
e datasets used to support the findings of this study are available at http://www.ehu.eus/ccwintco/index.php/ Hyperspectral_Remote_Sensing_Scenes.

Conflicts of Interest
e authors declare that they have no conflicts of interest.