Multiscale Dense Cross-Attention Mechanism with Covariance Pooling for Hyperspectral Image Scene Classification

,


Introduction
Hyperspectral remote sensing technology can generate a set of images with rich information. It can significantly improve data analysis quality, with significant improvements in the method's detail, reliability, and credibility. In general, because different categories of the same feature target exhibit different optical behavior and shown obvious differences in the band space, the feature type's pixel level can be used for identification and classification. is technology is nowadays being used in agriculture [1][2][3], environmental remote sensing [4][5][6], physics [7,8], ocean [9][10][11], and other fields, among which the research on hyperspectral image (HSI) classification is extremely important. e internal information of an image contains various types of features. It is precisely because of the difference in the various forms of features inside an image that they can be extracted for analysis. Finally, the method of distinguishing the target into different categories is called image classification. HSI classification enables identifying different targets of the ground features based on the obtained remote sensing information. In addition, it can also analyze the inherent laws of the ground features during the classification process. As HSI acquisition costs are relatively high, this will lead to an insufficient number of training samples. Further, the image itself has a relatively high spectral dimension and a relatively large amount of data, which increases its complexity.
In the early research, spectral information for classification was a popular procedure [12][13][14]. e methods of feature selection and dimensionality reduction [15,16] were often used to ease the spectral dimension's high dimensionality. With advances in this research topic, the complex space and spectral feature distribution of HSI have become the main problems that plague their classification. Many researchers have chosen to add local spatial connections to improve the models [16,17] and have also achieved positive results to a certain extent. However, these methods are primarily based on manual and shallow models, which rely heavily on expert knowledge and have poor generalization ability, making it difficult to extract the representative discriminating features.
In recent years, deep learning models have been fruitful and have emerged as a new research hotspot among the different methods used for HSI classification. Although deep learning methods can obtain deep-level feature representations from the original HSIs, detection and recognition usually rely on many training samples. However, it is time-consuming and labor-intensive to label the data used for HSI classification and to use it for training. e annotation data are very limited. In view of the abovementioned problems, maintaining deep learning methods with fewer training samples is still a significant challenge. In 2014, Chen et al. [18] proposed a deep learning framework that combines spatial and spectral features. e deep learning framework combines the principal component analysis (PCA) and deep learning architecture employing stacked autoencoders (SAE) to obtain the deep features using logistic regression (LR) results in classification (SAE-LR). Although the SAE-LR deep learning method has shown great potential in HSI classification, the autoencoder model usually flattens the local image patches into vectors and then feeds them into the model. However, this method destroys the image's two-dimensional structure, resulting in the loss of spatial information and the training time is also insufficient. In 2015, Makantasis et al. [19] used convolutional neural networks (CNNs) to develop a deepsupervised HSI classification algorithm. eir algorithm uses random PCA to reduce the original input data's dimensionality, a CNN to extract the deep-level features, and multilayer perceptron (MLP) for classification. In 2016, Zhao and Du proposed the spectral-spatial feature classification (SSFC) method. e framework uses a balanced local discriminant embedding (BLDE) algorithm to extract the spectral features based on the classification's multifeature classifiers.
In addition, three-dimensional spectral-spatial feature extraction methods have also been developed to discover useful but difficult-to-extract information from HSIs. Tsai et al. [20] proposed a three-dimensional gray-level cooccurrence feature extraction method for HSI classification based on the traditional two-dimensional morphological operators, and Weeks et al. [21] proposed a three-dimensional discrete wavelet transform, which is a series of one-dimensional wavelet transforms along the three-direction sequence of HSI processing and classification. Feature extraction from HSIs is important for improving the classification performance of the method. e related model of deep learning provides a new solution to the problems in HSI classification. However, applying deep learning to such a classification task also faces some difficulties.
(1) e high-dimensional problem of HSIs is different from ordinary natural images. HSI data present a three-dimensional structure. e spectral information is very rich, and the spatial information is relatively small. e application of deep learning for the classification of HSIs is the first step to designing a network structure suitable for HSI structure, which can effectively use the HSIs to enrich the spectral information and the limited spatial information. (2) In the HSI data, there is often a serious imbalance between the number of classes and the number of classes. If the weight constraints of various samples are not imposed, the trained network's robustness is low.
Based on the above limitations, a novel multiscale dense cross-attention mechanism algorithm with covariance pooling (MDCA-CP) has been proposed in this work for HSI scene classification. Multiscale convolution can detect the spatial dimensions of HSIs. e subtle changes between the pixels in the local areas in the spectral dimension can be applied to the feature extraction of the hyperspectral data of complex and diverse types and structures. Traditional algorithms only assign attention weights in a one-way manner, resulting in the loss of feature information. e dense crossattention mechanism proposed in this work can simultaneously distribute the attention weights horizontally and vertically to efficiently fetch the most representative hyperspectral data. In addition, covariance pooling has been used in this study to further extract the second-order features of HSIs. Experiments have been conducted on three well-known hyperspectral datasets, and the results thus obtained prove the effectiveness of the MDCA-CP algorithm.
e main contributions of this work are as follows: (1) A novel dense cross-attention (horizontal and vertical) mechanism algorithm has been proposed, and weight addition and maximum weight strategies have been constructed, which can mine more representative features in the hyperspectral data. (2) A novel attention-guided covariance pooling method has been proposed to make full use of the second-order information in the hyperspectral data feature map. It allows the neural network to learn more representative features when dealing with remote sensing scene classification problems. e rest of the paper has been organized as follows: Section 2 briefly describes other works related to this article. Section 3 introduces the principle and implementation details of the MDCA-CP algorithm. Section 4 presents the experimental results and visual analysis. Section 5 provides a summary of the conclusions based on this research.

Related Work
In recent years, deep learning has been mainly based on multilayer neural networks that infinitely approximate the nonlinear functions by using three or more neural networks to learn large amounts of abstract feature information in the image. Because it is considerably difficult to obtain the labels for HSIs, the use of deep learning methods to achieve better classification results while using a small number of training samples has always been the focus of research in the field of hyperspectral remote sensing. e classic neural network models based on deep learning include the autoencoder (AE) [22], stacked autoencoder (SAE) [23], and restricted Boltzmann machine (RBM) [24,25]. Although these methods have relatively good classification performance, scholars have designed deep neural network models based on CNNs and others due to many model parameters and shallow image features. e methods are combined and applied to the classification of HSIs and provide superior feature extraction effect and high classification performance.
Chen et al. [18] proposed deep feature extraction and HSI classification method based on CNNs. e CNN is used for extracting the deep features from the HSIs that are nonlinear, discriminative, and invariant. A combination of the CNN and a regularization method is used for extracting the spatial-spectral features of HSIs for classification. Xu et al. [26] proposed a multisource remote sensing data classification method based on CNN, which mainly uses two channels. e CNN extracts the spatial-spectral joint features of the HSI, fuses them with the features of other sources of remote sensing data, and classifies them. Luo et al. [27] proposed a new HSI classification method based on CNN in which the spatial-spectral features from the target pixel and its neighboring pixels are extracted. e convolution operations are performed on them, and the convolution results are superimposed into a two-dimensional matrix as the input of the standard CNN. Finally, the XGBoost classification model, proposed by Wang et al. [28], is an HSI classification method based on a combination of the random forest technique and CNN. It regards the CNN as an individual classifier that is used for extracting the discriminating features of HSIs. e random forest technique involves a random selection of the extracted features and training samples to establish a multiclassification system for performing a classification task. e abovementioned models based on CNNs are all constructed with standard convolutional layers and applied to HSI classification. In order to enable the models to gain the huge advantages of the standard convolutional layer while saving a considerable amount of computing resources, some scholars have introduced the attention mechanism into the classification models. Mei et al. [29] proposed a spectral-spatial attention network for HSI classification. is method can learn the internal spectral correlation in the continuous spectrum, and the attention model focuses on the neighboring pixels in the spatial dimension. Experimental results show that this method can make full use of spectral information. In order to extract the spectral and spatial features, Ma et al. [30] proposed a dual-branch multiattention mechanism network (DBMA) for HSI classification. e network has two branches that extract the spectral and spatial features, respectively, reducing the two features' interference. In addition, because of the two branches' different characteristics, two types of attention mechanisms are applied in the two branches, respectively, which ensures more distinctive spectral and spatial features. Excellent results have been achieved using this method. e abovementioned research studies show that the attention mechanism [31,32] is extremely effective in hyperspectral data classification tasks and provides outstanding results. Figure 1 shows a schematic of the overall architecture of our MDCA-CP algorithm. First, locality preserving projections (LPPs) are used for reducing the dimensionality of the HSI to eliminate redundant information and reduce computational costs. Second, the reduced-dimensional features are inputted into our MDCA module, and the second-order features are extracted through CP. Finally, the SoftMax function is used for predicting each pixel to achieve the classification of the hyperspectral data.

Locality Preserving Projections.
e spectral image of each pixel of the HSI has multiple endmember spectral images at the same time. In a low-dimensional subspace, the high-dimensional data can be projected onto this low-latitude subspace when the HSI is classified. e contained pixels represent a single endmember, which represents a characteristic of a ground feature. e local preserving dimensionality reduction method proposed in this study can find a linear map M. e adjacent pixels of the original HSI can obtain relatively close projection space distances, therefore achieving effective retention of the relevant information in the local neighborhood and efficiently protecting the diversified local structure of the original HSI. Assuming that the original data's training sample is x i , the class label x j is the number of sample categories, x th is the total number of training samples, and the training samples are x i and x j . en, the close relationship equation between x i and x j can be expressed as follows: represents the local scale of the sample x i and x (m) i represents the mth adjacent sample of the pixel x i . e local interclass scatter matrices L (lb) and L (lw) in the local preservation dimensionality reduction are defined as (2)

Mobile Information Systems
Both L (lb) and L (lw) are n × n-dimensional square matrices.
Equation (1) provides the weights of the adjacent pixels of the same kind, and the interclass scatter matrices L (lb) and L (lw) hardly affect the same nonadjacent pixels. By locally spreading the matrix, the maximized Fisher ratio is given by e transformation matrix obtained by deformation calculation is LPPs reduce the adjacent data of the same kind via the transformation matrix, T. ey effectively separate the adjacent data of different types while retaining the data's local characteristics. Using the local retention and dimensionality reduction to preprocess the original HSI, the redundant information is eliminated. e distribution of similar classes is made more compact, the noise is reduced, and the classification accuracy is improved.

Multiscale Convolution.
In this work, the use of a multiscale convolution kernel has two major advantages: e biggest advantage of multiscale convolution kernels is that different sizes of the convolution kernels can extract features of different scales from the HSIs. us, the filter can extract and learn richer information from them. When the CNN training model is used, training is achieved by learning the parameters of the filter (weights and offsets), i.e., by continuously learning the parameters of the filter to achieve an optimal value that is closest to the label. In this study, a multiscale convolution kernel has been used to enable every convolutional layer to have a variety of filters, thus achieving diversification of the weight and bias learning and a complete and effective extraction and learning of the useful information from the HSI.
In computer vision models, multiscale inference methods are usually used for obtaining the best results. Generally, the finer details are efficiently predicted at larger sizes, the larger objects are efficiently predicted at smaller sizes, and the receiving field of the network can accurately understand the scene at smaller sizes. In contrast to the traditional multiscale structure, a novel multiscale dense convolution model (as shown in Figure 2) has been proposed in this study. In particular, we apply the convolution process to extract the features in four sizes: 11 × 11, 7 × 7, 5 × 5, and 3 × 3. e corresponding calculation formulae are as follows:   Mobile Information Systems where ij , and w 4 ij represent the neurons and weights from the kernel of the multiscale convolutional layer and n represents the number of filters in CONV_i. As the output of the convolutional layer, Y is fed to the cross-attention-guided CP module.

Attention Mechanism.
In essence, deep learning's attention mechanism is similar to the visual attention mechanism humans use when making choices. Both aim to filter out the key information, which is more helpful to the task they are dealing with, from a large amount of information, and suppress the useless information that is not relevant to the task at hand. e attention mechanism is prominent in two aspects: on the one hand, it can select the locally important information that is required to be focused on the overall input by itself; on the other hand, it can reasonably allocate a small amount of important information to the critical task objectives via computing resources. Because these two aspects, which enable the current attention mechanism network to be widely applied in image recognition, stand out, the attention mechanism can strengthen the characteristics of the local information. Further, the attention area position will change depending on the mission objectives; by understanding the target's local information, the most useful information corresponding to the target can be put to good use. Figure 3 shows the framework of the attention mechanism network. e output of the decoder part can be expressed as follows: where S t is the state output of the decoder at a time t, S t−1 is the state output of the decoder at a time t − 1, y t−1 is the label at a time t − 1, and f is the dense layer.
where C t is the output of the next state, h j is the output of the jth input in the decoder, and a tj is the attention weight.
where a tj represents the degree of alignment between the current decoder and the jth input, which is the attention weight. g is used for calculating the relationship score 11 × 11 Mobile Information Systems 5 between S t−1 and h j . It can be seen that if the score is higher, the attention distribution is focused on the input.

Cross-Attention.
Considering that the traditional attention mechanism assigns attention weight from only a single direction, it often leads to the loss of feature information. is study proposes a cross-attention mechanism, which constructs two novel weight distribution strategies: weight addition and maximization strategies. First, the cross-attention mechanism calculates the feature weight coefficients along with the horizontal and vertical directions. ese two weight coefficients are added in order to enhance the feature, and the largest weight coefficient is obtained by applying the maximization strategy. Finally, the output of the two strategies is obtained via fusion. Figure 4 shows a schematic diagram of the cross-attention mechanism. e equations for calculating the addition and maximization of weights are as follows: where S 1 and S 2 represent the weight coefficients of the horizontal and vertical attention mechanisms, respectively, add represents the addition of the weight coefficients, and max represents the maximization operation of the weight coefficients.

Covariance Pooling.
e convolutional layer of a traditional CNN uses max/average pooling, and the fully connected layer can only capture the first-order information. Although the rectified linear unit (ReLU) introduces nonlinearity, it is only limited to the single-pixel level. We believe that the covariance matrix pooling (as shown in Figure 5) can extract the features from HSIs with higher efficiency than the first-order statistics. If it is a set of features, its covariance matrix is where f � (1/n) n i�1 f i . Only when the number of linear independent components in f 1 , f 2 , . . . , f n is greater than d, the resulting matrix is a symmetric positive definite (SPD) matrix. e covariance matrix must be SPD to use the SPD manifold network's geometric structure to maintain the layer. However, even if the matrix is only positive semidefinite, it can be normalized by adding a multiple of the trace to the diagonal term of the covariance matrix as follows: where λ is the regular parameter and I is the identity matrix.

Feature Fusion.
e multiscale convolution's output features were spliced, and the CP that is guided by the crossattention mechanism was used for obtaining the final deep features. e calculation equation for the same is written as follows:

Evaluation Index.
e results of the classification of HSIs require an evaluation of the classification ability of the MDCA-CP algorithm. For the same, evaluation indicators are required as a measurement standard. In this work, three evaluation indicators have been used for HSI classification: the kappa coefficient, the average accuracy (AA), and the overall accuracy (OA).

OA.
is evaluation index is expressed as the ratio of the number of correctly classified pixels to the total number of marked pixels. If n represents the number of categories of the image feature objects, N i represents the number of pixels in the ith category, and h ii represents the number of pixels that are correctly classified in the ith category, then the OA is expressed as

AA.
is evaluation index is defined as follows. e ratio of the number of correctly classified pixels of each category to the category's total number of pixels is calculated. e ratios of the overall categories are summed. e obtained sum is further divided by the overall categories, thus giving a number, which is the average classification accuracy. If N represents the number of sample pixels that are to be tested in the overall training sample, n represents the number of categories, and h ii represents the number of correctly classified pixels of the ith category, then AA is expressed as

Kappa.
is coefficient is a performance index used for measuring the accuracy of classification via the classification confusion matrix and is used for consistency testing. It can be expressed as follows:

Datasets.
is paper will test the proposed MDCA-CP on three benchmark datasets: IP (Indian Pines), PU (University of Pavia), and SA (Salina) to verify its effectiveness. An AVIRIS sensor over northwestern Indiana took the IP dataset. It comprises 145 pixels × 145 pixels in the spatial domain. e spectral domain is composed of 224 spectral reflectance bands, and the wavelength range is 0.4-2.45 μm. Among them, the available ground truth is classified into 16 categories. In this paper's experiment, 20 water absorption bands are removed, and the final image size is 145 pixels × 145 pixels × 200 pixels. A ROSIS sensor took the PU dataset over northern Italy. It consists of 10 pixels × 340 pixels in the spatial domain; after removing the water absorption band, the spectral domain consists of 103 spectral bands, and the spectral coverage ranges from 430 to 860 nm. Among them, the ground truth is classified into 9 categories, and the size of the image used in the experiment of this article is 610 × 340 × 103.
e SA dataset was taken over California by the AVIRIS sensor. It has 224 bands in the spectral domain and 512 pixels × 217 pixels in the spatial domain. It also has a high spatial resolution (3.7 m pixels). Features. After removing the 20 absorbent bands, the experimental image's size is 512 pixels × 217 pixels × 204 pixels, and the available ground truth is classified into 16 categories. Tables 1-3 describe the sample numbers of these three datasets in detail.

Hyperparameter Setting.
e MDCA-CP proposed in this paper is based on Python language and Keras deep learning framework. e experimental environment was the Windows 10 operating system, 16 GB RAM, and NVIDIA GeForce GTX 1080 8 GB GPU. In order to prevent the deviation caused by different training samples, the average value of more than 20 experimental results under the same conditions was analyzed in this paper. In this model, the random gradient descent method is adopted to update the weight. e learning rate is 0.01, the proportion of disconnected neurons in the dropout layer of the whole connection is set as 0.5, and the activation function is ReLU. In this paper, the bilateral fusion block network was trained for small-batch gradient descent. e number of training samples was set as 1000, and the EPOCH was set as 600, and Table 4 shows the detailed hyperparameter settings.

Experimental Results Obtained from Different Methods.
To verify the validity and correctness of our proposed method, a comparison of the results obtained using MDCA-CP was done with those obtained using AlexNet [23], ResNet [33], DenseNet [34], PRAN [35], FSSFNet [36], and SAGP [37]. In order to ensure fairness in the experiment, all hyperparameters in the comparison network were set to the same value. e results obtained by applying the different models on the three datasets are given in Tables 5-7. Mobile Information Systems

Experimental Results Obtained Using the IP Dataset.
We randomly selected 5% of the training samples from the IP dataset for training and the remaining 95% samples for testing. It can be seen from Table 5 and Figure 6 that the AlexNet model has the worst classification performance, with considerable noise in its classification graph. is is because its model is not deep enough to extract representative discriminant features. It fails to make corresponding optimization adjustments for overfitting and resolution decline in the training process. In the other classification models, the model's classification results with the attention mechanism algorithm are better than those obtained by other methods. In addition, compared with PRAN, SAGP has a better classification performance, which is due to the extreme imbalance of category samples in the IP dataset. e attention mechanism overcomes this disadvantage, so the MDCA-CP proposed in this paper achieves the optimal results.

Experimental Results Obtained Using the PU Dataset.
From Table 6 and Figure 7, it can be seen that, in the Pavia University dataset, among the accuracy rates obtained by various methods, the MDCA-CP method proposed in this work exhibits the highest classification accuracy. As observed in Figure 7, the method proposed in this work has a larger number of correctly classified pixels in the classification map and has less noise than the other methods in some areas. is is because SAGP contains an attention mechanism network that focuses on the features that play an important role in the current task goal by strengthening the local features and selects the current state from many features. e network model can use the analysis of the key features to replace the overall image features analysis and enhance the weight of the features related to the current task goal. e targeted features will not help the classification as they are weakened. When performing the classification process, attention can be focused on finding the useful features associated with the current output to obtain better feature representations for different categories, which helps the final classification performance. From the above description, we can see that the method proposed in this study is better than the other methods.

Experimental Results Obtained Using the SV Dataset.
For the SV dataset, this paper randomly selects 1%, 5%, and 10% of training samples, and the remaining samples are used as test samples. Table 7 and Figure 8 show the comparative experimental results. Similarly, in terms of visual effects, the ground object classification map displayed on the SV dataset by the method proposed in this paper has the slightest noise. In quantitative analysis, the OA on the SA data set of the MDCA-CP model reaches 96.15%, which is higher than other comparison methods on OA, KA, and AA. e proposed MDCA-CP model can effectively identify most of the ground features and perform better in the SV dataset. Buildings-grass-trees-drives 386 C16 Stone-steel-towers 93

Results of Ablation Experiments on Different Weighting
Strategies. Table 8 shows the experimental results obtained by using different weight allocation strategies. From the table, it can be seen that the combined addition and maximization strategy is significantly better than the individual strategies of addition only or maximization only. e combined strategy enhances the model classification ability, proving that the two strategies of MDCA-CP are effective. Further, the weight addition strategy is superior to the weight maximization strategy, thus proving that only using the maximization strategy will lead to a loss of some features. Figures 9-11 also show the limit pie chart of each indicator on the three datasets with different weighting strategies.

Results of Ablation Experiments on Different Submodules.
is section presents the results of the ablation experiments on the cross-attention module and the CP module. Table 9 and Figure 12 show the corresponding experimental results from which it is observed that a combination of the two models (MDCA-CP) provides superior results. Similarly, we found that the cross-attention module has better performance than the covariance pooling module, which further shows the importance of feature mining in hyperspectral image classification. e MDCA module can effectively select the most representative features. In contrast, the covariance module is better than the baseline model, proving that data mining from HSIs and extracting second-order       features are also effective. is further proves the superiority of our MDCA-CP model. Figures 13-15 also show the histograms of the indicators of the different submodules on the three datasets.

Conclusion
is paper presents the data dimensionality reduction and feature extraction and presents the novel MDCA-CP model for HSI scene classification. Multiscale convolution has been proven to effectively detect the subtle changes between the pixels in the local area of spatial and spectral dimensions of an HSI, and it can be applied to the feature extraction of hyperspectral data of complex types and structures. Traditional algorithms assign the attention weights in a one-way manner, thus resulting in the loss of feature information.
is work's dense cross-attention mechanism can jointly distribute the attention weights horizontally and vertically. As a result, it can efficiently mine the most representative features of the hyperspectral data. A further enhancement in  mining the features can be achieved by employing the weight addition and maximization strategies. is paper also proves the effectiveness and superiority of the MDCA-CP algorithm by comparing results obtained from different models and conducting ablation experiments.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.