Spectral-Spatial Attention Transformer with Dense Connection for Hyperspectral Image Classification

In recent years, deep learning has been widely used in hyperspectral image (HSI) classification and has shown good capabilities. Particularly, the use of convolutional neural network (CNN) in HSI classification has achieved attractive performance. However, HSI contains a lot of redundant information, and the CNN-based model is limited by the receptive field of CNN and cannot balance the performance and depth of the model. Furthermore, considering that HSI can be regarded as sequence data, CNN-based models cannot mine sequence features well. In this paper, we propose a model named SSA-Transformer to address the above problems and extract spectral-spatial features of HSI more efficiently. The SSA-Transformer model combines a modified CNN-based spectral-spatial attention mechanism and a self-attention-based transformer with dense connection. The SSA-Transformer model can combine the local and global features of HSI to improve the performance of the model. A series of experiments showed that the SSA-Transformer achieved competitive classification accuracy compared with other CNN-based classification methods using three HSI datasets: University of Pavia (PU), Salinas (SA), and Kennedy Space Center (KSC).


Introduction
Hyperspectral image (HSI) contains rich information in both spectral and spatial dimensions with high correlation [1,2]. Based on such advantages, HSI has been applied in many fields, such as mineral exploration, environmental monitoring, and urban development. So far, much effort has been made in the field of HSI analysis and processing, including classification [3], anomaly detection [4,5], and dimensionality reduction [6]. Previous studies of HSI classification mostly used support vector machines (SVM) [7][8][9], k-nearest neighbor (k-NN) [10], and multinomial logistic regression (MLR) [11]. However, these models heavily rely on experts' domain knowledge and engineering experience.
With the development of deep learning methodologies, multiple HSI classification methods have been developed and widely used in the past few years, including Stacked Autoencoder (SAE) [12], Deep Belief Network (DBN) [13][14][15], and Recurrent Neural Network (RNN) [16,17]. In addition, CNN has the advantages of directly processing 3D image patches and extracting a large amount of spatial context information, so a large number of CNN-based models have appeared in the field of HSI classification [18][19][20][21][22][23]. Hu et al. [18] used convolutional neural networks for HSI classification for the first time. However, the model only contains 1D convolution kernels, so only spectral information of HSI is used in classification and does not consider the spatial context information of HSI, which could potentially impact the accuracy of the model. Later, various models have emerged that use both spatial and spectral information for classification. He et al. [19] used multiple 3D convolution kernels of different sizes to build an M3D-DCNN model, which can extract multiscale spectral-spatial information on HSI. Gao et al. [20] used the small convolution and dense connection in their model to extract spectral-spatial features. e depth of the above model is shallow, and the performance of the model is not good enough.
e CNN-based model can extract more rich features by deepening the depth of the model. Paoletti et al. [21] proposed a deep residual network (pResNet) by stacking the pyramid bottleneck residual units derived from the pyramid residual network [24]. e depth of the model can reach more than 30 layers, which can extract rich spectral feature and spatial feature.
is model contains a large number of 2D convolution kernels, and the performance of the model is much better than the above models, but as the depth of the model becomes deeper, the training time becomes longer. Li et al. [22] proposed a dual-branch network to extract spectral and spatial information separately; however, stacking multiple 3D convolution kernels also caused the model training time to be too long, and the performance of the model did not improve much. e receptive field of CNN is limited by the small-sized convolution kernel. As a result, the CNN-based model cannot extract global features, which causes a bottleneck in the performance of the CNN-based model.
To solve the above problems, many transformer-based models have emerged. Considering the large spectral dimension of HSI, HSI can be regarded as sequence data. Just as the word vector in the NLP field represents the meaning of a word, the spectral vector of the HSI pixel represents land cover information. Moreover, the spatial information of HSI is similar to the context of the target word in the NLP field [25]. e transformer model was originally used in natural language processing (NLP) [26][27][28]; the self-attention mechanism of this model can mine the global features of the sequence, which makes it a great success in the field of NLP. e use of self-attentionbased transformer model can make better use of the correlation of HSI information and can extract the global features of neighborhood pixel blocks. However, the performance of many transformer-based models is still not good enough. e model proposed by Hu et al. [29] combines 1D-CNN and Vision Transformer (ViT) [30], but the overall accuracy on PU and SA datasets is only 93.77% and 96.15%, respectively. e reason is that ViT directly segments the input image and cannot handle the low-level features of the input image well [31]. Inspired by the model proposed by Yuan et al. [31], we use a spectralspatial attention mechanism to process neighborhood pixel blocks to obtain feature maps and then segment the feature maps. In this study, we propose a model that combines a CNN-based spectral-spatial attention mechanism and a self-attention-based transformer (SSA-Transformer). e advantage of the SSA-Transformer is that it can extract local and global features of HSI data and improve classification results. Specifically, the spectral-spatial attention mechanism is used to extract the local features of the neighborhood pixel blocks, reduce redundant information, then process them into sequences, and finally extract the global features by the self-attention-based transformer encoder block. We compared the proposed SSA-Transformer with other CNN-based methods on three HSI public datasets revealing its competitive classification performance. e main contributions of this work can be summarized as follows: (1) In our proposed model, we use a spectral-spatial attention mechanism to extract the low-level features of the neighborhood pixel blocks, which solves the disadvantage that the transformer part of the model cannot extract the rich low-level features of the input image.

Methodology
In this section, we first explain the details of the spectralspatial attention mechanism. Next, we introduce the principles of linear embedding and transformer encoder. Finally, we discuss the overall architecture of the proposed HSI classification method.

Spectral-Spatial Attention.
e transformer-only model will directly segment the input image, but this will cause the model to fail to extract rich low-level features. erefore, we use the attention mechanism to extract the rich low-level features of the neighborhood pixel blocks of the input model. Specifically, we use the modified CBAM [32] as the spectralspatial attention mechanism for feature extraction of neighborhood pixel blocks. e spectral-spatial attention module consists of a spectral attention module (SeAM) and a spatial attention module (SaAM) [33]. SeAM is used to select spectral features that are useful for classification, while SaAM is used to select spatial features that are useful for classification. Figure 1 shows the structure of the entire spectralspatial attention mechanism. We first utilize convolution operations on the HSI neighborhood pixel block y ∈ R H×H×C , where H and C represent the spatial size and the spectral dimension, respectively. Next, we extract features in the input data that contribute to classification by SeAM and SaAM, respectively. Note that none of these steps change C. Finally, 1 × 1 Conv is used to extract discriminative features on the neighborhood pixel blocks and reduce the number of dimensions of the spectral dimension. During this process, useless information is discarded to avoid the risks of reducing classification performance. We will introduce the detailed process of the spectral-spatial attention mechanism in the next section. e overall attention calculation can be summarized as follows: where * denotes the elementwise multiplication, Conv1 consists of two 3 × 3 convolution layer, M se ( ) denotes the spectral attention module, M sa ( ) denotes the spatial attention module, and Conv2( ) consists of one 1 × 1 convolution layer.

Spectral Attention Module.
For different classes of pixels in HSI, the spectral bands that contribute to the classification are different, and some spectral bands will reduce the accuracy of the classification [34]. erefore, the role of SeAM is to strengthen the contribution of the spectral bands that are helpful to the classification and weaken the contribution of the spectral bands that are useless or even harmful to the classification.
is module maps the input into a weight vector to indicate the contribution of each spectral band to the classification result. e structure of SeAM is shown in Figure 2. e module first generates two 1 × 1 × C vectors, P avg se and P max se , P avg se is generated by global average pooling, and P max se is generated by global max pooling. After that, the two vectors are first passed through the F1 fully connected layer for dimensionality reduction, and then the dimensionality is restored through the F2 fully connected layer. Next, the spectral weight vector P se is generated by the addition of these two vectors and processed by the ReLU activation function. P se is calculated by where σ denotes the ReLU activation function.
Finally, the spectral weight vector P se is multiplied by the input spectralwise to get the output y ″ .
2.3. Spatial Attention Module. All pixels of a neighborhood block are initially considered to be the class of the center pixel; that is, the contribution of all neighbor pixels to the class of the center pixel is initially the same [33]. However, there is no way to distinguish the contributions of different pixels in the neighborhood, which may affect the classification of pixels located on the boundary between two different categories. In addition, not all pixels in the neighborhood contribute to the classification of the center pixel, and some of them may even reduce the classification effect [33]. erefore, the role of SaAM is to enhance the contribution of pixels that are helpful for classification and weaken the contribution of pixels that are useless or even interfere with the classification. e structure of SaAM is shown in Figure 3. is module first calculates the average value and maximum value of the elements of the spectral dimension, respectively, and obtains the outputs P avg sa and P max sa with the shape of H × M × 1. Next, we concatenate these two outputs and go through a convolution operation and a sigmoid activation function to get a new output, which represents the contribution of each pixel. e specific calculation process is as follows: Finally, P se is multiplied by the input spatialwise to get the output y ′″ .  Computational Intelligence and Neuroscience capture the global information (long-term correlation) of the input patch at any location [35]. But it will cause the input vector to lose the positional relationship. erefore, ViT solves this problem by processing the sequence into a linear embedding sequence [30]. e overall process is shown in Figure 4: first segment the input data into patches, then flatten it into vectors, then add an extra vector for classification, and finally add a position code to each vector. [36], Vision Transformer [30] based on self-attention has successfully applied it in the field of computer vision. e self-attention mechanism in the transformer model can extract global features, which is the key to its attractive effect [30]. Figure 5 shows the architecture of the transformer encoder block, each of which consists of a multihead self-attention mechanism sublayer and a feedforward network sublayer. Residual connections are used between each sublayer and normalize the input of each sublayer using LayerNorm (LN). Self-attention mechanism can be defined as

Transformer Encoder Block. Inspired by transformer
where k, Q, V, and the output are matrices. K, Q, and V are obtained by multiplying the input matrix by w Q , w K , and w V matrices. Use the dimension d k of Q to participate in scaling. Note that this is not the only self-attention mechanism in each transformer encoder block.
ere are multiple such self-attention mechanisms, which constitute a multihead self-attention mechanism. Finally, we define the multihead self-attention mechanism as where W o is a weight matrix and h is the number of the heads. e feedforward network in each transformer encoder block consists of two full connection layers and a GeLU activation function, which can be defined as where σ denotes the GeLU activation function. Figure 6 shows the overall architecture of our proposed model. First, we take each labeled pixel as the center to extract a neighborhood pixel block of size h × h × c, where h is the length and width of the pixel block and c is the spectral dimension of different HSIs. We use padding operations for edge pixels that cannot be directly extracted into pixel blocks. Finally, we get sample data of shape (n, h, h, c), where n is the total number of samples. en, we use the spectral-spatial attention module to extract the spatial and spectral features of the sample data and reduce redundant information.

Overview of the Proposed Model.
e spectral-spatial attention mechanism reduces the redundant information of the input data in the spectral band, and the shape of the output data is (n, h, h, k), where k is the number of spectral bands retained by the data after processing through a 1 × 1 convolutional layer.
Next, we segment each output data with shape (h, h, k) into (h × h)/(p × p) patches with shape (p, p, k). We set p to 3. e patches of shape (p, p, k) will be reshaped into a onedimensional vector of length k × P × P. e shape of the data can be redefined as (N, D), where N is the length of the sequence, the size is (h × h)/(p × p), D is the dimension of each vector of the sequence, and the size is p × p × k.
Finally, by adding the embedding vector and the position code, we finally create a matrix of size (batch size, N + 1, D) to use as the input to the transformer part of our model. We use multiple transformer encoder modules to continuously extract image features and use the dense connection structure to reduce the loss of information.

Experiments
In this section, we first introduce these three HSI datasets used to measure the performance of the model: Kennedy Space Center (KSC), University of Pavia (PU), and Salinas (SA), as illustrated in Figures 7-9. e details of the datasets are shown in Table 1. Next, we specify the model configuration process. Finally, we analyze the four factors that affect the performance of the proposed model. We choose overall accuracy (OA), average accuracy (AA), and KAPPA coefficient (κ) as the measurement indices of SSA-Transformer performance.
In Salinas dataset, we randomly selected 10% of the dataset for training for our experiments. In Kennedy Space Center dataset, we randomly selected 200 samples per class object as the training set. In University of Pavia dataset, we randomly selected 400 samples per class object as the training set. A detailed experimental analysis is presented in this section. When the number of labels in some categories of the dataset is too small to be selected, 80% of the total number of labels in this category are selected as the training set. We randomly take out 25% of the training set to serve as the validation set. To be fair, we used randomly selected training data for ten experiments in all subsequent experiments and presented the mean and standard deviation of the experimental results.

Experimental Datasets
Kennedy Space Center (KSC). is dataset was collected by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensors in the Kennedy Space Center. It has a total of 224      Computational Intelligence and Neuroscience 7 spectral bands. After removing the water absorption and low signal-to-noise ratio (SNR) bands, the remaining 176 bands are used for experiments. Its size is 512 × 614 pixels, with a total of 5,211 marked pixels and 13 land cover categories. Table 2 lists the specific division of the dataset.
University of Pavia (PU). is dataset was collected by ROSIS sensors in the urban area of the University of Pavia in northern Italy. It has 115 spectral bands. After removing the bands affected by noise, 103 bands are left for experiments. It has a size of 610 × 340 pixels, a total of 42,776 marked pixels, and 9 land cover categories. Table 3 lists the specific division of the dataset.
Salinas (SA). is dataset is acquired by the AVIRIS sensor. e database has 224 spectral bands, with 20 water absorption bands removed, leaving 204 bands for experiments. e size of Salinas is 512 × 217 pixels. ere are a total of 54,128 marked pixels and 16 land cover categories. Table 4 lists the specific division of the dataset.

Experimental Configuration.
To evaluate the performance of the model proposed in this paper, the experiments are implemented on a computer with an AMD CPU R7-4800 at 2.9 GHz, a memory size of 16 GB, and an RTX2060 graphical processing unit (GPU). e model proposed in this paper was implemented by Python version 3.7.0 and the deep learning framework of PyTorch version 1.2.0. Optimization is performed by SGD optimizer [37]. e loss function of our proposed model uses the cross-entropy function. In the experiment on the PU dataset, the learning rate is set to 0.01, and it decays to 0.001 in the 41st epoch. In the experiment on the SA dataset, the learning rate is set to 0.005, which decays to 0.001 at the 41st epoch and decays to 0.0001 at the 81st epoch. In the experiment on the KSC dataset, the learning rate is set to 0.01, decays to 0.001 at the 41st epoch, and decays to 0.0001 at the 81st epoch.

Parameter
Setting. Some factors have a significant impact on the classification performance of the model, and we analyze the impact of these factors on the model in this subsection. ese factors are batch size, spatial size, training sample, and the number of heads of multihead self-attention. e total epochs of the three dataset experiments of PU, SA, and KSC are set to 80, 120, and 120, respectively.

Spatial Sizes.
e spatial sizes determine the spatial information that the model can use for classification and has a great impact on the performance of the model. To evaluate the impact of the spatial size on the performance of the SSA-Transformer, we choose the spatial size of 9, 15, and 21 for the experiment. Figure 11 shows the performance (OA) of the spatial size on the SSA-Transformer. We observed that as the spatial size increases, the accuracy of the model did not necessarily increase.

Training Sample.
We consider utilizing 5%, 10%, and 15% of the sample data in SA and 200, 300, and 400 samples per class in KSC and PU as the training set, respectively. e rest of the data is used as the test set. Figure 5 shows the results obtained by training our proposed model on the corresponding sample dataset. From Table 5, we can see that, in the experiment on KSC, the accuracy of the three sets of data is not much different. e experiments on the remaining two datasets are that the larger the training set, the higher the accuracy. e reason is that the training set can alleviate the overfitting problem of the model. ere is a trade-off between the performance and training time of the model. For the Pavia University dataset, we employ a 400 per class strategy. For KSC, we employ a 200 per class strategy. For Salinas, we employ 10% of the sample data.

e Number of Heads of the Multihead Self-Attention
Mechanism.
e multihead self-attention mechanism can focus on different positions and can more effectively mine the relationship between the various vectors of the sequence. erefore, we choose head � 4, 6, and 8 for the experiment. Figure 12 shows the effect of different number of heads on the accuracy of the model. ere is a trade-off between the performance and training time of the model. e number of heads of SA, PU, and KSC datasets is set to 8, 6, and 8, respectively.

Results and Discussion
In this section, we used several recently developed typical CNN-based models to measure the performance of our proposed model, including 1D-CNN [18], M3D-DCNN [19], SC-FR [20], pResNet [21], and DBDA [22]. We repeat all experiments in the three datasets 10 times to ensure the fairness of the experiment. We uniformly use the spatial size, training sample, and batch size determined in Section 3 as the input of the comparison model and the model we proposed. e evaluation indicators OA, AA, and KAPPA coefficients are expressed in the form of "mean ± standard deviation." In addition, we also use the variance of OA and the variance of AA to express the volatility of accuracy.

Comparing with Other Methods.
e classification results for each of the methods are shown in Tables 6-8. Experimental results demonstrate that our proposed model achieves the best performance on the PU and SA datasets. On the KSC dataset, compared with DBDA, our proposed model is 0.01%, 0.01%, and 0.02% lower than OA, AA, and KAPPA, respectively, but the gap is not significant. For the proposed model, in the Salinas dataset, compared with 1D-CNN, the OA, AA, and KAPPA of our model are 10.02%, 9.36%, and 13.42% higher, respectively. is is because 1D-CNN does not only extract the spatial feature of HSI but only extract the spectral feature of HSI. M3D-DCNN, pResNet, and SC-FR are models based on 3D pixel blocks that do not use the attention mechanism. M3D-DCNN uses a variety of convolution kernels of different sizes to obtain multiscale information. Even if 3D convolution is used, since the attention mechanism is not used, the OA, AA, and KAPPA are 3%, 1.55%, and 3.34% lower on the Salinas dataset compared with our proposed model, respectively. pResNet and SC-FR Overall Accuracy (%) Figure 11: Overall classification accuracy of different spatial sizes. Overall Accuracy (%) Figure 12: Overall classification accuracy of different number of heads.  Although it uses spatial attention mechanism and spectral attention mechanism to extract spatial and spectral features, its OA on the PU and SA datasets is 0.17% and 0.43% lower than our proposed model, respectively. e reason is that this model cannot use the global information of neighborhood pixel blocks for classification. Although, in the KSC dataset, the OA of DBDA is 0.01% higher than our proposed model, the accuracy of our proposed model on class 5 (Oak/Broadleaf ) reaches 100%, while the accuracy of DBDA on class 5 is only 99.52%. e performance of our proposed model on the three datasets shows that, compared to the CNN-based model, the model that combines transformer and CNN can also reach a competitive level of accuracy. Figures 13-15 visualize the classification results of our proposed model and the other five models on three datasets. It can be clearly seen in the classification map that there are a lot of noise points on the 1D-CNN classification map because the model does not extract the spatial features of HSI.
e rest of the model used for comparison and the model we proposed all use the spatial information of HSI to help classification.
us, the noise points problem is solved. Moreover, since M3D-DCNN, SC-FR, and pResNet do not use an attention mechanism, these models are more likely to be disturbed by pixels and spectral bands that do not contribute to the classification. For example, on the SA dataset, none of these models can accurately mark class 15 (Vinyard_untrained), and our proposed model marks class 15 most accurately. Although DBDA also uses the attention mechanism, it can be observed in the classification map that the classification effect is not as good as the model we proposed. Specifically, by comparing ground-truth images, our proposed model achieves a more accurate and smooth classification effect.
e above experiments can prove that our proposed model can achieve competitive performance compared with the CNN-based model. But balancing performance and efficiency is also important for the model. Table 9 shows the training time and test time of pResNet, DBDA, and our proposed model on PU, KSC, and SA. Our proposed model has a decrease in training time compared with DBDA and pResNet. Although DBDA performs better on KSC than our proposed model, the training time of our proposed model is only 65% of DBDA, which shows that our model achieves a better balance between efficiency and accuracy.

Computing Time for Selecting Different Numbers of
Bands. When we introduced the spectral-spatial attention module in Section 2, we mentioned that this module will select appropriate bands in the last layer (i.e., the 1 × 1 convolution layer) to reduce redundant information, which can also reduce the time required for the model to train and test. Table 10 shows the training time and test time of the model when the spectral-spatial attention module selects 16, 32, and 64 bands. We can find that as the number of selected bands decreases, both the training time and the testing time of the model decrease.

Effectiveness of the Dense Connection. Dense connection
can improve the flow of information between transformer encoder blocks and reduce the loss of information. To prove the effectiveness of dense connection, we removed dense connection and compared the performance of these two models.  Figure 16 shows the improvement of model performance by dense connections. A model with a dense connection can achieve higher accuracy. We conclude that dense connection can improve the performance of model classification.

Effectiveness of the Spectral-Spatial Attention Module.
In Section 2, we explain the role of the CNN-based spectralspatial attention module. To prove the effectiveness of the spectral-spatial attention module, we removed the spectral-spatial attention module, spectral attention, and spatial attention, respectively, and compared the performance of these four models.
e impact of the spectral-spatial attention module on the model performance is shown in Figure 17.
e performance of the model is greatly improved by extracting low-level local features from neighborhood pixel blocks. is shows that combining the local features extracted by CNN and the global features extracted by transformer can more effectively improve the performance of the model. It is worth noting that, on the SA dataset, the accuracy of our proposed model is improved after removing the spectral attention. e reason is that many pixels with different labels in the SA dataset have similar spectral characteristics. After adding spectral attention, the model pays too much attention to the spectral features. We conclude that spectral-spatial attention module can improve the performance of model classification.

Conclusion
In this paper, we propose a model that combines transformer and a CNN-based spectral-spatial attention mechanism. is model can separately extract the local and global features of HSI. e experimental results show that the model combining transformer and CNN has better performance than the CNN-based model. e model first uses a spectral-spatial attention mechanism to extract local features and reduce the impact of redundant information on classification, then converts neighborhood pixel blocks into sequences, and extracts global features through the transformer part of the model. Finally, it is classified through the fully connected layer.
In the experiment, we first analyzed the influence of batch size, spatial size, training samples, and number of heads of the multihead self-attention mechanism on  Overall Accuracy (%) without dense connection use dense connection  Future research should focus more on efficient transformer encoder block and attention mechanisms to process HSI information. By combining the local and global features of HSI more effectively, the accuracy of the HSI classification model can be further improved, and a more effective HSI classifier can be constructed.
Data Availability e data that support the findings of this study are openly available in Hyperspectral Remote Sensing Scenes at https://www.ehu.eus/ccwintco/index.php/Hyperspectral_ Remote_Sensing_Scenes.

Conflicts of Interest
e authors declare that there are no conflicts of interest.