HMFT: Hyperspectral and Multispectral Image Fusion Super-Resolution Method Based on Efficient Transformer and Spatial-Spectral Attention Mechanism

Due to the imaging mechanism of hyperspectral images, the spatial resolution of the resulting images is low. An effective method to solve this problem is to fuse the low-resolution hyperspectral image (LR-HSI) with the high-resolution multispectral image (HR-MSI) to generate the high-resolution hyperspectral image (HR-HSI). Currently, the state-of-the-art fusion approach is based on convolutional neural networks (CNN), and few have attempted to use Transformer, which shows impressive performance on advanced vision tasks. In this paper, a simple and efficient hybrid architecture network based on Transformer is proposed to solve the hyperspectral image fusion super-resolution problem. We use the clever combination of convolution and Transformer as the backbone network to fully extract spatial-spectral information by taking advantage of the local and global concerns of both. In order to pay more attention to the information features such as high-frequency information conducive to HR-HSI reconstruction and explore the correlation between spectra, the convolutional attention mechanism is used to further refine the extracted features in spatial and spectral dimensions, respectively. In addition, considering that the resolution of HSI is usually large, we use the feature split module (FSM) to replace the self-attention computation method of the native Transformer to reduce the computational complexity and storage scale of the model and greatly improve the efficiency of model training. Many experiments show that the proposed network architecture achieves the best qualitative and quantitative performance compared with the latest HSI super-resolution methods.


Introduction
Hyperspectral imaging can capture images of the same scene at diferent wavelengths simultaneously. Its rich spectral features are of great importance in the feld of remote sensing [1]. HSI has been applied to tracking [2], classifcation [3][4][5], segmentation [6], and clustering [7], and the results are signifcantly improved. Compared with conventional images (e.g., colour or grayscale images), HSI contains richer spectral information of real scenes. However, images with high spatial resolution and high spectral resolution cannot be obtained simultaneously due to the limitations of existing imaging sensors. However, it is easier to obtain RGB (red, green, blue) or panchromatic images with higher spatial resolution but lower spectral resolution [8] for conventional cameras. Terefore, fusing low-resolution hyperspectral images (LR-HSI) and multispectral high-resolution images (HR-MSI) is an efective way to solve the hyperspectral super-resolution problem. Tis process is often referred to as HSI superresolution or HSI fusion.
Currently, the fusion problem is formulated as an image restoration problem [8][9][10][11][12][13][14][15][16][17][18]. Te method follows a physical degradation model where the input LR-HSI and HR-MSI images are considered as spatially degraded observations and spectrally degraded observations of potential HR-HSI, respectively. Te observation model is expressed as where Y ∈ R S×hw , Z ∈ R s×HW , and X ∈ R S×HW represent the two-dimensional matricesof LR-HSI, HR-MSI, and HR-HSI after expansion along the third dimension, respectively. h and w refer to the width and height at low resolution, and H and W refer to the height and width at high resolution, and s and S represent the number of spectral bands at multispectral and hyperspectral levels, respectively. In addition, ∈ R HW×HW , S ∈ R HW×hw , and R ∈ R s×S are the blur matrix, subsampling matrix, and spectral response matrix, respectively. Based on the observation model (1), many methods have been proposed and achieved good performance. HSI super-resolution has a large-scale factor in both spatial and spectral domains, and it is a highly ill-defned problem. Terefore, it is vital to integrate a priori information to constrain the solution space. Reference [19][20][21] uses the prior knowledge that the spatial information of HR-HSI can be sparsely represented under the dictionary trained by HR-HSI [22]. A local spatial smooth prior for HR-HSI images is assumed, which is encoded into the optimization model using total variational regularization. Te low-rank characteristics of the spectrum are utilized to reduce spectral distortion [23]. Although valid for some applications, its rationality depends on subjective prior assumptions about the unknown HR-HSI. However, the HR-HSI images collected in real scenes are highly spatially and spectrally diverse. Tis traditional learning method cannot adapt to diferent HSI image structures.
In recent years, deep learning has performed better than traditional methods in many computer vision tasks [24]. Satisfactory performance is also achieved in HSI/MSI image fusion feld. An enhanced deep learning model is proposed by combining the observation model with low-rank prior information of HSI spectral. For example, Dian et al. [25] have designed a deep model to learn the prior information inside the image through deep CNN. Tis learned prior information can be combined with the traditional regularization model to obtain better image features than the single regularization model. In the study of [26], a multiscale fusion model is proposed, which adaptively adapts features in combination with the attention model, showing the ability to retain spectral information andspatial details, thus obtaining the most advanced HSI super-resolution results. Compared with traditional methods, the CNN-based approach is signifcantly improved, but it still sufers from its own drawbacks. For example, CNN-based methods focus on fne architecture design, and the model is usually complex. Secondly, CNN pays more attention to local features, and the model efect of long-term dependence and global features is poor.
Recently, we noticed that Transformer [27] and its various variants have achieved remarkable achievements in natural language processing and advanced computer vision tasks. Transformers have also been introduced into low-level vision tasks due to their excellent performance. For example, Chen et al. [28] proposed a multitask large-scale model IPT based on the original Transformer for image super-score. Liang et al. [29] proposed the SwinIR model based on the Swin-Transformer to solve the problem of image restoration. Both methods are aimed at the naturalimage restoration problem and lack the properties of HSI. Subsequently, Hu et al. [30] proposed a pixel-level Transformer model Fuse-Former to solve the HSI/MSI fusion problem. However, the network's ability to extract local features is insufcient, and the restoration of spatial details is mainly concerned without considering spectral features.
Based on the previous factors, we propose a simple and efcient hybrid network model HMFT based on a Transformer to solve the problem of hyperspectral image fusion. Specifcally, HMFT is mainly composed of spectral information extraction and spatial information extraction. (1) In spectral information, LR-HSI is up-sampled to a highresolution scale and directly transmitted to the end of the network through a long connection, so as to retain the spectral information contained therein to the maximum extent; (2) in spatial information, through clever design, the spatial details and remaining spectral information are extracted by fully combining the advantages of CNN, which pays more attention to local features, and Transformer, which pays more attention to global features. In addition, the feature segmentation module (FSM [31]) is added to reduce the time and space complexity of the network, and the Convolution Block Attention Module (CBAM [32]) is added to explore the correlation between the spectral to promote spatial enhancement and spectral consistency. Finally, the extracted spectral information and spatial information are fused to generate HR-HSI. In summary, the main contributions of this article can be summarized as follows: (i) A novel model HMFT based on Transformer is proposed to solve the super-resolution problem of hyperspectral images. Te self-attention mechanism of the Transformer can capture the global interaction between contexts and make up for the disadvantage that CNN only focuses on local features. We combine the advantages of both to extract rich feature information. (ii) Considering the huge amount of hyperspectral data, the native Transformer needs a lot of memory and calculation, and the model is difcult to train. Terefore, the feature split module (FSM) is introduced to replace the native Transformer selfattention calculation method to reduce the space and time complexity of the model. (iii) Experimental results on three diferent datasets demonstrate that our proposed network model HMFT is efective and generalizes well compared to previous state-of-the-art methods.

Deep CNN.
Deep CNN-based learning methods [33][34][35][36][37][38][39][40][41][42][43][44][45] have achieved good performance in the feld of image SR. Yang et al. [39] proposed the PanNet network model, which uses ResNet as the feature extraction backbone network. In particular, the network is trained with high-frequency information of images, which can reduce the network training pressure and retain more high-frequency information of images, while enhancing the generalization ability of the model. MHFnet [34] defnes the fusion task as an optimization problem. It combines the degradation model of the hyperspectral image with the spectral low-rank prior of HSI to construct the algorithm model. Diferent from the traditional optimization algorithm solution method, they extend the near-end iterative algorithm to the CNN network model to learn the near-end operator and model parameters.
Hu et al. [26] designed a multiscale fusion model HSRnet to extract spatial information of diferent scales and introduce an attention mechanism to make the network focus on important components of the image and suppress noise. Although all the previous methods achieve good results, the fact that CNN networks are limited by the size of the convolutional kernels cannot be ignored.

Vision Transformer.
In recent years, the natural language processing model Transformer [27] has been gradually applied in the feld of image super-resolution with its excellent performance. Chen et al. [28] proposed a multitask model IPT, which extracts image global information by stacking multiple native Transformer modules to solve lowlevel vision tasks. Liang et al. [29] proposed the SwinIR model, which uses residual Swin Transformer blocks (RSTB) as the basic unit to build a deep feature extraction network to solve the single image SR problem. Hu et al. [30] proposed the FuseFormer fusion model, which uses each pixel of the hyperspectral image as the input of the Transformer module to construct a pixel-level end-to-end mapping network. Te Transformer module can signifcantly improve the performance of the network thanks to its advantage of building long-term dependencies on images. However, due to the huge parameter scale and high GPU performance consumption, it is rarely used in the feld of hyperspectral image fusion.

Methodology
Tis section describes HMFT in detail. Te purpose is to learn an end-to-end mapping function F θ (•) with parameters θ by fully mining the spatial andspectral information between low-resolution hyperspectral image LR-HSI, highresolution multispectral image HR-MSI, and ground truth HR-HSI. Finally, an image with high resolution and hyperspectral characteristics is reconstructed through where y ∈ R H×W×s and z ∈ R h×w×S are the HR-MSI and the LR-HSI, respectively. I represents the fusion result and θ represents the entire network parameter, which can also be regarded as implicit prior knowledge. Figure 1 presents the overall schematic diagram of the HS/MS fusion network-based super-resolution model HMFT. Te network takes LR-HSI and HR-MSI as input and fnally outputs a HR-HSI. Te network is mainly divided into upper and lower parts. Te upper part is mainly composed of an upsampling module and a long residual connection, which is used to preserve the spectral information in LR-HSI to the greatest extent, and the lower part is composed of convolution layer, Efcient Transformer layer, and Spatial-Spectral Attention Module, which is used to preserve the spatial information and remaining spectral information.

Input of Transformer.
As Figure 1 shows, the network takes LR-HSI y ∈ R h×w×S and HR-MSI z ∈ R H×W×s as input.
Te y is frst upsampled to the same scale as z using a bicubic linear interpolation algorithm to obtain y u , followed by stitching with HR-MSI along the spectral dimension, immediately followed by a 3 × 3 convolutional layer, which works well for early visual processing and is more stable and optimal for extracting shallow spatial-spectral feature information [46].
Te native Transformer [47] divides the original image into nonoverlapping blocks and then stretches them into a one-dimensional vector, while adding positional embeddings to represent the positional relationships between patches. Since our test data comes from images of various sizes in diferent scenes, there is a parameter mismatch. Considering the previous reasons, the positional embedding is removed and the unfolding technique is used to manipulate the feature map D ∈ R H×W×F . Te partitions of the "Unfold" operation are sequential and it automatically refects the information about the location of each patch [31]. In detail, feature map D ∈ R H×W×F is unfolded (by kernel � stride � k) to a patch sequence, i.e., is the total number of the 1-D features. After that, those 0 i are sent to the transformer module for further processing.

Efcient Transformer Blocks.
Tis part consists of several Transformer encoder modules. As illustrated in Figure 2, a single encoder block mainly consists of an efcient multihead self-attention module (EMHA [31]) and multilayer perception (MLP). Meanwhile, layer normalization (Norm [48]) and residual connections are interspersed.
Let's suppose that the input features P is B × N × C. Due to the large number of hyperspectral image bands, the input features dimension C after patch division is too high, which will lead to too many network training parameters, and the model is easy to fall into overftting, making it difcult to train the network. Terefore, we add a reduction layer to reduce the dimension of input features by n times, where the Computational Intelligence and Neuroscience reduction layer includes a full-join operation. Ten for input features, P ∈ R B×N×(C/n), the query, key, and value matrices Q, K, and V are calculated as where W Q , W k , and W v are the projection matrices. Generally, we have Q, K, V ∈ R B×h×N×(C/n) . h is the number of heads. Te original MHA directly uses Q, K, and V for largescale matrix multiplication computations, resulting in a large amount of resource GPU memory and computational resources being occupied,. i.e., while calculating directly with Q and K, the shape of the self-attention matrix is B × h × N × N and then we perform matrix multiplication with V. However, hyperspectral images usually have high resolution, causing the N after dividing the patch to be very large. Obviously, direct calculations are not suitable for hyperspectral data.
In SR tasks, the predicted pixels of super-resolution images usually depend only on local neighbourhood in LR. However, the local neighbourhood is much larger than CNN's receptive feld. Te spatial and temporal complexity of the model can be reduced by dividing the feature into blocks. Hence, we use the feature segmentation module (FSM [31]) to divide Q, K, and V into s segments, where a segment is denoted by a triplet ( Te size of the local neighbourhood is controlled by s. Each segment performs a self-attention calculation separately and obtains the intrasegment self-attention matrix O i , which is thus computed by the self-attention mechanism in a segment as  Figure 2: Te transformer encoder module used in the network. We perform the attention function for h times in parallel and concatenate the results for multihead self-attention (MHA). Ten, the results of each segment were combined to generate a complete attention matrix O. GPU memory usage is further reduced signifcantly by using segmentation for self-attention matrix computation.
Next, a multilayer perceptron (MLP) with two fully connected layers and GELU nonlinearity between layers is used for further feature transformation. Layer Norm (LN) layers are placed before the MHA and MLP, and both modules are connected using residuals. Finally, to be consistent with the dimensions of the original input features, we restore it to its original dimensions by an expansion layer. Te whole process is formulated as P � MHA(Norm(P)) + P, where P is denoted as the input to the efcient transformer block.

Spatial-Spectral Attention
Module. Spectral characteristics are another important feature of HSI. Conventional convolution operations usually act on the entire waveband, which leads to spectral disorder and distortion. In addition, Transformer prefers to capture low-frequency information and lacks local high-frequency information. To address these problems, we use CBAM [32] (see Figure 3) to act as a spatialspectral attention module to refne and correct the features. In detail, the weight descriptor of each channel is calculated along the space dimension and then multiplied with the channel of the corresponding feature map F to make it consistent with the GT spectral feature, which plays an important role in correcting the spectrum efect. Next, spatial weight descriptors are computed along the spectral dimension and multiplied by the pixels at each corresponding location to enhance important regions of the image, such as edges. Te specifc calculation steps are as follows: where F s and F c represent compression along the spatial dimension and compression along the channel dimension, respectively, and AvgPool and MaxPool represent average and maximum pooling operations, respectively. σ is the sigmoid operation. Figure 4, upsampled LR-HSI y u and GT HR-HSI x ∈ R H×W×S have the same number of bands, and most of the spectral information of HR-HSI is contained in upsampled LR-HSI. We plot the spectral vectors of GT and y u at a location in Figure 4 to confrm it. Terefore, in order to maximize the retention of spectral information, y u is passed to the end of the network through a long residual connection to directly sum up with the output features of another part of the network ∈ R H×W×S , followed by a 3 × 3 convolution for spatial-spectral feature adaptive fusion, fnally generating the fnal high-resolution hyperspectral image I. Te remaining spectral information is acquired by another part of the network. In addition, y u contains more low-frequency information, and transmitting it directly to the end makes the network focus on learning highfrequency information, reducing the pressure on the network to reconstruct the whole HR-HSI.

Loss Function.
In order to measure the super-resolution performance, several cost functions are studied to make the super-resolution result close to the real high-resolution image of the ground. In the current literature, L 2 and L 1 are the most used and relatively reliable loss functions. Compared with L 1 loss, the MSE loss function is benefcial to the peak signal-to-noise ratio (PSNR), but it has several limitations, such as convergence and excessive smoothing. Terefore, we use L 1 loss to measure the accuracy of network reconstruction. Where L 1 loss is defned by Mean Absolute Error MAE (MAE) between all reconstructed images and real ground values where the superscript i denotes the ith out of the N total training images. Te previous losses can well preserve the spatial information of the super-resolution results. However, the correlation between spectral features is ignored, and the reconstructed spectral information may be distorted. In order to ensure the spectral consistency of the reconstruction results at the same time, we apply a spectral angle loss function between the reconstructed image and the ground truth where the superscript s denotes the sth out of the S total spectral bands. In summary, the fnal objective loss function used to optimize the model consists of the weighted sum of the previous two losses where α is used to balance the contributions of diferent losses. In our experiments, we set it as a constant, α � 0.5.

Data Sets.
We conduct a detailed analysis and evaluation of our proposed method with three public hyperspectral image datasets. Tese include two natural hyperspectral image datasets, the CAVE dataset [49] and one remote sensing hyperspectral image dataset, the Chikusei dataset [50].

Computational Intelligence and Neuroscience
Te abovementioned three data sets serve as ground truth values of high spatial resolution HR-HSI, whose corresponding HR-MSI is generated by using the corresponding camera spectral response function (SRF). Te CAVE and Harvard datasets (see Figure 5) use Nikon D700, and the Chikusei datasets use Canon EOS 5D Mark II.

Comparison Methods.
Five state-of-the-art hyperspectral super-resolution methods are selected as baselines for comparison with our proposed method. Among them, three are traditional fusion methods, namely, CSTF [16], FUSE [51], and GLP-HS [52] and the other two are deep learning fusion methods MHFnet [34] and HSRnet [26]. For a fair comparison, all comparison methods are from the public code. In addition, both HSRnet and MHFnet training datasets are consistent with this paper.

Evaluation Measures.
Six quantitative image quality metrics widely used in image domains are used to comprehensively evaluate the performance of our proposed method, namely, Cross Correlation (CC), Spectral Angle Mapping (SAM) [53], Root Mean Square Error (RMSE), Relative Global Dimensional Synthesis Error (ERGAS) [54], Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity (SSIM) [55]. PSNR and SSIM evaluate the spatial reconstruction quality of each band of the image. CC and SAM evaluate image spectral reconstruction quality. RMSE and ERGAS evaluate image loss size.

Performance on CAVE Dataset.
Te CAVE dataset contains 32 indoor HSI data under controlled lighting conditions with an image size of 512 × 512 and contains 31 bands ranging from 400 to 700 nm, with spectral bands in 10 nm steps.
For the CAVE dataset, 20 images are randomly selected from the CAVE dataset as a training set and the rest are used for testing. First, we normalize the dataset within the range of [0, 1] and randomly crop each image to extract 3920 patches of size 64 × 64 × 31 as HR-HSI. Ten, the HR-HSI bicubic downsampling (OpenCV-python function resize) is used to generate the corresponding LR-HSI patches, where the downsampling factor is 4. HR-MSI patches were generated by the Nikon D700 spectral response function identical to most of the experiments; 80% of the training set was used for training and 20% was the validation set.
We experimented directly with 11 test images with the trained model. Table 1 gives the average test metric results for 11 test images under diferent methods. In order to let the reader have an intuitive feeling, we select a test image fower to present the fusion result pseudo-colour image and the corresponding error map in diferent ways. Table 2 gives the indicators of diferent methods on the specifed image. It is obvious that our method outperforms other comparative methods. As can be seen from the error diagram in Figure 6, the corresponding error of our proposed method is smaller than that of the comparison method, which shows that HMFT is more efective in recovering fne-grained texture and coarse-grained structure. On the contrary, it can be clearly seen that the fusion result of the HSRnet method has some obvious light spots, while the MHFnet image outline is still clearly visible, and the error is large. In addition, we plot the spectral vectors of the specifed images to observe the spectral fdelity (see Figure 7). Te spectral vectors of the fusion results of our method are most like those of GT.

Channel Attention Module
Spatial Figure 3: Te architecture of the convolution block attention module. 6 Computational Intelligence and Neuroscience We select the upper left corner of the image (1000 × 1000) and then randomly select 10 images for testing. Te same as the previous settings, the raw data are regarded as HR-HSI, and the acquisition method of LR-HSI and HR-MSI is the same as that of section B. We are consistent with the HSRnet approach, that is, without any retraining or fne-tuning of the models, we test them directly on the Harvard dataset, whose performance on the Harvard dataset can directly refect the generalization ability of the model.
Te Harvard dataset has the same band as CAVE, no special training is performed, and 10 images are randomly selected for testing directly. Table 3 gives the average indicator results for diferent methods. Likewise, we select a test image computer and plot the pseudo-colour images of the fusion results of diferent methods, the corresponding error maps, and spectral vectors. Table 4 gives detailed metrics for diferent methods for specifc images. From Figure 8, there is a signifcant colour diference in the fusion results of MHFnet and HSRnet. In addition, the ERGAS and SAM comparison index values of CSTF and MHFnet fuctuate signifcantly, indicating that the model is sensitive to the parameters of diferent images and has weak generalization ability. However, our proposed method is stable in all indicators, indicating that the model generalization ability is better than other methods. Figure 9 also shows that our spectral fdelity is also better than other methods

Performance on Chikusei Dataset.
In order to demonstrate the performance of our proposed method on hyperspectral remote sensing images, we conducted experiments on Chikusei dataset. Te Chikusei dataset contains an airborne HSI, taken by a Visible and Near-Infrared (NIR) imaging sensor over agricultural and urban areas in Chikusei, Ibaraki, Japan. Te hyperspectral dataset has 128 bands in the spectral range from 363 nm to 1018 nm, and the scene consists of 2517 × 2335 pixels.
Likewise, the original data are treated as HR-HSI, and the LR-HSI is simulated according to the previous experimental method. Te HR-MSI is generated by the Canon EOS 5D Mark II spectral response function. After that, we select the 1024 × 2048 region in the upper left corner as training data and randomly crop 3920 overlapping patches with a size of 64 × 64 . Eight non-overlapping 512 × 512 patches were cropped from the remainder for test data. Table 5 gives the average metrics on test data for diferent comparison methods. Clearly, our method outperforms other comparison methods on every metric. Likewise, we select an image to display its pseudo-colour image and the error map for visual comparison. Table 6 gives the corresponding comparison index table. It can be clearly seen from Figure 10 that the fusion results of FUSE, GLP-HS, and CSTF are blurry and contain obvious spectral distortion. In   Computational Intelligence and Neuroscience addition, we also plot the spectral fdelity of the spectral features observed (see Figure 11). From a visual and data perspective, our method still performs well on hyperspectral remote-sensing images.

Ablation Study
(1) Convolution Layer Analysis. Transformer focuses more on global features and its self-attention mechanism can capture the global interactions between contexts well, while convolution focuses more on local features and can capture rich local details. We believe that the efective combination of the two can better learn the spatial-spectral information representation. To verify the validity of convolution, we compare our model with its variant, which has no convolutional layer. Tables 7 and 8 show the average metrics of the two networks on the CAVE dataset and the Harvard dataset. All indicators have been improved, so the network with convolution has a better efect.
(2) Feature Split Module Analysis. Hyperspectral images usually have high resolution, and direct calculation with native transformers will lead to huge computation and storage scale, and even more seriously, it may lead to memory overfow. In the SR task, we consider that the predicted pixels of the super-resolution image usually only depend on the local neighbourhood in the LR. Terefore, the same efect can be achieved by using the feature segmentation module (FSM) to block the features and then compute the attention. In order to demonstrate the efectiveness of the FSM, we conduct detailed comparative experiments on it. Table 9 shows the average quality metrics of the two models on the CAVE dataset test images. Obviously, the network performance with the FSM module is better, especially the test time diference is 10 times, and the memory usage diference is 4 times, where the memory usage is obtained by the memory diference before and after the test module runs.
(3) Convolutional Attention Mechanism Analysis. In order to pay more attention to information features such as highfrequency information conducive to HR-HSI reconstruction    and explore the correlation between spectra, we added a convolutional attention mechanism to further refne the extracted features in spatial and spectral dimensions, respectively. To demonstrate its efectiveness, we compare two networks with and without the convolutional attention mechanism. Among them, Table 10 shows the comparison results of 11 test images in the CAVE dataset. Te network performance of the convolutional attention mechanism is relatively better.

Conclusions
In this paper, we propose a simple and efcient hyperspectral and multispectral fusion method. Te network frst uses convolution to propose shallow details and then uses Transformer to extract both spatial and spectral information of LR-HSI and HR-MSI. We added the FSM module to the MHA module of the Transformer to reduce the computational and memory costs. In addition, the CBAM module is added to the network to make more important channels and regions in the network. Finally, the L1 and spectral joint loss functions are used to train the whole network.
In future works, the HSI and MSI fusion method proposed in this paper will be further extended in two directions. On the one hand, multiscale technology is considered to further improve the feature extraction capability of the network. On the other hand, we try to further reduce the computational and memory cost of the model.

Conflicts of Interest
Te authors declare that they have no conficts of interest.