Deep Convolutional Asymmetric Autoencoder-Based Spatial-Spectral Clustering Network for Hyperspectral Image

Due to the complex properties of hyperspectral images (HSI), such as spatial-spectral structure, high dimension, and great spectral variability, HSI clustering is a challenging operation. In this paper, we propose a novel deep convolutional asymmetric autoencoder-based spatial-spectral clustering network (DCAAES 2 C-Net) which employs a convolutional autoencoder (CAE) and an asymmetric autoencoder to investigate spatial-spectral information. First, we use a CAE to extract spatial-spectral features. Then, we introduce an asymmetric autoencoder between the encoder and decoder of CAE to suppress some non-material-related spatial information in latten feature like shading and texture. By using a collaborative strategy to train the proposed networks, we obtain the representation features in a low dimension. Furthermore, we improve the k -means algorithm by using the concept of over-clustering to handle fuzzy representation which is di ﬃ cult to distinguish the cluster, and utilize it to obtain the ﬁ nal HSI clustering result. The results of the experiments demonstrated that the proposed methodology outperforms other methods on the frequently used hyperspectral image dataset.


Introduction
Recent years have witnessed a spurt of progress in remote sensing technology; it has promoted study of hyperspectral remote sensing [1]. HSI is captured by hyperspectral sensors such as hyperspectral imaging spectrometers, which can image regions of interest with nanoscale spectral resolution, gathering rich spectra to capture information about numerous ground objects [2,3]. It is a 3D cube structure image with tens to hundreds of bands that includes various ground object information and allows meticulous ground object classification using deep networks [4], and it has been widely employed in a variety of industries, such as mineral exploration [5,6], vegetation monitoring [7,8], quantitative inversion of physical and biological parameters [9,10], and military reconnaissance [11,12]. Deep networks that employ supervised learning, on the other hand, typically require a substantial quantity of labeled data. Unfortunately, sample collection is time consuming, labor intensive, costly, and inefficient in practice, and training samples may be unavailable in some remote and no man's areas, severely limiting the application capabilities of hyperspectral remote sensing [13]. Thus, to increase the application potential of hyperspectral remote sensing, unsupervised ground object recognition theory and method is necessary to be developed to overcome the limitations of labeled samples and prior information.
Generally, the learning-based clustering methods of HSI include two important elements: clustering algorithm and feature extraction. Among them, the purpose of clustering algorithm is, as a typical unsupervised information analysis technology, it does not rely on any training samples, but only by mining the essential characteristics of the data to achieve the natural division of pixels, which effectively solves the classification problem without prior information. According to the difference of clustering algorithm principle and working mechanism, hyperspectral clustering can be summarized into 9 categories of methods by Zhai et al. [13]: centroid-based methods [14,15], density-based methods [16,17], probability-based methods [18], bionicsbased methods [19], intelligent computing-based methods [20], graph-based methods [21,22], subspace clustering methods [23,24], deep learning-based methods [25], and hybrid mechanism-based methods [26]. Centroid-based methods and density-based methods are used widely. The method based on clustering center is the first method to be introduced into hyperspectral clustering analysis, and it is also one of the most classic clustering methods.
The purpose of feature extraction is to search a mapping from high-dimensional space to low-dimensional space, so that it can reduce redundant information and preserve crucial information. In the early years, researchers focused on using linear transformations to extract HSI features, such as linear discriminant analysis [27], independent component analysis [28], minimum noise separation transformation [29], and some PCA-based methods [30,31]; then, some traditional clustering algorithms were applied to achieve clustering results. However, due to the complicated properties of HSI data, the performance is limited [32,33]. Nowadays, a more advanced HSI clustering method is learning-based method. It is widely used in HSI clustering to tackle the nonlinearity problem, which exceeds the performance of many traditional methods. There are generally two types of learning-based clustering algorithms: spectral-only methods and spatial-spectral methods. The spectral-only methods like automatic fuzzy clustering based on an adaptive multi-objective differential evolution (AFCMDE) [34], scalable graph-based clustering with nonnegative relaxation (SGCNR) [35], and a robust manifold matrix factorizationbased method (RMMF) [36], which cluster the HSI pixels by learning the spectral domain feature representations. A specific land-cover class is intuitively represented by an area with multiple pixels, so the center pixel and its neighboring pixels are most likely from the same category. However, the approaches that exclusively use spectral information discard the spatial relationship between neighboring pixels. Thus, some researchers introduced spatial-spectral HSI clustering algorithms to combine spatial and spectral information and get more discriminative features for HSI, based on the spatial-spectral feature representations, clustering methods are used to produce the final clustering result. Lei et al. [24] proposed a deep spatial-spectral subspace clustering network (DS 3 C-Net) which employed a multiscale autoencoder and self-expressive layers to explore spatial-spectral information and learn the subspace structures then used the spectral clustering to generate the final result. Murphy and Maggioni [37] integrated spectral-spatial diffusion geometry into the diffusion learning algorithm, which has achieved competitive performance and allows analyze the high-dimensional HSI data in a manner that both respects intrinsic pixel geometry in the data and the spatial regularity in the 2D image structure of the pixels. Nalepa et al. [38] used a 3D convolutional autoencoder to extract HSI features and achieved a good result in unsupervised segmentation.
A specific land-cover class in HSI data is generally represented by an area with multiple pixels which have similar spectral characteristics, and thus, how to make better use of spatial information and extract discriminative spatialspectral features is critical for the HSI clustering task. Most of spatial-spectral methods are based on convolutional networks; thus, the feature will include a large amount of spatial information such as shadow, texture, and geometric information due to the characteristics of convolutional network. Shen et al. [39] have proved that the actual reflectivity of a substance is only related to its material, and the shadows and textures that generated by the interaction of light and the shape of the substance's surface will interfere with its actual qualities; the information mentioned above is called non-material-related spatial information. Kang et al. [40] have certificated that non-material-related spatial information are meaningless in clustering task and hyperspectral pictures are mainly classified based on the similarity of spectral properties of substances. In addition, their method demonstrates that remove useless spatial information such as shading and texture which not directly related to the material of different objects effectively will obtain an outperformed result. Clustering is unsupervised, unlike the classification; it is very vulnerable to the characteristics of the data itself. Due to this reason, how to effectively remove the information contained in the feature vector is crucial.
In this letter, we concentrate on investigating spatialspectral information from pixel patches and using the concept of over-clustering to improve k-means algorithm. Our key contributions include the following: (1) We propose a novel deep convolutional asymmetric autoencoder-based spatial-spectral clustering network (DCAAES 2 C-Net) to extract the discriminative spatial-spectral features (2) An asymmetric autoencoder is introduced to suppress non-material-related spatial information in feature representations which generated by CAE (3) We improve the k-means algorithm by using the concept of over-clustering to handle fuzzy representation which is difficult to distinguish the cluster The remainder of this paper is organized as follows. Section 2 presents the proposed DCAAES 2 C-Net for unsupervised spatial-spectral feature learning. Section 3 reports and discusses experimental results over three benchmark hyperspectral datasets. Finally, conclusions are drawn in Section 4.

Method
Simple autoencoder is a three-layer feed-forward fully connected network. Units in the previous layers are connected to all units in the next layer. The size of the input layer and output layer is equal to the input size. According to universal approximation theorem, deepening the depth of the network can provide more advantages. Thus, the deep autoencoder typically used to learn feature representations. We show this network in Figure 1.
The fundamental structure of the convolutional autoencoder (CAE) is extended by altering the fully linked layers to convolution layers. The input layer and output layer sizes are the same as in the standard autoencoder, while the decoder network changes to convolution layers and the decoder     In this paper, we first employ a CAE to learn the spatialspectral information, and the input data of network is changed from spectral vector to patch. In general, it is considered Over-clusteringk-means Input: HSI feature vectors, n, k. 1: Set k according goal clusters, n is 2~3 times of k 2: repeat 3: Do K-means cluster with n clusters 4: Keep the top k clusters with the most elements, the other clusters are set to the background cluster.   that the center pixel and its neighborhood contain correlation information, and the main purpose of introducing spatial information is to use the correlation of the center pixel and its neighborhood to enhance the features of the center pixel [41]. However, because the convolutional network is very sensitive to picture geometric characteristics, some non-material-related spatial information is also incorporated in the feature vector, such as the shape of edges, textures, and shadows. For example, when we use CAE to learn the information of the point which is near the edge of land, the feature vectors will contain the information of the edge's shape which will be the principal component of the feature vectors. It will lead the feature vectors of the edge points cannot contain the information which can describe its material correctly. In the following, we call this kind of nonmaterial-related spatial information spatial noise. To address this problem, we designed an asymmetric autoencoder stack on the pretrained CAE to suppress the spatial noise in the output of CAE, as shown in Figure 3. In this process, since the hidden layer compresses the output of CAE and decompresses it to the corresponding spectrum, the spatial noise components will be dropout. f 1 ðxÞ represents the coding map of CAE, g 1 ðxÞ represents the decoding map of CAE, x u is the input data of the convolution network, and u is the original spectral information of the central pixel of x u . For CAE, suppose there is a mapping for any ε, equation (2) holds. For v can completely reconstruct x u , it is believed that the feature vector v can represent the information contained in The overall correlation between u and x u can be measured using the reciprocal of the Euclidean distance between the spectral vector u at the target point and the feature vector u′ at the network output, as shown in equation (4). The information in v that indicates weak correlation with u is discarded by an asymmetric self-encoder, and a dropout layer is set in the decoder part to guarantee that the spatial information is not discarded in its entirety. The encoding mapping of the asymmetric depth self-encoder is denoted by f 2 ðxÞ, and the decoding mapping is denoted by g 2 ðxÞ.
Maximizing the correlation between u and x u is then equiv-alent to maximizing the correlation between u and u′, as shown in equation (6), when dðu′, uÞ ⟶ 0.
In Asymmetric-AE, the input data is v, and the reconstruction object is the original spectral information of the target pixel. Its main function is to enhance the spectral information of the target pixel contained in v and suppress the spatial information that is irrelevant to the target pixel, that is, spatial noise.

2.2.
Over-Clustering K-Means. K-means clustering algorithm is a clustering method based on clustering center. This method is sensitive to outlier noise points, and this kind of outlier noise point will destroy the stability of clustering and exert great impact on clustering accuracy. The method which is based on feature density is not sensitive to such noise points such as DBSCAN clustering algorithm [42]. According to a priori, the distribution of noise vector in feature space is sparse. Therefore, when DBSCAN is used to cluster hyperspectral data, the noise vector will not be divided into the final clustering results. Based on this idea, we improve the K-means clustering algorithm, so that the K-means clustering algorithm also has the characteristics as DBSCAN to separate the noise vector in sparse region. The improved method is shown in Algorithm 1. The cost function of the improved k-means algorithm is shown in equation (7), where k is the number of goal clusters, n is a hyper-parameter and it always bigger than k.
The noise points are divided into background clusters, and the cost function E can be equivalent to E ′ , where μ 0  According to a priori, the values of all elements in the background cluster are 0, that is, μ 0 = 0. The distance within the noise cluster is generally larger than that within the nonnoise cluster. Therefore, the following formula can be derived.

Wireless Communications and Mobile Computing
It can be seen that by dividing the noise points into background clusters, the cost function is further optimized while removing the noise points, and the clustering performance is better.

Experiments Setting and Dataset.
In this paper, all the experiments are carried out using a PC equipped with InterCorei7-10700K CPU and a single GPU of GeForce RTX 3070.
Salinas scene is used as the experimental data in this paper. The size of the image is 512 × 217, and it contains 16 classes. We select 10 categories for research. Set the data labels of the 10th, 13th, and 16th categories to zero. The 3rd and 5th categories are collectively referred to as one category. No changes will be made to categories 1, 2, 4, 6, 7, 8, 9, 11, and 12, as shown in Figure 4.
We first normalize the hyperspectral data by min-max normalization. In order to prevent the mutual influence between the various bands, in this paper, when the min-max normalization is performed, each band is normalized separately. To more accurately evaluate the effectiveness of the feature extraction network, we use the sampling method of Bootstrap Sample to extract the training set, and the elements that do not appear in the training set are used as the test set [43].  Tables 1 and 2. Epoch of CAE is set to 300 and batch size is set to 256. Epoch of Asymmetric-AE is set to 1000 and batch size is set to 1000. The gradient optimization function uses Adam, the loss function uses MSE, and the activation function uses Relu.
3.3. Evaluation. This paper uses PSNR and SSIM as the evaluation criteria of the similarity between the input image and the reconstructed image [44,45]. For the CAE, 10 input images are randomly selected from the test set for reconstruction. The original image and its reconstructed image are shown in Figure 5. Calculate the PSNR and SSIM of these sub-images and their reconstructed images, as shown in Figure 6. The ten sets of PSNR values we get are the lowest 31 dB and the highest 46 dB; the lowest SSIM value is 0.967   Wireless Communications and Mobile Computing and the highest is 0.994. It can be seen CAE can learn the information in the sub-image and compress it into a feature vector with a smaller dimension. Asymmetric-AE is trained base the feature vectors generated by the CAE. Since the output of the Asymmetric-AE is the spectral vector, the similarity between the original image and its reconstructed image is analyzed directly. After calculation, the PSNR and SSIM of the original image and reconstructed image are about 41.67db and 0.983, respectively. The reconstructed image and the original image of the Asymmetric-AE are shown in Figure 7. The PSNR and SSIM of each band are shown in Figure 8. It can see that PSNR and SSIM still maintain a high value in general, and the original image can be reconstructed well after feature vector reduction again.
The comparison of feature extraction effectiveness between the proposed model and other typical autoencoder models is shown in Table 3. Figure 9 is the label graph of noise points in DBSCAN clustering of CAE and the model proposed in this paper (domain parameter is ðϵ, MinPtsÞ = ð3,200Þ).
The number of noise points in the clustering results of CAE is 9997, and the number of noise points in the clustering results of the proposed model is 6996. It can be seen from the comparison that the features extracted in this paper have better performance in clustering performance, but there are still some spatial noises. Therefore, the improved k -means clustering algorithm is used to further reduce the influence of noise points on clustering accuracy in this paper.
In order to scientifically evaluate the performance of the clustering method proposed in this paper, FMI, RI, AMI, and DB are used as the evaluation indicators of the clustering performance. Among them, FMI, RI, and AMI are external indicators that require ground truth as a reference standard [46][47][48]. The higher the score, the better the clustering performance. DB is an internal index [49], which represents the average similarity between clusters, and the ratio of the flat distance within a cluster to the distance between clusters is used as the evaluation criterion for similarity. 0 is the lowest value, and the lower the value, the better the clustering effect. Due to the randomness of K-means clustering algorithm [50], 10 clustering experiments are carried out in each experiment, and the clustering result with the highest evaluation index is taken as the final output. The comparison of clustering results between our method and other typical clustering methods on Slinas dataset is shown in Figure 10. The clustering results of Indian pines and PaviuaU are shown in Figures 11 and 12, respectively. After 10 experiments, the average evaluation score is shown in Table 4 and the curve is shown in Figure 13, which indicate that the clustering performance of the proposed feature extraction model combined with the improved k-means  Figure 13: The curve of average evaluation score. 9 Wireless Communications and Mobile Computing clustering algorithm is significantly better than that of other typical models (the number of clustering centers of the improved k-means clustering algorithm is 33, and the number of real clusters is 11).
The inference runtime of our method in different datasets is shown in Table 5. Due to our net compresses the information into a 10-dimension vector, thus computational cost is greatly reduced.

Conclusion
In this letter, we propose a novel DCAAES 2 C-Net, which explore spatial-spectral information by using an asymmetric autoencoder to suppress spatial noise information component in feature. Besides, we use the concept of overclustering to improve the k-means algorithm to reduce the influence of fuzzy feature. Finally, the improved clustering algorithm is applied based on the output of autoencoder network to obtain the HSI clustering result. Experimental results on Salinas scene demonstrate the effectiveness of the proposed method. The RI index of the clustering results is 0.9960, which improves 0.4%~6.3%; the FMI index is 0.9954, which improves 0.5%~7.8%; the AMI index is 0.9536, which improves 2.8%~30.6%; and the DB index is 0.8359, which decreases -17.9%~39.9%.
In the future, we will focus on exploring the spatialspectral from multiscale patch, combining the proposed method with one-stage clustering method and try to utilize this method to improve the performance of semi-supervised classification task.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.