Stacked Denoise Autoencoder Based Feature Extraction and Classification for Hyperspectral Images

Deep learningmethods have been successfully applied to learn feature representations for high-dimensional data, where the learned features are able to reveal the nonlinear properties exhibited in the data. In this paper, deep learning method is exploited for feature extraction of hyperspectral data, and the extracted features can provide good discriminability for classification task. Training a deep network for feature extraction and classification includes unsupervised pretraining and supervised fine-tuning.We utilized stacked denoise autoencoder (SDAE) method to pretrain the network, which is robust to noise. In the top layer of the network, logistic regression (LR) approach is utilized to perform supervised fine-tuning and classification. Since sparsity of features might improve the separation capability, we utilized rectified linear unit (ReLU) as activation function in SDAE to extract high level and sparse features. Experimental results using Hyperion, AVIRIS, and ROSIS hyperspectral data demonstrated that the SDAE pretraining in conjunction with the LR fine-tuning and classification (SDAE LR) can achieve higher accuracies than the popular support vector machine (SVM) classifier.


Introduction
Hyperspectral remote sensing images are becoming increasingly available and potentially provide greatly improved discriminant capability for land cover classification.Popular classification methods like -nearest-neighbor [1], support vector machine [2], and semisupervised classifiers [3] have been successfully applied to hyperspectral images.Besides, some feature matching methods in the computer vision area can also be generalized for spectral classification [4,5].
Feature extraction is very important for classification of hyperspectral data, and the learned features may increase the separation between spectrally similar classes, resulting in improved classification performance.Commonly used linear feature extraction methods such as principal component analysis (PCA) and linear discriminant analysis (LDA) are simple and easily implemented.However, these methods fail to model the nonlinear structures of data.Manifold learning methods, which are proposed for nonlinear feature extraction, are able to characterize the nonlinear relationships between data points [1,6,7].However, they can only process a limited number of data points due to their high computational complexity.Deep learning methods, which can also learn the nonlinear features, are capable of processing large scale data set.Therefore, we utilized deep learning for feature extraction of hyperspectral data in this paper.
Deep learning is proposed to train a deep neural network for feature extraction and classification.The training process includes two steps: unsupervised layer-wise pretraining and supervised fine-tuning.The layer-wise pretraining [8] can alleviate the difficulty of training a deep network, since the learned network weights which encode the data structure are used as the initial weights of the whole deep network.The supervised fine-tuning that is performed by logistic regression (LR) approach aims to further adjust the network weights by minimizing the classification errors of the labeled data points.Training the network can achieve both high level features and classification simultaneously.Popular deep learning methods include autoencoders (AE) [9], denoised autoencoders (DAE) [10], convolutional neural networks 2 Journal of Sensors (CNN) [11], deep belief networks (DBN) [12], and convolutional restricted Boltzmann machines (CRBM) [13].In the field of hyperspectral data analysis, Chen utilized AE for data classification [14], and Zhang utilized CNN for feature extraction [15].
In this paper, we focus on the stacked DAE (SDAE) method [16], since DAE is very robust to noise, and SDAE can obtain higher level features.Moreover, since sparsity of features might improve the separation capability, we utilized rectified linear unit (ReLU) as activation function in SDAE to extract high level and sparse features.After the layerwise pretraining by SDAE, LR layer is used for fine-tuning the network and performing classification.The features of the deep network that are obtained by SDAE pretraining and LR fine-tuning are called tuned-SDAE features, and the classification approach that utilizes LR classifier on the tuned-SDAE features is hereafter denoted as SDAE LR in this paper.
The organization of the paper is as follows.Section 2 describes the DAE, SDAE, and SDAE LR approaches.Section 3 discussed the experimental results.Conclusions are summarized in Section 4.

Methodology
Given a neural network, AE [14] trains the network by constraining the output values to be equal to the input values, which also indicates that the output layer has equally many nodes as the input layer.The reconstruction error between the input and the output of network is used to adjust the weights of each layer.Therefore, the features learned by AE can well represent the input data.Moreover, the training of AE is unsupervised, since it does not require label information.DAE is developed from AE but is more robust, since DAE assumes that the input data contain noise and is suitable to learn features from noisy data.As a result, the generalization ability of DAE is better than that of AE.Moreover, DAE can be stacked to obtain high level features, resulting in SDAE approach.The training of SDAE network is layer-wise, since each DAE with one hidden layer is trained independently.After training the SDAE network, the decoding layers are removed and the encoding layers that produce features are retained.For classification task, a logistic regression (LR) layer is added as output layer.Moreover, LR is also used to fine-tune the network.Therefore, the features are learned by SDAE pretraining in conjunction with LR fine-tuning.

Denoise Autoencoder (DAE)
. DAE contains three layers: input layer, hidden layer, and output layer, where the hidden layer and output layer are also called encoding layer and decoding layer, respectively.Suppose the original data is x ∈   , where  is the dimension of data.DAE firstly produces a vector x by setting some of the elements to zero or adding the Gaussian noise to x. DAE uses x as input data.The number of units in the input layer is , which is equal to the dimension of the input data x.The encoding of DAE is obtained by a nonlinear transformation function: where y ∈  ℎ denotes the output of the hidden layer and can also be called feature representation or code, ℎ is the number of units in the hidden layer, W ∈  ℎ× is the input-to-hidden weights, b denotes the bias, Wx + b stands for the input of the hidden layer, and   () is called activation function of the hidden layer.We chose ReLU function [17] as the activation function in this study, which is formulated as If the value of Wx + b is smaller than zero, the output of the hidden layer will be zero.Therefore, ReLU activation function is able to produce a sparse feature representation, which may have better separation capability.Moreover, ReLU can train the neural network for large scale data faster and more effectively than the other activation functions.
The decoding or reconstruction of DAE is obtained by using a mapping function   (): where z ∈   is the output of DAE, which is also the reconstruction of original data x.The output layer has the same number of nodes as the input layer.W  = W  is referred to as tied weights.If x is ranged from 0 to 1, we choose softplus function as the decoding function   (); otherwise we preprocess x by zero-phase component analysis (ZCA) whitening and use a linear function as the decoding function: where a = W  y + b  .DAE aims to train the network by requiring the output data z to reconstruct the input data x, which is also called reconstruction-oriented training.Therefore, the reconstruction error should be used as the objective function or cost function, which is defined as follows: where cross-entropy function is used when the value of input x is ranged from 0 to 1; the square error function is used otherwise.x ()  denotes th element of the th sample and ‖W‖ 2 is L2-regularization term, which is also called weight decay term.Parameter  controls the importance of the regularization term.This optimization problem is solved by using minibatch stochastic gradient descent (MSGD) algorithm [18], and  in (5) denotes the size of the minibatch.

Stacked Denoise Autoencoder (SDAE)
. DAE can be stacked to build deep network which has more than one hidden layer [16].Figure 1 shows a typical instance of SDAE structure, which includes two encoding layers and two decoding layers.In the encoding part, the output of the first encoding layer acted as the input data of the second encoding layer.Supposing there are  hidden layers in the encoding part, we have the activation function of the th encoding layer: where the input y (0) is the original data x.The output y () of the last encoding layer is the high level features extracted by the SDAE network.In the decoding part, the output of the first decoding layer is regarded as the input of the second decoding layer.The decoding function of the th decode layer is (7) where the input z (0) of the first decoding layer is the output y () of the last encoding layer.The output z () of the last decoding layer is the reconstruction of the original data x.
The training process of SDAE is provided as follows.
Step 1. Choose input data, which can be randomly selected from the hyperspectral images.
Step 2. Train the first DAE, which includes the first encoding layer and the last decoding layer.Obtain the network weights W (1) and b (1) and the features y (1) which are the output of the first encoding layer.
Step 3. Use y () as the input data of the ( + 1)th encoding layer.Train the ( + 1)th DAE and obtain W (+1) and b (+1)  and the features y (+1) , where  = 1, . . .,  − 1 and  is the number of hidden layers in the network.
It can be seen that each DAE is trained independently, and therefore the training of SDAE is called layer-wise training.Moreover, the trained network weights by SDAE acted as the initial weights in the following LR fine-tuning phase.Therefore, SDAE pretrains the network.

SDAE LR. SDAE LR includes SDAE pretraining and LR
fine-tuning.SDAE trains the network weights and obtains features by the reconstruction-oriented learning, and the learned weights acted as the initial weights of the network.Further, LR is used to fine-tune the network weights and obtain the fine-tuned features.It is worth noting that SDAE is unsupervised, while LR is supervised and only the data with labeled information can be used in LR stage.The SDAE LR network is illustrated in Figure 2, which shows a two-category classification problem (there are two output values).We can see that the decoding part of SDAE is removed and the encoding part of SDAE is retained to produce the initial features.In addition, the output layer of the whole network, which is also called LR layer, is added.The following sigmoid function is used as activation function of LR layer: where x is the output y () of the last encoding layer.It is also the deep features that are pretrained by SDAE method.The output of sigmoid function is between 0 and 1, which denotes the classification results.Labels are associated with the training data points, and therefore we can use the errors between the predicted classification results and the true labels to fine-tune the whole network weights.The cost function is defined as the following cross-entropy function: where  () denotes the label of the sample x () .Minimizing the cost function, we can update the network weights.This optimization problem is also solved by MSGD method.The steps of SDAE LR network training are as follows.
Step 1. SDAE is utilized to train the initial network weights, described in Section 2.2.
Step 2. Initial weights of the LR layer are randomly set.
Step 3. Training data are used as input data, and their predicted classification results are produced with the initial weights of the whole network.
Step 4. Network weights are iteratively tuned by minimizing the cost function in (9) using MSGD optimization method.
After the network training, we can calculate the features of any input data, which are the output of the last encoding layer.We call the features learned by SDAE pretraining and LR fine-tuning tuned-SDAE feature.It is worth noting that LR classifier is a part of the network.The output of LR layer, which is also the output of the whole network, denotes the classification results.Therefore, SDAE LR obtains feature extraction and classification simultaneously.In addition, besides LR, other supervised classifiers like support vector machine (SVM) can also be combined with the tuned-SDAE features.

Experimental Results and Analysis
3.1.Data Description.Three hyperspectral images were used for experiments.One was collected over Indian Pine (INP) in 1992.The spatial resolution of this image is 20 m; the available band for analysis of the image is 200 after removal of noisy and water absorption bands.One was acquired by Hyperion instrument over the Okavango Delta, Botswana (BOT), in May 2001.The 224-band Hyperion data have 10 nm spectral resolution over the range of 400 nm-2500 nm.The last high spatial resolution hyperspectral image was collected by reflective optics system imaging spectrometer (ROSIS) over the University of Pavia (PU), Italy.This data set has 103 dimensions of a spectral range from 430 nm to 860 nm, and its spatial resolution is 1.3 m.Both BOT and PU data contain 9 land cover types, and INP has 13 land cover types.Figure 3 shows the RGB images and the ground referenced information with class legends of BOT, PU, and INP images.Table 1 lists the class names and number of the three data sets.

Network Configuration.
We firstly normalized the data in the range between 0 and 1 and then randomly selected 20 thousand data points from BOT, PU, and INP images, which were used for unsupervised pretraining of SDAE.In supervised LR training stage, we randomly divided the labeled data into training data, validation data, and testing data, with a ratio of 5 : 2 : 3. The training data are used in LR for fine-tuning, the validation data are for parameter tuning and termination of the iteration in MSGD method, and the testing data are for evaluating the algorithm.
Network configuration contains three parameters, which are the number of hidden layers, the number of units   in hidden layer, and the standard deviation of Gaussian noise.The number of hidden layers is selected in the range from 1 to 5, the number of units is chosen from [10, 30, 60, 100, 200, 300, 400], and the standard deviation of Gaussian noise is selected from [0.2, 0.4, 0.6, 0.8].The optimal selection of these parameters is obtained according to the optimal classification results on the validation data.For BOT, PU, and INP data, the optimal number of layers is 4, 3, and 3, respectively; the best options for the number of units are 100, 300, and 200, respectively; the optimal selections of the standard deviation of Gaussian noise are 0.6, 0.6, and 0.2, respectively.In addition, network training includes two parameters: the epochs of pretraining and fine-tuning are set to be 200 and 1500, and the learning rates of pretraining and fine-tuning are selected as 0.01 and 0.1 empirically.We used Theano for conducting the SDAE LR classification.Theano is a Python library that can define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently and can use GPU to speed up the calculation.

SDAE LR Classification
Performance.SDAE LR method is compared with SVM classifier in this section, where SVM classifiers with linear and RBF kernels on the original data were conducted, which are denoted as LSVM and RSVM, respectively.The parameters in RSVM classifier are tuned by cross-validation method, and the penalty parameter in LSVM is set to be 2.The comparison results using overall accuracies (OA%) are shown in Table 2.It can be seen that the SDAE LR outperformed LSVM for all the three data sets and obtained higher accuracies than RSVM on PU and INP data.It demonstrates that the features learned by the SDAE pretraining and LR fine-tuning can effectively increase the separation between classes.Figure 4 shows the classification results of the whole images using SDAE LR for the three images.The acceptable results demonstrate good generalization ability of the SDAE LR approach.Using a machine with Intel Xeon CPU I7-4770, GPU NVIDIA Q4000, and 8 G RAM, the computational time of the three classifiers on BOT, PU, and INP data is shown in Table 3, where the LSVM and RSVM are implemented using CPU and SDAE LR utilized GPU for computation.LSVM costs least time and RSVM is the most time-consuming because of the parameter tuning.We did not provide the exact time for RSVM on PU data since it is longer than 12 hours.The proposed SDAE LR is much faster than RSVM, since it is implemented using Theano which accelerates the computation significantly.It is worth noting that the SDAE pretraining is fast and the LR fine-tuning costs time, because the former is layer-wise training and the latter propagates errors through the whole network.

Comparison of Different Feature Extraction Methods.
Features of SDAE LR network are obtained by SDAE pretraining and LR fine-tuning, which is called tuned-SDAE features.We compare the proposed method with four popular feature extraction methods, including PCA, Laplacian Eigenmaps (LE), locally linear embedding (LLE), and LDA.The first three methods are unsupervised methods and LDA is supervised.In addition, PCA and LDA are linear methods, while LE and LLE are nonlinear methods.We set the number of features to be 50 for PCA, LE, and LLE empirically.The tuned-SDAE features are obtained by using the same network configuration described in Section 3.2.
After feature extraction by PCA, LE, LLE, LDA, and SDAE LR, we used SVM classifiers (LSVM and RSVM) for classification.In addition, we also conducted SVMs on the raw hyperspectral data.Tables 4 and 5 show the overall accuracies of these methods using LSVM and RSVM, respectively.Several observations can be obtained: (1) for different feature extraction methods, tuned-SDAE performed the best.It significantly outperformed the others with the LSVM classifier for all the three data sets.When the RSVM classification was employed, the tuned-SDAE features also obtained the highest accuracies on most of the data sets; (2) compared to the SVM classification on the raw hyperspectral data, the four feature extraction methods (PCA, LE, LLE, and LDA) may not improve the accuracies, while the proposed tuned-SDAE features can consistently obtain better performance on most data sets; (3) in the four feature extraction methods (PCA, LE, LLE, and LDA), we cannot find one method that is consistently better than the others.The features obtained by SDAE LR produced stable and good performances on all the data sets; (4) RSVM performed better than LSVM on the raw data and the features extracted by PCA, LE, LLE, and LDA, while RSVM and LSVM provided similar results on the tuned-SDAE features.
From the last column of Tables 2, 4, and 5, we can also observe that, with the tuned-SDAE features, different classifiers (LR, LSVM, and RSVM) resulted in similar performances.Within the three classifiers, LR is simplest since it is a part of the network, and the output of the network is the LR classification results.
Computational times of different feature extraction methods on the three data sets are listed in Table 6.Since the computational complexity of LE and LLE is ( 2 ), where  is the number of dimension and  is the number of points, LE and LLE cannot process the large scale data sets.For PU data, we randomly selected 5000 data points for LE and LLE, and the features of the reminding data points are calculated by a kernel-based generalization method [1].We can see that PCA Figure 5 shows the results of parameter analysis.When one parameter was tested, the values of other parameters were set to be values described in Section 3.2.(1) For the layers of the deep network, we tested five different values (1, 2, 3, 4, and 5), and the classification results are shown in Figure 5(a).For INP and PU data, the best number of layer is 3; for BOT data, the optimal selection is 4. Results on BOT and PU data are not sensitive to these parameters when the number of layer is larger than 2, while results on INP data indicate that only values of 2, 3, and 4 produced satisfactory performance.(2) For the number of units in each hidden layer, we evaluated seven different values (10, 30, 60, 100, 200, 300, and 400).As is shown in Figure 5(b), the best numbers of unit are 100, 300, and 200 for BOT, PU, and INP data, respectively.For INP data, small values like 10 deteriorate the classification performance.However, SDAE LR is not very sensitive to this parameter in a large range (number of units > 100).
(3) For the standard deviation of Gaussian noise, we tested four different values (0.2, 0.4, 0.6, and 0.8).The classification results with respect to this parameter is shown in Figure 5(c).The optimal values are 0.6, 0.6, and 0.2 for BOT, PU, and INP data, respectively.It can be been that SDAE LR is not very sensitive to this parameter.
Selection of activation function of the network is very important, and we chose ReLU function as activation function in this paper, since it is able to produce sparse features.To demonstrate the effectiveness of the sparsity, we compared two activation functions: ReLU function and sigmoid function, where the latter cannot obtain sparse features.The extracted features of SDAE LR are the outputs of the last hidden layer, and therefore the dimensionality of features is equal to the number of units in the hidden layer.We define sparsity rate as the ratio of the number of zeros in the feature to the dimensionality of the feature.A high sparsity rate means there are many zeros in the feature and the feature is highly sparse.Figure 6 plots the sparsity rates versus different unit numbers of hidden layer using the ReLU activation function.With different number of units, the sparsity rate is high, and the number of nonzero values in the feature is small.Take PU data for example; when the number of unit is 400, the sparsity rate is 0.9626.It means the number of zeros in the feature is 385, and the feature only contains 15 nonzero values.Table 7 shows the OA using SDAE LR with ReLU function and sigmoid function.It can be seen that ReLU function outperformed sigmoid function on all the three data sets, which demonstrates the efficiency of the sparse features using ReLU function.The number of training data also affects the network training, since LR fine-tuning is supervised and only training data can be used to further adjust the network weights.Figure 7 shows the SDAE LR performance with respect to  different rates of training data (1%, 5%, 10%, 25%, and 50%).In general, high training data rates resulted in high accuracies, since LR performs supervised fine-tuning and classification.

Conclusion
Deep learning by SDAE LR is proposed for hyperspectral feature extraction and classification, where SDAE pretrains the network in an unsupervised manner, and LR fine-tunes the whole network and also performs classification.The features are learned by SDAE pretraining and LR fine-tuning.In the network, ReLU activation function was exploited to achieve the sparse features, which may improve the separation capability of the features.In experiments, SDAE LR outperformed the popular SVM classifier with linear and RBF kernels.The tuned-SDAE features also provide better classification accuracies than several popular feature extraction methods, which demonstrates the good discriminant ability of extracted features.In SDAE, the utilized ReLU function performed better than sigmoid function, indicating the effect of the sparsity of features.
In SDAE LR method, we only utilized spectral features of data.Plenty of spatial information of hyperspectral images can also be extracted and exploited [2,19], such as the texture feature, morphological feature, the spatial coordinate information, and the relations between spatial adjacent pixels.Our further work is to combine spatial information in the SDAE LR framework to further improve the classification performance.

Figure 1 :
Figure 1: The SDAE network is stacked by two DAE structures.

Figure 2 :
Figure 2: SDAE LR structure includes the encoding part of SDAE for feature extraction and LR for fine-tuning and classification.

Figure 3 :
Figure 3: Three band false color composite and ground references.(a) False color composite of BOT image with ground reference.(b) Class legend of BOT image.(c) False color composite of PU image.(d) Ground reference of PU image.(e) Class legend of PU image.(f) IND PINE scene.(g) Ground reference of IND PINE image.(h) Class legend of IND PINE image.

Figure 4 :
Figure 4: Classification results of the whole image on BOT (a), PU (b), and INP (c) data set.

Figure 5 :
Figure 5: Parameter analysis of SDAE LR approach.(a) For the parameter of the number of hidden layers.(b) For the parameter of the number of units in hidden layer.(c) For the parameter of standard deviation of Gaussian noise.

Figure 6 :
Figure 6: Sparsity rate of the network with different unit number of hidden layer.

Figure 7 :
Figure 7: SDAE LR classification performance with respect to the rates of training data.

Table 1 :
Class information of three datasets and the number of labeled samples in each class.

Table 3 :
Comparison of computational time of SDAE LR and SVM classifier (seconds).

Table 6 :
Comparison of computational time of different feature extraction methods (seconds).

Table 7 :
OA% with different activation functions.