Deep Learning for Hyperspectral Data Classification through Exponential Momentum Deep Convolution Neural Networks

Classification is a hot topic in hyperspectral remote sensing community. In the last decades, numerous efforts have been concentrated on the classification problem. Most of the existing studies and research efforts are following the conventional pattern recognition paradigm, which is based on complex handcrafted features. However, it is rarely known which features are important for the problem. In this paper, a new classification skeleton based on deepmachine learning is proposed for hyperspectral data.The proposed classification framework, which is composed of exponential momentum deep convolution neural network and support vector machine (SVM), can hierarchically construct high-level spectral-spatial features in an automated way. Experimental results and quantitative validation on widely used datasets showcase the potential of the developed approach for accurate hyperspectral data classification.


Introduction
Recent advances in optics and photonics have allowed the development of hyperspectral data detection and classification, which is widely used in agriculture [1], surveillance [2], environmental sciences [3,4], astronomy [5,6], and mineralogy [7].In the past decades, hyperspectral data classification methods have been a hot research topic.A lot of classical classification algorithms, such as k-nearest neighbors, maximum likelihood, parallelepiped classification, minimum distance, and logistic regression (LR) [8,9], have been proposed.However, there are several critical problems in the classification of hyperspectral data: (1) high dimensional data, which would lead to curse of dimensionality; (2) limited number of labeled training samples, which would lead to Hughes effect; (3) large spatial variability of spectral signature [10].
Most of the existing work, concerning the classification of hyperspectral data, follows the conventional paradigm of pattern recognition and complex handcrafted features extraction from the raw data and classifiers training.Classical feature extraction methods include the following: principle component analysis, singular value decomposition, projection pursuit, self-organizing map, and fusion feature extraction method.Many of these methods extract features in a shallow manner, which do not hierarchically extract deep features automatically.In contrast, the deep machine learning framework can extract high-level abstract features, which has rotation, scaling, and translation invariance characteristics [11,12].
In recent years, the deep learning model, especially the deep convolution neural network (CNN), has been shown to yield competitive performance in many fields including classification or detection tasks which involve image [13][14][15], speech [16], and language [17].However, most of the CNN network input data are original image without any preprocessing based on the prior knowledge.Such manner directly extends the CNN network training time and the feature extraction time [18,19].Besides, the traditional CNN network has too many parameters, which is difficult to initialize.And the training algorithm based on gradient descend technique may lead to entrapment in local optimum and gradient dispersion.Moreover, there is little study on the convergence rate and smoothness improvement of CNN at present.
In this paper, we propose an improved hyperspectral data classification framework based on exponential momentum deep convolution neural network (EM-CNN).And an innovative method for updating parameters of the CNN on the basis of exponential momentum gradient descendent is proposed aiming at the problem of gradient diffusion of deep network.
The rest of the paper is organized into four sections.Section 2 describes the feature learning and deep learning.The proposed EM-CNN framework is introduced in Section 3, while Section 4 details the new way of exponential momentum gradient descent method, which yields the highest accuracy compared with homologous parameters momentum updating methods.Section 5 is the experiment results.Section 6 summarizes the results and draws a general conclusion.

Feature Learning
Feature extraction is necessary and useful in the real-world for that the data such as images, videos, and sensor measurement data is usually redundant, highly variable, and complex.Traditional handcrafted feature extraction algorithms are time-consuming and laborious and usually rely on the prior knowledge of certain visual task.In contrast, feature learning allows a machine to both learn at a specific task and learn the features themselves.
Deep learning is part of a broader family of machine learning based on learning representations of data.It attempts to model high-level abstractions in data by using a deep graph with multiple processing layers, composed of multiple linear and nonlinear transformations.Typical deep learning models include autoencoder (AE) [20], deep restricted Boltzmann machine (DRBM) [21], deep Boltzmann machines (DBM) [22], deep belief networks (DBN) [23], stacked autoencoder (SAE) [24], and deep convolutional neural networks (DCNN) [25].
The deep convolution neural network (DCNN), a kind of neural network, is an effective method for feature extraction, which can potentially lead to progressively more abstract and complex features at higher layers, and the learnt features are generally invariant to most local changes of the input.It has been shown to yield competitive performance in many fields, such as object detection [13][14][15], speech simultaneous interpretation [16], and language classification [17].As the performance of classification highly depends on the features [26], we adopt deep convolution neural network (DCNN) as the part of our hyperspectral data classification framework.

Structure Design of Hyperspectral Data Classification Framework
In deep convolutional neural network, input data, convolution kernel, and threshold parameter are the three most important issues [27][28][29].The input data is the basis of feature extraction, which determines the final classification performance.The size of the convolution kernel determines the degree of abstraction of the feature.If convolution kernel size is too small, the effective local features are difficult to extract.Otherwise, the extraction feature would exceed the feature range that convolution kernel can express.Threshold parameter is mainly used to control the degree of response of characteristic submode.Besides, the network depth and dimension of output layer can also influence the quality of feature extraction.The deeper network layers indicate stronger feature expression ability, while they would lead to overfitting and poor real-time ability.The dimension of output layer directly determines the convergence speed of network.When the sample sets are limited, over lower dimension of the output layer cannot guarantee the validity of features, while over higher feature of the output layer will produce feature redundancy.
Since the traditional CNN input the original image directly into the deep network and the input data play a crucial part in the final feature extraction [28,29], three images obtained by image data preprocessing are used as inputs to improve the convergence speed and specific pattern classification performance.In order to obtain better extraction features, the sizes of convolution layer filter are 9 × 9, 5 × 5 and 3 × 3, respectively, and the depth of network is seven according to the results of the experiments.
Besides, the lower sampling applies Max-pooling and the nonlinear mapping function is LREL function, which is shown in the following formula: where  is nonzero small constant and  is the weight of neuron.The setting of  ensures that inactive neurons receive a nonzero gradient value, so that the neuron has the possibility of being activated.Based on the above analysis, a deep network framework for hyperspectral data classification based on deep convolutional neural network is proposed in Figure 1.
In the proposed deep CNN model, the first layer, the third layer, and the fifth layer are convolution layers, which realized feature extraction from lower level to higher level.The second layer, the fourth layer, and the sixth layer are lower sampling layers, used for feature dimension reduction.The final layer is the output layer which is whole connection layer and output of the final extraction features.

CNN Original data SVM
x(1) x( 2) descent method to update weight is shown in formula (2), and bias updating method is shown in formula (3) [30]: In the formula,  is the learning rate, −/  old is the gradient of error to weight, and −/  old is the gradient of error to bias, namely, the sensitivity of parameter adjustment.In order to achieve weight and bias optimizing, the gradient of error to weight and the gradient of error to bias must be first obtained.
For convolution layer, its output is shown as the following formula: where   is the bias of th type of feature diagram,   is the block of input feature diagram, and    is convolution kernel.According to derivation formula of sensitivity function, the sensitivity of convolution layer can be represented by the following formula: where  +1  is the convolution kernel of  + 1 sampling layer, up represents upper sampling, and , so upper sampling should be conducted.∘ symbol represents the multiplication of corresponding elements.
Thus, the gradient of convolution layer error to bias is shown in formula (6).In the formula, (, V) is the element location of sensitivity matrix: The gradient of convolution layer error to weight is shown in formula (7).In the formula,  −1  is the convolution block of  −1  and convolution kernel   , (, V) is the element location of the block: Substitute formula ( 5), ( 6) into formula (1), ( 2) and obtain the updated value of convolution layer's weight.
The output of sampling layer's neural network can be expressed by formula (8), in which    and    , respectively, represent multiplicative bias and additive bias.Multiplicative bias is generally set as 1: According to the sensitivity of calculating formula of gradient descent, the sensitivity of sampling layer obtained is shown as the following formula: Whereby the bias updating formula of sampling layer can be obtained, as is shown in formula (10).According to formula (3), bias value updating can be obtained:

Exponential Momentum Training
Algorithm.The traditional gradient descent method only transmits gradient error between single layers, which lead to slow convergence rate of the network.Increasing the learning rate  is a good way to improve the convergence speed.But it not only improves the convergence speed but also causes unstable problem of the network, namely, "oscillation."Faced with this situation, paper [19] proposes the momentum method, which increases the convergence speed by adding momentum factor.Paper [31] proposes the self-adaptive momentum method based on paper [19].However, neither of these methods considers the relation between oscillation, convergence, and momentum.And the momentum factor does not promote convergence and enhance learning performance.This paper applies error exponential function of gradient to adjust the pace of momentum factor.The function can increase the momentum factor at the flat region, which can accelerate the network convergence speed and can decrease the momentum factor at the steep region of error curve, which can avoid excessive network convergence.Such method can improve the convergence rate of the algorithm In the formula, Δ  =  +1 −   , and   represents the gradient of error to weight.

Experiment and Analysis
In this section, the performance of the proposed algorithm is evaluated on AVIRIS and ROSIS hyperspectral dataset.The overall accuracy, generalized accuracy, and kappa parameters, the most three important criteria, are used to evaluate the performance of the proposed framework.

Data Description.
In our experiments, we experimented and validated the proposed framework with AVIRIS and ROSIS hyperspectral datasets.AVIRIS hyperspectral data 92AV3C was obtained by the AVIRIS sensor in June 1992.ROSIS hyperspectral datasets were gathered by a sensor known as the reflective optics system imaging spectrometer (ROSIS-3) over the city of Pavia, Italy.In particular, we employed the Indian Pines dataset, which depicts Indiana and consists of 145 × 145 data size and 224 spectral bands in the wavelength range 0.4 to 2.510 −6 meters.It contains a total of 16 categories, as shown in Table 1.Its true mark is shown in Figure 2. The other datasets we employed are the Pavia  on the classification results was first analyzed.Then, we verified the performance of exponential momentum training algorithm.Finally, classifications based on CNN framework were conducted.

Effect of Kernel Size and Depth.
The influence of the kernel size and the network depth on the classification performance of the proposed framework is analyzed in this section.The deep convolution neural network is trained by a series of different kernel size and network depth under fixed network structure and algorithm parameters.The results are shown in Tables 3 and 4. Table 3 suggested that the convolution kernel size is less affected by the overall accuracy of the method, and it better be consistent with the features size of the image data.Table 4 results shows that the deeper structures can get better classification accuracy.

Exponential Momentum Training Algorithm.
In this section, we verified the general accuracy and the convergence speed of the algorithm.
We select adaptive momentum [31] and elastic momentum [32] as the comparative method to observe the iteration round change of loss function of training objectives.It can be easily seen from Figure 4 that the convergence point of adaptive momentum is 14, the convergence point of elastic momentum is 8, and the convergence point of exponential momentum is 7.So the convergence of iteration times of exponential momentum is the minimum, and its consumption of the training time is also the minimum.
For the general accuracy test experiment, the LeNet5 neural network [33] and standard multiple neural network [34] are chosen for comparison.The accuracy results obtained are shown in Table 5.It can be seen from the table that, compared with the corresponding training models of the   All the logistic regression classifiers are set to have learning rate 0.1 and are iterated on the training data for 8000 epochs.The result is shown in Figure 5. Experiments show that, by combining with SVM, the proposed method outperforms all other feature extraction methods and gets the highest accuracy.

Comparing with Other Classification Methods.
We examine the classification accuracy of EFM-CNN-SVM framework by comparing proposed framework with spatialdominated methods, such as radial basis function-(RBF-) linear SVM, principle component analysis-(PCA-) RBF-SVM, and stacked autoencoder-(SAE-) logistic regression (LR).By putting both the spectral and spatial information together to form a hybrid input and utilizing the deep classification framework detailed in Section 3, we get the highest classification accuracy we have ever attained.The experiments were performed with same parameter settings above 100.The results are shown in Table 6 and Figure 6.From Table 6, we can see that the EFM-CNN-SVM method turns out to be better on all other methods.And the joint features yield higher accuracy than spectral features in terms of mean performance.In Figure 6, we look into the classification accuracy from a visual perspective.It can be seen that  classification results of proposed method are closest to the ideal classification results other than RBF-SVM and linear SVM methods.

Conclusion
In this paper, a hyperspectral data classification framework is proposed based on deep CNN features extraction architecture.And an improved error transmission algorithm, selfadaptive exponential momentum algorithm, is proposed.Experiments results show that the improved error transmission algorithm converged quickly compared to homologous error optimization algorithm such as adaptive momentum and elastic momentum.And proposed EFM-CNN-SVM framework has been proven to provide better performance than PCA-SVM, KPCA-SVM, and SAE-LR frameworks.Our experimental results suggest that deeper layers always lead to higher classification accuracies, though operation time and accuracy are contradictory.It has shown that the deep architecture is useful for classification and the high-level spectral-spatial feature, increasing the classification accuracy.When the data scale is larger, the extracted feature has better recognition ability.

Ideal
Proposed method RBF-SVM Linear SVM

Figure 2 :
Figure 2: Indian Pines hyperspectral imagery and ground truth of classification.

Figure 3 :
Figure 3: Pavia hyperspectral imagery and ground truth of classification.

Figure 5 :
Figure 5: Comparison with other feature extraction methods.

Table 1 :
Sixteen classes of Indian Pines dataset.
Nine land cover classes are selected, which are shown in Figure3.The numbers of samples for each class are displayed in

Table 2 .
For investigating the performance of the proposed methods, experiments were organized step by step.The influence of the convolution kernel size and the depth of network

Table 2 :
Nine classes of Pavia dataset.

Table 3 :
Accuracy comparison of different kernel size.

Table 4 :
Accuracy comparison of different depth.

Table 5 :
Accuracy comparison of algorithms' image recognition.

Table 6 :
Accuracy comparison of different classifier.