Active Learning Algorithms for the Classification of Hyperspectral Sea Ice Images

Sea ice is one of the most critical marine disasters, especially in the polar and high latitude regions. Hyperspectral image is suitable for monitoring the sea ice, which contains continuous spectrum information and has better ability of target recognition. The principal bottleneck for the classification of hyperspectral image is a large number of labeled training samples required. However, the collection of labeled samples is time consuming and costly. In order to solve this problem, we apply the active learning (AL) algorithm to hyperspectral sea ice detection which can select the most informative samples. Moreover, we propose a novel investigated AL algorithm based on the evaluation of two criteria: uncertainty and diversity. The uncertainty criterion is based on the difference between the probabilities of the two classes having the highest estimated probabilities, while the diversity criterion is based on a kernel k-means clustering technology. In the experiments of Baffin Bay in northwest Greenland on April 12, 2014, our proposed AL algorithm achieves the highest classification accuracy of 89.327% compared with other AL algorithms and random sampling, while achieving the same classification accuracy, the proposed AL algorithm needs less labeling cost.


Introduction
As a member of the global marine and atmospheric system, sea ice with the high albedo has impacted the marine, power between the atmospheres, and heart and material exchange.Also sea ice plays a key role in the radiation balance, energy balance, and mass balance on the ocean surface [1].Expect for the influence on marine hydrology, atmospheric circulation, and ecosystems, sea ice has a great threat to the shipping and the facilities of marine resource development and has become one of the most prominent oceanic disasters in the polar and high latitude regions.
For the prevention and mitigation of the ice disasters and the hazard assessment, we not only need to obtain real-time sea ice area and the outer line information, but also need more detailed data about types, thickness, and distribution of sea ice.However, for the traditional methods of sea ice detection, it is very difficult to get the continuous and large-area sea ice condition, while, remote sensing is an effective mean that can get the large area of sea ice data rapidly.Currently, the main research areas of sea ice detection with the remote sensing are in the polar and high latitude regions.Furthermore, the countries carrying out the relevant research include America, Canada, Norway, Australia, and German.These researches mainly aimed at the moderate-resolution remote sensing detections, such as airborne remote sensing, moderate-resolution imaging spectrometer (MODIS) [2], and synthetic aperture radar (SAR).Shi et al. [3] extracted the information of sea ice by the surface temperature based on NOAA/AVHRR and got the relation between ice thickness and reflectivity by empirical formula; Meyer et al. [4] introduced an approach to map landfast ice extent based on L-band SAR data; Ozsoy-Cicek et al. [5] verified that active microwave can depart the ice edge and floating ice by field survey; Hong [6] proposed to use passive microwave for the inversion of small-scale roughness on ice surface and refractive index of sea ice.In contrast with the traditional remote sensing technology, 2 Mathematical Problems in Engineering the hyperspectral remote images contain nearly continuous spectral information and the abundant spatial information, which has a higher capability of the target recognition and can greatly improve the accuracy of target detection.From the published literatures, the researches on sea ice detection with hyperspectral technology are rarely involved by far.
Many methods have been developed in the classification of hyperspectral images, which can be mainly concluded in two types [7,8]: the unsupervised classification and supervised classification.Unsupervised classification does not need priori knowledge and can classify the primary hyperspectral images directly, which is simple and easy to implement, but the classification accuracy is low; the supervised classification needs some priori knowledge in advance and to get the classifier by training the labeled samples; finally we can use the trained classifier to categorize the unlabeled samples, which can get accurate classification accuracies.The classifier based on supervised classification can be obtained by the following ways: probabilistic model, empirical risk minimization (ERM) and structural risk minimization (SRM) [9,10].The most classic method based on probabilistic model is maximum likelihood classifier (MLC), which has higher computational complexity and requires a large number of training samples in order to get better classification results [11]; the commonly used methods based on ERM theory include the decision tree and neural network, which are easier to face the problems of "Hughes" and "overfitting" in the case of high dimension and small sample; the SRM principle considers the ERM and SRM simultaneously and increases the generalization capabilities for future samples, where the classic method based on the SRM principle is support vector machine (SVM).At present, SVM has got great progress in the way of theory research and algorithm realization and has obtained better classification results compared with traditional classifiers [12][13][14][15][16].For example, Melgani and Bruzzone [17] compared the different SVM methods with the -means and neural network based on radial basis function (RBF) in the original feature space and the feature subspace.Camps-Valls et al. [18] put forward to classify the crop by SVM and compare with other neural network methods, such as a multilayer perception neural network and RBF.Pal and Mather [19] compared the SVM with MLC and multilayer perception neural network and verified the results with Landsat-7.
The classification of hyperspectral remote sensing image often applies the supervised classification techniques, which require a large amount of labeled samples.The quality and the number of the available training samples are important for the accurate classification images.Because of the limitations of environment and conditions, the measured data used in the sea ice detection are very rare.The interpretation of hyperspectral images need to be analyzed by traditional remote sensing images with the higher spatial resolution at the same time and scene.But the available training samples are usually not enough for adequate learning of the classifier.How to mark as few samples as possible artificially and obtain better classification performance becomes the key issues of sea ice detection.In order to solve the classification problems, active learning (AL) approaches are proposed based on SVM which have got remarkable success in the real-world learning.The SVM classifier is fit for AL because its classification rule can be characterized by a small number of support vectors that can be easy to update over the successive iterations [20].At each iteration, the classifiers do not passively accept the training samples provided by the user but actively select the most valuable samples for the current classification model.The obtained labeled samples by the user are incorporated in the training set and the classifiers is retrained and updated.In this way, we can greatly reduce the labeling time and improve the classification accuracy.In recent years, researchers have conducted a large number of studies on the active learning and proposed many AL methods.Tong et al. [21,22] proposed the margin sampling (MS), which selects the samples closest to the current separating hyperplane as the most uncertainty and informative samples.Another popular strategy is given by committee-based active learners.A set of unlabeled samples are trained by different classifiers.The approach selects the examples where the disagreement is maximal between the classifiers.Afterward, the method based on entropy is proposed.Examples with the highest value of entropy are selected to query the user.In [23], Joshi et al. put forward the best versus second-best (BvSB) approach, which is based on the difference between the probabilities of the two classes having the highest estimated probabilities as a measure of uncertainty.In addition, there are many other AL algorithms, such as the Fisher information matrix method.In this paper, we propose a novel investigated AL technology on the classification of hyperspetral sea ice images, which can select the most informative samples and get good classification results.

AL Process Model. AL was put forward by professor
Angluin from Yale University in the paper of "Queries and concept learning" [24].Currently, the AL algorithms are widely used in the text classification and image retrieval.However, AL can be applied to the classification of remote sensing images by considering the specific features of this area.In the remote sensing problems, the land-cover types of the area are selected by the three methods, such as photo interpretation, ground survey and mixed strategies.But, these strategies are implemented with high costs and much time.So we expect that the AL process can be conducted with few labeled training samples without reducing the convergence capability.The classification framework based on AL is described as shown as the Figure 1.
The active learning process is conducted according to an iterative process that can be described by the form (, , , , ) [25], in which,  is a supervised classifier trained;  is the labeled training set;  is the query function, in order to select the most informative unlabeled samples from the unlabeled sample pool ;  is an human expert who can label the selected samples with the above mentioned three strategies.
The iterative process of AL can be described as follows.First, the classifier  is trained on the initial training set  which made up of few labeled samples.After the initialization, the set of samples are selected by  query function from the pool  and query the expert .Then, these labeled samples are included into  and the classifier  is retained by the updated training set.The retrained process continues until a stopping criterion (e.g., the labeling costs or the generalization accuracy reached some standard) is satisfied.The algorithm is described as follows.

Algorithm 1 (active learning).
Inputs.The inputs are labeled sample set , unlabeled sample set , classifier  (Initial), and query function .
Output. the output is updated classifier .
(1) Train the initial classifier  on labeled sample set .
(4) Select the unlabeled samples with the query function  from the unlabeled sample pool .
(5) Label the unlabeled samples by human expert .
(6) Add the new labeled samples to the training set .
(8) Until the stopping criteria is satisfied.
From the above mentioned descriptions, the selection of classifier and sampling strategy are two important components of AL.

The Classification Model.
Because SVM shows the outstanding performance in solving small sample, nonlinear and high-dimensional pattern recognitions, we choose the SVM classifier in this paper.SVM is only directly applicable for two-class tasks.Aiming at solving the multiclass classification problems of hyperspectral sea ice images, the implementation of SVM is approached by multiclass strategy.
Supposing that the training sample set  is made up of  independent samples, which can be described (  ,   )  =1 , where   denotes the training samples and   ∈ {+1,−1} denotes the associated labels.The basic thought of SVM is to map the data through a proper nonlinear transformation into a higher dimensional feature space, in order to find an optimal hyperplane which maximizes the margin between the two classes.
The classification problem can be transformed into a typical convex programming problem [5] on the basis of Kuhn-Tucker theorem.Accordingly, the convex programming problem can be converted into the following the dual linear programming problem by Lagrange multipliers   associated with the original training patterns   : The dual linear problem has global optimization.The   values corresponding with the nonsupport vectors are zero, so the optimal classification decision function used for binary problem is obtained by solving above problems: where SV is the set of support vectors,   and  are the parameters used to define the optimal hyperplane, and (⋅, ⋅) is the kernel function (we adopt the radial basis kernel function with better classification performance).
The multiclass classification problems depend on the binary classification, which can be transformed into multiple binary classification problems to solve.The construction of multiclass SVM classifier can be approached in two methods [26,27].Firstly, by constructing a series of binary classifiers, the decision is taken by combining the partial decisions of the single members of the ensemble [26,27].There are two common techniques, namely, the one-against-all (OAA) strategy and one-against-one (OAO) strategy.Secondly, the method is represented by SVM formulated directly as a multiclass optimization problem.The second method has poor stability and can affect the classification accuracy for the multiclass optimization.The OAO method for SVM is computationally efficient and shows good classification performance [28].So we use OAO approach for multiclass classification in the paper.If there are  classes of data, then we need to construct ( − 1)/2 binary classifiers in total.In this case, the binary classifier can separate class  from class  by means of a discriminant function   (): where   is the normal vector of the hyperplane discriminating the class  and the class .The final decision in the OAO strategy is taken on the basis of the "winner-takes-all" rule, which corresponds to the following maximization:  Sometimes, conflict situations may occur between two different classes characterized by the same score.Such ambiguities can be solved by selecting the class with the smaller index value of the class.

The Sampling Strategy.
The sampling strategy is crucial to distinguish the pros and cons of different AL algorithms.The selection of unlabeled samples with different strategies depends on the information content of the samples (i.e., the influence of the unlabeled samples on the generalization capabilities of classifier).Currently, the sampling strategy is widely used on the basis of uncertainty.In which, MS is one of the popular and effective measure for active SVM learning, but this method is only applicable to binary SVM classification problem.Although the entropy method is suitable for multiclass classification, it has a drawback.A major shortcoming of the entropy method is that its value is heavily affected by probability values of unimportant classes, which make the classifier confused [23].In order to solve the aformentioned problems, the BvSB method was proposed which can achieve better performance in the multiclass classification.
It is very important to observe that the above mentioned strategies consider only the uncertainty.In the way, the selected samples may have the same label after querying the user.It means that there will be a lot of redundant samples selected, which do not provide additional information and are unfavorable for the algorithm convergence.In order to address the shortcomings, we adopted the enhanced clustering-based diversity criterion (ECBD) based on the diversity distribution of the samples.In the following sections, we will introduce the BvSB method and the EBCD method, respectively.

AL Algorithm Based on Uncertainty and Diversity
The BvSB method is taken as a more greedy approach as a measure of uncertainty.We use the OAO strategy for multiclass classification.We assume that  , (,  ∈ ) is the classifier used to discriminate the sample  between the class  and class .If the true class label of an unlabeled sample  is , once its label is marked and added to the training set, which will modify the boundary of the classifiers that separate class  from the other classes.We denote these classifiers by   = { , (,  ∈ ,  ̸ = )}.Because the true label of the sample  is unknown, we use the optimal label  best as the evaluation of the true label.Thus, the classification set in contention is called   best = {  best ,  (,  best ∈ ,   ̸ =  best )}.For the classifier set   best , the uncertainty degree of sample  can be denoted by the difference in the estimated class probability value   best −   , which can be taken as an indicator describing the information content of sample .By minimizing the value of   best −   , that is maximizing the classification uncertainty, the BvSB criterion is obtained: According to formula (6),  samples with lower  uncertainty are selected as uncertainty samples.From the view of changing the classification boundaries, the BvSB criterion can be considered an efficient approximation for selecting the informative samples.Figure 2 shows the architecture based on the BvSB method.

The ECBD Method
Based on the Diversity.Considering the distribution of uncertain samples at the diversity step, clustering is an effective solution to select the most diverse samples.In the previous section, the similar samples may be selected as the informative samples by the BvSB method.So we consider combining the sampling strategy with unsupervised clustering.In this case, the representative samples are selected to label from different clusters.That is to say, the ℎ <  samples are selected by clustering, where  samples are obtained in the uncertainty step.The standard means clustering algorithm is applicable to original feature space, while the SVM classification hyperplane works in the kernel space.Therefore, the selected samples in the original space may not be fit for the kernel space.To overcome this shortcoming, we adopt the enhanced clustering-based diversity method (ECBD) by clustering in kernel space, which is improved based on the standard -means clustering.The ECBD is described as follows [20].
Assuming that  samples ( 1 ,  2 , . . .,   ) are selected at the uncertainty step, the idea of kernel -means is to divide  samples into ℎ clusters ( 1 ,  2 , . . .,  ℎ ) in the kernel space, then the most uncertainty samples of each cluster is taken as the representative sample.The center of each cluster is denoted by ( 1 ,  2 , . . .,  ℎ ).We suppose that sample   mapped into the kernel space is indicated by   = 0(  ).The Euclidean distance between the sample   and the sample   is written as Let ∇  be the cluster center in the kernel space that where |  | denotes the total number of samples in the cluster   and is computed as |  | = ∑  =1 (  ,   ).(  ,   ) (1 ≤  ≤ ℎ) shows the indicator function: The distance between   and ∇  can be expressed as By applying (10) to the standard -means clustering, we obtain the kernel-based -means algorithm described as follows: (1) Assign the initial value of (  ,   ) ( = 1, 2, . . ., ,  = 1, 2, . . ., ℎ), and ℎ initial clusters  1 ,  2 , . . .,  ℎ are obtained.
(5) For each cluster   , select the sample that is closest to the center in the kernel space as the pseudocenter of   : After  1 ,  2 , . . .,  ℎ are obtained, the most informative sample is selected as the representative sample of each cluster.This sample is defined as follows: where  BvSB + ECBD  represents the th sample chosen by the sampling strategy (i.e., BvSB + ECBD), and it is the most uncertain sample of the th cluster (i.e., the sample that has minimum  uncertainty () in the th cluster).Totally, ℎ samples are selected using (14), one for each cluster.

AL Algorithm Based on BvSB + ECBD.
Based on the considerations of the uncertainty of the current classifier and the diversity of the sample distribution, we design the multiclass classification algorithm based on the BvSB + ECBD method.The BvSB criterion aims at selecting the most informative samples; ECBD criterion is used to select the diversity samples by clustering in the kernel space.In this case, the most representative samples are selected to query the user for sample labels.Then the obtained samples and the corresponding labels are together incorporated into the training samples and the classifier is retrained.The algorithm can be summarized as follows.

Inputs
is the number of samples selected based on the BvSB method ℎ is the number of samples selected to add to training set at each iteration (3) The Clustering Based on ECBD in the Kernel Space.The  samples are clustered in the kernel space, and detailed description is shown in Section 3.2.The ℎ samples are selected according to formula (14), which is marked as Until algorithm converges or satisfies the number of iterations.

Experiment Analysis
4.1.Data set Description.Hyperion sensor is mounted on the Earth observation satellite which was launched by NASA in November 2000.Hyperspectral image has a total of 242 bands and spatial resolution of 30 m.There are 220 unique spectral channels collected with a complete spectrum covering from 357-2576 nm.Because it is being in the experimental stage, the coverage of hyperspectral image is small, only 7.7 km × 44 km [30].Because hyperspectral images have the high resolution and continuous spectrums, it has been widely used in vegetation studies, geological surveys, fine agriculture, marine remote sensing and so on.
The data set is a hyperspectral image acquired on a marine area of Baffin Bay in northwest Greenland on April 12, 2014.The data are L1Gst level through geometric correction, projection registration and topographic correction.This image consists of 2395 × 1769 pixels (which include background pixels).The number of bands is initially reduced to 176 by removing the bands with low signal-to-noise and water absorption.The available labeled samples (1678 samples) are collected by Landsat-8 image interpretation, which are illustrated in Figure 3(a).Figure 3(b) is a subset of the entire image in Figure 3(a).As can be seen from the image, there are three different classes available, namely seawater, thin ice and thick ice.All labeled samples are randomly divided to derive a pool  and the validation set .Here we use Landsat-8 data with a spatial resolution of 15 m as the test set at the same time and same scene.The final classification performance is The experiments are designed in order to compare the classification accuracy with different AL algorithms that are, respectively, BvSB and the investigated and proposed techniques (BvSB + -means and BvSB + ECBD).In the experiments with AL algorithms, three samples of each class are randomly chosen from the pool  as initial training samples, and the rest are considered as unlabeled samples.All experimental results are referred to the average classification accuracies obtained in ten trials because of ten initial randomly selected training samples.At each round of active learning in the following, firstly, in the uncertainty step, we select  samples on the basis of uncertainty (the difference between the highest estimated probability values of the two classes, i.e., BvSB) to query the user for labels.In the diversity step, the most diverse ℎ <  samples are chosen based on either standard -means or ECBD to query the user for labels.Then the selected samples and the corresponding labels are together incorporated in the training set.Finally, the classifier is retrained.The related number of samples chosen by the different methods is shown in Table 2.

Experimental Results
. This section reports experimental results with the random sampling and AL algorithms, that are respectively BvSB, BvSB + -means, and BvSB + ECBD.Results are presented as learning rate curves, which show the relation between the average overall classification accuracies and the active learning rounds used to train the SVM classifier.By analyzing Figure 4, we can observe that three AL algorithms are generally better than random sampling.The results show that our proposed BvSB + ECBD technique shows the highest accuracies in most of the iterations.Furthermore, given the same size of training samples, as indicated by the same point on the -axis, BvSB + ECBD shows significantly improved classification accuracy.From another perspective, in order to achieve the same value of classification accuracy (same point on the -axis), our proposed active learning algorithm needs far fewer training   samples than random selection from Table 3.The result indicates that the proposed method selects the most useful samples at each iteration, so that user input can be effectively utilized on the most relevant samples [23].
From Figure 4, one can know that the BvSB + ECBD method provides more informative samples compared to BvSB and BvSB + -means methods and achieves higher accuracies with the same active learning rounds.Figures 5  and 6 show the distribution of the chosen training samples and the pool (considering bands 49 and 84 of the hyperspectral image) after six iterations of the AL process with the BvSB method and BvSB + ECBD method, respectively.Note that, since the BvSB method considers only the uncertainty of samples, it may result in the selection of similar samples which can only provide redundant information.We can also find that performing the clustering in the kernel space can improve the classification accuracy compared with the standard -means clustering.Indeed, because of the kernel mapping, the set of most diverse sample in the original space may not be the most diverse in the kernel space [20].In Figure 7, we demonstrate the original hyperspectral image and the classification results, in which, Figure 7  In order to assess the classification performance of our proposed method, in Table 4, we also report the confusion matrix, the accuracy per class and Kappa coefficient at the last iteration of the BvSB + ECBD method.It is important to observe that the accuracies of seawater and thin ice are low.Since hyperspectral image was acquired in April, 2014, thin ice began to melt with the increasing temperature.So there are lots of pixels of seawater and thin ice that were confused and wrongly classified.Finally, we carry out an analysis of the sensibility of our proposed BvSB + ECBD method with different number of initial training samples.In Figure 8,  denotes the total number of initial training samples of three classes, where, the selected number of samples of each class is the same.One can see that, selecting different  values results in similar classification accuracies.The experiment result indicates that the different initial training samples do not provide a large benefit in the BvSB + ECBD method, that is to say, the classification accuracy is not sensitive to the selection of initial training samples [23].Furthermore, we can also observe that, when using high  values, convergence is easily achieved than when using small  values.That is because the greater  values, the more the number of training samples, when given the same number of rounds.

Conclusions
In this paper, AL algorithms in the classification of hyperspectral image have been addressed, which can reduce the number of labeled samples added to training set and improve the classification accuracy with respect to traditional passive techniques.Query function based on BvSB in the uncertainty step, and standard -means clustering and ECBD in the diversity step have been generalized to multiclass problems.Moreover, our proposed novel BvSB + ECBD method is compared with BvSB, BvSB + -means and random sampling in the classification accuracy.By analyzing the experiment results, we can summarize as follows: (1) The proposed BvSB + ECBD method gets the best performance in terms of classification accuracy and can reduce a large amount of labeled samples compared with random sampling; (2) the BvSB + -means method provides slightly lower classification accuracies than the BvSB + ECBD technique.At the diversity step, because of kernel mapping, the most diverse samples in the original space by the standard -means clustering may not be the most diverse in the kernel space, which means that the most informative samples cannot be selected for the current classifier; (3) the BvSB method leads to poorer classification accuracies with respect to other AL algorithms.Therefore, we can conclude that obtained uncertainty samples based on  the BvSB technique may be the similar samples and cannot provide available information; (4) our proposed BvSB + ECBD method is not sensitive to the selection of initial training samples.
As a future development, we consider to extend our proposed AL technique by integrating the semisupervised method in the classification of hyperspectral image.During the iterative process, we can also make full use of abundant spectral information to select the more respective samples and more accurately identify the types of ice.

Figure 1 :
Figure 1: The classification framework based on AL.

2 ,
. . .,  BvSB + ECBD ℎ }.These selected samples are marked as the representative samples by the user and denote them by  =  BvSB + ECBD .(4) Updating the Training Sample Set and Retraining the Classifier.Renew the training sample set and unlabeled sample set with new selected sample  =  ∪ ,  →  \ , then the SVM classifier are retrained with the novel training sample set.

Figure 4 :
Figure 4: Classification of overall accuracy obtained with different AL algorithms and sampling.

Figure 5 :
Figure 5: Distribution of the chosen training samples by the BvSB method and pool samples considering bands 49 and 84 of the hyperspectral image.

Figure 6 :
Figure 6: Distribution of the chosen training samples by the BvSB + ECBD method and pool samples considering bands 49 and 84 of the hyperspectral image.
(a) is an original hyperspectral image composed of band 25, 49 and 84.

Figure 7 (
b) is the classification image of the Landsat-8 data obtained by the standard SVM classifier, which used as a reference image for evaluating the classification performance.Further, Figure7(c) is the classification image of the hyperspectral image with the BvSB + ECBD method.

Figure 7 :
Figure 7: (a) Hyperspectral image (a false color image composed of R: 84, G: 49, and B: 25).(b) Result of the classification of the Landsat-8 data.(c) Result of the classification of the hyperspectral image.

Figure 8 :
Figure 8: Classification overall accuracy obtained by the BvSB + ECBD method with different  values.
{x 1 , x 2 , . .., x n } {x 3.1.The BvSB Method Based on the Uncertainty.Let us assume that the unlabeled sample set is  = { 1 , . . .,   } and the associated labels are denoted by  = { 1 , . . .,   }, where (  |  The probabilities of the optimal label and the suboptimal label for sample   are represented by ( Best |   ) and ( Second-best |   ); then the criterion can be described as  uncertainty (  ) = arg min ) denotes the membership probability[29].Under the BvSB criterion, we only consider the difference as a measure of uncertainty between the probabilities of the two classes having the highest estimated probability.The optimal label and the suboptimal label are denoted by  Best and  Second-best , respectively, but the other classes are ignored.

Table 1 :
Number of samples of each class in ,  for the data set.The classes and the related number of samples used in the experiments are shown in Table1.4.2.Design of Experiments.In our experiments, without losing generality, we adopt an SVM classifier with radial basis function (RBF) kernel.The values for the regularization parameter  and the spread  of the RBF kernel parameters are acquired by the cross validation grid search method on the basis of the validation set.Finally, the best value of the parameter  is 32; the optimal kernel width parameter  is found equal to 16.

Table 2 :
Number of samples chosen by the different methods.

Table 3 :
Percentage reduction in the number of training samples.

Table 4 :
Confusion matrix for the classification with the proposed BvSB + ECBD method.