Object-Oriented Semisupervised Classification of VHR Images by Combining MedLDA and a Bilateral Filter

ABayesian hierarchicalmodel is presented to classify very high resolution (VHR) images in a semisupervisedmanner, inwhich both amaximum entropy discrimination latent Dirichlet allocation (MedLDA) and a bilateral filter are combined into a novel application framework.The primary contribution of this paper is to nullify the disadvantages of traditional probabilistic topic models on pixellevel supervised information and to achieve the effective classification of VHR remote sensing images. This framework consists of the following two iterative steps. In the training stage, themodel utilizes the central labeled pixel and its neighborhood, as a squared labeled image object, to train the classifiers. In the classification stage, each central unlabeled pixel with its neighborhood, as an unlabeled object, is classified as a user-provided geoobject class label with the maximum posterior probability. Gibbs sampling is adopted for model inference. The experimental results demonstrate that the proposed method outperforms two classical SVMbased supervised classification methods and probabilistic-topic-models-based classification methods.


Introduction
With the development of imaging technology, many airborne and satellite sensors, for example, QuickBird, IKONOS, and WorldView, can provide very high resolution (VHR) images.Pixel-based supervised classification methods that have been successfully applied to low or moderate resolution remote sensing images do not have desirable results when applied to VHR images, because the spatial relationship among pixels is neglected when these methods are used and the "pepper and salt" effect is often observed in classification results of VHR remote sensing images.To solve this problem, object-based classification methods are often used to classify VHR images.For example, Wang et al. integrated the pixel and object-based methods for mapping mangroves with IKONOS imagery [1].Salehi et al. developed an object-based classification framework for QuickBird imagery coupled with a layer of height points to classify a complex urban environment [2].Kim et al. investigated the use of a geographic object-based image analysis approach with the incorporation of objectspecific grey-level cooccurrence matrix texture measures from a multispectral IKONOS image for mapping forest type [3].These methods often consist of two sequential steps, that is, segmentation and classification [4].Although this twostep procedure works well in some cases, several problems still remain.First, the classification results depends heavily on the segmentation algorithm.Second, training objects must be labeled before the image analysis in these classification methods.However supervised information is often obtained at a pixel-level.Thus a contradiction is noted between image objects and pixel-level supervised information, and labels cannot be used directly to train object classifiers [4].
In natural language processing, probabilistic topic models can be applied to find the latent topic representations of documents in a corpus [5,6].When these models are used to model remote sensing images, the images are often partitioned into a set of image tiles (i.e., documents) and the characteristics of pixels are often treated as visual words [4][5][6][7][8][9].These probabilistic topic models have been used to discover semantic structures from very high resolution (VHR) remote sensing images, such as in target detection [7], image clustering [8,10,11], and image annotation [12].Probabilistic topic models can also use supervised information for discovering latent topic representations, such as supervised latent Dirichlet allocation (sLDA) [13], discriminative latent Dirichlet allocation (DiscLDA) [14], semisupervised latent Dirichlet allocation (ssLDA) [4], and maximum entropy discrimination latent Dirichlet allocation (MedLDA) [15].Differences among these models are noted in the classification: the sLDA, DiscLDA, and ssLDA are learned under likelihooddriven objective functions, which are fully generative models.
MedLDA employs the discriminative max-margin principle into the process of topic learning, which is more suitable for classification tasks.Additionally, the supervised information in these models is associated with documents or tiles of images rather than individual pixels except for the ssLDA.Thus, a contradiction is noted between classification results of individual pixels and tile-level supervised information in these models, and tile-level labels cannot be used directly to train individual pixels classifiers.
To address the aforementioned problems in object-based supervised classification methods and probabilistic topic models, we present an object-oriented semisupervised classification method [4,16] for VHR remote sensing images based on the MedLDA model [15,17] and bilateral filter [18] in a novel framework, which is referred to as the semisupervised MedLDA (ssMedLDA).Each image object is defined as a squared image block with a neighborhood for pixels without image segmentation [10].The main contribution of our proposed model is the combination of a probabilistic topic model with pixel-level supervised information to achieve effective object-oriented semisupervised classification of VHR remote sensing images.The remainder of this paper is organized as follows.In Section 2, the proposed approach is presented in details.The experimental results are given in Section 3. Finally, the conclusions are presented in Section 4.

Methodology
First, the problem between the supervised information for objects and pixel-level is discussed in this section.Second the proposed method ssMedLDA for classification processing is introduced.Third, the algorithm is presented.In this paper, a linear discriminative function, for example,   =   z  , is used to bridge the user-provided geoobject class label   and an observed image object O  via an inferred latent image feature z  , where  is learned under the max-margin principle.
As shown in Figure 1, the latent image feature z  in probabilistic topic models is an expected measurement for the th image object O  , which consists of  = ℎ × ℎ pixels, that is,  → w  = { 1 ,  2 , . . .,   }.As for image object O  , its latent image feature z  is a -dimension vector with element z   , where I(⋅) is an indicator function that equals 1 when the predicate holds; otherwise it is 0. Because latent topic assignments of pixels in image object O  , that is, {z 1 , z 2 , . . ., z  }, are assumed to be independent and identically distributed in the MedLDA, the contribution or distance coefficient of each latent topic assignment to the image object O  is assumed to be equal, that is, 1/.When image objects are given geoobject class labels, the parameter of linear discriminative function  can be learned using the inferred latent features.As shown in Figure 1, the user-provided geoobject class labels are often given for partially labelled pixels.Thus a contradiction is noted between image objects O  and pixellevel supervised information   : the supervised information has different influences on each pixel in the image object O  and cannot be used directly to train object-oriented classifiers.Thus the problem is how to use labeled pixels directly to train object oriented classifiers.
To solve the above-mentioned problems, an edgepreserving filtering (EPF) [19], for example, the bilateral filter [18], is used to construct the relationship between the central labeled pixel and the other pixels in the image object.The pixel-level supervised information can be diffused to object-level supervised information based on the distance which includes both the spatial distance or gap  spa  and the spectral distance  spe  between the center pixel   and the th pixel   in object O  . spa and  spe are defined as the spatial and spectral decay functions, respectively.Therefore, the pixel-level supervised information diffuses to object-level semisupervised information in this paper by   : where   is the space location of   ,   is the space location of   ,   is the spectral location of   ,   is the spectral location of   ,  spa is a space decay parameter,  spe is a decay spectral decay parameter, and  spa > 0 and  spe > 0. (3) Sample   ∼ (  z  ), where z  is a vector with element z   = ∑  =1   ⋅ I(z   = 1), and I(⋅) is an indicator function that equals 1 when predicate holds; otherwise it is 0 in the image object O  .
The distance coefficient   is integrated into ssMedLDA through conditional topic random fields [20] as shown in Figure 2.  posterior distribution of latent variables in the model.The algorithm of ssMedLDA can be summarized as follows.

Algorithm. As shown in
Step 1 (initialization).In this step, the parameters in the model are initialized.Two types of parameters must be set, including the number of latent topics, ; the Dirichlet hyperparameter, ; the neighborhood size, ℎ; the positive regularization parameter, ; and the cost of making a wrong prediction, .In addition, the matrix of topics z is initialized by assigning random topics for all pixels.
Step 2 (inference).In the ssMedLDA, topics are learned in two ways based on whether the supervised information of the central pixel is available or not.
(a) When the central pixel   in th image object O  is unlabeled, analogous to the deduction in previous studies [5,21], the conditional distribution of    is given by the following: where (  , Σ  ) are parameters of the th Gaussian distribution.  ,¬ is the number of times that terms being associated with topic  within the object O  are without   ; and   is the value of the Dirichlet hyperparameter associated with topic .
(b) When the central pixel   is labeled, analogous to the deduction previously presented [17], the formulas of the conditional posterior of    change to the following:  The augmented variables  can be given by the following: where  indicates the inverse Gaussian, (   ) −1 obeys the inverse Gaussian, and    =  −      z  .The linear discriminative function  can be given by the following: where , and   indicates the number of labeled pixels.
Step 3 (classification).Each unlabeled pixel   is classified as a user-provided geoobject class label with the maximum posterior probability

Experimental Result and Discussion
In this section, the experimental data and the quantitative evaluation methods for the experimental results are described.The performance of the ssMedLDA is then compared with that of the pixel-based SVM, spectral-spatial SVM, and ssLDA methods.Finally the influence of different sizes of labeled pixels on the methods is analyzed.The proposed ssMedLDA algorithm and other methods are coded and implemented in a MATLAB environment.

Experimental Data.
The VHR data was collected by the ROSIS optical sensor over the urban area of the University of Pavia, Italy.The image data with a size of 610 × 340 pixels and a high spatial resolution of 1.3 m/pixel was used in the experiment.Figure 4(a) shows a color composite of the image, whereas Figure 4(b) presents the ground truth map, which includes nine geoobject classes of interest, that is, asphalt, meadows, gravel, trees, metal sheets, bare soil, bitumen, bricks, and shadows, for which each geoclass is represented by a color.

Comparison with Existing Methods and Parameters Setting.
To evaluate the effectiveness of the proposed method, the performance of the ssMedLDA was compared to three classification methods: (1) the pixel-based SVM; (2) the spectral-spatial SVM: pixel-based SVM classification was generated and a majority voting within the neighborhoods was defined by image segmentation (termed as SVM + MV) [22]; and (3) the ssLDA which is the supervised probabilistic topic models that used the pixel-level supervised information.The segmentation map for the three existing classification methods used the entropy rate superpixel segmentation algorithm, which is effective and efficient [23].The optimal number of segments  = 4000 was experimentally derived.In the two SVMs, a multiclass one-versus-one SVM classification with a RBF kernel from a LIBSVM was used [24] and the other parameters were determined by parameter optimization.In the ssLDA, the number of latent topics  was 80 and the Dirichlet hyperparameter  was 1 + 50/.In the ssMedLDA, the parameters  and  were identical to the ssLDA.The positive regularization parameter  and the cost of making a wrong prediction  were 1, the number of classes of interests  is 9 according to the ground truth, the neighborhood size ℎ was 11, the space decay parameter  spa was 5, and spectral decay parameter  spe was the variance of the VHR image.
In total, 10% of the labeled pixels were selected for the four methods at equal interval for training purposes and the remaining 90% of labeled pixels were set aside for testing purposes.The corresponding classification maps of the 10% labeled pixels are shown in Figures 5(a)-5(d).The ssMedLDA achieves a more compact and smoother classification result when compared with the SVM and ssLDA.The two visual results from SVM + MV and ssMedLDA look similar, but the latter does not require the segmentation map.Therefore, the advantages of integrating the max-margin principle, topic modeling, and pixel supervised information influence on its neighborhood based on distance consistency are confirmed.
Table 1 displays the class-specific (producer accuracies), overall accuracies, and kappa coefficients for all methods and shows that the ssMedLDA method yields the best overall accuracy and kappa coefficients.The producer accuracies of asphalt, meadows, gravel, trees, metal sheets, bare soil, bitumen, and bricks are better than other methods.The overall accuracies and kappa coefficients from SVM + MV are better than SVM, but the producer accuracies of the shadows are not better because of the segmentation map.Therefore the influence of the segmentation on the classification is not always good.

Influence of Different Sizes of Labeled Pixels.
To explore how the performance of the ssMedLDA and other object-based models behaves with different settings of the number of labeled pixels, different proportions of the ground truth pixels for training, that is, approximately 1.0%, 2.0%, 4.0%, 5.9%, 7.7%, 10.0%, 12.5%, 14.3%, 16.7%, 20.0%, and 25.0%, are conducted.Labeled training pixels for each class were acquired at equal interval from the ground truth, and the remaining labeled pixels of the ground truth were used for testing.The identical labeled pixels were used to train the SVM + MV, ssLDA, and ssMedLDA.All of the unlabeled pixels are used for training in the ssLDA and ssMedLDA.Figure 6 shows the overall accuracies of the SVM + MV, ssLDA, and ssMedLDA against the proportions of labeled training pixels.Two obvious results are noted: (1) regardless of the size of labeled pixels, the ssMedLDA method achieves a better performance than the ssLDA and SVM + MV, because the ssMedLDA integrates the mechanisms of probabilistic topic models, maximum entropy discrimination, semisupervised learning, and spatial coherence; and (2) the performance of the ssMedLDA, SVM + MV, and ssLDA is not always enhanced by increasing the portion of labeled pixels, because the samples are chosen at equal intervals, which means samples of the 4% condition do not completely contain the 2% samples.

Conclusions
In this paper, a semisupervised method has been proposed to address the problem of VHR remote sensing image classification.The method combines the MedLDA model and bilateral filter through conditional topic random fields for training.The proposed method takes advantage of spatial and spectral relationships in VHR images.Additionally, the ssMedLDA does not require a segmentation map, and the pixels with their neighborhoods are used as objects to enforce spatial regularization over the classification results.The experimental results show that the proposed approach is superior to the SVM + MV and ssLDA methods.

Figure 1 :
Figure 1: Supervised information   on pixel   and image object O  and the latent image feature.

2. 1 .
Problem Statement.Let  = { = (, ) | 1 ≤  ≤ , 1 ≤  ≤ } be the set of lattice sites in the given satellite image, where  and  are width and height of the image, respectively.A random field indexed by  is given by W = {W  =   |   ∈ V,  ∈ }, where a random variable W  at site  takes a value   in its state space V.The set w = {  |  ∈ } is drawn from the state space X with the joint probability (W = w).

Figure 2 :
Figure 2: Graphic model.The shaded nodes including the pixel value   and the set of distance coefficients in the th image object  →   represent the observed random variables, and the hashed fill node including   denotes the partially observed random variable class.The intuitional explanations of the model are shown at the end of the red arrows.

2. 2 .
ssMedLDA.Let  → w  = { 1 ,  2 , . . .,   } be a vector of  pixels appearing in the th image object O  in the remote sensing image, and let  be the number of latent topics in the model.The vector of the response of discrete variable y = {  |  ∈ ,   ∈ (1, 2, . . ., )} in the remote sensing image is the geoobject class labels, and  is the number of class labels.Different from the model of the MedLDA, the variable   of the geoobject class label for the central pixels   in the objects O  is shown in Figure 2. The half hashed pattern denotes that the class node is partially observed.The generative process of the ssMedLDA model is as follows: (1) Sample topic proportions  |  ∼ Dirichlet().(2) For each of the  pixels   in th image object O  , (a) sample a topic assignment   |  ∼ Multinomial(); (b) sample a word   from (  |   , ), a multinomial probability conditioned on   , namely,   |   ,  1: ∼Multi-Gaussian distribution(  ).

Figure 3 ,
Gibbs sampling is used to estimate the model inference and approximate the

Figure 4 :
Figure 4: The initial VHR image and its ground truth.

Figure 5 :
Figure 5: Classification results in the 10% labeled pixels condition for training.

Figure 6 :
Figure 6: Overall accuracy versus the proportion of labeled pixels.

Table 1 :
Comparison of the each class accuracy of four classifying results.