Multiscale Region-Level VHR Image Change Detection via Sparse Change Descriptor and Robust Discriminative Dictionary Learning

Very high resolution (VHR) image change detection is challenging due to the low discriminative ability of change feature and the difficulty of change decision in utilizing the multilevel contextual information. Most change feature extraction techniques put emphasis on the change degree description (i.e., in what degree the changes have happened), while they ignore the change pattern description (i.e., how the changes changed), which is of equal importance in characterizing the change signatures. Moreover, the simultaneous consideration of the classification robust to the registration noise and the multiscale region-consistent fusion is often neglected in change decision. To overcome such drawbacks, in this paper, a novel VHR image change detection method is proposed based on sparse change descriptor and robust discriminative dictionary learning. Sparse change descriptor combines the change degree component and the change pattern component, which are encoded by the sparse representation error and the morphological profile feature, respectively. Robust change decision is conducted by multiscale region-consistent fusion, which is implemented by the superpixel-level cosparse representation with robust discriminative dictionary and the conditional random field model. Experimental results confirm the effectiveness of the proposed change detection technique.


Introduction
Remote sensing image change detection (CD) aims to identify the land cover changes from the coregistered multitemporal images. It has extensive applications such as damage assessment [1] and forest monitoring [2]. Over the past decades, great efforts have been made to detect changes from the images of different resolutions [2][3][4][5]: low, moderate, high, and even very high resolution. Change detection, especially for very high resolution (VHR) images, is an ongoing hot topic in remote sensing image processing [5][6][7].
In the remote sensing literature, traditional CD approaches [4][5][6][8][9][10] usually consist of two sequential steps: change feature extraction and change decision. The change features are often organized based on visual features, like spectral feature, Gabor feature, and morphological feature. Such features try to describe the local structure of images. Among them, the filter-based features especially Gabor wavelet and morphological profiles have been proved to perform well on the task of change detection and hyperspectral classification [7,11,12]. The underlying reason is that the local structure can be well captured by different frequency components and the false alarms are significantly reduced by taking the spatial-contextual information into consideration. Furthermore, the morphological profile (MP) feature [12] draws much attention due to the nonlinear nature of morphological operations.
Having the visual features extracted, many approaches explicitly define the change features to describe the change signatures at a pixel or within a small region [8-10, 13, 14]. The most common strategy to design the change feature is differentiating [8,9,15] or stacking [6] the pairwise bitemporal visual features. Change feature built upon the differentiating operation directly reflects the change degree; that is, the difference feature vectors from the changed region usually present larger magnitudes than those from the unchanged region. However, the change feature by differentiating lacks the robustness to registration noise and viewpoint variation. In contrast, the change feature derived by stacking is more robust to the above impacts at the cost of higher feature dimension.
Subsequently, change decision is required to generate the change map (CM). In general, existing approaches can be categorized into two groups.
(1) Thresholding. This group of methods are based on thresholding the difference image. Image differencing [16] and change vector analysis [8] are the representatives. They are simple and easy to interpret results; however, low precision especially on high resolution images is their drawback.
(2) Clustering or Classification. This group of algorithms are mainly based on the clustering or classification techniques. For example, Celik [15] applied -means to cluster the spectral difference features, and promising performances were achieved on multispectral images. Recently, Volpi et al. [17] extended this method to the nonlinear case. However, these unsupervised methods are less effective for high resolution images that contain complex changes. Meanwhile, the performances by the unsupervised methods can vary with different change features and the results sometimes lack meaningful interpretation.
To tackle these issues, the supervised and semisupervised classifiers are introduced for change detection. For example, support vector machine (SVM) was utilized to classify change features in [6] and a transition matrix indicating the "fromto" changes was derived. As the scarcity of training sample may lead to low generalization performance, Bovolo et al. [18] employed the transductive SVM, a semisupervised learner, to classify change features. By exploiting the user-labeled samples, the supervised CD methods are superior to the unsupervised ones in diminishing the semantic gap between the change features and the real changes and in explaining the change maps semantically.
Despite the novelties of the traditional change features and the change decision strategies in the literature, they are inadequate for VHR image change detection due to the following two factors.
(1) In general, change feature contains the following two components: change degree description (CDD, i.e., in what degree the changes have happened) and change pattern description (CPD, i.e., how the changes changed). The two components are equally important for characterizing changes. However, the traditional change features treat them overwhelmingly unequally; that is, the latter component is usually being ignored.
(2) Both the multiscale and the region-consistency characteristics are important for the reliable change detection. However, the traditional change decision procedures seldom consider them simultaneously.
In this paper, the above two items are considered carefully. The rationale of the proposed approach includes the following.
(1) Change degree description and change pattern description are integrated seamlessly. Intuitively, if two image regions are similar, the features from one region can sparsely represent the features from the other with the low errors and vise versa. Compared to the simple Euclidean distance, the sparse representation error is more robust to the illumination change, seasonal variation, and registration noise. Therefore, it is a suitable candidate for describing the change degree. On the other hand, since MP is good at representing the local appearance of high resolution images, it is computed at each position and stacked with the change degree descriptor to generate the final change feature, sparse change descriptor (SCD). Compared to the traditional change descriptors [6,8,9,13], SCD captures both the change degree and change pattern.
(2) The simultaneous consideration of the regionconsistency and the multiscale characteristic of the changes is implemented by the region-level cosparse representation and conditional random field (CRF) fusion. In detail, the proposed change decision strategy starts with constructing a bitemporal image pyramid of the considered images. At each scale of the pyramid, the bitemporal images are cosegmented into small parcels (i.e., superpixels or image objects, or segmentation regions). For each pair of the corresponding parcels from some scale, its probability that belongs to the changed class is predicted by the isotonic regression model [19] trained on the labeled sparse representation errors. These errors are computed by the cosparse representation with the learned robust discriminative dictionaries (RDD), an improved version of the traditional dictionaries [20,21]. The robustness is realized by weighting the features. Once the change probability maps for all scales are estimated, they are fed into CRF to merge the information from different scales and smooth the resultant change map.
Compared with the related approaches, the contributions of this paper lie in the following aspects.
(i) A new change feature, SCD, is proposed, which represents both the change degree and change pattern feature.
(ii) A robust discriminative dictionary learning (RDDL) model is presented, which is resistant to outliers in modeling the change feature distribution.
(iii) The cosparse representation is resorted to predict the change probability for all pixels within an image object, which makes the detected changes more consistent within a homogeneous region.
(iv) CRF is used to fuse the change detection results from different scales, which makes the proposed method capable of recognizing the changes from different scales.
The rest of this paper is organized as follows: Section 2 illustrates the proposed CD method, Section 3 discusses the parameter estimation problem, Section 4 presents the experimental results and analysis, and finally Section 5 concludes the paper.

The Proposed Approach
As illustrated by Figure 1, the proposed VHR image CD approach consists of the following four steps: pyramid  generation, sparse change descriptor extraction, supervised probability prediction, and conditional random field fusion.
In the following, we will describe our CD approach in detail.

Pyramid Generation.
The objects in an image and the changes between the bitemporal images are highly dependent on the scale, and multiscale analysis is helpful to improve the performance. For this reason, multiscale images are firstly generated by the pyramid decomposition. In detail, the images I +1 and J +1 at the ( +1)th scale ( = 1, . . . , −1) of the pyramid are built by downsampling the images I and J at the th scale with the downsampling ratio ( > 1), respectively. Here, I 1 = I and J 1 = J are the original coregistered bitemporal images, which both have the size × × . , , and ( = 3 in this paper) are the number of rows, columns, and channels, respectively. Obviously, the sizes of images I and J are = round( / −1 ) rows, = round( / −1 ) columns, and channels, where round(⋅) is the rounding function.

Sparse Change Descriptor Extraction.
One of the main difficulties related to the change detection of VHR images lies in the lack of separability of change features caused by the abundant details and the low spectral resolution [5]. In this paper, the discriminative ability of change feature is improved by the sparse change descriptor (SCD). Different from the classical change features [8,9,13], two components are contained in SCD: the change degree description (CDD) and the change pattern description (CPD).

Change Degree Description.
Motivated by the state-ofthe-art performance of the sparse-representation-based classifier [22] in face recognition, where the images have serious occlusion and lighting change, the sparse representation error is used as the change degree description. In other words, the magnitude of this error is able to indicate the degree that a test face belongs to a person.
The sparse representation errors are computed on all scales of the image pyramid. At each position of the pairwise images from some scale, two informative representation errors are derived based on two local dictionaries. For this purpose, 3D patches are collected as follows. (2) These vectors are arranged by column to build the local dictionary D , of the image J at ( , ). Obviously, the size of this dictionary is 2 × ( + 2) 2 .
(3) Similarly, x and D , can also be constructed from the image I .
Based on the dictionaries D , and D , , two representation errors are computed: the error , of x under D , and the error , of y under D , . To compute these errors, the features x and y are expanded under the dictionaries D , and D , , respectively. Specifically, where ‖x‖ = (∑ | | ) 1/ ( > 0) denotes the ℓ -norm of vector x,̂and̂are the optimal representation vectors, and controls the sparsity of representation coefficients. Based on̂and̂, , and , can be computed by To build the change degree feature for the th scale, we stack the two errors; that is, e = [ , , , ] ( denotes the transposition). Obviously, larger values of these errors indicate more remarkable changes and vise versa. The Scientific World Journal

Change Pattern Description.
As mentioned before, the morphological profile feature [12] has been proved adept at describing the local structure of images; therefore, it is adopted in our change pattern description. For an image I with channels, let us denote the morphological profile feature at pixel ( , ) of scale as mp , = [(mp ,1 , ) , . . . , (mp , , ) , . . . , (mp , , ) ] , where mp , , is the morphological profile feature of image I at the th scale and the th channel.
Based on the change degree description and the change pattern description, the SCD of the bitemporal images I and J at pixel ( , ) and scale is the concatenation of the feature e with mp , and mp , ; that is, At each scale, two structuring elements (SEs) with the radius parameters = 0, ( − 1)/2 are used. Here, = 0 denotes the original spectral features. By combining the change degree descriptor and the change pattern descriptor, a powerful change feature, SCD, is formed. The change degree component makes it robust to false changes and the change pattern component improves its descriptive power. By this change feature, the change decision in the following sections can be made more reliably.

Supervised Probability Prediction.
This step is aimed at estimating a probability map that indicates the change state of all pixels in each scale of the image pyramid. These maps will be used in the CRF model to generate the final change map. For estimating these probability maps reliably, we employ the cosparse representation model [23] which can exploit the spatial-contextual information effectively. Before that, two conditions should be satisfied.

Condition 1: Training Robust and Discriminative Dictionaries.
Good dictionaries would produce the representation errors that have strong discrimination ability and stability, which are both helpful for reliable probability prediction. One simple strategy for dictionary learning is to train a dictionary on the given training data for each class separately. However, this strategy has at least three drawbacks: (1) the scarcity of the labeled data cannot ensure the generalization ability of the learned dictionaries; (2) the relationship between the dictionaries of different classes is ignored; (3) a few mislabeled data may mislead the dictionary learning process. To deal with these problems, an "add-and-remove" strategy for dictionary training is proposed. It consists of two basic steps: (1) enlarge the amount of samples on the basis of the userlabeled data by a preclassification process ("add" step, see Section 2.3.1) and (2) train all dictionaries simultaneously on the augmented and weighted training data. The weights are updated gradually to remove the intraclass outliers ("remove" step, see Section 2.3.2).

Condition 2: Estimating the Probability that Pairwise Pixels
Belong to the Changed Class Based on the Representation Errors Obtained by the Cosparse Representation. One can either use a parametric (e.g., logistic regression) or a nonparametric (e.g., isotonic regression [19]) model to predict this probability. Considering the nonparametric model is distribution-free and has the advantage in reducing the number of parameters to be adjusted (e.g., regularization parameter in parametric model), it is used in the probability prediction step (see Sections 2.3.2 and 2.3.3).
In the following three parts, the proposed supervised probability prediction method is elaborated step by step. KNN. This is the "add" step, which is aimed at preclassifying the extracted change descriptors based on the user-labeled data and making a new training set for all scales. Furthermore, each training sample is allocated with an initial weight, which will be used in the proposed robust discriminative dictionary learning model.

Initial Probability Prediction by
For the change feature f extracted at the position ( , ) from the images I and J , nearest neighbors are found from the provided training set of the scale . Assume that there are neighbors of the changed class ( ) and neighbors of the unchanged class ( ). The probability of f belonging to can be estimated by ( | f ) = / . After computing the probability for each position, a coarse probability map cPM is obtained.
To incorporate the relationship between different scales, all the probability maps are resized to × and combined into cPM = ∑ =1 UP(cPM , −1 ), where is the weight for the th scale; UP(I, ) is the operation that upsamples the image I with the upsampling ratio . It is reasonable to set the maps that have higher confidence (reflected by the total classification accuracy) with higher weights. The adaptive weights are computed as follows: where TFR is the total false rate of the probability map at the scale , which is achieved by comparing the ground truth labels of the training samples and the predicted labels by KNN. After that, cPM is downsampled to different scales to replace the original cPM ( = 1, . . . , ). Figure 2 gives the estimated probability maps for a toy data set. Figures 2(e)-2(h) show the probability maps before using multiscale fusion and the weighted sum of them. From these figures, the weighted sum is more accurate than each monoscale estimation.

Robust Discriminative Dictionary
Learning. After computing the coarse probability maps, the dictionaries can be learned at each scale based on the augmented and weighted training sets (the "remove" step). The new training sets are defined as X = {( , ) | cPM > 0.5} and X = {( , ) | cPM ≤ 0.5} for the changed class and unchanged class, respectively. The weight for a sample in X is = cPM and that for a sample in X is = 1 − cPM . For convenience, the scale symbol is omitted in the following sections except for specific explanation. X and The Scientific World Journal X may contain some outliers, which would undermine the discriminative ability of the learned dictionaries D and D (for the classes and , resp.). To solve this problem, each sample is weighted and the weight is updated during dictionary training. By this way, intraclass outliers can be stably removed and the discriminative ability of dictionaries is simultaneously kept.
For this purpose, we build a nonparametric model between and the corresponding representation errors. Here, the isotonic regression model [19,24] is adopted as it can fit an isotonic function without any assumption about the specific form. For ( , ) ∈ X , the mapping is = IR(SRF( , )); for ( , ) ∈ X , the mapping is = 1 − IR(SRF( , )). The class-specified representation errors and are defined as wherêand̂are the best representation vectors of f under D and D , respectively. IR(⋅) is an isotonic regression function with single variable [19,24]. SRF(⋅, ⋅) is named the signed ratio function, which is defined as where sgn( − ), max( , ), and min( , ) denote the sign of − and the maximal and minimal value between and , respectively. The role of the SRF function is to convert the complex bivariate case to the tractable univariate case by considering the relationship between and ; that is, if one of them is large the other would be small, and vise versa.
Given a set of labeled triplets, {( , , label )} (label is 1 if the pixel ( , ) is in the user-specified changed regions and 0 if it is in the user-specified unchanged regions), we first compute the SRF values, SRF( , ). The generalized pooladjacent-violators algorithm (GPAV) [24] is subsequently adopted to fit the univariate isotonic function IR(⋅) on {(SRF( , ), label )}. Using the fitted function (i.e., a staircase function), the label value can be evaluated on all pixels. They are used to build the weights of the reconstruction terms in the following robust discriminative dictionary learning (RDDL) model: 6 The Scientific World Journal where Ω is a convex set that consists of matrices and each column of each matrix has the ℓ 2 -norm less than 1. and are the number of samples in X and X , respectively, A and A collect all representation coefficients from and , respectively, 1 and 2 are the trade-off parameters, and tr(⋅) is the trace operator.
We propose using the alternative coordinate descent technique [20,21] to solve the above problem. It consists of the following three steps.
Step 1. Update A and A with the fixed D , D , { }. We update the columns of A and A separately. Actually, only the following problem needs to be solved: where D = D if ( , ) ∈ X and D = D otherwise. The least angle regression algorithm [25] is adopted to solve this subproblem due to its computational efficiency. Once all are obtained, we can get the new A and A . Step D and D can be separately updated by the projection gradient descending procedure [26].
Step 3. Update { } with the fixed D , D , A , and A . With the updated representation coefficients A and A , the representation errors and can be computed for each pixel by (5). Given the labeled triplets {( , , label )}, a new isotonic regression model IR(SRF( , )) can be fitted. Based on this model, the weights are updated by ← IR (SRF ( , )) , ∀ ( , ) ∈ X , The above steps are iterated alternatively until the solution keeps stable.
By the robust discriminative dictionary learning, two dictionaries, D and D , are learned for each scale. And these dictionaries will be utilized in the following superpixel-level cosparse representation step.

Superpixel-Level Cosparse
Representation. The role of the cosparse representation is to use the learned dictionaries in the former step to estimate a more accurate probability map for each scale. With the help of these maps, a better data term for the subsequent CRF model can be built, which enables us to further enhance the change detection performances. Since the traditional sparse representation models [22,27] treat samples independently, they will generate noisy CMs. Thus, we run the cosparse representation [23] on each segmentation region to encode all the features within it simultaneously.
Before cosparse representation, the images I and J (of some scale ) are cosegmented to generate small homogeneous regions. We first segment each image individually by the simple linear iterative clustering (SLIC) algorithm [28] and then obtain the final regions by merging the two segmentation results using the same strategy as [5].
Suppose that I and J are cosegmented into homogeneous regions R = { | = 1, . . . , }, where is the set that collects all indices of the pixels in the th region. If the feature matrix of the region ∈ R is F, its encoding matrix A can be computed by solvinĝ where 1 is the regularization parameter that controls the sparsity of A, | | denotes the number of features in region , ‖ ⋅ ‖ is the Frobenius norm, ‖A‖ 2,1 is the ℓ 2,1 -norm of A, which sums the ℓ 2 -norm of all its rows, and D is defined as Let us denote the th column of F by f and the corresponding representation coefficient vector bŷ= [(̂) , (̂) , (̂) ] , wherê,̂, and̂are the coefficient vectors corresponding to the submatrices D , D , and E, respectively. If we define the representation errors = ‖f − D̂‖ 2 2 and = ‖f −D̂‖ 2 2 , then the isotonic regression can be used to estimate the probability that a feature f belongs to the class ; that is, ( | f ) = IR(SRF( , )). By computing the probability for each pixel of the image at the scale , a refined probability map rPM can be obtained. Similar to the technique proposed in Section 2.3.1, a merged probability map rPM can also be computed. Figures 2(i)-2(l) show the refined probability maps by cosparse representation. By comparing the second and the third rows of Figure 2, it can be concluded that the refined probability maps are more accurate than the coarse ones.

Conditional Random Field Fusion.
To reduce the saltand-pepper-like noise contained in the final CM and to merge the information from different sources, the CRF model [29] is resorted. One simple strategy is to build the data term of CRF only by the refined probability map rPM. However, a better choice is to utilize both rPM and cPM; the reason is that they are usually complementary.
Given cPM and rPM, the energy function of CRF is defined as where C represents the 0-1 label configuration matrix, is the smoothness parameter, and V and V = 1 − V are the weights The Scientific World Journal 7 that balance the effects of the coarse and refined probability maps. V should be set larger than V because the refined map is usually more reliable in terms of total accuracy. By the simple try-and-error strategy, we found that V = 0. 8 where , = 1 if ̸ = and , = 0 otherwise, is related to the kernel width, and , is the squared Euclidean distance between the spectral features at ( , ) and ( , ).
Note that the problem in (12) can be solved efficiently by the max-flow algorithm [30] even for the large-size images. The fast speed enables us to tune conveniently.

Parameter Estimation
There are some important parameters in the proposed CD approach: in (1), 1 in (7) and (11), 2 in (7), and the parameters in SLIC. Considering the computational efficiency, 1 and 2 are set manually, and their sensitivities are analyzed in Section 4.1.1. The remaining parameters are tuned automatically according to some supervised or unsupervised criterions.

Parameter Estimation
: . The selection of should ensure the change degree feature as discriminative as possible. At each scale , for is estimated separately. Recalling the effectiveness of Fisher discriminant criterion [5,31] in evaluating the separability of feature, the score function for is defined as score ( ) = tr(S ( ))/ tr(S ( )), where S ( ) and S ( ) are the intraclass and inter-class scatter matrices of feature , respectively. To choose the optimal parameter, is discretized at points 1 , . . . , , and the one with the minimal score is chosen as the best parameter. The parameter searching results of the images in Figures 2(a) and 2(b) at three different scales are shown in Figure 3(a). From this figure, the best for all scales is about 2 × 10 −3 .

Parameter Estimation:
and . SLIC [28] has two parameters: and . The former determines the minimal area of the segmented regions; the latter is related to the region homogeneity. Smaller generates more homogeneous but more irregular regions. In our change detection task, the regions are expected to be as homogeneous as possible. Therefore, can be fixed as a small value, 0.01, for example, and only needs tuning. Intuitively, large and homogeneous regions are helpful for better performances. Accordingly, the following metric is defined to evaluate the segmentation quality: where x is the 2 dimensional spectral feature that concatenates the spectral features from the two times at the th ( ∈ {1, . . . , }) position of the th ( ∈ {1, . . . , }) region, x is the average spectral feature of region , and is the area of the th region. The best is the value that minimizes the function. We search this value in the range [10,100]. Figure 3(b) shows the relation between and computed for the images in Figures 2(a) and 2(b) at the first scale. From Figure 3(b), 1 = 50 can be regarded as the best parameter for this scale. Running the search algorithm for all scales is time-consuming. Considering the relationship between different scales, we estimate the optimal parameters for the higher scales by = round( 1 / −1 ) ( = 2, . . . , ).

Experiments
In this section, four different experiments on three data sets are carried out to test the validity of the proposed techniques.
The first experiment is to evaluate the overall performance of the proposed method. To this end, it is compared with other related CD methods qualitatively and quantitatively (Section 4.1). The second experiment is to validate the effectiveness of multiscale fusion. Hence, the fused results are compared against all monoscale results (Section 4.2). The third experiment is aimed at assessing the effectiveness of the proposed change feature. In particular, SCD is compared with five commonly used change features (Section 4.3). The last experiment makes it possible to validate the effectiveness of the proposed robust discriminative dictionary by comparison with some other dictionaries (Section 4.4).
Three data sets named DS1, DS2, and DS3 are used for performance comparison, which were taken by QuickBird 2 satellite over Beijing, China. The sizes of them are 1024×1024 pixels, 1001 × 1170 pixels, and 451 × 525 pixels, respectively. As shown in Figures 4(a)-4(d), 5(a)-5(d), and 6(a)-6(d), each data set consists of two coregistered pansharpened images (a) and (b), a reference ground truth (GT) image (c), and a training mask (TM) image (d). The resolution of the pansharpened images is 0.7 m/pixel. In the ground truth images, the changed class is in red. The training mask images are used to generate the training samples of supervised change detection methods. In these images, the changed and unchanged training regions are labeled with red and blue, respectively.
For performance comparison, three metrics are used: false alarm rate (FAR), missed alarm rate (MAR), and total false rate (TFR) [32].

First Experiment: Overall Performance Evaluation.
To demonstrate the effectiveness of the proposed CD method, the following methods are used for comparison.
(i) pxmsCRF. Similar to our method, we decompose the images into three scales. At each scale, CRF (we use the code provided by http://users.cecs.anu.edu.au/jdomke/JGMT/) is trained to classify the SCD feature extracted at each pixel. The final CM is obtained by the majority voting strategy. (ii) pxlinSVM. It is a multiscale pixel-level CD method. It extracts the MP feature at each pixel using the SE parameters = 0, 3, 7, 15 and classifies the multiscale features by the linear SVM classifier. The regularization parameter is selected by 5-fold cross-validation.
(iii) pxrbfSVM. It is a modified version of pxlinSVM by using the RBF kernel in SVM. Both the regularization parameter and the RBF kernel width are set via 5-fold cross-validation.
(iv) splinSVM. It is an object-level CD method. It first segments the two images I and J by SLIC. As pxlinSVM, the MP feature is extracted at all pixels. Based on these features, an object-level change feature is computed for each region by the approach used in [5]. The object-level change features are classified by the linear SVM classifier to generate the final CM. There are three tunable parameters in this method, that is, in linear SVM, , and in SLIC. is selected by 5-fold cross-validation; and are searched in the range [10, 100] × [10 −3 , 1], and the best test performance is reported.
(v) sprbfSVM. In this approach, linear SVM in splinSVM is replaced by the RBF kernel SVM classifier. Both and are set via cross-validation. Similar to splinSVM, and are also tuned to find the best test performance.
(vi) pKNN. cPM is computed by our approach, and the CM is generated by the threshold 0.5 on cPM. Specifically, for the th row and the th column of the change map CM, CM = 1 if cPM > 0.5 and CM = 0 otherwise.
(vii) pKNN-CSR. rPM is computed by our approach, and the CM is generated by the threshold 0.5 on rPM.
In the methods pKNN, pKNN-CSR, and pKNN-CRF, the computation cost of KNN (in Section 2.3.1) increases with the number of training samples. Therefore, instead of using KNN directly, the samples of the changed and unchanged class are clustered into and clusters, respectively. This modified version is called the prototype KNN (pKNN) classifier. The level of the pyramid , the downsampling ratio , and the size of image patch should be set according to the image resolution. As all data sets have the same resolution, these parameters are manually set as = 3, = 2, and = 7.
The parameters and in pKNN are set as 500. The number of the neighbors for classification in pKNN is set as 7. The regularization parameters 1 and 2 are set as 1 and 0.01, respectively. In addition, the parameter in CRF is tuned to reach the lowest TFR.
As 1 and 2 are designated manually, it is necessary to evaluate the influence of them on the final performances. To this aim, the TFRs of pKNN-CSR are computed on the parameter grid {0.5, 1, 1.5, 2} × {10 −4 , 10 −3 , 0.01, 0.1} for 1 and 2 . The relations between TFR and 1 , 2 on DS1 are shown in Figure 3(d).
After inspecting the trend of TFR with varying 2 , we can conclude that the term tr(D D ) in (7) works. In detail, the TFR decreases slowly with the increased 2 . Once it reaches the minimum point at about 0.01, it begins to ascend rapidly with the increasement of 2 . Therefore, the suitable range for 2 to keep TFR low and stable is about [10 −4 , 10 −2 ].
The effect of 1 on change detection performance is more apparent than 2 . Too small or too large 1 would degrade the total performance. Even though there are variances of the performance with different 1 , the TFR is relatively low, and the effective RDD is in the range 0.5 ≤ 1 ≤ 2 as can be seen from Figure 3(d). It is worthy noting that if there are abundant computation resources and enough training samples, it is recommended to conduct cross-validation of 1 and 2 to select the best settings. Figures 4(e)-4(l), 5(e)-5(l), and 6(e)-6(l). Compared to the other methods, the proposed CD method can obtain better CMs. In detail, pxmsCRF tends to generate over-smoothed CMs, which disables it to accurately capture the edges between the changed and unchanged regions. This phenomenon is particularly remarkable on DS3 (Figure 6(e)). The results of pixel-level CD methods, pxlinSVM and pxrbfSVM, contain lots of saltand-pepper noise due to the ignorance of the contextual constraints ((f) and (g) of Figures 4-6). The object-level CD methods, splinSVM and sprbfSVM, work much better than the pixel-level counterparts since the local consistency of change is considered. However, as the change features are not discriminative and robust enough, the CMs still have many false and missed alarms ((h) and (i) of Figures 4-6).

Results and Analysis. For visual comparison, the CMs by different methods are shown in
For the proposed method, pKNN-CRF, it outperforms all other baseline methods. As can be observed from (l) of Figures 4-6, almost all the changed regions are detected correctly, and the CMs of pKNN-CRF are less noisy. Furthermore, the edges between the changed and unchanged regions nearly align with the true edges.
To conduct the quantitative comparison, the FAR, MAR, and TFR on all data sets are listed in Table 1. From this table, the false alarms and the missed alarms of pKNN-CRF are significantly lower than other related methods, and the improvements can be attributed to the discriminativeness and robustness of the proposed SCD and the robust dictionary-based multiscale region-consistent change decision strategy.
Despite the promising performance of the large-margin classifier in classifying the features, however, as can be seen from both   Table 1, the simple prototype KNN classifier combined with the proposed sparse change descriptor is better than SVM equipped with the pixel-level MP features in most cases. This difference demonstrates the effectiveness of the proposed change feature.
As for pKNN and pKNN-CSR, it can be found from Table 1 that the TFRs of pKNN-CSR are lower than the TFRs of pKNN by 1.26%, 1.61%, and 6.10% on DS1, DS2, and DS3, respectively. This indicates that the refined probability maps rPM obtained by the cosparse representation are more accurate than the rough ones cPM estimated by KNN.
When comparing pKNN-CSR and pxmsCRF from Table 1, they have similar TFR on DS1 and DS2. Even so, as mentioned above, pxmsCRF tends to generate excessively smooth CMs. In contrast, by taking the advantage of superpixel-level cosparse representation, pKNN-CSR has the desirable edge-preserving property. Furthermore, by further smoothing the results of pKNN-CSR with CRF, better results (i.e., the results of pKNN-CRF) are obtained.

Second Experiment: Effectiveness Validation of Multiscale
Fusion. To validate the effectiveness of multiscale fusion, both the monoscale and the fused results on three data sets are  shown in Figure 7. From the figure, with the increased scale, the CMs become smoother. Compared to each monoscale result, the multiscale change detection results have better visual effects. For example, on the multiscale result of DS2, lots of false changes are removed compared to the monoscale results.
The performances of the monoscale and the multiscale results are listed in Table 2. From the table, we can find that the best single-scale result appears at different scales for different data sets. For example, on DS1 the third scale has the lowest TFR, while on DS3 the first scale is optimal. This result implies that it is important to conduct the multiscale analysis for VHR image change detection. By fusing the information from several scales, the change detection performances are greatly enhanced. This is validated by the fact that the fused results obtain the lowest TFR.

Third Experiment: Effectiveness Validation of Feature
Extraction. To further demonstrate the effectiveness of the proposed change feature, a comparative experiment is performed on the following change features. (i) SCD. Sparse change descriptor (SCD) is the proposed feature. In this experiment, is set as 7.
(ii) dfSPEC. This feature is computed by [x , y , ‖x − y‖ 2 ] , where x and y are the spectral features extracted from the 7×7 patches of the bitemporal images I and J.
(iii) dfMP. This feature is extracted with the same manner as dfSPEC, but the spectral feature is replaced by the MP feature with the structuring element parameter = 0, 3.
(iv) dfGB. This feature is extracted with the same manner as dfSPEC, but the spectral feature is replaced by the Gabor feature with three-scale and six-orientation filters.
(v) dfUDW. This feature is extracted with the same manner as dfSPEC, but the feature for each pixel is computed by stacking three-scale undecimated discrete wavelet decomposition coefficients [33].
(vi) dfSIFT. This feature is extracted with the same manner as dfSPEC, but the spectral feature is replaced by SIFT [34] feature with a 4 × 4 spatial cell.
All the mentioned features are extracted from the image pyramid of three levels with = 2, and they are fed into a classifier for estimating a probability map PM for each scale. Summing these maps with the weights computed by (4) generates a finer map PM, and the final CM is acquired by thresholding PM with 0.5. Two classifiers are used for probability estimation: the prototype KNN (pKNN) classifier and the nearest neighbor (NN) classifier (preclustering of training samples is not used). In this experiment, the numbers of cluster centers and in pKNN for the changed and unchanged class are set as 500. The number of nearest neighbors for pKNN, , is set by 5-fold cross-validation on the training set.
The performances by different change features are listed in Table 3, from which the proposed change feature SCD is superior to the other change features in most cases. The performance improvement demonstrates the discriminative ability of the proposed change feature, which mainly comes from the sparse-representation-error based change degree description. Specifically, by comparing the performance of SCD and dfMP, the sparse-representation-error based change pattern description is better than the feature-differencemagnitude based one. The underlying reason is that the regression error is more robust to the seasonal variation and registration noise than the Euclidean distance. In short, the combination of the change degree description and the change pattern description is important for improving the separability of change feature.

Fourth Experiment: Effectiveness Validation of RDDL.
To demonstrate the effectiveness of RDD, three other kinds of dictionaries are used for comparison.
(i) OrigD. This dictionary is learned by the model in (7) on the original training sets with the weights of reconstruction term that are set as 1 and 2 = 0.
(ii) AugD. This dictionary is learned by the model in (7) on the augmented training sets with the weights of reconstruction term that are set as 1 and 2 = 0.
(iii) AugDw. This dictionary is learned by the model in (7) on the augmented training sets with 2 = 0. In this case, the weights are set by the method in Section 2.3.2 and their values are not updated.
(iv) RDD. This is the proposed robust discriminative dictionary, which is learned by the model in (7) on the augmented training sets. Note that the weights are allowed to be updated.
Obviously, OrigD is learned on the original training data. AugD and AugDw are learned on the augmented training data without and with weighting, respectively.
To evaluate the performances of these dictionaries quantitatively, FAR, MAR, and TFR are computed for the CMs generated by pKNN-CSR. The parameters of pKNN are set as = = 500 and = 7. For fair comparison, 1 and 2 are tuned to obtain the lowest TFRs for each type of dictionary and the best performances are reported. The searching range for 1 is [10 −3 , 2] and that for 2 is [10 −4 , 1].
The results of different dictionaries are shown in Table 4. From Table 4, the proposed dictionary, RDD, performs best in terms of TFR. For DS1, the labeled data are representative enough to predict the probabilities of the unlabeled data belonging to the changed class with a relatively high accuracy. However, there are still some outliers that may degrade the discriminative ability of the dictionaries. Thus, the TFR of AugD is higher than OrigD. Because the correct probability prediction weakens the impact from the outliers by weighting, the TFR of AugDw matches that of OrigD and is lower than that of AugD. For RDD, the updated weights enable it to recognize more outliers than AugDw and the discriminativeness of the learned dictionaries is improved.
The Scientific World Journal 13  On DS2, the labeled samples are too scarce to acquire a precise probability map, which leads to the high TFR of OrigD. The outliers contained in the extra samples prevent AugD to acquire lower TFR than OrigD. For AugDw, as some samples are recognized as outliers and they are assigned with small weights, the TFR is much lower than that of OrigD and AugD. For the robust learning manner in RDD, more outliers are removed by gradually reducing the weights. This makes RDD achieve the lowest TFR. Similar analysis can be conducted on DS3.
In summary, the initially unlabeled data helps improve the abundance of training data and thus reduces the missed changes. Moreover, the robustness of the RDDL model helps to remove the mislabeled data and keep the FAR low.

Conclusions
This paper proposes a supervised change detection approach for VHR images. It includes two parts: sparse change descriptor extraction and multiscale region-level change decision. The sparse change descriptor integrates the change degree and change pattern description, which improves the discriminative ability of the change feature; meanwhile it makes it robust to false changes. On the other hand, the multiscale region-level change decision strategy enables the proposed change detection method to detect the changes from different scales and reduce the salt-and-pepper noise of change maps. Experiments on several data sets demonstrate the superiority of the proposed change feature and the change decision procedure.