Multiview Active Learning for Scene Classification with High-Level Semantic-Based Hypothesis Generation

Multiview active learning (MVAL) is a technique which can result in a large decrease in the size of the version space than traditional active learning and has great potential applications in large-scale data analysis. This paper made research on MVAL-based scene classiﬁcation for helping the computer accurately understand diverse and complex environments macroscopically, which has been widely used in many ﬁelds such as image retrieval and autonomous driving. The main contribution of this paper is that diﬀerent high-level image semantics are used for replacing the traditional low-level features to generate more independent and diverse hypotheses in MVAL. First, our algorithm uses diﬀerent object detectors to achieve local object responses in the scenes. Furthermore, we design a cascaded online LDA model for mining the theme semantic of an image. The experimental results demonstrate that our proposed theme modeling strategy ﬁts the large-scale data learning, and our MVAL algorithm with both high-level semantic views can achieve signiﬁcant improvement in the scene classiﬁcation than traditional active learning-based algorithms.


Introduction
Scene classification is defined as using a computer to understand the class of an image scene. e related research studies can be roughly divided into two branches: some focus on fast holistic scene perception based on visual psychology and physiology [1,2], while others build the statistical models through local image analysis to understand the scene, which is also the main developing tendency [3][4][5]. ere have been many methods for image representation in the past two decades, which is a key step for scene classification. Low-level features such as color, texture, and edge have been widely used to represent the local regions of an image. Some researchers trained object detectors to achieve high-level semantics such as object's class, size, and shape for more accurate image representation [6,7]. Prevailing statistical models are bag-of-words (BoW) and related theme statistical models. ese models reduce the gap between the low-level features and highlevel semantics by mining the hidden themes from local image regions such as pLSA [8] and LDA [9]. Other new scene statistical models [10][11][12] were proposed for more accurate object recognition in the scene. However, these mentioned models above mainly focus on the occurrence of the image semantics, and the spatial semantic correlations between different image regions are usually ignored.
For mining the spatial context information from an image, some researchers considered the information interaction between different spatial pyramid levels [13][14][15], and how to build reasonable attention mechanisms also can lead to significant improvement for scene classification. ese methods used deep neural networks, and their large-scale network parameter estimation tasks usually lead to much higher computational complexity than nondeep learning based methods.
Active learning ranks the unlabeled samples iteratively and only selects the samples with high uncertainty or which cause great ambiguity for the classifier. In PAC learning theory, compared with traditional passive learning, it can exponentially reduce its sample complexity to O(log(1/ε)) in the feature space for learning a classifier with expectation classification error ε [16][17][18], which has good potential of wide application in large-scale data leaning. However, most of the traditional active learning algorithms' lack of diversity of the hypotheses is generated usually by low-level image features, which affects their performances. is paper proposed a MVAL-based scene classification algorithm, which uses different high-level semantics as its views and can realize a decrease in more than a half size of the version space, and it is more efficient than both single-hypothesis-based and committee-based active learning [19].

Proposed Algorithm.
e flowchart of our proposed algorithm is illustrated in Figure 1. Our algorithm uses different high-level semantics as its views to generate the corresponding hypotheses. First, object detectors are trained to achieve the responses of different object classes in image regions. Furthermore, we design a cascaded online LDA (CO-LDA) as a secondary view for achieving more accurate image representation. Finally, a fine-tuned MVAL algorithm is utilized with both two high-level image semantics as its views for classifying the scene of an image.

Object Semantic-Based Image Representation.
Our object semantic-based image representation is illustrated in Figure 2.
First, multiple object objectors are used to achieve the local object response maps. Second, these maps are decomposed into three spatial pyramid levels, and the maximal object responses are computed in image blocks in each spatial level, which is annotated as red blocks in Figure 2. Finally, an object response histogram is computed, which can effectively reduce the influence of object response error in the whole image. For generating the object response, a latent SVM-based detector [7] is applied for recognizing the object classes with bulk type such as car and pedestrian. Another geometric context-based detector [6] is utilized for recognizing the object classes with different textures such as tree, sky, and building.

eme Semantic-Based Image Representation.
For satisfying the dynamic update of an active learning training set, an online LDA model [20] based on stochastic gradient descent strategy is used. It adds new samples sequentially, and old samples have been no longer stored, which can achieve efficient and accurate parameter estimation in largescale data training.
Online LDA computes the posterior probability distribution p(θ, z, w, β | α, η) of the hidden nodes based on observed samples. It actually uses variational inference to estimate the maximum likelihood of p(w | α, η) based on α and η. ree variational parameters ϕ, c, and λ follow the distributions: ϕ ∼ multinomial(ε), c ∼ Dirichlet(ε), and λ ∼ Dirichlet(ε). e variational distribution follows (1) e optimal (c, ϕ) is solved by maximizing the lower bound in the following equation: where E q denotes the conditional mathematical expectation. Maximizing the lower bound L(w, ϕ, c, λ) is equivalent to minimizing KL divergence of q(θ, z, β | c, ϕ) and p(θ, z, β | w, α, η): where L(w, ϕ, c, λ) is factorized as follows: Equation (4) can be transformed into formula (5). In equation (5), n dw denotes the frequency that word w occurs in text d. l(n d , ϕ d , c d , λ) reflects the contribution of d for the lower bound, which is iteratively optimized by a coordinate ascent algorithm:  Scientific Programming ϕ dw k in equation (5) is iteratively solved: where digamma functionΨ is the first-order derivative of function Γ. c dk and λ kw are iteratively solved in the following way: Whent th vector of word frequency n t is observed, we keep λ unchanged and update the local optimal solution of c t and ϕ t in E step. In M step, ϕ t and λ from last iteration are both used to update λ: λ in formula (7) is solved as follows: where n ts is s th text in each batch text set, M is the number of the training text set, and Sis the size of each batch text set. Hyperparameters α and η are updated by the Newton-Rapson method: Here, α(c t ) is the product of Hessian matrix and gradient ∇ α l of the objective function l(n d , ϕ d , c d , λ). η(λ) is the product of Hessian matrix and gradient ∇ η L of the objective function L(w, ϕ, c, λ). Based on online LDA, we proposed the CO-LDA theme model, which is similar with the classic SP-pLSA model in structure for enhancing the spatial correlation between different image regions. e framework of CO-LDA is illustrated in Figure 3. e main difference between CO-LDA and SP-pLSA is that different online LDAs (LDA1, LDA2, and LDA3) are applied in different spatial levels to jointly mine the theme of an image. e main advantage of CO-LDA is that it integrates the spatial correlation of objects in different image resolutions, which further improves the holistic scene understanding. e visual histogram computation in online LDA is the same as the way of object response histogram in Section 2.2, and the theme feature of each spatial block is represented by variational parameter c of the online LDA model.
Finally, the theme feature c of the whole image is achieved by concatenating the theme features of different blocks of different spatial pyramid levels: where c L i denotes the theme feature of the corresponding block in L i th pyramid level, ⊕ denotes the linear concatenation between feature vectors, and the weights of different spatial levels are configured as follows: w 1 � (1/2), w 2 � (1/2), and w 3 � (1/4).

Multiview Active Learning.
e MVAL referred in this paper is our previous work [21], which has two improvements in both hypothesis generation and selective sampling. First, boosting-like technique is integrated into MVAL, which uses a similar way of iterative weak classifier optimization, and the current hypothesis is boosted by weighted voting of all the hypotheses from the past queries. Furthermore, an adaptive hierarchical competition sampling is presented. In this sampling strategy, if the number of the contention samples is large, an unsupervised spectral clustering is activated to obtain the coarse spatial distribution of these contention samples in the high-dimensional feature space, and then, a multiview-based batch mode selective sampling is run based on two measures: sample uncertainty and redundancy by solving quadratic programming to determine the queried samples in each cluster.

Hypothesis Generation.
If an active learning can select enough number of contention samples, which could improve the hypothesis in each query, the number of unlabeled samples, which are incorrectly classified, will decrease. It is quite similar with boosting technique in weak classifier optimization. e MVAL incorporates the AdaBoost algorithm into our framework to boost the generated hypothesis in each query, and the main flowchart is described in Figure 4.
In Figure 4, a support vector machine (SVM) is used as a base classifier to construct a multiview classifier, which replaces the single-view classifier in AdaBoost, and this multiview classifier in each query can be considered as a weak classifier in each iteration in AdaBoost. e hypothesis of multiview classifier h i (x) is computed by weighted voting of n SVM base classifiers v 1 , v 2 , . . . , v n whose weights are ω 1 , ω 2 , . . . , ω n . Unlike traditional query by boosting, we update the weight of each base classifier in each query and obtain the boosted hypothesis H i (x) by weighting all the hypotheses from the past queries and not from the current query only. e detailed process of the MAVL's hypothesis generation based on AdaBoost is as follows: , weighted voting is used to generate the initial multiview-based hypothesis: where f t i (x j ) is the classification confidence of sample x j by view i, and ω t i denotes the contribution of view i for classification which is determined by the soft classification error rate ε t i , which defines how correctly a sample is classified:    where x∈L,y�1 f t i (x) and x∈L,y�−1 f t i (x) denote the sum of classification confidence of unlabeled samples, which are labeled as y � 1 and y � −1, respectively. For a "positive/negative" sample, the distance of it to the decision boundary in the "positive/negative" side reflects the degree of how correctly it is classified, and this information is utilized to calculate the error degree ε t i here instead of the traditional classification error calculated by the decision hypothesis in AdaBoost. Also, ω t i is updated through the following way: en, the classification confidence δ t of the multiview classifier can be computed by the following equation: (b) After iteration t, the size of the labeled sample set is increased as follows: J t � J t− 1 ∪ L t . J t denotes the labeled sample set in iteration t, and L t denotes the newly added samples after query. As we know, the size of the labeled samples set |J t | is increased during iteration in active learning. us, if the size of the initial labeled training set is small, the influence of |J t | should be considered when updating the weight η t of the multiview classifier, which is illustrated by the following equation: en, the weight of each sample is updated through the following way: where β t � (δ t /(1 − δ t )), if x j is correctly classified, e j � 0, otherwise, e j � 1. (c) e final boosted hypothesis H t (x) of the queried sample x i is equivalent to the weighted sum of all the hypotheses from the past K queries, which is defined by

Sampling Strategy.
e MVAL uses a new hierarchical competition-based sampling strategy in order to query the contention samples with high probability in different sample distributions, which is illustrated in Figure 5.
(1) Intercluster Sampling Competition. In the MVAL, a fast approximate spectral clustering algorithm is designed to reduce the computational complexity significantly to O(KNT) + O(K 3 ), where T is the iteration number of K mean clustering, and N is the total number of contention samples. e detailed process is illustrated as follows: (a) perform traditional K mean clustering on the contention unlabeled samples x 1 , x 2 , . . . , x N , compute the centroid of each cluster y 1 , y 2 , . . . , y K as K representative points, and build a correspondence table to associate each x i with the nearest cluster centroid y i y i ; (b) run the normalized cut algorithm on y 1 , y 2 , . . . , y K to obtain a m-way cluster membership for each of y i ; and (c) recover the cluster membership for each x i by looking up the cluster membership of the corresponding centroid y i in the corresponding table.
After fast spectral clustering, two intercluster sampling measures are defined: the number of samples in the cluster and its information entropy. Both measures are weighted to obtain the number of selected samples N S C in cluster C in the following equation: where Num(C) is proportional to the total number of samples N C in cluster C, and computing Ent(C) is equivalent to kernel density estimation of x in cluster C. Weight c � 0.5 reflects the impact of both measures in intercluster sampling competition, Z is the normalized factor, N T is the total number of selected samples in the current query, and [·] is rounding operation.
(2) Intracluster Sampling Competition. In the MVAL, an efficient quadratic programming based-method [22] is utilized, which dynamically estimates the weights of the redundancy and uncertainty of an unlabeled sample in each query. It is used for intracluster selective sampling and solved by minimizing the following object function: Equation (16) aims to estimate the normalized parameter p i ∈ [0, 1], which reflects how probable the unlabeled sample is selected.
. , x l are the queried unlabeled samples, u is a unit vector, and k � N S C is the number of unlabeled samples in batch mode. e first part denotes the sample uncertainty in v th view, and the sampling strategy tends to select the contention sample near the classification hyperplane of v th view by minimizing p T f v . e second part denotes the sample redundancy in v th view, and the similar samples are selected by minimizing p T K u,u p. e sampling probability p is calculated by a convex quadratic programming, and finally, p v 1 , p v 2 , . . . , p v l , which corresponds to x 1 , . . . , x l in v th view, is obtained. For selective sampling in each cluster, the conservative sampling strategy is utilized in a classic co-testing algorithm [23].

Results and Discussion
In our experiment, two classic image sets (OT image set from MIT [9] and UIUC sports event image set from UIUC [24]) 6 Scientific Programming are used for algorithm comparison. Average classification precision (ACP) and mean of average classification precision (MACP) are both used for evaluating the performance of both CO-LDA models and multiview active learning algorithms.

Evaluation of eme Semantic.
e first experiment is designed for evaluating the performance of our proposed theme semantic. In OT and UIUC Sports datasets, the parameter configuration of the CO-LDA model is as follows: (1) k OT � 0.5, τ OT � 256 and k UIUC � 0.8, τ UIUC � 1024 in formula (7). (2) e batch sizes of sampled images in MVAL are S OT � S UIUC � 512.
We observe MACP variation of the CO-LDA model by changing the numbers of both theme and visual word: T � 20, 30, 40, 50 and W � 200, 500, 800, 1200, 1500, and a total of twenty groups of (T, W) are obtained. In Figure 6, we find that (T, W) curves for both datasets show the similar trends that MASP increase first and then decrease. us, in our CO-LDA model, we set T OT � 30, W OT � 500 and T UIUC � 40, W UIUC � 1200. In the OT image set, we can see that there are significant differences between four scene classes "Highway," "Forest," "Mountain," and "Tall building" in theme probability distributions, and multiview SVM classifier works well in scene classification. In the UIUC Sports image set, the theme probability distributions are very similar in four scene classes "Bocce," "Croquet," "Polo," and "Snowboarding," which significantly increases the difficulty of scene classification.
Furthermore, we compare the CO-LDA model with traditional LDA [9] and SP-pLSA [8] models, and the performance comparison of three theme models is shown in Table 1. N1 ∼ N8 denote the following eight natural scene classes: "Coast," "Forest," "Mountain," "Open country," "Highway," "Inside city," "Tall building," and "Street." S1 ∼ S8 denote the following eight event scene classes: "Badminton," "Bocce," "Croquet," "Polo," "Rock Climbing," "Rowing," "Sailing," and "Snowboarding." In the LDA model, each image is divided into 11 × 11 blocks, and 5 pixels are overlapped between neighbored blocks. For feature representation, gray-scale SIFT descriptors are sparsely sampled, and means of three color channels are calculated. e numbers of the theme and visual word are T OT � 30, W OT � 200 and T OT � 50, W OT � 800 by cross validation. In the SP-pLSA model, the ways of image division and feature representation are the same as the LDA model. e numbers of the theme and visual word are T OT � 25, W OT � 1200 and T OT � 50, W OT � 1500 by cross validation.
In the OT image set, CO-LDA achieves both higher ACP and MACP than SP-pLSA in six scene classes except "Mountain" and "Inside city." LDA performs the worst in all of scene classes except "Street." It is easy to conclude that CO-LDA can achieve more accurate scene semantics than other two classic methods. In the UIUC Sports image set, CO-LDA achieves the highest ACP in the following three event classes: "Croquet," "Polo," and "Rowing," and SP-pLSA achieves the highest ACP in the following three event classes: "Bocce," "Rock Climbing," and "Snowboarding." But in the event classes "Badminton" and "Sailing," in which LDA has the highest ACP, CO-LDA still performs better than SP-pSLA. us, we can conclude that our proposed CO-LDA also have slightly better performance in theme mining than the two classic image representation methods.

Evaluation of MVAL.
In the second experiment, we compare our algorithm with other single-view active learning algorithm with both high-level semantics and lowlevel features for scene classification. In our initial labeled training set, label size � 150, batch size � 20, and iteration � 10.  Table 2.
From Table 2, it is easily found that our algorithm MVAL HS has the highest MACP in almost all scene classes than the other four algorithms in both image sets, which demonstrates that high-level semantics can achieve more significant improvement in holistic scene understanding than   traditional low-level image features. Furthermore, we can see that MVAL LS performs better in most cases than other three single-view algorithms, which also means that multiple view setting can successfully result in larger decrease in the size of the version space than traditional single-view active learnings due to its independent and diverse views.

Conclusion
is paper proposed a MVAL-based scene classification algorithm, which applies two different high-level image semantics to generate the corresponding hypotheses. Different object detectors are first trained to achieve the responses of different object classes as object semantic. Furthermore, a CO-LDA model is proposed for achieving more accurate theme semantic by integrating the spatial correlation of objects in different image resolutions, which improves the holistic scene understanding. With the help of both two independent views, our MVAL algorithm has potential to not only handle large-scale data training but also improve the performance of scene classification.

Conflicts of Interest
e authors declare that they have no conflicts of interest.