A Deep Multiview Active Learning for Large-Scale Image Classification

Multiview active learning (MAL) is a technique which can achieve a large decrease in the size of the version space than traditional active learning and has great potential applications in large-scale data analysis. In this paper, we present a new deep multiview active learning (DMAL) framework which is the ﬁrst to combine multiview active learning and deep learning for annotation eﬀort reduction. In this framework, our approach advances the existing active learning methods in two aspects. First, we incorporate two diﬀerent deep convolutional neural networks into active learning which uses multiview complementary information to improve the feature learnings. Second, through the properly designed framework, the feature representation and the classiﬁer can be simultaneously updated with progressively annotated informative samples. The experiments with two challenging image datasets demonstrate that our proposed DMAL algorithm can achieve promising results than several state-of-the-art active learning algorithms.


Introduction
Active learning was firstly proposed by Simon and Lea [1]. It ranks the unlabeled samples iteratively and only selects the samples with high uncertainty or which cause great ambiguity for the classifier. In PAC learning theory, compared with traditional passive learning, it can exponentially reduce its sample complexity to O(log(1/ε)) in the feature space for learning a classifier with expectation classification error ε. In the past twenty years, active learning has been developed rapidly and applied in many computer vision-related fields such as image classification [2], object segmentation [3], face recognition [4], object tracking [5], scene reconstruction [6], and activity recognition [7].
e key problem of active learning is selective sampling, and there are mainly two different solutions: pool-based sampling and stream-based one. e stream-based methods usually need manual thresholding for unlabeled sample evaluation and can't compare the input samples one by one which restricts its further application. ere are also two different pool-based methods based on the number of generated hypothesis. (a) Single-hypothesis-based methods only have one base classifier which uses uncertainty reduction [8] or expectation error minimization [9] as the sampling standard, and they usually perform better than random sampling. (b) Committee-based methods use two or more hypotheses to set up a committee, and the sample which reduces the size of the version space maximally is queried by this committee. In most cases, these methods such as Query by Committee [10], Query by Bagging [11], and Query by Boosting [12] can achieve a larger decrease in the size of the version space than single-hypothesis-based ones. However, in the committee-based methods, the generated hypotheses usually lack of diversity which affects their performance. Another special kind of committee-based methods is MAL such as Co-Testing [13] which uses the concept of redundant views from Co-Training [14]. ey combine multiple conditionally independent hypotheses (view) to maximally improve the query performance of one of the views and query the sample which leads to the classification error of at least one view because a classifier usually improves itself from the error. eoretically, MAL can achieve a decrease in more than a half size of the version space which is more efficient than both single-hypothesisbased and committee-based active learning.

Related Work
e first MAL algorithm Co-Testing was proposed by Muslea in 2000 [13]. In the past fifteen years, MAL has been developed quickly. Later, they proposed the Co-EMT algorithm [15]. It used Co-Testing to query the unlabeled samples and multiview learning to improve the hypotheses, respectively, for text classification. Furthermore, Muslea et al. also presented another MAL algorithm called Aggressive Co-Testing and showed how to use the weak view to improve both the selective sampling and hypothesis generation [16]. Cheng and Wang proposed the Co-SVM algorithm which can be converged faster than traditional single-view active learning (SAL) through image retrieval experiments [17]. Wang and Zhou designed a semisupervised-based MVAL learning algorithm and also proved that its sample complexity is reduced exponentially in nonrealized conditions while SAL can only achieve polynomial reduction at most in sample complexity [18] and also discussed MAL in the nonrealized case [19]. Di and Crawford presented a new coregularization-based active learning framework which combined both the consistency between views and local proximity in hypothesis to optimize the selective sampling [20]. Cai et al. also proposed a multiview active learning approach to fully exploit the visual view of videos, while querying as few annotations as possible from the text view for reducing the annotation cost [21]. MAL has been widely studied and applied to many applications.
Recently, incredible progress on visual recognition tasks has been made by deep learning (DL) approaches. With sufficient labeled data, deep convolutional neural networks are trained to directly learn features from raw pixels which have achieved the state-of-the-art performance for many visual-based applications such as image classification. However, in many real applications of large-scale image classification, the labeled data are far too enough since the tedious manual labeling process requires a lot of time and labor. us, it has great practical significance to develop a framework by combining DL and active learning. Some amazing works have revealed that such combination can achieve significant performance improvement [22][23][24][25][26]. Due to these successes, a new combination of MAL and DL is worth for comprehensive research.

DMAL Algorithm
To the best of our knowledge, in this paper, we are the first to incorporate MAL and DL into a unified framework for largescale image classification. Our objective is to apply active learning to deep image classification tasks by progressively selecting complementary samples for model updating. e flowchart of our DMAL framework is illustrated in Figure 1.

Suppose we have a dataset of Mcategories and N samples denoted as
. We denote the currently annotated samples of D as D L while the unlabeled ones as D U . e label of x i is denoted as y i � j, j ∈ 1, . . . , M { }. In our dataset D, almost all data are unlabeled and D U should be inputted into the DMAL framework in an incremental way. e DMAL for image classification is formulated as follows:  [27], in this paper, p(y i � j | x i ; W) denotes the mean of the softmax output of the two CNNs for the jth category which represents the probability of the sample x i belonging to the jth class and is formulated as follows: where α k is the normalized importance weight which measures the importance of outputs of multiple views. Equation (2) suggests the multiview fusion may more accurately obtain the real class of x i .
Specifically, the algorithm is designed by alternatively updating the labeled sample y i ∈ D U and the network parameters W. In the following, we introduce the details of the optimization steps and give their physical interpretations. e practical implementation of the DMAL will also be discussed in the end.
Algorithm 1 can be summarized as follows.
is measure takes all class label probabilities into consideration to measure the uncertainty. e higher the entropy value is, the more uncertain the sample is.

CNN Parameter Update.
Most of active learning methods only focus on classifier retraining. eir strategies to select the most informative/uncertain samples are heavily dependent on the assumption that the feature representation is fixed. However, the feature learning and classifier training are jointly optimized in CNNs, and simply fine-tuning CNNs in the traditional active learning framework probably face the divergence problem.
To solve this problem, after sample selection, we employ the standard back propagation to update the CNNs' parameters W k , k � 1, 2 mainly based on D L . Specifically, let L denote the loss function of equation (4), then the partial derive of the network parameter W according to equation (4) is as follows: where z j (x i ; W) M j�1 denotes the activation for the ith sample of the last layer of the CNN model before feeding into the softmax classifier which is defined as follows: After CNN fine-tuning, we go to the next iteration until the maximum iteration T is reached.

Datasets and Experiment Settings.
In our experiment, we use two large-scale image datasets: CIFAR-10 [28] and ImageNet ILSVRC [29] datasets for testing our algorithm. e CIFAR-10 dataset consists of 60000 32 × 32 color images in 10 different classes, with 6000 images per class. In the ImageNet ILSVRC dataset, there are more than 1.2 million training images and 100 thousand testing images in 1000 different classes. We use different CNNs for hypotheses generation in DMAL. In this paper, Alexnet [30] and VGGnet [31] are both utilized for multiview feature learning. For CIFAR-10, we resize all the images into 200 × 150 and set the learning rates of all the layers as 0.01. For ImageNet ILSVRC, we resize all the images into 256 × 256 and set the learning rates of all the layers as 0.001. We set the normalized importance weight α 1 � α 2 � 0.5.

Algorithm Comparison.
We compare our DMAL algorithm with the following seven state-of-the-art active learning algorithms: (a) SVM_RAN: During the training process, we randomly select samples to be annotated and use support vector machine (SVM) as the base classifier. is method discards all active learning techniques and can be considered as the lower bound.

Experiment 1: Fixed Label Size and Batch
Size. In Experiment 1, we set label size � 5000, batch size � 500, and iteration (T) � 15. Figure 2 shows the comparison of different active learning algorithms by average precision (AP) in different relevance feedbacks. Here, we only show four intermediate results with iteration interval equal to 4 due to space restriction. A1-A6 denote the following six active learning algorithms: SVM_RAN, SVM_QP, MAL, AL_DIS, AL_TC, and AL_DL which are drawn by dotted lines. A7 and A8 denote the following two deep active learning algorithms: DAL_ALEX and DMAL which are drawn by solid lines.
In Figure 2, we can see that (1) all of seven algorithms achieve higher AP than SVM_RAN which demonstrates that their sample selection strategies work and can query informative/uncertainty unlabeled samples, (2) both DAL_ALEX and DMAL achieve higher AP than other six algorithms which suggest the deep CNN is more efficient in feature representation than traditional features, and (3) DAL_ALEX achieves lower AP than DMAL after four iterations which demonstrates the DAL with complementary views can achieve better performance than DAL with a single view.

Experiment 2: Variable Label Size.
In Experiment 2, we fix batch size � 500 and evaluate the performance of these algorithms by varying label size. We also use average precision (AP) and mean of average precision (MAP) as the measures. Figure 3 shows AP variations of the first 200 feedback images based on different label size (label size � 1000, 2000, 3000, 4000, and 5000). We can see that both DAL_ALEX and DMAL always achieve the highest AP compared with other algorithms. As label size increases, AP growth of both DAL_ALEX and DMAL becomes slow which demonstrates that they have good convergence and also fit for small training sets. Table 1 illustates MAP comparisons of eight active learning algorithms with different label size for the first 500 feedback images. e value behind symbol ± denotes the standard deviation of AP.
In Table 1, we can see that (1) compared with SVM_RAN, other seven algorithms A2-A8 achieve higher MAP which demonstrates that their sampling strategies are more effective than random sampling, (2) MAP of DAL_ALEX is 6.13% and 4.3% higher than that of AL_DL in both datasets which demonstrates that the CNN is useful to improve the feature representation, and (3) MDAL outperforms state-of-the-art algorithms AL_DL, AL_TC, and AL_DIS in MAP which suggests the DL and MAL is the efficient combination.

Experiment 3: Variable Batch Size.
In Experiment 3, we fix label size � 5000 and evaluate the performance of these algorithms by varying batch size. Figure 4 illustrates AP variations of the first 200 feedback images based on different label size (batch size � 100, 200, 300, 400, and 500). rough analysis from Figure 4, we can make the following conclusion: our proposed algorithm MDAL outperforms all other algorithms in different batch size conditions. Furthermore, we can obtain an implicit regulation: As batch size increases, AP growth ratios of both DAL_ALEX and MDAL are higher than algorithms such as AL_TC and AL_DL that do not use the CNN for feature representation. For example, in the CIFAR-10 dataset, when batch size increases from 200 to 300, AP growth ratios of MDAL, DAL_ALEX, and AL_DL are 7.52%, 7.83%, and 5.48%, respectively; when batch size increases from 300 to 400, AP growth ratios of the three algorithms are 3.25%, 3.17%, and 1.60%, respectively. In the ImageNet ILSVRC dataset, when batch size increases from 200 to 300, AP growth ratios of MDAL, DAL_ALEX, and AL_DL are 6.54% 6.78%, (2) while not reach maximum iteration T do (3) utilize two CNNs to obtain p(y i � j | x i ; W) based on equation (2); (4) Add uncertainty samples into D L based on equation (3); (5) In iteration t: Update W via fine-tuning CNN according to equation (4) with updated D L ; (6) end while (7) return W ALGORITHM 1: DMAL algorithm. and 3.91%, respectively; when batch size increases from 300 to 400, AP growth ratios of the three algorithms are 5.45%, 5.29%, and 4.77%, respectively. ese evidences demonstrate that deep learning-based and multiview-based active learning algorithms diverge faster than other algorithms which strongly suggests that incorporating deep learning into the multiview active learning framework is an efficient way for active learning improvement. Table 2 illustrates MAP comparison of all eight algorithms with different batch size for the first 500 feedback   images. Obviously, we can make the similar conclusion as from Table 1.

Conclusions
In this paper, we propose a novel MDAL framework for largescale image classification tasks. To the best of our knowledge, we are the first to incorporate deep learning into the multiview active learning framework. Furthermore, unlike traditional active learnings, both feature representation and classifier are jointly updated with progressively added informatively samples. e detailed experiment results on two public challenging benchmarks justify the effectiveness of our proposed MADL framework.

Conflicts of Interest
e authors declare that they have no conflicts of interest.