Combining Convolutional Neural Network and Markov Random Field for Semantic Image Retrieval

With the rapidly growing number of images over the Internet, efficient scalable semantic image retrieval becomes increasingly important.This paper presents a novel approach for semantic image retrieval by combining Convolutional Neural Network (CNN) and Markov Random Field (MRF). As a key step, image concept detection, that is, automatically recognizing multiple semantic concepts in an unlabeled image, plays an important role in semantic image retrieval. Unlike previous work that uses single-concept classifiers one by one, we detect semantic multiconcept by using a multiconcept scene classifier. In other words, our approach takes multiple concepts as a holistic scene for multiconcept scene learning. Specifically, we first train a CNN as a concept classifier, which further includes two types of classifiers: a single-concept fully connected classifier that is best suited to single-concept detection and a multiconcept scene fully connected classifier that is good for holistic scene detection.Then we propose anMRF-based late fusion approach that is able to effectively learn the semantic correlation between the single-concept classifier and multiconcept scene classifier. Finally, the semantic correlation among the subconcepts of images is cought to further improve detection precision. In order to investigate the feasibility and effectiveness of our proposed approach, we conduct comprehensive experiments on two publicly available image databases. The results show that our proposed approach outperforms several state-of-the-art approaches.


Introduction
With the rapid development of information technique, a large number of multimedia objects such as images are available on the Web.Given a semantic query, how to effectively find relevant images from such a scalable Web database remains a challenge.For semantic image retrieval, image concept detection is a vital step.To address this issue, many approaches have been proposed, such as Markov random walk [1], group sparsity [2], ensemble learning [3], and multiview semantic learning [4].Although effective, these approaches work in the case of single-concept-based image retrieval.This means that each semantic query is supposed to contain only one semantic concept, restricting its practice usability.
In this paper, we specifically consider the problem of multiconcept-based image retrieval.This paradigm allows users to employ multiple semantic concepts to search relevant images.Its critical step is image multiconcept detection, that is, identifying multiple semantic concepts in an unseen image.Most previous studies [5,6] utilize multiple and independent single-concept classifiers to detect such a semantic multiconcept scene.Nonetheless, this method may be ineffective, since a visual multiconcept scene (e.g., "grass, person, soccer, and sports") is hard to be detected solely by a single-concept classifier.Therefore, further studies on image multiconcept detection are necessary.
In recent years, CNNs have achieved the state-of-the-art performance in many image tasks, such as single-conceptbased image retrieval [7,8], face recognition [9], image segmentation [10], and image reconstruction [11].This indicates that a CNN can learn robust visual features by capturing semantic structures of images.A natural idea is to devise a 2 Advances in Multimedia specific CNN for image multiconcept detection.For a task of image multiconcept scene detection, most conventional CNNs focus only on single-concept detection of images.As a result, they perform suboptimally on images with multiconcept scenes.We hence design a specific CNN that suits holistic scene detection, with two kinds of fully connected classifiers: a single-concept classifier and a multiconcept scene classifier.The former suits single-concept detection, while the latter is for holistic scene detection.Differing from the existing works that use single-concept classifiers, our method employs a multiconcept scene classifier to detect a semantic multiconcept scene, regarding multiple concepts as a holistic scene for multiconcept scene learning.Using our proposed MRF-based fusion method, we model the semantic correlation between single-concept classifier and multiconcept scene classifier and estimate the relevance score for an image multiconcept scene.The semantic link among the subconcepts presented in the images is further used to improve detection accuracy.Experimental results on MIR Flickr 2011 [12] and NUS-WIDE [13] datasets demonstrate the effectiveness of our proposed method.The major contribution of this paper is twofold: (1) Combining CNN and MRF, we propose a unified, novel CNN framework for image multiconcept scene detection.
(2) We model the semantic link between a single-concept classifier and a holistic scene classifier in a way that effectively detects the semantic multiconcept scene in an unlabeled image.
The remainder of this paper is organized as follows.Section 2 briefly reviews some related works.Section 3 details our proposed approach.Section 4 reports our experiments with setup, results, and analysis, and Section 5 concludes this paper with some remarks on further studies.

Related Work
Clustered in terms of discriminative, generative, and nearestneighbor methods, image concept detection is a vital step for semantic image retrieval.A discriminative method learns a classifier that projects visual images to semantic concepts, that is, Stochastic Configuration Networks (SCN) [14], while a generative method (e.g., a feature-word-topic model [15]) concentrates on learning the correlation between visual images and semantic concepts.By a majority vote of nearest neighbors of an image, a nearest-neighbor method assigns a semantic concept to this image.An influential work is the TagProp [6], which employed a weighted nearest-neighbor graph to learn semantic concepts of unseen images, achieving competitive learning performance.These above-mentioned methods lose sight of the valuable semantics latently embedded in image concepts so as to simplify the design of the system and related calculation.Alternatively, some others effectively integrate the semantics information under a unified learning framework, achieving the sound performance of concept detection.In [16], the Google semantic distance was proposed to extract the semantics of semantic concepts and phrases.In [17], a semantic ontology-based hierarchical pooling method was proposed to improve the coverage or diversity of the training images.
In the research field of image retrieval, MRF-based methods are also widely used, achieving promising performance.Laferte et al. [18] proposed a discrete MRF approach, which employed the maximum a posteriori estimation on the quadtree so as to reduce the computational expense.Metzler et al. [19] proposed a MRF-based query expansion approach that provided an effective mechanism for modeling semantic dependencies of image concepts.In [20], a potential function was proposed for parameter estimation and model inference, which empowered the learning ability for a concept classifier.Kawanabe et al. [1] utilized Markov random walks on graphs of textual tags to improve the performance of image retrieval.Lu et al. [21] utilized maximum-likelihood estimation to train a spatial Markov model and then employed this model for image concept detection.Dong et al. [22] proposed a sub-Markov random walk approach with concept prior to image retrieval, which can be interpreted as a conventional random walker on a graph with added auxiliary nodes.Most traditional methods concentrate on single-conceptbased image retrieval.For an image multi-concept query, they employ a combination of single-concept classifiers [5,6] to detect image multiconcept scene.
CNN-based deep learning has recently achieved stateof-the-art performance in single-concept-based image tasks.Simonyan et al. [23] trained a deep CNN termed VGG, achieving competitive performance on the large-scale dataset ImageNet.Szegedy et al. [7] proposed a deeper CNN architecture termed GoogLeNet, achieving better learning performance on ImageNet.To improve performance of image retrieval, Hoang et al. [24] proposed three masking schemes to select a representative subset of local convolutional features.Girshick et al. [8] proposed a scalable object detection approach, Regions with CNN features (R-CNN), which applied high-capacity CNNs to bottom-up region proposals.Ren et al. [25] proposed a Region Proposal Network (RPN) that shared full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.In [26], a Multi-Loss regularized Deep Neural Network (ML-DNN) framework was proposed, which exploited multiple loss functions with different theoretical motivations to mitigate overfitting during semantic concept learning.He et al. [27] proposed a residual learning framework to alleviate the training of neural networks.Wang et al. [28] proposed a deep ensemble learning approach for large-scale data analytics.Huang et al. [29] proposed a Dense convolutional Network (DenseNet) that connected each layer to every other layer in a feed-forward fashion, strengthening feature propagation and reducing training expense.Despite effectiveness, these methods are confined to cope with single-concept-based image retrieval, limiting its practical usability.This motivates us to devise a new model to resolve this issue.

Proposed Approach
Our approach, called CMMR, aims to combine CNN and MRF for the multiconcept-based image retrieval.Suppose that  and  stand for a training set and a test set, respectively.Each image  in  or  is represented as a low-level visual feature vector.Given a vocabulary  with  unique semantic concepts, each concept  in  is a single concept, for example, "grass" or "person."Each image in the training set  is labeled with several semantic single concepts , while the images in the test set  have no concept labels.Each semantic scene with the multiconcept , for example, "clouds, sky, and sunset," is an element of the power set of , that is,  ∈ 2  or  ⊆ .Given a multiconcept query  ∈ 2  (e.g., "grass, person, soccer, and sports") and the target set , our goal is to find a result set  ⊆  with relevant images.The result set  satisfies the following conditions: (1) each relevant image  in  includes all target single concepts  ∈ ; and (2) ∀ ∈  and   ∈  − ,   (, ) >   (,   ), where   (, ) and   (,   ) stand for the relevance scores for .
Figure 1 shows our proposed CMMR framework with working mechanisms.Our CMMR framework consists of three main components: CNN framework, MRF-based fusion, and online retrieval.CMMR aims to learn concept classifiers.Normally the last layer of CNN is a single-concept classifier.We replace it with two types of classifiers: a singleconcept fully connected classifier for single-concept detection and a multiconcept scene fully connected classifier for holistic scene detection.The MRF-based fusion component learns the semantic correlation between such two types of classifiers and produces the ultimate semantic score for a given multiconcept query with a semantic scene .Online retrieval obtains the search result for this  by taking four steps.First, CMMR generates the detection context   () by using a semantic neighbor approach.The proposed CNN then learns a single-concept classifier and a multiconcept scene classifier.Third, the use of MRF-based fusion approach learns the ultimate semantic scores of .Finally, CMMR employs the learned semantic scores to perform semantic image retrieval.

Multiconcept Vocabulary Generation.
CMMR regards each multiconcept  ∈ 2  as a whole, that is, one concept of a holistic scene.In order to avoid meaningless concept permutation, CMMR chooses the meaningful multiconcept  to generate a multiconcept vocabulary  + according to the following cooccurrence rule over the training set : where || is the cardinality of , for example, |grass, person, soccer, sports| = 4, and   () is the multiconcept frequency of .If the size of  + is too large, we can adjust the thresholds  and  to reduce the computational expense.In this way,  + containing multiconcepts  is generated.

CNN Network of Our Proposal.
Normally a CNN has multiple convolutional layers followed by fully connected classifier layers.The functionality of the convolutional layers is to learn and extract robust visual features, while the classifier layers learn a concept classifier.Any CNNs for image tasks can be incorporated into our framework.Without loss of generality, we choose an influential model, GoogLeNet [7], to build our convolutional layer.Image concept detection serves as a critical step in semantic image retrieval.Most conventional CNNs concentrate on image single-concept detection, thus performing suboptimally on image multiconcept scene detection.Furthermore, an original CNN (e.g., GoogLeNet) aims to predict one concept label of an unseen image, whereas in our case each image is labeled with multiple concepts.Therefore, we modify the GoogLeNet so as to fit multiconcept scene detection.
First, we design a specific fully connected classifier layer that suits holistic scene detection, comprising two kinds of classifiers: a multiconcept scene classifier and a singleconcept classifier.They share one convolutional layer, since this convolutional layer generates a general visual representation.Second, we follow [30] to define our softmax loss function   of multiconcept learning.With this definition, the normalized prediction (  |   ) of the image   in the jth multiconcept   is calculated as where   (e.g., "grass, person, soccer, and sports") is one holistic scene concept, (  ,   ) is the activation function, and  2 is the number of multiconcepts.Following [30], we use a rectified linear unit as our nonlinear activation function.
We minimize the Kullback-Leibler divergence between the prediction and the ground truth;   is defined as where where (  ) is the annotation set for training image   .Based on above positive samples and negative samples, we train the multiconcept classifier.For traditional single-concept classifier training, the images labeled with the concept   are employed as positive samples and the rest as negative samples.

Detection Context Generation.
Given a multiconcept query with a semantic scene ,  1 concept neighbors participate the concept detection and output the relevance scores.
These concept neighbors are tightly linked to  and hence can be taken as the detection context, denoted as   ().Some details on the procedure of generating the detection context are given below.
First, we generate a semantic neighbor set () ⊂  + by choosing neighbor concepts   with probabilities ( |   ) > 0. This symmetric semantic probability ( |   ) measures the interdependency between two concepts  and   , which is represented as where

MRF-Based Fusion for Multiconcept Scene Learning.
With our CNN, the concept classifier has been learned.This concept classifier projects visual images to semantic concepts.If a semantic concept and its related concepts frequently appear in images, the relevance prediction of this semantic concept will be boosted in our model.Given a multiconcept query with the semantic scene , all concepts   in the detection context   () are used for estimating the relevance.The relevance prediction ( | ; ) is estimated as follows: The relevance prediction (  | ) predicted by a multiconcept classifier  is seen as an evidence of   in an image , while the semantic correlation ( |   ) is treated as a weight of this relevance prediction.In view of the promising performance in single-concept learning reported in [6,7], a single-concept classifier  is integrated into the classifier layer of our CNN.Following [6], this single-concept prediction ( | ; ) between  and  can be estimated as follows: where  is the cardinality of  and   is a conventional single concept that is predicted by a single-concept classifier .
As a graphic model, MRF provides a basis for modeling contextual constraints in image retrieval.Hence, we employ MRF to analyze the semantic link between two types of classifiers mentioned above and produce the ultimate semantic score for .We first construct a specific MRF for the two types of classifiers and the query concept, that is, {, , }, so as to model their correlation.Then we infer the MRF-based fusion method for image concept detection.
Given a set of random variables  = { 1 , . . .,   } on an MRF graph, the joint probability of MRF is a Gibbs distribution [31]: where  is a normalization factor and () is the energy function, that is, the sum of clique potentials over all possible cliques.If using random variable   ∈ {0, 1} represents absence or presence of a multiconcept  for an image , the joint probability of the random variable set {, ,   } can be defined as where We define the potential functions as where   = [ 1 ,  2 ] are the CMMR parameters to be estimated and .. 1 +  2 = 1.

Online
Retrieval.CMMR concentrates on semantic image retrieval, including single-concept-based image retrieval and multiconcept-based image retrieval.A user employs multiple concepts to search for top-K semantically similar images from a database.In a word, we perform four steps for semantic image retrieval.
Step 1. Employ a semantic neighbor method to build the detection context   ().
Step 2. Learn a multiconcept scene classifier A and a singleconcept classifier B by our proposed CNN.
Step 3. Learn the final relevance score of  by using MRFbased fusion.
Step 4. Perform semantic image retrieval by using the learned relevance scores.Higher relevance score ranks higher.
The detailed process of semantic image retrieval is presented in Algorithm 1. From Algorithm 1, we conduct complexity analysis of time and space.Computing a set  + of multiconcept scene is an offline process, costing (1) time.Training a CNN is also an offline process, including deep feature extracting and classifier layer learning.This consumes () time, where  and  are the trainable parameter number of CNN networks and the size of image set, respectively.By initializing our CNN with a pretrained GoogLeNet and using a very small classifier layer, the number  is substantially reduced, boosting training efficiency.Computing the detection context   () is an online process, with (1) time and (1) space.For each test image, time and space complexity of computing predictions and fusing predictions are all  (1).Therefore, all test images spend () time and () space.Ultimately, ranked images are returned through heap sort, consuming ( log ) time and (1) space.Hence, the complexities of time and space of Algorithm 1 are ( log ) and (), respectively.15) and ( 16); 8 end 9 Perform heap sort over all predictions (  | ; , ) for obtaining top- images; 1 Output the image list {(1), (2), . . ., ()} that stands for the search result ; Algorithm 1: Semantic image retrieval process.

Experiments
Our experiments on semantic image retrieval include multiconcept-based image retrieval and single-conceptbased image retrieval.
4.1.Datasets.We conducted the comprehensive experiments of our approach on two public datasets: MIR Flickr 2011 and NUS-WIDE.Since they include large vocabularies, we chose them to evaluate the performance of multiconcept-based image retrieval.These two datasets are publicly available, containing images and ground truth for single-concept task evaluation.MIR Flickr 2011 contains 18,000 images labeled with 99 semantic concepts.We split it into 8000 training images and 10,000 test images.NUS-WIDE is comprised of 269,648 images with a vocabulary of 81 semantic concepts.We downloaded 230,708 images in total for our experiments.This dataset is randomly divided into two sets: 138,375 images for training and the rest of 92,333 images for test.
On MIR Flickr 2011, we follow literature [33], by using GIST, HOG, SIFT, and RGB histograms as visual features.To compare two features, we employ  2 distance for GIST,  for HOG,  2 for SIFT, and  1 for RGB.On NUS-WIDE, we use six visual features [13].Similarly, we employ  2 distance for wavelet texture,  for an edge direction,  2 for SIFT, and  1 for LAB and HSV, which are used in [33].
The average number of images associated with a concept is around 940 in MIR Flickr 2011 and 5381 in NUS-WIDE.The average number of concepts associated with an image is approximately 11 in MIR Flickr 2011 and about 3 in NUS-WIDE.The label vocabularies consist of dozens of label concepts, and around two-thirds of the semantic concepts have frequencies less than the mean concept frequency.Hence, semantic scene retrieval on these imbalanced datasets is challenging.

Evaluation Measures.
Given a query with semantic scene , the ground truth for  is defined as follows: if an image depicts all || target concepts   ∈ , it is considered to be relevant; and it is irrelevant otherwise.To evaluate the performance of semantic retrieval, we use three evaluation measures: Mean Average Precision (MAP), Precision at  (P@), and Precision-Recall (PR) curve.For each semantic query, Average Precision (AP) can be computed as  = ∑  ()()/, where  is the total number of relevant images in the test set ,  is the rank in the retrieved image list , () is an indicator function that equals 1 if the th image is relevant to  and equals 0 otherwise, and () is the precision at cut-off  in , which is defined as a ratio between  and the number of retrieved images.MAP is the mean value of APs on all the queries.For , the correctness of high ranking retrieved image counts more.Clearly, the higher the MAP the better the retrieval performance.P@ is a variant of precision, where only the top- ranked images are considered.Higher P@ means better retrieval performance.Besides MAP and p@, we employ PR curve to measure semantic retrieval performance.

Experimental Configurations.
In ( 1) and ( 2),  and , respectively, control concept cardinality and concept frequency.Since training images with 11 and 3 concepts appear the most frequently, we set  = 11 for MIR Flickr 2011 and  = 3 for NUS-WIDE, respectively.To reduce computational cost, the size of  + is limited to an acceptable one.This means that if the frequency of a concept exceeds , it is put into  + ; otherwise it is discarded.We set  = 200 for MIR Flickr 2011 and  = 50 for NUS-WIDE in our experiments.Thus,  + contains 15,970 and 2084 multiconcepts, respectively.In (7),  1 is used to control the size of   (), which is determined by 5-fold cross-validation.By testing  1 from a candidate set {2 *  |  = 1, . . ., 20}, we observe that the best performance is achieved when setting  1 = 10 on MIR Flickr 2011 and  1 = 4 on NUS-WIDE, respectively.Therefore, we set their values accordingly.In addition, all the parameters in the compared methods are turned to the best performance reported in the relevant literatures.
The basic structure of the convolution layer we use is the same as the one used in [7].For the classifier layer, it starts by a densely connected layer with the output size of 1024, followed by a 20% dropout.For all layers, rectified linear unit is employed as the nonlinear activation function.The optimization of the whole CNN is achieved by the stochastic gradient descent method with the mini-batch size of 128 at a 0.9 momentum.At the beginning, the CNN learning rate is Table 1: MAPs (%) and P@10s (%) of semantic image retrieval over all 1599 semantic queries on MIR Flickr 2011.MAP scores and P@10 scores are given in the format MAP/P@10.

Method
All
As a classical nearest-neighbor method, TagProp uses singleconcept techniques to resolve multiconcept-based image retrieval.FastTag learns two linear classifiers coregularized in a joint convex loss function that can be efficiently optimized in closed form on large-scale datasets.The others are influential single-concept-based deep learning methods.
After experimenting with TagProp on the large-scale dataset NUS-WIDE, we found that this method is difficult to scale up to a large-scale dataset due to its ( 2 ) time and space complexity.As such, we perform TagProp experiments by using 25,000 examples on NUS-WIDE.In addition, following literature [6], we use (9) to compute relevance prediction, given a query with multiconcept scene .

Experiments on Semantic
Image Retrieval.To evaluate retrieval performance, we construct a test query set Q, by following two steps.First, all single-concept queries  ∈  are added to Q. Then 1500 randomly generated queries with multiconcept scenes   ∈ 2  are put into Q, with 500 2-concepts, 500 3-concepts, and 500 4-concepts, where -concept is a multiconcept with cardinality .In this way, Q is built.On MIR Flickr 2011, Q is comprised of 1500 multiconcepts and 99 single concepts, while Q contains 1500 multiconcepts and 81 single concepts on NUS-WIDE.The MAPs and P@10s are used for evaluation on semantic image retrieval with varying query lengths.Tables 1 and 2 report MAP scores and P@10 scores, where MAP scores and P@10 scores are given in the format MAP/P@10.
From the results, we can see that our method, CMMR, is better than other methods.Clearly, multiconcept queries perform much worse than single-concept queries on both datasets.This is because detecting a multiconcept scene is more difficult than detecting a single-concept one.A multiconcept scene may have the characteristic visual appearance, while the goal of traditional single-concept models is to achieve precise results of single-concept detection.To search for a holistic scene, traditional methods use a combination of single-concept technologies.However, in some cases, this may lose some semantics latently embedded in the holistic scene.Therefore, only using single-concept classifiers is difficult to detect a sophisticated multiconcept scene.This observation motivates us to jointly consider the multiconcept scene classifier and the single-concept classifier in devising our CNN.Moreover, the MRF-based fusion method can effectively learn the semantic correlation of multiconcept scene classifier and single-concept classifier, boosting the detection accuracy of a semantic scene.
We further conduct the comparisons with different experiment settings.More specifically, we construct a group of comparative evaluation, that is, a difficult query set with less than 100 relevant images and an easy query set with more than 100 relevant images.The experimental results are shown in Figure 2. We can find out that our method still leads the search results.Figure 3 shows the PR curves of all compared methods on two datasets, illustrating the precision variation with the varying recall.As can be seen, our method CMMR has the better precision than compared methods at every level of recall.4.6.Experiments on Rare Concept Queries.Most existing approaches assume balanced concept distributions or equal misclassification costs.Nevertheless, a real-world dataset is commonly highly imbalanced [36].When presented with complex imbalanced datasets, these methods fail to properly represent the distributive characteristics of the data and resultantly provide unfavorable precision.On MIR Flickr 2011 and NUS-WIDE, the frequencies of most concepts are below average, leading a concept classifier to overclassify the frequent concepts with high occurrence frequencies in the learning stage.This makes it hard to derive a proper model for rare concepts with low occurrence frequencies.In such situations, a concept classifier commonly has the good performance on frequent concepts but very poor performance on rare concepts.This observation suggests that, for developing a classifier, we should consider varying frequencies of concepts.
Two groups of experiments are devised: rare concept queries and frequent concept queries.In the first group, the top-50 rare single concepts, the top-100 rare 2-concepts, and the top-100 rare 3-concepts from Q are selected as three respective sets of the single-concept rare queries, the 2-concept rare queries, and the 3-concept rare queries,   respectively, denoted by  1 ,  2 , and  3 .In the second group, the top-50 frequent single concepts, the top-100 frequent 2concepts, and the top-100 frequent 3-concepts from Q are, respectively, chosen as the set  1 of single-concept frequent query, a set  2 of the 2-concept frequent query, and a set  3 of the 3-concept frequent query.
As shown in Figures 4 and 5, concept classifiers achieve the higher MAPs and P@10s on the frequent concept sets  1 ,  2 , and  3 but far lower MAPs and P@10s on the rare concept sets  1 ,  2 , and  3 , significantly impacting retrieval performance and user experience.For the rare concept sets  1 ,  2 , and  3 on MIR Flickr 2011, our approach outperforms the compared methods, with the better improved 30%, 24%, and 26% over the second best method in terms of MAP, respectively.On NUS-WIDE, a similar improvement is also observed.During rare concept detection with semantic scene , a group of weighted concept classifiers of its detection context   () take part in concept detection through MRFbased fusion method.Among these concepts from   (), some concepts   ∈   () may be frequent concepts, which significantly boosts the relevance prediction of  and makes the rare concept  easier to be detected.Moreover, our maximization of the log likelihood of semantic concepts compensates for the varying frequencies of concepts.Consequently, our approach can remit the issue of concept imbalance, thus boosting retrieval performance.

Conclusion
Searching semantic images with high accuracy turns to be significant nowadays because of a vast number of realworld applications such as cognitive educational resource retrieval.As a key step, image scene detection plays an important role in semantic image retrieval.In this paper, we have presented a novel CNN framework for semantic image retrieval, which combines CNN and MRF in a novel way that enhances the capacity of multiconcept scene detection.Compared with previous methods, our CNN framework seamlessly incorporates three components: single-concept classifier, multiconcept scene classifier, and semantics.The combination of these three components can enhance the capability of CNN for detecting semantic scenes.We have conducted the comprehensive experiments on two public datasets.The favorable results indicate that our proposed method outperforms the compared approaches.
For future work, we intend to develop a better learning and fusion method for multiconcept scene detection.

Figure 2 :Figure 3 :
Figure2: Semantic retrieval performance (MAPs % and P@10s %) over the comparative group: a difficult query set and an easy query set on two datasets.

Figure 5 :
Figure 5: MAPs and P@10s (%) of semantic image retrieval for rare concepts and frequent concepts on NUS-WIDE.
For learning multiconcept of a scene   , the positive sample set (  ) and the negative sample set (  ) are built as follows: 1 is the number of images and (  |   ) is the ground truth in the image   in the jth multiconcept   .It is obvious that we have (  |   ) = 1 if   appears in   and (  |   ) = 0 otherwise.3.3.CNN Training.Training a CNN is a two-stage process: convolution layer training and classifier layer training.The former extracts deep feature, while the latter learns a reasonable concept classifier.This process is time-consuming, especially for training on large image databases.Therefore, a publicly released pretrained GoogLeNet is employed to accelerate training.This procedure includes three steps.After being initialized with the pretrained GoogLeNet, our CNN model is able to extract deep features.Next, these deep features are fed into the classifier layer, which is then well trained.Finally, the CNN is well retrained by freezing the bottom convolution blocks, as well as by fine-tuning the top convolution block and the classifier.
() and   (  ) are the occurrence frequency of  and   , respectively, and   ( ∪   ) is the number of images simultaneously including two multiconcepts  and   .Each multiconcept   is seen as its own semantic neighbor and hence (  |   ) = 1.Second, we assign all subconcepts   ⊆  into the context set   ().Finally, we assign top- related concepts   into the context set   () from the rest of ().Thus, the detection context   () is generated, with  1 elements.The interdependency probability ( |   ) should be normalized as follows: Optimization.A widely used technique for parameter optimization is a maximum likelihood, which chooses the parameters that maximize the joint probabilities over the training set.As such, we maximize the log-likelihood function L  of the query .The final relevance prediction (  | ; , ) of the image  is given by Input: training set  with label vocabulary , test set  and query with multi-concept scene  Output: ranked search result  1 Compute a set  + of multi-concept scene by Eqs.(1) and (2); 2 Train our CNN and obtain multi-concept scene classifier  and single-concept classifier ; 3 Construct detection context   (); Perform relevance prediction fusion of A and B, and compute final prediction (  | ; , ) by Eqs. (

Table 2 :
MAPs (%) and P@10s (%) of semantic image retrieval over all 1581 semantic queries on NUS-WIDE.MAP scores and P@10 scores are given in the format MAP/P@10.
Figure 4: MAPs and P@10s (%) of semantic image retrieval for rare concepts and frequent concepts on MIR Flickr 2011.