Content-based image retrieval (CBIR) systems require users to query images by their low-level visual content; this not only makes it hard for users to formulate queries, but also can lead to unsatisfied retrieval results. To this end, image annotation was proposed. The aim of image annotation is to automatically assign keywords to images, so image retrieval users are able to query images by keywords. Image annotation can be regarded as the image classification problem: that images are represented by some low-level features and some supervised learning techniques are used to learn the mapping between low-level features and high-level concepts (i.e., class labels). One of the most widely used feature representation methods is bag-of-words (BoW). This paper reviews related works based on the issues of improving and/or applying BoW for image annotation. Moreover, many recent works (from 2006 to 2012) are compared in terms of the methodology of BoW feature generation and experimental design. In addition, several different issues in using BoW are discussed, and some important issues for future research are discussed.
Advances in computer and multimedia technologies allow for the production of digital images and large repositories for image storage with little cost. This has led to the rapid increase in the size of image collections, including digital libraries, medical imaging, art and museum, journalism, advertising and home photo archives, and so forth. As a result, it is necessary to design image retrieval systems which can operate on a large scale. The main goal is to create, manage, and query image databases in an efficient and effective, that is, accurate manner.
Content-based image retrieval (CBIR), which was proposed in the early 1990s, is a technique to automatically index images by extracting their (low-level) visual features, such as color, texture, and shape, and the retrieval of images is based solely upon the indexed image features [
The semantic gap is the gap between the extracted and indexed low-level features by computers and the high-level concepts (or semantics) of user’s queries. That is, the automated CBIR systems cannot be readily matched to the users’ requests. The notation of similarity in the user’s mind is typically based on high-level abstractions, such as activities, entities/objects, events, or some evoked emotions, among others. Therefore, retrieval by similarity using low-level features like color or shape will not be very effective. In other words, human similarity judgments do not obey the requirements of the similarity metric used in CBIR systems. In addition, general users usually find it difficult to search or query images by using color, texture, and/or shape features directly. They usually prefer textual or keyword-based queries, since they are easier and more intuitive for representing their information needs [
Consequently, the semantic gap problem has been approached by automatic image annotation. In automatic image annotation, computers are able to learn which low-level features correspond to which high-level concepts. Specifically, the aim of image annotation is to make the computers extract meanings from the low-level features by a learning process based on a given set of training data which includes pairs of low-level features and their corresponding concepts. Then, the computers can assign the learned keywords to images automatically. For the review of image annotation, please refer to Tsai and Hung [
Image annotation can be defined as the process of automatically assigning keywords to images. It can be regarded as an automatic classification of images by labeling images into one of a number of predefined classes or categories, where classes have assigned keywords or labels which can describe the conceptual content of images in that class. Therefore, the image annotation problem can be thought of as image classification or categorization.
More specifically, image classification can be divided into object categorization [
However, image annotation performance is heavily dependent on image feature representation. Recently, the bag-of-words (BoW) or bag-of-visual-words model, a well-known and popular feature representation method for document representation in information retrieval, was first applied to the field of image and video retrieval by Sivic and Zisserman [
The BoW feature is usually based on tokenizing keypoint-based features, for example, scale-invariant feature transform (SIFT) [
Since 2003, BoW has been used extensively in image annotation, but there has not as yet been any comprehensive review of this topic. Therefore, the aim of this paper is to review the work of using BoW for image annotation from 2006 to 2012.
The rest of this paper is organized as follows. Section
The bag-of-words (BoW) methodology was first proposed in the text retrieval domain problem for text document analysis, and it was further adapted for computer vision applications [
To extract the BoW feature from images involves the following steps: (i) automatically detect regions/points of interest, (ii) compute local descriptors over those regions/points, (iii) quantize the descriptors into words to form the visual vocabulary, and (iv) find the occurrences in the image of each specific word in the vocabulary for constructing the BoW feature (or a histogram of word frequencies) [
Four steps for constructing the bag-of-words for image representation.
The BoW model can be defined as follows. Given a training dataset
The first step of the BoW methodology is to detect local interest regions or points. For feature extraction of interest points (or keypoints), they are computed at predefined locations and scales. Several well-known region detectors that have been described in the literature are discussed below [ Harris-Laplace regions are detected by the scale-adapted Harris function and selected in scale-space by the Laplacian-of-Gaussian operator. Harris-Laplace detects corner-like structures. DoG regions are localized at local scale-space maxima of the difference-of-Gaussian. This detector is suitable for finding blob-like structures. In addition, the DoG point detector has previously been shown to perform well, and it is also faster and more compact (less feature points per image) than other detectors. Hessian-Laplace regions are localized in space at the local maxima of the Hessian determinant and in scale at the local maxima of the Laplacian-of-Gaussian. Salient regions are detected in scale-space at local maxima of the entropy. The entropy of pixel intensity histograms is measured for circular regions of various sizes at each image position. Maximally stable extremal regions (MSERs) are components of connected pixels in a thresholded image. A watershed-like segmentation algorithm is applied to image intensities and segment boundaries which are stable over a wide range of thresholds that define the region.
In Mikolajczyk et al. [
On the other hand, according to Hörster and Lienhart [
Some authors believe that a very precise segmentation of an image is not required for the scene classification problem [
In most studies, some single local descriptors are extracted, in which the Scale Invariant Feature Transform (SIFT) descriptor is the most widely extracted [
In order to reduce the dimensionality of the SIFT descriptor, which is usually 128 dimensions per keypoint, principal component analysis (PCA) can be used for increasing image retrieval accuracy and faster matching [
SIFT was found to work best [
In addition, Quelhas et al. [
When the keypoints are detected and their features are extracted, such as with the SIFT descriptor, the final step of extracting the BoW feature from images is based on vector quantization. In general, the
To generate visual words, many studies focus on capturing spatial information in order to improve the limitations of the conventional BoW model, such as Yang et al. [
However, Van de Sande et al. [
Uijlings et al. [
In their seminal work, Philbin et al. [
Chum et al. [
After the BoW feature is extracted from images, it is entered into a classifier for training or testing. Besides constructing the discriminative models as classifiers for image annotation, some Bayesian text models by Latent Semantic Analysis [
The construction of discriminative models for image annotation is based on the supervised machine learning principle for pattern recognition. Supervised learning can be thought as learning by examples or learning with a teacher [
The learning task is to compute a classifier or model
In text analysis, pLSA and LDA are used to discover topics in a document using the BoW document representation. For image annotation, documents and discovered topics are thought of as images and object categories, respectively. Therefore, an image containing instances of several objects is modeled as a mixture of topics. This topic distribution over the images is used to classify an image as belonging to a certain scene. For example, if an image contains “water with waves”, “sky with clouds”, and “sand”, it will be classified into the “coast” scene class [
Following the previous definition of BoW, in pLSA there is a latent variable model for cooccurrence data which associates an unobserved class variable
On the other hand, LDA treats the multinomial weights
The goal of LDA is to maximize the following likelihood:
Bosch et al. [
However, it is interesting that Lu and Ip [
This section reviews the literature regarding using BoW for some related problems. They are divided into five categories, namely, feature representation, vector quantization, visual vocabulary construction, image segmentation, and others.
Since the annotation accuracy is heavily dependent on feature representation, using different region/point descriptors and/or the BoW feature representation will provide different levels of discriminative power for annotation. For example, Mikolajczyk and Schmid [
Due to the drawbacks that vector quantization may reduce the discriminative power of images and the BoW methodology ignores geometric relationships among visual words, Zhong et al. [
On the other hand, since the image feature generally carries mixed information of the entire image which may contain multiple objects and background, the annotation accuracy can be degraded by such noisy (or diluted) feature representations. Chen et al. [
Gehler and Nowozin [
Qin and Yung [
As there is a relation between the composition of a photograph and its subject, similar subjects are typically photographed in a similar style. Van Gemert [
In Rasiwasia and Vasconcelos [
In order to reduce the quantization noise, Jégou et al. [
As abrupt quantization into discrete bins does cause some aliasing, Agarwal and Triggs [
Similarly, Fernando et al. [
On the other hand, Wu et al. [
In de Campos et al. [
Zheng et al. [
Besides reducing the vector quantization noise, another severe drawback of the BoW model is its high computational cost. To address this problem, Moosmann et al. [
Recently, Van de Sande et al. [
On the other hand, Hare et al. [
Since related studies, such as Jegou et al. [
Gavves et al. [
On the other hand, López-Sastre et al. [
Constructing visual codebook ensembles is another approach to improve image annotation accuracy. In Luo et al. [
Bae and Juang [
Since one major challenge in object categorization is to find class models that are “invariant” enough to incorporate naturally-occurring intraclass variations and yet “discriminative” enough to distinguish between different classes, Winn et al. [
Kesorn and Poslad [
Tirlly et al. [
Effective image segmentation can be an important factor affecting the BoW feature generation. Uijlings et al. [
Besides point detection, an image can be segmented into several or a fixed number of regions or blocks. However, very few compared the effect of image segmentation on generating the BoW feature. In Cheng and Wang [
Similarly, Wu et al. [
In addition to using the BoW feature for image annotation, Larlus et al. [
Although the BoW model has been extensively studied for general object and scene categorization, it has also been considered in some domain specific applications, such as human action recognition [
Farhadi et al. [
On the other hand, Sudderth et al. [
Chum et al. [
Based on the BoW feature representation, Jegou et al. [
Since the aim of image annotation is to support very large scale keyword-based image search, such as web image retrieval, it is very critical to assess existing approaches over some large scale dataset(s). Chum et al. [
Moreover, Philbin et al. [
On the other hand, Torralba and Efros [
Although modeling the spatial relationship between visual words can improve the recognition performance, the spatial features are expensive to compute. Liu et al. [
For the dimensionality reduction purpose, Elfiky et al. [
Bosch et al. [
In contrast to reducing the dimensionality of the feature representation, selecting more discriminative features (e.g., SIFT descriptors) from a given set of training images has been considered. Shang and Xiao [
Simultaneously learning object/scene category models and performing segmentation on the detected objects were studied in Cao and Fei-Fei [
On the other hand, Tong et al. [
Shotton et al. [
Romberg et al. [
In their study, Lee and Grauman [
Since interest point detection is an important step for extracting the BoW feature, Stottinger et al. [
This section compares related work in terms of the ways the BoW feature and experimental setup are structured. These comparisons allow us to figure out the most suitable interest point detector(s), clustering algorithm(s), and so forth used to extract the BoW feature from images. In addition, we are able to realize the most widely used dataset(s) and experimental settings for image annotation by BoW.
Table
Comparisons of interest point detection, visual words generation, and learning models.
Work | Region/point detection | Local descriptor | Clustering algorithm | No. of visual words | Weighting scheme | Learning model |
---|---|---|---|---|---|---|
2012 | ||||||
| ||||||
de Campos et al. [ |
DoG | SIFT | Logistic regression | |||
Elfiky et al. [ |
Harris-Laplace | SIFT/HSV |
|
SVM | ||
Fernando et al. [ |
Harris-Laplace | PCA-SIFT/SIFT/SURF1 |
|
2000 | SVM | |
Gavves et al. [ |
SIFT/SURF | 200000 | ||||
Kesorn and Poslad [ |
DoG | SIFT | SLAC2 | Binary/TF/ TF-IDF | Naïve bayes/ SVM-linear/ SVM-RBF | |
Lee and Grauman [ |
NCuts3 | Texton |
|
400 | SVM | |
Qin and Yung [ |
Color SIFT |
|
SVM-linear/ SVM-poly/ SVM-RBF | |||
Romberg et al. [ |
SIFT |
|
mm-pLSA4 | |||
Shang and Xiao [ |
SIFT |
|
1000 | SVM | ||
Stottinger et al. [ |
Harris-Laplace | RGB Harris with Laplacian scale selection |
|
4000 | SVM | |
Tong et al. [ |
Harris-Laplace | SIFT | AKM5 | |||
| ||||||
2011 | ||||||
| ||||||
Hare et al. [ |
DoG/MSER | SIFT | AKM | 1000–100000 | IDF | |
López-Sastre et al. [ |
Hessian-Laplace | SIFT | CPM and Adaptive Refinement | 3818 | SVM | |
Luo et al. [ |
DoG | SIFT |
|
500 | TF | SVM |
Van Gemert [ |
Harris and Hessian-Laplace | SIFT |
|
2000 | ||
Yang et al. [ |
SIFT |
|
1000 | SVM | ||
Zhang et al. [ |
DoG | SIFT | HKM6 | 32357 | TF-IDF | |
Zhang et al. [ |
DoG | SIFT | HKM | 32400 | TF-IDF | |
| ||||||
2010 | ||||||
| ||||||
Bae and Juang [ |
Dense sampling | 171329 | ||||
Chen et al. [ |
Hessian-Laplace | SIFT | GMM-BIC7 | 3500 | TF | |
Cheng and Wang [ |
Mean-shift8 | HSV color histogram and co-occurrence matrix | SVM | |||
Ding et al. [ |
DoG | PCA-SIFT |
|
2000 | SVM | |
Jégou et al. [ |
Hessian-Laplace | SIFT |
|
200000 | TF-IDF | |
Jiang et al. [ |
DoG | SIFT |
|
500–10000 | Binary/TF/ TF-IDF/soft-weighting | SVM |
Li and Godil [ |
DoG | SIFT |
|
500/700/800 | TF | pLSA |
Qin and Yung [ |
PCA-SIFT | Accelerated |
32/128/2048/ 4096 | SVM | ||
Tirilly et al. [ |
Hessian-Laplace | SIFT | HKM | 6556 to 117151 | ||
Uijlings et al. [ |
PCA-SIFT |
|
4096 | SVM | ||
Wu et al. [ |
SIFT |
|
2500–4500 | Naïve Bayes/ SVM | ||
| ||||||
2009 | ||||||
| ||||||
Chen et al. [ |
DoG | SIFT |
|
1000 | Spatial weighting | |
Lu and Ip [ |
Dense sampling | HSV color + Gabor txture |
|
100/200 | SVM | |
Lu and Ip [ |
Dense sampling | HSV color + Gabor txture |
|
100/200 | LLP9/GLP10/ SVM | |
S. Kim and D. Kim [ |
Dense sampling | SIFT/SURF |
|
500/1500/3000 | TF | pLSA/SVM |
Uijlings et al. [ |
Dense sampling | SIFT |
|
4096 | SVM | |
Xiang et al. [ |
NCuts | 36 region features11 | MRFA12 | |||
Zhang et al. [ |
SIFT | HKM | 32357 | TF-IDF | ||
| ||||||
2008 | ||||||
| ||||||
Bosch et al. [ |
Harris-Laplace | Color SIFT |
|
1500 |
| |
Liu et al. [ |
Harris-Laplace | SIFT |
|
1000 | SVM-linear | |
Marszałek and Schmid [ |
Harris-Laplace | SIFT |
|
8000 | SVM | |
Rasiwasia and Vasconcelos [ |
DCT13 coefficients | Hierarchical Dirichlet models/SVM | ||||
Tirilly et al. [ |
SIFT | HKM | 6556/61687 | TF-IDF | SVM | |
Van de Sande et al. [ |
Harris-Laplace | Color SIFT |
|
4000 | SVM | |
Zheng et al. [ |
DoG + Hessian-Laplace | SIFT + Spin14 |
|
1010 | SVM | |
| ||||||
2007 | ||||||
| ||||||
Bosch et al. [ |
Dense sampling | HSV color + co-occurrence + edge |
|
700 | pLSA | |
Chum et al. [ |
Hessian-Laplace | SIFT |
|
TF-IDF | ||
Gökalp and Aksoy [ |
Dense sampling | HSV color |
|
Bayesian classifier | ||
Hörster and Lienhart [ |
DoG/dense sampling | Color SIFT |
|
LDA | ||
Jegou et al. [ |
SIFT |
|
30000 | |||
Li and Fei-Fei [ |
Dense sampling | SIFT |
|
300 | TF | LDA |
Lienhart and Slaney [ |
SIFT |
|
TF | pLSA | ||
Philbin et al. [ |
Hessian-Laplace | SIFT | AKM | 1 M | ||
Quelhas et al. [ |
DoG | SIFT |
|
1000 | SVM/pLSA | |
Wu et al. [ |
Dense sampling | Texture histogram | Unigram/ bigram/trigram models | |||
Junsong et al. [ |
DoG | PCA-SIFT |
|
160/500 | ||
| ||||||
2006 | ||||||
| ||||||
Agarwal and Triggs [ |
Dense sampling | SIFT | EM15 | LDA/SVM | ||
Bosch et al. [ |
Dense sampling | Color SIFT |
|
1500 |
| |
Lazebnik et al. [ |
Dense sampling | SIFT |
|
200/400 | SVM | |
Marszałek and Schmid [ |
Harris-Laplace | SIFT |
|
1000 | TF | SVM |
Monay et al. [ |
DoG | SIFT |
|
1000 | TF | pLSA |
Moosmann et al. [ |
Dense sampling/DoG | HSV color + wavelet/ SIFT | Extremely randomized trees | SVM | ||
Perronnin et al. [ |
DoG | PCA-SIFT | 1024 | SVM-linear |
1Speeded up robust features [
2Search ant and labor ant clustering algorithm [
3Normalized cuts [
4Multilayer modality pLSA.
5Approximate
6Hierarchical
7Gaussian mixture model with Bayesian information criterion.
8Mean shift region segmentation algorithm [
9Local label propagation on the
10Global label propagation on the complete graph.
11Region color and standard deviation, region average orientation energy (12 filters), region size, location, convexity, first moment, and ratio of region area to boundary length squared [
12Multiple Markov random fields.
13Discrete cosine transform.
14A rotation-invariant two-dimensional histogram of intensities within an image region [
15Expectation maximization.
From Table
On the other hand, several studies used some region segmentation algorithms, such as NCuts [
For the local feature descriptor to describe interest points, most studies used a 128 dimensional SIFT feature, in which some considered using PCA to reduce the dimensionality of SIFT, but some “fuse” the color feature and SIFT resulting in longer dimensional features than SIFT. Except for extracting SIFT related features, some studies considered conventional color and texture features to represent local regions or points.
About vector quantization, we can see that
For the number of visual words, related works have considered various amounts of clusters during vector quantization. This may be because the datasets used in these works are different. In Jiang et al. [
On the other hand, the most and second most widely used weighting schemes are TF and TF-IDF. This is consistent with Jiang et al. [
Finally, SVM is no doubt the most popular classification technique as the learning model for image annotation. In particular, one of the most widely used kernel functions for constructing the SVM classifier is the Gaussian radial basis function. However, some other SVM classifiers, such as linear SVM and SVM with a polynomial kernel have also been considered in the literature.
Table
Comparisons of datasets used and annotation performance.
Work | Categories | Dataset | No. of categories | No. of images | Baseline | |
---|---|---|---|---|---|---|
Scene | Object | |||||
2012 | ||||||
| ||||||
de Campos et al. [ |
v | PASCAL′07/′0816 | 20 | 9292 | ||
Elfiky et al. [ |
v | v | Sport event/15 scene/butterflies17/ |
15/20 | 6000/21000/2000/160k/4194k | Spatial pyramid |
Fernando et al. [ |
v | PASCAL′06/Caltech 1018 | 10/10/11 | 5304/3044 | BoW | |
Gavves et al. [ |
v | Oxford 5k19 | 11 | 5062 | ||
Kesorn and Poslad [ |
v | Olympic organization website + Google images | 8 | 16000 | pLSA | |
Lee and Grauman [ |
v | v | MSRC-v020/-v2/PASCAL′08/Corel/Gould′09 | 21/20/7/14 | 3457/591/1023/100/715 | LDA |
Qin and Yung [ |
v | SCENE-8/-15 | 8/15 | 2688/4485 | BoW | |
Romberg et al. [ |
v | v | Flickr-10M | >300 | 10080251 | pLSA |
Shang and Xiao [ |
v | Caltech 256/MSRC | 20/20 | BoW | ||
Stottinger et al. [ |
PASCAL′07 | 20 | 9963 | |||
Tong et al. [ |
v | v | Tattoo dataset/Oxford/Flickr | 101745/5062/1002805 | RS21/HKM/AKM | |
| ||||||
2011 | ||||||
| ||||||
Hare et al. [ |
v | v | UK Bench/MIR Flickr-2500022 | BoW | ||
López-Sastre et al. [ |
v | Caltech 101 | 10 | 890 | Mikolajczyk et al. [ | |
Luo et al. [ |
v | Caltech 4/Graz-0223 | 5/2 | 400/200 |
Li and Perona [ | |
Van Gemert [ |
v | v | Corel/PASCAL′09 | 20 | 2000/7054 | BoW/spatial pyramid |
Yang et al. [ |
v | PASCAL′08 | 20 | 8445 | Divvala et al. [ | |
Zhang et al. [ |
v | v | Google images/Caltech 101and256 | 15 | 376500 | BoW |
Zhang et al. [ |
v | v | ImageNet24 | 15 queries | 1.5 million | Nister and Stewenius [ |
| ||||||
2010 | ||||||
| ||||||
Bae and Juang [ |
v | Corel | 15 | 20000 | LSA | |
Chen et al. [ |
v | Oxford buildings/Flickr 1k | 11 (55 queries)/7 |
5062/11282 | Sivic and Zisserman [ | |
Cheng and Wang [ |
v | 6-scene dataset | 6 | 700 | Vogel and Schiele [ | |
Ding et al. [ |
v | TRECVID′0625 | 20 | 61901 | Binary/TF/TF-IDF weighting | |
Jégou et al. [ |
v | v | Holidays26/Oxford 5k/U. of Kentucky object recognition27 | 500/11 (55 queries) | 1491/5062/6376 | BoW by HE28/ |
Jiang et al. [ |
v | TRECVID′06 | 20 | 79484 | ||
Li and Godli [ |
v | v | Corel | 50 | 5000 | Duygulu et al. [ |
Qin and Yung [ |
v | v | 8/13/15 | 2688/3759/4485 | Siagian and Itti [ | |
Tirilly et al. [ |
v | v | U. of Kentucky object recognition/Oxford 5k/Caltech 6 & 101 | 300/55/200 queries | 10200/5062/8197 | TF-IDF weighting |
Uijlings et al. [ |
v | PASCAL′07/TRECVID′05/Caltech 101 | 20/101/15 | 9963/12914/4485 | BoW | |
Wu et al. [ |
v | LabelMe29/PASCAL′06 | 495/10 | BoW; Bar-Hillel et al. [ | ||
| ||||||
2009 | ||||||
| ||||||
Chen et al. [ |
v | LabelMe | 8 (448 queries) | 2689 | Yang et al. [ | |
Lu and Ip (a) [ |
v | LabelMe + Web images | 3 | 1239 | k-NN; LDA | |
Lu and Ip (b) [ |
v | v | Corel/histological images | 10/5 | 1000 | pLSA/SVM |
S. Kim and D. Kim [ |
v | v | Corel/histological images | 10/5 | 1000 | LLP/GLP/SVM/pLSA |
Uijlings et al. [ |
v | PASCAL′07 | 20 | 9963 | BoW | |
Xiang et al. [ |
v | Corel/TRECVID′05 | 50/39 | 5000 | Feng et al. [ | |
Zhang et al. [ |
v | v | Google images/Corel/Caltech 101 and 256 | 1506 queries/50/15 | 376500/500/2250 | BoW |
| ||||||
2008 | ||||||
| ||||||
Bosch et al. [ |
v | 6-/8-/13-/15-scene | 6/8/13/15 | 2688/702 | BoW | |
Liu et al. [ |
v | PASCAL′06/Caltech 4/MSRC-v2 | 20/5/15 | Savarese et al. [ | ||
Marszalek and Schmid [ |
v | v | Caltech 256 | 256 | Lazebnik et al.[ | |
Rasiwasia and Vasconcelos [ |
v | 15-natural scene/Corel | 15/50 | Bosch et al. [ | ||
Tirilly et al. [ |
v | Caltech 6 and 101 | 6/101 | 5435/8697 | SVM | |
Van de Sande et al. [ |
v | v | PASCAL′07/TRECVID′05 | 20 | ||
Zheng et al. [ |
v | Caltech 101/PASCAL′05 | 12/4 | BoW | ||
| ||||||
2007 | ||||||
| ||||||
Bosch et al. [ |
v | Corel | 6 | 700 | Global and block-based features + | |
Chum et al. [ |
v | v | Oxford + Flickr | 104844 | BoW | |
Gökalp and Aksoy [ |
v | LabelMe | 7 | 1050 | Bag of individual regions/bag of region pairs | |
Hörster and Lienhart [ |
v | Flickr | 12 (60 queries) | 246348 | BoW/color based BoW | |
Jegou et al. [ |
v | v | Object recognition benchmark30 | 10200 | Object recognition benchmark | |
Li and Fei-Fei [ |
v | 8 events | 8 | 240 | LDA | |
Lienhart and Slaney [ |
v | Flickr | 12 (60 queries) | 253460 | LSA | |
Philbin et al. [ |
v | v | Oxford 5 k/Flickr 1 and 2 | 11/145 and 450 tags | 5062/99782/1040801 | BoW |
Quelhas et al. [ |
v | Corel + Web images | 5 | 6680/3805/9457/6364 | BoW; Vailaya et al. [ | |
Wu et al. [ |
v | v | Caltech 7/Corel | 8/6 | 600 | LDA/pLSA |
Yuan et al. [ |
v | Caltech 101 | 2 | 558 | BoW | |
| ||||||
2006 | ||||||
| ||||||
Agarwal and Triggs [ |
v | Caltech 7 + Graz/KTH-TIPS31/Cal-IPNP32 | 4/10/2 | 1337/810/360 | LDA | |
Bosch et al. [ |
v | 6-/8-/13-scene | 6/8/13 | 2688/702/1071 | BoW | |
Lazebnik et al. [ |
v | v | 15-scene/Caltech 101/Graz | 15/101/2 | Zhang et al. [ | |
Marszalek and Schmid [ |
v | PASCAL′05 |
Wang et al. [ | |||
Monay et al. [ |
v | Corel | 4 | 6600 | ||
Moosmann et al. [ |
v | Graz-02/PASCAL′05 | 3/4 | BoW | ||
Perronnin et al. [ |
v | v | Corel | 10 | 1000 | BoW; Farquhar et al. [ |
16
17
18
19
20
21Random seed [
22
23
24
25
26
27
28Hamming embedding.
29
30
31
32
According to Table
Specifically, the PASCAL, Caltech, and Corel datasets are the three most widely used benchmarks for image classification. However, the datasets used in most studies usually contain a small number of categories and images, except for the studies focusing on retrieval rather than classification. That is, similar based queries are used to retrieve relevant images instead of training a learning model to classify unknown images into one specific category.
For the chosen baselines, most studies compared BoW and/or spatial pyramid matching based BoW since their aims were to propose novel approaches to improve these two feature representations. Specifically, Lazebnik et al. [
Besides improving the feature representation per se, some studies focused on improving the performance of LDA and/or pLSA discriminative learning models. Another popular baseline is that of Fei-Fei and Perona [
The above comparisons indicate several issues that were not examined in the literature. Since the local features can be represented using object-based regions by region segmentation [
In addition, the local feature descriptor is the key component to the success of better image annotation; it is a fact that the number of visual words (i.e., clusters) is another factor affecting image annotation performance. Although Jiang et al. [
The learning techniques can be divided into generative and discriminative models, but there are very few studies which assess their annotation performance over different kinds of image datasets which is necessary in order to fully understand the value of these two kinds of learning models. On the other hand, a combination of generative and discriminative learning techniques [
For the experimental setup, the target of most studies was not image retrieval. In other words, the performance evaluation was usually for small scale problems based on datasets containing a small number of categories, say 10. However, image retrieval users will not be satisfied with a system providing only 10 keyword-based queries to search relevant images. Some benchmarks are much more suitable for larger scale image annotation, such as the Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) by ImageNet (
However, it is also possible that some smaller scale datasets composed of a relatively small number of images and/or categories can be combined into larger datasets. For example, the combination of Caltech 256 and Corel could be regarded as a benchmark that is more close to the real world problem.
In this paper, a number of recent related works using BoW for image annotation are reviewed. We can observe that this topic has been extensively studied recently. For example, there are many issues for improving the discriminative power of BoW feature representations by such techniques as image segmentation, vector quantization, and visual vocabulary construction. In addition, there are other directions for integrating the BoW feature for different applications, such as face detection, medical image analysis, 3D image retrieval, and so forth.
From comparisons of related work, we can find the most widely used methodology to extract the BoW feature which can be regarded as a baseline for future research. That is, DoG is used as the kepoint detector and each keypoint is represented by the SIFT feature. The vector quantization step is based on the
On the other hand, for the dataset issue in the experimental design, which can affect the contribution and final conclusion, the PASCAL, Caltech, and/or Corel datasets can be used as the initial study.
According to the comparative results, there are some future research directions. First, the local feature descriptor for vector quantization usually by point-based SIFT feature can be compared with other descriptors, such as a region-based feature or a combination of different features. Second, a guideline for determining the number of visual words over what kind of datasets should be provided. The third issue is to assess the performance of generative and discriminative learning models over different kinds of datasets, such as different dataset sizes and different image contents, for example, a single object per image and multiple objects per image. Finally, it is worth examining the scalability of BoW feature representation for large scale image annotation.