With the fast growing number of images uploaded every day, efficient content-based image retrieval becomes important. Hashing method, which means representing images in binary codes and using Hamming distance to judge similarity, is widely accepted for its advantage in storage and searching speed. A good binary representation method for images is the determining factor of image retrieval. In this paper, we propose a new deep hashing method for efficient image retrieval. We propose an algorithm to calculate the target hash code which indicates the relationship between images of different contents. Then the target hash code is fed to the deep network for training. Two variants of deep network, DBR and DBR-v3, are proposed for different size and scale of image database. After training, our deep network can produce hash codes with large Hamming distance for images of different contents. Experiments on standard image retrieval benchmarks show that our method outperforms other state-of-the-art methods including unsupervised, supervised, and deep hashing methods.
Millions of images are uploaded and stored on the Internet every second with the rapid development of storage technique. Given a query image, how to efficiently locate a certain number of content similar images from a large database is a big challenge. Speed and accuracy need to be carefully balanced. This kind of task is content-based image retrieval (CBIR) [
The binary representation of images is an emerging approach to deal with both storage and searching speed of CBIR task. This method is called hashing method, and it works in three steps. First, use a hash function to map database images (gallery images) into binary codes and store them on the storage device; the typical length is 48 bits. Then calculate the Hamming distance between the binary code of query image and stored binary codes. The images with smallest Hamming distance to the query image indicate similar content and should be retrieved. Some examples of proposed hashing methods are in [
The critical part of hashing method is the features it uses to derive the hash code. The process of all hashing method includes feature extracting; the quality of feature directly affects the retrieval accuracy. Recently, convolutional neural network (CNN) has proved its remarkable performance in tasks highly depending on feature extracting, like image classification [
In this paper, we propose a new supervised deep hashing method for learning compact hash code to perform content-based image retrieval; we call it deep binary representation (DBR). This paper is an extended version of the work [
Hashing methods include data-independent methods [
Data-dependent methods can be further divided into unsupervised methods and supervised methods. Unsupervised methods include spectral hashing (SH) [
Hashing methods mentioned above use hand-crafted features which are not powerful enough for more complicated semantic similarity. Moreover, the feature extracting procedure is independent of hash function learning. Recently, CNN based hashing methods called deep hashing methods are proposed to issue these problems. CNN can learn more representative features over hand-crafted features. Furthermore, most deep hashing methods perform feature learning and hash function learning simultaneously and show great improvement compared to previous methods. Several deep hashing methods have been proposed and proved to have better accuracy in content-based image retrieval. For example, CNNH [
Our method aims to find a hash function solving the content-based image retrieval task. Given
(1)
(2)
Figure
The process flow of our work. First, we use our proposed target hash code generation algorithm to generate optimal hash code set. Next, we use hash code set and training images to train the hash learning CNN network, which is regarded as the hash function. Finally, we use the hash function to perform content-based image retrieval.
The CNN network of our proposal. The upper part is the DBR network, and the lower one is the DBR-v3 network based on inception-v3 net. First, target hash code set is generated based on hash length
Target hash code is the mathematically optimal code set with
To fit the target hash code length, we replace the last layer of original CNN classification model with a fully connected layer called hash layer which has
Since the training images are in
( ( ( ( ( ( ( ( ( ( Choose
For instance, given code length
Target 12-bit hash code set for a 10-category dataset.
Label | Decimal code | Target hash code |
---|---|---|
0 | 504 | 000111111000 |
1 | 1611 | 011001001011 |
2 | 1652 | 011001110100 |
3 | 1932 | 011110001100 |
4 | 1971 | 011110110011 |
5 | 2709 | 101010010101 |
6 | 2730 | 101010101010 |
7 | 2898 | 101101010010 |
8 | 2925 | 101101101101 |
9 | 3294 | 110011011110 |
In some situations, the semantic relation between different labels is not evenly distributed. For example, image samples in a dataset are divided into 3 different categories and their labels are
To generate such target hash code set, we need further information called semantic relation matrix
corresponding value in ( ( ( ( ( ( ( ( ( ( Choose
For instance, given code length
Semantically uneven target 12-bit hash code set for a 10-category dataset: Hamming distances between codes 0, 1, and 2 are small and those between codes 7, 8, and 9 are large.
Label | Decimal code | Target hash code |
---|---|---|
0 | 0 | 000000000000 |
1 | 15 | 000000001111 |
2 | 51 | 000000110011 |
3 | 252 | 000011111100 |
4 | 853 | 001101010101 |
5 | 874 | 001101101010 |
6 | 1430 | 010110010110 |
7 | 1449 | 010110101001 |
8 | 2714 | 101010011010 |
9 | 3174 | 110001100110 |
With the label information
After preparing the training samples, we build a deep network to learn to map images to hash codes. For small image datasets with the size of around
For small image deep network, we call our method deep binary representation (DBR). We take CIFAR-10 training as an example. We adopt a widely used simple CNN model for CIFAR-10 for fast retrieval. CNN has the powerful ability to learn image features through the concatenation of convolution layer and fully connected layer. As shown in Figure
Target hash code includes all the information needed to learn features from images, so loss function need not be specially designed; simple mean squared error (MSE) loss function works well. For training optimizer, we choose Adadelta [
For input images with relatively large size like
We make some changes to the inception-v3 to make it fit our hash function. After the final global pooling layer, we add one fully connected layer with 1024 nodes activated with ReLU function. Following this layer is the hash layer, a fully connected layer with
This network accepts input images of
After training, we combine all the components together to perform image retrieval. Our trained network accepts an input image
For image retrieval, we regard training images as the image database and test images as query images. Image retrieval process searches top
Map
For each query image
Compare the similarities of retrieved images and the query image. Then evaluate the performance in MAP according to the result.
In this part, we state our experiment settings and results. We calculate the MAP (mean average precision) of the image retrieval on different datasets and list it in Table
MAP of image retrieval on MNIST, CIFAR10, and ImageNet dataset with 4 different bit lengths. We use 1000 query images and calculate the MAP within top 5000 returned neighbors in MNIST and CIFAR10 dataset. We use images from 100 random categories in ImageNet dataset and all validation images of these categories are used as query sets.
Method | MNIST (MAP) | CIFAR-10 (MAP) | ImageNet (MAP) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
12 bits | 24 bits | 32 bits | 48 bits | 12 bits | 24 bits | 32 bits | 48 bits | 16 bits | 32 bits | 48 bits | 64 bits | |
| — | — | — | — | | | | | | | | |
| | | | | | | | | — | — | — | — |
HashNet | — | — | — | — | — | — | — | — | 0.442 | 0.606 | 0.663 | 0.684 |
DHN | — | — | — | — | 0.555 | 0.594 | 0.603 | 0.621 | 0.311 | 0.472 | 0.542 | 0.573 |
DNNH | — | — | — | — | 0.552 | 0.566 | 0.558 | 0.581 | 0.290 | 0.461 | 0.530 | 0.565 |
CNNH+ | 0.969 | 0.975 | 0.971 | 0.975 | 0.465 | 0.521 | 0.521 | 0.532 | — | — | — | — |
CNNH | 0.957 | 0.963 | 0.956 | 0.960 | 0.439 | 0.511 | 0.509 | 0.522 | 0.281 | 0.450 | 0.525 | 0.554 |
KSH | 0.872 | 0.891 | 0.897 | 0.900 | 0.303 | 0.337 | 0.346 | 0.356 | 0.160 | 0.298 | 0.342 | 0.394 |
ITQ-CCA | 0.659 | 0.694 | 0.714 | 0.726 | 0.264 | 0.282 | 0.288 | 0.295 | 0.266 | 0.436 | 0.548 | 0.576 |
MLH | 0.472 | 0.666 | 0.652 | 0.654 | 0.182 | 0.195 | 0.207 | 0.211 | — | — | — | — |
BRE | 0.515 | 0.593 | 0.613 | 0.634 | 0.159 | 0.181 | 0.193 | 0.196 | 0.063 | 0.253 | 0.330 | 0.358 |
SH | 0.265 | 0.267 | 0.259 | 0.250 | 0.131 | 0.135 | 0.133 | 0.130 | 0.207 | 0.328 | 0.395 | 0.419 |
ITQ | 0.388 | 0.436 | 0.422 | 0.429 | 0.162 | 0.169 | 0.172 | 0.175 | 0.326 | 0.462 | 0.517 | 0.552 |
LSH | 0.187 | 0.209 | 0.235 | 0.243 | 0.121 | 0.126 | 0.120 | 0.120 | 0.101 | 0.235 | 0.312 | 0.360 |
The MNIST dataset [
For MNIST, we use 32 convolution kernels of size 3
Our proposed method is compared with state-of-the-art hashing methods including data-independent method LSH [
We follow the experiment configurations of [
To evaluate the performance of retrieval, we use mean average precision (MAP). For each query image, we calculate the average precision of retrieved images. MAP is the mean value of these average precisions. Please note that, for each query image, the correctness of high ranking retrieved image counts more. The MAP result of our test is shown in Table
CIFAR-10 [
Other than the methods we compared in Section
Top 12 retrieved images of two query image samples. The dataset is CIFAR-10 and hash length is 24 bits. As shown in the figure, the precision of high-rank retrieved images is very high.
Furthermore, we upsample images in CIFAR-10 to the size of
The MAP results are in Table
We also conduct experiments on semantically uneven situations. For ten categories in CIFAR-10 dataset, we suppose that the automobile and the truck are semantically similar. We set the value of
MAP of image retrieval on CIFAR-10 dataset with 4 different bit lengths. We use 1000 query images and calculate the MAP within top 5000 returned neighbors. The first line is the result of semantically even situation and the second is semantically uneven situation.
Method | CIFAR-10 (MAP) | |||
---|---|---|---|---|
16 bits | 24 bits | 32 bits | 48 bits | |
DBR-even | 0.612 | 0.648 | 0.658 | 0.680 |
DBR-uneven | 0.608 | 0.647 | 0.658 | 0.683 |
Following [
MAP of image retrieval on CIFAR-10 dataset with 4 different bit lengths. We use 10000 query images and calculate the MAP within top 50000 returned neighbors.
Method | CIFAR-10 (MAP) | |||
---|---|---|---|---|
16 bits | 24 bits | 32 bits | 48 bits | |
| | | | |
DPSH | 0.763 | 0.781 | 0.795 | 0.807 |
DRSCH | 0.615 | 0.622 | 0.629 | 0.631 |
DSCH | 0.609 | 0.613 | 0.617 | 0.620 |
DSRH | 0.608 | 0.611 | 0.617 | 0.618 |
The running time of each experiment. Training time means the time to train the hash function. Hash time means the time to map one image to hash code.
Dataset | Method | Epoch | Training time | Hash time |
---|---|---|---|---|
MNIST | DBR | 100 | 120 s | 80 us |
CIFAR10 | DBR | 300 | 600 s | 160 us |
CIFAR10 | DBR-v3 | 50 + 20 | 2870 s | 3 ms |
ImageNet | DBR-v3 | 50 + 20 | 18 h | 3 ms |
ImageNet is an image database with more than 1.2 million images in training set and more than 50 thousand images in the validation set. Each image is in one of the 1000 categories. The image size varies, and the common size is hundreds by hundreds of pixels. ImageNet is currently the largest image database for various tasks. And experiments on ImageNet show the ability to deal with large-scale high-definition images.
The network details including loss function and training optimizes are stated in Section
Our proposed method is compared to state-of-the-art hashing methods including HashNet [
To evaluate the performance of retrieval, we use mean average precision (MAP), and the result is shown in Table
In this paper, we present a novel end-to-end hash learning network for content-based image retrieval. We design the optimal target hash code for each label to feed the network with the relation between different labels. Since the target hash codes between different labels have maximized Hamming distance, the deep network can map different-category images to hash codes with significant distance. For similar images, the network tends to produce exact same hash code. The deep network is based on convolutional neural network. We design two variants of our method:
The authors declare that there are no conflicts of interest regarding the publication of this article.
This work is supported by NSFC (61671296, 61521062, and U1611461) and the National Key Research and Development Program of China (BZ0300013).