Deep Binary Representation for Efficient Image Retrieval

With the fast growing number of images uploaded every day, efficient content-based image retrieval becomes important. Hashing method, which means representing images in binary codes and using Hamming distance to judge similarity, is widely accepted for its advantage in storage and searching speed. A good binary representation method for images is the determining factor of image retrieval. In this paper, we propose a new deep hashing method for efficient image retrieval. We propose an algorithm to calculate the target hash code which indicates the relationship between images of different contents. Then the target hash code is fed to the deep network for training. Two variants of deep network, DBR and DBR-v3, are proposed for different size and scale of image database. After training, our deep network can produce hash codes with large Hamming distance for images of different contents. Experiments on standard image retrieval benchmarks show that our method outperforms other state-of-the-art methods including unsupervised, supervised, and deep hashing methods.


Introduction
Millions of images are uploaded and stored on the Internet every second with the rapid development of storage technique.Given a query image, how to efficiently locate a certain number of content similar images from a large database is a big challenge.Speed and accuracy need to be carefully balanced.This kind of task is content-based image retrieval (CBIR) [1][2][3][4], a technique for retrieving images by automatically derived features such as colour, texture, and shape.There are also some applications of CBIR like free-hand sketchbased image retrieval [5] whose query images are abstract and ambiguous sketches.In CBIR, derived features are not easy to store.Searching from millions and even billions of images is very time-consuming.
The binary representation of images is an emerging approach to deal with both storage and searching speed of CBIR task.This method is called hashing method, and it works in three steps.First, use a hash function to map database images (gallery images) into binary codes and store them on the storage device; the typical length is 48 bits.Then calculate the Hamming distance between the binary code of query image and stored binary codes.The images with smallest Hamming distance to the query image indicate similar content and should be retrieved.Some examples of proposed hashing methods are in [6][7][8][9][10][11].
The critical part of hashing method is the features it uses to derive the hash code.The process of all hashing method includes feature extracting; the quality of feature directly affects the retrieval accuracy.Recently, convolutional neural network (CNN) has proved its remarkable performance in tasks highly depending on feature extracting, like image classification [12], natural language processing [13], and video analysis [14].CNN based methods outperform previous leading ones in these areas, which shows that CNN can learn robust features representing the semantic information of images.A very natural idea is to use deep learning for learning compact binary hash codes.Following semantic hashing [15], deep hashing methods using CNN show high performance in content-based image retrieval.
In this paper, we propose a new supervised deep hashing method for learning compact hash code to perform contentbased image retrieval; we call it deep binary representation (DBR).This paper is an extended version of the work [16].Our method is an end-to-end learning framework with three main steps.The first step is to generate optimal target hash 2 Advances in Multimedia code from pointwise label information.The second step is to learn image features and hash function simultaneously through the training process of carefully designed deep network.The third step is to map image pixels to compact binary codes through a hash function and perform image retrieval.Compared to other deep hashing methods, our method has the following merits.
(1) Our deep hash network is trained with calculated target hash code.The target hash code is optimal that Hamming distance between different labels is maximized.Methods like [17] derive hash codes from a middle layer of the deep network.Our method produces hash codes from the output layer.This method is more direct and shows better performance.
(2) Our training process is pointwise; one training sample consists of one image and one target hash code.Compared to pairwise [18] and triplet methods [19,20] whose training process needs two images or three images as one training sample, the training time is largely shortened.Our training process is a linear time algorithm, not exponential time algorithm for methods mentioned above.
(3) Our method reaches state-of-the-art performance on both small image datasets like CIFAR-10 and relatively large datasets like ImageNet.For large image datasets, we further propose an architecture based on inception-v3 net [21]; we call it DBR-v3.DBR-v3 achieves state-of-the-art performance on image retrieval of ImageNet dataset.When we apply DBR-v3 to CIFAR-10, a 15 percent performance improvement is achieved.

Overview of Hashing Methods
Hashing methods include data-independent methods [22] and data-dependent methods [10,[23][24][25][26]. Methods of the first category are proposed in earlier days.The most representative ones are locality-sensitive hashing (LSH) [22] and the variants of it.The hash function is not related to training data.Instead, they do random projections to map images into a feature space.The second category learns the hash function from the training data.Because of the extra information, datadependent methods outperform data-independent ones.
Data-dependent methods can be further divided into unsupervised methods and supervised methods.Unsupervised methods include spectral hashing (SH) [23] and iterative quantization (ITQ) [26].These methods learn hash functions from unlabelled training sets.To deal with the more complicated image database, supervised methods are proposed to learn a better hash function from label information of training images.For example, supervised hashing with kernels (KSH) [24] requires a limited amount of supervised information and achieves high-quality hashing.Minimal loss hashing (MLH) [25] is based on structured prediction with latent variables and a hinge-like loss function.And binary reconstructive embedding (BRE) [10] develops an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances.Asymmetric inner-product binary coding (AIBC) [27] is a special hashing method based on asymmetric hash functions.AIBC learns two different hash functions for query images and dataset images.It can be applied to both unsupervised datasets and supervised datasets.
Hashing methods mentioned above use hand-crafted features which are not powerful enough for more complicated semantic similarity.Moreover, the feature extracting procedure is independent of hash function learning.Recently, CNN based hashing methods called deep hashing methods are proposed to issue these problems.CNN can learn more representative features over hand-crafted features.Furthermore, most deep hashing methods perform feature learning and hash function learning simultaneously and show great improvement compared to previous methods.Several deep hashing methods have been proposed and proved to have better accuracy in content-based image retrieval.For example, CNNH [18] proposes a two-stage deep hashing method.It simultaneously learns features and hash functions based on learned approximate hash codes.Deep pairwise-supervised hashing (DPSH) [28] performs learning based on pairwise labels.Reference [29] poses hash learning as a problem of regularized similarity learning and simultaneously learns hash function and image features through triplet samples.Our approach proposed in this paper outperforms the above methods.

Proposed Method
Our method aims to find a hash function solving the contentbased image retrieval task.Given  training images  = { 1 ,  2 , . . .,   } belonging to  categories,   is in the form of raw RGB values.Label information is noted as  = { 1 ,  2 , . . .,   },   ∈ {1, 2, . . ., }.Our goal is to learn a function () mapping input images to compact binary codes   = (  ) and   ∈ {0, 1}  , where  stands for the hash length.The hash function () satisfies the following: (1)   and   are similar in the Hamming space when   =   .
(2)   and   are far away in the Hamming space when   ̸ =   .Figure 1 shows the whole flowchart of our system and Figure 2 shows the proposed network.A target hash code generation component is proposed to generate optimal hash code for training based on code length and category number.Our framework contains a CNN model as the main component.Normally the last layer of CNN is a Softmax classification layer.We replace it with a hash layer of  nodes.Since the output layer of CNN model has been changed, we need new output information to replace the labels.The hash function () is the trained model concatenated with a revised sgn function.Finally, we use the trained hash function to perform content-based image retrieval.

Target Hash Code Generation
3.1.1.Normal Situation.Target hash code is the mathematically optimal code set with  codewords; the Hamming distance between each codeword is maximized.We use target hash code together with raw images as the training sample to train the whole network.We hope to get a network which accepts the raw image as input and can map it to binary codes close to the target hash code.The trained network, which is used as hash function (), produces binary codes satisfying the goal.Learning the relationship between images with different labels is not our purpose.Instead, our purpose is to teach the network how to map image to a binary code.
That is why we calculate the target hash codes and feed it to the network, not letting the network learn from original labels.This is the major difference between our method and others.Furthermore, the target hash code generation component makes our learning a pointwise manner.We require no pairwise inputs like [18], and the training speed is much faster.Our target hash code's function is similar to the prototype code in adaptive binary quantization (ABQ) [30].The difference is that, in ABQ, the binary code of any data point is represented by its nearest prototype.The output binary code of the hash function lies in the prototype code set.
In our method, the target hash code is only used for training.Hash function can produce binary codes not in target hash code set.
To fit the target hash code length, we replace the last layer of original CNN classification model with a fully connected layer called hash layer which has  nodes.How to generate the target hash code for images in  different labels is the main focus of this part.The following is the detailed problem description and main algorithm.
Since the training images are in  categories, our target is to find a binary code set with  codewords.The minimum Hamming distance between any two codewords should be as large as possible.In a more specific way, given binary code length  and codeword number , we want to find a code set  = { 1 ,  2 , . . .,   },   ∈ {0, 1}  , whose minimum Hamming distance is maximized.This optimization problem can first be divided into smaller jobs: given code length  and minimum Hamming distance , find a code set with more than  code words.After that, repeat this process with larger  until no code set can be found.The last solvable  is the maximized minimum Hamming distance.The whole process is described in Algorithm 1. Please note that this optimization problem is a complicated problem with no fixed result.For different  and , the scale of code set may not be a certain number [31].We have proved that our algorithm is able to at least find a second optimal solution in our experiment cases.Consider a 24-bit code set for a 12-category dataset, the best solution is a code set whose minimum Hamming distance is 13 bits.Our algorithm will find one with the minimum Hamming distance of 12 bits.For instance, given code length  = 12 for a dataset with  = 10 categories, with our algorithm, the minimum Hamming distance  = 6 results in a code set  with 16 codewords.We further try  = 7 and it results in a set with 4 codewords, which fails to meet our need.Then we randomly choose 10 codewords from the former set ( = 12,  = 6) and the target hash code is constructed as Table 1 shows.

Semantically Uneven Situation.
In some situations, the semantic relation between different labels is not evenly distributed.For example, image samples in a dataset are divided into 3 different categories and their labels are  = { 1 ,  2 ,  3 } = {cat, dog, car}.The images belonging to labels  1 and  2 are of different categories but quite similar.However, images of label  3 are really far away from  2 and  3 .When we input a cat as a query image, we hope to retrieve dogs before cars.The target hash code needs to be redesigned; an evenly distributed Hamming distance between each label is The CNN network of our proposal.The upper part is the DBR network, and the lower one is the DBR-v3 network based on inception-v3 net.First, target hash code set is generated based on hash length  and image category number .Then deep network is trained with raw images and target hash code.Finally, image retrieval is processed with the hash function, which is the trained network concatenated with a sgn function.
not reasonable.In this example, we need a target hash code set  = { 1 ,  2 ,  3 } with small Hamming distance between  1 and  2 .In this {cat, dog, car} example, target hash code set  = {11001, 11010, 00000} is a reasonable one since the Hamming distance between cat and dog is 2 and others are 3.
To generate such target hash code set, we need further information called semantic relation matrix . is a  ×  matrix.Each element  , in  shows the semantic relation between label   and label   .So  , always equals  , and  , = 0.A negative number means closer relation than average, for example, cat and dog.A positive number means more dissimilar relation than other labels.Zero value means the normal relation between two labels and most values should be zero.For the {cat, dog, car} example,  will be a 3 × 3 matrix. 1,2 and  2,1 are −1 and all other values are 0. To generate this kind of uneven semantic target hash code, we need a slight revise to Algorithm 1.The whole process is shown in Algorithm 2. The only difference is in line (5).When comparing the Hamming distance between the generating code and already generated code, we need to add the corresponding  of these two codes defined in semantic relation matrix .In Algorithm 2,  means the value in  of the currently compared codes.For example, if we are generating the fifth code and comparing it to the first code in current code set,  is the value of  5,1 .

Learning Hash
After preparing the training samples, we build a deep network to learn to map images to hash codes.For small image datasets with the size of around 30 * 30 pixels, we

Deep Network for Small Images.
For small image deep network, we call our method deep binary representation (DBR).We take CIFAR-10 training as an example.We adopt a widely used simple CNN model for CIFAR-10 for fast retrieval.CNN has the powerful ability to learn image features through the concatenation of convolution layer and fully connected layer.As shown in Figure 2, we use 32, 32, 64, 64 3 × 3 convolution kernels for the convolution layers.
2 × 2 max pooling and 25% dropout are added after 2nd and 4th convolution layer.Following convolution layers are two fully connected layers with 512 nodes and a 50% dropout after the first one.All these layers are activated with ReLU function for adding nonlinearity.The hash layer is a fully connected layer with  nodes, depending on the length of target hash code.For larger , the network can be trained to learn more features from the input image and lead to better performance.Each node implies a hidden feature of the input image.Sigmoid function ranges in (0, 1) and for most cases lies in (0, 0.1) ∪ (0.9, 1).It is very suitable for indexing the output to binary codes.Target hash code includes all the information needed to learn features from images, so loss function need not be specially designed; simple mean squared error (MSE) loss function works well.For training optimizer, we choose Adadelta [32] for its good balance of speed and convergence point.Without huge modifications on the CNN model, our proposed model can learn a robust hash function fast in hundreds of training epochs.

Deep Network for Large Images.
For input images with relatively large size like 299 * 299 pixels or similar size, we call our method DBR-v3.We build our deep network based on inception-v3 [21].Inception net-v3 is a very deep convolutional neural network with more than 20 layers evolving from inception v1 [33].This network achieves 21.2% top-1 and 5.6% top-5 error in ILSVRC for single frame evaluation with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters.We adopt the ImageNet pretrained inception-v3 model as our basic model and perform our revise and training.
We make some changes to the inception-v3 to make it fit our hash function.After the final global pooling layer, we add one fully connected layer with 1024 nodes activated with ReLU function.Following this layer is the hash layer, a fully connected layer with  nodes.The weights of newly added layers are randomly initialized and others are pretrained weights of original inception-v3.The training is a two-step process.First, we train the whole network for several epochs.After the top layers are well trained, we freeze the bottom layers and fine-tune the top 2 inception-blocks for several epochs.The loss function and training optimizer are also MSE and Adadelta [32].
This network accepts input images of 299 * 299 pixels.For small image datasets, we use the upsampling algorithm to make images fit the network.This can make further performance improvement compared to original shallow network in Section 3.2.1.This performance improvement comes from two aspects, the power of pretrained weights and deeper networks.

Image Retrieval.
After training, we combine all the components together to perform image retrieval.Our trained network accepts an input image  in raw pixels and gives an output  ∈ (0, 1)  .To convert output  to binary hash codes ℎ ∈ {0, 1}  , we redefine the sgn function: Finally, we get our hash function: where  is the output of our proposed model.
For image retrieval, we regard training images as the image database and test images as query images.Image retrieval process searches top  most similar images from the database.Three steps are performed to do the image retrieval.
Step 2. For each query image   , first calculate ℎ  = (  ) and then retrieve  images by the rank of the Hamming distance (ℎ  ,   ); smaller Hamming distance ranks higher.
Step 3. Compare the similarities of retrieved images and the query image.Then evaluate the performance in MAP according to the result.

Experiments
In this part, we state our experiment settings and results.We calculate the MAP (mean average precision) of the image retrieval on different datasets and list it in Table 3.We apply our DBR method on MNIST and CIFAR-10, and DBR-v3 method on ImageNet.Furthermore, we upsample the images in CIFAR-10 and apply DBR-v3 to it.For each method, we list the time each network costs to train and to calculate the hash code of one image.The running time is listed in Table 6.We choose not to calculate the time of retrieving one image.The reason is that once the hash length is determined, the time to retrieve images according to the hash code is the same for all hashing methods.What matters is the time to map one image to the hash code.Please note that some results are missing because they are not available in corresponding paper and these methods are not totally open-source.

Results on MNIST.
The MNIST dataset [34] consists of 70000 28 × 28 grey-scale images belonging to 10 categories of handwritten Arabic numerals from 0 to 9.
For MNIST, we use 32 convolution kernels of size 3 × 3 for each one of the two convolution layers. 2 × 2 max pooling and 25% dropout are added after the 2nd convolution layer.Following convolution layers are two fully connected layers with 128 nodes and a 50% dropout after the first fully connected layer.The last layer is the hash layer and the number of nodes is adjustable with hash length.The model training uses Adadelta optimizer with mean squared error loss function.
We follow the experiment configurations of [18] and derive results from the same resource.We randomly select 100 images per class and total 1000 images as test query images.For the unsupervised methods, we use all the rest images as the training set.And for supervised methods including CNNH, CNNH+, and ours, we select 5000 images (500 images per class) as the training set.The whole training process lasts around 120 s for 100 epochs training on a GTX1060 6 GB graphic processing unit.It costs around 80 us to map one MNIST image to its hash code.
To evaluate the performance of retrieval, we use mean average precision (MAP).For each query image, we calculate the average precision of retrieved images.MAP is the mean value of these average precisions.Please note that, for each query image, the correctness of high ranking retrieved image counts more.The MAP result of our test is shown in Table 3; the DBR column is our method.We can find that our method outperforms other methods in grey-scale image retrieval.dataset consists of 60000 32 × 32 images belonging to 10 categories including airplane, automobile, and bird.The layer information of CIFAR-10 implementation is stated in Section 3.2.

Results on
Other than the methods we compared in Section 4.1, we compare our method with two more CNN based deep hash methods DHN [36] and DNNH [20].And we also follow their experiment configuration.We randomly select 100 images per class as query set.For the unsupervised methods, we use all the rest images as the training set.For supervised methods, 5000 images (500 images per class) are randomly selected as the training set.The whole training process lasts around 600 s for 300 epochs training on a GTX1060 6 GB graphic processing unit.It costs around 160 us to map one CIFAR-10 image to its hash code.Two images are considered to be similar if they have the same label.The top 12 retrieved images of two query images are shown in Figure 3 as an illustration.
Furthermore, we upsample images in CIFAR-10 to the size of 299 * 299 and apply DBR-v3 network to it.We train the whole network for 50 epochs and perform the fine-tuning for 20 epochs.The total training time is 37 s * 50 + 51 s * 20 = 2870 s on a GTX1060 6 GB graphic processing unit.For DBR-v3, it costs around 3 ms to map one CIFAR-10 image to its hash code.
The MAP results are in Table 3; the result of our method is shown in columns DBR and DBR-v3.We can see that our method is better than other methods including unsupervised methods, supervised methods, and deep methods with feature learning.DBR-v3 has a big advantage over DBR.This is because the network is a lot deeper and pretrained with ImageNet.However, the training time and hash code calculating time sacrifice a lot.
We also conduct experiments on semantically uneven situations.For ten categories in CIFAR-10 dataset, we suppose that the automobile and the truck are semantically similar.We set the value of  truck,automobile = −2 and all other values in semantic relation matrix   = 0.When the query image is an automobile, we can observe that trucks will be retrieved with higher rank compared to categories other than the automobile.At the same time, Table 4 shows that the overall MAP result remains at the same level.This experiment indicates that our target hash code can have the same semantic relation between different categories.
Following [28], we further do the comparison to some deep hashing methods with different experiment settings   The network details including loss function and training optimizes are stated in Section 3.2.2.For a fair comparison, we follow the experiment settings in [37].We randomly select 100 categories; all the images of these categories in the training set are used as training images.All the images of these categories in the validation set are used as query images.We train the whole network for 50 epochs and fine-tune the top 2 inception-blocks and hash layer for 20 epochs.The total training time is about 18 hours on a GTX1060 6 GB graphic processing unit.It costs around 3 ms to map one ImageNet image its hash code.
Our proposed method is compared to state-of-the-art hashing methods including HashNet [37] and most methods mentioned in Section 4.1.The data is derived directly from [37] and the test set is the same.
To evaluate the performance of retrieval, we use mean average precision (MAP), and the result is shown in Table 3.The result shows that our method has a great advantage over other methods.This indicates that our DBR-v3 method can solve large-scale image retrieval for high-definition images.

Conclusion
In this paper, we present a novel end-to-end hash learning network for content-based image retrieval.We design the optimal target hash code for each label to feed the network with the relation between different labels.Since the target hash codes between different labels have maximized Hamming distance, the deep network can map different-category images to hash codes with significant distance.For similar images, the network tends to produce exact same hash code.The deep network is based on convolutional neural network.We design two variants of our method: (1) DBR for small images: this network trains fast and it calculates fast; with powerful clusters online training is even possible.(2) DBR-v3 based on inception-v3 net: it benefits from the powerful learning ability of inception net and performs very good on high-definition image retrieval.Finally, we do experiments on standard image retrieval benchmarks.The results show that our method outperforms the previous works.

Figure 1 :
Figure1: The process flow of our work.First, we use our proposed target hash code generation algorithm to generate optimal hash code set.Next, we use hash code set and training images to train the hash learning CNN network, which is regarded as the hash function.Finally, we use the hash function to perform content-based image retrieval.

5 Figure 2 :
Figure2: The CNN network of our proposal.The upper part is the DBR network, and the lower one is the DBR-v3 network based on inception-v3 net.First, target hash code set is generated based on hash length  and image category number .Then deep network is trained with raw images and target hash code.Finally, image retrieval is processed with the hash function, which is the trained network concatenated with a sgn function.

Figure 3 :
Figure 3: Top 12 retrieved images of two query image samples.The dataset is CIFAR-10 and hash length is 24 bits.As shown in the figure, the precision of high-rank retrieved images is very high.

Table 1 :
Target 12-bit hash code set for a 10-category dataset.

Table 2 :
Semantically uneven target 12-bit hash code set for a 10category dataset: Hamming distances between codes 0, 1, and 2 are small and those between codes 7, 8, and 9 are large.
[21]ent DBR network based on relatively shallow convolutional neural networks.For large image datasets with image size of 299 * 299, we build our network called DBR-v3 based on inception-v3[21].

Table 3 :
MAP of image retrieval on MNIST, CIFAR10, and ImageNet dataset with 4 different bit lengths.We use 1000 query images and calculate the MAP within top 5000 returned neighbors in MNIST and CIFAR10 dataset.We use images from 100 random categories in ImageNet dataset and all validation images of these categories are used as query sets.

Table 4 :
MAP of image retrieval on CIFAR-10 dataset with 4 different bit lengths.We use 1000 query images and calculate the MAP within top 5000 returned neighbors.The first line is the result of semantically even situation and the second is semantically uneven situation.
thousand images in the validation set.Each image is in one of the 1000 categories.The image size varies, and the common size is hundreds by hundreds of pixels.ImageNet is currently the largest image database for various tasks.And experiments on ImageNet show the ability to deal with large-scale highdefinition images.

Table 5 :
MAP of image retrieval on CIFAR-10 dataset with 4 different bit lengths.We use 10000 query images and calculate the MAP within top 50000 returned neighbors.

Table 6 :
The running time of each experiment.Training time means the time to train the hash function.Hash time means the time to map one image to hash code.