In recent years, with the explosion of multimedia data from search engines, social media, and ecommerce platforms, there is an urgent need for fast retrieval methods for massive big data. Hashing is widely used in largescale and highdimensional data search because of its low storage cost and fast query speed. Thanks to the great success of deep learning in many fields, the deep learning method has been introduced into hashing retrieval, and it uses a deep neural network to learn image features and hash codes simultaneously. Compared with the traditional hashing methods, it has better performance. However, existing deep hashing methods have some limitations; for example, most methods consider only one kind of supervised loss, which leads to insufficient utilization of supervised information. To address this issue, we proposed a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH) in this work. The proposed method JLTDH combines triplet likelihood loss and linear classification loss; moreover, the triplet supervised label is adopted, which contains richer supervised information than that of the pointwise and pairwise labels. At the same time, in order to overcome the cubic increase in the number of triplets and make triplet training more effective, we adopt a novel triplet selection method. The whole process is divided into two stages: In the first stage, taking the triplets generated by the triplet selection method as the input of the CNN, the three CNNs with shared weights are used for image feature learning, and the last layer of the network outputs a preliminary hash code. In the second stage, relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision. We perform extensive experiments on the three public benchmark datasets CIFAR10, NUSWIDE, and MSCOCO. Experimental results demonstrate that the proposed method outperforms the compared methods, and the method is also superior to all previous deep hashing methods based on the triplet label.
In recent years, because of the explosive growth of Internet big data, the Internet is filled with a large number of multimedia resources, including pictures, videos, and text, and there is an urgent need for fast search methods for massive big data. Approximate nearest neighbor (ANN) [
A series of different hashing methods are proposed to implement efficient ANN search using Hamming distance [
Recently, triplet loss has been studied for computer vision problems. The triplet labels contain richer information than pairwise labels. Each triplet label can be naturally decomposed into two pairwise labels. In particular, a triplet label ensures that the query image is close to the positive image and far away from the negative image in the learning hash code space. However, a pairwise label only ensures that one constraint is observed. The retrieval performance of triplet loss is better than that of pointwise and pairwise losses. Therefore, the triplet likelihood loss is introduced in this paper.
At the same time, the existing deep hashing methods still have some shortcomings in the utilization of classification information. The classification information only plays a role in deep neural network image representation but has no direct impact on the optimization of the hash function. Therefore, this paper proposes a linear classification loss to deal with this situation.
Therefore, combining triplet likelihood loss and linear classification loss, we propose a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH). The supervised information used in this method is in the form of triplet labels. For the sake of fully utilizing the triplet information and the classification information, we propose a joint loss function, which consists of two parts: the triplet negative loglikelihood loss and the linear classification loss. Depending on this joint loss function, the hash codes can be further optimized by the linear classifier. The linear classifier indicates the relationship between the label information and the hash codes. We choose the convolutional neural network (CNN) as our deep learning model, for example, AlexNet, ResNet, and VGG, which can learn image representation and hash function at the same time. The whole process is divided into two stages: In the first stage, taking the triplets generated by the triplet selection method as the input of the CNN, the three CNNs with shared weights are used for image feature learning, and the last layer of the network outputs a preliminary hash code. In the second stage, relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision.
This work is summarized as follows:
In this paper, a triplet deep hashing method with negative loglikelihood is proposed, and the method performs both image feature representation and hash code learning in a convolutional neural network. In order to overcome the cubic increase in the number of triplets and make triplet training more effective, we adopt a novel triplet selection method.
To fully utilize the supervised triplet information, JLTDH proposed a joint loss function combining the triplet negative loglikelihood loss and the linear classification loss. Relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision.
We perform extensive experiments on the three public benchmark datasets CIFAR10, NUSWIDE, and MSCOCO. Experimental results demonstrate that the proposed method outperforms the compared methods, and the method is also superior to all previous deep hashing methods based on the triplet label.
Hashing methods are divided into dataindependent methods and datadependent methods. Localitysensitive hashing (LSH) [
The L2H methods include two types: unsupervised hashing and supervised hashing [
On the contrary, supervised hashing tries to learn the hash function by utilizing supervised information. The purpose of supervised hashing is to map the data points in the original space to the Hamming space with supervised information, and supervised information optimizes the learning of hash function so as to learn better hash code. In recent years, using supervised information for learning, supervised hashing has higher precision than unsupervised hashing, so more and more researchers have studied it deeply [
However, most traditional supervised hashing methods cannot extract features very well. In recent years, researchers have proposed deep learning hashing methods, which can effectively extract image features to identify similar images, and their performance is better than that of the traditional hashing method. Representative deep hashing methods including convolutional neural networkbased hashing (CNNH) [
Recently, some new deep hashing methods emerged, such as crossmodal hashing and hashingbased generative adversarial network. Representative methods including progressive generative hashing (PGH) [
In our proposed method JLTDH, the input of the convolutional neural network is triplets. We denote the image triplet set as
Our goal is to learn the hash codes
We introduce the proposed framework of JLTDH; this is an endtoend hash learning framework based on the convolutional neural network.
As shown in Figure
Framework of JLTDH.
We generally generate triplets based on the category information of the sample, select the anchor image and positive image from the same category, and select the negative image from different categories. However, as the dataset increases, the number of all possible triplets is very large; using all triplets is computationally difficult and not optimal, and at the same time, it is not helpful for training and will lead to slow convergence of training. The existing triplet hashing method does not solve the problem of triplet selection very well. Therefore, the mining and selection of triplets is an urgent problem to be solved. We adopt a novel triplet selection method, which will be discussed in detail in Section
In this part, we use three CNNs with shared weights to extract the appropriate feature representation for binary hash code learning. We use the AlexNet [
Configuration of AlexNet in our method.
Layer  Configuration  

#Filter  Filter size  Stride  Padding  Pooling  
Conv1  64  11 × 11  4 × 4  2 × 2  3 × 3 
Conv2  192  5 × 5  1 × 1  2 × 2  3 × 3 
Conv3  384  3 × 3  1 × 1  1 × 1  — 
Conv4  256  3 × 3  1 × 1  1 × 1  — 
Conv5  256  3 × 3  1 × 1  1 × 1  3 × 3 
Full6  9216  
Full7  4096  
Full8  Hash code length 
This part generates the hash code of the image according to the image features of the previous part. The FC layer uses the sign functions as activation functions. Binary code is obtained by using the sign function. The length of the hash code
The joint loss function combines two kinds of supervised loss functions: the triplet negative loglikelihood loss and the linear classification loss, and is designed to further optimize the hash code so that the hash code and classification information can maintain the semantic relationship between points. The joint loss function is the weighted combination of triplet label likelihood loss and linear classification loss.
In large datasets like NUSWIDE and MSCOCO, the number of all possible triplets is very large. Thus, using all triplets is computationally difficult and not optimal. Specifically, it is not helpful for training and will lead to slow convergence of training. For example, the dataset MSCOCO is used in this paper, whose training dataset contains 10,000 images, and the number of all possible triplets is approximately
Inspired by the study in [
Using the proposed triplet selection method, we find that the number of triplets is much smaller than the number of possible triplets in our dataset. The specific results are shown in Table
Number of triplets using the proposed triplet selection method.
Dataset  Training dataset  Group  Category  Selected triplets  Time 

CIFAR10  5000  20  10  124,000  86 s 
NUSWIDE  10,500  20  21  1,372,000  316 s 
MSCOCO  10,000  20  80  1,719,000  492 s 
The joint loss function combines two kinds of supervised loss functions: the triplet negative loglikelihood loss and the linear classification loss. We introduce them in the following.
The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. It is often used to calculate the similarity between two hash codes. In other words, the smaller the Hamming distance between two hash codes, the more similar they are, and vice versa.
The supervised information used by DPSH [
We can define the triplet label likelihood function
We can learn the optimal
In order to solve
From equation (
According to equation (
By maximizing the triplet likelihood
The above optimization problem can minimize the Hamming distance between the anchor point and the positive point and simultaneously maximize the Hamming distance between the anchor point and the negative point. This exactly matches the goal of supervised hashing with the triplet label.
Substituting formula (
A common problem with the deep hashing methods is how to train the neural network output to be binary codes [
This method is described below: we use
Then, we approximate equation (
To integrate the above image feature representation part and the loss function part into a deep neural network framework, we set
Triplet label information was used to learn hash codes by equation (
Equation (
By fixing
Taking the derivative of
So once we solve
We can get a closed solution to a row of
Similarly, let
Putting equations (
According to equation (
The proposed method JLTDH is briefly summarized in Algorithm
Generate triplet training set:
Randomly sample a minibatch of points from
Calculate
Compute
Compute the binary code of
Compute derivatives for point
Update the parameters
Discrete cyclic coordinate descent (DCC) optimization:
Compute
Iteratively optimization update
Update
Hash codes
In this section, we will describe our experiments and results. Three commonly used datasets are used to verify the effectiveness of our algorithm. We calculated the precision and mean average precision (MAP) of the retrieval results and showed the performance of our method on CIFAR10, NUSWIDE, and MSCOCO. Specifically, given an anchor
Our server configuration is as follows: the CPU is Intel Xeon CPU E52687W @ 3.0 GHz with 12 cores, the GPU is NVIDIA GTX 1080 8 GB, and the RAM is 32 GB. The Linux operating system is Ubuntu 16.04, and the deep learning framework is PyTorch [
We use three widely used benchmark datasets of different scales; they are CIFAR10 [
The NUSWIDE dataset contains 269,648 images in 81 categories. We used the 21 most commonly used categories. We randomly selected 2,100 images (100 images per class) as the query point and randomly selected 10,500 images (500 images per class) as the training set.
MSCOCO is an image dataset widely used for image recognition, segmentation, and captioning. It contains 82,783 training images and 40,504 validation images, in which each image is labeled by some of the 80 semantic concepts. We randomly selected 5,000 images as the query point and the rest images as the database and randomly sample 10,000 images from the database for training. Table
Examples of the datasets.
Dataset  Example  Label 

CIFAR10 

“airplane” 

“cat”  

“horse”  


NUSWIDE 

“clouds,” “ocean,” “sky,” “water” 

“person,” “ocean,” “water”  


MSCOCO 

“person,” “sheep,” “dog” 

“dog,” “bottle,” “barrel” 
Dataset settings used in our experiment.
Dataset  Total  Training (randomly)  Testing (randomly)  Labels 

CIFAR10  60,000  5,000 (500 ∗ 10)  1,000 (100 ∗ 10)  10 
NUSWIDE  269,648  10,500 (500 ∗ 21)  2,100 (100 ∗ 21)  21 
MSCOCO  82,783  10,000  5,000  80 
We compared our method with several representative hashing methods for MAP; the comparison methods we selected are divided into two groups: traditional hashing methods and deep hashing methods. Traditional unsupervised hashing methods include SH [
As shown in Table
MAP results for different numbers of bits (12, 24, 32, and 48 bits) on the two benchmark image datasets (CIFAR10 and NUSWIDE).
Methods  CIFAR10  NUSWIDE  

12 bits  24 bits  32 bits  48 bits  12 bits  24 bits  32 bits  48 bits  
Deep methods  Ours 








DTSH  0.710  0.750  0.765  0.774  0.773  0.808  0.812  0.824  
DPSH  0.713  0.727  0.744  0.757  0.752  0.790  0.794  0.812  
DQN  0.554  0.558  0.564  0.580  0.768  0.776  0.783  0.792  
DHN  0.555  0.594  0.603  0.621  0.708  0.735  0.748  0.758  
NINH  0.552  0.566  0.558  0.581  0.674  0.697  0.713  0.715  
CNNH  0.439  0.511  0.509  0.522  0.611  0.618  0.625  0.608  


Nondeep methods  FastH  0.305  0.349  0.369  0.384  0.621  0.650  0.665  0.687 
SDH  0.285  0.329  0.341  0.356  0.568  0.600  0.608  0.637  
KSH  0.303  0.337  0.346  0.356  0.556  0.572  0.581  0.588  
LFH  0.176  0.231  0.211  0.253  0.571  0.568  0.568  0.585  
SPLH  0.171  0.173  0.178  0.184  0.568  0.589  0.597  0.601  
ITQ  0.162  0.169  0.172  0.175  0.452  0.468  0.472  0.477  
SH  0.127  0.128  0.126  0.129  0.454  0.406  0.405  0.400 
MAP results for different numbers of bits on the three benchmark image datasets. (a) CIFAR10. (b) NUSWIDE. (c) MSCOCO.
MAP results for different numbers of bits on the MSCOCO dataset are shown in Table
MAP results for different numbers of bits on the MSCOCO dataset.
Methods  MSCOCO  

8 bits  16 bits  24 bits  32 bits  
Ours 




DQN  0.649  0.653  0.666  0.685 
DHN  0.607  0.677  0.697  0.701 
CNNH  0.505  0.564  0.569  0.574 
SDH  0.541  0.555  0.560  0.564 
KSH  0.492  0.521  0.533  0.534 
ITQCCA  0.501  0.566  0.563  0.562 
Our method can achieve excellent performance under shorter hash codes. At the same time, we also discussed the performance change under the long hash code. Since the comparison method does not provide the results of long hash codes, we only discuss our own method here. As can be seen from Figure
Effects of longer hash codes.
One might say that the performance improvements come from the neural network, not our approach. In order to further verify our method, we compared our method with the traditional hashing method using the CNNF network [
MAP results for different numbers of bits on the two benchmark image datasets (a) CIFAR10 and (b) NUSWIDE, compared to different traditional hashing methods using the CNN.
This paper adopts the method of triplet labels; therefore, we focus on the comparison of the deep hashing methods with the triplet label, and we will compare it further with other deep hashing methods using the triplet label. These methods include DSRH [
MAP results for different numbers of bits on the two benchmark image datasets (a) CIFAR10 and (b) NUSWIDE, compared with the triplet label methods.
The most important hyperparameters for JLTDH are
We report the MAP values for different
Effects of
Furthermore, we present the influence of
Effects of
By observing the convergence of the loss function, we can judge whether the selected loss function is reasonable or not. Figure
Convergence of three kinds of losses for different lengths of hash codes (epoch = 150). (a) Joint loss. (b) Triplet likelihood loss. (c) Linear classification loss.
In order to further reveal the respective influence of the two parts of the joint loss function, we have shown the loss curves of triplet likelihood loss and linear classification loss, respectively, in Figures
In order to confirm the contribution of different losses to final performance, we selected a variant of JLTDH for comparison: JLTDHT is a variant of JLTDH, whose loss function contains only triplet loss and no linear classification loss. As can be seen from Figure
(a) Ablation effects of loss on CIFAR10. (b) Ablation effects of loss on NUSWIDE.
We show the illustration of top 12 returned images for better understanding of the impressive performance improvement of the proposed method. Figure
The top 12 images returned by the proposed method.
In this work, we propose a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH). To fully utilize the supervised triplet information, this paper proposed a joint loss function combining two kinds of supervised loss functions: the triplet negative loglikelihood loss function and the linear classification loss function. At the same time, in order to overcome the cubic increase in the number of triplets and make triplet training more effective, we designed a triple selection method. The whole process is divided into two stages: Firstly, the last layer of the network outputs a preliminary hash code. Secondly, relying on the joint loss function and backpropagation algorithm, the parameters of the neural network are further updated so that the generated hash code has higher query precision. We perform extensive experiments on the three public benchmark datasets CIFAR10, NUSWIDE, and MSCOCO. Experimental results demonstrate that the proposed method outperforms the compared methods and is also superior to all previous deep hashing methods based on the triplet label.
Data are owned by a third party: the experimental part of this paper uses three widely used public datasets (CIFAR10, MSCOCO, and NUSWIDE), which can be publicly accessed. CIFAR10 can be accessed at
The authors declare that they have no conflicts of interest.
This work was supported by the science and technology project of Chongqing Education Commission of China (No. KJ1600332), the humanities and social science project of Ministry of Education (18YJA880061), the humanities and social science project of Chongqing Education Commission of China (No. 19SKGH035), Chongqing Education Scientific Planning Project (No. 2017GX116), and Chongqing Graduate Education Reform Project (No. yjg193093).