Triplet Deep Hashing with Joint Supervised Loss Based on Deep Neural Networks

In recent years, with the explosion of multimedia data from search engines, social media, and e-commerce platforms, there is an urgent need for fast retrieval methods for massive big data. Hashing is widely used in large-scale and high-dimensional data search because of its low storage cost and fast query speed. Thanks to the great success of deep learning in many fields, the deep learning method has been introduced into hashing retrieval, and it uses a deep neural network to learn image features and hash codes simultaneously. Compared with the traditional hashing methods, it has better performance. However, existing deep hashing methods have some limitations; for example, most methods consider only one kind of supervised loss, which leads to insufficient utilization of supervised information. To address this issue, we proposed a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH) in this work. The proposed method JLTDH combines triplet likelihood loss and linear classification loss; moreover, the triplet supervised label is adopted, which contains richer supervised information than that of the pointwise and pairwise labels. At the same time, in order to overcome the cubic increase in the number of triplets and make triplet training more effective, we adopt a novel triplet selection method. The whole process is divided into two stages: In the first stage, taking the triplets generated by the triplet selection method as the input of the CNN, the three CNNs with shared weights are used for image feature learning, and the last layer of the network outputs a preliminary hash code. In the second stage, relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision. We perform extensive experiments on the three public benchmark datasets CIFAR-10, NUS-WIDE, and MS-COCO. Experimental results demonstrate that the proposed method outperforms the compared methods, and the method is also superior to all previous deep hashing methods based on the triplet label.


Introduction
In recent years, because of the explosive growth of Internet big data, the Internet is filled with a large number of multimedia resources, including pictures, videos, and text, and there is an urgent need for fast search methods for massive big data. Approximate nearest neighbor (ANN) [1] search has been widely used in many fields such as image retrieval, computer vision, and data mining. Because of speed and low memory cost, hashing has become an important branch of ANN search, which is one of the widely used techniques in image retrieval [2][3][4][5][6][7][8][9]. Hashing techniques encode images, documents, videos, or other types of data in a short set of binary codes while keeping the original data similar. e hashing method produces binary encodings that make the nearest neighbor search of large datasets easy. A series of different hashing methods are proposed to implement efficient ANN search using Hamming distance [3,5,8,9]. More recently, deep hashing methods [10,11,12,13,14] show that image representation and hash coding can be learned more effectively using deep neural networks, resulting in state-of-the-art results on many benchmark datasets.
Recently, triplet loss has been studied for computer vision problems. e triplet labels contain richer information than pairwise labels. Each triplet label can be naturally decomposed into two pairwise labels. In particular, a triplet label ensures that the query image is close to the positive image and far away from the negative image in the learning hash code space. However, a pairwise label only ensures that one constraint is observed. e retrieval performance of triplet loss is better than that of pointwise and pairwise losses. erefore, the triplet likelihood loss is introduced in this paper.
At the same time, the existing deep hashing methods still have some shortcomings in the utilization of classification information. e classification information only plays a role in deep neural network image representation but has no direct impact on the optimization of the hash function. erefore, this paper proposes a linear classification loss to deal with this situation. erefore, combining triplet likelihood loss and linear classification loss, we propose a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH). e supervised information used in this method is in the form of triplet labels. For the sake of fully utilizing the triplet information and the classification information, we propose a joint loss function, which consists of two parts: the triplet negative log-likelihood loss and the linear classification loss. Depending on this joint loss function, the hash codes can be further optimized by the linear classifier. e linear classifier indicates the relationship between the label information and the hash codes. We choose the convolutional neural network (CNN) as our deep learning model, for example, AlexNet, ResNet, and VGG, which can learn image representation and hash function at the same time. e whole process is divided into two stages: In the first stage, taking the triplets generated by the triplet selection method as the input of the CNN, the three CNNs with shared weights are used for image feature learning, and the last layer of the network outputs a preliminary hash code. In the second stage, relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision.
is work is summarized as follows: (1) In this paper, a triplet deep hashing method with negative log-likelihood is proposed, and the method performs both image feature representation and hash code learning in a convolutional neural network. In order to overcome the cubic increase in the number of triplets and make triplet training more effective, we adopt a novel triplet selection method. (2) To fully utilize the supervised triplet information, JLTDH proposed a joint loss function combining the triplet negative log-likelihood loss and the linear classification loss. Relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision.

Related Works
Hashing methods are divided into data-independent methods and data-dependent methods. Locality-sensitive hashing (LSH) [1] is a typical representative of data-independent hashing methods among various hashing techniques; the hash function of this method is generated by a random map, and random projection maps data points from the original space to the Hamming space; in this process, the training data are not used to learn hash functions. One drawback of LSH is that LSH usually requires a very long bit length, which leads to huge storage overhead. Data-dependent methods [15] learn the data feature from the training data so as to learn the hash function, which are also known as learning to hashing (L2H) methods. Compared to the data-independent method, the L2H method can get better accuracy with a shorter hash code. erefore, the L2H method is more widely used than the data-independent method in the practical application. e L2H methods include two types: unsupervised hashing and supervised hashing [4,16]. Unsupervised hashing does not use supervised information for hash function learning, and the purpose of unsupervised hashing is to preserve the metric structure in the training data. Typical unsupervised approaches include iterative quantization (ITQ) [3], isotropic hashing (IsoHash) [4], discrete graph hashing (DGH) [15], and scalable graph hashing (SGH) [17]. Unsupervised hashing often fails to achieve satisfactory retrieval performance in practical applications.
On the contrary, supervised hashing tries to learn the hash function by utilizing supervised information. e purpose of supervised hashing is to map the data points in the original space to the Hamming space with supervised information, and supervised information optimizes the learning of hash function so as to learn better hash code. In recent years, using supervised information for learning, supervised hashing has higher precision than unsupervised hashing, so more and more researchers have studied it deeply [5,12,18]. Typical supervised hashing methods include supervised hashing with kernels (KSH) [5], fast supervised hashing (FastH) [8], supervised discrete hashing (SDH) [9], latent factor hashing (LFH) [19], adaptive binary quantization (ABQ) [20], hash bit selection [21], and structure-sensitive hashing (SSH) [10]. ABQ [20] jointly pursues a set of prototypes in the original space and a subset of binary codes in the Hamming space. e prototypes and the codes are correspondingly associated and together define the hash function for small hash codes. Hash bit selection [21] presented two related selection methods via dynamic programming and quadratic programming, incorporating bit reliability and complementarity. SSH [10] simultaneously captures the two types of structures among data in an alternative way. It learns discriminative hash functions that quantize data into the cluster prototypes associated with unique binary codes.
However, most traditional supervised hashing methods cannot extract features very well. In recent years, researchers have proposed deep learning hashing methods, which can effectively extract image features to identify similar images, and their performance is better than that of the traditional hashing method. Representative deep hashing methods including convolutional neural networkbased hashing (CNNH) [18] adopt a two-stage strategy: learning binary hash codes in the first stage and learning a deep-network-based hash function to fit the codes in the second stage. DNNH [12] improved CNNH with a simultaneous feature learning and hash coding pipeline such that deep representations and hash codes are optimized by the triplet loss. e deep hashing network (DHN) [14] improves DNNH by jointly preserving the pairwise semantic similarity and controlling the quantization error by simultaneously optimizing the pairwise cross-entropy loss and quantization loss via a multitask approach. Other typical deep hashing methods include deep pairwise supervised hashing (DPSH) [13] and deep supervised hashing (DSH) [22].
Recently, some new deep hashing methods emerged, such as cross-modal hashing and hashing-based generative adversarial network. Representative methods including progressive generative hashing (PGH) [23] learn a discriminative hashing network in an unsupervised way, which exploits the power of hash-conditioned GANs and progressive learning. Triplet-based deep hashing (TDH) [24] is used for cross-modal retrieval, and triplet labels are exploited as supervised information to capture relative semantic correlation between heterogeneous data from different modalities. UCH [25] is a cross-modal retrieval method, where the outer-cycle network is used to learn powerful common representation and the inner-cycle network is explained to generate reliable hash codes. Deep learning hashing methods can significantly outperform nondeep supervised hashing in many applications, so we focused on deep hashing.

The Framework of JLTDH
3.1. Problem Definition. In our proposed method JLTDH, the input of the convolutional neural network is triplets. We denote the image triplet set as and t a i are similar and belong to the same category), and the negative image t n i is defined as s an � 0 (t n i and t a i are dissimilar and belong to different categories). e distance between t a i and t p i is smaller than the distance between t a i and t n i . Our goal is to learn the hash codes B � b i n i�1 ∈ − 1, 1 { } L×n for the image X; the similarity between the two hash codes is calculated using the Hamming distance. e hash codes B should satisfy all the triplet labels T in the Hamming space as much as possible; for triplet labels, disH(b t a i , b t p i ) should be as small as possible as disH(b t a i , b t n i ), where disH() represents the Hamming distance between two hash codes.

Framework.
We introduce the proposed framework of JLTDH; this is an end-to-end hash learning framework based on the convolutional neural network.
As shown in Figure 1, we proposed a triplet deep hashing with joint supervised loss, which is a deep learning framework capable of both automatic feature learning and hash coding, and it joins triplet deep learning and linear classification quantization. is is an end-to-end approach and supervised by triplet labels, which contains three main components: image feature learning part, hash code learning part, and joint loss function part. It integrates these three components into the same end-to-end framework.
We generally generate triplets based on the category information of the sample, select the anchor image and positive image from the same category, and select the negative image from different categories. However, as the dataset increases, the number of all possible triplets is very large; using all triplets is computationally difficult and not optimal, and at the same time, it is not helpful for training and will lead to slow convergence of training. e existing triplet hashing method does not solve the problem of triplet selection very well. erefore, the mining and selection of triplets is an urgent problem to be solved. We adopt a novel triplet selection method, which will be discussed in detail in Section 4.

Image Feature Learning Part.
In this part, we use three CNNs with shared weights to extract the appropriate feature representation for binary hash code learning. We use the AlexNet [26] network architecture for this part, and VGG [27] and ResNet [28] can also apply to this part as well. Each CNN contains five convolutional layers and three fully connected layers. e last layer of AlexNet is replaced with the FC (fully connected) layer, and the output of the last layer is projected as a hash code. e configuration of AlexNet is shown in Table 1.

Hash Code Learning Part.
is part generates the hash code of the image according to the image features of the previous part. e FC layer uses the sign functions as activation functions. Binary code is obtained by using the sign function. e length of the hash code c is determined by the number of FC-layer neurons in the last layer.

Joint Loss Function Part.
e joint loss function combines two kinds of supervised loss functions: the triplet negative log-likelihood loss and the linear classification loss, and is designed to further optimize the hash code so that the hash code and classification information can maintain the semantic relationship between points. e joint loss function is the weighted combination of triplet label likelihood loss and linear classification loss.

Triplet Selection Method
In large datasets like NUS-WIDE and MS-COCO, the number of all possible triplets is very large. us, using all triplets is computationally difficult and not optimal.

Computational Intelligence and Neuroscience
Specifically, it is not helpful for training and will lead to slow convergence of training. For example, the dataset MS-COCO is used in this paper, whose training dataset contains 10,000 images, and the number of all possible triplets is approximately (1.0 × 10 4 ) 3 � 1.0 × 10 12 . is is a very large number which is very difficult to calculate.
Inspired by the study in [29], we adopt a novel triplet selection method to reduce the computational cost. Randomly splitting the training data into several groups G i m i�1 , the triplets are selected only within groups and G n i , respectively, represent the anchor points, positive points, and negative points in the i-th group. G p i � p ∈ G i : p ≠ a, s pa � 1 is the group of positive samples consisting of the samples similar to the anchor point G a i in the i-th group. We randomly chose negative samples from the group of negative samples G n i � n ∈ G i : α − dist(a, n)+ dist(a, p) > 0, s an � 0 , and we found that negative points far away from the anchor point were not helpful for training, so we excluded these negative points.
Using the proposed triplet selection method, we find that the number of triplets is much smaller than the number of possible triplets in our dataset. e specific results are shown in Table 2. e CPU running the code of the triplet selection method is Intel Xeon CPU E5-2687W @ 3.0 GHz with 12 cores, and the RAM is 32 GB. e time cost of the triplet selection method on three datasets is low and acceptable.

Joint Loss Function
e joint loss function combines two kinds of supervised loss functions: the triplet negative log-likelihood loss and the linear classification loss. We introduce them in the following.

Triplet Negative Log-Likelihood Loss.
e Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. It is often used to calculate the similarity between two hash codes. In other words, the smaller the Hamming distance Hash code length c where L is generally the length of the hash code. From the above equation, it can be concluded that the larger the inner product of the two hash codes, the smaller the Hamming distance and the more similar they are. Let Ω ij represent half of the inner product between two binary codes [13] is a method for simultaneously learning feature representation and hash codes using pairwise label likelihood function as the loss function, and the likelihood function is formulated as e two hash codes with the large inner product should have high similarity. e supervised information used by DPSH [13] is a pairwise label. Inspired by DPSH, this paper proposes a hashing method using triplet label likelihood function, which is a deep hashing method using a deep neural network. A set of triplet labels is given. Suppose the conditions are independent of each other, according to naive Bayes' theorem, with some prior p(B), then the posterior probability of B can be computed as follows: We can define the triplet label likelihood function p(T | B) as follows: We can learn the optimal B through maximum a posteriori estimation.
In order to solve p((t a i , t From equation (4), we can conclude that the smaller the distance between the anchor point and the positive point, and the greater the distance between the anchor point and the negative point, the larger the value of Ω t a i t According to equation (4), we can make the following definition: where β is a superparameter whose role is to adjust the gap e smaller the distance between the anchor point and the positive point, and the greater the distance between the anchor point and the negative point, the larger the value of the triplet likelihood function, and vice versa. is is consistent with the intention of our objective function.
By maximizing the triplet likelihood p(T | B), we can enforce the Hamming distance between the anchor image and the positive image to be smaller than that between the anchor image and the negative image. Taking the negative log-likelihood of the triplet label, the following optimization problem can be obtained: e above optimization problem can minimize the Hamming distance between the anchor point and the positive point and simultaneously maximize the Hamming distance between the anchor point and the negative point. is exactly matches the goal of supervised hashing with the triplet label.
Substituting formula (5) into formula (6), the following formula can be obtained: A common problem with the deep hashing methods is how to train the neural network output to be binary codes [11,20,25,26]. Minimizing the loss of equation (7) is a very difficult discrete optimization problem [30]. e usual method is to relax b i from discrete to continuous. In the last layer of the neural network, we use the sign function as the activation function to obtain the binary code. However, the sign function is nondifferentiable because the gradient of the sign function always equals zero, and the backpropagation of the loss function is difficult to proceed. A good method is to relax the binary codes and add a quantization error term in the objective function during training. is method is also utilized in [30,25].
is method is described below: we use u i to denote the continuous output of the last layer before the sign function for the image. b i can be obtained by b i � sgn(u i ). We relax b i to continuous vectors u i , and we redefine Ω ij as follows: en, we approximate equation (7) as where N i�1 ‖b i − u i ‖ 2 2 represents the quantization error term and η is the superparameter, which is used to balance the original objective function and quantization error.

Computational Intelligence and Neuroscience
To integrate the above image feature representation part and the loss function part into a deep neural network framework, we set where ϕ stands for all the parameters of the neural network except for the last layer, Θ(x i ; ϕ) represents the output of the neural network, W represents a weight matrix, and υ is a bias term.

Joint Learning with Linear Classification Quantization
Loss. Triplet label information was used to learn hash codes by equation (9), but the label information was not fully utilized. We hope to fully utilize the label information so that we use the joint learning linear classifier to further optimize the hash code, making the learned hash code optimal. Inspired by the study in [31], we use the following classifier, which can represent the relationship between the learned hash code B and label information Y: where W L×c � w i c i�1 is the classifier weight and Y c×N � y i N i�1 is the ground-truth label vector, in whichc is the number of categories in the dataset. We usually choose l2 loss for the linear classifier, and we define the loss function as follows: where l linear is the linear classifier loss function, μ is the regularization parameter, and ‖‖ 2 F is the F norm of a matrix. Equations (9) and (12) are combined by weight parameters, and the following formula can be obtained: where ‖‖ 2 2 is the l2 norm of a vector and λ is the trade-off parameter used to balance the triplet likelihood term and the linear classification loss. (13) is a joint loss optimization problem, which is still nonconvex, and it is difficult to solve. Here, we adopt the discrete cyclic coordinate descent (DCC) optimization method. Equation (13) can be decomposed into two suboptimization problems, and the linear classification loss can be solved iteratively by alternating minimization. For equation (12), the linear classification loss can be rewritten as

Optimization. Equation
By fixing B, using the matrix trace function tr(), the following formula is obtained: Taking the derivative of W and setting dJ(W)/dW � 0, we get the minimum value of W: So once we solve W, we assume W as a constant matrix. By fixing W, equation (14) becomes We can get a closed solution to a row of B by fixing the other rows. We use the discrete cyclic coordinate descent method to iteratively solve B row by row. Let z T be the k− th row of B, k � 1, 2, ..., K (K is the length of the hash code), and B ′ the matrix of B excluding z T . Let v T be the k− th row of W and W ′ the matrix of W 12×10 excluding v T . e third term in equation (17) can be solved as follows: Similarly, let Q � WY, and let q T be the k th row of Q and Q ′ the matrix of Q excluding q T . e second term in equation (17) can be solved as follows: Putting equations (17), (18), and (19) altogether, there is an optimal solution to this problem: 6 Computational Intelligence and Neuroscience According to equation (21), each bit hash code z is calculated according to the k− 1 bit B ′ that has been learned. We can iteratively update each bit until the program converges to a better set of hash codes B. e proposed method JLTDH is briefly summarized in Algorithm 1.

Experiment and Analysis
In this section, we will describe our experiments and results. ree commonly used datasets are used to verify the effectiveness of our algorithm. We calculated the precision and mean average precision (MAP) of the retrieval results and showed the performance of our method on CIFAR-10, NUS-WIDE, and MS-COCO. Specifically, given an anchor x q , we can calculate its average precision (AP) using the following equation: where R k is the number of relevant samples, P(k) is the precision at cutoff k in the returned sample list, and Δr(k) is an indicator function which equals 1 if the k-th returned sample is a ground-truth neighbor of x q . Otherwise, Δr(k) is 0. Given Q queries, MAP is the AP of all query results sorted; we can compute the MAP as follows:

Experimental Settings.
Our server configuration is as follows: the CPU is Intel Xeon CPU E5-2687W @ 3.0 GHz with 12 cores, the GPU is NVIDIA GTX 1080 8 GB, and the RAM is 32 GB. e Linux operating system is Ubuntu 16.04, and the deep learning framework is PyTorch [32]. We use three widely used benchmark datasets of different scales; they are CIFAR-10 [33], NUS-WIDE [34], and MS-COCO [35]. e CIFAR-10 dataset contains 60,000 images and 10,000 test images, belonging to 10 categories. e size of each image is 32 × 32 pixels. We randomly selected 5,000 images (500 for each class) as the training set and 1,000 images (100 for each class) as the test query set. e NUS-WIDE dataset contains 269,648 images in 81 categories. We used the 21 most commonly used categories. We randomly selected 2,100 images (100 images per class) as the query point and randomly selected 10,500 images (500 images per class) as the training set.
MS-COCO is an image dataset widely used for image recognition, segmentation, and captioning. It contains 82,783 training images and 40,504 validation images, in which each image is labeled by some of the 80 semantic concepts. We randomly selected 5,000 images as the query point and the rest images as the database and randomly sample 10,000 images from the database for training. Table 3 Input: training images X � x i n i�1 ; code length L; epochs � 150; superparameters β, η, μ, and λ; and minibatch � 128 Initialization: initialize neural network parameters ϕ, W, and v with the AlexNet model and iteration number t Generate triplet training set: Randomly sample a minibatch of points from T, and for each sampled point x i : (iv) Compute derivatives for point x i ; (v) Update the parameters W, v, and ϕ by utilizing BP; Discrete cyclic coordinate descent (DCC) optimization: (i) Compute W according to (16); (ii) Iteratively optimization update B bit by bit using the DCC method according to (21) in the minibatch; End for Update b i � sgn(u i ) in the minibatch according to joint loss function in (13); End for Output: u i � W T Θ(x i ; ϕ) + υ, with the parameters W, v, and ϕ; Hash ALGORITHM 1: e procedure of JLTDH.

Comparison to Other Deep Methods and Nondeep
Methods. As shown in Table 5 and Figure 2, the MAP is calculated based on the top 5,000 returned neighbors. NINH, CNNH, KSH, and ITQ results were obtained from the study in [11,18]. Results of other methods were obtained from the study in [42]. Our method performs better in these three datasets than the existing hashing methods; compared with nondeep learning methods, our method has been significantly improved. Our method further improved performance by 2-6% compared to the current best deep learning methods. At the same time, we found that our method performed better in shorter hash codes ( ≤32 bits). Most deep hashing methods have significant performance advantages.
MAP results for different numbers of bits on the MS-COCO dataset are shown in Table 6; except for our method, the other results were obtained from the study in [29]. In order to be consistent with the settings in [29], we set the hash code length as 8 bits, 16 bits, 24 bits, and 32 bits. e image pixel of MS-COCO is more complex, which will lead to more difficult classification, which may lead to inaccurate results of feature extraction and inaccurate classification results. As can be seen from Table 5, the performance of the MS-COCO dataset decreased to a certain extent compared with the results of MAP of the NUS-WIDE dataset. In spite of this, our method is still much better than the comparison methods.
Our method can achieve excellent performance under shorter hash codes. At the same time, we also discussed the performance change under the long hash code. Since the comparison method does not provide the results of long hash codes, we only discuss our own method here. As can be seen from Figure 3, CIFAR-10 gets a good MAP value at 32 bits, while NUS-WIDE gets a good MAP value between 48 and 64 bits. With the growth of hash code length, the MAP of CIFAR-10 is decreasing, while the MAP of NUS-WIDE is not decreasing, but there is no obvious increase.

Comparison to Nondeep Hashing Methods Using Deep
Features. One might say that the performance improvements come from the neural network, not our approach. In order to further verify our method, we compared our method with the traditional hashing method using the CNN-F network [43] to extract the depth features. As shown in Figure 4, we can see that our method is significantly superior to the traditional method. It should be

Comparison to Hashing Methods with Triplet Labels.
is paper adopts the method of triplet labels; therefore, we focus on the comparison of the deep hashing methods with the triplet label, and we will compare it further with other deep hashing methods using the triplet label. ese methods include DSRH [37], DSCH [38], DRSCH [38], and NINH [12]. e results of DSRH, DSCH, and DRSCH were adopted from the study in [38]. As shown in Figure 5, compared with previous deep hashing methods based on the triplet label, our method is obviously better and leads the comparative method by a wide margin.

Sensitivity to Hyperparameters.
e most important hyperparameters for JLTDH are λ and η. λ is the trade-off parameter used to balance the triplet likelihood term and the linear classification loss. η is used to balance the likelihood term and the quantization error. We explore the influence of these two hyperparameters.
We report the MAP values for different λ from the range of [0.1, 200] on two datasets with the code length being 12 bits and 32 bits. We can find that JLTDH is not sensitive to λ in a large range when 10 < λ < 50. According to Figure 6(a), we found that when λ � 10, CIFAR-10 can  obtain the best MAP performance. As shown in Figure 6(b), when 10 < λ < 50, NUS-WIDE can achieve better MAP performance.
Furthermore, we present the influence of η in Figures 7(a) and 6(b). As can be seen, there is a wide range of η in that our method performs well. When hash code � 12 bits, MAP accuracy is better on both datasets with η � 50, and when hash code � 32 bits, MAP accuracy is better on both datasets with 50 ≤ η ≤ 100. Other superparameter settings are obtained through cross-validation: u is set to 0.1, and β is half the length of the hash code.

Analysis of
ree Loss Functions. By observing the convergence of the loss function, we can judge whether the selected loss function is reasonable or not. Figure 8 shows the change of three kinds of losses for different lengths of hash codes during training. We only take CIFAR-10 as an example; the results for the other two datasets are similar. It is reasonable to conclude that the joint loss combining triplet likelihood loss and linear classification quantization loss successfully optimizes the loss during training. Figures 8(a) and 8(b) are similar: they show that the joint loss and triplet likelihood loss converge rapidly and are kept at a low level for different bits, that our method is reasonable and effective, and that the optimization process only needs a few iterations.
In order to further reveal the respective influence of the two parts of the joint loss function, we have shown the loss curves of triplet likelihood loss and linear classification loss, respectively, in Figures 8(b) and 8(c). e magnitude of the   Computational Intelligence and Neuroscience loss value in Figure 8(c) is about 10 − 1 of that in Figure 7(b), which is because triplet likelihood loss is used to optimize the first stage and plays a major role in training, while linear classification loss is used to optimize the second stage, which is further optimization based on the first stage and finetuning optimization.

Ablation Study about Loss Function.
In order to confirm the contribution of different losses to final performance, we selected a variant of JLTDH for comparison: JLTDH-T is a variant of JLTDH, whose loss function contains only triplet loss and no linear classification loss. As can be seen from Figure 9, on the CIFAR and NUS-WIDE datasets, JLTDH-Tcan achieve good MAP performance with only triplet loss. Based on further optimization of linear classification loss, JLTDH achieves better MAP performance by using the joint loss. JLTDH is about 2% ahead of JLTDH-T on average.

Visualization of Query Results.
We show the illustration of top 12 returned images for better understanding of the impressive performance improvement of the proposed method. Figure 10 illustrates the top 12 returned images of the proposed method for three query images on the three datasets CIFAR-10, NUS-WIDE, and MS-COCO, and the length of the hash code is 32. It shows that the method we proposed can truly preserve the features of an image and save them to the hash code. Regarding the query image in CIFAR-10, only one of the returned results is wrong, and the wrong image is only at the bottom of the sorted image. In contrast, for the query image in NUS-WIDE, only 1 out of the top 12 images is incorrect, and the top 12 images have a similar shape or similar color to the query image. And for the query image in MS-COCO, ten of the top 12 images are correct; compared with the previous two datasets, the query accuracy is slightly reduced. e possible reason is that MS-COCO is a multiobjective dataset.
is shows that our method can provide the desired search results.

Conclusion
In this work, we propose a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH). To fully utilize the supervised triplet information, this paper proposed a joint loss function combining two kinds of supervised loss functions: the triplet negative log-likelihood loss function and the linear classification loss function. At the same time, in order to overcome the cubic increase in the number of triplets and make triplet training more effective, we designed a triple selection method. e whole process is divided into two stages: Firstly, the last layer of the network outputs a preliminary hash code. Secondly, relying on the joint loss function and backpropagation algorithm, the parameters of the neural network are further updated so that the generated hash code has higher query precision. We perform extensive experiments on the three public benchmark datasets CIFAR-10, NUS-WIDE, and MS-COCO. Experimental results demonstrate that the proposed method outperforms the compared methods and is also superior to all previous deep hashing methods based on the triplet label.

Data Availability
Data are owned by a third party: the experimental part of this paper uses three widely used public datasets (CIFAR-10, MS-

Conflicts of Interest
e authors declare that they have no conflicts of interest.