Quadruplet-Based Deep Cross-Modal Hashing

Recently, benefitting from the storage and retrieval efficiency of hashing and the powerful discriminative feature extraction capability of deep neural networks, deep cross-modal hashing retrieval has drawn more and more attention. To preserve the semantic similarities of cross-modal instances during the hash mapping procedure, most existing deep cross-modal hashing methods usually learn deep hashing networks with a pairwise loss or a triplet loss. However, these methods may not fully explore the similarity relation across modalities. To solve this problem, in this paper, we introduce a quadruplet loss into deep cross-modal hashing and propose a quadruplet-based deep cross-modal hashing (termed QDCMH) method. Extensive experiments on two benchmark cross-modal retrieval datasets show that our proposed method achieves state-of-the-art performance and demonstrate the efficiency of the quadruplet loss in cross-modal hashing.


Introduction
With the advent of the era of big data, there are surging massive multimedia data on the Internet, such as images, videos, and texts. ese data usually exist in diversified modalities, for example, there may exist a textual data and an audio data describing a video data or an image data. As data from different modalities may have compact semantic relevance, cross-modal retrieval [1,2] is proposed to retrieve semantic similar data from one modality while the querying data is from a distinct modality. Benefitting from the high efficiency and low cost, hashing-based cross-modal retrieval (cross-modal hashing) [3][4][5][6] has drew extensive attention. e goal of cross-modal hashing is to map the modal heterogeneous data into a common binary space and ensure that semantic similar/dissimilar crossmodal data have similar/dissimilar hash codes. Cross-modal hashing methods can usually achieve superior performance; nonetheless, most of existing cross-modal hashing methods (such as cross-modal similarity sensitive hashing (CMSSH) [7], semantic correlation maximization (SCM) [8], semantics-preserving hashing (SePH) [9], and generalized semantic preserving hashing (GSPH) [10]) are based on handcrafted feature learning, which cannot effectively capture the heterogeneous relevance between different modalities and thus may result in inferior performance.
In the last decade, deep convolutional neural networks [11,12] have been successfully utilized in many computer vision tasks, and therefore, some researchers also deploy it in cross-modal hashing, such as deep cross-modal hashing (DCMH) [13], pairwise relationship guided deep hashing (PRDH) [14], self-supervised adversarial hashing (SSAH) [15], and triplet-based deep hashing (TDH) [16]. Cross-modal hashing methods with deep neural networks efficiently integrate the hash representation learning and the hash function learning into an end-to-end framework, which can capture heterogeneous cross-modal relevance more effectively and thus acquire better cross-modal retrieval performance.
To date, most deep cross-modal hashing methods utilize the pairwise loss (such as [13][14][15]) or the triplet loss (such as [16]) to preserve semantic relevance during the hash representation learning procedure. Nevertheless, the pairwise loss-and triplet loss-based hash methods suffer from a weak generalization capacity from the training set to the testing set [17,18], as shown in Figure 1(a). On the contrary, quadruplet loss is proposed and has been utilized in image hashing retrieval [17] and person reidentification [18], and in these works, it has been proved that the quadruplet loss-based model can enhance the generalization capability. erefore, crossmodal hashing combines quadruplet loss as a natural solution to enhance the performance of cross-modal hashing, as shown in Figure 1 To this end, in this paper, we introduce quadruplet loss into cross-modal hashing and propose a quadruplet-based deep cross-modal hashing method (QDCMH). Specifically, QDCMH firstly defines a quadruplet-based cross-modal semantic preserving module. Afterwards, QDCMH integrates this module, hash representation learning, and hash code generation into an end-to-end framework. Finally, experiments on two benchmark cross-modal retrieval datasets are conducted to validate the performance of the proposed method. e main contributions of our proposed QDCMH include the following: (i) We introduce quadruplet loss into cross-modal retrieval and propose a novel deep cross-modal hashing method. To the best of our knowledge, this is the first work to introduce quadruplet loss into cross-modal hashing retrieval. (ii) We conduct extensive experiments on benchmark cross-modal retrieval datasets to investigate the performance of our proposed QDCMH. e remainder of this paper is organized as follows. Section 2 elaborates our proposed quadruplet-based deep cross-modal hashing method. Section 3 presents the learning algorithm of QDCMH. Section 4 is the experimental results and the corresponding analysis. Section 5 concludes our work.

Proposed Method
In this section, we elaborate our proposed quadruplet-based deep cross-modal hashing (QDCMH) method with the following sections: notations, quadruplet-based cross-modal semantic preserving module, feature learning networks, and hash function learning. Figure 2 presents the flowchart of  Figure 1: (a) Triplet loss-based cross-modal hashing methods suffer from a weak generalization capacity from the training set to the testing set because the test instances belong to the category and cannot be mapped into compact binary codes (see the lower-right corner). (b) Triplet loss-based cross-modal hashing methods can project the test instances, which belong to the category , into compact binary space (see the lower right corner).
our proposed QDCMH, which cooperates quadruplet-based cross-modal semantic preserving module, hash representation learning, and hash codes generation into an end-toend framework. In our proposed QDCMH method, we assume that each instance has two modalities, i.e., an image modality and a text modality, but they can be easily applied to multimodalities.

Notations.
Assume that the training data comprises n image-text pairs, i.e., the original image features V ∈ R n×d v and the original text features T ∈ R n×d t . Besides, there is a label vector associated with each image-text pair and label vectors for all training instances constitute a label matrix L ∈ R n×d l . d v and d t are the corresponding original dimensions of image features and text features, respectively, and d l is the total number of class categories. If image-text pair V i , T i attaches to the jth category, then L ij � 1, otherwise L ij � 0. e quadruplet (V q , T p , T n1 , T n2 ) denotes that V q is a query instance from the image modality, and T p , T n1 , T n2 are three retrieval instances from the text modality, where V q and T p have at least one common categories, while V q and T n1 , V q and T n2 , and T n1 and T n2 are three pairwise instances and the two instances in each pairwise have no common label.
With the known quadruplet (V q , T p , T n1 , T n2 ), the target of our proposed QDCMH is to learn the corresponding hash codes (B V q , B T p , B T n1 , B T n2 ), where B V q , B T p , B T n1 , B T n2 are the hash codes of instances V q , T p , T n1 , T n2 , respectively. To learn the above hash codes, we first learn the hash representations (F V q , G T p , G T n1 , G T n2 ) from the quadruplet (V q , T p , T n1 , T n2 ) with deep neural networks, where F V q � f(V q , θ V ) and G T p � g(T p , θ T ) are the hash representations of instance V q and T p , respectively. f(., θ V ) and g(., θ T ) are the hash representation learning functions for the image modality and the text modality, respectively. θ V and θ T are the parameters of deep neural networks to extract features for the image modality and for the text modality, respectively. Secondly, we can utilize the following sign function to approximately map the hash representations into the corresponding hash codes, i.e., B V q � sign(F V q ) and B T p � sign(G T p ). In the same way, we can learn the hash codes of quadruplet (T q , V p , V n1 , V n2 ). For convenience, we denote the hash codes of all training image-text pairs, the hash representations of all training image instances, and the hash representations of all training text instances as B ∈ −1, 1 { } n×k , F ∈ R n×k , and G ∈ R n×k , respectively, where k is the length of hash codes:

Quadruplet-Based Cross-Modal Semantic Preserving
Module. In cross-modal hashing retrieval, given an image instance V i and a text instance T j , it is intractable to preserve the semantic relativity during the hash code learning procedure as the huge semantic gap across modalities. To solve this, DCMH [13] defines pairwise loss to map similar/dissimilar image-text pairs into similar/dissimilar hash codes. TDH [16] utilizes triplet loss to learn similar hash codes for similar cross-modal instances and generate distinct hash codes for semantic irrelevant cross-modal instances. Both pairwise loss and triplet loss can preserve the relevance in the original instance space; however, pairwise loss-and triplet loss-based hashing methods often suffer from a weaker generalization capability from the training set to the testing set [17,18]. To solve this problem, in this section, a quadruplet-based cross-modal semantic preserving module is proposed to boost the generalization capability and better preserve the semantic relevance for cross-modal hashing. For a quadruplet (V q , T p , T n1 , T n2 ), we should keep the semantic relevance unchanged during the hash representation learning, i.e., F V q should be similar to G T p , F V q should be distinct to G T n1 and G T n2 , and G T n1 should be dissimilar with G T n2 . us, we can define the following quadruplet loss for cross-modal hashing: where V q is a query instance from the image modality, T p , T n1 , and T n2 are three retrieval instances from the text modality, and V q and T p are semantic similar. While V q and T n1 , V q and T n2 , and T n1 and T n2 are three pairwise instances, and the two instances in each pairwise have distinct semantics. Equation (2) denotes that the distance of hash representations of similar cross-modal pairwise instances should be smaller than that of dissimilar pairwise instances (both from intermodalities and from intramodalities) with a positive margin (α 1 or α 2 ). is can ensure that similar crossmodal instances have similar hash representations while dissimilar instances have distinct hash representations. By this quadruplet loss, the cross-modal semantic relevance can be preserved during the hash representation learning stage.
Intermodal quadruplet loss (1) a quadruplet-based cross-modal semantic preserving module, (2) a classical convolutional neural network is used to learn imagemodality features and the TxtNet in SSAH [15] is adopted to learn the text-modality features, and (3) an intermodal quadruplet loss is utilized to efficiently capture the relevant semantic information during the feature learning process and a quantization loss is used to decrease information loss during the hash codes generation procedure. (a) Quadruplet (V q , T p , T n1 , T n2 ), which utilizes an image instance V q to retrieve three text instances: T p , T n1 , and T n2 . V q and T p have at least one common labels, while V q and T n1 , V q and T n2 , and T n1 and T n2 are three pairwise instances and the two instances in each pairwise have no common label. (b) Quadruplet (V q , T p , T n1 , T n2 ), which utilizes a text instance T q to retrieve three image instances: V p , V n1 , and V n2 . T q and V p have at least one common labels, while T q and V n1 , T q and V n2 , and V n1 and V n2 are three pairwise instances and the two instances in each pairwise have no common label. 4 Computational Intelligence and Neuroscience Similarly, given a quadruplet (T q , V p , V n1 , V n2 ), we can have the following cross-modal quadruplet loss: where T q is a query instance from the text modality, V p , V n1 , and V n2 are three retrieval instances from the image modality, G T q , F V p , F V n1 , and F V n2 are hash representations for instances T q , V p , V n1 , and V n2 , respectively, and α 3 and α 4 are two positive margins. Equation (3) is distinct to equation (2) as the modality of query instance and the modality of retrieval instances are inverse.

Hash Representation Learning and Hash Code Learning.
For each quadruplet from training set, it is easy to learn their hash representations and fully protect the semantic similarity with the above quadruplet-based cross-modal semantic relevance preserving module, so we have the following hash representation learning loss: where n I⟶T is the number of quadruplets for utilizing image to retrieve text, n T⟶I is the number of quadruplets for utilizing text to retrieve images, and β is a hyperparameter to balance the two parts. Additionally, to learn high-quality hash codes, we generate hash codes from the learned hash representations with the sign function in equation (1), and the final hash codes matrix for all training image-text pairs are generated as follows: As F and G are real-valued features, to decrease the information loss from F and G to B in equation (5), it is necessary to force F and G to be as close as possible to B; thus, we introduce the following quantization loss: Integrating the hash representation loss and the quantization loss together, the whole loss function is as follows: where c is a hyperparameter to balance the hash representation loss and the quantization loss.

Feature Extraction Networks.
In QDCMH, feature extraction includes two deep neural networks: a classical convolutional neural network is used to extract the features of images and a multiscale fusion model is utilized to learn features from texts. Specifically, for image modality, we deploy AlexNet [11] pretrained on the ImageNet [19] dataset. We then fine-tune the last layer using a new fully connected hash layer which consists of k hidden nodes. erefore, the learned deep features have been embedded into a k-dimensional Hamming space. For text modality, the TxtNet in SSAH [15] is used, which comprises a three-layer feedforward neural network and a multiscale (MS) fusion model (Input ⟶ MS ⟶ 4096 ⟶ 512 ⟶ k).

Learning Algorithm of QDCMH
For QDCMH, we utilize alternating strategy to learn parameters θ V of deep neural networks for image modality and parameters θ T of deep neural networks for text modality and hash codes matrix B for all training image-text pairs. When we learn one of θ V , θ T , and B, we keep the other two fixed. e specific algorithm for QDCMH is depicted in Algorithm 1.

Update θ V with θ T and B Fixed.
When θ T and B are maintained fixed, we utilize stochastic gradient descent and backpropagation to optimize the deep neural network parameters θ V .

Update θ T with θ V and B Fixed.
When we fix the values of θ V and B, we use stochastic gradient descent and backpropagation to learn the deep neural network parameters θ T .

Update B with θ T and θ V Fixed.
When the deep neural networks' parameters θ T and θ V are kept unchanged, the hash codes matrix B can be optimized with equation (5).

Datasets.
To investigate the performance of QDCMH, we conduct experiments on two benchmark cross-modal retrieval datasets: MIRFLICKR-25K [20] and Microsoft COCO2014 [21], and the brief descriptions of the datasets are listed in Table 1.

Evaluation Metrics.
In our experiments, we utilize mean average precision (MAP), top N-precision curves (top N Computational Intelligence and Neuroscience Curves), and precision-recall curves (PR Curves) as evaluation metrics; for the detailed description of these evaluation metrics, refer to [22,23].
All the experiments are executed by using the open source deep learning framework pytorch and running on an NVIDIA GTX Titan XP GPU server. In our experiments, we set n I⟶T � n T⟶I � 10000, max_epoch � 500, and λ � 10 −5 and the learning rate is initialized to 10 −1.5 and gradually decreased to 10 − 6 in 500 epochs. For those handcrafted feature-based baselines, each image in the two datasets is represented by a bag of words (BoW) histogram or feature vector having 512 dimensions. For the whole experiment, we use I ⟶ T to denote using a querying image while returning text and T ⟶ I to denote using a querying text while returning an image.

Performance Evaluation and Discussion.
Firstly, we investigate the performance of QDCMH with different hyperparameters β and c. To this goal, we experiment on MIRFLICKR-25K with the hash code length k � 64 and record the corresponding MAPs under different values of β and c, as shown in Figure 3. We find that high performance can be acquired when β � 1 and c � 0.2.
Secondly, to validate the performance of QDCMH, we perform the experiment to compare QDCMH with baseline methods in terms of MAP on datasets MIRFLICKR-25K and MS-COCO2014. Table 2 presents the MAPs of each method for different hash code lengths, i.e., 16, 32, and 64. DSePH represents the SePH method whose features of the original images are extracted by CNN-F. From Table 2, we can see that the following. (1) e MAPs of our proposed QDCMH

Input:
training data set: V, T, L { }. e maximal number of epoches of the algorithm is max_epoch. Mini-batch size n batch � 128.

Output:
Parameters θ V , θ T of the deep neural networks, and corresponding hash codes matrix B.
(1) Generating n I⟶T (V q , T p , T n1 , T n2 ) quadruplets (named Quad I2T ) from training set, generating n T⟶I (T q , V p , V n1 , V n2 ) quadruplets (named Quad T2I ) from training set. (2) Initialize the deep neural network parameters θ V , θ T , the whole training image hash representations F, the whole training text hash representations G, the hash codes matrix B, and the epoch numbers batchnum v � batchnum t � n I⟶T + n T⟶I )/n batch . (3) repeat (4) for j � 1 to batchnum v do (5) Randomly sample n v images from Quad I2T ∪ Quad T2I to construct a mini-batch of images. (6) For each instance V i in the mini-batch, calculate Update F. (8) Calculate the derivative of θ V in equation (7). (9) Update the network parameters θ I by utilizing backpropagation. (10) end for (11) for j � 1 to batchnum t do (12) Randomly sample n t texts from Quad I2T ∪ Quad T2I to construct a mini-batch of texts. (13) For each instance T i in the mini-batch, calculate G T i � g(T i , θ T ) by forward propagation. (14) Update G. (15) Calculate the derivative of θ T in equation (7). (16) Update the network parameters θ T by using backpropagation. (17) end for (18) Update B using equation (5). (19) until the max epoch number max_epoch. ALGORITHM 1: QDCMH: quadruplet-based deep cross-modal hashing. Computational Intelligence and Neuroscience are higher than the MAPs of most baseline methods in most cases, which demonstrates the superiority of QDCMH. We can also observe that SSAH outperforms than our proposed QDCMH in most cases, which is partly because SSAH takes self-supervised learning and generative adversarial networks into account during hash representation learning procedure.
(2) e MAPs of QDCMH is always higher than the MAPs of TDH, which shows that quadruplet loss can better preserve semantic relevance than triplet loss in cross-modal hashing retrieval. (3) Table 2.

Conclusions
In this paper, we introduce a quadruplet loss into deep crossmodal hashing to fully preserve semantic relevance of original cross-modal quadruple instances and propose a quadruplet based deep cross-modal hashing method (QDCMH). QDCMH integrates quadruplet-based crossmodal semantic relevance preserving module, hash representation learning, and hash code generation into an end-toend framework. Experiments on two benchmark crossmodal retrieval datasets demonstrate the efficiency of our proposed QDCMH.

Data Availability
e experimental datasets and the related settings can be found in https://github.com/SWU-CS-MediaLab/MLSPH. e experimental codes used to support the findings of this study will been deposited in the github repository after the publication of this paper or can be provided by xitaozou@ sanxiau.edu.cn. Computational Intelligence and Neuroscience 9