Discrete Two-Step Cross-Modal Hashing through the Exploitation of Pairwise Relations

The cross-modal hashing method can map heterogeneous multimodal data into a compact binary code that preserves semantic similarity, which can significantly enhance the convenience of cross-modal retrieval. However, the currently available supervised cross-modal hashing methods generally only factorize the label matrix and do not fully exploit the supervised information. Furthermore, these methods often only use one-directional mapping, which results in an unstable hash learning process. To address these problems, we propose a new supervised cross-modal hash learning method called Discrete Two-step Cross-modal Hashing (DTCH) through the exploitation of pairwise relations. Specifically, this method fully exploits the pairwise similarity relations contained in the supervision information: for the label matrix, the hash learning process is stabilized by combining matrix factorization and label regression; for the pairwise similarity matrix, a semirelaxed and semidiscrete strategy is adopted to potentially reduce the cumulative quantization errors while improving the retrieval efficiency and accuracy. The approach further combines an exploration of fine-grained features in the objective function with a novel out-of-sample extension strategy to enable the implicit preservation of consistency between the different modal distributions of samples and the pairwise similarity relations. The superiority of our method was verified through extensive experiments using two widely used datasets.


Introduction
With the development of Internet technology in recent years, a large quantity of multimodal data obtained from video, audio, image, text, and other sources are being disseminated rapidly across social networks. A common requirement in real scenarios is cross-modal retrieval, e.g., retrieving corresponding images or videos through text descriptions. Owing to the heterogeneity of multimodal data, cross-modal retrieval tasks must, unlike traditional retrieval tasks, bridge the semantic gap and acquire common and unified expressions. At the same time, the rapidly growing mass of data has increased the time and space costs of retrieval to serve users who generally expect to be able to quickly obtain information related to their retrieval targets.
Owing to its high retrieval efficiency and low space cost, cross-modal hashing has become one of the primary methods in the field of cross-modal retrieval [1]. Crossmodal hash learning attempts to convert multimodal data into a set of short binary codes (called hash codes) in Hamming space while preserving the original sample relations and then to learn a set of mapping functions from the specific modality to the sample hash code. e binary code of the common space and the mapping from the specific modality to the common space can be used to achieve crossmodal retrieval. As the similarities between hash codes are calculated as Hamming distances, the XOR operation can be implemented on hardware to significantly improve retrieval efficiency. Furthermore, the storage cost is reduced because the hash code length is relatively short. e existing cross-modal hashing approaches can be roughly divided into unsupervised and supervised methods. Unsupervised methods learn hash functions by exploiting the sample relations between and within modalities, which often have low accuracies. Supervised methods attempt to exploit the supervision information contained in labels and use the common semantic information in these labels to guide the learning of the hash code, thereby improving its quality. However, most existing supervised methods do not fully exploit the supervision information [2,3]. For example, Xu et al. [4] used label matrix regression alone and ignored the similarity relations between samples. Furthermore, the one-directional regression used by supervised methods is not conducive to the full exploitation of supervision information and will also cause the hash learning process to be unstable. erefore, in this paper we propose a new supervised cross-modal hashing method that combines label matrix factorization and hash code regression to achieve bidirectional mapping. In addition, a dual supervision approach is adopted to embed the label and pairwise similarity matrices into the hash learning process to further exploit the pairwise relations between modalities.
Another important problem in hash learning is the integer optimization problem caused by binary constraints. e introduction of the pairwise similarity matrix has made the optimization of the objective function more complicated. Currently available methods generally adopt a relaxation strategy in which the binary constraint is abandoned; instead, the real value is optimized and the thresholding method is then used to obtain a solution. However, this method will cause cumulative quantization errors. In this work, we propose a semirelaxed, semidiscrete strategy to minimize quantization errors and improve retrieval accuracy while ensuring the smooth optimization of the objective function. Furthermore, inspired by the concept of heterogeneous modal feature fusion under the bilinear model [5], we propose a simple mapping learning to fuse multimodal data to obtain more fine-grained high-order features, which are used in combination with the proposed out-of-sample extension strategy to implicitly preserve the similarity relations between samples.
In summary, the contributions of this study are as follows: (i) A new cross-modal supervised hashing framework using dual supervision information is designed to exploit the pairwise relations between samples. Furthermore, a semirelaxed and semidiscrete strategy is adopted to exploit the pairwise similarity matrix to reduce cumulative quantization errors. (ii) A new out-of-sample extension strategy with two novel optimization strategies is developed. By combining this strategy with the fine-grained features, the consistency between the different sample modal distributions and the pairwise similarity relations can be effectively preserved. (iii) Experiments were conducted on two widely used retrieval datasets to verify the effectiveness and superiority of the proposed method.
e remainder of the article is organized as follows. Section 2 briefly reviews some related works. Section 3 gives the details of DTCH. Section 4 presents the experimental results and discussions, followed by the conclusion in Section 5.

Related Work
e currently available cross-modal hashing methods can be primarily divided into linear model-based [6][7][8] and deep model-based [9][10][11][12] methods. Although some recently proposed methods based on deep models have improved retrieval performance, such approaches generally exhibit high time and space complexities and have poor interpretabilities. By contrast, linear models are more applicable to real scenarios owing to their high retrieval efficiencies and strong interpretabilities. e existing linear cross-modal retrieval methods can be further divided into two main categories: unsupervised crossmodal hashing [13][14][15] and supervised cross-modal hashing [16][17][18][19].
e unsupervised cross-modal hashing methods primarily learn hash functions by mining sample feature information to obtain relations between and within sample modalities. For example, Sun et al. [13] extended the traditional spectral hashing approach to the multimodal field by minimizing the Hamming distances between sample pairs. Intermedia hashing [14] learns hash codes by maintaining the semantic consistencies between and within modalities. Zhou et al. [15] proposed a latent semantic sparse hashing method in which matrix factorization and sparse coding are combined to discover a common Hamming space.
In contrast to the unsupervised learning methods, the supervised cross-modal hashing methods use label information and pairwise similarity information to improve hash code quality. Zhang et al. [16] used the pairwise similarity matrices generated from labels to learn hash codes and then attempted to use these hash codes to reconstruct the matrices with the goal of maximizing the semantic correlations of the hash codes. Xu et al. [4] extended supervised discrete hashing (SDH) [20] to the cross-modal field and used the label matrix regression method to directly learn hash codes. However, SDH adopts the bitwise learning strategy to generate binary codes, making it time-consuming. Chen et al. [17] proposed a scalable cross-modal hashing method in which matrix factorization is applied to the cross-modal field. Generally speaking, the retrieval accuracies of supervised learning methods are significantly higher than those of unsupervised methods owing to the exploitation of label information. In general, these methods are restricted by their weak representation ability. To obtain satisfactory accuracy, longer code lengths are often required, leading to greater storage and query costs.

Proposed Method
In this section, we introduce our method in detail in terms of its use of notation, binary code learning, optimization, outof-sample extension, and time complexity. e framework is shown in Figure 1.
with N sample pairs, we use v i ∈ R d to represent the eigen vector of the image modality in the i-th sample pair and t i ∈ R d to represent the eigen vector of the text modality in the i-th sample pair, where d and f are the corresponding eigen dimensions. Correspondingly, V ∈ R d×N and T ∈ R f×N represent the visual modal feature and the text modal matrices, respectively. In addition, Y is used to represent the sample label matrix, where y i � y ic ∈ 0, 1 { } C represents the label vector corresponding to the i th sample, where C represents the number of classes. If the current sample x i belongs to the c th class, then y ic � 1, where c � 1, . . . , C; otherwise, y ic � 0. At the same time, the label matrix Y is used to construct the pairwise similarity matrix S ∈ − 1, +1 { } N×N . If the sample pair i and j are similar, then S i,j � 1; otherwise, S i,j � − 1.

Binary Code Learning.
e goal of cross-modal hashing is to map heterogeneous multimodal data onto compact binary codes while preserving the semantic similarity of the original space. Intuitively, multimodal data describe the same entity and, therefore, their high-level semantics should be consistent.
is paper proposes a new cross-modal hashing framework that fully exploits the pairwise relations of samples contained in the label and pairwise similarity matrices to learn a unified binary code.
Inspired by Multimodal Discriminative Binary Embedding (MDBE) [21], we attempted to use matrix factorization to explore the semantic information implicit in the label matrix. is process can be formalized as follows: where B ∈ − 1, 1 { } N×L is the binary semantic representation learned from the label matrix, i.e., the hash code, and M is the auxiliary matrix. To avoid singular solutions, we add the L 2 -norm regularization term to M. In addition, by regressing the label matrix to the hash code, the label matrix can be embedded into the learning of the binary code as follows: where W is the linear mapping matrix. In addition to further exploration of the semantic information in the label matrix, equation (2) can be used to stabilize the hash learning process. e full exploitation of the label matrix can reduce the semantic gap caused by modal heterogeneity to the greatest extent possible, making it more likely that the hash code expresses high-level semantics beyond the specific modal. In other words, the label information is not restricted by the specific modality and the hash code learned from label information should be a more advanced representation that can cross the semantic gap. Additional important supervision information for the supervised hash learning method is obtained from the pairwise similarity matrix. A common approach to constructing the pairwise similarity matrix for the cross-modal supervised hashing method is to reconstruct the sample label matrix. If two sample pairs share one or more labels, they are considered similar, and vice versa. As the inner product of the hash code between two samples corresponds to the distance between the samples, it can be used as a measure of the similarity relation between the samples. erefore, the inner product of the hash code is used to fit the pairwise similarity matrix to ensure that the learned hash code maintains the similarity relations of the original space as much as possible, which is consistent with the original intention of cross-modal hash learning. is process can be modeled as follows: where B ∈ − 1, 1 { } N×L is the binary semantic representation learned from the label matrix, i.e., the hash code, and L  Computational Intelligence and Neuroscience denotes the length of hash code. Clearly, equation (3) is a nonconvex optimization problem that is difficult to solve. Many existing methods have adopted a complete relaxation strategy involving the removal of the discrete constraint on the hash code. However, this approach will produce accumulated quantization errors that seriously affect the accuracy of hash retrieval. To solve this problem, we adopt a semidiscrete, semirelaxed strategy in which the real value information in equation (2) is used to replace the hash matrix B in equation (3). In this manner, the rich semantic information in the real value can be fully exploited without destroying the discrete constraints on the hash code. is process can be formalized as follows: According to the bilinear model [5], high-order features obtained through the fusion of heterogeneous features can better characterize an original sample. Inspired by this idea, the proposed approach fuses data obtained from different modalities. It is worth noting that because different modal data exist in different feature spaces, there will be semantic gaps between them. erefore, a simple feature mapping must be learned prior to fusion to aid in the feature space transformation, that is, VPT T . Further, by combining this formulation with equation (4), this fine-grained feature can be embedded into hash learning as follows: where V and T are features of different modalities and P is the linear projection. is equation reinforces the learning of hash codes using the fine-grained features VPT T and improves the quality of the hash codes. At the same time, it can be applied in conjunction with the out-of-sample extension strategy to produce learned hash codes for different modal samples that preserve the pairwise similarity relations as much as possible, as will be introduced in detail in Section 3.4. In summary, by combining equations (1), (2), (4), and (5), the final objective function can be obtained as follows: where α, β, c, and λ are tradeoff parameters. (6) is evidently still a difficultto-solve nonconvex optimization problem for variables W, B, M, and P. However, solving for a single variable while fixing the other variables remains a relatively straightforward process. erefore, we propose an alternating iteration strategy for optimization with the goal of achieving global optimization through local optimization. Each optimization step is introduced as follows:

Optimization. Equation
Step 1: first, the optimization process of the mapping M is introduced. By fixing the remaining three variables, equation (6) can be simplified to By taking the derivative of equation (7) with respect to M and setting it equal to zero, we obtain By solving the above equation, the closed-form (analytical) solution of M can be obtained as follows: Step 2: fix the three variables M, B, and P, and optimize the mapping W. In this case, the objective function can be simplified to

Computational Intelligence and Neuroscience
By taking the derivative of equation (10) with respect to W and setting it equal to zero, we obtain e closed-form solution of W is Step 3: optimize variable B. By fixing the remaining variables, we obtain As B contains binary constraints, equation (13) is still an integer optimization problem. Here, we introduce two approaches to optimizing B. e first optimization scheme uses discrete proximal linearized minimization (DPLM) [22]. After reconstruction and simplification, B can be solved through a simple symbolic function: where B j is the solution of B after the j th iteration, μ is a hyperparameter, and ▽L(B) is expressed as follows: e second optimization scheme for B adopts the discrete cyclic coordinate descent (DCC) approach [20]. Although equation (13) is an integer optimization problem, the DCC algorithm can still be employed to solve its discrete solution iteratively and bit by bit. As ‖B‖ 2 � L * N, equation (13) can be rewritten as where Q � YM T + αYW T + βSYW T + cVPT T YW T . According to the DCC algorithm, we define b as the l-th column of matrix B, l � 1, . . . , L, and B ′ as matrix B excluding b. Analogously, we define q as the l-th column of matrix Q. We then define m as the l-th column of matrix M and M ′ as matrix M excluding m. Finally, we define H � WY T , h as the l-th column of matrix H, and H ′ as matrix H excluding h. Equation (16) can then be rewritten as By taking the derivative, the analytical solution of b can be obtained as follows: where sgn(·) represents the symbolic function.
Step 4: fix the remaining variables and optimize mapping P. In this case, the objective function is By taking the derivative and setting it equal to zero, we obtain e closed-form solution of P is

Out-of-Sample Extension.
In this section, we introduce the out-of-sample extension strategy. As shown in equation (6), DTCH is a two-step hashing method. After the offline training is completed, a mapping from features to hash codes Computational Intelligence and Neuroscience 5 must also be learned to query the samples. As mentioned above, we propose a new out-of-sample extension strategy that, when combined with the objective function, can help ensure that the learned modality-specific out-of-sample mappings P V and P T preserve the similarity in the original space. Specifically, this strategy can be formalized as follows: where the solution of the out-of-sample extension mapping for the visual modality can be expressed as By taking the derivative with respect to P V , we obtain e solution of the out-of-sample extension mapping for the text modality can be expressed as By taking the derivative with respect to P T , we obtain

Time Complexity.
In the training process, we need to update the projection M, W, P, P V , and P T and the unified binary code matrix B. e time complexity for learning M, W, and P, P V , and P T are O(2f 2 N + 2d 2 N + LdN + LfN), respectively. In this study, we adopt two approaches to optimize B. Specifically, solving equations (15) and (18) requires O(NL 2 + LdN + LfN) and O(LdCN 2 + LfCN 2 ), respectively. As N is usually much larger than C and L, the training time complexity of the proposed method with DCC and DPLM can be simplified as where T is the number of iterations.

Experiments
To verify the effectiveness of the proposed method, we conducted extensive experiments using two widely used datasets. In the following section, the three aspects of each dataset (their modalities and class information) experimental settings, and experimental analyses and results are introduced in detail.

Dataset.
To verify the effectiveness and superiority of the proposed method, we conducted extensive experiments using two widely used large-scale cross-modal retrieval datasets: MIR-Flickr [23] and NUS-WIDE [24].
e MIR-Flickr dataset contains 25,000 images in 24 classes, with each image forming an image-text pair with a corresponding text description. In this study, 15,902 sample pairs were selected as the training set, 836 sample pairs were selected as the test set, and the union of these sets was used as the retrieval set. Specifically, the image modality was represented by a 150-dimensional edge histogram, the text modality was represented by a 500-dimensional word vector, and the class information was represented by a 24-dimensional semantic label. e NUS-WIDE dataset contains 269,648 images and corresponding text descriptions from the Internet and includes 81 classes. In this study, the 10 most frequent classes and their 17,000 corresponding samples were selected for training, 994 samples were selected for testing, and 50,000 samples were selected for retrieval. Specifically, the image modality was represented by a 500-dimensional SIFT bag-of visual words vector [25], the text modality was represented by a 1,000-dimensional bag-of-words vector, and the class information was represented by a 10-dimensional semantic label.
For a fair comparison, the hyperparameters of all baseline methods were initialized according to the approaches used in the original papers; for all methods, including DTCH, the average performance over five runs was used for comparison. e following parameter settings were used for the method proposed in this paper: α � 2, β � 10 − 7 , c � 10 − 7 , λ � 10 − 4 , μ � 0.05, and σ � 10 − 7 . As the proposed method is based on a linear model, a deep model was not used as its baseline method. Moreover, all examinations are led on a computer with an Intel Core i7-6700 3.40 GHz 4 processor and 32 GB RAM under the programming climate of MATLAB R2019b.
To compare the performance of the methods, we tested each on two cross-modal retrieval tasks: (1) Img2Text, involving the retrieval of texts using images; and (2) Text2Img, involving the retrieval of images using texts. e average precision (AP) and mean average precision (mAP) were used as metrics. AP represents the average precision of tobe-retrieved samples as follows: where D is the number of correlated samples among K retrieved samples and σ(r) indicates whether the r th example is correlated with a retrieved sample. mAP was obtained by sorting the AP values of the samples and then taking the average as follows: where Z represents the number of samples to be retrieved.

Experimental Results and Analyses.
In this section, we provide a brief analysis of the experimental results.   DCC and DPLM, respectively, for solution optimization. It is seen from Table 1 that Ours-2 achieved the best performance under various code lengths on both datasets, indicating that the proposed method was able to reduce the semantic gap to a certain extent and improve the cross-modal retrieval performance. Ours-1 also obtained convincing results; in performing the Text2Img task in particular, its performance was significantly improved relative to the previous methods. On the NUS-WIDE dataset, however, the performance of Ours-1 was slightly worse than that of SCARTCH, particularly at relatively short code lengths. One possible reason for this result is that Ours-1 was trapped in local optima while using DCC optimization. In addition, its insufficient performance on the Img2Text task alone indicates that its characterization ability on the image modality is slightly lacking. us, in subsequent work it will be useful to enhance the image modal expression ability of the proposed method. Figure 2 shows a line chart plotting the average Preci-sion@K indicator as a function of code length for each method in performing each task. Without loss of generality, the value of K was selected to be 50. It is seen from the results that Ours-1 and Ours-2 achieved optimal performance at nearly all code lengths. Furthermore, DTCH was able to achieve convincing results even when the code length was relatively short. Figure 3 shows a line chart plotting the average Precision@K indicator as a function of K for each method in performing each task. In this case, the code length was fixed at 32. It is seen that the proposed method outperformed all of the comparison methods, particularly at relatively small K values, and achieved significantly higher average precision values than the other methods.

Conclusion
In this paper, we proposed a supervised cross-modal hashing method called DTCH. is method simultaneously embeds a label matrix and a pairwise similarity matrix into hash learning and fully exploits the pairwise relations between samples for each label using the dual supervision approach. Specifically, to exploit the sample label matrix, the proposed method combines matrix factorization and label regression. e bidirectional mapping approach not only fully exploits the semantic information but also stabilizes the hash learning process. To exploit the pairwise similarity matrix, we adopt a semirelaxed, semidiscrete method to avoid the original nonconvex optimization problem; this also alleviates the significant cumulative quantization error that can arise from directly removing the binary constraint. We additionally designed a new out-of-sample extension strategy that is combined with the objective function's fused fine-grained features as a method for carrying out the objective function. In this manner, the consistency between the different modal distributions of samples and the pairwise similarity relations is effectively preserved. Extensive experiments carried out using two datasets verified the excellent performance and efficiency of DTCH. Furthermore, embedding deep learning in the DTCH framework as the nonlinear embedding technique slows original method. In  Computational Intelligence and Neuroscience future, we plan to investigate how to effectively and efficiently combine them.

Data Availability
e research data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.