Abstraction and Association: Cross-Modal Retrieval Based on Consistency between Semantic Structures

. Cross-modal retrieval aims to ﬁnd relevant data of diﬀerent modalities, such as images and text. In order to bridge the modality gap, most existing methods require a lot of coupled sample pairs as training data. To reduce the demands for training data, we propose a cross-modal retrieval framework that utilizes both coupled and uncoupled samples. The framework consists of two parts: Abstraction that aims to provide high-level single-modal representations with uncoupled samples; then, Association links diﬀerent modalities through a few coupled training samples. Moreover, under this framework, we implement a cross-modal retrieval method based on the consistency between the semantic structure of multiple modalities. First, both images and text are represented with the semantic structure-based representation, which represents each sample as its similarity from the reference points that are generated from single-modal clustering. Then, the reference points of diﬀerent modalities are aligned through an active learning strategy. Finally, the cross-modal similarity can be measured with the consistency between the semantic structures. The experiment results demonstrate that given proper abstraction of single-modal data, the relationship between diﬀerent modalities can be simpliﬁed, and even limited coupled cross-modal training data are suﬃcient for satisfactory retrieval accuracy.


Introduction
Recent years have witnessed a surge of need in jointly analyzing multimodal data [1,2]. As one of the fundamental problems of many multimodal applications, cross-modal retrieval aims to find semantically similar items from objects of different modalities (such as text, visual, or audio object) [3]. e modality gap is the main challenge of cross-modal retrieval [4,5]. A common approach to bridge the modality gap is constructing a shared representation space where the multimodal samples can be represented uniformly [2]. However, it is not easy because it requires detailed knowledge of the content of each modality and the correspondence between them [6]. A variety of tools are used to construct the shared space, such as canonical correlation analysis (CCA) [1,[7][8][9][10], topic model [11][12][13], and hashing [14][15][16][17][18]. Among these methods, the deep neural network (DNN) has become the most popular one because of its strong learning ability [6,[19][20][21][22][23][24]. e performance of most of these methods, especially DNN-based methods, heavily depends on sufficient coupled cross-modal samples [25]. However, collecting coupled training data is labor-intensive and time-consuming.
Although it may not be explicitly announced, two types of relationships are essential considerations when constructing the shared representation space: the intermodal relation and the intramodal relation [5,10]. ey play critical roles in preserving the cross-modal similarity and the single-modal similarity, respectively [5,10,25]. Also, separate representation learning and shared representation learning in some existing works are preserving these two relationships [26,27].
It should be noticed that the information to maintain these two relationships is different: the correspondence between cross-modal samples is essential to preserve intermodal relations, while the similarity relation between single-modal samples is indispensable to preserve intramodal relations [10]. Most of the existing methods, such as [5,25,[28][29][30], only use coupled cross-modal samples to preserve intermodal relations and intramodal relations; however, uncoupled single-modal samples are discarded.
In many cross-domain learning tasks, such as machine translation [31][32][33], unlabeled samples in the single domain are significant. As a typical cross-domain learning task, cross-modal retrieval should also benefit from uncoupled training samples. Besides, in contrast to coupled training samples, uncoupled ones are easier to obtain. us, it is necessary to introduce uncoupled training samples into the construction of the shared representation space, especially when coupled ones are insufficient.
Inspired by the discussion above, a two-stage crossmodal retrieval framework is proposed. As illustrated in Figure 1, the proposed Abs-Ass framework uses training samples in a different way. In existing methods, only coupled training samples are used to preserve intramodal and intermodal relations. However, in this framework, both coupled and uncoupled samples are used to maintain intramodal relations; only a few coupled cross-modal sample pairs are used to maintain intermodal relations. us, the process of constructing the shared representation space is divided into two subprocesses: Abstraction that preserves intramodal relations and Association that preserves intermodal relations.
e name Abstraction indicates that we need to consider the intramodal relation at the semantic level rather than the feature level. e name Association means that the process of preserving the intermodal relation is exactly finding the correlation between different modalities. Abstraction fully explores intramodal relations through uncoupled samples of each modality, which enables Association to recognize multimodal samples at a higher level; thus, Association can find the correlation between cross-modal samples much easier, even though only a few coupled training samples are available. In the ideal case, high-level representations of different modalities can be associated even with a linear transformation [34].
Moreover, following the framework above, we proposed a cross-modal retrieval method based on the reference-based representation and the correlation between the semantic structures of different modalities. Specifically, Abstraction is implemented by the reference-based representation [35], which represents multimodal objects through the semantic structure. e term semantic structure refers to all pairwise similarities of a set of n samples, for some similarity measure [36]. is representation scheme is modality-independent and can provide multimodal objects a relatively isomorphic representation space. Moreover, we prove that if the reference points of different modalities are one-to-one matched, the semantic structures of different modalities are naturally correlated.
us, cross-modal similarity can be measured with the linear correlation between semantic structures of different modalities. In our implementation, the cross-modal relations have a fixed and straightforward form, and cross-modal sample pairs only play the role of the multimodal reference set; therefore, its performance has much lower dependence on coupled training samples.
rough this paper, we demonstrate the importance of uncoupled samples for preserving intramodal relations and the correlation between semantic structures of different modalities, which together provide the possibility of crossmodal retrieval with limited coupled training samples. e main contribution can be summarized as follows: (1) Abs-Ass cross-modal retrieval framework. We propose a twostage framework consisting of the Abstraction and the Association that emphasizes different roles of coupled and uncoupled training samples. In contrast to the end-to-end learning model, the proposed framework separates the process of preserving intermodal and intramodal relations into two stages and uses uncoupled single-modal samples and coupled cross-modal samples to learn them, respectively. Compared with the existing methods, the Abs-Ass framework improved the using efficiency of training samples and has lower demands for coupled training data. (2) Semantic structure-based cross-modal retrieval method. Following the Abs-Ass framework, we propose a cross-modal retrieval method by introducing the reference-based representation to represent multimodal data at the semantic level and proving the positive correlation between the semantic structures of different modalities. Although some existing works also try to find the cross-modal correlation from the semantic view [1,3], the correlation between semantic structures naturally exists and has a fixed and straightforward pattern. erefore, even a few coupled training samples are enough to align semantic structures of different modalities. Besides, the proposed method is unsupervised because the reference-based representation scheme does not need class labels. e remainder of this paper is organized as follows. Section 2 introduces the related works of the cross-modal retrieval task. Section 3 introduces the proposed implementation of the Abs-Ass framework. Section 4 tests the proposed method through the experiments on public data sets.

CCA-Based Methods.
To the best of our knowledge, the first well-known cross-modal correlating model may be the CCA-based model proposed by Hardoon et al. [7]. It learns a linear projection to maximize the correlation between the representation of different modalities in the projected space. Inspired by this work, many CCA-based models are designed for cross-modal analyzing [1,[8][9][10]37]. Rasiwasia et al. [1] utilized CCA to learn two maximally correlated subspaces, and multiclass logistic regression was performed within them to produce the semantic spaces, respectively. Mroueh et al. [9] proposed a truncated-SVD based algorithm to compute the full regularization path of CCA for multimodal retrieval efficiently. Wang et al. [10] developed a new hypergraph-based canonical correlation analysis (HCCA) to project low-level features into a shared space where intrapair and interpair correlation is maintained simultaneously. Liang et al. [37] incorporated the group correspondence and CCA to cross-modal retrieval.

Topic Model Methods.
e topic model is also helpful for uniform representing of multimodal data, assuming that objects of different modalities share some latent topics. Latent Dirichlet allocation-(LDA-) based methods establish the shared space through the joint distribution of multimodal data and the conditional relation between them [11,12]. Roller and Walde [12] integrated visual features into LDA and presented a multimodal LDA model to learn joint representations for text and visual data. Wang et al. [13] proposed the multimodal mutual topic reinforce model (M 3 R) to discover mutual consistent topics.

Hashing-Based Methods.
For the rapid growth of data volume, the cost of finding the nearest neighbors cannot be dismissed. Hashing is a salable method for finding nearest neighbors approximately [14]. It projects data into a Hamming space, where the neighbor search can be performed efficiently. In order to improve the efficiency of finding similar multimodal objects, many cross-modal hashing methods have been proposed [14-18, 38, 39]. Kumar and Udupa [15] proposed a cross-view hashing method to generate such hash codes that minimized the distance in a Hamming space between similar objects and maximized that between dissimilar ones. Yi et al. [16] used a coregularization framework to generate such binary code that the hash codes from different modalities were consistent. Ou et al. [17] constructed a Hamming space for each modality and built the mapping between them with logistic regression. Wu et al. [18] proposed a sparse multimodal hashing method for cross-modal retrieval. Song et al. [38] proposed Self-Supervised Video Hashing (SSVH), which outperforms the state-of-the-art methods on unsupervised video retrieval. Ye and Peng [39] proposed Multiscale Correlation Sequential Cross-modal Hashing Learning (MCSCH) to utilize multiscale features of cross-modal data. Liu et al. [40] proposed the Matrix Tri-Factorization Hashing (MTFH) that discards the unified Hamming space to obtain higher representation scalability.

Deep Learning Methods.
Due to the strong learning ability of the deep neural network, many deep models have been proposed for cross-modal learning, such as [6, 19-24, 26, 27, 41, 42]. Ngiam et al. [19] presented an autoencoder model to learn joint representations for speech audios and videos of the lip movements. Srivastava and Salakhutdinov [20] employed the restricted Boltzmann machine to learn a shared space between data of different modalities. Frome et al. [22] proposed a deep visual-semantic embedding (DeViSE) model to identify the visual objects using the information from the labeled image and unannotated text. Andrew et al. [21] introduced deep canonical correlation analysis to learn such nonlinear mapping between two views of data that the corresponding objects are linearly related in the representation space. Jiang et al. [23] proposed a real-time Internet cross-media retrieval method, in which deep learning was employed for feature extraction and distance detection. Due to the powerful representing ability of the convolutional neural network visual feature, Wei et al. [24] employed it coupled with a deep semantic matching method for cross-modal retrieval. Peng et al. [26,27] proposed two-stage frameworks to learn the separate representation and the shared representation, which are implemented by cross-media multiple deep networks (CMDN) and crossmodal correlation learning (CCL), respectively. Song et al. [43] proposed multimodal stochastic RNNs (MS-RNN) for the video caption task, which solved a critical deficiency of the existing methods based on the encoder-decoder framework. Recently, the attention mechanism is playing an important role in maintaining the intermodal and intramodal relations. Qi et al. [41] proposed a visual-language relation attention model to explore the intermodal and intramodal relation between fine-grained patches, as well as the cross-media multilevel alignment to boost precise cross-media correlation learning. Gao et al. [42] proposed hierarchical LSTMs with adaptive attention for visual captioning.
Although these methods have achieved great success in multimodal learning, most of them need a mass of training data to learn the complex correlation between objects from  Mathematical Problems in Engineering different modalities. To reduce the demand for training data, some methods have been proposed from different views. Gao et al. [25] proposed an active similarity learning model for cross-modal data. Nevertheless, without extra information, improvement is limited. Chowdhury et al. [44] introduced additional web information to cross-modal retrieval.

Proposed Approach
e cross-modal retrieval can be formalized as follows. e multimodal data set D(X, Y) consists of X � x 1 , . . . , x n } ∈ R n×d and Y � y 1 , . . . , y m ∈ R m×e . Given a query set Q of any modality, the goal of cross-modal retrieval is to calculate similarity between each query and a set T of all the targets of the other modalities and retrieve the similar samples by ranking all the target samples according to the similarity [30]. We assume the availability of a small training set Tr � (x tr , y tr ) | (x tr ≈ y tr ) , where x tr ≈ y tr means x tr and y tr are similar. Because this work focuses on unsupervised and few-coupled cross-modal retrieval, class labels of both modalities' samples are not available, and the size of the training set is much smaller than the whole data set. e process of our proposed method can be described as equations (1) ∼ (5). First, extract visual and text features through tools in Section 3.1: Second, represent the to-be-matched objects (the nonred points in Figure 2) of X and Y by the distributed representation in Section 3.2: As illustrated in Section 3.3, R X and R Y have been well abstracted and are highly isomorphic in semantic and form. erefore, in Section 3.4, the representation space of different modalities can be easily aligned by the coupled training samples: and the similarity between cross-modal samples can be measured with general similarity metrics.

Feature Extraction for Text and Images.
In the early research on cross-modal learning, the weak-effectiveness of low-level feature extraction is one of the main factors that limits the retrieval accuracy. e application of the CNN visual feature has significantly improved the accuracy of cross-modal retrieval [4,24]. In contrast to the visual feature, some works still take the BoW (bag-of-word) as the default tool to extract text features [5,29], which is not effective enough to model intramodal relations in text modality. e consistency of the semantic structure is beneficial to transferring learning tasks, including cross-modal retrieval [45]; thus, we take the pretrained CNN model and the sentence embedding with the pretrained word vector for feature extraction of images and text, respectively.

Pretrained Convolutional Neural Network for Feature
Extraction of Images. CNN has demonstrated outstanding performance for various computer vision tasks, such as image classification and object detection. Wei et al. proposed to utilize the pretrained CNN for visual feature extraction in cross-modal retrieval [24], which performs much better than the low-level feature. Because we aim to reduce the dependency on training data, we directly take the pretrained VGG19 [46] (not fine-tuned) to extract the feature of images, namely, the mapping in equation (1).

Sentence Embedding for Feature Extraction of Text.
e advancement of NLP techniques provides us powerful tools for text feature extraction. Given enough supervised information, a good end-to-end model can automatically extract the most important features; however, with limited training data, it is hard to train such a model. Instead, we take the pretrained word embedding and an unsupervised text embedding method for the feature extraction of the text, which is the mapping in equation (2).
Many text embedding methods for general NLP tasks can be helpful; among them, smooth inverse frequency (SIF) is a simple but a powerful sentence embedding method [47]. With the pretrained word vector (such as Glove [48]), SIF provides a completely unsupervised method to embed sentences into the semantic space, which can be summarized as equations (6) and (7). Given a sentence s, each word in s is represented as its word vector v w ; then, the sentence s is represented as the weighted average of all word vectors: where p(w) is the probability of a word w which is emitted in the sentence s. e computation of parameter a is complex, which can be found in [47]. Let X be a matrix whose columns are [v s : s ∈ S] and u be the first singular vector of X that can be computed by singular vector decomposition (SVD); the final sentence embedding vector is obtained by

Semantic Structure-Based Representation for Single-Modality Data.
Although the extracting tools above provide more accurate features for image and text data, it is still hard to directly perform retrieval tasks on them, especially when only limited coupled training samples are available. erefore, feature-level representations X and Y need further abstraction, i.e., representation learning in (3) and (4). Besides, we mainly consider unsupervised representation learning because training samples are not always labeled in real-world applications. In this way, we introduce an unsupervised representation scheme to represent image and text data that can preserve intramodal relations.
In the unsupervised setting, the semantic structurebased representation (also named space structure-based representation, SSR) is a simple but effective way to preserve intramodal relations, as some unsupervised learning methods did [49,50]. Given a set of samples X, in the semantic structure-based representation, each sample x i ∈ X is represented as the similarity vector: where x r ij is the similarity between x i and x j . In this paper, it is computed with the cosine similarity: e reason for choosing cosine similarity lies in two aspects: on the one hand, the cosine similarity is normalized, and its value range is always [−1, 1], which helps us to measure the consistency between semantic structures easier in Section 3.3; on the other hand, both image and text are high-dimensional data, where cosine similarity performs good on both accuracy and efficiency.
In equation (8), the dimensionality is very high for large data sets, which leads to high computational complexity of the representation and the follow-up task. Given a data set X ∈ R n×d , the representing complexity is O(n 2 ) [35]. Besides, it is not true that all the samples are useful for the representation, and some of them may undermine the representing ability. Zheng et al. [35] believed that it is better to take some representative samples as the reference points rather than all of them and propose a lower-dimensional SSR-the reference-based representation. As illustrated in Figure 3, six purple points are represented as their similarities (the dotted line) to three reference points (the orange ones). With this representation scheme, an object x i is represented as the distribution over some reference points: where x r ij′ is the cosine similarity between x i and the ref- e reference set X ′ is a subset of X, which is selected by a clustering-based strategy in [35]. As Figure 4 shows, X is divided into groups through clustering, and the center of each cluster is selected as a reference point. e clustering method should generate cluster centers of the sample form because the reference point is the real samples of the data set. However, many popular clustering methods can only generate cluster centers in the form of prototypes, such as k-means. erefore, we choose a simple and effective clustering method that can generate centers of the sample form, which is the k-medoids [51] method. Also, the clustering number is a significant consideration, which is directly related to the representation ability and the cost. Zheng et al. [35] provided two ways of deciding the cluster number: one by the canopy method [52], which can Common representation space Abstraction Association Reference-based representation Matching  Mathematical Problems in Engineering automatically give the number of clusters; the other is userspecifying, where users can balance the performance and the cost by themselves.
In the reference-based representation, R X and R Y , image and text data are represented as their distribution over semantic prototypes; in this way, the correlation between them can be found more straightforward. In the next section, it is proved that semantic structures of different modalities are correlated, which can be used to measure the crossmodal similarity even with very limited coupled training samples.

Cross-Modal Similarity Computing Based on the Correlation between Semantic Structures.
e semantic structurebased representation provides different modalities with a homogeneous representation scheme. Moreover, if the reference set X ′ and Y ′ is one-to-one matched, corresponding dimensions of R X and R Y have the same meanings-the similarity to a semantic prototype. In this way, the similarity between cross-modal samples can be computed according to the correlation between the corresponding dimensions of their reference-based representations.
is section proves that assuming R X and R Y share the reference points (i.e., the reference points of different modalities are one-to-one matched), values in the corresponding dimension of similar samples are positively correlated. Since a reference point in the reference-based representation is also a real sample, the assumption above holds if the semantic structures of different modalities are positively correlated. at is to say, if two images x i , x j ∈ X are similar to each other, their corresponding text descriptions y i , y j ∈ Y should be similar too and vice versa.
Although the assumption seems reasonable intuitively, it is hard to prove completely because the definition of the cross-modal similarity relation cannot be defined uniformly at the feature level. For simplicity, we discuss the case that similar cross-modal samples can be correlated through a linear transformation. e nonlinear case is not discussed because nonlinear mapping functions have much more complex and various forms; thus, it is difficult to discuss the nonlinear case comprehensively in a limited space. Besides, existing works [34,36] have proved that nonlinear mapping functions have no obvious advantage over linear mapping in correlating cross-modal samples.
Following existing works [1,25], we assume that similar cross-modal samples are correlated through a linear transformation: where M ∈ R d×e is a mapping matrix. M is nonzero (not all elements are zero) because if M is zero, then y i will always be zero, which is obviously unreasonable. e similarity between x i and x j and that between y i and y j , denoted as s X (i, j) and s Y (i, j), can be measured with their inner products: In this way, we have the following proposition: If the similar samples in X and Y are linearly correlated to each other as equation (11), s X (i, j) and s Y (i, j)(i, j � 1, 2, . . . , n) are positively correlated.
Proof. We assume that X has already preprocessed by whitening [53,54] and zero-centralization; thus, x ,φ satisfies the I.I.D (independent and identically distributed) and zero-centered assumption, that is, x ,φ (φ � 1, . . . , d) are dependent random variables that are subject to the same distribution p θ , and its expectation is zero. It should be noted that whitening and zero-centralization will not affect the similarity between samples. e Pearson correlation coefficient is used to measure the correlation between s X (i, j) and s Y (i, j):

□
Step 1. Proving the denominator of equation (13) is positive. Because neither s X (i, j) nor s Y (i, j) is constant, the variance of them is greater than zero. us, the denominator of equation (13) is greater than zero: Step 2. Factorizing the numerator of equation (13) by diagonalizing MM T . e covariance between s X (i, j) and s Y (i, j) is Because MM T is a real symmetric matrix, it can be diagonalized as Preclustering Select the cluster centers Figure 4: e process of selecting the reference points. 6 Mathematical Problems in Engineering where P � [p T 1 , . . . , p T c , . . . , p T d ] ∈ R d×d , p T c is the c−th eigenvector of MM T , and Λ is a diagonal matrix whose diagonal elements are eigenvalues of MM T . us, we have where λ c is the c-th eigenvalue of MM T . Substituting equation (17) into equation (15), where Cov Step 3. Computing the covariance through case-by-case discussion of equation (19). Because x iφ (φ � 1, 2, . . . , d) are independent of each other and from the same distribution p θ , we have the following conclusions.
If μ ≠ φ and ] ≠ φ, x iφ , x jφ , x iμ , and x j] are dependent from each other; then, the covariance in equation (19) equals zero: If μ � φ and ] � φ, the covariance in equation (19) is where D(x iφ x jφ ) is larger than zero because x iφ x jφ is not a constant.
Because MM T is a positive semidefinite matrix, all λ c are greater than or equal to zero: e sum of eigenvalues equals the sum of the diagonal elements of MM T , which is larger than zero because M is a nonzero matrix: where m cc refers to the c-th diagonal element of MM T . Equations (24) and (25) show that there exists at least one λ c which is greater than zero: Because the eigenvector p c is nonzero, from equation Finally, from equations (26) and (27), the covariance between s X (i, j) and s Y (i, j) is greater than zero: Step 5. Proving the Pearson coefficient is positive. From equations (14) and (28), the Pearson correlation coefficient is larger than zero: In conclusion, for any x i , x j ∈ X and y i , y j ∈ Y, if x i ≈ y i and x j ≈ y j , then s X (i, j) is positively correlated to s Y (i, j).
In the proposition and its proof, apart from being nonzero, we have no requirement for the mapping matrix M. However, some properties of M may lead to stronger conclusions. For example, low correlation between the columns of M is beneficial for high correlation between Mathematical Problems in Engineering s X (i, j) and s Y (i, j). In the most extreme case, if M is an orthogonal matrix, we have s X (i, j) � s Y (i, j).
From Proposition 1, it can be inferred that s X (i, j) and s Y (i, j) are positively correlated if x i ≈ y i and x j ≈ y j : Because reference points are also real samples in X and Y, the conclusion above also holds between reference points and nonreference points. erefore, representing multimodal samples x i and y i as equation (10), if the reference points x j ′ and y j ′ are matched, the values of similar crossmodal samples in the corresponding dimensions should be positively correlated: us, if all the reference points of X and Y are one-toone matched, the similarity between cross-modal samples can be measured according to the linear correlation between their reference-based representations: where x r and y r are mean vectors of x r i and y r j . Moreover, we have x r � 0 and y r � 0 because the cosine similarity is normalized; then, the reference-based representation of both modalities can be considered homogeneous. erefore, the cross-modal similarity S X,Y (i, j) can be directly computed with the cosine similarity: Although the analysis above is somewhat lengthy, the cross-modal similarity computation based on which is quiet simple. e core of similarity computation is a multimodal reference set R(X, Y) � (x 1′ , y 1′ ), . . . , (x i′ , y i′ ), . . . , (x k′ , y k′ )}, where x i ′ and y i ′ are matched cross-modal samples. However, the reference selection method in Section 3.2 only suits single-modal data, and we cannot expect that the reference sets generated, respectively, will be one-to-one matched.

Multimodal Reference Selection Based on Active Learning.
e multimodal reference set R(X, Y) plays two roles in Abstraction and Association, respectively: on the one hand, the reference set is the guarantee of satisfactory abstract representation for a modality; on the other hand, the correspondence relation between reference points is the basis of aligning the single-modal representations. We must comprehensively consider these two roles because both are crucial for accurate similarity computation.
In this section, we design an active learning-based strategy for the selection of the multimodal reference set R(X, Y), based on which the similarity computation in equation (33) can be achieved.
From the analysis in Section 3.2, there exists a positive correlation between semantic structures of different modalities. Hence, the neighbor structure of different modalities should be similar. erefore, if x i is selected as a reference point of X, its correspondent y i can also be the reference point of Y: us, we select the reference points for one single modality as in Section 3.2; then, the corresponding samples in the other modality are used as the reference points of this modality, which are obtained by asking the oracle. It is also important to choose the reference point from which modality. It is recommended to choosing one that has a clear group structure, which can bring better performance. Also, the cost of matching should also be considered. For example, the cost of querying images from text and querying text from images is different.
Finally, combining the similarity computing method in Section 3.3 with the reference selecting method above, we propose the semantic structure matching with the active learning (SSM-AL) method in Algorithm 1.
First, the multimodal reference set R(X, Y) is generated in Steps 1 ∼ 5: divide X (or Y) into clusters by clustering, and take the centers of all clusters as the reference set X ′ ; then, query the corresponding sample y j of each x i ∈ X ′ and take it as the reference set Y ′ . us, X and Y can be represented as equation (10) with reference sets X ′ and Y ′ . Finally, the cross-modal similarity matrix can be computed directly according to the linear correlation between reference-based representations as equation (33). e computational complexity of SSM-AL is analyzed as follows. Given the retrieval problem between X � x 1 , x 2 , . . . , x n } ∈ R n×d and X � y 1 , y 2 , . . . , y m ∈ R m×e , the complexity of k-medoids clustering on X is n × d × k. e complexity of computing the representation of X and Y is d × k × n and e × k × m. e complexity of cross-modal similarity is n × m × k. Considering d and e are constant, k is much smaller than m and n [35]; then, the complexity of the SSM-AL method is O(dk n + dk n + ekm + nmk) � O(mn).

Experiments
In this section, we perform some experiments to evaluate the performance of the proposed method.
Pascal-Sentences: a subset of Pascal VOC, which contains 1,000 pairs of images and the corresponding text description from twenty categories. Wikipedia: a data set containing 2,866 pairs of images and text from ten categories. Each pair of image and text is extracted from Wikipedia's articles [1]. XMedia: a publicly available data set consisting of five media types (text, image, video, audio, and 3D model).

Mathematical Problems in Engineering
We only use the image and text data in this paper, i.e., 5,000 pairs of images and text from twenty categories. MSCOCO: a large data set containing 123,287 images and their annotated sentences. Each image is annotated by five independent sentences.
Following the existing works [24,30], we take 20% samples as the testing set for Wikipedia, Pascal-Sentences, and XMedia. e testing set of MSCOCO is split as [58,59]. e scale of training sets is set small because we aim to test performance with insufficient training samples.

Evaluation Protocol.
We compare the retrieval performance of the proposed method with eight baselines: CCA [1]: with canonical correlation analysis (CCA), a shared space is learned for different modalities where they are maximally correlated. HSNN [60]: the heterogeneous similarity is measured by the probability of two cross-modal objects belonging to the same semantic category, which is achieved by analyzing the homogeneous nearest neighbors of each object. JRL [56]: through semisupervised regularization and sparse regularization, JRL learns a common space using semantic information. JFSSL [61]: a multimodal graph regularization is used to preserve the intermodality and intramodality similarity relationships. CMCP [62]: a novel cross-modal correlation propagation method considering both positive relation and negative relation between cross-modal objects. JGRHML [63]: a joint graph-regularized heterogeneous metric learning method, which integrates the structure of different modalities into a joint graph regularization. VSEPP [59]: a learning visual-semantic embedding technique for cross-modal retrieval, which introduces a simple change to common loss functions used for multimodal embeddings.
GXN [34]: a cross-modal feature embedding method that incorporates generative processes, which can well match images and sentences with complex content. SSM-AL: the proposed method has two settings: reference selection based on text clustering, denoted as SSM-AL T and reference selection based on image clustering, denoted as SSM-AL I . Each cluster corresponds to a coupled training sample; then, the cluster number is manually specified as the training sample N.
Among these methods, CCA, VSEPP, GXN, and our proposed SSM-AL are unsupervised methods that do not use class labels completely; HSNN, JFSSL, and CMCP are supervised methods where class labels are necessary; and JRL and JGRHML are semisupervised methods that need class labels of some samples.
For Pascal-Sentences, Wikipedia, and XMedia, a query item and a target item are considered actually similar if they share the same class label [30]. Mean average precision (MAP) is used to evaluate the performance in these data sets, which is a widely used metric of information retrieval [64]: where Q is the query set (for example, in the image-to-text retrieval task, Q refers to all the images in the testing set, regardless of the class) and AP(i) is the average precision of query sample i. For the query x i , the average precision can be computed as where L i denotes the number of target samples that are actually similar to the i−th query (for these three data sets, L i is also the number of target items that share the same class label with the query), T is the set of all target items, P(j) considers the position of the ranked target list and can be computed as 1/j, δ(j) � 1 if the j-th sample is similar to x i , and δ(j) � 0, otherwise. In cross-modal literature [1,4,62], two samples are considered similar if they share the same label. e MAP score can comprehensively reflect the quality Require: Two data set X, Y, and reference size k Ensure: Cross-modal similarity matrix S X,Y (1) Divide X into k clusters (2) X′ ⟵ the cluster centers of X (3) for all x i ∈ X′do (4) Y′ ⟵ find one y i ∈ Y that x i ≈ y i by asking oracle (5) end for (6) R X ⟵ represent X with X′ as equation (10) S X,Y (i, j) ⟵ compute similarity between x r i and y r j as equation (33)  of ranked target list of all queries. Both MAP scores of bidirectional retrieval (image-to-text and text-to-image) are reported, and higher MAP indicates better performance of a method.
In contrast to the other three data sets, MSCOCO has no definite class labels. Following [28,30], we considered a query and a target are actually similar only if they are the coupled image-text pair from the data set, and take the score of Recall @K instead of MAP as the performance metric: where ϕ(i, j) � 1 if the j−th item in the ranked target list is actually similar to the i-th query; otherwise,. L i is the number of targets that are actually similar to the i-th query (for the MSCOCO data set, L i � 1 because each query only has one similar target). Another metric Precision@K is not reported because it is closely related to Recall @K in this data set. More specifically, because each query in this data set has one similar target, Precision @K � Recall @K/K [28]. We only report Recall @K of unsupervised methods (SSM-AL, CCA, VSEPP, and GXN) because supervised and semisupervised methods cannot be conducted on MSCOCO.

Retrieval Performance Comparisons.
In this section, we compare the bidirectional (image-to-text and text-to-image) retrieval performance of SSM-AL and the baselines. e number of coupled cross-modal training samples (also the number of clusters and reference points in SSM-AL) is denoted as N.
MAP values of some methods with small training sets are not reported in Tables 1-3 because they cannot finish with such limited training data, which are marked with "-." Besides, to evaluate the impact of the number of reference points on the retrieval performance, we draw the MAP-N curves of the bestperforming SSM-AL method and three representative baselines that have regular trends in  (1) e result of Pascal-Sentences: in Table 1, SSM-AL I outperforms all the baselines in the retrieval task of both directions, including supervised, semisupervised, and unsupervised methods. When N � 10, the MAP values of SSM-AL T are lower than and SSM-AL I but higher than baselines. e MAP values of JRL, JFSSL, JGRHML, and HSNN are not reported because they cannot finish normally. CCA and CMCP perform worse than SSM-AL but better than VSEPP and GXN. When N � 50, SSM-AL T and SSM-AL I still perform the best. JFSSL and HSNN perform the second-best in the image-to-text task and the text-to-image task, respectively. When N increases to 100, SSM-AL I is still the best-performing method in both directions. e performance of CMCP increases a lot and exceeds SSM-AL T in the image-to-text retrieval but still worse than SSM-AL I . e MAP values of JRL, JFSSL, VSEPP, GXN, and CCA are obviously lower than other methods. From Figure 5, in general, more reference points always bring higher performance in both retrieval tasks. e increasing speed is high when N < 50 but then slows down. e MAP value of CMCP decreases first and then increases fast when N > 25. Both performances of VSEPP and HSNN also increase with increasing N, but the speed of the former is much slower than the others.
(2) e result of Wikipedia: in Table 2, MAP values of SSM-AL T and SSM-AL I are higher than others, and SSM-AL T performs best in both retrieval tasks. When N � 10, the retrieval performance of two SSM-AL methods is obviously higher than the other four. When N � 50, SSM-AL T and SSM-AL I also outperform the eight baselines. JRL performs better than the other baselines but is obviously poorer than SSM-AL T and SSM-AL I . CMCP, JGRHML, and HSNN have similar MAP values in both retrieval tasks. e MAP values of CCA, VSEPP, and GXN in the text-to-image retrieval task are similar, while the MAP value of CCA in the image-to-text task is higher than the other two. JFSSL method performs worst in all the methods. When N � 100, the MAP values of CMCP, JGRHML, and HSNN increase obviously but still lower than SSM-AL T and SSM-AL I . e performance of CCA, JRL, JFSSL, VSEPP, and GXN does not show a significant improvement. In Figure 6, the MAP values of SSM-AL T in both tasks also increase with the number of reference points. e performance gain of more reference points in the image-to-text task is less obvious than that in the text-to-image task.
e MAP value of CMCP also decreases first and then increases fast. e performance of VSEPP in two tasks does not show visible changes as N is increasing.
(3) e result of XMedia: in Table 3, SSM-AL T and SSM-AL I outperform all the baselines, and SSM-AL T performs better than SSM-AL I . When N � 40, the performance of CMCP is worse than that of SSM-AL T and SSM-AL I but is obviously better than that of CCA, JGRHML, HSNN, VSEPP, and GXN. When N � 200, the MAP values of SSM-AL T and SSM-AL I in both retrieval tasks are still higher than the baselines. e MAP values of CMCP and JGRHML in the image-to-text retrieval task increase obviously and are higher than the other baselines; also, the MAP values of JFSSL, CMCP, and JGRHML in the text-to-image retrieval task are higher than the other baselines. When N � 400, SSM-AL T and SSM-AL I still perform the best. With larger N, MAP values of JFSSL and JGRHML do not show significant improvement.
In Figure 7, the MAP value of SSM-AL T increases along with the increasing reference point number, especially when the number N is small. More reference points bring obvious performance gain for SSM-AL when N < 100; when N is larger than 100, the speed of performance gaining is much slower.

Mathematical Problems in Engineering
Although MAP values of CMCP and HSNN also increase as N increasing, compared to SSM-AL T , the speed is much slower. e performance of VSEPP still does not show significant changes.     the number of training samples. Although, in Figure 8(a), Recall @1 of SSM-AL T and SSM-AL I increases firstly then decreases slightly, it is still higher than Recall @1 of three baselines.
In Figure 9, Recall @K of SSM-AL T , SSM-AL I , VSEPP, and GXN increases as the number of the training samples increases, and SSM-AL methods still perform best in the text-to-image retrieval task. Although in four figures, Recall @K of VSEPP and GXN increases rapidly but is always lower than that of two SSM-AL methods. Recall @K of CCA is the lowest when K � 1, 5, 10, and 50 and shows no visible change along with the number of training samples. Moreover, SSM-AL T and SSM-AL I get similar Recall @K scores in general; however, Recall @K of SSM-AL I is slightly higher when K ≥ 5.
It can be concluded that the proposed SSM-AL T and SSM-AL I outperform all the baselines when matched training samples are insufficient, even though they do not use any label information. Experiment results prove the importance of the intramodal relation learning with uncoupled samples and the simple correlation between high-  level cross-modal concepts, as well as the effectiveness of the Abs-Ass framework. e performance gap between the proposed method and the baselines decreases with the increase of coupled training data; however, the proposed method still is an ideal choice with limited coupled training data. In addition, the DNN-based methods (VSEPP and GXN) do not show an advantage over the traditional ones when training samples are not sufficient.
Overall, more reference points are beneficial to the retrieval performance of SSM-AL: in most cases, the performance of SSM-AL improves with more reference points. However, it does not hold that the more the reference points, the better. On the one hand, more reference points mean higher costs for matching cross-modal samples; on the other hand, the performance gaining becomes limited when the number of reference points is high. In practice, the number of reference points should be decided according to the performance demand and the cost. e above results also show that the retrieval performance is different when using different data as the basis of reference point selection. SSM-AL T performs better than SSM-AL I in the Wikipedia and XMedia data set, while SSM-AL I performs better than SSM-AL T in the Pascal-Sentences and MSCOCO data set.

Conclusion
In this paper, we try to improve the performance of crossmodal retrieval when training data are insufficient. Different from existing works, our proposed framework and its implementation emphasize the intramodal relation learning from data itself; no additional information (such as class labels and annotations from the web) is used as supplementary. e idea of this work is meaningful, especially when coupled training samples are insufficient; thus, it can be very helpful when applying cost is an essential consideration. Also, it can be incorporated in other methods to solve the cold-start problem of the cross-modal retrieval task. e future work lies in two folds. On the one hand, this work can be improved by incorporating the class labels of a few samples when aligning the semantic structure of different modalities. On the other hand, we attempt to extend this work to other modalities, such as video and audio.
Data Availability e multimodal data supporting this work are from previously reported studies and data sets, which have been cited. All of them are open access and available at the Internet.

Conflicts of Interest
e authors declare that they have no conflicts of interest.