Image-Text Joint Learning for Social Images with Spatial Relation Model

The rapid developments in sensor technology and mobile devices bring a ﬂourish of social images, and large-scale social images have attracted increasing attention to researchers. Existing approaches generally rely on recognizing object instances individually with geo-tags, visual patterns, etc. However, the social image represents a web of interconnected relations; these relations between entities carry semantic meaning and help a viewer diﬀerentiate between instances of a substance. This article forms the perspective of the spatial relationship to exploring the joint learning of social images. Precisely, the model consists of three parts: (a) a module for deep semantic understanding of images based on residual network (ResNet); (b) a deep semantic analysis module of text beyond traditional word bag methods; (c) a joint reasoning module from which the text weights obtained using image features on self-attention and a novel tree-based clustering algorithm. The experimental results demonstrate the eﬀectiveness of using Flickr30k and Microsoft COCO datasets. Meanwhile, our method considers spatial relations while matching.


Introduction
With the rise of cheap sensors, mobile terminals, and social networks, research on social images is making good progress, including image retrieval, object classification, and scene understanding. Compared with images in traditional applications, it is hard to understand social pictures using the low-level features. Meanwhile, most of the existing methods only capture the local patterns of images by utilizing low-level features (e.g., color and texture). Intuitively, knowing the spatial relation among local elements may help predict what objects and scenes are presented in the visual content. It has recently been widely adopted in the vision community that contextual information, i.e., the relation between objects, improves the accuracy of object recognition [1]. erefore, the geometry relation of objects in social images is usually exploited to conduct annotation, which depends on the similarity measurement of visual objects.
Significant efforts have been taken to integrate visual and textual analyses [2][3][4]. For example, Wang et al. [5] present an algorithm to learn the relations between scenes, objects, and texts with the help of image-level labels. However, such a training process requires a large number of paired images and text data. Motivated by the success of the encoder-decoder network, studies have been proposed to apply it to generate text descriptions of the given images [6,7]. Nevertheless, such impressive performance relies on the assumption that the training data and the test data should come from the same underlying data distribution. Some approaches [8,9] exploit the spatial relations of objects indicated by the prepositions for image understanding. ey suffer from the limitations that spatial relations have to be learned with the bounding boxes of objects and cannot be driven by the task goal.
Although there exist several successful image-text learning approaches or vision-based approaches to analyze social images, the following problems are still not addressed: (1) Visual content and text are always separately learned, making the traditional methods hard to be trained end-to-end. (2) Learning tasks converted to classification problems, empowered by large-scale annotated data with end-to-end training using neural networks, which is not capable of describing concepts unseen in the training pairs. (3) e spatial relations defined by prepositions have to be learned with the bounding boxes of objects, which are so immoderately challenging to obtain. Moreover, the spatial relationships from the textual descriptions are very scarce in reality.
Motivated by these observations, we aim at developing a method to learn the spatial relations across separate visual objects and texts for social image understanding. erefore, this paper proposes a cross-modal framework, which builds a joint model of texts and images to extract features and combine the advantages of self-attention mechanism and deep learning models, generating interactive effects. In particular, we investigate (1) how to label social images with high-level features based on their political image position and (2) how to combine the text and visual content. e framework is established by taking spatial relationships as a basic unit of image-text joint learning. We use neural architectures to measure the semantic similarity between visual data, e.g., images or regions, and text data, e.g., sentences or phrases. Learning this similarity requires connecting low-level pixel values and high-level language descriptions and then, a joint latent space of standard dimensionality in which matching image and text features has high cosine similarity, to explore the semantics hidden inside a social image.
(1) We propose a framework that jointly trains two dual tasks: the spatial semantic of image and text-toimage synthesis, which improves the supervision and the generalization performance of social image caption. (2) We extend the conventional model by adding the top-down attention mechanism. With a novel treebased clustering method, it can demonstrate the effectiveness in learning the alignments of visual concepts of images and the semantics of texts.

Related Works
e related works generally fall into two categories: image tagging and relational inference methods.

Image
Tagging. Image tagging has been widely studied for the past decade. Early image tagging models are built mainly from the view of statistics and probability. In practice, many image annotation works [10,11] assign top-k class labels to each image, the quantities of class labels in different photos vary significantly, and the top-k annotations degrade the performance of image annotation. Besides, many works attempt to infer correlations or joint probability distributions between images and semantic concepts (or keywords). For example, Farhadi et al. [12] use detection methods to understand scene elements. Similarly, Li et al. [13] start with exposures and piece a final description using phrases containing detected objects and relationships.
Further, powerful language models based on language parsing have been used as well [14]. Recently, deep learning has achieved great success in the field of image, text, and speech, for example, the m-RNN model [15] in which a multimodal component is introduced to explicitly connect the language model and the vision model by a one-layer representation.
Images can label at the pixel level, which has applications in intelligent video monitor, and self-driving cars, etc. More recently, the variable label number problem has been identified [16,17]. ese solutions treat the image annotation problem as an image-to-text translation problem and solve it using an encoder-decoder model. e multiscale approaches [18] propose a novel multiscale deep model for extracting rich and discriminative features capable of representing a wide range of visual concepts. Instead of CNN features, some works use more semantic information obtained from the image as the input to the decoder [19,20]. In all, most methods still focus on recognizing objects separately. e spatial relationships between objects and scenes are always neglected.

Relational Inference.
e earliest reasoning form can date to a symbolic way, and the ties between symbols are established in logical and mathematical language, which is interpreted in terms of deduction, arithmetic, and algebra. As symbolic approaches suffer from the symbol grounding problem, they are not robust to small tasks and input variations [21,22]. Many other methods, such as deep learning, in a traditional inference network, the inference part may be multilayer Perceptron (MLP), Long Short-Term Memory (LSTM), or LSTM with attention, which often runs into the problem of weak data [23,24]. Santoro proposed a relation network to achieve the reasoning part. is structure clearly expresses two ideas: the final answer has to do with pairs of objects, and the problem itself will influence how to examine the objects. As an active research field, some recent works have also applied neural networks to structured graph data or tried to standardize network output through relations and knowledge bases. We believe that visual data reasoning should be both local and global: abandoning the two-dimensional image structure to involve the task is not only useful but also invalid. Beyond the object detection task that detects relationships for an image, the scene graph generation task [25,26] endows the model with an entire structured representation capturing both objects and their semantic relationships, where the nodes are object instances in images, and the edges depict their pairwise correlations. e proposed methods usually require additional annotations of relations, while they demand only image-level annotations.

Overview of Cross-Modal Tasks and Pipeline.
In this paper, we focus on two image-text tasks: spatial relation modeling and image-text matching. e former refers both to image-to-image and image-to-text, and definitions of the two scenarios are straightforward: given an input image (resp. sentence), the goal is to find the relationships between entities with semantic meaning. e second task refers to find the best matching sentences to the input images. e architecture for the above tasks should consist of four-step pipeline, as summarized here: e First step is to design functional feature extractors, we use a refined TF-IDF method [27] to calculate the word frequency and then combine it with the embedding vector, which helps us to map it nonlinearly into another vector space to enhance semantic association in words and lower its dimension. e second step is to generate full image features and geographical features. We can deepen the network continuously, getting saturated then degrades rapidly [28]. e third step is to add a self-attention mechanism, using the image feature to get the weight of words. en we combine the image and text features together to deduce the spatial relation of the picture. e final step performs instance recognition, which is a deep semantic understanding of social images.
At a conceptual level, there are two branches to achieve the goal. One is to train the network to map images and texts into joint embedding space. e second approach is to frame pictures and documents correspondence by cosine similarity, to obtain the probability that the two items match. Accordingly, we define a cross-modal reasoning framework to exploit image-text joint learning of social image retrieval with spatial relation model, which includes two variants of two-branch networks that follow these two strategies ( Figure 1): the embedding network and the similarity network.
Two networks are needed in this framework, one for cross-modal topic detection and the other for semantic matching, which is trained one by one. e embedding network refers to cross-domain spatial relation modeling, as illustrated in Figure 1(a). e spatial relation includes both image-to-text and image-to-image, and the goal is to model spatial relationships in CNN based detection. e similarity network is illustrated in Figure 1(b), we first pretrain image data and text-image data, respectively, and then fine-tuned on the target domain training data via reinforcement learning. In detail, we use ResNet-50 as the CNN encoder of the framework to learn the social image annotations. By adding the cross-domain spatial relation model, the structure can attend the crucial parts of the image when decoding different words of the caption to generate interactive effects. Joint models with image-text similarity are given by cosine similarity [29].

Cross-Domain Spatial Relation Modeling.
e spatial relation between visual content and texts is presented as two levels. First, it represented by the matching probability of the purpose and second by the spatial relationships between the image objects represented by the interaction of the objects in different tasks.

Spatial Relation between Images and Texts.
Motivated by attention-based deep methods [30,31], we first review a basic attention module called self-attention. As illustrated in Figure 2, the input consists of queries and keys of dimensions d k and d v . e dot product is performed between the question and all keys to obtain their similarity. A softmax function is applied to obtain the weights on the values. Given keys and values, the output value is the weighted average over input values: Similarity Query, Key i * Value i . (1) In our case, we apply modifications to the output values. Specially, we define attention as follows: the source consists of a set of two-dimensional vectors < key, value > , given an element of a target named query, calculating the similarity or relevance of the question and each to get a weight coefficient of every key corresponding to the value. e weighted sum is performed by where x is the vector of an input image or word, L x is the length of x, Query is the image vector, key i is the word vector, and the dot product is performed to calculate similarity.
We apply the perceptron to calculate the weight of the word vector. In self-attention, each word can compute all terms with attention. e aim is to learn the internal word dependence and to capture the internal structure of the text. e characteristics of self-attention lie in ignoring the distance of sentence to an image, directly calculating its inner structure, to study the internal structure of a sentence. Further, the realization of parallel computing can also be relatively simple.
Inspired by the thought of learning to rank [32,33], we measure the similarity between two samples using cosine distance.
, for effectively taking two modalities into account, and the ranking loss can be written as where I denotes the visual input, T denotes the text input, and α is a margin. en we use the idea of triplet loss [34], which is one of the very widely used similarity functions. Triplet loss function consists of (a) a sample called Anchor (write to x a ) that is chosen randomly from the training dataset, Complexity (b) a same-class sample as Anchor called Positive (write to x p ), and (c) an example that class is different from Anchor called Negative (write to x n ). e purpose of triplet loss is to minimize the distance between x a and x p and maximize the distance between x a and x n . We call this kind of ranking loss the triplet ranking loss, and the formulation representation is

Spatial Relation in Image
Objects. We now describe object relation computation. e basic idea of the spatial relationships between image objects have their roots in the crucial inference: a visual or hidden object in the image space tends to concentrate for relevant purposes, but distribute randomly for irrelevant ones. Let an object consists of its coordinate feature f C and appearance feature f A . In this work, f C is a 4-dimensional object bounding box with relative image coordinates, and f A is related to the task. To be specific, given input set of N visual or hidden objects (f n C , f n A ) N n�1 , the relation feature f R (n) of the whole sets concerning the n th object, is computed as e output is a weighted sum of appearance features from other purposes, linearly transformed by W V , which is corresponding to values V in equation (1). e relation weight ω mn indicates the impact of other objects. W V and ω mn can be calculated by the object-relations model [35].
ere are two steps. First, the coordinate features of the two objects are embedded in a high-dimensional representation by ResNet. Inspired by the widely used bounding box regression method DPM [36], the elements are transformed using log(·) to calculate distant objects and close-by objects. Second, the embedded feature is turned into a scalar weight and trimmed. e trimming operation restricts relations only between objects of individual spatial relationships, which is related to the task and knowledge. e cross-domain spatial relation model has the same dimension of images and texts at the output, which can be an essential building block in the framework.

Similarity Network
3.3.1. Image Representation. As the gradient vanishing problem prevents the deep network from being fully learned, we adopt the ResNet-50 to avoid loss of accuracy by learning residual functions to reformulate the layers concerning the inputs instead of learning unreferenced functions. e building block is defined as where x is the input vector and y is the output vector of the layers considered. e function f is on behalf of the residual mapping to learn. As a building block has two line shortcuts, we consider another option. Due to the shortcut connection and element-wise addition, it operates F + x. In equation (7), the numbers of channels of x and y are equal. If it is the opposite situation, we adopt the calculation: where w s is the operation of convolution to adjust the channel dimension of x. We learn the image representation vectors from the lower convolutional layer. In this manner, the decoder can attend to specific parts of an image by selecting a subset of the feature vectors. In this study, we employed 2048 feature maps of the 50-layer residual network.

Text Representation. Prevalent model architecture for Natural Language Processing (NLP) is one-hot coding.
Unfortunately, when the output space grows, features cannot properly span the full feature space, consequently, one-hot encoding might result insufficient for fine-grained tasks, since the projection of the outputs into a higher-dimensional space dramatically increase the parameter space of computed models. Also, for datasets with a large number of words, the ratio of samples per word is typically reduced.
A straightforward way to solve the limitations mentioned above is to relax the problem into a real-valued linear programming problem and then threshold the resulting solution. We combine the vector of word embedding and frequency by TF-IDF [27], which depicts the occurrence frequency of a word in all texts, not just the number of occurrences. It allows each word to establish a global context and transforms the high-dimensional sparse vector into a low-dimensional dense vector. Specially, we use self-attention as it will enable each word to learn its relation to other words.

Cross-Modal Matching
Cosine Similarity. e angle between cosines can effectively avoid the differences of degrees in the same cognition of individuals and pay more attention to the differences between dimensions rather than the differences in numerical values. We use the extracted image feature vector to allocate the weight of the word vectors, rather than getting the weight of the word from the text. en we use the resulting weight and the corresponding word vector to do dot multiplication with image vector to get the similarity between image and text.
is allows the semantics of the image and text to interact.

Tree-based Clustering Vector Quantization Algorithm. Let
x � R d be the d-dimensional instance and y � 1, 2, ..., q be the class space with a set of q possible classes. Note that two tasks are needed in this term: tree-based training and retrieval. After obtaining the features of the images and texts, the image and text are in the same vector space, and we can use a scalable K-means++ clustering algorithm [37] to both image and text vector.
We develop a tree-based algorithm for cross-modal learning, which presents a tree-based classification model for multiclass learning. A tree-based structure is constructed where the root node corresponds to the classes in the training dataset. Each node v contains a set of k-means++ classifiers. e top node contains all the training data. At the top level, the whole dataset is partitioned into five data subsets {A, B, C, D, E}. e instances are recursively partitioned into smaller subsets while moving down the tree. Each internal node contains the training instances of its child nodes. Especially, each node in the tree contains two components: cluster vectors and predictive class vectors. Complexity e cluster vector is a vector with real values to measure clusters at a node, and we adopt the definition as where e predictive class vector is a vector with boolean values indicating the membership of predictive classes at a node. e value is 1 when p v (n) is larger than the threshold. It implies that the node is the proper class or subclass for instances of node v. Note that C i and the threshold are obtained from the training process. e algorithm uses three stopping criteria to determine when to stop growing the tree. (1) A node is a leaf node if identified as a predictive class. (2) A node is a leaf node if the data cannot be partitioned to more than three clusters using the classifiers. (3) e tree gets the max depth. Figure 3 shows the flowchart of the training and retrieval steps. We save the tree model, we calculate the position of the real image or text in the leaf nodes, the path is the pictured fingerprint, we save all picture fingerprints, and then the tree model is built. When matching, it is also necessary to extract image features first. Starting from the root node, each clustering model recursively predicts the category of the vector at a different level. After the image has fallen to the leaf nodes, output the path of the leaf nodes as the fingerprint of the picture. Use cosine distance to find the same text with the fingerprint, and sort it to get the annotation. e process is shown in the Algorithms 1 and 2.

Datasets and Evaluation.
We use the Flickr30k and Microsoft COCO to evaluate our proposed method, which is widely used in caption-image retrieval and image caption generation tasks. As a popular benchmark for annotation generation and retrieval tasks, Flickr30k contains 31,783 images focusing mainly on people and animals, and 158,915 English captions (five per image). We select 30783 images randomly for training and another 1000 for testing splits.
Microsoft COCO is the largest existing dataset with both captions, and region-level annotations are, which consists of 82783 training images and 40504 validation images, and five sentences accompany each image. We use a total of 82783 images for training and testing splits with 5000 images. e testing splits are images with more than three instances, which are selected from the validation dataset. For each testing image, we use the publicly available division, which is commonly used in the caption-image retrieval and caption generation tasks.

Implementation Details.
We use the ResNet-50 to exploit 2048 feature maps with a size of 1×1 in "conv5_3", which helps us to integrate the high-level features with the lower ones, and the visualization results are provided in Figure 4. By adding a full-connection layer of which dimension is 128, we name the output of graph embedding v1.
In the text representation, the input layer is followed by two full-connection layers, with dimensions d 1 and d 2 , respectively. e output of the last full-connection layer is the output of text module embedding, and we name it v 2 . We add a self-attention mechanism to two embedding networks and calculate to get v 1 ′ , v 2 ′ . en, v 1 ′ and v 2 ′ are connected to a triplet ranking loss. Further, we use Adam optimizer to train the system.
After offline debugging, we finally set the parameter of the input layer length of the text module to be 100,000. After the statistics of the word occurrence number of all sentence segmentation words, we take the word occurrence number of the first 99999, and the rest words as a new word. In total, the number of neurons of the input layer is 100,000.
As for the text domain, the second neuron number is 512; the third is 128. Meanwhile, the length of the graph embedding is also 128. We set the margin (in equation (6)), and the parameters of Adam's algorithm are as follows: lr � 0.001, beta_1 � 0.9, beta_2 � 0.999, epsilon � 1e−08, decay � 0.0.
For speeding up the training, we conducted negative sampling.
e sample sentence that matches the current image is positive. To thoroughly train the low-frequency sentences, we randomly select the part of the sentence outside the sample as negative samples.

Result of Experiments.
Microsoft COCO consists of 123287 images and 616767 descriptions. Five text descriptions accompany each image. We randomly select the training with public 82783 images and remain 5000 images as test data. We can see that our loss function is slightly better than others in Figure 5 when the number of iterations was over 400. e Softmax loss optimizes the distance between classes being great, but it is weak when it comes to optimizing the range within categories. e triplet ranking loss addresses this problem to some extent.
As is shown in Figure 6, we can observe that the words such as "climbs," "rest at," "in the" which show the spatial relationships between objects get significantly higher scores than other words while things like "the" and "a" always have lower matching scores. However, some prepositions containing positional relations, such as "by" in the right-upper corner, have higher scores than other prepositions. e use of attention causes features related to spatial relation traits to be weighted. As is shown in Figure 7, such as the image in the right-upper corner, we can infer from the coins in the cup next to the person lying down that he is begging.

Qualitative Evaluation.
To evaluate the effectiveness of the proposed method, we first compare its classification performance with state-of-the-art performance reported in the literature. We manually labelled the ground truth mapping from the nouns to the object classes. By varying the threshold, the precision-recall curve can be sketched to measure accuracy. We commonly used "R@1", "R@5", and "R@10", i.e., recall rates at the top 1, 5, and 10 results.  (2) compute R by ResNet to represent M; (3) Repeat (4) create a node v; (5) update R according to N clusters to M with k-means++; (6) let D j be the set of data in D satisfying outcome j; (7) attach the node returned by T(D j ) to node v;    Figure 6: Similarity matching between image and text. 8 Complexity As shown in Tables 1 and 2, we compare our proposed method with some of the latest and most advanced processes on the Flickr30k and Microsoft COCO datasets. We can find that our approach works better than the compared methods. Different from DSPE + FV * that uses external text corpora to learn discriminative sentence features, our model learns them directly from scratch in an end-to-end manner. When comparing among our methods, we can conclude as follows: (a) our attention scheme is sufficient since the model with attention consistently outperforms those without notice on both datasets; (b) using ResNet as the underlying network to understand the deep semantics of images to get the spatial relation features with the help of text context relations.
It takes 235 seconds to use a brute force algorithm and 18 seconds to mark with a tree-based clustering algorithm on Microsoft COCO datasets.

Quantitative Evaluation.
We also report quantitative evaluation results with the frequently used BLEU metric [46] for the proposed datasets. e results for Microsoft COCO and Flickr30K datasets are listed in Tables 3 and 4. e image encoders of all methods listed here are either VGG-Net or ResNet, which are prevalent in this field. We also report the ablation test in terms of discarding the spatial relation model. e results demonstrate that our method has a better performance than discarding the spatial relation model.     Besides, our approach is slightly better than the proposed state-of-the-art methods, which verifies the efficiency of topic-condition in image-captioning. Note that our model uses ResNet-50 as the encoder, which is a simple attention model. us, our approach is competitive with those models.

Discussion and Conclusion
is paper proposes an integrated model to recognize instances and objects jointly by leveraging the associated textual descriptions and presents a learning algorithm to estimate the model efficiently. e learning process requires only separate images and texts without high-level captions. We use the residual network to deepen the learning of image semantic and combine with the text to obtain some hidden relation features contained in the picture. By constructing a joint inference module with self-attention, we make a fusion of local and global elements. We also show that integrating images and text for deep semantic understanding to label the spatial relation features. Furthermore, the use of a tree clustering algorithm accelerates the matching process. Experiments verify that the proposed method achieves competitive results on two generic annotation datasets Flickr30k and Microsoft COCO.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.