Context-Fused Guidance for Image Captioning Using Sequence-Level Training

Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image.


Introduction
Image captioning, which analyses and converts the image content into a natural language description automatically, is drawing considerable attention in the artificial intelligence field. As a typical multimodal task, the image captioning system combines both computer vision and natural language processing. erefore, it should not only recognize the salient image objects and other visual properties (attributes, locations, and relations) but also depict the image content with natural and coherent descriptions [1]. Over the past few years, image captioning task has been applied on a wide area of aspects, such as assistance for visually impaired people [2].
For current image captioning system, the encoder-decoder architecture has been a widely adopted pipeline for its conspicuous performance. In general, it employs a convolutional neural network (CNN) to encode the image into a set of feature vectors and a long short-term memory (LSTM) network to generate the captions. Moreover, to steer the model into focusing and capturing informative visual features on a particular image region, the attention mechanisms are introduced as well [3][4][5].
e encoder-to-decoder framework has achieved remarkable advances in humanlike caption generating, but there are still some issues to be concerned.
First, to capture the visual and textual information simultaneously, some prior networks [3,4] were designed to learn the sentence structure at a global level. Strictly, the generated caption can only depict the image roughly because during decoding, the network may discard some useful image objects or scenes unexpectedly. is reduces the consistency between image and text description. As a solution, the guidance vector is adopted [6][7][8]. In [6], the time-independent guidance was implemented as a joint text-image embedding. However, as pointed out in [7], their approach is short of consideration from two aspects: (1) from the view of computer vision, visual evidence is not always essential for the decoder because the description sentence usually contains salient objects that correspond to visual features; (2) the explicit separation between encoder and decoder usually leads to a representational disconnect between the learned feature vectors and generated captions. To handle these issues, they constructed a semantic image guidance, which is conditioned on textual context and image features. It provides the decoder with semantic information from n-gram word and sentence levels. rough this, the generated captions include richer image instances than [6]. Nevertheless, their approach neglects the information about motions and locations of image objects. In addition, although the sentence-level guidance achieved the best performance, it is not a very efficient approach because of the prepositions, articles, and conjunctions in the sentence. Considering the fact that the instances in region image are not always corresponding to the words in the vocabulary, in [8], they concatenated the global image representation with the visual concept [9] as the guidance vector. e visual concept is a set of frequent words that describe the salient image objects, which enhances the correlation between image and text at regional level. However, there is a latent drawback: an inappropriate word in visual concept will mislead the language model to generate unexpected captions.
Second, as indicated in [10], for the models trained with maximum likelihood estimation (MLE), the vanilla encoderdecoder framework may cause the problem of exposure bias. e error accumulation caused by MLE probably results in a word mismatching during caption generating. To address this issue, the reinforcement learning (RL) strategy is introduced in the image captioning task. However, due to the high variance of gradient estimation, it is extremely difficult to train the model with RL strategy directly. To meet this criterion, the self-critical sequence training (SCST) framework [11] is proposed to apply the RL strategy by sequencelevel training. During the inference stage, SCST utilizes the generating samples as the baseline to normalize the rewards. Consequently, the network can use nondifferentiable sequence-level metrics (e.g., CIDEr [12]) to evaluate the language quality rather than the cross-entropy loss in word level. Based on this framework, a number of approaches were proposed [13][14][15]. Particularly, in [14], they proposed the CAVP to accomplish the visual decision-making task. e CAVP captures the visual context that is crucial for compositional reasoning and attends to complex visual compositions over time.
rough this, it significantly boosted the caption consistency to image content. erefore, to boost the caption consistency of image by utilizing reasonable semantic information and informative visual features, an image captioning system within contextfused guidance (CFG) is proposed in this paper. e main idea is illustrated in Figure 1. e CFG utilizes compositional visual features for multilevel image learning.
By the context fusion gate, CFG adaptively combines the visual concept and word embedding. Using the context-fused image guidance, our model can generate captions with comprehensive descriptions. In short, the main contributions in this paper are as follows: (1) An image captioning system using sequential decision-making is proposed for a comprehensive caption generation. (2) A context-fused image guidance is formulated to improve the caption consistency of image. It selectively aggregates the semantic information from the visual concept and word embedding. (3) Evaluation on the MS COCO dataset shows that our approach outperforms most standard metrics. e linguistic analysis demonstrates that our method enhances the correlation of generated captions and images.

Image Captioning.
In the past few years, image captioning systems based on encoder-decoder framework have been deeply investigated [3,16]. In [16], they employed a CNN to encode the image and a recurrent neural network to output a sequence of words. Subsequently, many works were proposed to improve and extend this framework. In [17], they proposed a recurrent fusion network (RFNet) to exploit the complementary information from multiple encoders to understand the image comprehensively. In [18], they extracted the image features at multiple levels to learn accurate subject predictions. As a very recent investigation [19], the editing network generates the image description by refining an existing caption rather than generating a new caption from scratch. Inspired by the attention mechanism applied in machine translation, several attention-based image captioning systems were proposed. In [3], they integrated the decoder with the proposed hard and soft attention mechanism to capture the highlighting spatial image   Computational Intelligence and Neuroscience regions. In [4], they constructed a combined bottom-up and top-down attention mechanism. It calculates the attention feature vectors of the objects and other salient regions in image. In [5], the attention-on-attention module employs an attention gate to transform the result from a standard attention mechanism. Moreover, to improve the semantic representation of the generated captions, some approaches also focused on utilizing specific semantic attribute, such as the visual concept [9]. In [8], the guidance vector is equipped with the visual concept to provide the decoder with high-level semantic information. In [20], they proposed a hierarchical attention network to enhance the caption richness by incorporating the visual concept and other visual features.

Sequential Decision-Making.
e models trained on vanilla CNN-LSTM framework often result in the problem of exposure bias [10]. To mitigate this, the reinforcement learning was applied on image captioning by introducing sequential decision-making: agent takes account of the actions, states, and rewards in further sequences. In the case of image captioning, the action corresponds to choosing the next word and image; the state can be the visual context, previous prediction, and other information. e rewards can be any evaluation metric, such as BLEU-N [21] and CIDEr [12]. Several works have applied the sequential decisionmaking. In [10], the REINFORCE is used to optimize a userspecified evaluation metric during training directly. However, it lacks adequate generalities to other evaluation metrics. In [11], the self-critical sequence training (SCST) framework is proposed. In SCST, the generated captions are evaluated at sentence level. Afterwards, in [13], they incorporated a discriminative loss component into the training objective to produce the caption with high discriminability. To capture crucial compositional information in image, CAVP [14] was proposed to capture complex visual compositions over time. Recently, the B-SCST [15] extended the SCST framework for image captioning models by incorporating Bayesian inference. From the distribution obtained by a Bayesian DNN model, B-SCST generates the baseline reward by averaging predictive quality metrics.

Proposed Approach
In this section, we introduce the proposed CFG network in detail. As the architecture presented in Figure 2, our model consists of five components: (1) a text encoder, which encodes the visual concept; (2) an image encoder, which encodes the region image features; (3) an attention module, which calculates the attentive compositional visual features; (4) a guidance formulation module, which obtains the fused textual context through the context gate and calculates the context-fused image guidance; and (5) a captioner, which is an extension of the top-down captioner [4] for caption generating.

Text Encoder.
As the visual concept reveals the objects in images explicitly, we introduce it to offset the separation between image and text. In this paper, the visual concept is denoted as A � a 1 , a 2 , . . . , a m , a j ∈ R m ×E , where m is the count of the words in visual concept and E is the dimension of word embedding. Specifically, as the word a j is isolated, therefore a unidirectional LSTM is employed as the text encoder to deal with A as follows: where E(·) is the word embedding layer and w ∈ R m × H , where H is the size of hidden state. w indicates the encoding semantic vectors of each word in A. It will be used to calculate the fused textual context in the guidance formulation module. . . , a m , a unidirectional LSTM is adopted to obtain the encoded vector w. e region image feature r is extracted by a Faster R-CNN, and the image representation r is obtained by the max pooling applied on r. In decoder, a two-layer LSTM architecture is adopted. s t indicates the fused textual context. Both V comp and context-fused guidance g t are passed into the language LSTM along with the hidden state h v t from attention LSTM. e input vector X consists of r, w the word embedding, and the hidden state of language LSTM.

Image Encoder.
For the given image I, to learn the visual information about objects, attributes, and relations, a pretrained Faster R-CNN [22] is adopted to extract the region image representation r as follows: where r � r 1 , r 2 , . . . , r k , r i ∈ R 2048 , presents the semantic information of an image region and k indicates the number of selected ROIs according to the ranking scores. To reduce the calculate consumption, a transformation matrix W I ∈ R H×2048 is applied on r to convert its dimension to r ∈ R k×H . Consistent with prior works, the image representation at global level is formulated by a mean-pooling operation as follows: where r ∈ R H . Both r and r are used to compute the attentive compositional visual features.

Compositional Visual
Features. e compositional visual features contain the image information at regional and global levels. As shown in Figure 2 (framed in blue), for the image feature vectors r and r, an additive attention mechanism is applied to reduce the variance caused by sampling diverse image regions. Without loss of generality, we first introduce the general formulation of the attention computation used in this paper: where π indicates the attentive weight of the query vector q, and h t stands for the hidden state output from LSTM unit. w T π , W π q , and W π h are the parameters to be learned. Accordingly, for the region image feature r, the attention computation is presented as follows: Here, the parameters w T α ∈ R D , W α r ∈ R D×H , and W α h ∈ R D×H in this case, D indicates the dimension of attention layer, and h v t is the hidden state from attention LSTM.
en, the attentive region image feature z r t is computed as follows: where z r t ∈ R H . Particularly, in contrast to previous works that only integrate the global image representation in the first LSTM layer, similar to equation (5), z r t is computed as the attentive vectors of r.
en, we combine z r t with z r t as the compositional visual features: where [; ] indicates the vector concatenation. e attentive compositional visual feature z c t is obtained as follows: where the trained parameters w T β ∈ R D , W β V ∈ R D×H , and W β h ∈ R D×H here. In comparison to z r t , the decoder can capture more comprehensive visual information from z c t at each decoding step. Additionally, z c t is also utilized to modulate the guidance vectors.

Guidance Formulation.
In [7], Zhou et al. conditioned the guidance information on the current word W e (y t ) and used the text-conditional image feature V as the guidance: where W e (·) is a text-conditional embedding matrix. rough this, the model can focus on a part of the semantic image feature when capturing a specific word. In this paper, we extend this formulation with the visual concept vector w. Intuitively, if modulating the semantic image guidance g t on w only, it may mislead the generating process because of the latent inappropriate word in visual concept set. Hence, it is essential to adaptively incorporate the semantic information from word embedding and visual concept. Inspired by [23], a context fusion gate is introduced. e structure is presented in Figure 3. By this component, our model can learn how much to attend to the context from two different sources. Utilizing the word embedding and visual concept, the context fusion gate is defined as follows: where s t is the fused textual context. W w ∈ R E×H and W t ∈ R E×E are the weight matrix; ⊙ indicates the elementwise multiplication. e factor f t ∈ (0, 1) is calculated by a sigmoid activation function σ as follows:  Computational Intelligence and Neuroscience where W f is the transformation matrix. z w t indicates the attentive semantic vector, which is computed as follows: where the parameter w T c ∈ R D , W c w ∈ R D×H , and W c h ∈ R D×H . rough this, z w t is equipped with the attentive visual information. Taking V comp and s t , the context-fused image guidance is formulated as follows: where W s ∈ R E×H is a transformation matrix. In comparison to equation (9), the context-fused image guidance g t contains richer visual and textual context. It will be passed into the captioner as a time-dependent variable.

3.5.
Captioner. e captioner consists of two separated LSTM networks: attention LSTM (AttLSTM) and language LSTM (LangLSTM). e input of AttLSTM is defined as the concatenation of previous word embedding vector E(y t−1 ), the previous hidden state h l t−1 from the LangLSTM, the visual concept vector w, and the image representation r. at is, where h v t is used to attend over the visual features and semantic vectors, respectively. AttLSTM provides the LangLSTM with the feature vectors at the global level. In LangLSTM, the network focuses on generating the caption with both compositional, image feature V comp and contextfused image guidance g t : en, we apply a multilayer perceptron (MLP) following by a softmax layer on hidden state h l t to obtain the probability distribution of each words as follows: where each value of p t indicates the probability of corresponding word in vocabulary. Overall, our proposed network takes full advantage of image and text information to generate captions elaborately.
3.6. Training Strategy. Consistent with prior works [11], the sequence-level training strategy in this paper can be decomposed into two stages: the standard supervised learning with cross-entropy (XE) loss and the reinforcement learning with a self-critical reward. e XE loss is formulated as follows: log p θ y t |y 1: t−1 , (17) where N is the length of a generated caption, y 1: t−1 is a target ground-truth sequence, and θ indicates the model parameters. e supervised model is trained by minimizing this value. en, the one with best performance is chosen as the initial network for next training stage. During reinforcement learning, the negative expected reward is minimized as follows: where r(·) is the standard metric evaluation (CIDEr [12] in this paper). According to SCST [11], the gradient of L(θ) can be approximated as follows: where y s 1: T is the caption sampled from the word distribution and y 1: T is the generated caption by greedy searching. e resulting reward signal r(y s 1: T )−r(y 1: T ) can be treated as a baseline score. e probability of each word in the sampled captions will be increased if r(y s 1: T ) is higher than r(y 1: T ), and vice versa.

Experiments
In this section, the dataset and evaluation metrics are introduced first. en, the implementation details and the comparing models are described. Finally, we discuss the quantitative and qualitative experiments.

Dataset and Metrics.
e MS COCO dataset [24] is one of the most popular benchmark datasets for image captioning task. ere are 82,783 images in training set, 40,504 images in validation set, and 40,775 images in test set, respectively. For a fair comparison, the dataset using "Karpathy" split (http://cs.stanford.edu/people/karpathy/ deepimagesent/) is adopted in this paper. It contains 113,287 images for training, 5000 images for validation, and 5000 images for test, respectively. e statistics of these two splits are summarized in Table 1.

Preprocessing.
For the region image representation, we use the bottom-up features provided by [4] which extracted top k � 36 features in each image as salient regions. e visual concept is detected by a pretrained model [9]. Only object attribute (nouns) is preserved. We convert all the sentences to lowercase, replace the punctuation with space,
In particular, as the visual features in [7] are extracted by a different CNN, to investigate the performance of different guidance formulation, we also conduct a study on the following ablation models: (1) CFG V , which only preserves the compositional visual feature and removes the visual concepts, context fusion gate, and context-fused image guidance. (2) CFG E , which adopts the guidance defined in equation (9) and removes the visual concept, and context fusion gate. It is a 1-gram word-level guidance. (3) CFG A , in which the factor f t is removed. e fused textual context s t is computed by a vector addition directly. eir performance will be discussed in the Ablation Studies section.

Quantitative Analysis.
e evaluation results on the test portion of the Karpathy splits are summarized in Tables 2 and  3. All the scores were inferred by beam searching with size 3. For the cross-entropy loss training (Table 2), our model achieves competitive scores with RAtt-Soft [29]. For the sequence-level optimization (Table 3), our model obtains the scores with advantages across all metrics except for ROUGE-L and SPICE. Optimized by CIDEr, the scores of CFG on all metrics are increased in Table 3. Especially the score on CIDEr is improved from 114.0 to 125.4. e comparison results indicate our model can effectively improve the captioning performance by leveraging the compositional visual feature and context-fused image guidance. Besides, by sequence-level training, our network can significantly promote the results on each evaluation metric and outperform other models. However, it also should be noted that our model fails to achieve an advantage score on SPICE metric on both Table 2 and Table 3. As mentioned, SPICE is defined over the objects, relations and attributes. In [29], RAtt-Soft utilizes the scene graph and visual relation features to precisely map visual relationship information to the semantic description. is indicated a limitation of our proposed network.

Qualitative Analysis.
For an intuitive presentation of the image captioning effect of the model with different guidance formulation, some examples are shown in Figure 4. Compared to CFG E , the full model CFG can understand the image with detected salient objects (with a rainbow, holding a racket, next to glass of beer, and with luggage), but CFG E neglects these instances and focuses on the main content of the images. In addition, CFG can better recognize the object remote control, while CFG E mistakes it as computer keyboard. For the last image, CFG exactly describes the image with clear objects pizza, broccoli, and vegetables, while CFG E just captures the object broccoli and depicts the image at a general level. ese examples demonstrate that, in comparison to the guidance modulated on text-conditional embedding, the context-fused guidance is more advantageous to boost the model to depict the image comprehensively. Nevertheless, there are also several shortages in our proposed network, shown as the images presented in red frame. For the first image, our CFG succeeds in depicting the image with main instances, but it misunderstands the "desk" as "table" and generated inappropriate relation information "standing around a table." Similarly, in the last image, our model depicts the image with an incorrect position phrase "in the water." is indicates our network is insufficient to reason accurate relationships, especially among multiply image objects. One possible solution is to introduce the scene graph [30], which contains complex structural representation of image and sentences.
In Figure 5, we visualized the probabilities of the words the generated sentence and visual concept set, along with the object attention map, respectively. It can be found that the visual concepts are well applied to generate the captions. In the first example, the salient instances (man, horse, filed, and cows) are captured and the predicted words are highly corresponding to the detected visual concepts with high probabilities. e image content is well depicted by the generated sentence. is indicates that our model can exploit the high-probability visual concept to generate the relevant words in captions. For the second image, the weights of "bike" (0.34) and "sunset" (0.33) are much lower those of "man" (0.86) and "dock" (0.93), but our model can also

Ablation Studies.
e evaluation results of the ablations are given in Table 4. Compared to CFG V , CFG E boosts the SPICE from 20.3 to 20.5 on cross-entropy training category, respectively. It suggests the effect of the text-conditional guidance to improve image captioning. In comparison to CFG E , CFG A achieved weak advantage results on crossentropy training. After CIDEr optimization, the scores of BLEU4 and SPICE are boosted from 37.8 to 38.1 and 21.1 to 21.4, respectively. Among these models, CFG still achieved the best performance across all metrics. Particularly, the CDIEr score was significantly improved after sequence-level training.
ese indicate the following: (1) the introduced visual concept is helpful to boost image captioning. (2) e compositional visual feature and fused textual context are effective to improve the captioning quality. (3) e context fusion gate is beneficial to integrate the context from different sources for a better image captioning performance.

Conclusions
In this paper, an image captioning system within fused context guidance is proposed to enhance caption consistency of image. By the compositional visual feature, context fusion gate, and context-fused image guidance, our model further boosts the caption consistency of image. Extensive experiments demonstrate that our proposed model significantly improves the baseline method and outperforms other comparison approaches, which suggests the effect of the explicit consideration of using context-fused guidance.
However, the visual relation bias is not well handled. In the future, we will extend our network with scene graph, because it provides a unified representation that connects the objects, attributes, and their relationship in an image or a sentence. It is more advantageous for the model to employ the scene graph to depict an image with an accurate text description about object relationships.

Data Availability
e data used to support the findings of this study are included within the article.