Text to Realistic Image Generation with Attentional Concatenation Generative Adversarial Networks

,


Introduction
In recent years, with the rise of artificial intelligence and deep learning, natural language processing and computer vision have become the hot research fields. e text to image as a basic problem in the field has also attracted the attention and research of many scholars. Text to image is the generation of a realistic image that matches a given text description, requiring processing fuzzy and incomplete information in natural language descriptions. Text to image drives the development of multimodal learning and crossmodal generation and shows great potential in applications such as cross-modal information retrieval, photo editing, and computer-aided design.
Since Goodfellow et al. [1] proposed Generative Adversarial Networks (GANs) in 2014; the network model has received extensive attention from academia and industry. With the continuous development of GAN, it has been widely used to generate realistic high-quality images based on text descriptions. e commonly used method [2][3][4][5] encodes the entire text description into a global sentence vector, which is input to the generator as a condition variable of GAN to generate an image. However, due to the large structural differences between text and images, the use of only word-level attention does not ensure the consistency of global semantics, while it is difficult to generate complex scenes; moreover, fine-grained word information is still not explicitly used for generating images. erefore, the generated image does not contain enough details and is still significantly different from the real image.
To address this issue, this paper proposes Attention Cascade Generative Adversarial Networks (ACGAN). e network adopts multilevel cascade structure, the generator and discriminator in each layer are composed of convolution units, and a new network layer is added layer by layer during the training process, and the generator and discriminator are added for processing the details of the higher resolution image. At the same time, the deep attentional multimodal similarity model is introduced into the network, focusing on the fine-grained information of the word level in the semantics. e word vector is used as the input of the generator, and through the constraint of the word vector, in the case of ensuring that the overall shape of the image is unchanged, the details of the generated image are emphasized, the consistency of the image and the semantic cross-modality is maintained, and the generation process is smooth. Finally, a measure of diversity is added to a layer of the discriminator to influence the discriminator's discriminant, so that the generator can obtain more diverse gradient directions, increase the diversity of generated samples, and improve the quality of generated samples. e contribution of our method is threefold as follows: (i) A multilevel cascade structure is proposed, which improves the resolution of the generated image, and can generate a high-resolution image of up to 1024 × 1024. (ii) Introduce the attention mechanism model into the network, and make the details of the generated image richer by paying attention to the fine-grained information of the word level in the semantics. (iii) Add the measure of diversity to the discriminator, increase the diversity of the generated samples, and improve the quality of the generated samples.

Related Works
Generative image modeling is a fundamental problem in computer vision. ere has been remarkable progress in this direction with the emergence of deep learning techniques. Variational Auto Encoders (VAEs) [6,7] is aimed to maximize the lower bound of the data likelihood. Autoregressive models (e.g., PixelRNN) [8] that utilized neural networks to model the conditional distribution of the pixel space have also generated appealing synthetic images. Recently, Generative Adversarial Networks (GANs) have shown promising performance for generating sharper images and video [9][10][11]. For example, Eghbal-zadeh et al. [12] proposed a Mixture Density Generative Adversarial Networks to improve the clarity and quality of generated images. Gecer et al. [13] combined the generated confrontation network with a deep convolutional neural network to reconstruct a 3D facial structure from a single face image. But training instability makes it hard for GAN models to generate high-resolution images. A lot of work has been proposed to stabilize the training and improve the image quality [14][15][16][17][18][19].
Generating high-resolution images from text descriptions, though very challenging, is important for many practical applications such as art generation and computeraided design. Lyu et al. [9] learn joint embedding to establish the relationship between natural language and real images, and then train GAN to generate 64 × 64 images that are conditional on text descriptions. Cao et al. [10] proposed a Stacked Generative Adversarial Networks, which decompose the complex problem of generating high-quality images into some subproblems with better control and generate 256 × 256 high-resolution images.
Recently, attention models have been widely used in computer vision and natural language processing, for example, object detection [20,21], video subtitle [22], and visual question answer [23,24]. Xu et al. [25] introduced the attention mechanism into the GAN network and proposed Attentional Generative Adversarial Networks, which instruct the generator to focus on different word-level fine-grained information when generating different image subregions. Qiao et al. [26] proposed a global-to-local collaborative attention module that uses word attention and global sentence attention to enhance the consistency of generated images and semantics.

e Proposed Model.
e Attentional Concatenation Generative Adversarial Networks model proposed in this paper consists of two parts: attentional concatenation generative adversarial networks and deep attentional multimodal similarity model. As shown in Figure 1, the Attentional Concatenation Generative Adversarial Networks model is divided into multiple levels; each layer contains a generator G and a discriminator D, using a multilevel cascade structure, increasing generators and discriminators layer by layer, and continuously adds a new residual network layer during the training process, corresponding to the generation from low-resolution to high-resolution images.
e Deep Attentional Multimodal Similarity Model contains a common semantic space, mapping the subregions of the image and the word vector of the sentence into one of the semantic spaces, and measuring the word-level image and text similarity. Instead of adopting a one-step approach, the entire model's training process tries to generate low-resolution images, then continuously increase the resolution, and finally generate high-resolution and high-quality images.

Concatenation Generative Adversarial Networks.
e generative network has k generators (G 0 , G 1 , ..., G k−1 ), which take the hidden states (h 0 , h 1 , ..., h k−1 ) as input to the generator (G 0 , G 1 , ..., G k−1 ), generating images of different resolutions. Specifically, 2 Discrete Dynamics in Nature and Society Here, z is a noise vector usually sampled from a standard normal distribution. c s is a global sentence vector, and c w is a word vector. F ca represents the Conditioning Augmentation [10] that converts the sentence vector c s to the conditioning vector. Training starts with both the generator G and discriminator D having a low spatial resolution of 64 × 64 pixels. As the training advances, we incrementally add layers to G and D, and all existing layers remain trainable throughout the process. When doubling the resolution of the generator G and discriminator D we fade in the new layers smoothly. During the transition, we treat the layers that operate on the higher resolution like a residual block, whose weight increases linearly from 0 to 1. en we add a new residual layer and transform word features into semantic space of image features. Based on the hidden feature h of the image, a word context vector is calculated for each subregion of the image.
Finally, the image features and corresponding word context features are combined to generate an image in the next stage. In order to generate a real image with multiple levels (sentence level and word level) of conditions, the final objective function of the attention generation network is defined as Here, λ is a hyperparameter to balance the two terms of equation (2). e first term is the GAN loss that jointly approximates conditional and unconditional distributions. At the i th stage of the ACGAN, the adversarial loss for is defined as

FC layer
Upsamping where the unconditional loss determines whether the image is real or fake, while the conditional loss determines whether the image and the sentence match or not. As shown in Figure 2, for unconditional image generation, the discriminator is trained to distinguish between real images and forged images. For conditional image generation, images and variables are input to the discriminator to determine if the image matches the condition, and the bootstrap generator approximates the conditional image distribution. Discriminator D i is trained to classify the input into the class of real or fake by minimizing the cross-entropy loss defined by where x i is from the true image distribution p datai at the i th scale, and x i is from the model distribution P G i at the same scale. Discriminators D i of the ACGAN are structurally disjoint, so they can be trained in parallel and each of them focuses on a single image scale.

Deep Attentional Multimodal Similarity
Model. e Deep Attentional Multimodal Similarity Model [25] learns two neural networks that map subregions of the image and words of the sentence to a common semantic space, thus measuring the image-text similarity at the word level to compute a fine-grained loss for image generation.
is paper first uses a standard convolutional neural network to transform an image into a set of feature maps. Each feature map represents some subregions of the image. e dimension of the feature map is equal to the dimension of the word vector, and they are treated as equivalent entities. Next, based on each token in the text, attention is applied to the feature map and their weighted averages are calculated. Finally, the DAMSM is trained to minimize the difference between the attention vector and the word vector described above.
where "w" stands for "word". Symmetrically, we also minimize where P is the posterior probability that sentence S i is matched with its corresponding image M i . Finally, the DAMSM loss is defined as Using attention mechanism, the DAMSM is able to compute the fine-grained text-image matching loss L DAMSM . And L DAMSM is only applied to the output of the last generator, because the ultimate goal of this paper is to generate high-resolution images through the last generator. If L DAMSM is applied to the images generated by all generators (G 0 , G 1 , ..., G k−1 ), the computational cost will increase greatly and the performance will not improve.

Standard Deviation of Measuring
Diversity. GAN usually tends to capture only the changes found in the training data. In order to obtain more training data, this paper has greatly simplified this approach and has also improved the change based on "minibatch discrimination". Not only can feature statistics be calculated from a single image, but they can also calculate feature statistics for the entire small batch, thereby encouraging the generation of images and training images to display similar statistics. By adding a small batch layer at the end of the discriminator, the layer learns a large tensor and converts the input into a set of statistics. Finally, each instance is generated with a separate set of statistics and connected to the output of the layer so that the discriminator can use the statistics internally.

Experimental Environment and Data.
e algorithm uses the deep learning framework Tensorflow [27], and the experimental environment is Ubuntu 14.04 operating system, using four NVIDIA 1080Ti graphics processing unit (GPU) to accelerate the operation. At the same time, all models were trained on the CUB [28] and Oxford [29] datasets. As shown in Table 1  species of birds with a total of 11,788 images. In this paper, 8855 images are used as training datasets and 2933 images as test datasets. Since the target area of 80% of the bird images in the dataset is less than 0.5 [28], we preprocess all images before training to ensure that the proportion of the bird target area is greater than 0.75 of the image size. e Oxford

Evaluation Metrics.
For the evaluation of the GAN model, qualitative evaluation is usually used; that is, the visual fidelity of the image generated by manual inspection is required. is method is time-consuming and subjective and is somewhat misleading. erefore, this paper mainly uses two evaluation criteria to evaluate the quality and diversity of generated images.

Inception Score.
We choose numerical assessment approach "inception score" [16] for quantitative evaluation, where x denotes one generated sample, and y is the text label corresponding to the sample, p(y) is the marginal distribution, and p(y|x) is the conditional distribution. e KL divergence between the marginal distribution p(y) and the conditional distribution p(y|x) should be large, so that a variety of high-quality images can be generated. In the experiments, an inception model was given to the CUB data sets, and samples of each model were evaluated.

Human Rank.
Human rank for qualitative assessment 50 text descriptions was randomly selected in the CUB and Oxford test sets, and for each sentence, the generated model generated 5 images. e five images and corresponding texts are described to different people to rank the image quality in different ways, and finally the average ranking is calculated to evaluate the quality and diversity of the generated images.

Experimental Result
e comparisons between the inception score and human rank results of various models on the CUB and Oxford datasets are presented in Table 2. As can be seen from the table, compared to the inception score of the AttnGAN model, the inception score of the ACGAN model on the CUB dataset has increased by 2.75% (Inception score increased from 4.36 to 4.48). rough the analysis of experimental results, ACGAN scores higher in Inception score than other GAN models; from an intuitive visual point, Human rank score is lower than other GAN models. It shows that the quality and diversity of the sample images generated by the model in this paper have been enhanced, and it is closer to the real image.
Subjective visual comparisons between the three models of StackGAN++, AttnGAN, and ACGAN on the CUB dataset are presented in Figure 3. It can be seen that the image details generated by StackGAN++ and AttnGAN are lost, colors are inconsistent with the text descriptions (1st and 2nd row), and the shape looks strange (2nd and 3rd column) for some examples. ACGAN achieved better results with more details and consistent colors and shapes compared to AttnGAN. For example, the wings are vivid in the 3rd and 4th row. By comparing ACGAN with Attn-GAN, we can see that ACGAN contributes to producing fine-grained images with more details and better semantic consistency. For example, the color of the bird in the 2nd column was corrected to black. By comparing ACGAN (256 × 256) with ACGAN (1024 × 1024), we can see that the images generated by ACGAN (1024 × 1024) have higher definition, more vivid colors, and more lifelike details. Generally, content in the CUB dataset is less; therefore, it is easier to generate visually realistic and semantically consistent results on CUB. ese results confirm that ACGAN  is better than AttnGAN, and the generated image is closer to the real image.
Detailed (beak, wings) comparisons of the results between the three models of StackGAN++, AttnGAN, and ACGAN on the CUB dataset are presented in Figure 4. It can be seen that the beak, wings, and feet of the bird are clearer, and the edges and details are more realistic in the images generated by the ACGAN in this paper. For example, the beak of a bird is more vivid and conforms to the text description in the 4th column. Compared with StackGAN++ and AttnGAN, it has achieved better results.   Figure 5. Details (petals) comparison of the results are presented in Figure 6. It can be seen that the image details generated by GAN-INT-CLS and StackGAN++ are lost, and the shape looks strange (1st and 2nd row) for some examples. ACGAN achieved better results with more details and consistent colors and shapes compared to StackGAN++. For example, the overall shape of the flowers is clearer, and the details of the petals are more obvious in the 4th row. ese results confirm that ACGAN is better than StackGAN++, and the generated image is closer to the real image.

Conclusions
is paper adds attention mechanism and multilevel cascade structure to generate adversarial network, uses attention mechanism to pay attention to the fine-grained information of word level in semantics, enriches the details of generated images, and generates through cascade structure Higher resolution images. Experiments have shown that, on the same data set, the Attentional Concatenation Generative Adversarial Networks have clearer edge details and local textures against the image generated by the network, making the generated image closer to the real image. Although this method has achieved good results in generating images, it is still difficult to model complex scenes in life. How to deal with this problem needs further study. At the same time, the generated image is similar to the training data, lacking diversity. erefore, it is intended to combine the zero shot learning and the generative adversarial networks to synthesize the new category image, which will be the focus of the next step.
Data Availability e basic data used in this article was downloaded from the Internet. ere are two-part datasets: (1) the CUB is a public dataset that can be downloaded from http://www.vision. caltech.edu/visipedia/CUB-200-2011.html. (2) e Oxford is a public dataset that can be downloaded from http://www. robots.ox.ac.uk/∼vgg/data/flowers/102/index.html.   Discrete Dynamics in Nature and Society