CPGAN : An Efficient Architecture Designing for Text-to-Image Generative Adversarial Networks Based on Canonical Polyadic Decomposition

Text-to-image synthesis is an important and challenging application of computer vision.Many interesting andmeaningful text-to-image synthesis models have been put forward. However,most of the works pay attention to the quality of synthesis images, but rarely consider the size of these models. Large models contain many parameters and high delay, which makes it difficult to be deployed on mobile applications. To solve this problem, we propose an efficient architecture CPGAN for text-to-image generative adversarial networks (GAN) based on canonical polyadic decomposition (CPD). It is a general method to design the lightweight architecture of text-to-image GAN. To improve the stability of CPGAN, we introduce conditioning augmentation and the idea of autoencoder during the training process. Experimental results prove that our architecture CPGAN canmaintain the quality of generated images and reduce at least 20% parameters and flops.


Introduction
Text-to-image synthesis is a challenging cross modal generation which generates images according to given texts. It extracts the common modal data from texts and transfers the semantic data into images. Text-to-image synthesis plays a more and more important role in computer vision. Images were edited by images in the past. With the development of text-to-image synthesis, images can also be edited by text, which greatly expands the application of computer vision. Text-to-image synthesis can be widely applied in humancomputer interaction, such as cross modal retrieval [1] and artistic creation [2,3].
Traditional text-to-image synthesis used variational autoencoder (VAE), attention mechanism, and recurrent neural network (RNN) to generate images step by step [4,5]. Limited by generative ability of VAE, generated images are not as clear as real images. A new generative model GAN was proposed by Goodfellow et al. in 2014 [6]. GAN becomes a popular model in image generation task due to its strong generating ability. Reed et al. [7] proved that GAN could be used to generate clear images from text description and proposed GAN-int-cls. It uses DCGAN as the backbone, text embedding, and random noises as inputs of the generator. e generated images, text embedding, and real images are inputs of the discriminator. Subsequently, many sophisticated models were proposed. ese models can generate images according to general text, scene graph, or dialog. e quality of generated images has been improved a lot.
However, these models introduced many constraints and modules to generate realistic images. ese will greatly increase parameters and floating-point operations per second (flops) of models. It will require more and more hardware resources (CPU, GPU, memory, and bandwidth) to deploy these models. High complexity also leads to high latency.
is greatly limits application of text-to-image GAN in mobile terminal. It is necessary to compress text-to-image GAN. Canonical polyadic decomposition (CPD) is an easy and efficient way to compress and accelerate model in tensor decomposition. Many implementations of convolutional neural networks (CNN) compression based on CPD [7][8][9] have already been proposed.
In this paper, we propose a general compressed architecture CPGAN for designing text-to-image GAN to reduce parameters and flops. CPGAN redesigns each layer of the original neural network by using CPD. e original convolution layer is decomposed into three convolution layers with different ranks and small size. A layer with a smaller rank has few parameters. According to the needs of the application, we can design architectures with different compression ratios by setting different ranks. During the training process of models with different ranks, it is timeconsuming to select the appropriate learning rate. To this end, we use cyclical learning rate (CLR) [11] method to select the optimal learning rate for the redesigned architecture. In addition, GAN has the problem of unstable training. CPGAN is a deeper architecture than the classical GAN and is difficult to train from scratch. To solve this problem, we add conditioning augmentation module and introduce the idea of autoencoder method.
Our contributions can be summarized as follows: (i) We propose CPGAN to reduce parameters and maintain the generative ability of text-to-image GAN. It is a general method to design the lightweight architecture of GAN. (ii) To reduce high resource consumption caused by decomposition operation, we train CPGAN from scratch and do not need to pretrain the model. To the best of our knowledge, it is the first time to use CPD to design text-to-image GAN without using pretrained model. (iii) To stable the end-to-end training, we introduce the idea of autoencoder. e added decoder modules can be removed after training.
Experimental results on two representative cross modal datasets (Oxford-102 and CUB) prove that our architecture CPGAN can maintain the quality of generated images and reduce parameters and flops of original model effectively at the same time. In Oxford-102 and CUB, CPGAN performs better in inception score (IS) and Fréchet inception distance (FID) than original model. It reduces 8.8 × 10 9 flops and 1.31 × 10 6 parameters in Oxford-102. ese show that our architecture can efficiently redesign text-to-image GAN without loss of image quality. e rest of the paper is organized as follows. e work related to our paper is introduced in Section 2. In Section 3, we propose the efficient architecture CPGAN of text-to-image generative adversarial networks (GAN) based on canonical polyadic decomposition (CPD). Section 4 describes experimental settings and experimental results. Finally, we conclude this paper in Section 5.

Canonical Polyadic Decomposition.
e essence of neural network is the matrix transformation process of input data matrix using weight parameters. Each layer of neural network is a large tensor, which can be decomposed into several small tensors. Canonical polyadic decomposition (CPD) is a standard tensor decomposition method. It was proposed by Hitchcock in 1927 [12]. It can decompose a tensor into a sum of rank-one tensors. CPD has been applied in psychometrics [13], signal processing [14], computer vision [15], data mining [16], and elsewhere. It also performs well in model compression.
Denton et al. [8] used CPD to approximate the original convolution kernel and presented two methods of improving approximation criterion. ey performed finetuning on the decomposed kernels by fixing other layers. Jaderberg et al. [9] applied CPD to decompose a 4D kernel into two small kernels and use two methods to reconstruct the original filters. Lebedev et al. [10] used CPD to decompose the 4D convolution kernel tensor into four small kernels with nonlinear least squares and then replace original layer. en, they fine-tuned the entire network using backpropagation. Lebedev et al.'s [10] method accelerated the second convolutional layer of AlexNet by 6.6 times at the cost of 1% accuracy loss. is exceeded the other two works, where Denton et al. [8] got 2 times speed-up and Jaderberg et al. [9] got 4.5 times speed-up at the cost of 1% accuracy loss.
Astrid et al. [17] proposed a CNN compression method based on CPD : CP-TPM. It achieved 6.98 times parameter reduction and 3.53 times speeding-up in AlexNet. It is better than the Tucker-based method [18] in the same network. Zhang et al. [19] and Tai et al. [20] also applied CPD to compress CNN. Original layers are pretrained to minimize the difference between the decomposed layer and the original tensor in the models of Astrid et al. [17], Zhang et al. [19], and Tai et al. [20]. Because CP decomposition operation consumes extensive resources, we do not decompose the pretrained weight tensor, but directly use CP decomposition to design an efficient architecture in text-to-image GAN.

Text-to-Image Synthesis.
Text-to-image synthesis is a branch of computer vision which generates images according to given texts. It can be used for image editing, cross modal retrieval, and artistic creation. GAN has strong generating ability. It can generate realistic images and has been widely used for image generation. Since Reed et al. [7] first successfully used GAN for text image generation, GAN has also become a popular model in textto-image synthesis.
Reed et al. [7] proposed GAN-int-cls by revising DCGAN and successfully generated plausible 64 × 64 images of birds and flowers from texts. In order to produce high resolution images, multiple stages generating was introduced into text-to-image synthesis, such as StackGAN [21], StackGAN++ [22], HDGAN [23], and LAPGAN [24]. StackGAN [21] stacked two conditional GANs to generate high resolution and plain images in two stages. Multiple generators were used to generate images of different scales using tree structure in StackGAN++ [22]. HDGAN [23] adopted hierarchically-nested discriminators to help the single-stream generator generate high resolution images.
LAPGAN [24] put forward a Laplacian pyramid framework through integrating a set of generators.
Xu et al. [25] and Qiao et al. [26] added attention mechanisms to synthesize image with fine-grained details. Besides, Reed et al. [27] adapted bounding box and key part information to improve quality of generated images. ACGAN [28] and TAC-GAN [29] used auxiliary class information to generate diversity images. Because these models show excellent cross modal generative ability, textto-image GAN has been used for image editing [30,31], cross modal retrieval [1], story visualization [2], and painting [3]. However, these models are too complicated to be deployed on the mobile end. To this end, we propose an endto-end compression framework based on CPD. Compared to Shu et al. [32] and Li et al. [33], we do not need to pretrain GAN model. We design and train the compression model from scratch.

Canonical Polyadic Generative Adversarial Networks (CPGAN)
In this section, we introduce the designing of the efficient architecture (CPGAN) and the training process. Section 3.1 describes how to replace 4-dimensional convolutional weight tensors with three small kernels. Section 3.2 describes techniques for stabling training process of the redesigned architecture.

Canonical Polyadic
Decomposition. GAN consists of a generator and a discriminator in general, both of which are convolutional neural networks. e weight tensor for convolution is a 4-dimensional tensor W ∈ R K×K×S×T , which maps input X ∈ R I×J×S into another representation Y ∈ R X×Y×T . It can be written as where the first two dimensions of W(k, k, s, t) are the spatial dimension (K is typically 3 or 5), the third dimension is the input channel, and the fourth dimension is the output channel. CPD is an approximation method which decomposes a tensor into a sum of rank-one tensors. In CPD, tensor W∈R K×K×S×T can be represented as where R is the tensor rank and it is the sum of rank-one tensors, respectively. Rank-one tensor is the vector outer product. Rank selection decides the compression ratio and it is a NP-hard problem in rank decomposition.
In convolutional layer, spatial dimension K does not have to be decomposed because the benefits of spatial decomposition are quite small. By using the variant of CP decomposition, tensor can be decomposed as where W (2) i,j,r is a tensor of size K × K × R. Substituting equation (3) into equation (2), we obtain the following approximate representation of the convolution: Performing rearranging and combining, we can get the following three consecutive expressions: where Y (1) and Y (2) are the intermediate tensors of respectively. e original big layer can be decomposed into three small layers, as shown in Figure 1. For example, the third convolution layer of GANint-cls has 128 input channels, 512 output channels, and 3 × 3 filters (128 × 512 × 3 × 3); we can decompose it into three convolution layers with the following parameters: 128 × R × 1 × 1, R × R × 3 × 3, and R × 512 × 1 × 1. R is the rank which can be set as different values according to the need of tasks.

Overall Framework.
We take the classical model GANint-cls as the original model to compress. is model has the most compact structure and parameters. e main convolution layers of the generator in other text-to-image GAN models are similar to GAN-int-cls. We redesign GAN-int-cls to show the effectiveness and generality of our compression architecture. As shown in Figure 2, the proposed CPGAN contains two novel components which can stabilize the training of decomposed GAN : conditioning augmentation and autoencoder module. Conditioning augmentation (CA) is proposed by Zhang et al. [21] which alleviates the difficulty of GAN training caused by text embedding sparsity. CA is to randomly sample the hidden variables as the input of the generator from the independent Gaussian distribution N(μ(φ t ), Σ(φ t )). φ t is the text embedding which is generated by encoding text description. μ(φ t ) and Σ(φ t ) are the mean and diagonal covariance matrix functions of the text embedding φ t , respectively. We use pretrained char-CNN-RNN [34] to get the text embedding φ t . en, we feed φ t into CA and obtain μ(φ t ) and Σ(φ t ). Similar to StackGAN [21], we also add the Kullback-Leibler (KL) divergence into our training objectives, which is the KL Scientific Programming divergence between the standard Gaussian distribution N(0, I) and the conditioning Gaussian distribution N(μ(φ t ), Σ(φ t )), as shown in the following equation: Autoencoder (AE) is used for representation learning by reconstructing input. e decomposed architecture is deeper than the original model, which increases the instability of training. So, we use AE to stabilize the training process. AE is composed of an encoder and a decoder in general. We regard each convolution layer as an encoder and add a decoder corresponding to each convolution layer. e training objective of AE is the reconstruction loss. We use mean square error (MSE) ‖x 1 − h(x 1 )‖ 2 2 as the AE loss, where x 1 is the input of layer and h(·) is the function of AE. e decoder will be removed after training. e generator objective of original GAN-int-cls contains matching-aware loss and interpolation loss, as shown in where z is the random noise, t 1 and t 2 are text embeddings, and β is a decimal between 0 and 1 and used to interpolate between text embeddings t 1 and t 2 .
In the generator objective of our model, we add KL divergence and MSE reconstruction loss into the original model objective, as shown in the following equation: e discriminator objective of the original model and our model is both matching-aware loss:

Scientific Programming
We use the above scheme to train an efficient architecture from scratch. e training algorithm is shown in Algorithm 1. Firstly, original convolutions are decomposed into three layers through equations (5)- (7). Secondly, each layer is regarded as an encoder and a decoder is added corresponding to each layer. irdly, we encode matching text t and mismatching text t and get text embeddings. en, we use CA to process text embeddings and get independent Gaussian distribution. From the independent Gaussian distribution, we sample variables and concatenate it with random noise. e following training process is the same as GAN-int-cls with different training objectives of generator. e objective function of our model adds the loss of CA and autoencoder on the basis of the original model's objective function. Until the training is finished, we remove added decoder layers and obtain the model of CPGAN.

Experiments
We conduct extensive experiments to evaluate the proposed CPGAN. In Section 4.1, we introduce the experimental dataset and evaluation index. Section 4.2 describes the setting of learning rate and the other experimental hyperparameters. In Section 4.3, we compare our CPGAN with previous GAN-int-cls models for text-to-image synthesis.

Overall Framework.
To show the generality of our method, we choose the classic model GAN-int-cls as our original model. Same as GAN-int-cls, our method is evaluated on CUB [35] and Oxford-102 [36]. e CUB dataset covers 200 kinds of birds, including 5,994 training images and 5,794 test images. In addition to category labels, each image contains bounding box, bird key part of bird information, and bird attributes. Oxford-102 flowers dataset is a flower dataset which contains 8,189 images. It is divided into 102 categories and each category contains 40 to 258 images. Each image has large scale, pose and light variations. e dataset is divided into a training set, a validation set, and a test set. Both datasets are benchmark image datasets and each image corresponds to 10 single sentence descriptions.
In order to evaluate our model, we use inception score (IS) and Fréchet inception distance (FID) to evaluate the quality of the generated images. IS uses pretrained Incep-tionNet-V3 to judge whether the generated image is clear and diverse. High IS score means that images are clear and diverse. FID calculates feature distance between the real image and the fake image as a supplement of IS evaluation index. ese two indicators are widely used to evaluate the quality of generated images.

Implementation Details.
Learning rate is a very important hyperparameter in deep learning. Reasonable learning rate can make the model converge to the minimum point instead of the local optimal point or saddle point. In this paper, we use the method CLR [11] and MultistepLR to set learning rate and learning rate attenuation. CLR was proposed by Smith. It changes learning rate periodically in the iterative process, rather than a fixed value. It is used to find the optimal learning rate automatically instead of manual experiments. We use CLR to get a learning rate setting. CLR method needs to set three parameters, minimum learning rate (min_lr), maximum learning rate (max_lr), and iteration. min_lr and max_lr are the smallest value and the biggest value of learning rate, respectively. Iteration is the number of test iterations at each learning rate. We increase the learning rate from 0.00001 to 0.001 and get the loss curve under different learning rates (see Figure 3).
We choose the appropriate learning rate according to maximum absolute slope criterion. According to Figure 3, we select 0.0002 and 0.00015 as the learning rates of the Oxford-102 dataset and 0.0001 and 0.00008 as the learning rates of the CUB dataset.
MultistepLR is a learning rate attenuation method in PyTorch. It has three hyperparameters: initial learning rate (ini_lr), epoch to update learning rate (epo), and multiplication factor(mfc). ini_lr is the initial learning rate during the training. epo is the epoch when we change the learning rate. mfc is the attenuation coefficient of learning rate. In the experiment using MultistepLR, the initial learning rate is ini lr. When the experiment runs epo epochs, the learning rate is changed to ini lr * mfc.
In this paper, we set the MultistepLR hyperparameters ini_lr, epo, and mfc as 0.0001, 600, and 0.8 in CUB and 0.0002, 600, and 0.75 in Oxford-102. e batch size in our experiment is 64. e optimizer of CPGAN is Adam [37] with momentum of 0.5.

Comparison with Original
Model. In CP decomposition, rank represents compression ratio and it is hard to select. Due to the need of text-to-image synthesis task, we design the lightweight model on the premise of ensuring the quality of generated images. We do extensive experiments to balance the performance and the compression ratio.
As shown in Table 1, we do a large number of experiments to find the balance. e ratio is rank ratio, where 1.0 is full rank decomposition and 0.9 means about 0.9 times of original layer's number of input channels. A layer with a smaller rank has few parameters. Table 1 shows that with increasing of rank, flops and parameters grow. When the decomposition rank is close to 0.7, the parameters begin to exceed the original model's parameters (5.76 × 10 6 ). With the increase of rank, the quality of images generated by the model has not been greatly improved. e value of FID decreases first and then changes slightly with the increase of model parameters, while IS is not stable. It may be that the calculation of IS needs to use the edge distribution of data, but generated samples in Oxford-102 are not enough to get accurate edge distribution.
As shown in Table 1, FID gets the best value when rank ratio is 0.5. e model is compressed by about 23% parameters and 29% flops. e generated images are better than the original model on FID and IS. It can prove that our method can generate better images with less parameters than the original Scientific Programming 5 model. It is effective to use CP decomposition to reconstruct the model and design compact text-to-image GAN without loss of image quality. Although 8.8 × 10 9 flops and 1.31 × 10 6 parameters are reduced, the images generated by CPGAN get a little improvement on IS and FID. is shows that the image generated by the model with more parameters may not be better. So around the rank of 0.5, we look for a better model ensuring the quality of generated images. Table 2 shows the comparison between our best generative model and the original model on IS, FID, parameters, and flops. FID and IS of the original model are 79.55 and 2.66 ± 0.03 in Oxford-102, while those of our best model are 74.40 and 3.68 ± 0.08, respectively. In CUB, the images generated by our best model get 65.94 on FID and 5.03 ± 0.07 on IS, while those of original model are 68.79 and 2.88 ± 0.04, respectively. e comparison of representative images on Oxford-102 and CUB dataset can be seen in Figures 4 and 5, respectively. e better generated images of CPGAN indicate that our proposed method can generate more realistic images from text descriptions. ese results also prove that there are redundant parameters in existing text-to-image GAN. A more concise and efficient text-to-image GAN model can be designed based on CPD.   (5)-(7) to decompose the original convolutional layer in generator; (2) Add CA module for text embedding and add decoders layers; (3) Select an appropriate learning rate for the decomposed model; (4) For N � 1 to S do (5) Encode text description into embedding t; (6) Feed t into CA and obtain N(μ(φ t ), Σ(φ t )); (7) Sample c from N(μ(φ t ), Σ(φ t )) and random noise z; (8) Concatenate z and c and feed it into the generator; (9) Update discriminator D by equation (11); (10) Update generator G by equation (10); (11) End for (12) Discard all decoders and get a trained CPGAN. ALGORITHM 1: Overall scheme of CPGAN algorithm.

Conclusions
In this paper, we propose a simple and efficient architecture CPGAN based on CPD. CPGAN can reduce extensive parameters and flops of the original model. It also improves the quality of generated images at the same time. In the process of designing CPGAN model, we replace the convolution layer with three CP decomposed small layers to achieve a certain compression. In order to stabilize the training process, we introduce conditioning augmentation to reduce the instability caused by text embedding sparsity. To further improve the end-to-end training of our model, the idea of autoencoder is integrated into the model. Each decomposed layer can be regarded as an encoder layer and is paired with an added decoder layer. e decoder layers can be removed after training. Experiments demonstrate that CPGAN reduces about 23% parameters and 29% flops with a little improvement of generated image quality in Oxford-102.
Extensive experimental results demonstrate that our proposed CPGAN can design an efficient text-to-image GAN. We have also decomposed similar convolution layers in other GAN models and these experiment results were similar to the experiment results of GAN-int-cls. e main convolution layers of the generator in other text-to-image GAN models are similar to GAN-int-cls. It is applicable for other cross modal GANs to use CPD. In the existing methods, the rank is set manually, which is time-consuming. erefore, the automatic selection of rank may be a research direction in the future.
Data Availability e datasets used in this paper are public datasets which can be accessed via the following websites: http://www.vision. caltech.edu/visipedia/CUB-200-2011.html and https://www. robots.ox.ac.uk/∼vgg/data/flowers/102/ is flower is pink and yellow in color, with petals that are rounded.
is flower has clusters of deep red blossoms with rounded petals arranged vertically along the stem.
is flower has small pointed green sepals holding wavy white with yellow petals. is flower has wide yellow petals whose shapes are curved so ly.
is flower has petals that are purple with white shading.
is flower is yellow in color and has petals that are skinny and bunched together.
Text description GT GAN-int-cls Our model This is a small bird, with blue primaries, red flank, and crown, with a large bill for its body.
This is a black bird with gray and white wings and a bright yellow belly and chest.
A grey backed bird with a white belly and narrow grey legs.
This bird has a shiny navy plumage on its back, and a white breast and belly.
This bird has a brown crown as well as a white belly A large bird with a black body, tarsus, and hooked black bill.