Weakly Supervised GAN for Image-to-Image Translation in the Wild

,


Introduction
e unsupervised image-to-image translation is the process of learning an arbitrary mapping between two categories, domains, or classes images without labels. is greatly reduces the reliance on paired datasets and extends the range of applications for image translation tasks. For example, we can translate zebras into horses. e unsupervised image translation tasks can meet a variety of needs. Previous models assume that the shared latent space between different categories will be captured from the given categories (e.g., zebras to horses). Unfortunately, besides the well-designed datasets from given categories, many examples come from wild categories (e.g., cats to dogs) holding special shapes and sizes (short for adversarial examples), so the shared latent space is troublesome to capture and it will cause the collapse of these models.
Prior works such as CycleGAN [1] enable capturing the cyclic space between well-designed categories (e.g., zebras to horses and summer to winter) without paired data. However, the cyclic mapping of CycleGAN is a one-to-one mapping, in its limitation section; some unnatural results were discovered on special datasets. Recent works such as MUNIT [2] enable us to capture the shared latent content space between well-designed categories (e.g., cats to dogs) without paired data. ough MUNIT enables us to build multiple mapping and generate natural results, it is troublesome to capture the shared latent content space for wild categories with many adversarial examples.
In this paper, we take a further step towards unsupervised image-to-image translation research for wild datasets with many adversarial examples. Our global model can be divided into two parts to explain. For the first stage, inspired by the facenet [3], we use SSIM (structural similarity) [4] [6]. ese GANs models [7][8][9] can map from noise inputs to realistic images. ese GANs models have produced promising results in image translation. e Pix2pix model [10] applies a conditional GAN to model the mapping function. Although high-quality results have been shown, model training requires paired training data. It is applied to numerous tasks such as sketch to photo, image colorization, and photo to map.

Image-to-Image Translation.
e CycleGAN [1] model is proposed for unpaired image translation that relies on a cycle consistency loss term. e CycleGAN model showed some success when applied to a range of classic image translation tasks like zebras to horses. Some of its failure cases include overrecognizing objects and not being able to change the shape of the object during translation (e.g., outputting a cat-shaped dog). Other works tackle the greater shape change problems.
e Contrast-GAN [11] model introduces an adversarial distance comparison objective for optimizing one conditional generator and several semanticaware discriminators. e MUNIT [2] model assumes that images in different domains share a common content space but not the style space. e ganimorph [12] model introduces dilated convolutions in their discriminator architecture.
en their discriminator output facilitates more fine-grained information flow from the discriminator to the generator. However, for the wild datasets with many adversarial examples, all of the above models will collapse.

Proposed Method
Given examples from a category, such as the images of cats, our goal is to translate them into dogs. In this paper, x and y indicate the examples from categories X and Y. Our method consists of two stages. In the first stage, a weakly supervised model to automatically reduce the number of adversarial examples within each category is trained. Because the dogs' dataset (http://www. recod.ic.unicamp.br/∼rwerneck/datasets/flickr-dog/) contains many adversarial examples, the previous models suffered a model collapse when processing the cats-to-dogs task. erefore, it is necessary to reduce the number of adversarial examples. x w and y w indicate the weakly supervised examples from categories X and Y. As long as the SSIM distance between the examples x w n w�1 is greater than 0.3, we can get the global shared latent space. In our weakly supervised model, the source category and target category are the same. Figure 1(a) shows the network structure of the first stage.
In the second stage, an image-to-image translation model is trained. C x (e.g., identity matrix) and C y (e.g., inverse identity matrix) indicate the inverse category codes of the categories X and Y. Inspired by the multimapping [5], we assume that the two-dimensional C x and C y can constraint the local shared features across categories X and Y. For the wild datasets with many adversarial examples, the global shared latent space is troublesome to capture. We use two inverse category codes to hide some detailed features for each example and reduce the effect of the adversarial examples to establish the correct mapping. ough we only get the local shared features across categories, this will stop the model from collapse. en we use two encoders E x and E y to capture the local shared latent space for unsupervised imageto-image translation. Figure 1(b) shows the network structure of the second stage.

Objective.
In the first stage, our model contains two mappings G: X ⟶ Y and its discriminator D Y , F: Y ⟶ X and its discriminator D X . To automatically reduce the number of adversarial examples for each category separately, when we train the model, x 1 , x 2 come from the same category and y 1 , y 2 come from the same category in this stage. We introduce the adversarial losses firstly. e adversarial losses are usually used to judge the true and false probability of the generated images and the input images. e adversarial losses can be expressed as follows: (1) CycleGAN argues that, for each example x from category X, the image translation cycle should be able to bring x back to the original image; that is, It is called forward cycle consistency. Similarly, for each image y from category Y, G and F should also satisfy backward cycle consistency: y ⟶ F(y) ⟶ G(F(y)) ≈ y. e cycle consistency loss can be expressed as follows: For the first stage, the source examples and the target examples are selected randomly from the same dataset. To automatically reduce the number of adversarial examples, we use the weakly supervised example x w (e.g., x 1 , x 2 rather x) to generate more normal examples from category X. For an example, given x 1 from category X, the generated example x 2 ′ � G(x 1 ) should be similar to x 2 when the minimum SSIM distance between x w and x 2 is bigger than 0.3. In other cases, the generated example x 2 ′ � G(x 1 ) should be similar to x w . is leads us to propose a Sim loss (s represents the structural similarity of two tensors): , ‖x 1 − x 2 ‖ s (this represents the minimum SSIM distance between the input examples x 1 , x 2 , the output of the generator network G(x 1 ), and all the weakly supervised examples). In the first stage, our Sim-GAN model learns maps from X to X and Y to Y. After the above process, we have obtained the first stage objective: In this stage, our global objective function consists of three parts: the Gan losses, the Cyc losses, and the Sim losses. e parameter value we used is λ 1 � 10, λ 2 � 60. FGX and FGY mean that we use the first stage to reduce the number of adversarial examples for categories X and Y.
In the second stage, our Sim-GAN model learns maps from two categories X and Y. We introduce encoders E x and E y and category codes C x and C y for our model. Here, we   Given an example x 1 , our goal is to translate it into an example x w or x 2 . At first, generator G is used to translate an example x 1 into an example x 2 ′ . en discriminator D distinguishes between the generated example x 2 ′ and real one x 2 or x w . At last, generator F is used to translate x 2 ′ into x 1 . x 1 indicates the generated result. In this stage, FGX and FGY mean that we use the first stage to reduce the number of adversarial examples for categories X and Y. Here, Cyc loss denotes the cycle consistency loss between x 1 and x 1 . (b) In the second stage, E x and E y indicate the encoders of the categories X and Y. At first, generator G is used to translate x 1 , the category code C x , and the encoded representation of image x 1 into an example y 1 ′ . en the discriminator D distinguishes between the generated example y 1 ′ and real example y 1 .
structural similarity of the two tensors. e SSIM loss can be expressed as follows: In this stage, we introduce variational autoencoders (VAEs) [13] type encoders E x and E y to get the local shared latent space. Our goal is to use the random Gaussian distribution (N (0, I)) to represent the local shared features. e VAE loss can be expressed as follows: where D KL (p || q) � − p(z)log(p(z)/q(z))dz (here, p and q are the latent distributions and z is the latent vector from VAE-like encoder. D KL is the Kullback-Leibler divergence). To enforce the generator utilizing the latent vector z x , z y , the reconstruction latent vector loss is expressed as follows: Specifically, when y is input to E y , we will get z y . en z y can be input to the generator network G(y, z y , C y ). e reason for the paired z y ′ and z y is that y ′ � G(y, z y , C y ). After the above process, we have obtained the following objective function for the second stage: TGX means that we use the second stage to learn maps from category X to category Y. In this stage, our global objective function consists of five parts: the Gan losses, the Cyc losses, the SSIM losses, the VAE losses, and the reconstruction losses. e parameter value we used is λ 1 � 10, λ 2 � 7, λ 3 � 0.01, λ 4 � 10. Finally, the workflow of example x can be expressed as follows:

Generator Network.
e goal of the generator network is to generate learned features. For the first stage, we use the ResNet structure with an encoder-decoder framework, which contains two stride-2 convolution layers for downsampling, six residual blocks, and two stride-2 transposed convolution layers for upsampling. In order to get more local features, we use local response normalization [14] for all the convolutional layers. For the second stage, some details and spatial information may be lost in the downsampling process. We use the ResNet structure with a decoder framework, which contains two stride-2 convolution layers for downsampling, six residual blocks, and two stride-2 transposed convolution layers for upsampling. We replace all normalization layers except upsampling layers with CBIN (the central biasing instance normalization) layers [5]. e CBIN aims to adjust the different distributions of input feature maps adaptively with learnable parameters, which makes the category code able to manage the different tasks. We use the category code to label the different mapping in the generator.

Discriminator Network.
For the first stage, we use one discriminator networks to make a distinction between the real example and the weakly supervised example. For the second stage, we use two discriminator networks to discriminate the real and fake images in different scales.

Encoder.
Our encoders consists of three convolution layers followed by four residual blocks to down example the input examples. In order to get more features, we use instance normalization for all the convolutional layers. It should be noted that the output of the encoder will be used in our generator network.

Experiments
To explore the generality of the Sim-Gan model, we test the method on a variety of tasks including human faces to animes, human faces to cats, human faces to dogs, and cats to dogs. We carry on the experiment for unpaired image-toimage translation on four open source datasets. We implement the Sim-Gan model in the open source Tensorflow framework, which uses GTX1080Ti GPUs for both training and testing. We first optimize the dataset and deal with the problem that the dataset does not converge and then use the trained model to process the input data in the second stage and perform the image translation task. Furthermore, we used our model to handle the four tasks above. en we performed experimental comparisons with the most advanced models to accomplish the same tasks. Finally, we recorded various performance indicators for testing [15][16][17][18].

Datasets and Preprocessing.
Before starting the experiment, we should resize the image to 256 × 256. Each batch of training randomly loads 1 image from the source category and then randomly loads 1 image from the target category. We use a total of four public datasets for testing and comparison. e CELEBA dataset [19] with 202,599 celebrity face images (short for faces). e Getchu dataset [19] contains 26,752 anime character face images with a clean background (short for animes).
e Flickr-Dog dataset (http://www.recod.ic.unicamp.br/∼rwerneck/datasets/flickrdog/) has 42 classes and 374 photos (short for dogs). e cAT dataset [20] (short for cats) includes 10,000 cat images. For each image, they annotate the head of a cat with nine points, two for eyes, one for the mouth, and six for ears.
We conduct cats to dogs, human faces to cats, human faces to dogs, and human faces to animes task separately. e experimental results of first stage on dogs to dogs tasks are shown in Figure 2.
As shown in Figure 2, we note that Sim-GAN can generate dogs that are close to the weakly supervised dogs. In this way, we can automatically reduce the number of adversarial dogs. e experimental results of the second stage on these tasks are shown in Figure 3: In Figure 3, we find that Sim-GAN can generate objects closer to the target objects for four tasks. It means that Sim-GAN can get the local shared latent space to stop the model from collapse. In the first stage, we reduce more than

Evaluation Index.
Using the same evaluation metrics, we compare our method against several baselines qualitatively and quantitatively.

AMT.
For these tasks, we run "real vs fake" perceptual studies on Amazon Mechanical Turk (AMT) to assess the realism of our outputs. We follow the same perceptual study protocol from Isola et al. [10], and we gather data from 50 participants per algorithm we tested. Participants were shown a sequence of pairs of images, one a real image and one fake (generated by our algorithm or a baseline), and asked to click on the image they thought was real.

Classification (Cf for Short).
We train three Xception [21] based binary classifiers for each image datasets. e baseline is the classification accuracy in real images. Higher classification accuracy means that the generated images may more easy to distinguish.

Consistency (Cs for Short).
We compared the domain consistency between real images and generated images by computing average distance in feature space. We use the cosine similarity to evaluate the perceptual distance in the feature space of the VGG-16 network [22] pretrained in ImageNet [23]. We sum across the five convolution layers preceding the pool layers. e larger the value, the more the similarities between the two images. In the test stage, we randomly example the real image and the generated image from the same domain to make up the data pair. en we compute the average distance between each pair.

Base Model Comparison.
Here, we evaluate the performance of different models. In order to be fair, we use the same dataset to ensure that each model reaches a convergence state. e experimental results of the Sim-GAN model on these tasks are shown in Figure 4 (the result ours 2 indicates that only the model of our second stage is used to generate data).
In Figure 4, it is shown that our model generates better and closer target category examples than other models. e CycleGAN model cannot handle these tasks; it only learns part of the style mapping.
e Imo-GAN model enables completing a few tasks but lacks some details; it only learns most of the content mapping. e MUNIT model enables completing most tasks, it learns the right content mapping and style mapping. e experimental results show that for the cats to human faces task, besides CycleGAN, all the models produce natural results and it means that the local shared latent space is close to some of the global shared latent space.
is is because the cats and faces datasets

Cats to dogs
Human to cats

Human to dogs
Human to animes Furthermore, the reason for some similar image translation results between MUNIT and our method is that the local shared latent space and the global shared latent space may intersect under certain conditions. e contrast effect for four tasks on three evaluation indicators is shown in Table 1.
As can be seen from Table 1, our model achieved leading numerical results than other models. is means that we not only reduce the number of adversarial examples but also successfully capture the local shared latent space for unsupervised image-to-image translation.

4.5.
Limitations. Although our model is able to generate semantically plausible and visually pleasing examples for wild datasets with many adversarial examples, it has some limitations. e first limitation is that we are not able to translate desired results based on conditions. ese will be addressed in the next study. Our model can avoid collapse for the wild dataset, but the weakly supervised model reduces the number of examples. erefore, our image translation results lack diversity, which will be discussed in future work. e second limitation is that the image translation results are not similar in the pose of the head. e main reason for the not similarity in the pose of the head is the dataset. e similarity is global latent spaces. e not similarity in the pose of the head is the local latent spaces. e last limitation is the pretrained ImageNet for the consistency evaluation.
ough we use pretrained ImageNet for the consistency evaluation of dogs, cats, and animes, the VGG-Face model is very critical for face consistency evaluation. We will use it for the consistency evaluation of cats to human faces, dogs to human faces, and animes to human faces tasks.

Conclusion
is paper studies the use of a novel Generative Adversarial Networks model for image-to-image translation when other models collapse. We assume the shared latent space can be classified as global and local and design a weakly supervised Similar GANs (Sim-GAN for short) to capture the local shared latent space rather than the global shared latent space. We first introduce a loss based on SSIM (structural similarity) distance with weakly supervised examples for Sim-GAN to automatically reduce the number of adversarial examples within each category. en we introduce the category codes to constraint the local shared features across categories and the encoders to capture the local shared latent space for unsupervised image-to-image translation. Experiments on four public datasets show that our model significantly outperforms state-of-the-art baseline methods.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Mathematical Problems in Engineering 7