The Defense of Adversarial Example with Conditional Generative Adversarial Networks

Deep neural network approaches have made remarkable progress in many machine learning tasks. However, the latest research indicates that they are vulnerable to adversarial perturbations. An adversary can easily mislead the networkmodels by adding welldesigned perturbations to the input.(e cause of the adversarial examples is unclear.(erefore, it is challenging to build a defense mechanism. In this paper, we propose an image-to-image translation model to defend against adversarial examples.(e proposed model is based on a conditional generative adversarial network, which consists of a generator and a discriminator.(e generator is used to eliminate adversarial perturbations in the input. (e discriminator is used to distinguish generated data from original clean data to improve the training process. In other words, our approach can map the adversarial images to the clean images, which are then fed to the target deep learningmodel.(e defense mechanism is independent of the target model, and the structure of the framework is universal. A series of experiments conducted on MNIST and CIFAR10 show that the proposed method can defend against multiple types of attacks while maintaining good performance.


Introduction
Deep learning [1][2][3][4][5] is a hierarchical machine learning method involving multilevel nonlinear transformations and is good at mining abstract and distributed feature representations from raw data. Deep learning can solve many problems that are considered challenging in machine learning. Recently, driven by the emergence of big data and hardware acceleration, deep learning has made significant progress in numerous machine learning domains, such as computer vision, natural language processing, edge computing [6][7][8][9][10], and services computing [11][12][13], and promotes the large-scale application of artificial intelligence technology in the real world. While deep learning has achieved great success, its performance and applications are also questioned due to the lack of interpretability [14], which means that we cannot reasonably explain the decisions made by deep learning models. is exposes deep learning-based artificial intelligence applications to potential security risks. Many types of research have shown that deep learning is threatened by multiple attacks, such as membership inference attack [15,16] and attribute inference attack [17]. e most serious security threat to deep learning is the adversarial example [18] proposed by Szegedy in 2013. An adversary can add small-magnitude perturbations to inputs, which can easily fool a well-performed deep learning model with few perturbations imperceptible to humans [19]. e disturbed inputs are called adversarial examples, and they make the target model report high confidence in incorrect predictions. Moreover, recent research shows that artificial intelligence applications in the real world can be exposed to adversarial samples [20], for example, attacks in the face recognition system [21] and vision system in autonomous cars [22].
With the in-depth study of adversarial examples, the development of this field mainly presents the following main trends. (1) A growing number of methods for constructing adversarial examples are proposed. According to adversarial specificity, we can divide these attack methods into targeted attacks and nontargeted attacks. For targeted attacks, the adversary can submit well-designed inputs to the target model and cause maliciously chosen target outputs, such as R + LLC [23], JSMA [24], EAD [25], and C&W [26]. For nontargeted attacks, the adversary can cause the target model to misclassify well-designed inputs into classes that are different from the ground truth, such as FGSM [27], BIM [20], PGD [28], and DeepFool [29]. Even worse, the robustness of adversarial examples constantly increases, and detection and defense are challenging. (2) e cost of constructing adversarial examples is decreasing. Due to the transferability [30] of the adversarial example, the adversary can successfully launch an attack without background knowledge about the target model. (3) e range of attacks is also expanding. Adversarial examples can also successfully attack different deep learning models such as reinforcement learning models and recurrent neural network models. Moreover, attack scenarios are not limited to the computer vision. e same security risks exist in text [31] and speech [32]. erefore, building an effective defense mechanism against adversarial examples is crucial in deep learning.
ere is no uniform conclusion on the cause of the adversarial examples; thus, building a defense mechanism is challenging. In general, there are two classes of approaches to defend against adversarial examples: (1) making deep neural networks more robust by adjusting learning strategies, such as adversarial training [27,33] and defensive distillation [34]; (2) detecting adversarial examples or eliminating adversarial noise after deep neural networks are built, such as LID [35], Defense-GAN [36], MagNet [37], and ComDefend [38]. ere are some bottlenecks in these defense mechanisms. First, some defense mechanisms are only effective against specific attacks. For example, defensive distillation is effective for gradient-based attacks, and it is defeated by C&W attacks. Second, some methods require large samples and high computational costs, which limit the application scenarios for these defense mechanisms. ird, the difference between the adversarial example and the clean example is small; thus, it is difficult for current detection methods to distinguish them with high confidence. In summary, we hope to find a defense mechanism with good performance on most attacks and low computational cost.
Our work has made some progress toward building a better defense mechanism against adversarial examples in computer vision. e main reason for adversarial examples to mislead the target model is that the added noise changes the characteristics of the original inputs; thus, an intuitive approach is to remove the noise from the adversarial examples and generate a mapping of the adversarial examples to the clean examples. In computer vision, this can be posed as "translating" an input image (adversarial example) into a corresponding output image (clean example). In this paper, we use the framework proposed by Isola et al. [39] as a defense mechanism. Based on conditional adversarial networks (conditional GANs) [40], the framework consists of a generator network to translate the adversarial images to the clean images and a discriminator network to ensure that the generated images are realistic. Our method can effectively eliminate adversarial perturbations and restore the characteristics of the original clean images. e overview of the defense model is shown in Figure 1. e advantages of our method are listed as follows: (1) e proposed method is a general-purpose defense framework. On the one hand, the defense mechanism processes the input and is model independent, which means that the target model does not need to be retrained. On the other hand, the network structure of the defense framework is based on a general-purpose solution of image-to-image, and we can apply the framework for different scenarios with only a few adjustments. (2) Our method is simple and easy to use, and it is effective against most commonly considered attack strategies, such as FGSM, DeepFool, JSMA, and CW.
Moreover, this defense mechanism shows certain transferability, which means the defense mechanism built for the specific target model is still effective for other models. e remainder of the paper is as follows. We introduce some related works about adversarial example in Section 2. In Section 3, we review the necessary theories and concepts about adversarial example and conditional GANs. We give a detailed technical development about the framework of the generation and defense of adversarial example in Section 4. Section 5 describes the experimental results, and Section 6 concludes the paper.

Related Works
In this section, we introduce the application of GANs in the field of adversarial examples: generating adversarial examples with GANs and defending adversarial examples with GANs.

Generating Adversarial Examples with GANs.
Xiao [41] proposed AdvGAN to generate adversarial examples. AdvGAN takes a clean image x as the input of the generator G and obtains the adversarial images as x + G(x).
e adversarial examples generated by AdvGAN perform high attack success rates in both semiwhite box and black-box attacks. Song et al. [42] designed an unrestricted approach to generate adversarial examples with an auxiliary classifier generative adversarial network (AC-GAN) [43]. Different from perturbation-based attacks, this approach constructs adversarial examples entirely from scratch instead of perturbing an existing data point. In addition, the adversary can specify the style of the generated adversarial examples and labels that are misclassified on the target model. Zhao et al. [44] noticed that the adversarial perturbations are often unnatural and not semantically meaningful. He proposed a framework consisting of a WGAN [45] and an inverter. e inverter maps a clean image to random dense vectors z. e generator of the WGAN obtains the z (perturbing z) as the input. e goal of the generator is to synthesize an image that is as close to the original image as possible. is method can generate natural and legible adversarial examples that lie on the data manifold. Hu and Tan [46] focused on adversarial examples in traditional security scenarios. ey proposed the MalGAN to generate adversarial malware examples, which are able to bypass black-box machine learning-based detection models.

Defense Adversarial Examples with GANs.
Lee et al. [47] introduced a novel adversarial training framework named generative adversarial trainer (GAT). e framework consists of a generator and a classifier. e generator attempts to generate adversarial perturbations that can easily fool the classifier and the classifier attempts to correctly classify both original and generated adversarial images.
is approach can improve the robustness of the model and outperforms other adversarial training methods using a fast gradient method. Santhanam and Grnarova [48] proposed cowboy, an approach to defend against adversarial attacks with GANs.
is work shows that adversarial samples lie outside of the data manifold learned by a GAN that has been trained on the same dataset. ey used the discriminator (GAN) to detect adversarial examples and the generator (GAN) to eliminate adversarial perturbations. Samangouei et al. [36] proposed a new framework named Defense-GAN, which leverages the expressive capability of generative models (WGAN) to defend against adversarial examples. Defense-GAN finds a close input to the adversarial examples and sends the input to the generator of WGAN. en, the generated images are fed to the target model.

Background
In this section, we introduce four methods of generating adversarial examples. In addition, GAN and its connection to our method will be discussed.

Generating Adversarial Example.
e main idea of generating adversarial samples is to add appropriate perturbations to the input samples to make the noisy samples as similar to the original input as possible, but mislead the target model. We can briefly describe this process: for a given input image x, the adversary needs to find a minimal perturbation η and craft the noisy example as x * � x + η. In recent years, many methods of generating adversarial examples have been proposed. Here, we introduce some of the most well-known attacks.

Fast Gradient Sign Method (FGSM) [27]. Szegedy et al. first introduced adversarial examples against deep neural networks and proposed the method named L-BFGS [18] to
generate adversarial examples; however, it was time-consuming and impractical. In 2014, Goodfellow et al. argued that the primary cause of neural networks' vulnerability to adversarial perturbations is their linear nature. Based on this explanation, they proposed a simple and fast method to generate adversarial samples, named fast gradient sign method (FGSM). Let θ be the parameters of a target model, x is the input to the model, y is the label associated with x, and J(θ, x, y) is the cost function used to train the model. e adversarial sample is generated as where ε is a parameter that determines the perturbation size. [29]. FGSM is simple and effective; however, it causes a large degree of perturbations to inputs. Moosavi-Dezfooli et al. observed that adding noise along the vertical direction of the closest decision boundary to the inputs can ensure that the added perturbation is optimal. ey used an iterative method to approximate the perturbation by considering that f is linearized around x i at each iteration. e minimal perturbation is computed as

DeepFool
where η i is the distance to the decision boundary. [24]. e previous two attack methods are both nontargeted attacks. Papernot et al. observed that different input features have different degrees of influence on the output of the target model. If we find that some features correspond to a specific output in the target model, we can make the target model produce a specified type of output by enhancing these features in the inputs. Based on this idea, they proposed a simple iterative method for targeted attack named the Jacobian-based saliency map attack (JSMA). First, the JSMA requires the calculation of the forward derivative, which shows the influence of each input feature on the output. en, it can generate the adversarial saliency map and use the adversarial saliency map to find the input features that have the greatest impact on the specific output of the target model. Finally, a small perturbation added to the features can fool the neural network. [26]. Carlini et al. proposed a method of generating a more robust adversarial example that can bypass many advanced defense mechanisms.

Carlini and Wagner (C&W)
is method treats the adversarial example as a variable, and two conditions need to be met for the attack to succeed. First, the difference between the adversarial example and the corresponding clean sample should be as small as possible. Second, the adversarial example should Y Image X X * = G(X) Generator Classifier Figure 1: e overview of defense model for the adversarial example proposed in this paper. Between the input images and the classifier, we add a generator, which can eliminate the adversarial perturbations in the input images and map the adversarial examples into clean images. make the model classification error rate as high as possible. ere are three attacks for the L 0 , L 1 , and L 2 distance metrics, and we provide a brief description of the L 2 attack: e loss function g is defined as where Z denotes the SoftMax function, κ is a constant used to control the confidence (as κ increases, the adversarial examples become more powerful), t is the target label of misclassification, and the constant c can be chosen with binary search.

Generative Adversarial Networks.
Generative adversarial networks (GANs) [49] are a successful framework for generative models and are widely used in many fields [50][51][52]. A GAN framework forces two networks to compete with each other: a generator G, which attempts to map a sample z (noise distribution z ∼ p z (z) to the data distribution (x ∼ p data (x)), and a discriminative model D, which estimates the probability that a sample came from the training data rather than G. e goal of a generator G is to maximize the probability of D making a mistake. us, this framework plays a two-player minimax game via the following value function V(G, D): In the competition, both the generator and discriminator will be improved until the discriminator cannot distinguish a generated sample from a data sample.
Mirza and Osindero [40] introduced the conditional version of generative adversarial networks (conditional GAN), and the conditional GAN can be expressed as a mapping from an observed input x and random noise z to y, G: x, z { } ⟶ y. e value function V(G, D) in conditional GAN is as follows: D(x, G(x, z)))].

(6)
With the conditional GAN, it is possible to direct the data generation process and obtain the specified result.

Proposed Method
In this section, we introduce the defense mechanism against adversarial examples in detail.

Motivation.
In computer vision, we can consider the attack and defense of adversarial examples as an image-toimage translation process. For the adversary, the goal is to perturb clean images to generate adversarial images. For the defender, the usual idea is to transform the input adversarial images and eliminate the perturbation to restore them to clean images. According to this idea, we can apply some image conversion methods to the field of adversarial examples. In 2018, Isola et al. [39] proposed a generic approach named pix2pix to solve image-to-image translation problems and is based on the conditional GAN. ey demonstrated that pix2pix is effective at reconstructing objects from edge maps and colorizing images, among other tasks. In this paper, we use the same network framework as pix2pix to solve the problems in adversarial examples. We use the framework as a defense mechanism to generate a mapping of adversarial images to clean images.

Framework.
e framework of pix2pix is based on the conditional GAN.
is means that the structure of this framework mainly consists of two parts: a generator and a discriminator. As shown in Figure 2, we introduce the structure of our framework from two aspects.

Generator.
We use the structure of U-Net [53] as a generator, which adds skip connections based on the encoder-decoder network. Although there are some minor distinctions in surface appearance between the inputs (adversarial images) and outputs (clean images), the underlying structures of both are the same. erefore, in the task of image-to-image (adversarial images to clean images), both of them should share the same underlying information. e traditional encoder-decoder generator model lacks the transmission of low-level information, which causes some distortion of the outputs. erefore, we add skip connections to share underlying information between the inputs and outputs based on the encoder-decoder network, which can ensure that the quality of the converted images is closer to the expected result. Each skip connection simply concatenates all channels at layer i with those at layer n − i, where n is the total number of layers.

Discriminator.
We use the structure of PatchGAN [39] as a discriminator. e traditional GAN discriminator judges the output as a whole, and it restricts the discriminator to model the high-frequency structure. e PatchGAN maps each input image into N × N patches via a convolutional network and attempts to determine whether each N × N patch in an image is real or fake. en, it averages all responses to provide the ultimate output of the discriminator. In this way, the local features of the generated images can be well constructed. Figure 2 illustrates the overall architecture of the defense mechanism for the adversarial example. We use paired data (x, y) for training, and each pair of data contains a clean image y and its adversarial image x. Here, the generator G takes the adversarial example x as its input and generates the images G(x). en, (x, G(x)) and (x, y) are sent to the discriminator D, which is used to distinguish the generated data and the original instance. e adversarial loss can be written as follows:

Defense Adversarial Example.
e goal of G is to not only fool the discriminator but also be near the ground truth output. erefore, we add the loss LL 1 (G), which encourages the generated instances G(x) to be close to the clean images y: e current objective function is where λ controls the relative importance of LL 1 (G). Figures 3 and 4, our defense mechanism can eliminate adversarial perturbations in the images. However, for some complex datasets (such as CIFAR10), although the generated images are close to the original clean images, their performance in the target model f is not satisfactory. To solve this problem, we adjust the objective function. Our core goal is to eliminate the adversarial perturbations in x and make the prediction results of the generated images G(x) close to the prediction results of y in the target model. erefore, we add the loss function as follows:

As shown in
e final objective function is where μ controls the relative importance of L f adv . In general, the loss functions L cGAN (G, D) and λLL 1 (G) encourage the adversarial data to appear similar to the clean data, while the loss function L f adv improves the prediction accuracy of the generated images on the target model.

Experiment
In this section, we evaluate the defense mechanism against adversarial examples. All experiments are based on two datasets: MNIST and CIFAR10.
MNIST (the MNIST used to support the findings of the study is public, and one can find it in http://yann.lecun. com/exdb/mnist/) is a dataset of handwritten digits and consists of 60000 training examples and 10000 testing examples. Each sample consists of 28 × 28 pixels, where each pixel is a grayscale value. For MNIST, we trained two classifiers Anet and Bnet and used these classifiers as target models to generate adversarial examples and test our approach.
e network structure is shown in Table 1. e prediction accuracies of Anet and Bnet on the test set are 98.96% and 99.74%, respectively. e CIFAR10 (the CIFAR10 used to support the findings of the study is public, and one can find it in https://www.cs.toronto.edu/∼kriz/cifar.html) dataset consists of 60000 32 × 32 color images in 10 classes, with 6000 images per class.
ere are 50000 training images and 10000 test images. For CIFAR10, we trained two classifiers Resnet (Rnet) [54] and DenseNet (Dnet) [55] and used these classifiers as target models to generate adversarial examples and test our approach. e prediction accuracy of Rnet and Dnet on the test set is 93.63% and 95.04%, respectively.

Implementation Details.
We used the adversarial examples generated by the training data and the clean images in the training data as the training set for our framework. All attacks (FGSM, DeepFool, JSMA, and CW) were implemented in advbox [56], which is a toolbox used to benchmark deep learning systems' vulnerabilities to adversarial examples. We used the interface provided by advbox to generate the adversarial examples. We experimented with ε � 0.15 on MNIST, ε � 0.1 on CIFAR10, and L 2 attacks for CW. For the targeted attacks  Figure 2: e training framework of defense mechanism with the conditional GAN. is framework consists of a generator and a discriminator. e generator takes adversarial images x as input and eliminates perturbations in x. en, we obtain G(x). e discriminator is used to distinguish the generated data (x, G(x)) and the original instance (x, y), where y denotes the clean images.
JSMA and CW, we set a random target label for each sample. e network structure of our framework (include the generator and discriminator) is the same as pix2pix [39].

Defense Adversarial Example.
To verify the effectiveness of the defense mechanism, we tested it on two datasets MNIST and CIFAR10. For each dataset, we trained two defense frameworks for different target models. We generated adversarial examples on test data and selected the adversarial examples that successfully attacked in the target model as members of the test set. erefore, the prediction accuracy of the target model on the test set is 0%. In our defense mechanism, we sent the adversarial examples to a generator that had previously been trained. en, we took the generated data as input to the target model. Figures 5 and  6 show the prediction accuracy of the target model on the adversarial example under the defense mechanism, where epoch means the number of training iterations. e result indicates that our defense framework can quickly converge during training. For the MNIST dataset, we take epoch � 20 as the final result, as shown in Table 2. Our defense mechanism is effective against different types of attacks. It improves the prediction accuracy of the target models (Anet and Bnet) on the adversarial sample from 0 to almost 98%. For the CIFAR10 dataset, we take epoch � 40 as the final result, as shown in Table 3. Since the CIFAR10 dataset is much more complicated than the MNIST dataset, it can cause some losses in the denoising process. erefore, the defensive performance on CIFAR10 is reduced compared to that on MNIST. CW attacks are more robust than other attacks, which means that defending against such attack is more challenging. Our defense mechanism still achieves good performance on CW attacks.   In addition, we compare the adversarial perturbation and defense loss for both the MNIST dataset (epoch � 20) and CIFAR10 dataset (epoch � 40). An adversarial perturbation means average L 1 norm loss between adversarial images and clean images, and the defense loss means an average L 1 norm loss between the generated images and    clean images. Since our defense framework consists of U-Net and PatchGAN, their combination enables the generator to restore the details of the original clean data. As shown in Figures 7 and 8, our defense mechanism can control defense losses within a certain range. is ensures the high quality of the generated images and the similarity to the clean images.

Defense Transferability.
In this experiment, we tested the transferability of our defense mechanism. We used the adversarial examples generated by other target models to test the framework trained for the specific target model. Figures 9

Comparison with Other Defense Methods.
Following the experimental setup in Defense-GAN [36], we compared the proposed method with other defense mechanisms such as Defense-GAN, MagNet [37], and adversarial training [27]. e adversarial training uses the adversarial example as part of the training set to build a more robust model. e magnet consists of a detector and a reformer. e detector is used to detect adversarial examples, and reformer is used to transform adversarial examples into clean examples. Since Defense-GAN is not argued secure on    CIFAR10, we only use MNIST and experiment with ε � 0.3 for FGSM and the L 2 attack for CW. ere are four target models A, B, C, and D, whose structures are the same as the settings in Defense-GAN. e experiment results are shown in Table 6. e proposed method is better than MagNet and adversarial training. Although our method is slightly inferior to Defense-GAN in some tests, our method also has certain advantages. (1) Our method is simpler than Defense-GAN. Simultaneously, Defense-GAN requires two steps before feeding the input to the classifier: minimizing the reconstruction error and generating. However, our method only requires generating. (2) Our defense mechanism is a generalpurpose defense framework, which means that we can adapt the defense mechanism to different datasets or scenarios with a few adjustments.

Conclusions
In this paper, we propose a novel defense strategy utilizing conditional GANs to enhance the robustness of classification models against adversarial examples. Our method is a universal defense framework. We tested it on different datasets and target models, and the experimental results proved that our method is effective against most commonly considered attack strategies. In addition, compared to the state-of-the-art defense methods, the proposed method also has many advantages.
It is worth mentioning that although our method is a feasible and simple defense mechanism, there are still some practical difficulties in implementing and deploying this method. For example, our experimental performance will be reduced on complex datasets. In the future, we will focus on adjusting the network structure of the defense framework to improve the performance on complex scenarios.

Data Availability
e MNISTdataset used to support the findings of the study is public and available at http://yann.lecun.com/exdb/mnist/.
e CIFAR10 dataset used to support the findings of the study is public and available at https://www.cs.toronto.edu/∼kriz/ cifar.html.

Conflicts of Interest
e authors declare that they have no conflicts of interest.