Dual-Tree Complex Wavelet Transform-Based Direction Correlation for Face Forgery Detection

With the rapid development of face synthesis techniques, things are going from bad to worse as high-quality fake face images are unnoticeable by human eyes, which has brought serious public confidence and security problems. (us, effective detection of face image forgeries is in urgent need. We observe that some subtle artificial artifacts in spatial domain can be easily recognized in transformation domain, and most facial features have an inherent directional correlation, and generative models would ruffle this kind of distribution pattern. Inspired by this, we propose a two-stream dual-tree complex wavelet-based face forgery network (DCWNet) to expose face image forgeries. Specifically, dual-tree complex wavelet transform is exploited to obtain six directional features (±75°, ±45°, ±15°) of different frequency components from original images, and a direction correlation extraction (DCE) block is presented to capture the direction correlation. (en, the direction pattern-aware clues and the original image are taken as two complementary network inputs. We also explore how specific frequency components work in face forgery detection and propose a new multiscale channel attention mechanism for features fusion. (e experimental results prove that the proposed DCWNet outperforms the state-of-the-art methods in open datasets such as FaceForensics++ and achieves high robustness against lossy image compression.


Introduction
In recent years, various deep learning technologies such as FaceSwap [1], Deepfake [2], and Face2Face [3] have presented for facial image manipulations which change the attributes of face images. Besides, some generative adversarial network-(GAN-) [4] based works can even create fake faces without target images. As shown in Figure 1, these artificial products seem scarily real that it is difficult to find fake face images from real ones by naked eyes. is brings great threats to public information security. For example, these techniques might be used to produce pornographic videos or scams. us, how to distinguish real and fake face images has attracted more attentions in the community of image content security.
Many works have been proposed to use artificial intelligence (AI) to fight with AI, namely, using deep learning methods to differentiate real images from fake ones. Among them, some sophisticated convolutional neural network (CNN) structures [7][8][9][10] were proposed or they were combined with hand-crafted features [11][12][13] to achieve better performance. However, what makes CNNs be much more perceptive than humans? Some researchers tried to provide some explanations to this from frequency domain [14][15][16][17]. Nevertheless, the conventional frequency-domain transformation methods, such as FFT [18] and DCT [19], do not keep well the spatial information of the original image.
at is, the images with distinct visual contents might have the same spectral amplitudes. us, vanilla CNN structures might be inapplicable. In [16], the frequency features extracted by frequency-aware decomposition (FAD) and local frequency statistics (LFS) were combined with sliding window DCT (SWDCT) to preserve the spatial structure of the image to some extent. Wavelet transform has been widely used in various image applications such as denoising, compression, and texture classification. Compared with fast Fourier transform (FFT) and other transforms, wavelet transform preserves well multiscale image spatial structure, which makes it to be known as textual microscope. is motivates us that wavelet transform might be compatible with CNN for face forgery detection tasks. e direction-related details such as facial contour, wrinkles, and light-shadow cross lines are intuitive yet effective for face image forensics. Dual-tree complex wavelet transformation (DTCWT) was proposed to overcome the translation sensitivity, which has higher directional selectivity than traditional wavelets [20]. We exploit the DTCWT to reveal the correlation between facial features in different directions. Moreover, wavelet transformation decomposes the original image into multiple scales. Among them, the low-level features provide richer details, whereas the highlevel features provide more semantics information. It is wellknown that both low-frequency and high-frequency information is useful for image classification tasks [21]. Is it the same for face image forensics? If so, what is the role each component plays in face forgery detection and how can we fuse multiscale features?
In this work, we propose a novel two-stream deep network for face image forgery detection. One stream exploits DTCWT to learn multiscale directional features. In Figure 2, we show the results of the two-stage DTCWT on the original face image. Each stage contains six different directional features. e other stream takes the original image as input which provides low-frequency and pixellevel information for the network. Moreover, to fully exploit different frequency components, we propose a multiscale channel attention (MSCA) mechanism to fuse multiscale frequency-domain features from direction correlation extraction (DCE) block. e main works and contributions are three-fold: (1) DTCWT is combined with CNN for face image forensics. It addresses face forgery detection from a new perspective, in which a novel DCE block is proposed to extract the correlation features. (2) A MSCA mechanism is proposed to improve feature fusion efficiency. (3) We demonstrate that face image forensics is different from image classification, and the influence of various frequency components on face forgery detection is well studied. e remainder of this paper is organized as follows: Section 2 summarizes the related works. Section 3 presents the proposed DCWNet. Section 4 reports the experimental results, and conclusion is given in Section 5.

Related Work
e recent AI-enabled face forgeries can generate fake face images without any noticeable artificial artifacts. CNNs have achieved great success compared with the earlier works which exploit hand-crafted features [22,23]. Many face forgery detection works have been presented for better accuracy or interpretability.

Pixel-Level Forgery Detection.
e most widely used method is to input the original images into CNN, either in RGB or HSV color space. In [24], Dang et al. proposed a CNN-based approach integrated with an attention mechanism to improve the feature maps. Inspired by image steganalysis, Nataraj et al. proposed to combine pixel cooccurrence matrices with CNN for face forgery detection [13]. e model was trained on the dataset generated by CycleGAN [25] and had an extra test on face images generated by different GAN structures (StarGAN [26]). e experimental results showed that their work has good generalization capability. Afchar et al. proposed to use two existing networks, namely, Meso-4 and Meso-Inception-4, to exploit the mesoscopic properties of the images [27].
ey achieved an accuracy of the ACC up to 98.4%. Guo et al. proposed an adaptive manipulation trace extraction network (AMTEN) [14]. It predicts manipulation traces by an adaptive convolution layer, which are also reused to  eir approach involved two networks and used the recognition signals from these two networks to detect such discrepancies [28]. In addition, recurrent neural network (RNN) was also exploited by considering face images with temporal properties [29][30][31]. Some other works exploited visual artifacts such as 3D head poses incoherence for better explanations [32][33][34]. Chen et al. proposed an improved Xception model for GAN-generated faces [35]. ey removed the four residual blocks of Xception to avoid the overfitting problem, and the dilated convolution is used to replace the common convolution layer. e proposed model performed well on their locally GAN-based generated face (LGGF) dataset.

Frequency-Based Forgery Detection.
Image transformation refers to transforming an original image from the spatial domain to other domains such as frequency. e common image transformations include discrete cosine transform [19], fast Fourier transform [18], and wavelet transform [36], which are widely used in various image applications such as edge enhancement, image smoothing, and texture analysis.
In recent years, transform domain processing has been introduced into face forensics. Qian et al. proposed a novel F 3 -Net [16], which exploits frequency-aware decomposed image components and local frequency statistics. F 3 -Net performs well on the FaceForensics++ dataset, especially for low-quality images. Liu et al. found that the phase spectrum is more sensitive to the up-sample operation than the amplitude spectrum and proposed to expose the up-sample traces by exploiting the phase spectrum [37]. Gong et al. exploited 2D DCT for each RGB channel of the original image and then used AutoGAN [38] to synthesize GAN artifacts in any image without pretrained model [15].

Attention Mechanism.
e attention mechanism generates a set of weighting coefficients, which are often adaptively weighted to strengthen interested regions and suppress irrelevant background regions.
ere are three common attention mechanisms. e first one is the channel attention. In SENet [39], global average pooling is used to obtain the mean value of the channels as the input of the following fully connected layer. In ECANet [40], 1 × 1 convolutions replace the fully connected layer to pay more attention to the relationship between adjacent channels. e second one is the spatial attention mechanism which reinforces local areas in each channel. One of the most outstanding works is CBAM [41]. e third one is the selfattention [42], which models the global context through the self-attention mechanism and effectively captures longdistance feature dependencies.

Direction Correlation Extraction Block.
Face images have rich directional information such as wrinkles, facial contours, and light and shadow boundaries. ey have distribution patterns under specific facial movements.
at is, there are spatial correlations among them. e AI-generated fake faces might have weak relevances. is can be used as the clue for face forensics, which motivates us to design a DCE block to expose this, as shown in Figure 3. Conv means convolution operation, BN represents batch normalization, and ReLu is the activation function.
Directional correlation contains two parts: (1) local correlation inside each direction map. (2) Correlation among different direction maps. For local features, we applied 3 × 3 convolutions on each type of directional feature maps, respectively.
where I n are the face feature maps of the nth direction obtained by DTCWT; C i denotes the convolution kernels; and f n,i represents the features extracted with C i in direction n. In this work, both m and k are set to 6. For each input, we obtain the feature maps of six channels, which are concatenated to obtain F local .
e SE block [39] is an existing channel attention method. e input multichannel feature maps are taken into the global average pooling to obtain the weight array. Considering the characteristics of the wavelet coefficients,

Security and Communication Networks 3
MSCA is adopted to extract features among directional channels (we will demonstrate MSCA in Subsection 3.2.2).
Note that the original 1 × 1 convolution in MSCA is replaced with a fully connected layer (MSCA fc ). e reason behind this is that the 1 × 1 convolution pays more attention to the correlation among adjacent channels. In contrast, the fully connected layer is a point-to-multipoint relationship, which comprehensively describes the relationship between interval channels. Besides extracting the correlation between channels, the MSCA fc block also reduces redundant information in local features. us, DCE focuses on directional components. en, we apply a 1 × 1 convolution operation C 1 × 1 to further exploit interchannel correlation. In this manner, the same directional features share the convolution kernel in wavelet transform.

Attention-Based Multiscale Feature Fusion.
In essence, multiscale wavelet transform is the stepped dichotomization of the original image frequency. How each frequency component works for face forensics task and how to effectively fuse the directional features obtained from the multiscale wavelet transform? us, we proposed a new attention-based feature fusion method.

3.2.1.
e Impacts of Frequency Components on Face Forensics. Face forgery detection is different from the traditional image classification tasks. As claimed in [21], the deep network models for image classification exploit both low-frequency and high-frequency information, both contribute to final classification. We conduct a preliminary experiment by selecting 10k face images in which real and fake ratios are half. e fake face images are generated by four face image forgeries. ResNet18 is exploited for experiments. ese images are reconstructed by FFT with r as the radius to keep the centre frequency component (Figure 4(a)). e training and testing processes are recorded in Figure 4(b). e horizontal axis is the number of epochs trained, and the vertical axis is the ACC. r is the radius of masking. e larger the r is, the more the high-frequency components are retained. From it, we can observe the following: (1) for low-frequency images, the network converges much quickly, and three epochs are enough. (2) e initial accuracy is continuously improved with the increasing of the high-frequency components. (3) With the introduction of higher frequency components, the network benefits less, and even the accuracy drops.
From the above observations (1) and (2), the network should learn some features from low-frequency components. Note that the frequency components are exploited in parallel, which is different from the conventional image classification [21]. Actually, this is also consistent with our common sense. As we know, image classification is usually of semantic level, whereas face tampering detection is a finegrained classification task. From the observation (3), since the image often contains some noises that usually exist in the high-frequency components, the accumulation of highfrequency components also brings some difficulties to network learning.

Multiscale Channel Attention.
Wavelet transform can provide multiscale image description due to diverse frequency components. Both high-frequency and lowfrequency components benefit for face forgery detection. us, fusing features is a key issue. e weights of the conventional channel attention mechanisms are based on the mean values of channels, e.g., SENet [39]. Although they work, yet ignore some important local information in the subimportant feature channels.
is drawback inhibits wavelet transform from exerting its capability of detail representation. Inspired by the receptive field of human visual cortex neurons, we propose a multiscale channel attention (MSCA) mechanism, which considers the importance of local features and minimizes the side effect of noises. Figure 5 shows the proposed MSCA. C n denotes different DCE feature maps.
ey are concentrated as C a . C a � concat C 1 , C 2 , . . . , C n .
We perform maximum pooling with the kernels of 3 × 3, 5 × 5, and 7 × 7 on C a . For each pooling, we get a 1 × 1 channel array by global average pooling.
Next, we transpose and concentrate them to 3 × 1 channels, then we use a 1 × 1 convolutional operation (C 1×1 ) to obtain w f. e final output is obtained by multiplying C a with w f .
output � w f ⊙ C a .
e maximum pooling strategy strengthens local features, while average pooling highlights global information.    Security and Communication Networks us, the assignment of the weights for each channel is comprehensively considered by using MSCA. Please note that the directional features use high-frequency components. e experiment in Subsection 3.2.1 proves that the low-frequency components also play a role in the model training. us, we use a two-stream network to exploit the low-frequency information and pixel-level features simultaneously.
Based on the above methods, we proposed our DCWNet, and Figure 6 shows the framework of the complete work.

Experimental Setting. Image Dataset.
FaceForensics++ is the most recent face manipulation dataset, which has been widely used in existing works [33,43]. It is expanded from the FaceForensics dataset with three quality levels, namely, RAW (raw), HQ (high quality), and LQ (low quality). For the FaceForensics dataset, each level includes 1,000 videos, which are directly collected from YouTube without tampering. e same amounts of fake videos are generated by four face forgeries including Deepfake, Face2Face, FaceSwap, and Neural Textures. In addition, the FaceForensics++ dataset also contains 363 real videos from 28 actors under 16 scenes. us, the FaceForensics++ dataset has 1,363 real videos and 4,000 fake videos for each quality. We extract 60 frames for each real video at equal interval and 16 frames for each fake video. e MTCNN [44] is used to crop the face images. us, we have 63k fake face images and 63k real face images, totally 126k face images. We divide them into 85k, 35k, and 6k face images as the training set, the testing set, and the validation set, respectively. In addition, the DFDC preview [45] dataset, which is a preview dataset of the Deepfake Detection Challenge, is also used for experiments. It contains 1131 real videos and 4119 fake videos. We obtain 120k face images from the DFDC preview dataset.
Evaluation Metrics. To evaluate the effectiveness of our model, we exploit two widely used metrics, namely, classification accuracy (ACC) and area under receiver operating characteristic curve (AUC). e closer the ACC is to 100% and the AUC is to 1, the better the performance the network achieves.
Experiment Details. e ResNet34, which was pretrained on ImageNet [46], is exploited as the backbone for two streams.
e Kaiming Batch Normalization is used for initialization. e networks are optimized via SGD with 0.9 as the momentum and 0.0005 as the weight decay. We set the base learning rate as 0.02 and use StepLR as the learning rate scheduler with half the learning rate per step. e batch size is 64 and we train the model for about 14k iterations. e whole work is completed upon PyTorch 1.1.0 with two Nvidia GeForce GTX 1080 Ti GPUs. To speed up the training process, we save the results of wavelet transform into local disk in NumPy format.

Comparisons with the Existing Works.
e proposed DCWNet is tested on different quality image datasets that consist of fake images produced by different image tampering methods. Experimental comparisons are made among the proposed approach and the existing works. For the FaceForensics++ dataset, the experimental results are shown in Table 1. Apparently, the proposed DCWNet achieves a pretty high ACC (98.73%) and AUC (0.999) on the FaceForensics++ (HQ) dataset.
For the LQ dataset, DCWNet also achieves desirable results with the ACC of 97.91% and the AUC of 0.994. Compared to the baseline networks (ResNet34), DCWNet achieves the improvement of ACC about 2.05%. is proves that the DCE block is effective. Figure 7 reports the ROC curves for different face forgery detection methods. We also conduct the experiments on the DFDC preview dataset with the same experimental setting. Table 2 reports the experimental results.
For different face manipulations, we also test our model. Specifically, there are four face manipulations for the fake images in the FaceForensics++ dataset. Each face manipulation has 31k images. Among them, 22k, 8k, and 1k are used for training, testing, and validation, respectively. Similar experimental results are obtained, which are reported in Table 3.

Ablation Study
To prove the contribution of the proposed DCWNet, ablation study is conducted. We first explore the influence of the number of directions, and the experimental results are recorded in Table 4. Even with features from one direction, the DCE stream achieves high ACC and AUC. is proves that the DCE block is powerful for local feature representation. With more features from multiple directions, the detection accuracies improve greatly.
is implies that the features extracted from different directions are complimentary to each other. We also compare the effect of the FC layer and 1 × 1 convolution used in MSCA. We observe that with the using of more directions, FC is better than 1 × 1 convolution. Figure 8 shows some feature maps extracted from the DCE block. We can notice that the attention responses of the fake images are distracted, whereas those of the real images are compact. e reason behind this is that the directional features are not strongly correlated in fake face images, while they are more uniform for real face images.   Methods ACC (%) AUC Meso-4 [27] 53.71 0.553 Meso-Incep [27] 58.16 0.654 HP-CNN [11] 61.49 0.675 Constrained Conv [47] 81.01 0.877 AMTEN [14] 88.83 0.892 XceptionNet [9] 89.37 0.969 ResNet34 [8] 94  Table 5. Specifically, we conduct experiments for the first (S1) and second (S2) stages of the wavelet transform, respectively. e elementwise addition, self-attention (SE), and MSCA are used for feature fusion. From Table 5, the MSCA achieves the best feature fusion. Figure 9 also compares the feature maps from the DCE stream between SE and MSCA.

Conclusion
In this work, we propose a two-stream DCWNet for face forgery detection. One stream uses the DCE block to exploit the multiscale directional correlation. To fuse the DCE feature maps of different scales, MSCA is proposed. e other stream uses the original image as input. e experimental results showed that DCWNet achieves desirable     results on the FaceForensics++ and DFDC preview datasets. From the ablation study, we observe that real and fake faces have different feature maps that learned from the DCE block. is proves that the correlation of direction distribution is valuable for face forgery detection. Moreover, the effectiveness of the proposed MSCA is verified by comparisons with existing feature fusion methods. We also explore how different frequency components contribute to face forgery detection, which provides some interpretability for face forensics.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.