Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos

In recent years, the spread of fake videos has brought great influence on individuals and even countries. It is important to provide robust and reliable results for fake videos. The results of conventional detection methods are not reliable and not robust for unseen videos. Another alternative and more effective way is to find the original video of the fake video. For example, fake videos from the Russia-Ukraine war and the Hong Kong law revision storm are refuted by finding the original video. We use an improved retrieval method to find the original video, named ViTHash. Specifically, tracing the source of fake videos requires finding the unique one, which is difficult when there are only small differences in the original videos. To solve the above problems, we designed a novel loss Hash Triplet Loss. In addition, we designed a tool called Localizator to compare the difference between the original traced video and the fake video. We have done extensive experiments on FaceForensics++, Celeb-DF and DeepFakeDetection, and we also have done additional experiments on our built three datasets: DAVIS2016-TL (video inpainting), VSTL (video splicing) and DFTL (similar videos). Experiments have shown that our performance is better than state-of-the-art methods, especially in cross-dataset mode. Experiments also demonstrated that ViTHash is effective in various forgery detection: video inpainting, video splicing and deepfakes. Our code and datasets have been released on GitHub: \url{https://github.com/lajlksdf/vtl}.


Introduction
Video forgery has gained global attention, leading to increased focus on forgery detection [1][2][3].Common techniques for video semantic editing include object removal (video inpainting), object addition (video splicing), and object swapping (face swapping) [4][5][6][7].Malicious use of these technologies can cause harm to individuals and organizations, and fake videos can have serious consequences for politics, society, fnance, and the law.Current methods output probability values but lack interpretability and have limitations in real-world applications [6,8,9].Moreover, existing forgery detection methods perform poorly on independent testing and have poor robustness to common video processing techniques used on the Internet [6].Terefore, a reliable and robust forgery detection method is essential.
Inspired by hash retrieval, we propose a hash-based source-tracing method.However, the discrete distribution of the hash space and the nonsmooth calculation function using the Hamming distance result in nondiferentiable optimization problems.Traditional hash retrieval is usually employed to fnd similar videos within the same category, where videos of diferent categories have signifcant semantic diferences, making it easy to train diferent hash centers.However, the challenge in source tracing lies in the fact that videos in the dataset may be similar and that their initial hash codes are difcult to diferentiate, which makes it hard to train hash centers with signifcant diferences.To address these challenges, we introduce a new loss function called the hash triplet loss, which replaces the Hamming distance calculation function with a diferentiable function implemented in PyTorch.Te hash triplet loss can iteratively optimize hash codes, gradually diferentiating videos with subtle diferences, even when the diferences are not immediately apparent.
Figure 1 illustrates the approach for learning hash codes for triplet-based retrieval and tracing.Te hash retrieval methods are based on triplets (q, q + , q − ), where (q, q + ) is a positive sample and q − is a negative sample [10][11][12].Te method increases the distance between (q, q + ) and q − within a triplet, decreases the distance between q and q + , and learns the local similarity between elements within the triplet.
Instead, we treat X ∈ s x , f x1 , f x2 , . . ., f xn   as one class and train a hash center CenterX for a real video s x and its associated fake videos f xi .Te hash triplet loss is based on triplets (s x , f xi , f xj ), where s x is the real video and f xi and f xj are two randomly selected relevant fake videos.In each training iteration, the hash triplet loss increases the distance between the hash codes of diferent-class triplets and decreases the distance between the hash codes of the fake videos (f xi , f xj ) and the real video s x in the same triplet.It learns the global similarity of a class of data.
Since each triplet always includes the real video s x , the fake videos eventually generate a hash center around the real video.Terefore, our method reduces the reliance on a limited set of forged videos.By not relying on these forged traces, our method can improve the robustness of detecting forged videos against various processing techniques and the generalization of various forgeries.Ultimately, the hash centers of diferent classes are far apart, and the hash codes of videos from the same class are clustered around their corresponding hash center.
As shown in Figure 2, the distribution of video hash codes is presented at diferent stages.Initially, the binary hash codes of real and fake videos in the dataset are mixed together, making them difcult to distinguish.During training, the hash codes of the real and related fake videos gradually converge, while the hash codes of unrelated videos separate.Eventually, a hash center is trained for each real video and its related fake ones, and the average Hamming distance of each hash center is close to half of the length of the hash code.Te generated hash centers are close to the optimal hash distribution [13].
We use the pyramid vision transformer (PVT) v2 [14] as the backbone for feature extraction.PVT v2 is an efective network for learning image recognition features based on the vision transformer (ViT) architecture.To better capture the temporal information of videos, we design a temporal encoding module that is commonly used in the ViTstructure [15].We frst train the network model and hash centers using the hash triplet loss.Ten, we calculate the hash center with the minimum Hamming distance for the given hash code of the detected video and fnd the related real video through the index of the hash center.Finally, we use humanlevel comparison to judge the diference between the real and detected videos to determine whether the detected video is fake.When the found real video is not related to the detected video, detection fails.In summary, the contributions of this paper are as follows: (i) Our method ofers a more reliable alternative to probability-based detection techniques, making it a promising solution for real-world applications, particularly in critical scenarios involving individuals.(ii) We have designed a novel loss function, hash triplet loss, for forgery detection through source tracing.
Extensive experimental results have demonstrated that our method outperforms the state-of-the-art forgery detection methods.Our code and models have been released on GitHub and have received considerable attention.(iii) Our method does not rely on potential forgery artifacts, thereby improving the robustness and generalization of detection.We conducted extensive experiments on multiple datasets of three diferent types, demonstrating the efectiveness of our approach for detecting various types of synthetic forgeries, such as DeepFake, video splicing, and video inpainting.

Video Inpainting Detection.
Object inpainting has been widely applied in real-world applications such as object removal [27][28][29].Methods based on 3D CNNs have shown poor performance in video inpainting.Recently, fow-based approaches have incorporated optical fow into networks used for video inpainting [30,31].Tis alleviates the time issue of video inpainting but inevitably leaves temporal artifacts in the generated results.Several works have been proposed for video inpainting localization recently.
Learning-based inpainting localization methods aim to extract semantic representations through a large amount of training data [32,33].However, the performance of these methods sharply declines on new datasets due to their reliance on large training datasets.Others apply advanced features to enhance robustness.VIDNet [9] uses LSTMbased ELA and temporal structures to localize video inpainting.HPF [34] explores high-pass fltering to distinguish high-frequency noise and fake images.FAST [8] 2 Security and Communication Networks combines frequency-domain characteristics and temporal ViT to improve the performance of video inpainting localization.However, these methods do not consider the inherent artifacts of the inpainting manipulation process, making them inefective when a new forgery method is proposed.

Video Splicing Detection.
Since splicing is a relatively simple task, image/video splicing is usually performed manually with tools such as Photoshop.Due to the lack of video splicing datasets, there have been few studies on video splicing detection.Image splicing can be detected at the pixel level.PQMECNet [35] uses the local estimation of the JPEG quantization matrix to distinguish spliced regions taken from diferent sources.MVSS-Net [6] learns semanticagnostic and more generalizable features by utilizing noise distribution and boundary artifacts around tampered regions.ComNet [36] is customized to approximate JPEG compression operation, thereby improving performance against adversarial JPEG compression.Te challenge of splicing localization is to improve the robustness against various postprocessing operations [6] such as compression and blur.
2.4.Hash Retrieval.Hash retrieval methods map highdimensional content features of images or videos to Hamming space (binary space), reducing the memory space requirements in image or video retrieval systems, improving retrieval speed, and meeting the requirements for massive data retrieval [10,12,37,38].Retrieval methods based on image similarity matching are computationally expensive and time-consuming, as they require matching a large number of key frames in videos [12,37].Changes in the semantics of fake videos are more obvious and signifcantly afect the matching accuracy.In contrast, hash-based retrieval methods are faster and require fewer resources, and their accuracy mainly depends on the quality of the hash centers [11,13].Traditional triplet learning methods use (q, q + , q − ), capturing only local data similarity from two or three samples and ignoring global data similarity [10,12].Subregion [11] proposed a novel subregion localized hashing approach to learn compact within-class and large betweenclass hash codes that capture fne-grained local information for efcient fne-grained image retrieval.DLTH [12] introduced a new method for generating triplets from a knowledge distillation module to introduce more triplets during training and proposed a new listwise triplet loss to capture relative similarities in the new triplets.Due to the diferences in processing logic, directly applying existing hash retrieval algorithms to source tracing is inappropriate.

Source-Tracing Detection.
In recent years, the method of detecting fake data through source tracing has gradually attracted researchers' attention.Tese methods typically retrieve the source of the data under test from an existing real database and then judge the authenticity by manually Security and Communication Networks comparing the diferences between the data under test and the real data.Shang et al. [39] use distributed blockchain technology to trace the source of fake news, which can effectively prevent the spread of fake news and provide reliable fake news detection.Dwivedi et al. [40] propose a social media framework based on blockchain and watermarking to control the spread of fake news.It helps to reduce the spread of fake news by tracing the root or source of fake news on social media.Shrivastava et al. [41] propose a model to investigate the spread of fake news related to the COVID-19 pandemic, thereby alleviating the pressure on online social network users.Zhu et al. [42] propose a voice antifraud method.Te experimental results on the ASVspoof 2019 LA dataset show that the proposed method achieves a 20% performance improvement compared to traditional binary deception detection methods.Te methods related to news and voices demonstrate that using source-tracing detection methods is not only efective but also highly applicable to real-world scenarios in the industry.

Vision Transformer.
Currently, networks based on ViT have achieved great success in various felds, including image and video tasks [15,[43][44][45][46]. ViT is an efective structure for feature extraction from sequential data, making it particularly suitable for extracting temporal features from videos [14,47,48].In addition to the classic 3D CNN and hybrid 2D CNN architectures, ViT provides an alternative solution for video understanding tasks.ViViT [15] frst proposed a pure ViT-based structure for video classifcation, which uses token temporal and positional encodings to more efectively extract spatiotemporal features from videos.In early research, pure ViT-based structures required larger datasets and more memory consumption compared to CNN models.HRFormer [43] improves memory and computational effciency by utilizing a multiresolution parallel design introduced in high-resolution convolutional networks, as well as local window self-attention conducted on small nonoverlapping image windows.Recent studies have combined CNN and ViT to achieve better performance [49].Pool-Former [44] improves the self-attention mechanism-based ViT structure into a hybrid structure of CNN and ViT, signifcantly reducing computation consumption.With the evolution of deep-learning architectures, the hybrid architecture of CNN and ViT is a popular choice.PVT v1 [48] inherits the advantages of both CNN and ViT and replaces the CNN backbone to make it a unifed backbone in various visual tasks.It uses a progressive shrinking pyramid to reduce the computation consumption of large feature maps, achieving better performance in multiple tasks [48].PVT v2 [14] reduces the computational complexity of PVT v1 to linear and signifcantly improves basic visual tasks such as classifcation, detection, and segmentation.In this paper, we use PVT v2 as the detection backbone and leverage token temporal encoding combined with PVT v2 for more efective video feature learning, given that PVT v2 is an image task model.

Method
In this section, we describe the complete procedure of our approach.As shown in Figure 3, our method involves three main stages: data preprocessing, hash center learning, and fake video source tracing.Initially, we restructure the dataset videos to adapt them to the training of the hash triplet loss.Next, we employ the hash triplet loss to learn the hash centers gradually and dynamically.Finally, we save the trained model and hash center and use the hash code of the fake video to trace the corresponding real video.

Data Preprocessing
Te data preprocessing step reorganizes and combines the dataset in a way that is suitable for training our method with the hash triplet loss.Each subclip in the video is used to train the hash center so that the original real video can be accurately traced to detect any tampered videos based on any subclip.Given a dataset defned as O in equation ( 1), we partition each X ∈ s x , f x1 , f x2 , . . ., f xn   into a class and train a hash center c x for each class of data X.During training, we form triplets U � s x , f xi , f xj  , where s x is the real video and (f xi , f xj ) are two randomly selected fake videos.To train independent hash centers for forgeries, each triplet unit always contains the original real video s x .We recommend a triplet unit size of U > 8 to ensure a uniform and reasonable distribution of data across diferent classes.It should be noted that data preprocessing is only applied during the training phase.

Defnition of Hash Triplet Loss.
Inspired by the K-means clustering algorithm, the process of training the hash center is similar to clustering.It involves gradually adjusting the hash center to cluster real videos and related fake videos of the same class.Te main idea of implementing the hash triplet loss is to increase the interclass loss and reduce the intraclass loss.Te interclass loss refers to the Hamming distance between hash codes of videos from diferent classes, while the intraclass loss refers to the Hamming distance between hash codes of a real video and its related fake videos.Te process of computing the complete hash triplet loss is illustrated in Algorithm 1.We input the hash codes and labels HL s of the training instances, along with the associated hash center HC s .Subsequently, we compute the intraclass loss and interclass loss using the function defned in Algorithm 1. Te mathematical expression of L is defned as follows: (1) ViTHash outputted hashes and related labels as HL s ; (2) Voted hash centers and related labels as HC s loss of hash triplet loss (3) where m is the number of videos in the same class (intraclass) and n is the number of videos in diferent classes (interclass).

Voting Temporary Hash Centers.
During each training iteration, given a triplet unit U � s n , f nj , f nk  , the ViTHash network outputs the corresponding hash codes.Each triplet U votes to generate a temporary hash center h where U i represents the ith column element of the matrix U. Te output is 1 if the mean of U i is greater than 0, and 0 otherwise.Te voting method for the temporary hash center is similar to the following: Te hash codes of the same triplet are encouraged to be close to this temporary hash center through the intraclass loss, while the hash centers of diferent triplets are pushed far away from each other through the interclass loss, implemented using equation (3).Trough repeated iterative training and optimization, the temporary hash center gradually approaches the optimal hash center with an average Hamming distance close to half of the hash code length [13].Te trained model and hash center fle are saved for future use.

Nondiferentiable Optimization for Similar Videos.
Learning optimal hash centers through the network is challenging due to the high similarity of hash codes among similar videos.Nondiferentiable optimization is often used in deep learning-based hash code generation due to nonsmooth similarity metrics like Hamming distance, which can be solved using subdiferentials.For a function f: R n ⟶ R, its subdiferential is defned as follows: where 〈•, •〉 denotes the inner product and zf(x) represents the subdiferential of f at point x.Te nonsmooth optimization problem can be written as min x∈R n f(x), where f(x) is a nonsmooth function.
Subgradient methods can be used to solve nonsmooth optimization problems, such as learning optimal hash centers.Specifcally, the subgradient of the similarity metric with respect to the hash codes can be computed and used to update the hash codes.Te update rule of subgradient methods is as follows: where h t denotes the hash codes at iteration t, S(h t ) denotes the similarity metric, η t denotes the learning rate, and zS(h t ) denotes the subgradient of the similarity metric at h t .
In our PyTorch implementation of the hash triplet loss, we calculate the vector distance using mean(|torch.sub(h → , c → )|) instead of the nonsmooth Hamming distance.Te Hamming distance measures the similarity between two hash codes of the same length by counting the number of difering bits.Te smaller the diference, the higher the similarity.Our implementation measures similarity using the average absolute diference between two vectors, where smaller values indicate higher similarity to the hash center.Terefore, in theory, these two methods can be used interchangeably.
In practice, we have observed that even for very similar videos, there exist slight diferences in their hash codes.Increasing the length of hash codes (to 512 bits) allows for optimization of diferent bit elements and better representation of the slight diferences.Te interclass loss increases the Hamming distance between hash codes of diferent classes during each iteration to train optimal hash centers.

Fake Video Source Tracing.
After training the model and hash centers, source tracing becomes a straightforward task, but it requires human-level interaction to judge whether the detected video is forged.Once HC is trained, we load both HC and the trained model.Given a detected video f and the hash code h → outputted by the trained model, we calculate the Hamming distance between h → and all hash centers HCs.We fnd the HC i with the minimum Hamming distance to h → , along with its corresponding label.We use the label to retrieve the original genuine video s i .Finally, we compare the detected video f with the genuine video s i by humanlevel judgement.Since the tampered videos always have obvious semantic modifcations, it is easy to distinguish the diference between the detected video f and the genuine video s i .Tus, one can judge whether it is forged through human-level interaction.

Networks
In this section, we introduce the network architecture of our approach, as well as some advantages of our method.

Overview of Networks
4.1.1.ViTHash.As shown in Figure 4, ViTHash is used to train the hash center and trace the source.Te feature extraction of ViTHash consists of a series of spatiotemporal PVT v2 [14] and multiple attention blocks.Te frst module, spatial transformer, focuses on spatial features, while the second module, temporal transformer, focuses on temporal features.Finally, the output is generated through the tanh function and then converted into binary codes using the sign() function in equation (8), where k represents the number of hash bits.

6
Security and Communication Networks 4.1.2.Localizator.As shown in Figure 4, the localizator architecture is designed to facilitate comparison between real and fake videos.It serves as an auxiliary comparison network that outputs suspicious areas in grayscale, helping us distinguish the diferences between the tracked and detected videos.We observed that the ViT-based network disrupted the spatial continuity of pixels when trained on linear patch images [45].CNN blocks excel in learning highlevel features and focusing on the correlation of local pixels, while ViT focuses on the long-range context and temporal dimension features of videos.To improve performance, we designed a hybrid CNN-ViT structure.In addition, we used an upper sampling module to gain more detailed insights into the diferences in the regions of interest.

Fast and Space Efciency.
We assume that the time required for detecting using diferent backbone network methods is relatively similar and denoted as t.For the traditional forgery detection method, the time cost is t 1 � t.
Te hashing retrieval method requires a time cost of t 2 � λ + t ≈ t, where λ is the time needed to calculate the Hamming distance.In contrast, the content matching retrieval method takes a time cost of t 3 � n × t, where n denotes the number of matching videos.In addition, hashing retrieval requires minimal storage space to store the hash code and video index.Te hash code is a fxed-length binary bit (k bits), and the index is represented by a 32 -bit integer.Te total storage space required is (32 + k) × n, where n is the number of original videos.Te scalability of hashing methods enables them to handle large datasets with ease, making them ideal for use in big data applications.Security and Communication Networks provide fully reliable results, even when claiming to provide additional interpretable visual features.In critical scenarios involving high-profle individuals in government, military, and business, it is difcult to eliminate the impact of public opinion without conclusive and reliable evidence.Establishing a database of related real videos for these individuals can help trace malicious tampering based on these videos back to the original real videos.Comparing these original real videos with the fake videos provides reliable evidence for tampering detection.

Experiment Setup.
We conduct two sets of experiments: the ViTHash detection performance evaluation and the localizator evaluation.Evaluation of ViTHash six evaluation experiments and one ablation study is carried out.Te six evaluation experiments include a DeepFake comparison experiment, a video tampering experiment, a video splicing experiment, a robustness experiment, a cross-dataset generalization experiment, and a similar-scene performance experiment.Te evaluation of ViTHash detection performance using detection accuracy (ACC) as the evaluation metric is carried out.Te localizator serves as an auxiliary comparison tool to facilitate the comparison of two videos.Te evaluation of the localizator experiment outputs the pixel-level suspect region between two known methods.Te localizator evaluation using mean intersection over union (mIoU) as the evaluation metric is carried out.

Implementation.
Our model is implemented using PyTorch, and the code is released on GitHub.We use fmpeg to extract frames from videos and train the model using a single NVIDIA RTX 3090 24GB GPU.Each model is trained for 2-5 epochs on the dataset.In addition, we use the adaptive moment estimation (ADAM) optimizer with a learning rate of 1e − 5. ADAM is computationally efcient, requires less memory, and performs well on large datasets.

Baseline Methods.
In the ViTHash comparative experiment, we use accuracy as the evaluation metric.As the necessary implementation codes were not available, we cite the experimental results of the compared methods from the relevant papers.Compared to the existing forgery detection methods that directly output binary classifcation results, our method utilizes traceability to determine the authenticity.For fairness, we use the Top-1 retrieval accuracy as the evaluation metric because there is only one correct result for traceability.To evaluate the cross-dataset generalization performance, we compare Xception [50], HRNet [51], Face X-ray [52], ADD [1], and Grad-CAM [53] on the fve subdatasets of FaceForensics++.
For the comparison experiment in the localizator, we chose mIoU as the evaluation metric and compared it with two known methods, DMAC [62] and DMVN [63].

Datasets
5.2.1.DeepFake Dataset.We evaluate our method on several publicly available datasets in the feld of DeepFake detection.Te FaceForensics++ (FF++) dataset [50] includes 1,000 real videos and 5,000 unique fake videos collected from You-Tube.Te Google/Jigsaw DeepFakeDetection (DFD) dataset [64] contains 363 original videos from 28 consenting actors and 3,068 fake videos.Te Celeb-DF [65] dataset, which is part of the deep fake detection challenge, consists of 590 original videos and 5,639 fake videos.

Similar Scene Video Dataset.
We create a dataset called DeepFake of similar scenes (DFS) to evaluate the detection performance of the hash triplet loss on similar videos, as shown in Figure 5(a).DFS aims to simulate scenarios like news conferences, which are highly similar and thus challenging to detect.We paid 75 actors to shoot similar-scene videos where they sit in front of the camera and give speeches while wearing similar clothing, with minor scene changes.Diferent actors were required to shoot in designated scenes such as ofces, studies, and bedrooms.We used three DeepFake generation methods, namely, DeepFaceLab [66,67], Faceswap [68,69], and Faceswap-GAN [70], to generate 187 forged videos.DFS consists of 133 training videos and 54 test videos, totaling 578,613 frames.DFS is an Asian face dataset, and all actors authorized the modifcation of their recorded videos.

Video Inpainting Dataset.
Yu et al. [8] proposed a video inpainting dataset named DAVIS-VI based on DAVIS [71].Tey used three video inpainting methods, namely, OPN [72], CPNET [73], and DVI [74], to remove the annotated objects from the DAVIS dataset and generate corresponding inpainted videos.However, due to the limited number of original samples, we further augmented the DAVIS-VI dataset with three additional video inpainting methods: FGVC [31], DFGVI [30], and STTN [75].As shown in Figure 5 where scale is the scaling factor of the spliced object and pos is the position of the object B in the forged video R.

Robustness Experiments.
Te propagation of fake videos on the Internet inevitably involves various video processing techniques, such as compression, cropping, redrawing, and blurring.Improving the robustness of video detection against these operations has important practical signifcance.
As shown in Table 1, the performance of processed videos is almost the same as that of unprocessed videos.Tis is mainly because video processing techniques destroy the forgery traces of fake videos, and our method extracts features that are irrelevant to forgery, thus having better robustness.In addition, longer hash codes usually lead to better performance.However, a slight performance decrease is observed when the hash code length reaches 1,024.More hash code elements can better capture small diferences between different videos, help generate better hash centers, and solve the nondiferentiable optimization problem of similar videos due to the marginal utility.However, when the number of hash code elements is too high, the marginal utility decreases.Terefore, when the hash code length exceeds 512, redundant information is learned which may have a negative impact on source tracing.Te experiments prove the robustness of our method against video processing on Internet scenes.

Evaluations of Cross-Dataset.
We conduct crossdataset evaluations to further validate the generalization ability of our proposed method.As shown in Table 2, our method achieves comparable or better performance on within-dataset compared to recent works but has a signifcant advantage on cross-dataset.Tis is because those methods simply learn dataset-dependent forgery features from existing data, which may not be applicable to unknown forgery data.However, our method aims to learn more general features that are independent of forgery methods.Experiments show that our method has better generalization ability for detecting unknown forgeries.

DeepFake Comparison Experiment.
To evaluate the performance of our method, we compare it with the state-ofthe-art methods on popular datasets including Celeb-DF, DeepFakeDetection, and FaceForensics++.Figure 6(a) presents several correct result examples on the Face-Forensics++ dataset, where "Fake" indicates forged videos and "Traced" represents the traced videos.As shown in Table 3, our method achieves comparable or better performance than the state-of-the-art methods.As shown in Table 4, our method performs consistently well on diferent qualities and types of DeepFake videos, achieving better performance than existing methods, especially on lowquality (LQ) videos.Existing methods rely on learning forgery features from the data, which results in good performance on the same dataset.However, the reason for the poor performance on the LQ dataset is that LQ videos damage the potential forgery features they learned.In contrast, our method extracts features that are independent of forgery traces, resulting in better performance in detecting various types of forgeries and low-quality videos.Te experiment shows that our method is efective in detecting DeepFake videos on multiple datasets and is more robust than existing methods.

Experiment on Video Inpainting Detection.
To verify the performance of our method in detecting video object removal, we are conducting experiments on the DAVIS-VI dataset [8].Existing video object removal detection methods sufer from pixel-level detection and lack corresponding comparison methods.Figure 6(b) presents several examples of correct results on the DAVIS-VI dataset, where "Fake" denotes forged videos and "Traced" refers to traced videos.As shown in Table 5, our method achieves nearly 100% accuracy.Tis is because the object removal dataset contains only 50 real videos, making it easy to fnd the original videos from the 50 real videos.Due to the large semantic diferences between the forged videos in video object removal and the real videos, it is easier to learn the diferences between videos and make the source-tracing task relatively simple.Te experiments are showing that our method is efective in detecting video inpainting.

Experiment on Video Splicing Detection.
To evaluate the performance of our proposed method for detecting video splicing, we conducted experiments on the video splicing dataset.However, due to the lack of publicly available datasets for video splicing, there are no comparable methods.Figure 6(c) presents several examples of correct results on the video splicing dataset, where "Fake" denotes forged videos and "Traced" refers to traced videos.As shown in Table 5, our method achieved nearly 100% accuracy in the experiment, which is mainly due to the small size of the test set consisting of only 30 videos.In addition, spliced videos often have signifcant semantic diferences, making them easier to trace.Tese results demonstrate the efectiveness of our proposed method for detecting video splicing.Security and Communication Networks

Experiment on Similar Scene Detection.
To evaluate the performance of our designed hash triplet loss in distinguishing similar videos on the DFS dataset, as shown in Table 5, we evaluated our method on 133 forged videos traced back to 54 real videos, achieving 100% accuracy.Te large amount of data in the DFS dataset, which contains 578,613 frames, allows our method to fully exploit each video's unique features and exhibit excellent performance.
We also analyzed the experimental results on similar videos.Figure 7 shows male and female subclips with similar backgrounds captured from diferent angles in the same room.Despite their similarity, our method can accurately identify the original video.Te results demonstrate that the hash triplet loss can efectively learn subtle diferences in similar videos and address the nondiferentiable optimization problem in hash code learning.

Localizator Evaluation Experiment.
As shown in Table 5, our method outperforms DMAC and DMVN in localizing the suspicious regions of the two videos.Since these two methods are relatively early, we applied an efective feature extraction network based on ViT and CNN to more easily mark the diferential regions of the two videos.Te experiment shows that the localizator is efective in distinguishing the suspicious regions of the two videos.

Ablation Study.
To validate the efectiveness of the hash triplet loss, we evaluated our method from three aspects: structure, activation function, and error analysis.Since the average Hamming distance has a signifcant impact on the quality of the generated hash centers, we used it as one of the evaluation metrics [13].

Hash Triplet Loss.
To validate the efectiveness of the hash triplet loss structure, we evaluated the performance using only the interclass or intraclass loss in Face-Forensics++.As shown in Figure 8(a), when trained with only the interclass loss, it is difcult to train the intraclass videos to be similar to the hash center.Tus, the hash center keeps changing, which cannot meet our expectations.When trained with only the intraclass loss, despite various improved algorithms and diferent training strategies that we have attempted, the hash is always unstable and close to 0 → or 1 → .When both losses are trained together, the average Hamming distance of the hash center gradually approaches half of the hash bits.Te experimental results demonstrate that the structure of the hash triplet loss is reasonable and necessary.

Various Activation Functions
We evaluated the performance of diferent activation functions and their corresponding hash binary functions, such as ReLU with equation (10), tahn with equation (11), and sigmoid with equation (12).As shown in Figure 8(b), with the help of hash triplet loss, the Hamming distance of these activation functions can quickly stabilize at around half of the hash bits.Tis suggests that the infuence of diferent activation functions on the experimental results is minor, while hash triplet loss is more important to the experimental results.

Analysis of Incorrect Results.
As shown in Figure 9, we present examples of erroneous results on multiple datasets.Te frst three videos share similar backgrounds and human poses, except for diferences in the faces and clothing.In the fourth video, two people swapped positions.Te remaining videos have subtle diferences that are even imperceptible to human observers.Tese errors are reasonable and consistent with common sense.In our extended experiments, we found that expanding the scope of tracing (Top-10) can avoid these errors.Te errors in these experimental results indicate that our method is sensitive to the structural content of videos.Tis demonstrates that our method efectively learns the semantic structure of the video, rather than relying on forgery traces.Tis property is benefcial for improving the detection of unknown forgery videos.

Conclusions
In this paper, we propose a reliable source-tracing-based method for detecting forged videos, which provides trustworthy and interpretable detection results.Our method is essential for scenarios that require reliable detection to prevent the spread of rumors on the Internet.We introduce the hash triplet loss to solve the nondiferentiable optimization problem of similar videos, which efectively improves source tracing accuracy and the ability to distinguish similar videos.Experimental results on various types of datasets demonstrate that our method is capable of detecting video forgeries and exhibits good robustness to various commonly used video processing techniques on the Internet.Since our method extracts forgery-independent features, it can be easily extended to detect other types of video synthesis forgeries.In conclusion, our proposed method provides an efcient and reliable solution for detecting forged videos and has great potential for industrial applications in the future.

4Figure 3 :
Figure 3: Te complete process of training hash centers and performing source tracing can be divided into three steps: Step 1: data processing: the data are organized into a format suitable for training with the hash triplet loss.Step 2: hash center learning: the hash centers are dynamically trained, and a temporary hash center is generated by voting during each training iteration.Step 3: source tracing: the trained model and hash centers are utilized to search for the real video with the smallest Hamming distance from the detected video.

Figure 4 :
Figure 4: Overview of our proposed networks.Our method consists of two networks: ViTHash and localizator, and two basic modules: upper sampling and transformer block.ViTHash and localizator are composed of these basic modules.ViTHash trains hash centers from triplet videos, which include the original video and two randomly related fake videos.Te trained hash centers are used to trace the source of fake videos.Te localizator is designed to analyze the diferences between the traced video and the fake video, which are not afected by the video quality or cropping.Te diferent areas of the two videos are represented by generated masks.
(c), DAVIS-VI contains 50 original videos and 300 inpainted videos, totaling 33,550 frames.Te training set includes 200 inpainted videos, and the test set includes 100 inpainted videos.5.2.4.Video Splicing Dataset.Video splicing detection receives relatively less attention due to the lack of video splicing datasets.Compared to image splicing datasets, creating a video splicing dataset is challenging because it requires considering the position, size, color, and semantics of spliced objects.As shown in Figure 5(b), we create a video 8 Security and Communication Networks splicing dataset called video splicing to evaluate the performance of our method in detecting video splicing forgery.Te video splicing dataset contains 30 carefully manually created videos of diferent scenes as the test set and 795 randomly spliced forgery videos based on these objects and real videos as the training set.We develop a Photoshop-like tool to create videos frame by frame.Given all the frames of a real video A � a 1 , a 2 , . . ., a n , where n is the number of frames in the real video, and a set of frames for the object to be spliced B � b 1 , b 2 , . . ., b m , where m is the number of frames in the object to be spliced and m < n, the frames of the synthesized video are defned as R � r 1 , r 2 , . . ., r m .Te production process of the forged video is defned as follows: R � A +(scale × B + pos),

Figure 5 :
Figure 5: Tumbnails of the datasets.(a) DFS: a dataset for detecting DeepFake of similar (DFS) scene videos.(b) Video splicing: a dataset for detecting video splicing.(c) DAVIS-VI: a dataset for detecting video inpainting based on DAVIS2016.

Figure 7 :
Figure 7: Example results of experiments on similar videos in two diferent scenarios: one with a similar background and another with the same background from a diferent angle.
ALGORITHM 1: Te whole calculation process of the hash triplet loss.

Table 1 :
Robustness experiments with various video preprocessing and diferent hash bits on FaceForensics++.

Table 3 :
Comparison of accuracy (ACC) with existing methods on multiple datasets.
Bold values represent the best results in the correlation dataset.

Table 4 :
Comparison experiment of fne-grained accuracy (ACC) with recent works on FaceForensics++ high-quality (HQ) and low-quality (LQ) datasets.

Table 5 :
Evaluation of classifcation and localization on diferent types of datasets.
Bold values represent the best results.