Robust Frame Duplication Detection for Degraded Videos

To detect frame duplication in degraded videos, we proposed a coarse-to-fine approach based on locality-sensitive hashing and image registration. *e proposed method consists of a coarse matching stage and a duplication verification step. In the coarse matching stage, visually similar frame sequences are preclustered by locality-sensitive hashing and considered as potential duplication candidates. *ese candidates are further checked by a duplication verification step. Being different from the existing methods, our duplication verification does not rely on a fixed distance (or correlation) threshold to judge whether two frames are identical. We resorted to image registration, which is intrinsically a global optimal matching process, to determine whether two frames coincide with each other. We integrated the stability information into the registration objective function to make the registration process more robust for degraded videos. To test the performance of the proposedmethod, we created a dataset, which consists of 3 subsets of different kinds of degradation and 117 forged videos in total. *e experimental results show that our method outperforms state-of-the-art methods for most cases in our dataset and exhibits outstanding robustness under different conditions. *anks to the coarse-to-fine strategy, the running time of the proposed method is also quite competitive.


Introduction
With various nonlinear editing tools such as Adobe Premiere, Microsoft Movie Maker, and Sony Vegas, it is now much easier for people to tamper the content of a video. Many different kinds of detection methods have been proposed [1][2][3]. Among different approaches to video forgery, frame duplication, which simply copies a sequence of frames to another position in the timeline, may be one of the most convenient yet effective means to hide or counterfeit events. e forged part of the video can be easily made visually natural and therefore difficult for manual detection. Fortunately, since the source and target frames simultaneously exist in the video, frame duplication forgery can be exposed by detecting abnormal identical frame sequences. On this basis, several methods have been proposed [4][5][6][7][8]. ese methods share a common methodology to judge whether a frame is the copy of another one. ese methods share a common methodology to judge whether a frame is the copy of another one. ey extract features from the frames and set the distance threshold between the features. Such methodology makes it difficult for these methods to perform robustly when applied in realistic frame duplication detection (FDD), where degradation is quite common. In degraded videos, the local structure of the frames can be altered slightly by sorts of factors, and then the value of the extracted features correspondingly changes; therefore, a fixed distance threshold cannot always work well. For instance, an experienced attacker may add a little perturbation (e.g., additive noise) to either the source or the target frames; then, in this noisy scenario, it is quite probable that the threshold tuned for ordinary cases will miss some true matching frame pairs. In fact, even the lossy encoding process will result in substantial difference between the source and target frames; please see Figure 1 for an example. Since a tampered video will be subject to at least double compressions, this example indicates that degradation in video forgeries is almost unavoidable. is makes it quite complicated to stably detect frame duplication in realistic scenarios.
In this paper, we are attempting quite a different methodology for FDD. In the proposed method, we no longer rely on a fixed threshold to decide whether a video subsequence is the duplication of another. e key idea is that for any two frames f a and f b , if they contain the same objects, meanwhile the corresponding objects' shape and position are identical, f a and f b can be considered as the copy of each other. We resort to image registration to check whether the three aspects (objects, objects' shape, and objects' position) of two frames coincide with their counterparts. Specifically, the problem is solved by pixel-level global optimal matching. When a given frame is aligned to its copy, there should not be any distortion in the resulted offset field, i.e., each pixel in the source frame matches to the same location in the target frame. Our method is robust to certain magnitude of video degradation, but the global matching procedure is not so fast as some feature extraction and comparison-based methods such as [4]. For acceleration, our pipeline involves a coarse matching step, which can significantly improve the computational speed.   Figure 1: An example demonstrating the substantial difference between a frame and its copy. Such difference is caused only by the lossy encoding process itself. e video is H.264 encoded. (a, b) A pair of source and target frames. Although they are visually identical, nonnegligible error between them has been introduced during the encoding process. (c) e histogram of pixel-wise absolute difference. We can see that the error can even be as large as 30, and the number of occurrences of the errors which are larger than or equal to 10 is over 700,000 (the resolution of the video is 1920 × 1080). To further demonstrate the impact of such difference, we build a 500-word vocabulary by k-means clustering 7500 dense SIFT features extracted from 72 randomly selected images, and the vocabulary is lexicographically sorted so that neighbouring visual words in the vocabulary are also nearer in the feature space. We then extract dense SIFT features for each pixel in (a) and (b) and map the features to the indices of the visual words so that we get two visual word index maps (d, e) and the absolute difference map between them (f ). e considerable amount of bright spots in (f ) indicates that the lossy compression substantially changed the local structure and hence local feature.
It should be noted that, like the other methods in this field [4][5][6][7][8], our method also does not take the static scenes into consideration. erefore, our method can be used to expose counterfeit events. For instance, an attacker can copy a sequence of frames from the historical record to counterfeit the event that a man passing through a scene; then, the duplicated frames can be used as the evidence of being present or absent and therefore plant suspicion on that man or absolve him from guilty. e contributions of this paper are listed as follows. First, we propose a coarse-to-fine FDD scheme, whose first key step is preclustering perceptually similar sequences of frames by locality-sensitive hashing (LSH). rough this coarse matching step, the computation load for finer duplication verification can be reduced by several orders of magnitude.
Second, we use global optimal matching for finer duplication verification, with the benefit regarding computational cost obtained in the coarse matching step. We integrate the stability information of different regions into the matching objective, resulting in a robust yet sensitive matcher for the noisy environment. e rest of this paper is organized as follows: in section "Related Work," we briefly introduce related work, and then in section "Proposed Method," the proposed method is detailed. e experimental results are presented in section "Experimental Results." e conclusion and future directions are finally drawn in section "Conclusion and Future work."

Related Work
To the best of our knowledge, there are only a few works on FDD. In [8], Wang and Farid proposed the first FDD method. e video is divided into overlapping short subsequences; then, for each subsequence, temporal correlation between each pair of frames and the block-wise spatial correlation within each frame are calculated and used as features for subsequence comparison. e subsequences are compared with each other in a coarse-to-fine manner. High similarity in temporal correlation coefficients will trigger the comparison between spatial correlation coefficients [9]. Given the number of frames, n, the time complexity of the subsequence comparison process is O(n · (n + 1)/2). Methods using the same framework include [4,6], which, respectively, use structural similarity (SSIM) [6] and histogram correlation to measure the similarity between subsequences.
Being different from [8], in [5,7], the authors lexicographically sorted the features (Tamura texture and local binary pattern are used as features in [5,7], respectively) corresponding to each frame; then, neighbouring features which are close enough to each other in the feature space correspond to duplicated frames. In this manner, the time cost spent on identifying matching frames is theoretically reduced to O(k · n · log n), where k is the feature dimension. It should be noted that, for lexicographical sorting, the features corresponding to the frames of the entire video have to be simultaneously stored in the memory. From an implementation perspective, in memory-constrained environments, with the increase of k and n, the storage needed by lexicographical sorting can sometimes exceed the memory capacity. In such cases, the features have to be stored on the disk instead; then, the sorting procedure will involve frequent disk access which is rather slow.
Taking the characteristic of the encoding process into consideration, Subramanyam and Emmanuel [10] extracted the histogram of oriented gradients (HOG) features from block pairs colocated in neighbouring I, B and P, B frame pairs; then, high correlation between the HOG features discloses duplicated blocks and hence frames. However, this method can only be used to detect the case that the source and target frames are placed next to each other, which is quite uncommon. e methods discussed above unexceptionally detect the duplication behaviour by a fixed global threshold, which makes them less robust. Although some features can be, to some extent, robust to degradation, it is obvious that a threshold calibrated for a certain condition may not be suitable for others. is problem becomes particularly noticeable when processing degraded videos since there are too many unstable factors caused by, e.g., compression artifact or manually added noise.
One subject closely associated with FDD is near-duplicate identification (NDI) [11][12][13][14], whose major concern is the copyright issues. Unlike FDD, in NDI, the video clip in query is known, while the goal of FDD is to find all duplicated frame pairs within the video sequence where each frame can be possibly forged; therefore, FDD is more challenging in terms of time complexity, which increases quadratically with the number of frames [11]. Another major difference between FDD and NDI is that, in NDI, the potential attacks can be much stronger than in FDD, and the pirated videos can be geometrically transformed (e.g., picture in picture or recaptured videos) or inserted into logos or subtitles. In this sense, methods for NDI should be more robust while less discriminative than FDD.

Proposed Method
3.1. e Pipeline. We detect frame duplication in a coarseto-fine manner. e pipeline consists of two key steps: coarse matching and duplication verification, as shown in Figure 2.
e main concern of the coarse matching stage is to significantly reduce the burden of computation for the second stage. Given an input video V, following [8], we firstly divide V into T − L + 1 overlapping subsequences, where T is the total number of frames and L is the subsequence length. For each subsequence t(1 ≤ t ≤ T − L + 1), using LSH, we identify from t's successive subsequences the ones that are visually similar to t. In this way, we cluster subsequence t and its duplication candidates into the same group. en, in the duplication verification stage, we perform image registration between each pair of corresponding frames, respectively, in t and the duplication candidates. rough image registration, we obtain a series of offset fields, and the zero-offset fields verify duplications.

Coarse Matching by LSH.
e left part of Figure 2 depicts the process of coarse matching for a given subsequence t. For subsequence t, we would like to find out, from subsequence t + L to T − L + 1, the duplication candidates c t,1 , c t,2 , . . . , c t,k (where k ≤ T − t − 2 · (L − 1)), which are perceptually similar to t, such that the duplication verification procedure, which is more accurate but slower, can compare t only with c t,i (1 ≤ i ≤ k) instead of with all the succeeding subsequences of t. To this end, what we need is a feature which is sensitive to content (i.e., objects, objects' shape, and objects' position) change while robust against image degradation. Although many image hashing schemes can be used for this purpose (e.g., the wavelet-based [15] and SVD-based [16]), we find that a block-wise GIST [17] feature meets our requirement best in terms of the tradeoff between robustness, discriminative power, and computational time. For each frame, we extract the block-wise GIST descriptor; then, the descriptors extracted from the L frames in the subsequences t and t + L + i(0 ≤ i ≤ T − L + 1 − t) are, respectively, concatenated to form one-dimensional features f s,t , f s,t+L+i for the corresponding subsequences.
We exploit LSH to determine whether a feature f s,t+L+i is sufficiently close to f s,t . Given an error probability P e > 0 and a distance threshold R, when ‖f s,t − f s,t+L+i ‖ ≤ R, LSH guarantees that P c ≥ 1 − P e , where P c is the collision probability for the hash value of f s,t and f s,t+L+i . In this paper, we use the p-stable distribution-based LSH [18]: where f is the feature vector being hashed, a is a real-valued vector whose elements are independently drawn from a standard normal distribution N(0, 1), which has been proven to be 2-stable [18] (since we use the l 2 -norm to measure the difference between features), and b is a real scalar uniformly drawn from [0, ω], where ω is a real scalar.
To produce more reliable results, we construct H hash tables, and subsequence t + L + i is considered as a duplication candidate of t only when the hash values of f s,t and f s,t+L+i collide more than H/2 times.
If the collected duplication candidates include ξ or more consecutive subsequences (ξ � 10 in our experiments, i.e., about 0.5 second), then these consecutive subsequences are considered as static scenes and discarded.
Note that this coarse matching stage also involves a distance threshold, R, but this distance threshold is different from that used in the existing works in that R is not used for making the final decision. e coarse matching step is used to eliminate unnecessary computations; therefore, when choosing R, we do not have to consider much about the tradeoff between robustness and distinctiveness; we should just guarantee that the duplication of t is a subset of c t,1 , c t,2 , . . . , c t,k . In fact, in practice, R is not necessarily explicitly assigned, and we will discuss this in more detail in the Experimental Results section.
At the end of this stage, each subsequence t is associated with a set of potential duplication candidates c t,1 , c t,2 , . . . , c t,k . e duplication verification step is then performed between t and these candidates.

Duplication Verification.
For each pair of subsequence t and its duplication candidate c t,i , we perform image registration between the corresponding frames to check whether the two frames contain the same objects and whether the shape and position of the corresponding objects happen to be identical. If so, the registration will yield zero-valued offset fields. However, it is not easy to stably obtain correct registration results for degraded images. As shown in Figure 1, even the lossy compression itself can result in substantial changes between a frame and its copy, which usually cause registration faults. To solve this problem, we propose to find the stable regions in the frames and rely on these regions more than the unstable areas during the registration procedure. We use a variant of Harris cornerness response proposed in [19] to measure the stability of the local structure of a pixel:  Figure 2: e pipeline of the proposed method. e pipeline consists of two key steps: coarse matching by LSH and duplication verification. is diagram depicts the process of identifying the duplication of a given subsequence t. Please see the text for details.
For a frame F, a large value at M(x, y) indicates that both eigenvalues of the autocorrelation matrix corresponding to F(x, y) are large. is means that the signal changes significantly in two orthogonal directions; such points have been shown to be stable under various conditions except for scale change [20,21]. We use M to weight different regions in a frame during the registration process, and the registration objective can be written as where D is the data term which measures the difference between the local structures around the matching pixels, S is the smoothness term which guarantees that neighbouring pixels have similar offsets, v(x, y) is the offset for point (x, y), and 〈(m, n), (x, y)〉 denotes an edge in the 4-neighbourhood system N.
W D are W S are weighting matrices such that where M ′ (x, y) � max −r≤a,b≤r M(x + a, y + b) , with M being the normalized version of M. We use the maximal filtering to diffuse the impact of the stable points to a small range around them. ε in (5) is quite a small value (1e − 5 in our implementation) used to mask out the excessively smooth regions when computing the data term (3). Although the Harris cornerness response weights the smooth regions less, the areas which are excessively flat still cause troubles during registration. e local structure of such regions can be easily changed by small perturbation. Figure 1(f ) is an apparent illustration of this phenomenon, where large bright spots (significant differences between visual word indices) all locate on the wall or the floor, which are both rather smooth. Such excessively smooth regions will result in high data cost which is inconsistent with the real situation. Based on this observation, we set the threshold ϵ to remove the impact of the data costs in those areas. As a consequence, the offset field within those regions is completely controlled by the smoothness term (4); we therefore add a truncation value κ in (6) to guarantee that the smoothness constraint is always above a certain level. e data term (3) and smoothness term (4) are, respectively, defined as where y)) in the target frame (we use the SIFT descriptor [22] extracted on a single scale as the feature for each pixel), and v x (x, y) and v y (x, y) denote the offset in horizontal and vertical directions, respectively. We use the truncated l 1 -norm in (8) to account for discontinuities in the offset field, and α is used for balancing the data term (3) and smoothness term (4) (α can also be combined into (5), we write it here for clearness).
We use the dual-layer loopy belief propagation in [23] to minimize the objective function E(v). By decoupling the smoothness term in (4) into two parts in (8) corresponding to two directions, the complexity of message update in each iteration is reduced from O(nL 4 ) to O(nL 2 ), where n is the number of pixels in each frame and L is the number of possible offsets in each direction. e complexity is further reduced to O(nL) by the distance transform proposed in [24]. e multigrid message passing scheme in [21] is also exploited to significantly reduce the total number of iterations.
Optical flow (e.g., [25]) or SIFT flow [23] can also be used for image registration, which is intrinsically a pixelwise correspondence estimation process. However, neither of them can obtain expected results for degraded videos. e difference between our objective and that of optical flow and SIFT flow is obvious: we encode the stability information of different regions into the matching objective, which makes our method quite robust against video degradation. Moreover, in our objective, there is no small displacement term which is used in SIFT flow so that the registration can be more sensitive to subtle changes between two frames. Figure 3 shows two representative examples demonstrating the difference between the three methods. e offset fields are visualized with the color encoding scheme in [26]; please see Figure 4 for more details.
Compared with the typical methods for FDD [4][5][6][7][8], our method relies on image registration rather than the feature extraction and thresholding strategy. Conforming to the data similarity and smoothness constraints, the correspondences between the pixels of two frames are established in an "optimal" manner through a probabilistic inference process (i.e., the dual-layer loopy belief propagation); furthermore, the pixels with high Harris corner response are usually located on the boundaries of objects; therefore, when Harris corner response is integrated into the registration objective, the registration can be, to some extent, considered Security and Communication Networks 5 as an object-level matching process. As a result, even though the registration objective involves more parameters (i.e., κ, α, and d) than typical feature extraction and thresholdingbased methods do, we will show in our experimental part that once the parameters are calibrated, the proposed method can perform more robustly than the feature extraction and thresholding-based methods.

e Dataset.
As far as we know, there is no publicly available dataset dedicated for FDD evaluation. erefore, we created a dataset to evaluate the performance of the proposed method, especially for the degraded cases. We captured five indoor and eight outdoor video clips (named "v01" to "v13," and "v01"∼"v05" are indoor scenes) with Panasonic HDC-Z10000GK camcorder. e videos were shot in the Science Park of Harbin Institute of Technology. e clips are captured from different scenes, and their contents include characters, landscapes, buildings, and plants. Several screenshots from our dataset are shown in Figure 5. e video clips are H.264 encoded by the built-in codec, and then we convert these clips into the .mp4 format with Adobe Premiere Pro CS 5.5. e resolution of the clips is 1920 × 1080, and the frame rate is 25 FPS. Based on these original clips, we created three forgery subsets: the MCOMP subset, the MCOMP + AGN subset, and the MCOMP + INT subset. e details of these subsets are listed in Table 1. Each original clip corresponds to 9 forged versions, and the whole dataset consists of 117 forged videos in all. e magnitude of the additive Gaussian noise and intensity change is moderate so that they are hardly perceptible. e duration of the forged video clips varies from 8 seconds to 30 seconds.

4.2.
e Efficiency of LSH-Based Coarse Matching. As mentioned above, given a subsequence t, in the coarse matching stage, we exploit LSH to find C, with C being the set of duplication candidates of t. In theory, to use the p-stable distribution-based LSH, we have to assign the distance threshold R and the error probability P e to determine the parameter ω in (1). However, since we only use LSH as a coarse matcher and the result of the coarse matching does not have to be quite accurate, we should just make sure that d ∈ C, where d is the duplication of t. In this sense, we can determine ω by training rather than by firstly assigning R and P e and then calculating ω from R and P e .
To make sure d ∈ C, we define ϕ as the completeness of the correctly collected duplication candidates, and for the training set, we should choose ω such that ϕ � n c n g � 1, where n c is the number of correctly collected duplication candidates and n g is the actual number of duplicated subsequences. Given the premise in (9), we use the average number of collisions, n ave , to measure the efficiency of the coarse matching step as follows: where n t is the total number of collisions. It is straightforward that n ave monotonically increases with ω, and we prefer smaller n ave when (9) is guaranteed. We randomly select four video clips which are most seriously degraded (i.e., have been attacked by additive Gaussian noise of the standard deviation of 10 or by downscaling the intensity by 3%) from the MCOMP + AGN and MCOMP + INT subset, respectively, to train the parameter ω. e block size of the block-wise GIST feature is 25 × 25, the subsequence length L is set to 5, and we scale each frame to 12.5% of their original size before feature extraction. We constructed 80 hash tables for the coarse matching stage; therefore, a pair of subsequences is considered to be identical only when their hash values collide more than 40 times. Under this configuration, ϕ and n ave vary with ω as shown in Figure 6. We set w � 0.45, where ϕ reaches 1, and n ave ≈ 0.094. Figure 3: e comparison between the three methods for image alignment. Top row: (a, b) a frame and its copy. Bottom row: (a, b) two consecutive frames. e video is recorded by using a still camera. e person in a black coat is walking toward left, and the person's gesture slightly changed across the two frames. (c)-(e) In both rows, the offset fields calculated based on optical flow [25] (we use the implementation in [27]), SIFT flow [23], and our method, respectively. Despite the impressive result for the motion of the walking person, optical flow is too sensitive to perturbations caused by lossy compression; it gets nonzero offsets all over the fields for both cases, particularly for the wall region in the first row. SIFT flow obtains better result than optical flow for the first case, but there is still a large area of nonzero values corresponding to the smooth wall. On the contrary, SIFT flow is not sensitive enough to subtle changes: it fails to detect the motion of the person in the second case. In contrast, our method correctly calculates both offset fields.
It is obvious that the value of ϕ and n ave depends on the content of the videos. We list in Table 2

Security and Communication Networks
INT95, the value of ϕ was only 0.61. is is because the intensity of the pixels of the duplicated frames is large; therefore, a small scaling factor can result in remarkable intensity change. On average, 2% duplicated subsequences are missed during the coarse matching stage. On the contrary, the average of mean n ave for different scenes is 0.10; this implies that, in our duplication verification stage, on average, we just need to perform 0.1 comparison for each subsequence. In contrast, without the coarse matching stage, on average, we have to compare each subsequence with about T/2 other subsequences, where T is the number of frames. T is typically larger than 200 in our dataset and can be much larger in practice. In this sense, for our dataset, the computational load for the duplication verification stage is reduced by 3 orders of magnitude. We will show later that it is worth performing this coarse matching step, in spite of its O(T · (T + 1)/2) time complexity.

e Detection Capability.
In this section, we investigate the detection capability of our method. In our implementation, we randomly select 7 forged clips to calibrate κ, α, and d, and κ, α, and d are empirically set to be 1.8, 635, and 12,800, respectively. e frames are resized by a factor of 25% (for acceleration) before registration. We binarized the resulted offset field images and masked out the nonzero regions whose area is less than 0.1% of the whole field image to account for outliers. We use precision (10), recall (11), and F 1 -score (12) to evaluate the performance: where TP, FP, and FN are the number of correctly detected duplication frame pairs, the number of falsely detected duplication frame pairs, and the number of undetected duplication pairs, respectively. We compared our method with [4,8] (denoted Farid and Li, respectively), and the related parameters are set to be identical to those in [4,8], respectively. e comparison results for the MCOMP subset are shown in Tables 3-5. e label "v01" in the first row denotes the forged version of video clip v01 in the current forgery group, and "average" means the average value of v01 to v13, and the same hereinafter. For v07 to v12 in the MCOMP100 group, all three methods performed perfectly. For v05 and v13, our method obtained rather low precision; this is because some shots in v05 and v13 are almost still, and the duplication verification step failed to differentiate the excessively similar frames. On the contrary, [4] is quite effective in terms of discriminative power. Our method outperformed the other two in the first four cases. It should be noted that [8] detected none of the duplicated frames in v02 to v05 and v13 for differing reasons. In [8], when the temporal correlations of the frames in a subsequence are all above a certain value, this subsequence is considered as static and then discarded; therefore, in v05 and v13, the subsequences whose frames are quite similar to each other have not been compared with other subsequences at all. In contrast, the duplicated frames in v02 to v04 are missed due to the inappropriate correlation threshold.
For the MCOMP80 group, the precision of [4] for v05 and v13 dropped to 0.44 and 0.50, respectively, and the performance of [4] for the rest of the clips just slightly changed. e performance of [8] was rather poor for this Table 1: e description of the three forgery subsets.

Subset
Description MCOMP MCOMP stands for multiple compression. For each original clip, we use MATLAB R2014a to randomly copy k (5 ≤ k ≤ 35) consecutive frames and paste them to another random position in the timeline and resave the clips with the quality factors of 100, 80, and 60, respectively. Finally, each clip has been compressed three times (by camcorder, Adobe Premiere, and MATLAB, respectively). We denote the three levels of compression as MCOMP100, MCOMP80, and MCOMP60 groups, respectively.
MCOMP + AGN e same steps as the MCOMP100 group are performed. Besides, before resaving the forged clips, additive Gaussian noise with the standard deviations of 1, 5, and 10 is added to the target frames, i.e., each original clip corresponds to three forged clips subject to different levels of additive noise. For simplicity, in the rest of this paper, we refer to these three levels of forgeries as AGN1, AGN5, and AGN10 groups, respectively.

MCOMP + INT
e same steps as the MCOMP100 group are performed. Besides, before resaving the forged clips, the intensity of the pixels in the target frames is downscaled to 99%, 97%, and 95% of their original values. In the following, we refer to these three levels of forgeries as INT99, INT97, and INT95 groups, respectively.  group: it failed to detect 8 out of 13 forged video clips. In this group, the precision for v08 and recall for v07 of our method decreased to 0.50 and 0.83, respectively, and the results for other clips were almost the same with those in the MCOMP100 group.
When the quality factor of the last compression dropped to 60, the performance of [4] significantly decreased. Specifically, 6 video clips were mistakenly judged as unchanged. In contrast, most results of our method maintained a relatively stable value. e detection results for the MCOMP + AGN subset are shown in Tables 6-8. In the AGN1 group (Table 6), our method outperformed the other two for the first four cases. Our recall rate for v08 is a little lower than the other two. Besides, for v05 and v13, [4] still obtained better results than our method, while the precision for v05 dramatically dropped to less than 0.50. [8] obtained the worst results for v01 to v03, v05, v06, and v13. In Table 7, it is interesting that the precision of [4] for v05 in the AGN5 group is 1.00, which was supposed to be less than 0.50 due to the stronger noise. According to our observation, the difference between the source and target frames in v05 in this group is indeed small, which may be caused by the encoding mechanism of the lossy compression. In Table 8, for most cases, the precision and recall of [4,8] dropped to 0. Contrarily, our method is     Tables 9-11. Except for v05 and v13, our method is on par with or excelled the other two methods for most cases. Since [8] used correlation of pixel intensity as their feature, which is rather robust against intensity change, in this subset, [8] performed better than it did in the COMP + AGN subset. On the contrary, [4] is rather sensitive to intensity change. For the INT95 group, [4] even detected none duplicated frames at all for 10 out of 13 forged clips.
When comparing frames in degraded videos, the distances between the features can easily fall out of the range of a fixed threshold. As can be seen from Tables 3-11, when the degradation gets stronger, more than half of the forged video clips have been falsely judged as innocent by [4,8]: none of the duplicated frames has been detected. By contrast, our method performed stably across different test groups. Although, occasionally, our method performed worse than the other two methods (especially for v05 and v13), the average of the precision, recall, and F 1 -score, without exception, significantly outperformed those of [4,8]. Even for the strongest degradation groups, i.e., MCOMP60, MCOMP + AGN10, and MCOMP + INT95, the average values of precision, recall, and F 1 -score of our method are above 0.8 (for the worst case, F 1 � 0.81 in INT95).

4.4.
e Running Time. e running time of the three methods is closely related to the content of the videos. When comparing subsequences, once any corresponding pair of frames is found to be not identical, the comparison between the current pair of subsequences terminates. erefore, the video clips whose frames are highly similar to each other will result in more processing time. e comparison between the running time of the three methods is given in Table 12. e frames are scaled by a factor of 25% for all the three methods, and all the experiments are conducted on a workstation with       Intel Core i7-2600 processor and 24 GB RAM. We implemented the three methods with MATLAB R2014a. As mentioned earlier, in [8], if the correlation coefficients between the frames within a given subsequence are all above a predefined value, the subsequence will be considered as static and discarded. Such subsequences will not be compared to other subsequences, and that is why the running time of [8] is so short for scenes v05 and v13 (recall that [8] detected none of the forgeries corresponding to these two scenes).
In fact, both the structural similarity and correlation coefficients used in [4,8], respectively, can be computed much faster than the image registration process in our method. For each single pair of frames to be compared, the structural similarity and correlation coefficients can be calculated in about 0.03 and 0.05 second, respectively. In contrast, in our method, the image registration procedure takes about 4 seconds for each pair of frames to be compared. Even so, our method is faster than the other two for most cases. e coarse matching step plays an important role for acceleration. As we have demonstrated in Table 2, the total number of subsequences which need finer duplication verification can be reduced by several orders of magnitude, especially for the video clips whose content changes rapidly across frames. e average running time of each step in our method is listed in Table 13. e coarse matching step accounts for about 42% of the total running time, and it takes less than one second for each frame on average. Such a step is well worth performing: without coarse matching, even a 10-second video clip will cost us several hours to detect the forgery.

Conclusion and Future Work
In this paper, we proposed a new method for frame duplication detection, particularly for the degraded videos. Our method detects duplication forgeries in a coarse-to-fine manner and consists of two steps: coarse matching and duplication verification. In the coarse matching stage, we use locality-sensitive hashing to precluster the visually similar subsequences. rough coarse matching, the total number of subsequences which need finer duplication verification can be reduced by several orders of magnitude. e duplication verification step exploits image registration to identify the identical subsequences. We encode the stability information of different regions into the registration objective function such that the registration can work stably for degraded videos. Being different from existing methods, our detection process does not rely on a fixed distance threshold, which is typically unreliable for degraded videos. Experimental results show that our method outperformed state-of-the-art methods for most cases and exhibited outstanding robustness under different conditions. However, our method cannot distinguish between highly similar frames; as a result, for the video clips whose content just slightly changes across frames, the precision can be rather low. Further efforts should be made to improve the discriminative power of the registration process.

Data Availability
e dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.