Video Surveillance Object Forgery Detection using PDCL Network with Residual-based Steganalysis Feature

. Video surveillance has various applications in various felds and industries. However, the rapid development of video processing technology has made video surveillance information susceptible to multiple malicious attacks. At present, the state-of-the-art methods, including the latest deep learning techniques, cannot get satisfactory results when addressing video surveillance object forgery detection (VSOFD) due to the following limitations: (i) lack of VSOFD-specifc features for efective processing and (ii) lack of efective deep network architecture designed explicitly for VSOFD. Tis paper proposes a new detection scheme to alleviate these limitations. Te proposed approach frst extracted VSOFD-specifc features via residual-based steganalysis feature (RSF) from the spatial-temporal-frequent domain. Key clues of video frames can be more efectively learned from RSF, instead of raw frame images. Ten, the RSF feature is used to form the residual-based steganography feature vector group (RSFVG), which serves as the input of our following network. Finally, a new VSOFD-specifc deep network architecture called parallel-DenseNet-concatenated-LSTM (PDCL) network is designed, which includes two improved CNN and RNN modules. Te improved CNN module fuses and processes the coarse-to-fne feature extraction and simultaneously preserves the frame independence in video frames. Te improved RNN module learns the correlation features between the adjacent frames to identify forgery frames. Experimental results show that the proposed scheme using the PDCL network with RSF can achieve high performance in test error, precision , recall , and F 1 scores in our newly constructed dataset (SYSU-OBJFORG+newly generated forgery video clips). Compared to existing SOTA methods, our framework achieves the best F 1 score of 90.33%, which is greatly improved by nearly 8%.


Introduction
Video surveillance has extensive applications across various industries and felds [1][2][3]. For instance, they can serve as evidence in trials, providing a basis for subsequent investigations. Furthermore, video evidence can be used for news material, insurance claims, intelligence agencies, etc. However, with the continuous advancements in multimedia processing technology, the manipulation of videos has become commonplace. Terefore, forensic analysis for multimedia aims to ensure the authenticity of the content. Due to the extensive usage of surveillance videos, verifying the authenticity of such information has become a critical component of multimedia forensic analysis. Object-based forgery is a common method of video forgery since adding or removing an object from a surveillance video can signifcantly alter its meaning. It has seriously afected people's trust in the authenticity of court evidence, news reports, and other surveillance video evidence, which has signifcant negative efects on society. For this reason, video surveillance object forgery detection (VSOFD) has become urgently demanding.
Digital video forgery techniques were previously studied to address VSOFD. Existing popular digital video forgery generally consists of frame-based forgery and object-based forgery [4], as portrayed in Figure 1. Frame-based forgery takes the whole frame content as the operating cell, which includes frame duplication (insertion) [5] and frame deletion [6], and, respectively, aims to duplicate and delete some successive frames in the same video or diferent videos to highlight or conceal some critical events. Currently, many state-of-the-art (SOTA) detection methods devote to framebased forgery and achieve satisfactory results, including frame correlation statistics diference [5], frame motion residual [6], and deep learning [7].
Unlike frame-based forgery with obvious traces (e.g., sudden lighting change, temporal fickering, scene fuctuation), object-based forgery [8][9][10][11][12][13][14] with sophisticated techniques can achieve realistic forgery efects. Taking a particular account of the video objects refecting the video meaning and contents is signifcant to object-based forgery detection. Generally, object-based forgery contains two subbranches, namely, splicing and copy-move forgeries (Figure 1). Splicing is a forgery with heterogeneous sources. Te copied (splicing) source and the pasted frame (localization) originated from diferent video clips (Figure 2(a)). Many videos splicing forgery detection (VSFD) methods reference the image splicing detection schemes [15][16][17], which search for the splicing traces between the heterogeneous splicing content(s) and the spliced background of the target video. Besides, the SOTA methods also consider temporal dimension to analyze spatial-temporal block correlation [8] or use recurrent neural networks (RNN) with temporal correlation [9] to locate and distinguish forgery frames.
Copy-move is a forgery with a homogenous source. Te copied source and its pasted frame (localization) originated from the same video clips (Figure 2(b)) [10]. Tis forgery technique can consist of both intra-frame and inter-frame manipulations. In intra-frame forgery, the object(s) are copied from one frame and pasted into the same frame. In inter-frame forgery, the object(s) are copied from one frame and pasted into other diferent frames within a short interval (e.g., 200 frames) to create a realistic forgery. Tese phenomena provide signifcant detection clues to search for matching correlations since the copied and pasted objects are in the same video clip. Recent video copy-move forgery detection (VCMFD) methods reference image copy-move forgery detection (CMFD) schemes [11] and consider temporal dimension for addressing both intra-frame [12,13] and inter-frame forgeries [14].
As discussed previously, the relatively consistent content expression in video copy-move forgery can easily create realistic forgery efects. In practice, video surveillance object forgery is like flm plots. Malicious attackers generally employ copy-move forgery variations, in which the copied object(s) and the pasted frame are separated for a relatively long time. A suspicious video clip is usually considered rather than a surveillance video of unlimited continuous length for video forensics. Terefore, this forgery can make the suspicious video clip contain only the pasted object(s) without any copied source clues. As a result, it can easily bypass most existing video surveillance object forgery because the current methods cannot fnd the copied and pasted pairs. Tis special forgery variation (as shown in Figure 2(c)) is a popular video surveillance object forgery.
Te existing video splicing forgery and copy-move forgery detection methods fail to detect video surveillance object forgery because of two reasons: (1) In video splicing forgery, the content of the forgery video is originated from two diferent videos and contains equipment information from diferent cameras. According to the characteristics of a video splicing forgery, the splicing detection methods can only identify the splicing traces between the splicing content(s) and the spliced background of the forgery video. However, the copy-move content of the video surveillance object forgery originated from the same videos (their equipment information belongs to one camera) but diferent video clips. Terefore, the video splicing detection methods fail to identify the copy-move forgery traces captured by the same camera in VSOFD.
(2) In video copy-move forgery, this forgery's copymove content originates from the same video clips (their equipment information belongs to one camera). According to the characteristics of a video copymove forgery, the copy-move detection methods  must fnd the one-to-one matching correlation between the copied and pasted objects. However, in the video surveillance object forgery, the copied objects and paste frame do not appear in the same suspicious video clip (but in other parts of the surveillance video clip). Terefore, the video copy-move forgery detection methods can easily bypass most existing video surveillance object forgery because the current methods cannot fnd the copied and pasted pairs in VSOFD.
Only a few methods [4,[18][19][20] have been proposed to address VSOFD. However, even the latest deep learning model in existing works cannot provide satisfactory detection accuracy due to two main challenges: (i) Discriminative feature in the independent spatial, temporal, or frequent domain for existing VSOFD: Some methods based on motion residual [4,19] for VSOFD, can only extract spatial or temporal domain features and do not consider changes in the frequency domain of the forged video. Similarly, current deep learning methods [12,18,21], CNN or RNN, are not specifcally designed for VSOFD, which only address the feature extraction in the spatial or temporal domains, respectively. Te features of the above algorithms cannot efectively discover the forgery trace. Terefore, extracting comprehensive and efective features for VSOFD is necessary.
(ii) Efective deep network architecture for VSOFD: A video is a time series of sequential and correlated frames, while each frame has its independence. CNN is powerful in handling spatial characteristics. It is likely to employ CNN for VSOFD. However, CNN only takes a frame as input. In this situation, CNN fails to keep frame coherence in the video, leading to unsatisfactory detection accuracy. Although RNN can maintain frame coherence, it does not possess the ability to handle spatial characteristics. Terefore, a new architecture having both the abilities of CNN and RNN is necessary.
To address these two defects, a novel VSOFD scheme integrated with several newly designed techniques is proposed, including spatial-temporal-frequent comprehensive and efective deep network architecture: Efective residual-based steganalysis feature (RSF) is designed explicitly for the spatial-temporal-frequent domain. Residual-based signal (RS) is frst extracted from the spatial-temporal domain. Subsequently, RSF is further extracted from the RS in the frequency domain. Te RSF can efectively represent spatial-temporal-frequent perspectives with dimensionality reduction in features. Ten, the RSF feature is used to form the residual-based steganography feature vector group (RSFVG), which serves as the input of our following network.
A new learning framework called parallel-DenseNetconcatenated-LSTM (PDCL) network is designed, which combines both the CNN (parallel-DenseNet) and the RNN Video clip #1 Video clip #2 Video clip #3 (LSTM) structures. Te proposed PDCL structure simultaneously preserves frame independence (spatial domain) and captures correlation (temporal domain). Furthermore, its CNN module has a parallel cross-layers-block structure that extracts the coarse-to-fne fusion feature for processing. Noteworthy, the CNN module in PDCL is compatible with diferent RSF and preserves the RSF independence of each frame with a column convolutional kernel. Subsequently, the independent CNN frame features are concatenated as input for the RNN module to learn the correlation and coherence of the frame sequence. Te rest of the paper is organized as follows. Te related work is presented in Section 2. Te proposed RSF and parallel-DenseNet-concatenated-LSTM (PDCL) are detailed in Sections 3 and 4, respectively. Te experiments and conclusions are provided in Sections 5 and 6, respectively.

Related Video Surveillance Object Forgery Detection.
Te video surveillance object forgery can achieve realistic vision results without leaving tampering traces. Unfortunately, there are very few research works on VSOFD in the literature. In recent years, Chen et al. [4] created the SYSU-OBJFORG dataset, introduced video object forgery detection, and proposed a temporal forgery detection algorithm based on motion residuals. All video clips in the available SYSU-OBJFORG are extracted from primitive video footage of several static surveillance cameras. Moreover, a substantial part of forgery video clips in the SYSU-OBJFORG has the same properties as our study issue (video surveillance object forgery). Terefore, video object forgery detection can provide an excellent reference to video surveillance object forgery detection (VSOFD).
Te inherent statistical properties of a video can be divided into two categories: the inherent intra-frame properties describing its spatial characteristics and the inherent inter-frame properties describing its temporal characteristics. Since the motion residuals contain the inherent attributes of the corresponding frames both within and between them, it becomes the primary analysis tool of VSOFD. Two hand-crafted automatic identifcation algorithms [4,19] rely on motion residual (MR) as the feature of each frame and use a machine learning classifer for discriminating the forgery and genuine frames. Motion residuals can only refect the spatial-temporal domain's correlation but ignore the frequency characteristic changes of a forged video in the features (defect (i)). Meanwhile, these classifers are difcult to handle many hyper-parameter adjustments. It is only well-designed for specifc forgery datasets (e.g., SYSU-OBJFORG). Tey cannot provide satisfactory detection efciency and accuracy (defect (ii)).
Yang et al. [18] propose a deep network based on a spatial rich model (SRM) and 3D convolution (C3D) to address an application similar to VSOFD. However, this general CNN structure does not include discriminative features in the spatial-temporal-frequent model (defect (i)). It needs to extract the diference and coherence between the successive frames. Furthermore, its general CNN structure (without RNN) is incompetent in addressing the VSOFD efectively (defect (ii)) and cannot process slowly moving forgery objects. Jin et al. [21] propose dual-stream networks for video object-based forgery detection. Tis technique is similar to fast forgery detection [12]. It uses corresponding DNN modules to replace the hand-crafted feature extraction, processing, and tracking modules, e.g., dual-stream networks for feature extraction instead of exponential Fourier moments [12]. It is well-designed for both splicing and general copy-move forgery detection. Still, features extracted from the convolutional layer lack the spatialtemporal-frequent perspective (defect (i)), leading to the failure of VSOFD detection. Moreover, these methods [12,21] cannot detect occlusive or smoothing background forgery.
Te previous studies lack comprehensive and efective features and an efective deep network architecture for VSOFD, which we have discussed in this related work. To address these two defects, a novel VSOFD scheme integrated with newly designed techniques is proposed: (i) Efective residual-based steganalysis feature (RSF) is designed which can efectively represent spatial-temporal-frequent perspectives with dimensionality reduction in features. (ii) Te proposed PDCL structure simultaneously preserves frame independence (spatial domain) and captures correlation (temporal domain) to efectively improve VSOFD results.

Residual Signal and Steganalysis Feature.
Te motion residual is a popular feature extraction for video forensics. To address motion-compensated frame rate up-conversion (MC-FRUC), Ding et al. [22] build a residual signal to search for the forgery splicing traces. Ten, the identifcation problem of MC-FRUC is transformed into a classifcation or discrimination problem of diferences in residual signals between the video frames. Saddique et al. [19] also rely on motion residual (MR) as the feature of each frame for discriminating the forgery and genuine frames.
First, the feature extraction method uses an overlapping temporal window (aggregated operation) to slide in the video sequence one frame at a time. Ten, the minimum, maximum, and median of motion residual inside a temporal frame window create the aggregated frames AF. For the k th frame in the video,

International Journal of Intelligent Systems
where min, max, and med mean the minimum, maximum, and medium values; 2Ls + 1 is the number of the aggregated frames F x,y , which contains the center (current) frame k in the aggregated frames; Ls represents the number of the successive previous and subsequent frames of frame k; and x and y are the pixel position in the corresponding frame. RS x,y is the residual-based signals (RS). Te motion residual can refect the correlation of fnegrained pixels between the adjacent frames but lacks global vision. Some popular steganography techniques, e.g., DCT [23], CCPEV [24], subtractive pixel adjacency matrix (SPAM) [25], spatial domain rich model (SRMQ1) [26], spatial and color rich model (SCRMQ1) [27], as the efective features, are extracted from frame residual to obtain the global vision. Terefore, combining motion residual and steganography efectively obtains the local and global features for discriminating video forgery frames.

CNN + RNN (DenseNet and LSTM) Structure for Feature
Processing and Forgery Frame Identifcation. With the rapid development of convolutional neural networks (CNN), many efective CNN models, such as VGG16 [28] and ResNet [29], have shown powerful feature extraction abilities and excellent image classifcation in the spatial perspective. However, these classic CNN models have a feedforward transferring structure whose current network layer only receives the processed information from the preceding layer. Subsequently, the information from the current layer is processed and transferred to the next layer sequentially. Moreover, these classic CNN models have some defects, e.g., of many parameters, network layers, and widths, which may easily cause overftting. Huang et al. [30] proposed a Den-seNet instead of the VGG16 serial structure, with a concatenated structure that transfers the concatenated feature maps of all preceding layers. DenseNet gives a good reference for the proposed VSOFD DNN architecture. Furthermore, the video is a time series. Any video frame and its neighboring frames are temporal-dependent. RNN technique [31], especially LSTM, can capture this long shortterm dependence between the preceding and current frames. Terefore, it is suitable for the correlation statistics between the video frames.

The Proposed Residual-Based Steganalysis Feature Extraction
Te proposed method includes two main stages: (1) residualbased steganalysis feature extraction ( Figure 3) and (2) parallel-DenseNet-concatenated-LSTM for feature learning and forgery frame detection ( Figure 4). Practically, the surveillance video clip contains the stationary scenes and the object motion parts. In the video surveillance object forgery, static scenes cover the target object and remove its motion traces. Tis kind of object removal belongs to the spatial domain. Nevertheless, video is a continuous time series of frames, and each video frame strongly correlates with its adjacent frames. Terefore, video surveillance object forgery is a kind of attack to change spatial-temporal coherence. In our work, a novel motion residual extraction strategy is proposed for RS with the following concerns, which is much diferent from the literature [4,19]: (i) Microscopic level in the spatial-temporal domain (Section 3.1): the fne-grained pixel diference between the detected pixel and the adjacent pixels in the spatial domain, and the fne-grained pixel difference between the current frame position and the adjacent frame positions in the temporal domain. (ii) Macroscopic level in the frequent domain (Section 3.2): steganography features diferences between the frame and adjacent frames in the energy and frequent domains.
In our work, the proposed residual-based steganalysis feature extraction is shown in Figure 3, where Figures 3(a)-3(c) present the residual-based signal (RS) extraction for concern (i) and residual-based steganography feature (RSF) extraction for concern (ii), respectively.

Residual-Based Signal Extraction.
Given concern (i), the proposed RS extraction considers spatial-temporal coherence and diference analysis. First, it is the fact that the closer positions of two frames in a video sequence indicate a higher frame correlation. Terefore, the RS extraction is within only a short temporal frame window Ls = {1, 2}, instead of a long temporal frame window, e.g., Ls ≥ 10 in [19]. Considering that the Laplacian operator can beneft from enhancing the forgery traces and sharpening the image details, a Laplacian operator is employed to create RS instead of calculating the minimum, maximum, and median of motion residual in existing schemes [4,19]. Since the video frame sequence is a 1-dimensional (1-D) time series of frames, the proposed RS extraction only requires calculating a two-order discrete Laplacian operator in the 1-D temporal domain (equation (2)).
when Ls � 1, International Journal of Intelligent Systems is the gray-scale features of all pixels in the kth frame, and k is the center frame in a short temporal window. Following the literature [32], the convolution kernels of the 1-D and 2-D Laplacian operators are set to [1, −2, 1] and [1, Second, in the spatial domain, a two-order discrete Laplacian operator in equation (3) for the 2-D image or video frame is applied to remove the subtle camera-shaking interference, i.e., Gaussian noise, to highlight the areas where pixel values change rapidly.
where x, y is the pixel coordinate in the residual-frame map

Residual-Based Steganography Feature Extraction.
Steganalysis can analyze, fnd, and distill the hidden information of steganographic carriers at a macroscopic level. In addressing concern (ii), steganography feature selection is   International Journal of Intelligent Systems crucial in feature extraction. In our work, the residual-based steganography feature (RSF) extraction further uncovers the steganography feature from the residual-based signal (RS) as the feature representation. RSF extraction transforms a spatial-temporal feature matrix (e.g., RS) into a spatialtemporal-frequent feature vector that can reduce the huge amount of information (e.g., from 3D to 2D).
Tere are two reasons to employ RSF as the VSOFD feature: (1) efectiveness: most of the SOTA steganography techniques are competent in this work, and (2) efciency: the compact steganography feature (namely, the short-length feature or low-dimension) can relieve "the curse of length or dimensionality." Te compact steganography feature is also suitable for the following deep model training. Among RSF techniques, DCT [23], CCPEV [24], and SPAM [25] are selected for steganography feature extraction. Figure 3(b) illustrates the CCPEV feature with 548-D. Each RSF vector is N × 1 dimensional, where N is the feature dimension. Te SRMQ1 [26] and SCRMQ1 [27] with relatively long-length (or high-dimensional) features are used for the experimental comparisons. Algorithm1 shows the RS and RSF extraction algorithms.

Samples Processing Using Residual-Based Steganography
Feature Vector Group (RSFVG). In this sample processing algorithm, vision persistence is considered. In the human psychophysical system (i.e., vision persistence), an object in a video clip appears at least 0.1-0.4 seconds, i.e., 3-10 frames. For this reason, a certain number of successive frames (called a frame group) can better represent the temporal characteristics of deep model learning. In Section 3.2, the RSF vectors of all frames are concatenated into a W × N × 1 matrix, where W is the total number of frames in the detected video clip, and N is the feature dimensions of RSF. We apply a local temporal vector sliding window in an RSF matrix, with one vector at a time, to obtain diferent combinations of successive RSF vectors in the corresponding frame group (Figure 3(c)), namely, the RSF vector group (RSFVG). RSFVG helps to provide sufcient samples for efective learning. Te relations of RS, RSF, and RSFVG are illustrated in the right part of Figure 3. Based on the requirement of vision persistence, each RSFVG has M � 2L M + 1 frames (RSF vectors), where L M is no less than 3 and L M > Ls.
Te ground-truth GT F for each RSF vector (frame) is labeled in a binary code. Te binary of 0 and 1 represents a genuine frame and a forgery frame, respectively. Te GT F of each RSFVG is a GT vector with the size of M � (2L M + 1) × 1 × 1. Algorithm 2 shows the sample processing algorithm.

Parallel-DenseNet-Concatenated-LSTM (PDCL) Architecture
Similar to RSF extraction, it is also required to consider the spatial-temporal perspectives in deep models. For this reason, a novel architecture called parallel-DenseNet-concatenated-LSTM (PDCL), combining CNN and RNN, is proposed. Te overview of PDCL is illustrated in Figure 4.   International Journal of Intelligent Systems and fne features (in diferent sizes). From this perspective, our parallel-DenseNet (PDN) is proposed to address this issue by concatenating the serial and parallel features (i.e., cross-layer and cross-block features) from the preceding layers and blocks. In this way, the coarse-to-fne features can be simultaneously learned. Section 3.3 and Figure 4 mention that the input feature map is a 3-D tensor X∈R M×N×K . First, each box and its arrowed line with diferent colors represent the output of the corresponding layer of the PDN block. Each PDN block is a serial structure consisting of three layers of A, B, and C. Diferent PDN blocks and Ts layers are parallel structures. Te color dots in the bottom fgure illustrate an RSFVG. M is the number of concatenated RSF vectors or frame quantity in an RSFVG. Note that M (width of the color dot boxes at the bottom of Figure 4) is fxed to preserve their feature independence in the PDN structure, and the feature dimension (height of the color dot boxes at the bottom of Figure 4) is cut in half in the whole PDCL processing. Tis design benefts the following LSTM module to learn the temporal correlation between the frames in RSFVG. Te PDCL architecture with adjustable capability is compatible with various RSF dimensions. Figure 3(b) takes the CCPEV steganalysis feature as an input RSF sample, e.g., N � 548, to simplify the analysis. Te initial channel K of the RSFVG feature map is one layer. Terefore, the feature map size of the PDCL input is M × 548 × 1.

Parallel-DenseNet
Te PDN framework consists of the preprocessing (PP) layer, PDN block, and transition (Ts) layer. Te PP layer is a Conv-BN-ReLU-AvgPool block, similar to the Ts layer. Noteworthy, Convolution (Conv) uses a column convolutional kernel of (1, c � 7) instead of a conventional square kernel of (c, c) for processing. Tis way can keep M unchanged and each RSF vector independent. Conv also extends the feature map from 1 channel to 24 channels. Batch normalization (BN) can then remove the internal covariate shift and retain the same distribution of each layer. Rectifed linear unit (ReLU) is a nonlinear transformation function that can decrease gradient vanishing and speed up the network training. Finally, an AvgPool operation implements the feature map fltering. Te stride (1, 2) of AvgPool also aims to keep the M-independent RSF vectors in an RSFVG and compresses the feature dimension from N � 548 to N/ 2 � 274. Finally, the PP layer outputs a rough PDN feature map with a size of M × ⌊N � 548/2⌋ × 24.
Ten, the PDN blocks and the subsequent Ts layer constitute the backbone of PDN. Figure 4

Input: Te video frames
#W is the number of frames in a detected video; RS is extracted from the number of frames 2Ls + 1 including the center (current) frame k, successive previous frames Ls, and subsequent frames Ls of frame k; X M and Y N are the width and height of the frame.

Serial Structure.
Each PDN block has the same output layer depth with a modest 24 channels. In a PDN block, each layer has the same series (Conv-BN-ReLU structure, Figure 6). As shown in Figure 5(b), the column convolutional kernel of (1, c = 7) is used similar to the Conv in the preprocessing layer. Since a frame RSF vector with a limited feature length is much simpler than the image content with 3-D rich details, each PDN layer outputs only 8 channels. In the PDN block ( Figure 6), the three layers, A, B, and C, all output 8 channels, namely, the growth rate g = 0. Each PDN layer outputs 3 × 8 channels, more than 24 channels than the preceding layer. Te modest depth of the Conv layer can reduce the network parameter size and avoid gradient vanishing. Finally, the PDN block output is a feature map y j with 24 output channels in the following equation: y j � Cat y j , A , y j , B , y j , C , (4) where Cat represents the concatenation, and j is the j th block.

Parallel Structure.
Te DenseNet transition layer receives the processed information from the last layer and all preceding blocks. Te parallel-DenseNet transition layer, which is similar to the DenseNet transition layer, is a transition connection between two PDN blocks. Te DenseNet transition layer only receives the preceding layer features. Feature transmission is a serial way. Each DenseNet block and its following transition layer only address its features with a specifc size and dimension. Te PDN transition layer is not entirely like the DenseNet transition layer, which only compresses the feature depths and sizes. A close feedforward and parallel structure can concatenate the crosslayer features from all preceding blocks or layers for fusing learning ( Figure 5). Tis series-parallel structure can also process multidimensional and multiscale dense features that fuse in each transition layer. Te depths (i.e., number of channels) of feature maps of the PDN transition layer are given in the following equation: T j � Conv Cat Avg j y j , A , y j , B , y j , C , Avg j−1 y j−1 , A , y j−1 , B , y j−1 , C , . . . , Avg 1 y 1 , A , y 1 , B , y 1 , C , (5) where T j is the j th Transition layer output, j = 1, 2, 3, 4, Cat ([j]) represents the j th PDN block output within it, and Avg represents the average pooling. Each transition layer j output is more than 24 channels to the preceding transition layer j-1 , namely propagated rate ρ of 24.

Concatenated-LSTM Framework.
In this subsection, the focus of our work shifts to the feature correlations between the current frame and its adjacent frames. Unlike DenseNet, PDCL adds the LSTM layer before the linear layer. Te LSTM processes RSFVG (the concatenated RSF vectors) by reshaping the 3-D feature map (R M×N×K ) to 2-D (R M×(N×K) ), namely, from M × ⌊N/32⌋ × 120 to M × (⌊N/32⌋ × 120). LSTM outputs an M × 120-D feature map. Ten, the following linear classifcation layer contains 4096 fully connected (FC) and attached SoftMax functions. Finally, PDCL outputs an M × 2 matrix corresponding to the M vectors (frames) in the local temporal frame-group window to identify the genuine or forgery frame. (6) for training the model considers a sum of a concatenation of all frames in an RSFVG,

Loss Function. Te loss function in equation
where L( * ) function is a binary cross-entropy, O j represents the fnal classifcation output of frame j or RSF vector j in an RSFVG, GT j represents the ground-truth label for the vector j, M = 2L M + 1 is the frame quantity in an RSFVG, and Loss is the loss function of total PDCL.

Experimental Results
Tis section frst presents the existing SYSU-OBJFORG dataset,the proposed extension dataset and training strategies. Section 5.    Table 1 shows the efective RSFVG and the corresponding frames in the complete dataset for L M � 3, 5, 7, 10. Finally, the frames of the two datasets are split with a ratio of 8 : 1 : 1 for the training, validation, and testing stages.

Evaluation Criteria.
Te primary network classifer evaluation criterion is frame accuracy (FACC) in the following equation: incorrectly detected frames all the frames .
Lower error indicates better classifer performance. Furthermore, precision, recall, and F 1 [12] are the common criteria for the forgery forensics feld.
Precision � correctly detected forgery frames all detected forgery frames , Recall � correctly detected forgery frames all the forgery frames , where F 1 is a comprehensive evaluation criterion, which is also the balance of precision and recall. Higher precision, recall, and F 1 indicate good performances. In summary, error, precision, recall, and F 1 together provide complete performance evaluations from various perspectives.

Performance Comparisons among the PDCL Derivative
Structures and Other Schemes. Derivative structures are generated by considering diferent combinations of parameters and network structures with diferent model performances.

Comparisons of PDCL Derivate Based on Diferent
Combinations of Ls and L M . Tere are two parameters Ls � 1 or 2 and L M � 3, 5, 7, or 10 for RS and RSFVG, respectively. Diferent combinations of (Ls, L M ) ∈ (1, 2) × (3, 5, 7, 10) determine the efectiveness of RSF extraction (i.e., the CCPEV of 548-D features) and hence afect the network learning efectiveness and convergence speed. Tis work determines the optimal combination of (Ls, L M ) based on validation errors of the proposed PDCL network under diferent combinations of (Ls, L M ). From Figure 8, PDCL with Ls � 1 and all combinations of L M (except L M � 3) achieve the best validation error < 3% in the complete dataset. Te defnite number of PDCL B blocks follows the metrics 32 > ⌊N/2 B+1 ⌋ ≥ 16, where B is the number of PDCL blocks of the transition layer, and N is the RSF dimension. Tis defnite number keeps the proximate feature dimension as the input of parallel-LSTM for performance comparison. Besides, the popular CNN (e.g., Vgg16 [28], ResNet [29], and DenseNet [30]) replaces the proposed PDN to create different models (VGG16 + LSTM, ResNet + LSTM, and DenseNet + LSTM) for comparison. In Figure 9, the validation errors under (Ls, L M ) � [ (1,5), (1,7), (1,10)] are better than that of (Ls, L M ) � (1, 3). Terefore, the PDCL derivatives use the feature extraction framework (Ls, L M ) � (1, 5) for the validation errors. As shown in Figure 9(a), the proposed PDCL 4-CCPEV and PDCL 9-SCRMQ1 (PDCL baseline) achieve the best validation error among the derivatives. Figure 9(b) shows that the proposed PDCL structures are better than CNN + RNN structures. It means that the PDCL structure is much more suitable for VSOFD.

Experimental Analysis and Discussion.
Te performance between PDCL, its various derivative structures, and other SOTA schemes is evaluated in the complete dataset. Te PDCL derivatives are based on diferent RSFs, including DCT, SPAM, CCPEV, SRMQ1, and SCRMQ1. All the PDCL derivatives use a combination of a short temporal frame window International Journal of Intelligent Systems Ls = 1 and the RSFVG window 2L M + 1 for various RSF extractions, contributing to the comprehensive performance comparisons. Nevertheless, there are few existing VSOFD methods and models. Terefore, some related video forensics methods are used to compare with the proposed PDCL, including hand-crafted techniques and the DNN model, e.g., automatic identifcation and forged segment localization algorithm with CCPEV feature (AIFSL CCPEV ) [4], fast forgery detection with exponential Fourier moments (FFD EFMs ) [12], dense moment feature index, and best match algorithm with Radial-Harmonic-Fourier moments (DMFIBM RHFMs ) [13], Patch-Match with Polar Cosine Transformation (PM PCT ) [14], and Motion Residual and Parasitic Layers (MRPL) [19]. Tese methods can be used in video copy-move forgery detection [12][13][14], video splicing detection [19], and other forensics felds.
Besides, Spatiotemporal Trident Networks (STNs) [18] are also compared in our work. Table 2 details the performance comparisons in the complete datasets. Figure 10 shows the visualization results of some VSOFD samples.

Efect of RSF (Spatial-Temporal-Frequent Feature).
Te residual-based steganography feature (RSF) vector effectively extracts the implicit and unique features for classifying forgery video frames. From the results of Table 2, all the methods based on RSF achieve relatively good performance in precision, recall, and F 1 . For example, the PDCL 9-SCRMQ1 and PDCL 4-CCPEV models achieve the best performance scores of 90% in F 1 , followed by the SRMQ1 with 88.42% in F 1 . Among all the RSF, even the worst performing     temporal-frequent feature as video processing features and RSFVG structure can highly refect the spatial-temporalfrequent coherence and the diference between frame contents. STNs are suitable for addressing the splicing forgery. However, STNs apply only 3 frames as a group to identify the forgery group/frame. Te short temporal frame group is incompetent in detecting the forgery object of slow motion. For example, the PDCL 1,3 in Figure 8(a) achieves a weaker performance than PDCL 1,5 and PDCL 1,7 .

Efective Deep Network PDCL Architecture for VSOFD
(1) Efect of Novel Parallel-DenseNet (PDN) Structure. From the results of Table 2, the proposed PDCL with PDN structure achieves the overall best performances, including the lowest test error and the highest precision, recall, and F 1 scores compared to VGG16, ResNet, and DenseNet. DenseNet [30] is a kind of CNN with excellent performance in pattern recognition and image/video classifcation tasks. However, DenseNet is a feed-forward serial structure that can only process the concatenated features (in the same feature size). For this reason, DenseNet cannot simultaneously handle the coarse and fne features (in diferent sizes). From this perspective, our parallel-DenseNet (PDN) is proposed to address this issue by concatenating the serial and parallel features (i.e., cross-layer and cross-block features) from the preceding layers and blocks. In this way, the coarse-to-fne features can be simultaneously learned. Te PDN structure (CNN module) in PDCL is compatible with diferent RSF and preserves the RSF independence of each frame with a column convolutional kernel. It is suitable for the following LSTM to process each frame's coherence and difference to learn the sequence's correlation and coherence. Terefore, the proposed PDCL derivates perform better than the VGG16 + LSTM CCPEV , ResNet + LSTM CCPEV , and DenseNet + LSTM CCPEV . (2) Efect of Concatenated-LSTM. A video is a time series, and its frames have temporal coherence, and each frame has its independence. CNN is powerful at handling spatial features but only accepts single-frame input. In this case, CNN cannot retain the coherence features of video frames, resulting in unsatisfactory detection accuracy. Although RNN can maintain frame correlation, it cannot handle spatial features. Terefore, a new architecture with both CNN and RNN capabilities is necessary. Te PDN processes the spatial content of each frame while keeping frame invariance and independence. Tis way, the subsequent LSTM with long-short-term dependence can better focus on the temporal correlation among the adjacent frames. In Table 2, DenseNet CCPEV is much weaker than Dense-Net + LSTM CCPEV for addressing video surveillance object forgery. Similarly, the proposed PDCL derivates and VGG16+LSTM CCPEV , ResNet + LSTM CCPEV , and DenseNet + LSTM CCPEV all achieve better performance than the common CNN-based MRPL. Tis verifes the efectiveness of LSTM in VSOFD.
In conclusion, the PDCL derivates with (CNN + RNN) RSF structures get much better scores than the CNN only, CNN RSF (DenseNet CCPEV ), CNN RS (MRPL), hand-crafted methods (FFD EFMs , DMFIBM RHFMs , PM PCT ). PDCL derivates also get better scores than other CNN + RNN RSF structures using the same RSF. Among the RSF, PDCL with SCRMQ1 and CCPEV in L sM1,5 , getting the best F 1 scores of 90.33%, are the PDCL baseline in our dataset.

Conclusion
Tis paper proposes a new detection scheme for VSOFD with a novel spatial-temporal-frequent feature representation called RSFVG and a newly designed PDCL network, which aims to address the following critical issues: (1) Trough RSF, spatial-temporal-frequent features can be efectively represented with dimension reduction. (2) Trough the PDCL network, highly discriminative information in each frame (through CNN) and temporal correlation features between adjacent frames (through LSTM) can be learned simultaneously while maintaining frame independence. Tis is a critical property or requirement for identifying forgery frames in a video clip.
From the experimental results, the proposed scheme using the PDCL network with RSF can achieve high performance in test error, precision, recall, and F 1 scores. Among them, PDCL 9-SCRMQ1 achieves the best F 1 scores of 90.33% in the complete dataset, which is greatly improved by nearly 8% compared to the existing SOTA methods.

Data Availability
Due to privacy issues, the database of this paper is not published publicly. Te data supporting the current study are available from the corresponding author upon request.

Disclosure
Yan-Fen Gan and Ji-Xiang Yang are considered co-frst authors.

Conflicts of Interest
Te authors declare that they have no conficts of interest.

Authors' Contributions
Yan-Fen Gan and Ji-Xiang Yang contributed equally to this work.
International Journal of Intelligent Systems 15