Local Corner and Motion Key Point Trajectory Extraction for Facial Forgery Identification

. At present, the development of deep forgery technology has brought new challenges to media content forensics, and the use of deep forgery identifcation methods to identify forged audio and video has become a signifcant focus of research and difculty. Deep forgery technology and forensic technology play a mutual game and promote each other’s development. Tis paper proposes a spatiotemporal local feature abstraction (STLFA) framework for facial forgery identifcation to solve the media industry challenges of deep forgery technology. To adequately utilize local facial features, we combine facial key points, key point movement, and facial corner points to detect forgery content. Tis paper establishes a spatiotemporal relation, which realizes face forgery detection by identifying abnormalities of facial keypoints and corner points for interframe judgments. Meanwhile, we utilize RNNs to predict the sequences from facial key point movement abnormalities and corner points for interframe. Experimental results show that our method achieves better performance than some existing methods and good anticompression forgery face detection performance on FF++.


Introduction
Media content forgery has brought some security problems to society.Especially with the development of autoencoders (AEs) [1] and generative adversarial networks (GANs) [2], media content forgery has become easy to achieve through deep forgery techniques.Te techniques usually utilize deep learning methods to alter a person's identity in a video to synthesize a piece of media content that does not exist.Deep forgery identifcation techniques include both image-level detection and video-level detection.
Forgery detection of images or video frames is mostly the detection of forged video content, including color inconsistencies and semantic inconsistencies.Image forgery detection can be divided into detecting the image as a whole and detecting the facial area, according to the detection dimension.Forgery detection of the image as a whole is mainly to detect the physical properties of the image, such as the direction of the image's light source [3], the saturated pixel frequency [4], and the spectral sensitivity [4].It is classifed by judging the diference between forged and authentic images.Forgery detection for facial regions includes inconsistent iris color, missing tooth gaps, and inconsistent eye refexes, including detection of facial artifacts using light estimation, global consistency and geometric estimation [5], corneal highlight region consistency detection [6], and facial artifact detection [7].
Te detection of video sequences is mainly performed by combining optical fow anomalies, motion incoherence, or anomalies between video frames.Forgery detection based on optical fow mainly calculates the optical fow feld of the target in the video and detects the inconsistency of the optical fow feld [8].Some authors utilize eye blinks [9], abnormal head movements [10], and facial distortions [11] to detect incoherent motion or abnormal behaviors in consecutive frames.
However, the early works were mainly focused on global features.Specifcally, we notice that forgery detection features are particularly evident in key facial organs such as the eyes, nose, and mouth [5,6,12].For example, Xue et al. [12] found that only using facial organs such as the nose, lips, eyes, eyebrows, and chin can detect deep forgery very well.
Based on this, we frst consider constructing the facial organs' relation.Tese organs can be abstracted to local features and represented by sequential vectors.We then adopt recurrent neural networks (RNNs) to capture their internal properties or diferences to obtain instructive guidance that describes whether the face is falsifed.For comprehensive detection, we realize face forgery detection for key facial local regions such as the lips, eyes, nose, eyebrows, and chin, thus achieving impressive performance.Te contributions of our work are summarized as follows: (1) We propose a spatiotemporal local feature abstraction (STLFA) framework for facial forgery identifcation, which establishes local features' relation via an organ-specifc method.(2) In STLFA, we combine abnormal facial movement detection and facial landmark time discontinuity detection to analyze the facial key point and corner point features frame by frame.Meanwhile, we judge video sequences' key point movement and corner point number transformation to achieve forgery identifcation of images and videos.(3) Tis paper demonstrates the efectiveness and robustness of the proposed method and discusses and analyzes the advantages and disadvantages of STLFA.

Deep Forgery Discrimination Based on Image or Video
Frames.Currently, most forgery detection of images or video frames is performed by detecting manual features for forgery identifcation.Te detection subject can be divided into two categories: image detection and inconsistency detection only for human faces.Image forgery detection mainly detects the inconsistent lighting conditions and color inconsistencies in images.Chen et al. [13] proposed a robust dual-stream network by integrating dual-color spaces RGB and YCbCr using an improved Xception model, which considers both the luminance and chrominance components of dual-color spaces (RGB and YCbCr) to enhance the robustness.Johnson and Farid [3] proposed a method to detect lighting inconsistencies by estimating the direction of point light sources in a single image to estimate the consistency of light sources for the whole image.McCloskey and Albright [4] analyzed the structure of the popular GAN network.Tey found that the image generated by the GAN network difers from the captured image in color processing.Tey propose a method for forgery classifcation by saturated pixel frequency detection and spectral sensitivity detection.
Te forgery detection of inconsistencies in the person's face focuses on the incomplete consideration of semantics in the content generation process by the deep forgery method, resulting in the generation of a person with inconsistent iris colors in the left and right eyes, inconsistent refections, and uneven gaps in the teeth.Matern et al. [5] detected facial artifacts based on detecting intraframe image artifacts using light estimation, global consistency, and geometric estimation.Hu et al. [6] proposed a scheme to study whether the highlight patterns on the corneas of two eyes are consistent to determine whether they are fake.Li and Lyu [7] determined the forgery traces by detecting artifacts traced from the afne transformation during face forgery.
In order to integrate the features of facial regions, some authors proposed novel approaches.Wang et al. [14] proposed a method that fused facial region feature descriptor for forgery determination by extracting feature points of a person's face.Xue et al. [12] built a transformer model for a deepfake-detection method by organs to obtain the deepfake features.Yang et al. [15] proposed a method for detecting diferences in face textures by amplifying the texture diferences between genuine and fake images and using a bootstrap flter to enhance postprocessing-induced texture artifacts and display the underlying features of the artifacts.

Deep Forgery Discrimination Based on Video Sequences.
Te video sequence-based deep forgery approaches have more detection items than the image-based deep forgery approach.Te forged video generation process is frame-byframe leading to optical fow inconsistencies between the preceding and following frames and motion anomalies.
In terms of forgery identifcation based on optical fow detection, Amerini et al. [8] proposed a forgery detection method based on optical fow anomalies between diferent frames by extracting the correlation of the optical fow feld and using a CNN classifer for classifcation.Trinh et al. [16] proposed a forgery detection framework by superimposing optical fow felds on RGB images for forgery detection.Caldelli et al. [17] proposed a CNN-based classifcation method to distinguish motion dissimilarities in the temporal structure of video sequences by using optical fow felds.
In terms of forgery identifcation based on abnormal motion detection, Li et al. [9] proposed a GAN-based model that could not represent blinking in fake synthetic videos, enabling the detection of blink inconsistencies.Yang et al. [10] proposed a detection method based on the inconsistency of 3D head pose estimation by extracting the coordinates of facial key points and calculating the direction vector diference between the center of the face and the coordinates of peripheral key points to achieve deep forgery detection.Sun et al. [11] proposed a geometric feature calibration module to determine the accuracy of interframe geometric features to determine the abnormal facial movements of characters.

Framework.
In this section, we provide a detailed illustration of our proposed method.Figure 1 illustrates the architecture of STLFA.We used facial preprocessing modules to crop the eight facial organ regions, including the left eyebrow, right eyebrow, left eye, right eye, nose, mouth, inner mouth, and chin.We built a sequence group by facial 2 Security and Communication Networks key points, key point movement, number of corner points, and number of variations.Meanwhile, RNN models are trained for each region until they have the detection ability.
After that, we integrate the results from the RNNs and obtain the fnal prediction.

Facial Preprocessing.
Te facial preprocessing module mainly contains three steps: face detection, face landmark detection, and landmark alignment.Following [11], we use tracking and denoising methods to match the key points between video sequences to obtain the complete facial key point coordinates and coordinate movement.We utilized the Lucas-Kanade (LK) operation in the tracking method to track the coordinate points and forward-backward processes to eliminate inaccurate predictions.Meanwhile, the denoising method is used to solve the noise caused by the LK operation and to ensure the stability of the landmark, using the Kalman flter to integrate the prediction information.

Facial Key Points Coordinates Extraction.
Te facial key point coordinate detection method requires cropping the preprocessed image.After that, we detect 68 facial key points representing the facial shape, as shown in Figure 2(a).We select the key point frame to extract eight facial key organ regions based on the 68 key points, as demonstrated in Figure 2(b).We create vector v p for each key organ region.
Each region can be expressed as v p i : where x 1 i is the horizontal coordinate of the frst key point in region i and y 1 i is the vertical coordinate of region i.

Corner Extraction
(1) Motivation for Using FAST Feature Points.Te FAST algorithm is a corner detection algorithm mainly used to extract the feature points in the image.Based on the feature point information, the translation, distortion, and rotation objects in the dynamic process are associated with realizing the target tracking in a series of images of dynamic imaging and positioning.Wang et al. [14] found that although the fake video face was highly similar to the original video face, it still lost many fne details used to determine the FAST feature points and found that the phenomenon was more evident in the local area of the face.Based on this observation, we design a FAST feature descriptor to extract the phenomenon of the occasional failure of face-changing in the local area of the fake video and further complete the face forgery detection.
(2) Extraction Algorithm Feature Point of FAST.Features from accelerated segment test (FAST) [19] is an efcient corner point detection method mainly used for feature extraction of image corner points.Te FAST method builds up the intensity of a pixel point I p , sets the threshold value to t, and creates a Bresenham circle for 16-pixel points around p, as shown in Figure 3(a).Designating pixel point p as a corner point if there is a set of n consecutive pixels in the circle that are all brighter than I p + t or darker than I p − t.
In order to speed up the operation, the pixel points compared with I p can be simplifed and set to 1, 5, 9, and 13, as shown in Figure 3(b).Tis paper focuses on establishing FAST corner point detection for eight regions extracted, such as the eyes, nose, lips, and eyebrows, and establishing corner point comparisons between frames, as shown in Figure 3(c).
We defne pixel p as a corner when the circle in Figure 3  Security and Communication Networks the points are brighter than I p + t or darker than I p − t.In order to speed up the operation, the points can be simplifed and only use points 1, 5, 9, and 13 to calculate, as shown in Figure 3(b).We focus on establishing FAST corner point detection for eight regions, as shown in Figure 3(c), and setting corner point comparisons between frames.

Abnormal Facial
Te key point coordinate vector of the eight regions collection in frame i can be expressed as v i l : where v i l 1 ∼ v i l 8 represents the respective vectors of the eight regions in frame i and the corresponding key points are as follows: 6∼10 represent the chin, 17∼21 points represent the left eyebrow, 22∼26 represent the right eyebrow, 36∼41 represent the left eye, 42∼47 represent the right eye, 27∼35 represent the nose, 48∼60 represent the mouth, and 61∼67 stands for the inner mouth.
Ten, we use v i l 1 ∼ v i l 8 , extracted frame by frame, to provide clues for subsequent temporal discontinuity detection of facial motion morphology.

Facial Corner Abnormal Detection.
Following [14], we use FAST to obtain feature points with a descriptor des of 32 dimensions.We assume that the number of corner points FD k i of the focus organ region k is in frame i, then FD k i can be expressed as follows: In this way, a feature vector FD i can be created for the eight regions: where FD i k is a statistical vector based on corner points in region i, containing the number of corner points in region k at frame i.We create time series based on FD i k to detect clues of alternating authentic and forgery faces in forgery videos.We detect the temporal discontinuity of facial key point displacement between frames based on the displacement information of facial key points between consecutive frames.We analyze the movement of key points in each region and build a key point coordinate movement vector v i m k ; each region can be expressed as follows: Te key point coordinate movement vector of the eight regions collection in frame i can be expressed as v i m : where ∆x 1 i is the adjacent frames variation in the horizontal coordinates; we can calculate ∆x 1 i using |x 1 i+1 − x 1 i |, the same as ∆y 1 i .

FAST Feature Time Discontinuity Detection.
Te FD i k in Section 3.4.2 is the corner number vector of the described local region, and we use this vector to build the corner number diference vector ∆FD k i between consecutive frames: ∆FD i k is the diference between the number of corners in region k in frame i and the number in region k in frame i − 1. Te statistical vector ∆FD i of the diference in the number of the corners in the whole facial region can be expressed as follows: We use ∆FD i to detect nonsmooth facial corner number changes in the video.

Facial Forgery Prediction
k , and ∆FD i k obtained in Sections 3.4 and 3.5, the local facial feature fusion vector v i f k is formed by concatenating the four types of feature vector sequences: Ten, the local facial feature fusion vector for region k of the entire video can be expressed as follows: We utilize a series of the local facial feature fusion vectors v f 1 ∼ v f 8 to represent the facial fusion features.After that, we use the connected feature vector v f k to train a dual-stream RNN model for each of the eight regions to classify the forgery videos.

RNN-Based Deep Forgery Detection.
We utilize RNNs to model local facial feature sequences.In order to ensure an identical input dimension of the RNN and to achieve deep forgery detection at the video level, each video sample used as input is cut into a fxed length, and a fxed number of key frames are extracted.Based on the extraction results, the RNN parameters are selected for training to achieve deep forgery detection of the overall video.
Trough the embedding process, the RNNs are adopted to model the feature sequences of each local region, learning the shape movement pattern, landmark diference pattern, and FAST feature point variation pattern.Ten, the fully connected (FC) network is connected to each RNN output layer.Furthermore, calculate 8 FC layers output average result as the fnal prediction to achieve deep forgery detection based on the local regions of the face.We utilize F to represent this process:

Datasets
(1) FaceForensics++ (FF++) [18]: FF++ is one of the benchmark datasets for large-scale deep forgery detection, with a total of over 1,000 segments, more than 1.5 million frames in total, and over 1.5 TB of video data in the original video format.Meanwhile, a face detector is used to flter the video footage to ensure that there are three video qualities in the FF++ dataset, Raw, c23, and c40, characterized by many forged video segments, and a variety of deep forgery methods are considered.(2) Celeb-DF [20]: Te Celeb-DF (v2) dataset is a largescale deepfake forensic dataset that addresses the shortcomings of poor forged video quality, apparent forgery traces, and fickering video faces.Te Celeb-DF (v2) dataset improves the deep forgery generation method and the face key point localization method to obtain stable fake video content quality.Te dataset contains 590 raw videos collected from YouTube with categories of diferent ages, races, and genders.5639 HD deepfake videos are the same quality as the online broadcast videos.(3) DFDC preview dataset [21]: Tis dataset comes from Te Deepfake Detection Challenge hosted by Facebook.It is the preliminary dataset for the competition.It consists of 5,214 videos, of which the ratio of true and false content is 1 : 0.28, and forgery data contain data generated by two deep forgery methods.Each video is a clip of about 15 s.

Experiment Settings.
During preprocessing, DLIB was used for face cropping and face landmark detection, and FAST detector and BRIEF descriptors were used for corner point detection and description.In the classifcation process, a bidirectional recurrent neural network connects to the feature sequences in the respective regions.Each RNN in the detection framework consists of a GRU (gated recurrent unit) with a hidden layer feature output dimension of 64.A dropout layer is set between the input and the RNN, using a fully connected network to connect to the output of the Security and Communication Networks RNN layer.Using two dr � 0.5 dropout layers separated between the RNN layer and the fully connected layer and inside, these experimental parameter settings partly refer to existing research results [22].In the experimental dataset section, the ratio of training data to test data was 7 : 3, with 120 frames drawn from each video.Te model was optimized using the Adam optimizer for the specifc training process.We initialize the learning rate at 0.005, set the batch size to 1024, and the maximum number of iterations Epoch was 800 rounds.Te experiments in this paper use AUC (area under curve) to evaluate the performance of the deep forgery detection model, and the AUC is calculated as follows: where pred pos is the predicted probability of getting a positive sample, pred neg is the predicted probability of getting a negative sample, positiveNum is the number of positive samples, negativeNum is the number of negative samples, and AUC is the number of samples where the predicted probability of a positive sample is greater than the predicted probability of a negative sample in the positiveNum * negativeNum sample.

Partial Organ Comparison.
In this paper, experiments are conducted on the FF++ dataset to compare each organ region module's detection efect to verify each organ's region detection efect on deep forgery.In this paper, following the idea of [14], eight key regions such as the left eyebrow, right eyebrow, left eye, right eye, nose, mouth, inner mouth, and chin were set up and compared, as shown in Table 1.Te "Points" results are obtained using facial key point coordinate detection and facial key point coordinate movement detection, "Coordinate" indicates the detection result using only the facial key point coordinates, and "Movement" indicates the detection result using only the facial key point movement coordinates."C + M" indicates the result obtained by combining the key point coordinate detection and the facial key point coordinate movement detection."Corners" is the result obtained using FAST corner number detection and corner number change detection."All" means that the results of "Points" and "Corners" are combined with the experimental results of FAST features, and the RNNs of each segment are trained separately.
From Table 1, all local organs can be used individually in the FF++ dataset to detect whether the images contain forgeries.Tis paper observes that among the eight organ regions, the eyebrows, eyes, and mouth have the highest accuracy rate, while the nose and chin have a low accuracy rate.Also, in the "Points" detection group, where three experiments were set up, it was seen that "Coordinate" could perform a single-frame detection task with an average detection rate of 87.2%."Movement" is the detection method combined with video sequences, with an average detection rate of 82.6%.Te combination of "Coordinate" and "Movement" enables the combination of abnormal facial movement detection and facial landmark time discontinuity detection, allowing for more efective acquisition of key facial features with an accuracy rate of 91.1%.

Ablation Study.
In this paper, we use the frame-level AUC to verify the efectiveness of face key point and corner point detection on deep forgery detection, respectively, to validate the proposed method.Te models in the experiments are trained on FF++ (raw) and tested on three datasets: FF++, DFDC Preview, and Celeb-DF.Te results are shown in Table 2.
Te experimental results show that "Points" and "Corners" have similar detection results in terms of AUC, with an average of 71.3% and 74.1%, respectively, and all the best detection was achieved by "All," with an AUC of 75.9%.Meanwhile, in the FF++, DFDC Preview, and Celeb-DF datasets, the AUC values of "All" were higher than those of "Points" and "Corners" and "All" has a higher AUC than "Points" and "Corners."Tis proves that the method proposed in this paper, which combines facial key point and corner point detection, is reasonable and efective.

Comparison Experiments.
In this paper, using framelevel AUC evaluation, we selected mainstream deep forgery detection methods based on full-frame face region forgery detection [18], fake face edge fusion region detection [23], facial landmark feature enhancement forgery and detection [11], visual distortion detection [24], and capsule network forgery detection [25].Tests were carried out on datasets such as FF++, DFDC Preview, and Celeb-DF.We refer to the detection results of [11,14], as shown in Table 3.In the FF++ dataset, "raw" represents the uncompressed data and "c40" represents the compressed LQ data.
As can be seen from Table 3, the AUC results of the proposed method on FF++ are better than those of mainstream methods such as Xception [18], Face X-ray [23], LRNet [11], DSP-FWA [24], and Capsule [25].In particular, in the experimental group of "c40," the proposed method has better robustness for low-quality forged video identifcation, with a 1.7% improvement over LRNet [11] and a 35.8% improvement over Face X-ray [23].
In anticompression forgery face detection, our work shows a good forgery face detection performance.Te method in this paper extracts the geometric features of the local facial region by combining the local facial key points and the corner.Te extracted features have more robust and lower cost characteristics and have high sensitivity in detecting changes in the number of the corner.Te strategy designed in this paper for face forgery detection through 8 local facial regions improves the accuracy of overall face forgery detection by reducing the detection error of a single region.Te efectiveness of our strategy is also verifed on FF++ (Raw, c40).
Te low-complexity and high-performance geometric feature extraction method designed in this paper can effectively reduce the impact of image compression on the face 6 Security and Communication Networks forgery detection task, and the experimental results further demonstrate this.We compared this method's training and testing results and other methods on the FF++ (Raw, c40) dataset in Table 3. Te results show that our method achieves better performance than some existing methods, with a diference of 0.4% in AUC compared to the Single XceptionNet [26] method on FF++ (c40), and has better anticompression forgery face detection performance.Te detection performance sufers less interference on c40 data.Te experimental results show that our method is innovative and can only use individual organs to detect forgery videos with deflement and stain.Meanwhile, using all organ regions has higher average accuracy.To further verify the ability of our method, we set up cross-dataset experiments to compare with the state-of-arts in Table 5.
Security and Communication Networks time series of features to complete fake face detection, which verifes the efectiveness of the framework.Applying geometric features improves the sensitivity to detecting facial feature point motion patterns and diferential changes to a certain extent.Still, in the face of forging changes in the scene around the face of diferent datasets, the feature extraction method in this framework needs to be further optimized.Obtaining more efective forgery face features is the further optimization direction of this framework.Experimental results show that our method performs better than some existing methods and achieves good anticompression forgery face detection performance on FF++.At the same time, for the detection of face forgery, the generalization ability under cross-dataset testing is also important.Terefore, a robust method with strong generalization ability is the goal of our future work.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

63.8
Te bold values are used to highlight the experimental records that represent the most optimal performance within each experimental group.Specifcally, the bold values labeled as "Ours" indicate the results obtained from the experiments conducted using our proposed method.

Figure 1 :
Figure 1: Te framework of STLFA.Te face image in this fgure comes from the FF++ dataset [18] obtained from open access.

Figure 3 :
Figure 3: FAST feature point extraction algorithm: (a) p is the selected corner point, and a Bresenham circle is established around the point p.Te face image is from FF++ [18].(b) Simplifed corner operations.Te face image is from FF++ [18].(c) Fast corner points detection in eight regions.Te face image is from FF++ [18].
Te development of deep forgery technology has brought new challenges to the authenticity of media content.Te mutual promotion of deep forgery technology and forensics technology is prominent in addressing the challenges brought by deep forgery technology to the media industry.We focus on the consistency of facial key points and corner points' coordinates and propose a spatiotemporal local feature abstraction (STLFA) framework for facial forgery identifcation, which establishes local features' relation via an organ-specifc method, which combines abnormal facial movement detection and facial landmark time discontinuity detection to analyze the facial key point, and corner point features frame by frame.It is mainly to detect the consistency of the movement of facial key point coordinates and the facial corner point number variations.At the same time, the method utilizes the bidirectional RNN to establish the sequence in eight local facial regions to model the facial shape pattern, the key point movement pattern, and the corner point number variations.
(a) has a group of consecutive pixel points.Meanwhile,

Table 1 :
Comparison table of local organs (Acc (%)).Te bold values are used to highlight the results of the experiments conducted for this study.Specifcally, they represent the performance of the proposed method in each experimental group.

Table 2 :
Ablation experiments (AUC (%)).Te bold values are used to indicate the experimental records where the proposed method, discussed in this paper, demonstrated the most favorable outcomes within each experimental group.By highlighting these values in bold, we aim to emphasize the superior performance achieved by our method in those specifc experimental conditions.

Table 3 :
AUC (%) results of the proposed method and mainstream methods on the FF++ dataset.
However, since the sample distribution of the FF++ dataset cannot represent all deep forgery techniques, the generalization of this method under the new data distribution is not explicitly guaranteed, which may lead to the degradation of performance in crossdatabase testing.Research on the generalization problem will be our future goal.
4.4.Discussion.Although the proposed method utilizes RNNs to model local facial feature sequences, it achieves deepfake discrimination through abnormal facial movement detection and facial landmark time discontinuity detection and exhibits good detection performance and compression resistance.Our method mainly mines the detection performance of each local face region for deep forgery and can efectively learn and model local face regions' forgery features and patterns.

Table 4 :
Te detection accuracy in cross-dataset experiments only uses local organs and organ combinations (Acc (%)).