Font Size: a A A

Temporal-Spatial Modeling Transformer For Deepfake Detection

Posted on:2024-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:P WangFull Text:PDF
GTID:2568306932955399Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
With the rapid development of deep learning,image and video tampering techniques based on deep neural networks have emerged,bringing significant benefits to the film industry.However,the disclosure of relevant technologies has led to the widespread dissemination of forged contents in cyberspace,posing a significant threat to personal reputation,cybersecurity,and even political stability.Therefore,identifying the authenticity of various media contents in the network has become the center of attention for the community.Although many deep forgery detection methods have been proposed in recent years,and have achieved relatively high detection accuracy,they have been facing a common conundrum:poor transferability in cross-dataset tests.Focusing on the transferability of the detection model,this paper conducts research from the perspective of image and video forgery detection.For images,a method based on the general feature of semantic information anomaly in the spatial domain is proposed.For videos,a method is proposed to detect face forgery by combining the inconsistent information in the temporal domain and the abnormal information in the spatial domain within forgery videos.The paper aims to design a universal face forgery detection framework with high transferability.The main contributions and innovations are summarized as follows:1.A Face Forgery Images Detection Method Based on Spatial Information:Existing face forgery detection methods are mainly based on traditional convolutional neural networks,which tend to overfit to local texture information during the training process.These features are closely related to the generation algorithm,resulting in poor transferability of traditional detection methods.This paper proposes a new method with high transferability based on the vision transformer and the analysis of global semantic information of face images.Additionally,several plug-and-play enhancement modules are proposed,including the attention leading module used to urge the model to focus on areas that might be most discriminative,the variant residual connection for reducing redundant information and supplementing residual features,the multi-forensics module for integrating features from different levels,and the contrastive loss used to strengthen the supervision of the training process.The experimental results demonstrate that this framework has achieved satisfactory performance and impressive transferability.2.A Face Forgery Videos Detection Method Based on Temporal-Spatial Information:Considering that most of the face forgery contents spread on the network are videos,this paper further takes forgery videos as a target and finds that they exhibit the characteristics of existing discontinuous and inconsistent information in both temporal and spatial domains.Therefore,the classic visual transformer is modified to propose a spatio-temporal transformer for video input to achieve the joint analysis of temporal and spatial information.Additionally,since the structure of the transformer is not sensitive to texture information,a 3D convolutional neural network branch is introduced to supplement the texture-level features.To fully take advantage of both branches and strengthen their interaction,a global-aware module and a cross-attention module are proposed.The former shares the global information of the transformer branch to the convolutional network branch,and the latter is used to merge different features and fully integrate the characteristics of the two branches.A series of comparative experiments show that this framework outperforms other state-of-the-art methods.
Keywords/Search Tags:DeepFake Detection, Transferability, Vision Transformer, Digital Image and Video Forensics
PDF Full Text Request
Related items