Font Size: a A A

Multi-modal Deepfake Detection Of Specific Individuals Based On Audio-visual Features

Posted on:2024-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:B L ChuFull Text:PDF
GTID:2568306941984049Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Deepfake technology is a technique that uses deep learning to create fake audio and video,which can generate highly realistic fake character content.Currently,the degree of simulation of deepfake technology for real people has reached a level that is difficult to distinguish with the naked eye.If used illegally,attackers can create false news,inflict online violence and conduct other activities that harm society,affecting people’s judgment and decision-making in the real world.Therefore,it is particularly important to design effective deepfake detection technology,which is the fundamental starting point of this research.Existing detection techniques mainly rely upon deep learning methods based on large-scale sample learning.They utilize multidimensional learning ability of neural networks to capture the artifacts and abnormal phenomena of images or videos in different modalities that may be left by deepfake technology.Furthermore,the current scenario in the field of deepfake detection mainly focuses on the identity-agnostic image and video authentication,i.e.,the samples that the model encounters contain images or videos of various characters without considering the identity information of these characters.However,in practical applications,specific individuals often have a greater impact on social media,such as political figures,academic authorities,and celebrities.If deepfake technology is applied to these individuals,it may cause serious negative effects and seriously endanger internet security.To address the above issues,this paper focuses on exploring deepfake detection for specific individuals and uses the widely used multimodal learning to model the speech patterns of specific individuals,thereby resisting attacks from deepfake technology.The main contents of the work are as follows:This paper explores the detection of deepfake videos targeting specific individuals,using a widely adopted multi-modal learning approach to model the speaking patterns of the individuals.In contrast to existing deepfake detection methods,which rely on the detection of artifacts caused by deepfake techniques,this method achieves high accuracy by leveraging the high-dimensional consistency between facial movements and lip movements during speech.This method overcomes the issue of poor generalization to different classes of deepfake videos and maintains high robustness to compressed videos.Specifically,the model is composed of a dual-stream structure.One stream extracts features related to facial movements and learns complex semantic relationships.The other extracts features related to lip movements and learns patterns of motion.The two streams are then fused through dimensionality concatenation and fed into a backend network for classification.To train this model,a dataset was constructed using video data downloaded from social media platforms,as there are currently no publicly available datasets that target specific individuals.Four US politicians were selected,and the model was trained using their video data to generate high-quality deepfake samples for the task.The proposed method was evaluated on this dataset and another publicly available dataset FakeAVCeleb,and experimental results show that it outperforms existing methods for deepfake detection targeting specific individuals and maintains high detection performance on compressed videos.Furthermore,this study investigated the model design of multi-modal deepfake detection using both audio and visual modalities,and proposed a corresponding network architecture for this task.A cross-modal alignment scheme was designed,and the self-supervised network Wave2Vec2 was used for effective feature extraction of audio signals.Comparative experiments demonstrated that the proposed multi-modal approach outperformed existing detection models,indicating that the audio and visual modalities exhibit strong consistency in human speech,and can serve as effective clues for deepfake detection,providing a new perspective for the field.In summary,this paper models the speech pattern of specific individuals using the strong correlation between multiple modalities,providing effective resistance against various tampering methods.The research presented in this paper fills the gap in the current field of deepfake detection,which lacks studies that provide steadfast protection for specific individuals.This work demonstrates that each individual has a unique multi-modal consistency and provides new ideas for better deepfake detection.
Keywords/Search Tags:deepfake detection, identity protection, multimodal learning
PDF Full Text Request
Related items