Currently,deep learning-based face generation technology has been able to synthesize extremely realistic images or videos of human faces.This technology,also known as deepfake,invisibly poses a great danger to public safety.Unscrupulous people can use this technology to fake news statements or create pornographic videos.With the emergence and development of short video platforms,it has become increasingly important to develop deepfake detection techniques in order to prevent the misuse of deepfake techniques.In recent years,the deepfake detection task has received attention from industry and academia,with large-scale deepfake video datasets and new detection methods emerging.In practical scenarios,however,these detection methods often have major limitations,such as a significant drop in accuracy when the trained model is tested on unseen datasets or compressed datasets,i.e.,poor generalization ability.Furthermore,most of the research stays at the frame-level detection,and there is a lack of sufficient research on how to migrate the frame-level detection models to the video-level detection.In the thesis,two deepfake detection methods are proposed to address the aforementioned problems.The main content is:(1)Deepfake video detection based on dynamic composite feature enhancement:The deepfake video contains artifacts in certain regions as a result of shortcomings in the generation algorithm.Additionally,the areas of the video that contain artifacts may vary depending on the specific deepfake generation algorithm.To address this problem,the thesis dynamically extracts the most discriminative feature regions based on the feature maps obtained from convolutional neural networks,so as to learn composite features at different scales.The proposed method can improve the generalization ability of the detection model,and roughly locate the manipulated regions in the video.(2)Deepfake video detection based on multi-frame feature fusion: Most of the deepfake generation algorithms adopt the frame-by-frame generation mode,which leads to a certain discontinuity in the temporal sequence of the generated video.The thesis proposes a detection method that can fuse the features of multiple frames of a video.A temporal model is used to extract the temporal features carried within the video.The Attention mechanism is used to extract the most important features in multiple sets of inputs,so as to obtain a more discriminative feature representation.The thesis conducts experiments using three large-scale deepfake video datasets,Face Forensics++,Celeb-DF,and DFDC.The thesis then demonstrates the advantages of the two detection methods in terms of generalization ability through cross-video compression rate tests,cross-generation algorithm tests,and cross-dataset tests. |