| With the development of deep learning,the visual effect of synthetic media has reached the point where it is difficult to distinguish with human eyes.State-of-the-art generative models have been successfully applied to education,movie,games and other fields.However,malicious attackers use generative models to synthesize forged face videos,and spread misinformation and disinformation.This technology called Deepfake can synthesize highly deceptive forgery videos.In addition,the rapid development of hardware,the massive growth of social media data and the open source and commercialization of face editing software also aggravate the uncontrollable nature of Deepfake.Therefore,how to effectively detect forged face videos has become a research hotspot at home and abroad.Regarding the face forgery detection and localization,existing methods mainly use errorthreshold method to generate ground truth map to supervise the localization task.However,the error-threshold method has some problems,such as partial missing and insignificant forged signals.Moreover,current methods do not consider multi-scale manipulation regions,which leads to limited localization performance.Aiming at the generalization of face forgery detection model,it is hard for existing methods to learn a unified representation for different types of manipulation,and the generalization ability needs to be further improved.Therefore,corresponding solutions are put forward for the shortcomings of the above methods of face forgery detection,localization and generalization.The major work of this paper include:(1)A novel face forgery detection and localization network named as SIFDNet is proposed based on structural similarity(SSIM)error maps.It consists of Adjacent Layer Aggregation Module(ALAM),Dilated Convolution Module(DCM),and Gradient-enhanced Block(GEB).The ALAM is proposed to extract features from multiple levels to estimate SSIM error maps with attention residual learning.The DCM is introduced to deal with multi-scale manipulation regions.Finally,the Sobel stream which consists of three GEBs is capable to enhance forgery artifact and manipulation region contours.Experimental results and analysis on four benchmark datasets,i.e.,Face Forensics++,Face Shifter,Celeb-DF,and DFDC show that SIFDNet can effectively perceive artifacts and their strength of Deepfake videos and thoroughly predict manipulation regions.It outperforms the existing methods in face forgery detection and localization accuracy.It has great potential to be applied for forgery detection and localization of Deepfake videos.(2)A fake-side domain generalization framework based on spatio-temporal and frequency feature learning is proposed to improve the generalization ability of face forgery detection model.Firstly,the framework consists of a spatio-temporal stream learning spatio-temporal consistency features and a frequency stream learning domain invariant features,in which the spatio-temporal stream contains multiple multi-scale vision transformer blocks,while the frequency stream consists of a frequency decomposition and convolution neural network.Secondly,the framework learns the generalization space through fake-side adversarial loss,in which the features of fake faces in different domains are compact and mixed,while the features of real faces are scattered around.Extensive cross-domain experiments on several datasets show that the generalization performance of the proposed method is superior than previous methods.(3)A Deepfake video detection and manipulation localization system based on SIFDNet is designed and implemented.The main functions of the system include: a)uploading and persistence of face video data? b)real-time display of detection progress in the process of video frame-by-frame analysis? c)output and display of the final forgery detection and manipulation localization results.The research findings of this paper can be applied to the detection of Deepfake videos and the localization of manipulation regions,providing a new way for Deepfake detection and improving the generalization performance of the model over unseen manipulation types and data domains. |