| In recent years,deepfake videos have been widely spread on the Internet causing a large negative impact and posing serious threats and challenges to individuals,society and the state.At present,most researchers in deepfake video detection focus on spatial feature detection of single frames,but this approach ignores the video-specific temporal features,resulting in the extraction of forgery features by the detection model is not comprehensive enough.On the basis of extracting spatial features,this thesis proposes two deepfake video detection methods that fuse multi-frame timing features to achieve more comprehensive and adequate information extraction.The main work of the thesis is as follows:(1)Video data pre-processing.Firstly,frame interception is performed on the video;then face detection algorithm is used to identify and crop to form face pictures;for a small number of videos where irrelevant faces are detected during face recognition,data cleaning methods are proposed and data expansion is performed.For the 3D convolution-based temporal feature detection algorithm,an algorithm to process the image into 5-dimensional data as network input is proposed.(2)A 3D convolutional-based temporal feature deepfake video detection method is proposed.To address the problems that the computational and parametric quantities of 3D convolutional neural networks increase significantly when processing multiple frames at the same time and ignore the detailed features in the detection process,the Efficient Net-B0 network is mapped to a 3D convolutional neural network by fusing the Efficient Net-B0 network with the 3D convolutional kernel to extract the temporal features;the perceptual fields of both temporal and spatial dimensions are balanced using asymmetric downsampling;finally,the overall perceptual field is increased using inflated In the data preprocessing stage,an improved multiscale image detail enhancement method is used to enhance the detail features of image frames,which enables the model to enhance the learning of local detail features while learning the temporal features.(3)GRU and Involution improved deepfake video detection methods are proposed.To address the problems that the capsule network structure only focuses on spatial features and the detection accuracy needs to be improved,and the standard operator is channel invariant and spatially agnostic,firstly,VGG19_bn is used as the backbone network to extract spatial features,and the Involution operator is embedded into the backbone network to enhance the spatial modeling capability of face images from the perspective of spatial and channel information,and a global feature information extraction network is constructed.Then,we use the main capsule layer to learn the location information of features and use GRU for frame level learning of the features after CNN-Capsule output,so that the whole model can fully learn the temporal features while focusing on the location information of features;finally,we use Focal Loss to balance the samples in the video classification stage.(4)Conducting experiments.The two proposed detection methods are tested on the public datasets FaceSwap,DeepFakes,and Celeb-DF,and the experimental results show that both proposed detection methods can fully extract the temporal feature information and improve the detection accuracy;it is also concluded that the method using 3D convolution to extract temporal features has good generalization ability,and the method using GRU to extract temporal features is more Stable. |