| Automatic Speaker Verification systems have been widely used for biometric authentication by automatically verifying the identity of a speaker through analyzing their speeches.With the tremendous technological advancements of deep learning,the Deepfake speeches produced by speech synthesis and speech conversion technologies have become difficult to discern.This makes ASV systems vulnerable to deception attacks,where attackers impersonate the target speaker by presenting Deepfake speeches that are very bonafide.These attacks undoubtedly pose a significant security threat to existing ASV systems,in which case,the technology to detect the authenticity of speeches has become an urgent need.In recent years,there have been significant research achievements related to speech.The academic community has provided a series of AI-driven solutions in feature engineering,backend models,loss functions,model integration,etc.However,there still lack researches specifically focused on the task of detecting Deepfake speeches.In terms of feature extraction,most researches rely on features from the fields of speech recognition and speech synthesis.For algorithm models,researches mainly optimize the classic structure for image recognition.This thesis analyzes in-depth the differences between Deepfake and bonafide speeches,and proposes effective methods for Deepfake speech detection from both feature extraction and model construction perspectives.The contributions are as follows:(1)To deal with the lack of features suitable for Deepfake speech detection,this thesis first conducts an in-depth analysis of the frequency distribution of different phonemes in bonafide and Deepfake speeches,and identifies the frequency bands with the most significant differences.This thesis then adjusts the filter sparsity structure based on the differences,and obtain a new feature specifically for Deepfake speech detection-Phoneme Frequency Cepstral Coefficient(PFCC).Furthermore,in order to verify the effectiveness of the PFCC feature and reduce the impact of language differences on phoneme frequency distribution,this thesis conducts experiments on both an English dataset(namely,the ASVspoof2021 DF)and a Chinese dataset(namely,FMFCC-A).The results demonstrate the effectiveness and feasibility of the proposed method.(2)Pipeline neural network models contain numerous modules.Especially,feature engineering requires considerable knowledge in phonetics for manual design,and the selection and adjustment of hyperparameters have a significant impact on detection performance.To address the above issues with traditional neural network models,this thesis develops an end-toend Deepfake speech detection model by improving on Rawnet2.The proposed model introduces Sinc Layer to directly capture speaker’s features from the raw speech samples,stacks multiple FMS-Res Net layers,and uses the FMS that is similar to self-attention pooling mechanism to scale global contextual information feature maps in convolution blocks,which enables the output of the residual blocks to obtain more discriminative representations.The skip-layer connections between each layer aggregate multi-level features while also reducing the total number of parameters,which further enhances the model’s performance.In the Deepfake speech detection task on the ASVspoof2021 DF dataset,the proposed model achieves an EER of 22.01%,which outperforms the four baseline systems provided by the DF task. |