| Automatic Speaker Verification(ASV)systems authenticate a speaker’s identity by analyzing their voice.Deep learning has made these systems popular,offering high accuracy and user-friendliness.However,ASV systems are vulnerable to fraudulent attacks.Attackers can exploit techniques like replay,speech synthesis,and voice conversion to generate false speech and undermine ASV systems.Consequently,it is crucial to investigate effective deception speech detection methods to mitigate the threat posed by ASV attacks.Based on datasets from ASVspoof 2019 and ASVspoof 2021 challenges,this paper conducts research in multiple aspects,including front-end feature extraction,back-end classifier selection,loss function design,and model fusion.The research achieves the following advancements:(1)A synthetic speech deception detection method is proposed,based on online hard example mining.This method effectively addresses the problem of imbalanced distribution between simple and difficult samples in the training set by selecting hard samples with high training loss values for online feedback training.Experimental results show that the introduction of the Online Hard Example Mining(OHEM)algorithm leads to relative decreases of 42%,28%,25%,and 22% in equal error rates(EER)for four different deep neural network models,namely Res Net18,Res Net50,SE-Res2 Net,and Raw-Res2 Net,respectively.(2)A new network architecture called Raw-Res2 Net is proposed.Compared to the Raw Net2 model,t this model utilizes Res2 Net blocks instead of residual blocks and employs a squeeze-and-excitation mechanism for feature map scaling.Res2 Net enhances the representation of multi-scale features and expands the perceptual field of each layer.Squeezeand-excitation blocks recalibrate channel direction feature responses by explicitly modeling channel interdependencies.Experimental results demonstrate that,with the introduction of the OHEM algorithm,the proposed new model reduced the EER relative to the Raw Net2 model by35%.Compared to the two baseline systems in the ASVspoof 2019 competition,the EER relative decreased by 63% and 68%,respectively.(3)A speech detection method for replay attacks based on a dual-input hierarchical fusion network is proposed.This method uses the original signal and the time-reversed signal as the model’s two inputs and introduces a hierarchical fusion module to effectively fuse the output results of the different residual blocks of the upper and lower layers.On the ASVspoof 2021 PA test set,this method demonstrated high performance,with an EER of 24.46% and a min tDCF value of 0.6708.Compared to the four baseline systems in the ASVspoof 2021 competition,the min t-DCF value decreased relative by 28.9%,31.0%,32.6%,and 32.9%,respectively. |