Font Size: a A A

Research On Forged Speech Detection Technology Based On Feature Fusion In Time And Frequency Domain

Posted on:2022-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:X Q PanFull Text:PDF
GTID:2518306752965329Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Speech information processing technology has developed rapidly under the impetus of deep learning.The combination of speech synthesis and conversion technology can realize realtime and high-fidelity speech output of specified objects and content,and has broad application prospects in the fields of human-computer interaction,pan-entertainment and so on.At the same time,deepfake technology with deep learning algorithms as the core driving force is booming,making forged speech generated by malicious application of this technology more and more difficult to distinguish,and the resulting public security risks are gradually exposed.Therefore,the research on forged speech detection has attracted the increasing more attention in recent years.In order to solve the problems of weak model generalization ability and low detection accuracy in the current forged speech detection technology,this paper conducts research from two aspects: speech feature extraction and forgery detection model construction.The main work contents are as follows:(1)For speech feature extraction,a feature extraction method based on time-frequency domain information fusion is proposed.This paper studies the feature extraction process,including speech signal preprocessing,the principles and steps of mainstream speech feature extraction.On this basis,the Gammatone filter bank is selected in the process of frequency domain feature extraction,which simulates the signal processing process of the auditory system more finely and has proved to improve the robustness against noise in the speaker recognition task.The first-order and second-order difference parameters of the feature are introduced,in order to increase the description of the correlation between adjacent speech signals,supplement the timing information,and improve the integrity of the feature.The effectiveness of the features extracted by the above method in the task of forgery detection is proved by comparative experiments,and according to the experimental results,the Mel-frequency cepstral coefficients with the first-order difference parameters and the features extracted based on the Gammatone filter bank are determined as the optimal feature sets in the forged speech detection task,so as to avoid the negative impact of redundant feature stacking on the detection efficiency.(2)For forgery detection model construction,a forged speech detection method based on the composite model is proposed.The method integrates Res Ne Xt network and Bi-GRU network,comprehensively analyzes feature changes from two scales,global and local,and improves the recognition rate of forged speech.Among them,the Res Ne Xt network uses superimposed convolutional layers to continuously expand the perception field of view and can analyze the overall features.On this basis,the model introduces an attention mechanism to further help the model find forged features.Bi-GRU network pays attention to the jumping changes of speech features in the vicinity.What's more,the model adds input channels,mines multi-scale information of different inputs,and improves the effectiveness of detection.The two networks are fused in parallel to obtain a more efficient and accurate detection model.The experimental results show that the detection method proposed in this paper achieves 97.33%and 91.97% detection accuracy for logically forged samples(Logical Access,LA)and physically forged samples(Physical Access,PA)on the ASVspoof2019 dataset,respectively.Compared with the best detection results obtained by a single model,the evaluation indicators such as accuracy and F1 value have improved.
Keywords/Search Tags:forged speech detection, speech information processing, feature fusion, deep learning, attention mechanism
PDF Full Text Request
Related items