Font Size: a A A

Research On Speech Synthesis And Spoof Detection Based On Deep Learning

Posted on:2023-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:H D XiaoFull Text:PDF
GTID:2568307103492774Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Intelligent voice has become an important way of human-computer interaction.Speech synthesis technology is an important part of intelligent speech,realizing the conversion of text into a speech signal that reads this text,giving the machine the ability to speak like a human.With the development and popularization of speech synthesis technology,people’s demand for personalized multi-speaker speech synthesis is getting stronger and stronger.On the other hand,criminals use speech synthesis and other technologies to forge other people’s voices,attack voice identity authentication systems and carry out telecommunications fraud,bringing security risks to society.It is of great social significance to study the speech forgery method for accurately identifying forged speech to protect the safety of citizens’ property and privacy.The existing speech synthesis and speech forgery methods based on deep learning have the following problems:(1)The existing multi-person speech synthesis methods do not have sufficient feature fusion of the reference speech,and the timbre consistency constraints of the synthesized speech are insufficient;(2)The generalization of speech pseudo-discrimination models is insufficient,and it is difficult to cope with the increasingly mature speech synthesis and speech conversion methods.In order to solve the above problems,this article proposes corresponding solutions.Aiming at the problem(1),this paper proposes a multi-person speech synthesis method based on adversarial learning.A text-speech feature fusion method based on affine transformation is designed,and the affine transformation parameters are predicted by using speech features,and a timbre perception discriminator is designed by introducing an adversarial mechanism,and the acoustic model is trained to improve the similarity between the synthesized speech and the target speaker’s speech..Aiming at the problem(2),this paper describes a speech forgery method based on data enhancement.By introducing a data enhancement method based on frequency domain exchange,the generalization of the speech detection model is improved,and a time-frequency attention mechanism is designed to calibrate the speech features,so that the model pays more attention to the moment and frequency band with significantly fake information,and the equal error rate on the ASVSpoof LA dataset reaches 1.88%.In addition,this paper uses the proposed speech synthesis method to design and implement a Slide2 Video system that automatically generates speech videos based on speech slides and speeches,which is convenient for researchers participating in international academic online conferences.
Keywords/Search Tags:Deep learning, Speech Synthesis, Speech Spoofing Detection, Attention Mechanism
PDF Full Text Request
Related items