Font Size: a A A

Research And Application Of Voice Clone Technology Based On Deep Learning

Posted on:2022-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:F F ChenFull Text:PDF
GTID:2518306320484814Subject:Engineering
Abstract/Summary:PDF Full Text Request
Voice cloning is a relatively broad concept,such as "voice style transfer" and "voice conversion" can all be called voice cloning technology.The speech cloning technology in this article extracts acoustic features from speech and synthesizes speech with specified content.The application field of voice cloning is very wide.For example,the virtual anchor industry is booming recently.The virtual anchor refers to an anchor who uses an avatar to conduct live events without real people appearing on the scene.Voice cloning can also shine in the fields of game dubbing,rehabilitation treatment for the disabled,and voice assistants for mobile phones.At present,the existing speech cloning scheme relies on a large number of data sets and manual adjustment of the rhythm,which is demanding and time-consuming and labor-intensive.The high-quality open source voice data in Chinese is relatively scarce,and many voice data are monopolized by companies such as iFlytek.In order to solve these problems,this paper proposes a voice cloning technology based on deep learning.This method is different from traditional models,using three models for joint modeling which use different data sets for independent training.The method can use the current open source data set and achieve good results on low-performance devices,and has a relatively fast generation speed.The main work of this paper is as follows:(1)This paper designs a voice cloning algorithm,which is composed of three modules:The encoder module converts the speaker's voice into speaker embedding,and extracts the voice features of the specified speaker;the synthesizer module converts the text and the speaker embedding output by the encoder is converted into Mel-spectrogram;the vocoder module converts the Mel-spectrogram into waveform,and generates high-quality,highly natural and clear speech based on the Mel spectrogram.(2)The encoder module preprocesses the speech first,and then uses the speaker classification to preliminarily classify the input speech data,and classify the speech of the same or similar speakers into one class to extract speech features,thereby optimizing the encoder model to make the extracted voice features more accurate.(3)The synthesizer module controls explicit variables such as fundamental frequency contours and pronunciation decisions to avoid the entanglement of text and voice information.It also controls latent variables such as vector dictionary,the attention module between text and the mel spectrogram.On the premise that emotional training data is not included in the data set,the pitch and pronunciation decisions can be accurately controlled during the training process to generate more natural speech data.(4)The vocoder module uses an improved model based on WaveNet.Wavenet no longer focuses on the modeling of the spectrum envelope,and uses digital signal processing to process the filter,making the neural network focus on making the spectrum flat.On the other hand,the multi-band and multi-time strategy is adopted to ensure the voice quality while greatly reducing the complexity.(5)The voice cloning algorithm in this paper is applied to engineering practice,and the"virtual anchor's voice cloning system" is designed.In order to prove the reliability and practicability of the technology in this paper,a combination of subjective and objective methods is used to evaluate the generated speech data.The comparison and analysis of the difference between the original voice and the cloned voice mel spectrogram generated by the cloning algorithm in this paper,and the subjective scoring results of the testers,all have proved the rationality and effectiveness of the algorithm in this paper.
Keywords/Search Tags:deep learning, voice cloning, virtual anchor, joint model
PDF Full Text Request
Related items