Font Size: a A A

Research On Voice Clone Technology Based On Deep Learning

Posted on:2022-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y X ZhangFull Text:PDF
GTID:2518306326494854Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Speech is the main method for people to exchange.With the increasing use of computers and intelligent electronic products,human-computer interaction has become inevitable need for people.In order to use intelligent electronic products more conveniently,voice can be used to achieve human-computer interaction.There are two processes to realize human-computer interaction by speech.One is that smart electronic devices "understand" the voice information sent by humans,that is,voice recognition;the other is that smart electronic devices convert text into voice and "speak",that is,speech synthesis.Generally,during human-computer interaction,people only need to hear the voices emitted by smart electronic devices.However,But with the rapid growth of speech synthesis technology based on deep learning,customized and individualized speech,that is,voice cloning,has gradually become a demand for people.Voice cloning is a technology that converts text into a specific person’s voice.In voice cloning,the naturalness and similarity of the cloned voice are the criteria for evaluating the quality of the voice.The voice cloning system based on speaker verification realizes the function of voice cloning for a specific person.In the voice cloning system,the clone voice of similarity is not high and naturalness is insufficient,and the training speed is slow,this article carries out the following work:1.The voice cloning system based on speaker verification is composed of speaker encoder network,synthesizer network and vocoder network.The speaker encoder network uses the d-vector speaker embedding method to extract the speaker information;the synthesizer network uses the sequence-to-sequence Tacotron2 architecture to realize the conversion from text to the mel spectrogram;the vocoder network uses the improved Wave RNN architecture to convert the mel spectrogram into speech waveform.The experimental results indicate that the naturalness and similarity of the cloned voice remain to be promoted.2.The vioce cloning system based on speaker verification adopts the speaker coding characteristics described by d-vector.The d-vector does not consider the correlation between the frames before and after the speech,and the ability to characterize the speaker characteristics is insufficient,which limits the similarity of the cloned speech.In the light of this issue,this article proposes a voice cloning method based on x-vector speaker characteristics.x-vector extracts the speaker’s embedded features based on the time-delayed neural network,takes the speech information of the total sentence into account,and can more accurately represent the speaker’s features.The experimental results indicate that in terms of embedding vector similarity,the xvector method has a lower similarity value for different speakers than the d-vector method,and a higher similarity value for the same speaker;in terms of the naturalness and similarity of cloned speech.Using the x-vector method,the naturalness of the final cloned voice is improved by 0.32,and the similarity is improved by 0.14 compared with the d-vector method.3.In the speech cloning system experiment based on speaker verification,the vocoder part adopts an improved Wave RNN architecture.The Wave RNN architecture is an autoregressive model,which is difficult to train in parallel,and the training speed is slow.In the light of this issue,this article proposes the use of Hi Fi-GAN architecture as a vocoder.Hi Fi-GAN is an audio generation model based on a generative adversarial network,which can convert mel spectrogram into high-quality speech quickly.The experimental results indicate that the naturalness of cloned voice is improved by 0.37.Combined with the x-vector method,the naturalness of cloned voice has increased by0.06 on this basis;in terms of cloning speed,it is ten times faster than the Wave RNN vocoder.
Keywords/Search Tags:deep learning, voice cloning, speaker feature extraction, speaker embedding, x-vector, HiFi-GAN
PDF Full Text Request
Related items