Font Size: a A A

A Voice Conversion Algorithm Based On Phonetic Posteriorgrams And Nonlinear Masking Post-processing

Posted on:2022-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:J T ZhangFull Text:PDF
GTID:2518306569475904Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Voice conversion can be generally viewed as a speech synthesis problem,which refers to the conversion of the original voice into another form by changing some characteristics of the voice.There are two mainstream for deep learning based voice conversion.The first method is completely end-to-end,while the second one is integrated by using automatic speech recognition(ASR)and text-to-speech(TTS)models.The biggest advantage of the end-to-end model is that the mapping of source and target acoustic features can be constructed directly and developers do not need to care about the details of the conversion.However,the disadvantage of this method is also obvious,which needs to rely on a large amount of training data,and it is complex to decouple the modules of the model.ASR-and TTS-based approaches utilize voice conversion to a set of standardized procedures and distinguish different modules: source speech analysis,feature mapping and target speech reconstruction.This method has two shortcomings.On the one head,voice conversion relies on TTS model,and the TTS model is prone to “repeat”problems in the converted speech due to the influence of its alignment method.On the other hand,due to the lack of standard speech enhancement post-processing,there may be noise or semantic distortion in the synthesized speech.The method adopted in this paper belongs to the latter,and it is improved to solve the two major problems mentioned above.By combining the ASR model and the TTS model,TDLSTM based on phonetic posteriorgrams(PPGs)and Tacotron-NMLs based on nonlinear mask post-processing are proposed in this paper.Among them,TDLSTM and Tacotron-NMLs are the concrete implementations of PPG model and TTS model,respectively.There are three main contributions in this paper.The first one is to generate a more phonetic posteriorgram with time information through powerful "first packet" utilization ability and time series modeling ability of TDLSTM.The second one is to use the time information provided by phonetic posteriorgrams to assist the TTS model to find the acoustic features that should be assigned to the pronunciation state more effectively,so as to solve the semantic error of "repeated reading" existing in the converted speech.The third one is that nonlinear masking layers(NMLs)are introduced into voice conversion,and the Tacotron-NMLs is proposed.By using its speech enhancement and separation techique,the problems of noise and semantic error in the converted speech are solved,so as to enhance the formant information in the spectrograms.Finally,a more natural and fluent time domain monophonic signal is restored by an efficient neural vocoder called Wave RNN.Based on the proposed model,this paper constructs a voice conversion procedure based on "TDLSTM + Tacotron-NMLS + Wave RNN",and the experiments are carried out on it.The results of the inset test show that the voice conversion procedure constructed in this paper has the lowest error between the converted speech and the reference speech,while the results of the outset test show that the same voice conversion procedure has better naturalness and fluency in terms of the subjective auditory experience of the listeners in the experiments.These are all due to positive gain of both the phonetic posteriorgrams and nonlinear masking post-processing.
Keywords/Search Tags:Voice Conversion, TDLSTM, Nonlinear Masking Layers, Tacotron-NMLs, WaveRNN
PDF Full Text Request
Related items