Font Size: a A A

A Study On Representation Learning Based Acoustic Modeling For Speech Synthesis

Posted on:2022-10-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J ZhangFull Text:PDF
GTID:1488306323964349Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speech synthesis technology is an important and necessary technology to realize human-machine communication.Generating high intelligible and natural speech is the goal of speech synthesis at the current stage.Statistical parametric speech synthesis has the advantages of high automation and flexibility which is one of the popular meth-ods for constructing speech synthesis systems at current time.Traditional statistical parametric speech synthesis usually includes front-end text analyzer,acoustic feature prediction model,duration model and vocoder.These modules are trained separately which will cause feature mismatch and error accumulation.In recent years,the speech synthesis acoustic models based on sequence-to-sequence neural network integrate the acoustic model and the duration model into a complete model for joint training,which not only simplify the difficulty of speech synthesis modeling,but also improve the nat-uralness of synthesized speech.Traditional speech synthesis acoustic models based on sequence-to-sequence neu-ral networks use<text,acoustic feature>pairs for model training.The text features are usually the character or phoneme sequence of the current sentence and the prosodic description.The acoustic features are usually manually designed features,such as mel-spectrum,cepstrum and fundamental frequency.However,the above features still have some shortcomings.For example,the text features haven't considered the influence of context and semantic information.The acoustic features lack description of high-level prosodic variations which results in limited naturalness of synthesized speech and diffi-culty in prosody control.On the other hand,representation learning methods based on neural networks have received widespread attention in recent years.These methods can transform the original data into the representation which can be effectively used in ma-chine learning methods by learning the underlying structure of the data.Therefore,this thesis focuses on representation learning based acoustic modeling methods for speech synthesis,introduces representation learning into acoustic models based on sequence-to-sequence neural networks,and improves the naturalness and controllability of syn-thesized speech by extracting and utilizing richer text and acoustic representations.The research contents of this thesis include:First,the thesis studies the style transfer and control method based on variational autoencoder for speech speech.The style tokens of the global style token model don't have the disentanglement property and the consecutiveness of token weight space is not guaranteed.Therefore,this thesis proposes a style transfer and control method based on variational autoencoder for speech synthesis.The model uses an unsupervised rep-resentation learning method to learn the acoustic representation of speaking style,and realizes a consecutive latent space and feature disentanglement by using variational au-toencoder.Finally,the style transfer and control of synthesized speech are realized by flexibly controlling the latent variable.Secondly,the thesis studies the speech synthesis acoustic modeling method in which the pre-trained language model and paragraph-level text representation are in-troduced.Traditional acoustic models usually use phoneme sequences and prosodic annotations as text input which haven't make full use of the target sentence and the contextual semantic information around the target sentence.Therefore,this thesis pro-poses a method to extract deep and wide contextual representation by using pre-trained language model.Further,the contextual representation are combined with sequence-to-sequence acoustic model.The proposed method has improved the naturalness of synthesized speech.Finally,the thesis studies the acoustic modeling method based on fine-grained acoustic latent variables for speech synthesis.Most existing speech synthesis acoustic models based on acoustic latent variables only learn sentence-level prosodic represen-tations which lack the ability to predict fine-grained prosody representations.There-fore,this thesis proposes an acoustic modeling method based on fine-grained acoustic latent variables for speech synthesis.This method introduces fine-grained discrete la-tent variables to describe word-level acoustic variations,constructs an acoustic latent representation extractor based on sequence-to-sequence acoustic model and an acous-tic latent representation predictor using pre-trained language model.Furthermore,the adversarial learning method for disentangling acoustic and text representation are ex-plored.Finally,the proposed method improves the naturalness of synthesized speech.
Keywords/Search Tags:speech synthesis, neural network, representation learning, variational autoencoder, pre-trained language model, adversarial training
PDF Full Text Request
Related items