Font Size: a A A

Research On Neural Network Based Statistical Parametric Speech Synthesis

Posted on:2019-04-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J HuFull Text:PDF
GTID:1318330542497982Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The objective of speech synthesis is to convert input text to fluent speech with high quality.Statistical parametric speech synthesis(SPSS)has become the state of art speech synthesis method because of its various advantages,including automatic and fast system construction,small footprint,high flexibility.Traditional hidden Markov model(HMM)based SPSS can generate continuous,stable and fluent speech with high intelligibility.However,the synthesized spectrum is generally over-smoothed,leading to degraded speech quality.In recent years,deep learning has emerged as a new area of machine learning,and has grown rapidly.Deep learning is a statistical modeling method using many lay-ers of artificial neural networks,and has shown strong advantages over conventional methods on various tasks,including image recognition,computer vision,natural lan-guage processing,and automatic speech recognition.In the field of speech synthesis,deep learning has been successfully applied to acoustic modeling,spectral representa-tion,post-filtering and waveform modeling of SPSS,and has become the most popular methods for SPSS.This thesis focuses on research of deep learning methods for SPSS.Two aspects of SPSS are investigated:spectral representation and acoustic modeling.In the aspect of spectral representation,deep learning based models have been introduced for spectral representation extraction.Three neural network based spectral representations were proposed for SPSS:the deep belief network(DBN)based spectral representation,the convolutional neural network(CNN)based spectral representation,and the deep auto-encoder with binary distributed hidden units(BDAE)based spectral representation.In the aspect of acoustic modeling,a method of generative adversarial network(GAN)based acoustic modeling method has been proposed.The details are as follows.First,aiming at the disadvantages in conventional spectral representation,where the is no nonlinear process during mel-cepstrum extraction from spectral envelopes,and the synthesized spectrum is over-smoothed,this thesis proposed the DBN based spectral representation for SPSS.In this method,an unsupervised trained DBN is adopted to model natural spectral envelopes,and the samples of the top hidden layer of DBN are used as spectral representations for further acoustic modeling in SPSS.Experimental results shows the over-smoothing effect of synthesized speech is alleviated.Second,aiming at problem that there is little consideration on the formants and other local structures in spectral envelops during conventional spectral representation extraction,this thesis proposed a CNN based spectral representation extraction method for SPSS.Due to the strong ability of CNNs to detect and extract local structures,a CNN based auto-encoder is used to extract the prominence and position representations of peaks and valleys in spectral envelopes,which are modeled separately during acous-tic modeling in SPSS.This method could generate better local structures in spectral envelopes,and improves the quality of synthetic speech.Third,in conventional methods,the spectral representation extraction is indepen-dent from the acoustic modeling.To address this problem,this thesis proposed a BDAE based spectral representation,which considers acoustic modeling during spectral repre-sentation extraction.This method restricts the hidden units of DAEs to be binary,and can reduce the effect of acoustic modeling error on the spectral envelope reconstruction,thus can relieve the over-smoothing effect of synthetic speech.Experiments on various datasets demonstrated that this method could improve the quality of synthetic speech significantly.Fourth,a GAN based acoustic modeling method was proposed to overcome the over-smoothing during statistical acoustic modeling using conventional maximum like-lihood criterion and minimum mean square criterion.This method adopts a GAN to model the spectral envelope conditioned on input text and lower dimensions of mel-cepstrum,and could generated more stable synthetic speech with higher quality com-pared conventional methods.
Keywords/Search Tags:Speech Synthesis, Neural Networks, Deep Learning, Deep Belief Network, Deep Auto-Encoder, Convolutional Neural Network, Generative Adversarial Network
PDF Full Text Request
Related items