Font Size: a A A

Research On Speech Synthesis Vocoders Using Convolutional Neural Networks

Posted on:2019-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:H C WuFull Text:PDF
GTID:2428330542994090Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech synthesis vocoder is an indispensable part of statistical parameter speech synthesis system,which reconstructs speech waveform from fundamental frequency,spectrum and other acoustic features.In recent years,acoustic modeling accuracy and the naturalness of synthetic speech have been improved effectively using deep neu-ral network model in acoustics modeling.However,the vocoder based on the tradi-tional source filter structure,represented by STRAIGHT,still has problems such as loss of spectral details,phase dependence artificial design,and linear filtering framework,which still restricts the further improvement of the synthetic speech quality in statistical parameter speech synthesis.DeepMind researchers proposed causal dilated convolutional neural network to model and generate waveform directly in 2016.And they predicted the speech wave-form from the text features with this new method,which obtained better speech natu-ralness than the traditional statistical parameter method.The direct modeling of speech waveform by convolutional neural networks compensates the defect of the spectrum de-tail missing and phase information missing,and the deep neural network has a flexible nonlinear processing capability.This provides a new way to realize a synthesis vocoder.This paper focuses on vocoder based on convolutional neural network,including three research points.First,it designs and implements the speaker dependent speech synthesis vocoder based on the convolutional neural networks.Second,it proposed speaker independent and adaptive training method of the neural network based vocoder.The high quality vocoder is trained with limited target speech data.Third,it proposed a multi-resolution hierarchical network structure to improve the vocoder efficiency.The whole dissertation is organized as follow:Chapter 1 is the introduction.It reviews the technology of speech synthesis re-search and intriduces mainstream speech synthesis approachs including the unit selec-tion and waveform concatenation approach and statical parametric synthesis.Then it reviews popular speech synthesis vocoder and analyses their strength and weakness.Chapter 2 first introduces WaveNet proposed by DeepMind researchers,and de-clares our motivation of the speech waveform modeling with convolutional neural net-work.Then it introduces the proposed speech synthesis vocoder based on the convo-lutional neural networks.This vocoder builds an upsampling network to match the sampling rate of the acoustic feature and output audio.Then the acoustic information is added to activation function of the network to guide the waveform generation.Chapter 3 reviews the history of speaker adaptation technology and intrduces the methods in speech recognition and speech synthesis task.Then it intrduces the speaker independent and adaptive training method proposed in this paper.Finally,the natural and predicted acoustic features are used to reconstruct speech waveform in experiments to prove the effectiveness of adaptive training.Chapter 4 first analyses the problem of the slow speech generation using neural network based vocoder.Secondly,bandwidth expansion based on the dilated convolu-tional neural networks is introduced.Thirdly,a multi-resolution hierarchical generation network is proposed.Finally,it evaluates the efficiency improvement of the model and the quality of the synthetic speech.Chapter 5 concludes the whole dissertation.
Keywords/Search Tags:speech synthesis vocoder, convolutional neural network, speaker-independent model, adaptive training, multi-resolution hierarchical network
PDF Full Text Request
Related items