Font Size: a A A

A Study On Neural Vocoders For Speech Synthesis

Posted on:2022-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y AiFull Text:PDF
GTID:1488306323464344Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech synthesis converts texts into natural and smooth speech waveforms.It is a key technology for intelligent voice interaction and has important application value.Recently,statistical parametric speech synthesis(SPSS),which consists of text analysis,acoustic model and vocoder,has become one of the mainstream technical ways to realize speech synthesis.First,the text analysis module analyzes the input original text to obtain linguistic features;then,the acoustic model predicts the acoustic features of the speech based on the linguistic features;finally,the vocoder converts the acoustic features into the final speech waveform.Among them,the performance of the vocoder significantly affects the quality of synthesized speech,and has become one of the key core issues in speech synthesis research.Conventional vocoders based on the source-filter speech generation theory(such as STRAIGHT,WORLD,etc.)have disadvantages such as relying on linear filtering assumptions,ignoring spectral details and phase information,resulting in the deteriora-tion of the quality of synthesized speech.With the development of deep learning tech-nology and the increasing application in the field of signal processing,in 2016 Google proposed WaveNet,a raw waveform generation model based on neural networks,and built a speech synthesis vocoder based on this model.Compared with conventional vocoders,WaveNet-based neural vocoder significantly improved the quality of recon-structed speech,but it still had limitations in modeling accuracy and operation effi-ciency.In recent years,the improvement of neural vocoders has become a research hotspot in the field of speech synthesis.Therefore,this thesis focuses on the research of neural network vocoder for speech synthesis,and proposes corresponding improvement plans for several problems in its development process,including:First,with the goal of improving the waveform modeling accuracy of WaveNet vocoder,the hierarchical recurrent neural network-based vocoder is studied.Compared with the convolutional neural network structure used in WaveNet,the waveform gen-eration model based on the hierarchical recurrent neural network represented by Sam-pleRNN has the advantages of wide receptive field and convenient conditional input.Therefore,this thesis designs and implements a neural vocoder based on hierarchical re-current neural networks,which improves the waveform prediction accuracy of WaveNet vocoder and the naturalness of reconstructed speech.Furthermore,a bandwidth ex-tension(BWE)method based on hierarchical recurrent neural network is proposed,which realizes the speech bandwidth extension that directly takes the waveform point as the prediction target.Compared with the conventional frequency-domain feature-dependent BWE method,the subjective quality of the generated speech is improved.Secondly,in view of the limited number of quantization bits of the early neural vocoders,a spectral enhancement method for low-bit neural vocoders is proposed.This method combines the neural network vocoder with speech enhancement based on deep learning.Through the enhancement of amplitude spectra of the reconstructed speech,the influence of quantization noise in the output speech of the low-bit neural network vocoder is reduced,and the quality of the reconstructed speech of the vocoder is im-proved.Thirdly,aiming at the efficiency bottleneck of neural vocoders,a neural vocoder based on hierarchical prediction of amplitude and phase spectra is proposed.This method first predicts the short-term speech spectra from the input acoustic features,and then generates the reconstructed speech waveform based on the short-time Fourier transform(STFT).Through frame-level amplitude spectrum prediction and phase spec-trum prediction based on lightweight waveform modeling,this method effectively re-duces the point level calculation of the conventional neural vocoders,and significantly improves the generation efficiency on the basis of ensuring the quality of generated speech waveforms.Finally,facing the scene where the input acoustic features are affected by rever-beration and noise,corresponding neural vocoder construction methods are studied.Aiming at improving the quality of reconstructed speech with reverberation acoustic features,a reverberation control method of neural vocoders is proposed.By adding a reverberation modeling module to the vocoder network structure,this method realizes the prediction of the room impulse response(RIR)from the log amplitude spectra(LAS)and improves the accuracy and subjective quality of the reconstructed speech waveform for the reverberation features.Aiming at predicting the clean speech waveform from the noisy and reverberant acoustic features,the method of reverberation and noise elimina-tion of neural vocoder is proposed.This method improves the subjective and objective quality of reconstructed speech waveforms by designing a denoising and dereverbera-tion amplitude spectrum predictor,and introducing a frequency band expansion model and a frequency resolution expansion model.
Keywords/Search Tags:vocoder, waveform generation, neural network, speech synthesis, deep learning
PDF Full Text Request
Related items