Font Size: a A A

Research On Deep Learning Based Singing Voice Synthesis

Posted on:2021-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y H YiFull Text:PDF
GTID:2428330602494317Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Singing voice synthesis(SVS)aims to convert lyrics and music score information(such as rhythm,pitch,etc.)into songs.Statistical parameter synthesis method is the main approach for achieving singing voice synthesis at current stage.The statistical parameter synthesis method can use a small amount of singing data to synthesize smooth singing speech.However,the traditional acoustic models for SVS,such as hidden Markov model(HMM)are still deficient in the accuracy of acoustic modeling,resulting in low synthesis sound quality.In recent years,deep learning methods such as deep neural networks(DNN)have been widely applied to the acoustic modeling of statistical parameters speech synthesis,which improve the accuracy of acoustic models significantly.Therefore,this dissertation studies deep learning-based singing speech synthesis and investigates its acoustic modeling methods such as recurrent neural network,deep autoregressive model,and sequence-to-sequence model.First,this dissertation studies singing voice synthesis methods based on recurrent neural networks.This method uses a recurrent structure to model the complex context dependence in singing synthesis,which improves the accuracy of the traditional DNN model in predicting fundamental frequency,spectrum and duration.Secondly,this dissertation proposes an acoustic modeling method for singing voice synthesis based on deep autoregressive models.In order to better describe the dependence between the acoustic features in consecutive frames,this method uses deep autoregression to predict the fundamental frequency trajectories and spectral features,and further improves the accuracy of acoustic modeling usingrecurrent neural networks.This method can generate dynamic characteristics of fundamental frequency such as vibrato and enhance the naturalness of synthetic singing voice.Finally,this dissertation designs and implements a singing voice synthesis method based on sequence-to-sequence model.Based on the mainstream Tacotron2 model,this method achieves a sequence-to-sequence singing voice synthesis with controllable durations by introducing a duration embedding layer and expanding the input text according to the duration.Further,this method introduces a bidirectional decoding mechanism to restrict the consistency of forward decoding and backward decoding,which strengthens the ability of duration controlling and speeds up the convergence of model parameters.Experimental results show that this method can achieve a better subjective quality of synthesized speech than deep autoregressive models.
Keywords/Search Tags:singing voice synthesis, parametric synthesis, recurrent neural network, autoregressive model, sequence-to-sequence neural network
PDF Full Text Request
Related items