Research On Deep Learning Based Singing Voice Synthesis

Posted on:2021-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Yi

Full Text:PDF

GTID:2428330602494317

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Singing voice synthesis(SVS)aims to convert lyrics and music score information(such as rhythm,pitch,etc.)into songs.Statistical parameter synthesis method is the main approach for achieving singing voice synthesis at current stage.The statistical parameter synthesis method can use a small amount of singing data to synthesize smooth singing speech.However,the traditional acoustic models for SVS,such as hidden Markov model(HMM)are still deficient in the accuracy of acoustic modeling,resulting in low synthesis sound quality.In recent years,deep learning methods such as deep neural networks(DNN)have been widely applied to the acoustic modeling of statistical parameters speech synthesis,which improve the accuracy of acoustic models significantly.Therefore,this dissertation studies deep learning-based singing speech synthesis and investigates its acoustic modeling methods such as recurrent neural network,deep autoregressive model,and sequence-to-sequence model.First,this dissertation studies singing voice synthesis methods based on recurrent neural networks.This method uses a recurrent structure to model the complex context dependence in singing synthesis,which improves the accuracy of the traditional DNN model in predicting fundamental frequency,spectrum and duration.Secondly,this dissertation proposes an acoustic modeling method for singing voice synthesis based on deep autoregressive models.In order to better describe the dependence between the acoustic features in consecutive frames,this method uses deep autoregression to predict the fundamental frequency trajectories and spectral features,and further improves the accuracy of acoustic modeling usingrecurrent neural networks.This method can generate dynamic characteristics of fundamental frequency such as vibrato and enhance the naturalness of synthetic singing voice.Finally,this dissertation designs and implements a singing voice synthesis method based on sequence-to-sequence model.Based on the mainstream Tacotron2 model,this method achieves a sequence-to-sequence singing voice synthesis with controllable durations by introducing a duration embedding layer and expanding the input text according to the duration.Further,this method introduces a bidirectional decoding mechanism to restrict the consistency of forward decoding and backward decoding,which strengthens the ability of duration controlling and speeds up the convergence of model parameters.Experimental results show that this method can achieve a better subjective quality of synthesized speech than deep autoregressive models.

Keywords/Search Tags:

singing voice synthesis, parametric synthesis, recurrent neural network, autoregressive model, sequence-to-sequence neural network

PDF Full Text Request

Related items

1	Research On Speech Synthesis Algorithm Based On Sequence To Sequence Model
2	Research On Neural Network-based Acoustic Modeling For Speech Synthesis
3	Statistical Model Based Mandarin Chinese Singing Voice Synthesis
4	Research On Mandarin Singing Synthesis Based On Wavenet Architecture
5	Research On Deep Learning Based Small-Sized Unit Concatenation Speech Synthesis
6	Research On Mandarin Singing Synthesis Based On Deep Learning
7	Research On Multi-discrimination Singing Voice Synthesis Vocoder Based On Generative Adversarial Network
8	Research On End-to-End Non-Autoregressive Model-Based Amdo Tibetan Speech Synthesis Technology
9	Research On Synthesis Methods Of Singing Oriented To Timbre Conversion
10	Research On Neural Network Based Statistical Parametric Speech Synthesis