Font Size: a A A

Research On Deep Learning Based End-to-End Chinese Speech Synthesis

Posted on:2022-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:T HeFull Text:PDF
GTID:2518306494450904Subject:Electrical engineering
Abstract/Summary:PDF Full Text Request
With the continuous popularization of smart terminal devices such as smart phones and smart speakers,the application scenarios for human-computer interaction in people's lives are also expanding.As one of the core technologies of human-machine voice interaction,speech synthesis technology is widely used in various fields such as smart home,public transportation,and cross-voice communication,playing an increasingly important role.This paper mainly studies the end-to-end Chinese speech synthesis system based on deep learning,and optimizes the system performance of speech synthesis and the audio effect of synthesized speech.Starting from the goal of a Chinese speech synthesis system with high real-time and good speech quality,this paper proposes a Chinese speech synthesis system based on autoregressive acoustic modeling,including a sequence-to-sequence modeling method based on RNN.In addition,this paper introduces a local attention mechanism which pays more attention to the relevance of local information,adds a new module DCBHG and an acoustic modeling method to strengthen the prediction of acoustic features,and combines with LPCNet vocoder to form a Chinese speech synthesis system.The model proposed in this paper is better than the original model by 0.73 in mel cepstral distortion(MCD),and on the premise that the performance of the synthesis system is similar.The mean opinion score(MOS)of speech naturalness is increased by 0.36 compared with the reference model,and shorten the gap of MOS between the reference model and the original recording by 72.0%,which is proved to improve the effect of synthesized speech.Aiming at the possible low stability problems caused by the accumulation of errors of typical autoregressive models,this paper proposes a non-autoregressive acoustic modeling method based on feedforward sequence memory network(FSMN).Unlike acoustic modeling methods based on the self-attention mechanism,this article adopts the combination of FSMN modules to form a new network architecture,paying more attention to local features while reducing the amount of model parameters.Besides,this method adopts continuous interpolation and a fundamental frequency distillation strategy with or without fundamental frequency separation for acoustic modeling.Under the premise that the parameter amount is only 59.2% of Tacotron2 and Fastspeech2,the acoustic modeling method based on FSMN proposed in this paper improves the effect of Chinese speech synthesis with the better MCD and spectral details.
Keywords/Search Tags:Speech Synthesis, Acoustic Modeling, Attention, Feedforward Sequential Memory Networks
PDF Full Text Request
Related items