Research On Deep Learning Based End-to-End Chinese Speech Synthesis

Posted on:2022-04-04

Degree:Master

Type:Thesis

Country:China

Candidate:T He

Full Text:PDF

GTID:2518306494450904

Subject:Electrical engineering

Abstract/Summary:

PDF Full Text Request

With the continuous popularization of smart terminal devices such as smart phones and smart speakers,the application scenarios for human-computer interaction in people's lives are also expanding.As one of the core technologies of human-machine voice interaction,speech synthesis technology is widely used in various fields such as smart home,public transportation,and cross-voice communication,playing an increasingly important role.This paper mainly studies the end-to-end Chinese speech synthesis system based on deep learning,and optimizes the system performance of speech synthesis and the audio effect of synthesized speech.Starting from the goal of a Chinese speech synthesis system with high real-time and good speech quality,this paper proposes a Chinese speech synthesis system based on autoregressive acoustic modeling,including a sequence-to-sequence modeling method based on RNN.In addition,this paper introduces a local attention mechanism which pays more attention to the relevance of local information,adds a new module DCBHG and an acoustic modeling method to strengthen the prediction of acoustic features,and combines with LPCNet vocoder to form a Chinese speech synthesis system.The model proposed in this paper is better than the original model by 0.73 in mel cepstral distortion(MCD),and on the premise that the performance of the synthesis system is similar.The mean opinion score(MOS)of speech naturalness is increased by 0.36 compared with the reference model,and shorten the gap of MOS between the reference model and the original recording by 72.0%,which is proved to improve the effect of synthesized speech.Aiming at the possible low stability problems caused by the accumulation of errors of typical autoregressive models,this paper proposes a non-autoregressive acoustic modeling method based on feedforward sequence memory network(FSMN).Unlike acoustic modeling methods based on the self-attention mechanism,this article adopts the combination of FSMN modules to form a new network architecture,paying more attention to local features while reducing the amount of model parameters.Besides,this method adopts continuous interpolation and a fundamental frequency distillation strategy with or without fundamental frequency separation for acoustic modeling.Under the premise that the parameter amount is only 59.2% of Tacotron2 and Fastspeech2,the acoustic modeling method based on FSMN proposed in this paper improves the effect of Chinese speech synthesis with the better MCD and spectral details.

Keywords/Search Tags:

Speech Synthesis, Acoustic Modeling, Attention, Feedforward Sequential Memory Networks

PDF Full Text Request

Related items

1	Research On Tibetan Acoustic Modeling Method Based On Sequential Memory Neural Network
2	Research On Neural Network-based Acoustic Modeling For Speech Synthesis
3	Research And Implementation Of End-to-End Prosodic Speech Synthesis System
4	Deep Neural Networks For Voiceprint Spoof Detection
5	Research On Acoustic Modeling Methods In Statistical Parametric Speech Synthesis
6	A Study On Representation Learning Based Acoustic Modeling For Speech Synthesis
7	Study On Chinese Speech Synthesis Methods Based On Deep Learning
8	Research On Unit Selection Concatenation Speech Synthesis Method Based On Deep Learning
9	Improved Tacotron2 Speech Synthesis Method Based On Forced Monotonic Attention Mechanism
10	Esearch On The Modeling And Generation Of Fundamental Frequencies In Statistical Parametric Speech Synthesis