Font Size: a A A

Research And Implementation Of End-to-end Chinese Speech Synthesis

Posted on:2023-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z F GaoFull Text:PDF
GTID:2568306914973499Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speech synthesis is the technology of converting text into speech,which plays an important role in human-computer interaction.In recent years,with the continuous development of deep learning technology,endto-end technology based on neural network has become the mainstream research direction in the field of speech synthesis,but there are still many problems.Mainstream models usually use attention mechanism to solve the alignment problem between text and speech sequences.However,the"soft alignment" of attention mechanism leads to the instability of the model in the decoding process,and the synthesized speech is prone to errors,omissions,repetition,synthesis failure and so on.At the same time,due to the complexity of Chinese pronunciation rules and the relationship between pronunciation and text context information,the model needs to have better context modeling ability.To solve the above problems,this paper mainly carries out the following research work:(1)An acoustic model combining attention and duration prediction is designed.This paper combines attention and duration prediction.The duration prediction mechanism is responsible for the alignment of phonemes and audio frames,and the attention mechanism is responsible for extracting the context information and semantic information of the alignment position,which helps the model modeling and solves the problem of unstable decoding of attention mechanism.Experiments show that the model can basically eliminate the unstable phenomena such as wrong words and missing words on the premise of ensuring the quality of synthetic speech.(2)An acoustic model based on pre training model is designed.This paper combines the pre training model BERT(Bidirectional Encoder Representations from Transformers)with the acoustic model,and uses Bert to extract the semantic features of the input text to assist the model in pronunciation modeling,so as to alleviate the lack of text structure and semantic information and the small amount of data in the speech data set.At the same time,this paper introduces guided attention to guide the attention mechanism,solves the problem of unstable model training,and further improves the quality of model synthesized speech.(3)The phoneme modeling scheme of separating vowel and tone is designed.According to the pronunciation rules and characteristics of Chinese,this paper proposes a phoneme modeling scheme with the separation of vowels and tones.In this paper,vowels and tones are mapped separately into coding vectors.This scheme can explicitly reflect the pronunciation relationship between vowels and tones,so that the model can learn the combined pronunciation rules of vowels and tones,improve the pronunciation modeling ability of the model,and map tones separately into coding vectors,It improves the modeling ability of the model to tone,and makes the synthesized tone pronunciation more accurate.
Keywords/Search Tags:speech synthesis, attention mechanism, phoneme duration prediction, pre training model, text context modeling
PDF Full Text Request
Related items