Font Size: a A A

Research On Speech Synthesis Based On Deep Learning

Posted on:2023-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:K ChenFull Text:PDF
GTID:2558306914982969Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of Internet information technology and the continuous improvement of the computing power of computer hardware,artificial intelligence technology has been gradually applied to various applications and a series of breakthroughs have been made,which greatly facilitates people’s lives.As an important task in human natural language processing,speech synthesis technology mainly solves the transformation of input text information into speech information,which is widely used in various human-computer interaction scenarios such as intelligent assistants,voice navigation,reading education,and the future metaverse.played a very crucial role.Most of the current mainstream speech synthesis technologies use end-to-end models to achieve synthesis tasks,which avoids the dependence on expertise in the acoustic field and reduces the manual feature extraction and labeling work.However,there are also some problems,such as the use of attention Problems such as frequent errors in the process of attention alignment by the mechanism,poor parallel computing capability of models based on recurrent neural networks,and inability to extract emotional features will ultimately affect the speed and quality of speech synthesis,and even lead to mispronunciation and word skipping.Aiming at the above problems,this thesis studies the traditional end-to-end speech synthesis and proposes an optimization scheme.The main work is as follows:First,this thesis proposes a CRC(Convolution-RecurrentConvolution)speech synthesis(Text-to-Speech,TTS)model,which takes Tacotron as the baseline model,and on this basis proposes a fully convolutional neural network module and dynamic convolution attention mechanism.The fully convolutional neural network module has multiple sets of one-dimensional convolutional neural networks with dilated convolutions,and highway connections are made between each layer.Compared with the CBHG module used in the original model,this module abandons the low training efficiency.The recurrent neural network structure uses multiple groups of dilated convolutional neural network layers to perform high-dimensional feature extraction,which can effectively ensure that the training and inference speed of the model are improved without losing the quality of the synthesized speech.The dynamic convolutional attention mechanism is a purely position-related attention mechanism.When the attention is aligned,it only refers to the hidden state of the previous step and the current input state.It is an additive attention based on the attention of the Gaussian mixture model.Compared with the position-dependent attention mechanism used in the original model,the mechanism effectively reduces the computational difficulty,effectively reduces the probability of attention alignment errors,and also keeps the model attention monotonous during the training process,which not only improves the align result of the model but also the align speed,and it also effectively improves the compatibility of the model for long sequence text.Then,this thesis optimizes the original CRCTTS model and introduces GST(Global Style Token)and guided attention mechanism.GST realizes the feature extraction of the speaker’s voice style.It is a relatively independent module.Its purpose is to distinguish the speaker’s style features from the speech content.The GST module first encodes the input voice and then extracts the features from them.It contains highdimensional features of content and speaking style,and then uses the multihead attention mechanism to compare with a set of style tags to calculate the similarity between the two.During the training process,all data share style tags,thus realizing the style extraction in speech,which in turn makes the final synthesized speech more emotional and closer to real human speech.Guided attention takes advantage of the linear relationship between the speech position and the input text position when the attention is aligned in the speech synthesis task.The final effective attention matrix should be a diagonal line from a geometric point of view,so a diagonal matrix is used in training as a attention guidance matrix is used to calculate the attention loss.When the attention is offset too far from the diagonal position,it will be severely punished.Therefore,the model can be aligned faster than the original model during training,which effectively improves the training speed of the overall model.
Keywords/Search Tags:deep learning, speech synthesis, attention mechanism, end-to-end
PDF Full Text Request
Related items