Font Size: a A A

Realization And Improvement Of End To End Speech Synthesis Algorithm

Posted on:2021-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:R Y LiuFull Text:PDF
GTID:2518306197955509Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of deep learning technology,it is possible for speech synthesis to pronounce like a real person.Although the end-to-end speech synthesis algorithm based on DNN has better synthesis effect and simpler training method,it is not widely used in products because of its low robustness and easy to cause synthesis errors.To further improve the end-to-end speech synthesis algorithm is the main way to solve this problem.Based on Tacotron2 model,this paper explores and implements an improved end-to-end speech synthesis algorithm.The main work of this paper includes:1.Analyze Tacotron2 algorithm in detail.Starting from the basic algorithm of Seq2 Seq machine translation,to each part of speech synthesis algorithm,the composition of each module and the role of each part are analyzed item by item.On this basis,the shortcomings of Tacotron2 algorithm and the improvement methods that can be adopted are summarized.2.The speech self encoder is designed and implemented.Introducing speech features into text feature extraction is an important way to improve the encoder,and the implementation of forced alignment between text and speech is the core of speech synthesis.Therefore,this paper designs and implements a windowed convolution algorithm based on SincNet.On this basis,a speech self encoder is designed and implemented.While solving the problem of forced alignment,the speech self encoder can compress the speech to the same length as the text,which provides the possibility of introducing speech features into the encoder.Experimental results show that the sub algorithm can effectively extract local prosody features,speech feature vectors and text encoding vectors from corpus(speech,text).3.Improve decoder.Due to the shortcomings of Tacotron2 decoder,the error will gradually accumulate with the length of synthesized speech,and finally affect thequality of speech synthesis and the maximum length of synthesized text.In this paper,it can be divided into two reasons: one is the difference between the training process and the synthesis process of decoder;the other is the noise caused by the accumulation of minimum values in the forced alignment process.For these reasons,this paper proposes and implements the following three improvement methods:(1)Taking Tacotron2 as the generator,we design prosody discriminator,Mel discriminator and alignment discriminator to form a generative network algorithm.We use the synthesis process training model directly to solve the difference between the training process and the synthesis process.(2)Design and implement the Random Down training algorithm which combines the training process and synthesis process.(3)The attention weight plus window algorithm is introduced to solve the problem of noise caused by minimum accumulation,which further strengthens the above two methods.The experimental results show that in the 550 length text input,the generative adversary network algorithm can extend the length of the synthesized text from 200 to more than 400,while Random Down algorithm can successfully synthesize the 550 length text,but there will be distortion at the end.The final Random Down + attention weight plus window algorithm can still perfectly synthesize speech when the length of input text is 1100.The improved algorithm in this paper greatly improves the length of speech synthesis.
Keywords/Search Tags:end-to-end speech synthesis, SincNet, window convolution, Random Down, alignment window
PDF Full Text Request
Related items