Font Size: a A A

Improved Tacotron2 Speech Synthesis Method Based On Forced Monotonic Attention Mechanism

Posted on:2022-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y WangFull Text:PDF
GTID:2518306572450884Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speech is one of the most important ways for human beings to accept external information and express their own thoughts,which plays an indispensable role in people's life.The application of speech technology involves many aspects,among which speech synthesis is the technology of converting text information into speech signal.This research has a very long history and has achieved a lot.In recent years,with the improvement of computer computing ability and the development of deep learning,the speech synthesis technique has made rapid development,and many novel and efficient synthesis methods have emerged,among which the method based on end-to-end model has been widely concerned.However,the attention mechanism in this kind of model is evolved from the field of computer vision and machine translation,which is not fully suitable for speech synthesis task.Unsuitable attention mechanism will bring great impact on the synthesis performance.Attention mechanism applied in speech synthesis task needs to be monotonous.However,the current research on how to make attention mechanism meet this requirement is not enough.This paper focuses on how to improve the attention mechanism in the endto-end model to ensure its monotonicity and make it more suitable for speech synthesis tasks.Firstly,tacotron 2 model,which is a representative model of end-to-end speech synthesis system,is analyzed.And its attention mechanism is studied in detail.Then,an independent forced monotonicity method is proposed to design constraint vectors only based on monotonicity.In order to guarantee the monotonicity,the constraint vector of each time step is designed.At the same time,neural network is used to predict the weight of the constraint vector to dynamically adjust the size of the constraint vector.The attention vector of each step is the weighted sum of the output of the original attention mechanism and the constraint vector.Finally,the method of using phoneme duration information with more speech characteristic to guide the training of attention mechanism is proposed.It uses the duration prediction module of traditional parameter speech synthesis to get the required information,and then changes the information into an alignment matrix similar to the attention matrix after a series of operations,such as expanding according to the frame length and relaxing according to the peak position.By reducing the distance between the attention matrix and the alignment matrix,the convergence speed of the attention mechanism and the correctness of the alignment information represented by the attention mechanism can be improved obviously.
Keywords/Search Tags:Speech synthesis, End to end model, Attention mechanism, Long short time neural network, Phoneme duration prediction
PDF Full Text Request
Related items