Font Size: a A A

End-to-end Speech Synthesis

Posted on:2021-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q ZhangFull Text:PDF
GTID:2518306308972769Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
As a part of natural language understanding,speech synthesis is of great significance to human-computer interaction.With the improvement of computing power and the explosive growth of data volume,the traditional parameter synthesis method based on hidden Markov and neural network(DNN)has achieved good results,but requires multiple sub-models to be trained separately,and the text front end is even more Strong expert knowledge,training steps are tedious and complicated.Based on this situation,the sequence-to-sequence model realizes the conversion from character sequences to acoustic feature sequences,simplifies the entire process,and realizes end-to-end.But there are also problems with end-to-end speech synthesis,such as unstable alignment,large data requirements,and difficulty in introducing personalized information.This paper studies the end-to-end speech synthesis based on the Transformer model,the main work is as follows:Firstly,build and optimize the end-to-end Chinese speech synthesis system based on Transformer.Based on the Transformer model,according to the characteristics of speech synthesis,this paper studies the training skills in end-to-end speech synthesis,such as planned sampling,multi-frame prediction,stop prediction,etc.,which accelerates network convergence;for encoders,decoders,and attention The force mechanism has been optimized for experiments.By comparing the existing attention mechanisms,the forward attention mechanism and the multi-scale attention mechanism are merged,and the model stability is improved.Then,researched the information increase problem of Chinese end-to-end speech synthesis,an implicit pronunciation duration model for pre-trained decoder.In the speech synthesis task,the text input and voice output information are different,the model is difficult to learn,and the rich text-side information is conducive to accurate network modeling.This paper pre-trains word vectors to enrich text input information and regards the auto-regressive structure of the Chinese end-to-end speech synthesis decoder as an implicit pronunciation duration model.It uses speech data to train.Experiments show that the method of adding information can reduce the sequence to the data requirements of the sequence model.Lastly,researched on style speech synthesis based on speaker style representation,introduce more distinguishable triplet loss for training speaker style representationAfter investigating speaker speech synthesis,this paper implements speaker recognition based on cross-entropy loss,the purpose is to migrate the speaker representation in the speaker recognition model to an end-to-end system;this paper introduces ternary loss to train speaker representation,The ternary loss function can be closer to the representation of the same speaker,the model is more distinguishable,and the synthesized speaker speech is more accuracy.
Keywords/Search Tags:end-to-end, transformer, pre-training, triplet loss
PDF Full Text Request
Related items