End-to-end Speech Synthesis

Posted on:2021-05-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Zhang

Full Text:PDF

GTID:2518306308972769

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

As a part of natural language understanding,speech synthesis is of great significance to human-computer interaction.With the improvement of computing power and the explosive growth of data volume,the traditional parameter synthesis method based on hidden Markov and neural network(DNN)has achieved good results,but requires multiple sub-models to be trained separately,and the text front end is even more Strong expert knowledge,training steps are tedious and complicated.Based on this situation,the sequence-to-sequence model realizes the conversion from character sequences to acoustic feature sequences,simplifies the entire process,and realizes end-to-end.But there are also problems with end-to-end speech synthesis,such as unstable alignment,large data requirements,and difficulty in introducing personalized information.This paper studies the end-to-end speech synthesis based on the Transformer model,the main work is as follows:Firstly,build and optimize the end-to-end Chinese speech synthesis system based on Transformer.Based on the Transformer model,according to the characteristics of speech synthesis,this paper studies the training skills in end-to-end speech synthesis,such as planned sampling,multi-frame prediction,stop prediction,etc.,which accelerates network convergence;for encoders,decoders,and attention The force mechanism has been optimized for experiments.By comparing the existing attention mechanisms,the forward attention mechanism and the multi-scale attention mechanism are merged,and the model stability is improved.Then,researched the information increase problem of Chinese end-to-end speech synthesis,an implicit pronunciation duration model for pre-trained decoder.In the speech synthesis task,the text input and voice output information are different,the model is difficult to learn,and the rich text-side information is conducive to accurate network modeling.This paper pre-trains word vectors to enrich text input information and regards the auto-regressive structure of the Chinese end-to-end speech synthesis decoder as an implicit pronunciation duration model.It uses speech data to train.Experiments show that the method of adding information can reduce the sequence to the data requirements of the sequence model.Lastly,researched on style speech synthesis based on speaker style representation,introduce more distinguishable triplet loss for training speaker style representationAfter investigating speaker speech synthesis,this paper implements speaker recognition based on cross-entropy loss,the purpose is to migrate the speaker representation in the speaker recognition model to an end-to-end system;this paper introduces ternary loss to train speaker representation,The ternary loss function can be closer to the representation of the same speaker,the model is more distinguishable,and the synthesized speaker speech is more accuracy.

Keywords/Search Tags:

end-to-end, transformer, pre-training, triplet loss

PDF Full Text Request

Related items

1	Triplet Loss And Manifold Dimensionality Reduction Based Method For Text-independent Speaker Recognition
2	Speaker Recognition Algorithm Based On Deep Learning
3	Research On Face Recognition Based On Machine Learning Method
4	The Research On Cross-lingual Speaker Recognition Based On Language-adversarial Training
5	Research On End-to-End Simultaneous Speech Translation Based Transformer Transducer
6	High Efficiency And High Power Density Planar Transformer Design In Half Bridge LLC Resonant Converter
7	Research On Person Re-identification In Traffic Environment
8	Research And Implementation On Flower Image Recognition Based On Deep Learning
9	Design Of Management Information System For Power Supply Enterprises Transformer District
10	Applying A Newglobal Loss Function With Fused Multipe Loss Function In People Re-identification Neurals Networks