Research On Speech Synthesis Algorithm Based On Sequence To Sequence Model

Posted on:2021-03-17

Degree:Master

Type:Thesis

Country:China

Candidate:W Q Xie

Full Text:PDF

GTID:2518306047488264

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Speech synthesis is a technology that converts text into speech.It has been widely used in human-computer interaction systems such as voice navigation and audio e-books,which has brought great convenience to people's lives.Diverse application scenarios set higher goals for the perceptibility,articulation,and naturalness of synthesized speech.Among the existing speech synthesis algorithms,traditional waveform concatenation is highly natural,but the speech database production takes time and requires a lot of storage space.Statistical parameter speech synthesis based on Hidden Markov Model are highly flexible,but capabilities are limited,which can easily cause the loss of speech feature details.Speech synthesis based on deep neural network improves the accuracy of the model,but it still requires complex front-end processing,and the entire system is composed of multiple modules such as acoustic feature prediction models,phoneme duration prediction model,and vocoder,which easily cause the accumulation of training errors.The emergence of sequence-to-sequence models for speech synthesis has opened a new direction for speech synthesis algorithms.The model eliminates the complex text processing process,improves the shortcomings of the existing algorithms,and realizes the mapping from text to acoustic features directly.It has become one of the mainstream algorithms for speech synthesis today.Based on the sequence-to-sequence model,this paper has conducted in-depth research on speech synthesis technology.First,we explored the current typical sequence-to-sequence model for speech synthesis system,Tacotron.This model is applied to the open source English single speaker data set LJSpeech,and the problems in the model are analyzed.The structure of the recurrent neural network severely restricts the operation speed of the model.The experimental results show that the basic acceptable speech quality can only be obtained when the model is trained for about 249.6h.Secondly,in view of the problem of low training efficiency of recurrent neural networks,a speech synthesis system based on sequence-to-sequence model and convolutional neural network is studied and implemented.Starting from the front-end module of the system,the effect of embedding dimensions of character vectors on the quality of synthesized speech is studied.Aiming at the problem that the same letter is pronounced differently in different words,the network needs to extract stronger contextual information.The character embedding in the front-end module is modified to phoneme embedding.The experimental results show that phoneme embedding is superior in terms of training speed and synthetic speech quality.Compared with Tacotron system,a convolutional neural network-based system only needs about 9 hours of training to obtain ideal speech quality,which greatly improves the model training speed.Then,a Chinese speech synthesis system based on sequence-to-sequence model and convolutional neural network was implemented.Aiming at the problem that Chinese characters cannot express the pronunciation of text and the existence of polysyllabic characters,a preprocessing module was added,a preprocessing module was added to use Pinyin as a character annotation to convert the text into Pinyin Input to the network for training.The experimental results show that the subjective quality evaluation score of the synthesized speech reaches 4.15,and the Mel cepstrum distortion rate is 4.528896,which satisfies the practical requirements well.Finally,a speech synthesis system based on sequence-to-sequence model and deep separable convolution is proposed.Traditional convolutional neural networks simultaneously perform feature extraction and feature fusion.When the network layer is too deep and there are too many hidden nodes,the problem of slow training speed still exists.In view of this problem,a deep separable convolution is proposed instead of the traditional one-dimensional convolution to improve the original model.The results show that the improved model significantly reduces the amount of parameters and improves the training speed without affecting the quality of the synthesized speech.

Keywords/Search Tags:

Speech synthesis, Recurrent neural network, Convolutional neural network, Deep separable convolution, Sequence-to-sequence model, attention mechanism

PDF Full Text Request

Related items

1	Research On Neural Network-based Acoustic Modeling For Speech Synthesis
2	Research On Deep Learning Algorithm For Sequence Data
3	Research On Sequence Recommendation Method Based On Hybrid Neural Network
4	Research On Whisper To Normal Speech Conversion Based On Deep Neural Networks
5	Research On Deep Learning Based Singing Voice Synthesis
6	Research On Speech Emotion Recognition Based On Convolutional Recurrent Neural Network
7	Hand Gesture Recognition Method Based On Recurrent Three Dimensional Convolutional Neural Network And Attention Mechanism
8	Question Classification Based On Deep Learning Model
9	Study Of Aurora Image And Sequence Classification Based On Deep Learning
10	Study On Keyword Recognition Based On Neural Network