Font Size: a A A

Research On Speech Synthesis Algorithm Based On Sequence To Sequence Model

Posted on:2021-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:W Q XieFull Text:PDF
GTID:2518306047488264Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Speech synthesis is a technology that converts text into speech.It has been widely used in human-computer interaction systems such as voice navigation and audio e-books,which has brought great convenience to people's lives.Diverse application scenarios set higher goals for the perceptibility,articulation,and naturalness of synthesized speech.Among the existing speech synthesis algorithms,traditional waveform concatenation is highly natural,but the speech database production takes time and requires a lot of storage space.Statistical parameter speech synthesis based on Hidden Markov Model are highly flexible,but capabilities are limited,which can easily cause the loss of speech feature details.Speech synthesis based on deep neural network improves the accuracy of the model,but it still requires complex front-end processing,and the entire system is composed of multiple modules such as acoustic feature prediction models,phoneme duration prediction model,and vocoder,which easily cause the accumulation of training errors.The emergence of sequence-to-sequence models for speech synthesis has opened a new direction for speech synthesis algorithms.The model eliminates the complex text processing process,improves the shortcomings of the existing algorithms,and realizes the mapping from text to acoustic features directly.It has become one of the mainstream algorithms for speech synthesis today.Based on the sequence-to-sequence model,this paper has conducted in-depth research on speech synthesis technology.First,we explored the current typical sequence-to-sequence model for speech synthesis system,Tacotron.This model is applied to the open source English single speaker data set LJSpeech,and the problems in the model are analyzed.The structure of the recurrent neural network severely restricts the operation speed of the model.The experimental results show that the basic acceptable speech quality can only be obtained when the model is trained for about 249.6h.Secondly,in view of the problem of low training efficiency of recurrent neural networks,a speech synthesis system based on sequence-to-sequence model and convolutional neural network is studied and implemented.Starting from the front-end module of the system,the effect of embedding dimensions of character vectors on the quality of synthesized speech is studied.Aiming at the problem that the same letter is pronounced differently in different words,the network needs to extract stronger contextual information.The character embedding in the front-end module is modified to phoneme embedding.The experimental results show that phoneme embedding is superior in terms of training speed and synthetic speech quality.Compared with Tacotron system,a convolutional neural network-based system only needs about 9 hours of training to obtain ideal speech quality,which greatly improves the model training speed.Then,a Chinese speech synthesis system based on sequence-to-sequence model and convolutional neural network was implemented.Aiming at the problem that Chinese characters cannot express the pronunciation of text and the existence of polysyllabic characters,a preprocessing module was added,a preprocessing module was added to use Pinyin as a character annotation to convert the text into Pinyin Input to the network for training.The experimental results show that the subjective quality evaluation score of the synthesized speech reaches 4.15,and the Mel cepstrum distortion rate is 4.528896,which satisfies the practical requirements well.Finally,a speech synthesis system based on sequence-to-sequence model and deep separable convolution is proposed.Traditional convolutional neural networks simultaneously perform feature extraction and feature fusion.When the network layer is too deep and there are too many hidden nodes,the problem of slow training speed still exists.In view of this problem,a deep separable convolution is proposed instead of the traditional one-dimensional convolution to improve the original model.The results show that the improved model significantly reduces the amount of parameters and improves the training speed without affecting the quality of the synthesized speech.
Keywords/Search Tags:Speech synthesis, Recurrent neural network, Convolutional neural network, Deep separable convolution, Sequence-to-sequence model, attention mechanism
PDF Full Text Request
Related items