Font Size: a A A

Research On Speech Synthesis Technology For Chinese Advertisement Text

Posted on:2019-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:J K HouFull Text:PDF
GTID:2428330590973910Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text to speech refers to the technology of converting natural text into speech signal by machine.It is an important part of artificial intelligence technology and plays an indispensable role in human-computer interaction.Since the development of text to speech technology,two main methods,waveform mosaic synthesis and statistical parameter synthesis,have been formed.The former chooses the candidate unit waveforms from the speech database by analyzing the prosodic features of the text,and then concatenate them to the synthesized speech.The quality of synthesized speech is high,but this method needs the corresponding speech database,with high cost and poor transplantability.The latter uses hidden Markov technology to model the parameters of text and speech,and then predicts the acoustic characteristics of the synthesized speech,finally reconstructs speech through a vocoder.This method depends on the quality of vocoder,and the naturalness of synthesized speech is not enough.In recent years,text to speech technology based on deep learning has become one of the research hotspots,and gradually shows excellent performance advantages.Based on the above research background,this paper focuses on speech synthesis technology based on deep learning method,explores efficient text to speech technology for Chinese advertising text,and constructs a corresponding speech synthesis system.The speech synthesis system consists of two main modules: text-to-acoustic feature prediction module and vocoder module which converts acoustic features into speech.In the research of text-to-acoustic feature prediction module,this paper takes end-to-end acoustic feature prediction technology as the research foundation.Aiming at the problem of slow prediction speed,this paper proposes a method o f introducing independent recurrent neural network into the prediction module,which improves the prediction speed of acoustic feature.At the same time,in the end-to-end acoustic feature prediction technology research,the paper finds that although the end-to-end acoustic feature prediction technology can simplify the acoustic feature prediction process,the final synthesized speech is monotonous and lacks prosodic information.In order to solve this problem,this paper introduces the Lattice LSTM network to fuse the words,prosody and other information of the text,which enriches the details of synthetic speech.In the research of vocoder,this paper explored the WaveNet model which based on autoregressive deep generation network.In order to accelerate the prediction speed,we introduce parallel WaveNet technology based on inverse autoregressive flow to this paper,which enables vocoder to convert acoustic features into corresponding voices in real time.At the same time,this paper proposes a multi-person speech synthesis technology based on speaker identification,which improves the training efficiency of the model and reduces the dependence of the model on the length of the corpus.Aiming at speech synthesis in the field of Chinese advertising text,the paper has formed a complete synthesis process,which can synthesize clear and fluent advertising speech acco rding to the advertising text.
Keywords/Search Tags:Text to speech, Deep learning, IndRNN, Deep generative model, Inverse autoregressive flows
PDF Full Text Request
Related items