Research On Chinese Speech Synthesis Method Integrating Pause And Personal Information

Posted on:2022-02-22

Degree:Master

Type:Thesis

Country:China

Candidate:H H Fu

Full Text:PDF

GTID:2518306569494634

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The development of artificial intelligence technology has promoted the commercial application of intelligent human-computer interaction products such as smart speakers and mobile phone assistants,bringing many conveniences to human daily life.Among them,speech synthesis is one of the key links of intelligent human-computer interaction.Its task is to convert the corresponding text into natural audio,which is an important research direction in the field of intelligent speech generation.In recent years,an end-to-end speech synthesis method based on deep learning has gradually become the mainstream.This method not only reduces the complexity of the model but also improves the quality of synthesized audio.However,the current end-to-end system still has problems such as unnatural pauses,unstable synthesis of long and difficult sentences,and divergence of sound quality.Therefore,in the context of Chinese speech synthesis,this article uses the Tacotron series of end-to-end model architecture to study the solutions to the above problems.The main work is as follows:To alleviate the unstable synthesis of long and difficult sentences and the divergence of sound quality,a new acoustic model structure for Chinese speech synthesis,Evotron,is proposed.By introducing a variant structure based on transformer network,the phonetic sequence is encoded with context information to obtain richer text semantic information.By introducing the forward local position-sensitive attention mechanism to alleviate the problem of instability in the synthesis of long and difficult sentences;by introducing the diagonal guidance attention loss to impose constraints on the attention weight,making the attention weight diagonal,speeding up the model Convergence;use differential loss and waveform loss to impose time-frequency domain constraints on the model,improve the clarity of the synthesized spectrogram,and alleviate the divergence of sound quality;use hybrid input techniques to alleviate the exposure bias problem in the Seq2 Seq decoding process.After experimental verification,the proposed new Chinese synthesis framework is superior to the current mainstream synthesis framework in terms of speed and sound quality.At the same time,various optimization techniques as auxiliary means can bring a certain degree of performance improvement,of which the waveform loss effect is particularly significant.To solve the problem of unnatural pauses in Chinese speech synthesis,research has been conducted in two aspects: pause prosody prediction optimization and pause prosody modeling in acoustic models.For the Chinese pause prosody prediction task,BERT word vectors and syntactic features are introduced to enrich the semantic information of the input sequence.At the same time,considering the hierarchical nesting of Chinese prosody levels,a hierarchical prosody prediction architecture is proposed to more accurately model the prosody structure.Besides,to better control the pause prosody information explicitly in the acoustic model,a multi-task learning strategy that takes spectrogram generation as the main task and pause prosody prediction as the auxiliary task is designed to guide the model to perform pause prosody modeling.The ablation experiments were conducted on single-person and multi-speaker data sets,and the experiments showed that the hierarchical prosody prediction structure is better than the single prediction structure.In the single-person speech synthesis experiment,the multi-task learning strategy can bring a certain degree of improvement to the two tasks at the same time;in the multi-speaker speech synthesis task,in addition to the performance improvement,the prosody joint learning strategy can better encourage the model to learn the pause rhythm unique to a specific speaker.

Keywords/Search Tags:

speech synthesis, evotron, prosody prediction, hierarchical nesting, multi-task learning

PDF Full Text Request

Related items

1	The Research On Dai Prosody Prediction Module Of Speech Synthesis
2	Research On 3D Visible Speech Animation Driven By Prosody Text
3	Multi-level Prosody And Short-term Spectrum Transform For Emotional Speech Synthesis
4	The Research Of Speech Synthesis And Prosody Control In Wu-Dialect Text-to-Speech
5	A Research Of Prosody Modeling And Synthesis Method In Chinese TTS
6	An Improved Speech Synthesis Method
7	The Method And Implementation Of ToBI Automatic Prosodic Labeling In English Text To Speech System
8	Research On Mandarin Text-to-Speech Based On Deep Learning
9	Mongolian Speech Synthesis Based On Deep Learning
10	Research On The Prosody Boundary Prediction For Foreign Students Speaking Mandarin