Font Size: a A A

Research On Chinese Speech Synthesis Method Integrating Pause And Personal Information

Posted on:2022-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:H H FuFull Text:PDF
GTID:2518306569494634Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The development of artificial intelligence technology has promoted the commercial application of intelligent human-computer interaction products such as smart speakers and mobile phone assistants,bringing many conveniences to human daily life.Among them,speech synthesis is one of the key links of intelligent human-computer interaction.Its task is to convert the corresponding text into natural audio,which is an important research direction in the field of intelligent speech generation.In recent years,an end-to-end speech synthesis method based on deep learning has gradually become the mainstream.This method not only reduces the complexity of the model but also improves the quality of synthesized audio.However,the current end-to-end system still has problems such as unnatural pauses,unstable synthesis of long and difficult sentences,and divergence of sound quality.Therefore,in the context of Chinese speech synthesis,this article uses the Tacotron series of end-to-end model architecture to study the solutions to the above problems.The main work is as follows:To alleviate the unstable synthesis of long and difficult sentences and the divergence of sound quality,a new acoustic model structure for Chinese speech synthesis,Evotron,is proposed.By introducing a variant structure based on transformer network,the phonetic sequence is encoded with context information to obtain richer text semantic information.By introducing the forward local position-sensitive attention mechanism to alleviate the problem of instability in the synthesis of long and difficult sentences;by introducing the diagonal guidance attention loss to impose constraints on the attention weight,making the attention weight diagonal,speeding up the model Convergence;use differential loss and waveform loss to impose time-frequency domain constraints on the model,improve the clarity of the synthesized spectrogram,and alleviate the divergence of sound quality;use hybrid input techniques to alleviate the exposure bias problem in the Seq2 Seq decoding process.After experimental verification,the proposed new Chinese synthesis framework is superior to the current mainstream synthesis framework in terms of speed and sound quality.At the same time,various optimization techniques as auxiliary means can bring a certain degree of performance improvement,of which the waveform loss effect is particularly significant.To solve the problem of unnatural pauses in Chinese speech synthesis,research has been conducted in two aspects: pause prosody prediction optimization and pause prosody modeling in acoustic models.For the Chinese pause prosody prediction task,BERT word vectors and syntactic features are introduced to enrich the semantic information of the input sequence.At the same time,considering the hierarchical nesting of Chinese prosody levels,a hierarchical prosody prediction architecture is proposed to more accurately model the prosody structure.Besides,to better control the pause prosody information explicitly in the acoustic model,a multi-task learning strategy that takes spectrogram generation as the main task and pause prosody prediction as the auxiliary task is designed to guide the model to perform pause prosody modeling.The ablation experiments were conducted on single-person and multi-speaker data sets,and the experiments showed that the hierarchical prosody prediction structure is better than the single prediction structure.In the single-person speech synthesis experiment,the multi-task learning strategy can bring a certain degree of improvement to the two tasks at the same time;in the multi-speaker speech synthesis task,in addition to the performance improvement,the prosody joint learning strategy can better encourage the model to learn the pause rhythm unique to a specific speaker.
Keywords/Search Tags:speech synthesis, evotron, prosody prediction, hierarchical nesting, multi-task learning
PDF Full Text Request
Related items