Development And Application Of Dialect Speech Synthesis System Based On Tacotron2

Posted on:2021-11-17

Degree:Master

Type:Thesis

Country:China

Candidate:J Wu

Full Text:PDF

GTID:2518306047987019

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of technology,more and more electronic devices use deep learning technology to make electronic devices more intelligent.Smart electronic devices have basically added voice interaction functions to free users’ hands.However,voice interaction is basically carried out in Mandarin,and users who do not have a good grasp of Mandarin and who can only speak dialect cannot use this function well.In order to solve this problem,some intelligent electronic devices have added dialect speech synthesis and dialect speech recognition modules,but currently the synthesized dialect speech is not clear and natural enough to meet the needs of users.In order to improve the quality of dialect speech synthesis in smart devices,this paper proposes an end-to-end dialect speech synthesis system based on the Tacotron2 model.Compared with the traditional speech synthesis model,the Tacotron2 model directly synthesizes speech from the text end to the output end without excessive manual intervention.The main research content of this paper includes the following four aspects:(1)Data collection and processing: Collect the text data set,speech data set and Mel spectrum data set required by the dialect speech synthesis system.First,collect the dialect text and process it into 10298 sentences.Then record these sentences,and use the Wave Pad tool to divide the recording sentence by sentence,so that the text content corresponds to the audio content one by one.Finally,the dialect text is converted into a pinyin phonetic notation form as a text data set,and the segmented dialect voice is used as a speech data set.The predictive Mel spectrum module is used to convert the dialect text into Mel spectrum as the Mel spectrum data set.(2)System requirements analysis: The system business of the dialect speech synthesis system is presented,and the functional requirements and non-functional requirements are analyzed.Model the functional requirements of the system,and determine the functions to be implemented by the system through requirements analysis.(3)Design and implementation of the system: design and implement the functional modules of the system.Dialect speech synthesis system mainly includes three modules: data processing,text prediction Mel spectrum and Mel spectrum to audio.The function of the text prediction Mel spectrum module is to convert the input phonetic text into a Mel spectrum.This module is implemented using the Tacotron2 model.The Mel spectrum to audio module converts the Mel spectrum from frequency domain to time domain audio.This module is implemented using the Wave Glow model.(4)System test and analysis: test the dialect speech synthesis system in this article,design test cases,analyze the system test results and draw conclusions.The text-to-mel spectrum model of the dialect speech synthesis system undergoes 175,000 steps of training,and the loss finally converges to 0.3495.The test results of the predicted Mel spectrum and the real Mel spectrum are analyzed,and the conclusion that the model predicted Mel spectrum and the real Mel spectrum are basically consistent is obtained.Then the system response time is tested,and the average synthesized audio system response time per second is 3s.Finally,the quality of the synthesized speech is evaluated.The MOS value of the dialect speech 3.926 is very close to the original speech MOS value of 4.217.The test results of the dialect speech synthesis system in this paper show that the current dialect speech synthesis system meets the design requirements,and the synthesized speech has improved in naturalness and fluency.The speech synthesis effect has been very close to the real human voice,reaching the research goals expected by the system.

Keywords/Search Tags:

Speech Synthesis, Deep Learning, Tacotron2, WaveGlow, TTS, Dialect

PDF Full Text Request

Related items

1	Chinese Dialect Synthesis Method Based On Tacotron2 With Limit Samples And Lacking Pronunciation Dictionaries
2	Research On Speech Synthesis Of Shanghai Dialect Based On Deep Learning
3	Research And Implementation Of Speech Synthesis System Based On Deep Learning
4	The Design And Implementation Of The Speech Synthesis System Of Minnan Dialect
5	Research On The Speech Synthesis Of Tibetan Ando Dialect Based On HMM
6	An Hmm-based Speech Synthesis System Applied To Tianjin Dialect
7	Application Research Of Deep Learning In Speech Recognition Of Sichuan Dialect
8	Research On Dialect Accent Classification Based On Deep Learning
9	Research On Acoustic Analysis And Speech Synthesis For Lanzhou-Dialect
10	Speech Recognition Of Hainan Dialect Based On Deep Learning