Font Size: a A A

Chinese Dialect Synthesis Method Based On Tacotron2 With Limit Samples And Lacking Pronunciation Dictionaries

Posted on:2024-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:K H MuFull Text:PDF
GTID:2568307091965259Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Chinese dialects are intangible cultural heritage,but as society develops and Mandarin becomes more widespread,people are gradually abandoning the use and inheritance of Chinese dialects,leading to their decline.Using scientific technology to synthesize Chinese dialects is an important means of protecting them.However,Tacotron2,an end-to-end speech synthesis model based on deep learning,is designed for Latin.Chinese characters are ideograms that symbolize ideas,not sounds.To solve this problem,researchers need to use a pronunciation dictionary to convert Chinese characters into pinyin or phoneme sequences as model’s input.However,the existing pronunciation dictionaries for Mandarin are not applied to Chinese dialects,and building such dictionaries for specific dialects requires not only requires researchers to fully understand the dialect and have a great deal of linguistic knowledge,but also requires a large amount of human and material resources.Chinese dialects are low-resource languages,collecting annotation data also takes a lot of time and effort.To address the problem of the lack of a pronunciation dictionary for Chinese dialects,which makes it difficult to convert Chinese characters into phoneme sequences and thereby hinders the synthesis of dialect speech by the model,this paper proposes two models,WFSC-Tactron2 and CSPCoding-Tacotron2.These models free Tacotron2 from the reliance on manual construction of pronunciation dictionaries.They make Tacotron2 to synthesize Chinese dialects with small samples and without pronunciation dictionaries,reducing the synthesis cost.The main contributions of this paper are as follows:(1)The WFSC-Tacotron2 Chinese dialect speech synthesis method is proposed.The method uses the similarity between pronunciation frames to automatically construct character frame difference set(CFD-Set),and encodes each word according to the similarity between the pronunciation information corresponding to each word and the CFD-Set vector,and synthesizes Chinese dialects with Tacotron2.This method can synthesize high-quality Chinese dialect speech with small samples and without linguistic knowledge.(2)The CSPCoding-Tacotron2 method for Chinese dialect speech synthesis is proposed.This method can solve the problem that the WFSC encoding method needs to collect single-word speech.The core of the method is to extract the pronunciation common frames of the data set to determine the pronunciation information of a word.The whole process automatically extracts pronunciation features for each word as well as automatic encoding,significantly reducing the cost of synthesizing Chinese dialects.(3)Speech synthesis experiments were carried out for three Chinese dialects,Cantonese,Hunan and Hefei,using WFSC-Tacotron2 and CSPCoding-Tacotron2 methods under 5-hour sample size conditions.Their MOS evaluation results exceeded 3.5.Meanwhile,these two models were synthesized in Mandarin,and compared with the traditional Pinyin-Tacotron2 which used pinyin as the model’s input.The results show that WFSC-Tacotron2 and CSPCoding-Tacotron2 are able to synthesize high-quality Chinese dialects under small sample and effectively reduce the cost of synthesizing Chinese dialects.Simple and effective synthesis of Chinese dialects not only promotes local communication and preserves Chinese dialects,but also can be applied to audio and video products,bringing convenience in audio and video use to middle-aged and elderly people who have not been educated in Mandarin.We also reported some samples on Mandarin,Cantonese,Hefei and Hunan dialects in http://www.buct-nlp-lab.top/#/all.
Keywords/Search Tags:speech synthesis, Chinese dialects, Tacotron2, phoneme extraction, low-resource
PDF Full Text Request
Related items