Font Size: a A A

Cross-lingual Speech Synthesis Based On Statistical Models

Posted on:2017-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:Q J YuFull Text:PDF
GTID:2348330536958910Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speech synthesis is an important part of human-computer interaction.Along with the increased global association,people's demands of multilingual interaction expand day by day.Cross-lingual speech synthesis whose purpose is to use the same speaker's voice to synthesis speech of different language to satisfy the requirements of multilingual interaction is becoming a hot and difficult question in the speech synthesis field.There are shared information between different languages(e.g.similar pronunciation,homologous segmental structure),which provide phonetics basis for cross-lingual speech synthesis.The pressing key problems and technology challenges of cross-lingual speech synthesis are how to learn shared information between different languages and establish cross-lingual mapping model.In recent years,with the development of statistical speech synthesis,people established cross-lingual mapping model of different languages based on large-scale of same speaker's bilingual corpus.However,it is very difficult to build such bilingual corpus and it is especially true for low-resource languages.Orienting the work to cross-lingual speech synthesis based on small-scale corpus is urgent.The main research work and contributions of this thesis can be summarized as following:1.We proposed cross-lingual mapping model based on speaker adaptation technique,and implemented cross-lingual speech synthesis of small-scale corpus based on hidden Markov models(HMM).In HMM-based statistical speech synthesis,people chose HMM state as shared units based on the associations between the sequence of HMM state and pronunciation units and established the cross-lingual mapping model using the same speaker's large-scale bilingual corpus.However,it is not easy to build such bilingual corpus.It is difficult to build accurate cross-lingual mapping model using different speakers' bilingual corpus since the differences of acoustic features between different speakers.Concerning this issue,this thesis proposed to establish cross-lingual mapping model based on speaker adaptation technique using the same speaker's small-scale bilingual corpus,and the experiment results show that the model can establish more accurate mapping relationships between shared units ofdifferent languages.2.We proposed speech synthesis method based on bidiredtional long short term memory(BLSTM),and builded a multilingual speech synthesis system based on BLSTM as a research platform for cross-lingual speech synthesis.BLSTM-based speech established a mapping relationship between input feature and output feature.The difficult of BLSTM-based speech synthesis is how to build input features which include rich context information.We added linguistics phonetics and prosody information in input features to solve the above problem.We researched BLSTM-based of different languages and builded multilingual speech synthesis system based on BLSTM to satisfy the requirements of multilingual speech synthesis.The multilingual speech synthsis system can synthesis Mandarin and English.We experimented with different BLSTM structures and different parameter configuration to find the optimal configuration.The subjective experiments showed that the results of BLSTM are better than HMM.3.We proposed low-resource language speech synthesis based on BLSTM using multilingual BLSTM to learn cross-lingual shared information which can benefit the results of low-resource language.Compared with HMM-based speech synthesis,BLSTM-based speech synthesis usually need large amount of training data to obtain an accurate model.It is difficult to train an accurate model using limited training data especially for low-resource language,which decline the results of synthesized speech.To this issue,we designed a multilingul BLSTM model which can learn cross-lingual information from rich-resource language and transfer it to low-resource language to improve the accuracy of the predicted parameters.From the objective evaluation,this approach can significantly improve the accuracy of predicted parameters of low-resouce BLSTM model,the voiced/unvoiced swapping errors reduced 2%,the log-spectral distance reduced 2.3dB and the root-mean-square error reduced 7Hz.
Keywords/Search Tags:cross-lingual, speech synthesis, hidden Markov model, recurrent neural network, low-resource
PDF Full Text Request
Related items