Font Size: a A A

Research On Tibetan Voice Conversion Based On Deep Learning

Posted on:2021-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:G Y ZhaoFull Text:PDF
GTID:2415330629488956Subject:Engineering
Abstract/Summary:PDF Full Text Request
The voice conversion is a kind of speech waveform modification technology which can transform non-semantic information freely on the premise of the source speaker's semantic information is unchanged,so that the converted voice has the target speaker's personality characteristics.At present,most of the mainstream voice conversion technologies are implemented under the condition of parallel corpus,but in reality,for low-resource Tibetan languages,the acquisition of parallel corpora is very expensive,and the alignment of acoustic features is also prone to problems.Therefore,the purpose of this article is to study Tibetan voice conversion,focusing on Tibetan voice conversion under parallel and non-parallel corpus conditions.The main work is as follows:(1)The theory and flow of voice conversion are sorted out,and the WORLD vocoder is used to extract speech acoustic parameters and speech synthesis.(2)The design of the Tibetan WeiZang dialect text-corpus for Tibetan voice conversion is studied,and the basis of Tibetan voice conversion is established.First,the text corpus should cover various combinations of phonemes in the Tibetan WeiZang dialect,and strive to make the frequency of different phonemes as balanced as possible to avoid data sparseness.After the text corpus is designed,the corresponding speech corpus is recorded and segmented.(3)Under the condition of parallel corpus,the Deep Neural Network and Generative Adversarial Network are introduced into the conversion of Tibetan speech spectrum parameters.Through a large number of experiments,the results show that both DNN and GAN network can achieve Tibetan voice conversion,and the conversion effect is better than that based on Gaussian mixture model.(4)Due to the limitation of Tibetan parallel corpus,this article also studies more flexible and general Tibetan voice conversion under the condition of non-parallel corpus.To improve the above GAN network,the Tibetan voice conversion method based on the CycleGAN and StarGAN networks is proposed.Through a large number of experiments,the results show that the Tibetan voice conversion effect based on the CycleGAN network is close to the GMM based conversion under parallel corpus conditions.Moreover,the CycleGAN.based method achieves a bidirectional conversion of “one-to-one” conversion,while the GMM method is a “one-to-one” one-directional conversion;the Tibetan voice conversion effect based on the StarGAN network is worse than the GMM based conversion under parallel corpus conditions,but the StarGAN method implements a "many-to-many" conversion,which is more flexible and efficient.
Keywords/Search Tags:Tibetan, Voice Conversion, parallel/non-parallel corpus, Generative Adversarial Network(GAN)
PDF Full Text Request
Related items