Font Size: a A A

Rendering Speech Across Speaker And Language Difference

Posted on:2020-01-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:F L XieFull Text:PDF
GTID:1368330590972974Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Rendering speech across speaker and language is an important research topic in speech signal processing field.There are two sub-problem of this topic,1)Rendering speech across speaker: voice conversion;2)Rendering speech across both speaker and language: cross-lingual TTS.Voice conversion mainly focuses on changing speech from source speaker to target speaker in voice timbre and prosody without changing linguistic information.Cross-lingual TTS focuses on constructing target speaker's L2(language 2)text-to-speech system using his L1(language 1)speech with the help of another L2 native speaker's speech.Rendering speech across speaker and language has values and actual demand in quite a few fields.However due to the limitation of the training data in the real scenarios and bottleneck of the modeling method,the naturalness and speaker similarity of the speech across speaker and language is not satisfactory and there's still a long way to meet the industry requirement.In this thesis,voice conversion and cross-lingual TTS,these two topics are studied in depth and systematically including system construction and key technology improvements.The detailed research works in this thesis are as follows.Firstly,when parallel training data is available to do voice conversion task,neural network based voice conversion is implemented and the performance is further improved by proposing a new training criterion: minimum sequence error(MSE)training criterion.Minimum sequence error training criterion not only considers the whole sequence when training the neural network with back propagation but also eliminates the inconsistent problem of training objective function and test synthesis target.In addition,pitch is transformed jointly with spectral features in the neural network.Experimental results show that minimum sequence error training criterion is better than minimum frame error training criterion in voice conversion using neural network as a regression model.LSD on CMU ARCTIC test set is lowered by 0.15 dB than baseline system which is based on sequence error trained neural network voice conversion.The naturalness(91% vs.6%)and speaker similarity(88% vs.7%)of converted speech are both significantly improved compared with sequence error trained neural network system.Secondly,a new voice conversion framework(KLD-DNN)only using target speaker's speech is proposed based on speaker independent neural network acoustic model and KL divergence.Speaker independent neural network output ASR senone phonetic space is used to equalize speaker difference between source and target speaker,and KL divergence is used to measure phonetic distortions between different acoustic units.Based upon target speaker's different acoustic unit: 1)supervised TTS senone;2)unsupervised phonetic cluster;3)unsupervised frame,different post processing methods are proposed to smooth the acoustic trajectory.Experimental results show this voice conversion method based on speaker independent neural network and KL divergence outperforms the baseline voice conversion method using neural network as a regression model which requires parallel training data.LSD on CMU ARCTIC test set is lowered by 0.5dB than baseline system which is based on neural network.The naturalness(60% vs.22%)and speaker similarity(65% vs.35%)of converted speech are both improved compared with baseline system.Thirdly,based on the idea using speaker independent neural network to equalize speaker difference,frame selection in SI-DNN phonetic space with WaveNet vocoder for voice conversion is proposed.WaveNet vocoder doesn't rely on speech production mechanism(source-filter model)assumption and directly uses convolution neural network to model the waveform sample sequence.Experimental results show that the frame selection based voice conversion using WaveNet vocoder can significantly outperform the KLD-DNN based method in subjective test on CMU ARCTIC databse both in speech naturalness(80% vs.7%)and speaker similarity(76% vs.8%).Finally,KLD and SI-DNN approach for cross-lingual TTS synthesis is prposed.Based on the assumption that speech in different language share the same acousticphonetic space to an extent in sub-phonemic or speech level,speaker independent deep neural network trained only with L1 speech is used to equalize the speaker difference between L2 reference speaker and L1 target speaker.In supervised mode,minimum KL divergence is used to map the leaf nodes between target speaker's L1 decision tree model and reference speaker's L2 decision tree model;in unsupervised mode,L2 frames from target speaker are used to fill the leaf nodes of reference speaker's L2 decision tree weighted by the KL divergence.And then target speaker's L1 decision tree model can be constructed to achieve rendering speech across speaker and language difference.Experimental results show that the proposed method can significantly outperform the trajectory tiling based baseline method.LSD on test is lowered by 0.89 dB,and DMOS for speaker similarity in subjective test is imporved by 0.6(2.9-> 3.5).This thesis proposes three voice conversion methods: 1)sequence error minimization based neural network,2)KLD-DNN based voice conversion,3)Frame selection and WaveNet based voice conversion.These three methods efficiently solve 1)mismatch in neural network training object and test object,2)equalizing speaker difference among different speakers,3)bad converted speech naturalness caused by traditional vocoder these three problems,respectively.And these three methods significantly improve the naturalness and speaker similarity of converted speech step by step.KLD-DNN based cross-lingual TTS synthesis is also proposed to solve the problem of equalizing speaker difference among different speaker and language.This method can significantly improve the speaker similarity of cross-lingual synthesized speech.
Keywords/Search Tags:voice conversion, deep neural network, KL divergence, Wavenet vocoder, cross-lingual TTS
PDF Full Text Request
Related items