Font Size: a A A

Non-parallel Many-to-Many Voice Conversion Based On SE-ResNet Combining Speaker Embedding

Posted on:2021-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:P CaoFull Text:PDF
GTID:2428330614965800Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Speech contains semantic information as well as abundant speaker characteristic and emotional information etc.The main goal of voice conversion(VC)is to convert the speech from a source speaker to that of a target with the semantic information unchanged,which belongs to a popular area of personalized speech generation.VC can be used in many applications,such as video dubbing,personalized text-to-speech,and anti-spoofing attack and so on.Depending on the requirements of the corpus in the training process,VC can be divided into parallel corpora and non-parallel corpora conversion.However,collecting a large number of parallel training utterances in advance is time-consuming and difficult.Moreover,parallel utterance is always not available in cross-language and medical assistance systems,which severely restricts the application of voice conversion in practice.Therefore,research on voice conversion for non-parallel corpora has becoming a hotspot and difficulty in the field of VC,which has wide application prospects and practical significance meanwhile facing great challenge.A voice conversion method is considered to be successful when the identity of the source speaker is converted effectively to that of the target speaker,with speech quality and linguistic content maintained.There are two major problems in existing non-parallel VC methods.On one hand,the speaker similarity of converted speech is not good enough.On the other hand,the quality of converted speech is not satisfying.In view of this,this thesis focuses on non-parallel VC based on the star generative adversarial network(StarGAN),and proposes a series of improvements,in order to improve the speaker similarity and speech quality of converted speech.First,in order to further improve speaker similarity,this thesis proposes a StarAGN voice conversion method combining x-vector embedding.Since the StarAGN model uses traditional one-hot vector to represent speaker's identity,which is a constraint for the improvement of speaker similarity of converted speech.In the proposed method,x-vector representing abundant speaker characteristic is introduced as speaker representation for many-to-many VC as a complement to traditional one-hot vector and significantly improves speaker similarity of converted speech.The former provides abundant speaker information for synthesized speech,and the latter can accurately distinguish between different speakers as exact tags,which are complementary to each other.Sufficient objective and subjective experimental results show that the average mel cepstral distortion(MCD)of converted speech is decreased by 5.41%,the mean opinion score(MOS)is increased by 6.64%,and the ABX is increased by 5.12% compared with the baseline method StarGAN,indicating that the proposed method significantly improves the speaker similarity,and is also helpful to improve the quality of converted speech.Furthermore,aiming to solve the degradation of network in the baseline method StarGAN,this thesis proposes a novel voice conversion model based on SE-ResNet StarGAN.The core of this method is SE-ResNet network,which is adopted between the encoding and decoding networks of the generator.With attention and gating mechanism,SE-ResNet can model the dependence of each channel,learn the weight of each feature channel throught global information,selectively strengthen features containing useful information and suppress useless ones,so as to further enhance the model representation ability of the model.Sufficient objective and subjective experimental results show that the average value of MCD is decreased by 7.82%,the MOS is increased by 11.89%,and the ABX is increased by 3.35% compared with the baseline method StarGAN,which indicating that the proposed method can effectively improve the speech quality and speaker similarity of the converted speech.Further,a voice conversion method based on SE-R StarGAN-x is proposed in this thesis by introducing x-vector into the above improved model,compared with the converted speech of the baseline method StarGAN,Sufficient objective and subjective experimental results show that the average value of MCD of converted speech is decreased by 9.53%,the MOS is increased by 19.58% and the ABX is increasesd by 8.66%,which demonstrate that the proposed method in this thesis improves the quality of speech greatly while improving the similarity of speaker personality effectively.
Keywords/Search Tags:voice conversion, x-vector, SE-ResNet, StarGAN, WORLD, non-parallel corpora, many-to-many
PDF Full Text Request
Related items