| Voice is one of the most important ways of human communication.Voice conversion is an important research direction of voice synthesis.The goal of voice conversion is to make a certain voice sound like what another person said(after process it with certain algorithm)while keeping the original meaning.Voice conversion technology is widely used in various scenarios,such as voice interaction,voice customization,the entertainment industry,and so on.With the development of deep learning in the recent years,voice conversion technology has made remarkable progress.As one of the most important sub-directions in voice conversion,zero-shot voice conversion has attracted extensive attention.Although a large number of researchers have proposed corresponding algorithms of voice conversion for various scenes,most the zero-shot voice conversion technologies is still an challenging task.In recent years,zero-shot voice conversion are based on auto-encoder framework with a carefully-designed bottleneck.However,this method is not generative enough and limits to the further improvement of zeroshot voice conversion.To solve these problems,this thesis proposes a zero-shot voice conversion method based on the generation adversarial network.The main contents of research are as follows:(1)A zero-shot voice conversion framework based on generative adversarial network is proposed.For speakers who do not appear in the dataset,our algorithm uses a timbre encoder to extract timbre features from the input speech and uses a content encoder to generate content distribution features from the speech of any other speaker.Our algorithm separates timbre and content information through conversion-reconstruction cycle training,and learns to synthesize new speech.At the same time,our algorithm improves the quality of voice conversion and generalization performance with adversarial loss.Experimental results show that the proposed algorithm can achieve a higher quality of zero-shot voice conversion.(2)In the research contents(1),speech is decomposed into timbre and content,which is not accurate enough.From the perspective of acoustic,the information components in speech can be more completely decomposed into: content,timbre,rhythm and pitch.Therefore,based on the research content(1),this thesis proposes a zero-shot voice conversion framework for arbitrary components based on generative adversarial network.The four kinds of speech information are decomposed and embedded through an information encoder and sequential rescaling,and reconstructed with generator.The experimental results show that the algorithm realizes voice conversion of arbitrary component with better applicability and universality in zero-shot voice conversion.This research expands and improves the practical application of voice conversion technology. |