Font Size: a A A

Non-Parallel Many-to-many Voice Conversion Method Based On Adaptive Trans-StarGAN

Posted on:2022-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z T HeFull Text:PDF
GTID:2518306557968969Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Voice conversion is an important development branch of speech synthesis.Its purpose is to convert the source speaker's speech so that the converted speech sounds like the target speaker's one,that is,it retains the semantic information of the original speech and has the target speaker's personality at the same time.In terms of the types of training corpus,voice conversion methods can be divided into parallel voice conversion methods and non-parallel voice conversion methods.The difference is that non-parallel voice conversion methods do not require the target speaker to speak the same content as the source speaker,so non-parallel voice conversion methods are more realistic.In recent years,with the rapid development of deep learning,various neural network models have been successfully applied in the field of voice conversion,and star generative adversarial network(StarGAN)is one of the popular models.Two important performance indicators for evaluating the performance of voice conversion methods are naturalness and speaker similarity.This paper analyzes StarGAN model from these two performance indicators and proposes a series of improvements.In order to solve the problem of insufficient semantic representation output from the multi-layer convolutional neural networks of StarGAN model,this paper proposes a voice conversion method based on Transitive StarGAN model.The improved method,inspired by the idea of shortcut connection,establishes connection between the encoder and decoder network of generator.It makes full use of hierarchical features of multi-layer convolutional networks,which strengthens the semantic learning ability of generator and makes the generated spectrum have more complete fundamental frequency information and harmonic information.Experimental results prove that compared with baseline method,the spectrogram of the converted speech of proposed method retains more structural information and shows more natural and clear texture.The loss curve of the generator has a faster convergence speed and a lower loss value.The results of MOS and ABX are increased by 21.62% and 3.90%,respectively,which proves that the proposed model has good generation quality.The StarGAN model regards the task of voice conversion as the task of domain conversion.When it learns the mapping relationship of many-to-many voice conversion,the learning for speaker personality characteristics is not sufficient.To solve this problem,this paper proposes a voice conversion method based on the Adaptive Trans-StarGAN model,which is an improved version of Transitive StarGAN.The proposed method uses a speaker style network to extract speaker embedding from the target speaker's speech,and uses the embedding as a style input for adaptive instance normalization layer to perform style transfer on the hierarchical features of encoder network.It effectively improves the generator's ability to learn personality characteristics.Experimental results show that compared with baseline method,the spectrogram of the proposed method is more similar to that of target speaker in texture and the proposed model also maintains a low loss value with almost no increase in computational cost.The results of MOS and ABX are increased by 24.66% and 6.65%,respectively,which proves the effectiveness of the proposed model in improving speech quality and learning speaker personality characteristics.
Keywords/Search Tags:voice conversion, StarGAN, shortcut connection, speaker embedding, AdaIN
PDF Full Text Request
Related items