Font Size: a A A

Fast One-shot Cross-lingual Voice Conversion Based On Dual Encoders

Posted on:2022-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:L L XuFull Text:PDF
GTID:2518306557469804Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Voice conversion is to transform personality characteristics while preserve semantic information,so that the converted speech has the same semantic information as the source speech and the same personality characteristic as the target speech.With the deepening of the research on voice conversion,many representative non-parallel voice conversion methods have been put forward,among which the methods based on VAE and GAN have become the current mainstream,and the converted speech by these methods has achieved good results in naturalness and similarity.However,there are still three major problems in the current voice conversion.Firstly,the mainstream methods mainly study the intra-lingual voice conversion,while the problem of cross-lingual voice conversion has become a research difficulty.Secondly,in order to present better conversion results,most models require a large number of sentences for training,which will increase the user's operation complexity in the application process and greatly reduce the user-friendliness of the product.Therefore,the one-shot problem has been raised in recent years and become a research hotspot in voice conversion.Finally,increasing the scale of network to improve the quality of conversion results will cause high requirements for equipment performance,which is not conducive to the engineering of the model.In order to achieve a breakthrough in voice conversion,based on the model of variational autoencoder,the overall structure of the model is analyzed and a series of improvements are proposed in this paper.First of all,in order to break through the language restrictions and realize cross-lingual voice conversion,this paper proposes a dual encoder method based on disentangled and interpretable representation.The model is set as a combination of two encoders,which encode the input information with different functions,so as to realize the disentanglement of the input information.According to the functions of different encoders,the input acoustic features are characterized,that is,the sentences to be converted in different languages are input to the content encoder and speaker encoder for encoding,then the content representation and speaker representation are obtained respectively.The obtained representations are input to the decoder for decoding,and the converted speech is generated,so that the model can realize cross-lingual voice conversion.Because dual encoders are set to encode content information and speaker information respectively,the speaker encoder in the model can dynamically encode speaker information,so as to get rid of the constraint that the speaker label needs to be acquired in the training stage,thus realizing the one-shot cross-lingual voice conversion.One-shot means the source and target speakers in the training stage can only provide a few sentences to participate or even not participate at all,which can improve the user-friendliness of the model in application.Objective evaluation shows that the speaker encoder in this paper can represent the speaker information well.Subjective evaluation shows that the average MOS value and average ABX value of the speech converted by the proposed method are 3.52 and 82.50%,respectively,which shows that the proposed method has achieved good results in speaker personality similarity and speech quality for one-shot cross-lingual voice conversion.Furthermore,in order to reduce the requirements of the model for equipment performance,this paper proposes a fast and one-shot cross-lingual voice conversion based on depthwise separable convolution,and optimizes some conventional convolutions in the model by using depthwise separable convolution,so as to reduce the model parameters and improve the running efficiency of the model.Experiments show that the training time is accelerated by 26.13%,the parameter quantity is reduced by 27.97%,the average MOS value is 3.51 and the average ABX value is80.00% when the depthwise separable convolution is introduced into the network structure of the model.That is to say,this method achieves the goal of reducing the model parameter quantity and training time while basically maintaining the quality of converted speech,and provides theoretical and simulation exploration for the application of the model in multi-terminals.
Keywords/Search Tags:voice conversion, cross-lingual, one-shot, disentangled representation, variational autoencoder, fast, depthwise separable convolution
PDF Full Text Request
Related items