Font Size: a A A

Cross-lingual Voice Conversion Based On Mutual Information And SE Attention Mechanism

Posted on:2024-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:C Y HuFull Text:PDF
GTID:2568307136491914Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Speech is the external form of language,which contains rich content information,speaker’s personality information and emotional information.The goal of voice conversion is to replace source speaker’s personality information with that of target speaker while guaranteeing source speaker’s content information unchanged.As an important branch in the field of voice conversion,cross-lingual voice conversion has great application value in academic exchanges,international trades and medical assistances.However,due to the significant differences in phoneme,tone and stress between different languages,the research on cross-lingual voice conversion is difficult and challengeable.With the continuous development of deep neural networks,cross-lingual voice conversion based on deep neural networks have emerged in recent years,and achieved good conversion results,but most models can not realize conversion task in the open set case.However,in practical applications,cross-lingual voice conversion model must be applicable to any speaker,so how to realize cross-lingual voice conversion in the case of open set has become an urgent matter.In view of this,this paper discusses and proposes a series of improvements on the problem of open set and performance of cross-lingual voice conversion.Firstly,in order to solve the problem of how to realize cross-lingual voice conversion in the open set case,this paper proposes a cross-lingual voice conversion model based on mutual information.The model consists of four modules: content encoder,speaker encoder,pitch extractor and decoder.In the training stage,the content representation,speaker representation and pitch representation of the training data are extracted by the content encoder,speaker encoder and pitch extractor respectively,and mutual information is introduced as a correlation measure to reduce the correlation between the content representation,speaker representation and pitch representation by minimizing the loss function of mutual information so as to achieve appropriate disentanglement.In the conversion stage,the content representation and pitch representation of the source voice are extracted by the content encoder and pitch extractor,and the speaker representation of the target voice is extracted by the speaker encoder.Input them into the decoder for fusion and decoding to obtain the converted Mel spectrum.Finally,the converted voice is generated by the Parallel Wave GAN vocoder.The experimental simulation results show that compared to one-shot cross-lingual voice conversion model based on dual encoders,the performance of cross-lingual voice conversion model based on mutual information presented in this paper has been significantly improved,the average MOS has been increased by 8.69%,the average ABX has been increased by 7.70%,and the average MCD has been decreased by 7.70%,which indicates that this model can realize cross-lingual voice conversion in the open set case,and have a good performance.Secondly,in order to further improve the quality of converted speech,this paper proposes a cross-lingual voice conversion model based on mutual information and SE attention mechanism.This model introduces Squeeze-and-Excitation Networks into content encoders,which embeds global context information into content representations through squeeze and extraction operations to further improve the quality of converted speech.The experimental simulation results show that compared to cross-lingual voice conversion model based on mutual information,the performance of this model has been further improved,the average MOS has been increased by 2.29%,the average ABX has been increased by 3.99%,and the average MCD has been decreased by 2.83%,which indicates that this model can further improve the quality and speaker similarity of converted speech.In summary,cross-lingual voice conversion model based on mutual information and SE attention mechanism proposed in this paper can realize cross-lingual voice conversion in the case of open set,and achieve good conversion results,which provides an important theoretical discussion and simulation for cross-lingual voice conversion technology to move towards practical application.
Keywords/Search Tags:voice conversion, cross-lingual voice conversion, mutual information, SE attention mechanism, disentanglement, global context information
PDF Full Text Request
Related items