Font Size: a A A

High-quality Voice Conversion From Non-parallel Corpora Based On Variational Auto-encoder And Bottleneck Feature

Posted on:2019-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z LingFull Text:PDF
GTID:2428330566999285Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Speech is a kind of signal generated when the speaker is vocalizing.It contains many kinds of natural information,such as semantic information,speaker's personal information and emotion,which are easy to collect.The technique of voice conversion is to change the personality characteristics of the source speaker,so that it has the personality characteristics of the target speaker,and keep the semantic information unchanged.In recent years,the concept of deep learning and the worldwide research boom on this concept,have got great attention.Some of them have taken advantage of deep learning models in the study of voice conversion and achieved gratifying progress.As various deep learning models have the ability to obtain intrinsic features of complex signals,and efficiency of research have been improved.With intensive research on deep-learning,various new concepts and models are applied to the study of voice conversion,which solves various practical problems.Applying the method of deep learning to the research of voice conversion technology can help to promote other areas of speech signal processing and further improve the efficiency of speech intelligent devices and intelligent human-computer interaction.Therefore,the study of voice conversion using the method of deep learning has broad prospects and far-reaching theoretical and practical value.This thesis is focusing on the voice conversion model based on VAE and Bottleneck features.In the training stage of decoder in VAE,the label feature in hidden layer have not been fully utilized.The Bottleneck feature obtained by DNN is used as the label of speaker.This algorithm takes full advantage of the label features in the VAE model and improves the voice conversion performance.Furthermore,when the training data of target speaker is limited,a method of intervening the training process of DNN is proposed,which solves the M2M voice conversion problem by enriching the target speaker's personality feature space.Through experimental analysis,the MCD(Mel-cepstrum distortion,MCD)of the proposed method is lower than that of the baseline system,decreased by 5.39%on average in non-parallel corpora training condition,reflecting the spectral similarity between converted speech and target speech are better.In terms of subjective evaluation,the PESQ-MOS value is higher,which increased by 24%on average,indicating that the voice quality of the model is better.In the VAE+Bottleneck experiment where the target speaker is not fully trained,by intervening the DNN training process.Through the listening test,29.0%of the test results show that there is no difference between sufficient and limited training data situation.Analytical and experimental results show that the converted speech obtained by the proposed method has higher spectral similarity and higher PESQ-MOS values,which indicates that there is a certain improvement in spectral similarity and speech quality.
Keywords/Search Tags:AHOcoder, MFCC, Variational Auto-encoder(VAE), Bottleneck feature, Many-to-Many(M2M) voice conversion
PDF Full Text Request
Related items