One-shot Voice Conversion Algorithm Design And Implementation Based On Representations Separation

Posted on:2021-05-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y Chen

Full Text:PDF

GTID:2428330611465562

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Voice Conversion(VC)technology,as a branch of speech recognition and speech synthesis,plays an important role in text-to-speech,film and television industry,information security,and speech translation.In recent years,the continuous innovation of deep learning and neural network methods has further promoted the rapid development and upsurge of voice conversion research.Voice Conversion technology is a method of converting the source speaker's speech into the target speaker's speech without changing the source speaker's speech content.The current voice conversion technology has the following problems:(1)At present,most voice conversion algorithms are only applicable to a limited number of speakers,and cannot achieve voice conversion between any speakers,and the use scenario is greatly restricted;(2)At present,when the representations separation of the source speaker speech and the target speaker speech is performed by the mainstream technology,the separation effect is not ideal;(3)At present,there are still some problems in the voice quality of most models after conversion.To solve the problem(1),this paper builds a voice conversion model based on Encoder-Decoder structure.Encoder consists of two parts: Speaker Encoder and Content Encoder.The speaker encoder separates the target speaker features from the target speaker speech and generates a representations with the target speaker timbre,called speaker representations or speaker features;the content encoder separates from the source speaker speech,The content feature generates a representations containing the source speaker's speech content,called content representations.The decoder decodes the speaker representations and the content representations to generate the target speaker's speech with the source speaker's speech content.In this paper,the voice conversion model only needs to input any source speaker speech and target speaker speech to achieve voice conversion between any two speakers,also known as one-shot voice conversion.To solve the problem(2),this paper optimizes the basic speaker verification model,and obtains the optimized speaker verification model SVINGE2E(Speaker Verification with Instance Normalization using Generalized End-to-End loss).Compared with the basic speaker verification that the model has increased by 41.72%.The trained SVIGEN2 E is used as the speaker encoder in the voice conversion model.This speaker encoder can effectively extract the speaker's timbre information.In the training of voice conversion models,the content encoder is optimized and the content encoder is optimized using the content loss function,so that the content encoder can effectively remove the source speaker information in the speech and extract the content information in the source speaker's speech.For the problem(3),in order to improve the quality of the generated speech,a progressive training method is proposed when training the voice conversion model.The first step uses the reconstruction loss function as the model loss function to train the model's ability to reconstruct the speech spectrum.In the second step,the reconstruction loss function and the content loss function are used as the model loss function.The model optimizes the content encoder while reconstructing the speech spectrum.Experiments show that the progressive training method produces better voice quality.Through the above improvements,this paper constructs and implements an arbitrary speaker voice conversion algorithm based on representations separation.Experimental results verify the effectiveness of the algorithm in this paper,and the conversion effect reaches a good level.

Keywords/Search Tags:

One-shot Voice Conversion, Speaker Verification, Speaker Representations, Content Representations, Representations Separation, Progressive Training Method

PDF Full Text Request

Related items

1	Research On A New Method Of Speaker Verification
2	Research On Voiceprint Verification Technology In Multi-speaker Scenarios Based On Deep Learning
3	Joint time-frequency representations of nonstationary signals
4	Incorporating indexicality and contingency into the design of representations for computer-mediated collaboration
5	Multiscale and directional representations of high-dimensional information content in remotely sensed data
6	The Research Of The Speaker Recognition System Using Low-Dimensional Vector Representations
7	Neural representations used by brain regions underlying speech production
8	Speaker Verification Under Emotional Voice And Implementation
9	The effects of event organization in database representations on user data retrieval performance
10	Media Construction and Representations of Legitimate/Illegitimate Citizens