Research Of Voice Style Transfer For End-to-End Speech Synthesis

Posted on:2021-07-07

Degree:Master

Type:Thesis

Country:China

Candidate:S J Yan

Full Text:PDF

GTID:2518306017959849

Subject:Computer technology

Abstract/Summary:

Since the development of artificial intelligence technology,human-computer interaction mode has been updated several times.From the earliest keyboard and mouse mode to the later touch screen handwriting and then to the current intelligent voice interaction mode,the friendliness of people using intelligent technology products and equipment has been greatly improved.As an important part in the process of intelligent voice interaction,speech synthesis technology has a decisive role in improving the convenience and comfort of people experiencing smart devices.Among them,personalized speech synthesis is a technical problem that needs to be overcome.The traditional speech synthesis technology adopts the front-end and back-end model architecture,which requires detailed design of the front-end and back-end models,and requires a certain background knowledge of linguistics and acoustics.This article uses an end-to-end speech synthesis model as a baseline to study the personalized speech synthesis technology.The design of this model does not require expert knowledge in a specific field.In this paper,based on the end-to-end speech synthesis model,we carry out the study of speech style and speaker transfer learning.This article uses the Tacotron2 feature prediction model and WaveRNN vocoder as the end-to-end speech synthesis baseline system.A speech style transfer module is added.Multiple global style tokens are used to represent the global style information of the speech.Each token represents a different level of style.By combining tokens with different weight ratios,the global style information representation of the voice is obtained.The global style information is sent to the Tacotron2 feature prediction model for predicting the acoustic characteristics.Finally,the WaveRNN vocoder is used to synthesize the speech of the specified style.The experimental results prove that this method of speech style transfer model can synthesize speech highly similar to the reference speech style based on the designated style reference speech.In order to generate voices of timbre outside the training data set,this paper adds a speaker identification network in the field of text-independent speaker verification based on the voice style transfer model,which is used to extract the timbre representations of different speakers and send them to the voice style transfer.The model is used to control the timbre representation of synthesized speech.The experimental results prove that on the basis of the designated speaker feature reference speech and the style reference speech,the speech style transferred model combined with speaker migration can synthesize the speech of the designated speaker timbre and have high similarity in style compared with the style reference speech.

Keywords/Search Tags:

Speech synthesis, end-to-end model, speech style transfer, speaker transfer

Related items

1	Speech Style Transfer And Emotion Synthesis Based On Deep Learning And Transfer Learning
2	Research On Personalized Speech Synthesis Based On Deep Speech Representations
3	Research And Implementation Of Speech Synthesis Method For Helping Old Robots
4	The Application Of HMM In Parameter-Based Text-To-Speech System
5	Research On Statistical Parametric Mandarin-Tibetan Cross-lingual Speech Synthesis
6	Research And Implementation Of Multi-Speaker Speech Synthesis System For Audio Novels
7	Research On Statistical Parametric Speech Synthesis Integrating Speech Production Mechanisms
8	Whisper To Speech Conversion And Whisper Recognition Modeling Method
9	Research On Statistical Parametric Speech Synthesis Based On Speaker Adaptive Training
10	Research And Implementation Of Speech Synthesis Based On Fastpeech