Font Size: a A A

Research Of Voice Style Transfer For End-to-End Speech Synthesis

Posted on:2021-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:S J YanFull Text:PDF
GTID:2518306017959849Subject:Computer technology
Abstract/Summary:
Since the development of artificial intelligence technology,human-computer interaction mode has been updated several times.From the earliest keyboard and mouse mode to the later touch screen handwriting and then to the current intelligent voice interaction mode,the friendliness of people using intelligent technology products and equipment has been greatly improved.As an important part in the process of intelligent voice interaction,speech synthesis technology has a decisive role in improving the convenience and comfort of people experiencing smart devices.Among them,personalized speech synthesis is a technical problem that needs to be overcome.The traditional speech synthesis technology adopts the front-end and back-end model architecture,which requires detailed design of the front-end and back-end models,and requires a certain background knowledge of linguistics and acoustics.This article uses an end-to-end speech synthesis model as a baseline to study the personalized speech synthesis technology.The design of this model does not require expert knowledge in a specific field.In this paper,based on the end-to-end speech synthesis model,we carry out the study of speech style and speaker transfer learning.This article uses the Tacotron2 feature prediction model and WaveRNN vocoder as the end-to-end speech synthesis baseline system.A speech style transfer module is added.Multiple global style tokens are used to represent the global style information of the speech.Each token represents a different level of style.By combining tokens with different weight ratios,the global style information representation of the voice is obtained.The global style information is sent to the Tacotron2 feature prediction model for predicting the acoustic characteristics.Finally,the WaveRNN vocoder is used to synthesize the speech of the specified style.The experimental results prove that this method of speech style transfer model can synthesize speech highly similar to the reference speech style based on the designated style reference speech.In order to generate voices of timbre outside the training data set,this paper adds a speaker identification network in the field of text-independent speaker verification based on the voice style transfer model,which is used to extract the timbre representations of different speakers and send them to the voice style transfer.The model is used to control the timbre representation of synthesized speech.The experimental results prove that on the basis of the designated speaker feature reference speech and the style reference speech,the speech style transferred model combined with speaker migration can synthesize the speech of the designated speaker timbre and have high similarity in style compared with the style reference speech.
Keywords/Search Tags:Speech synthesis, end-to-end model, speech style transfer, speaker transfer
Related items