Font Size: a A A

Robust Speech Synthesis Based On Small Amount Of Corpus

Posted on:2022-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z J TianFull Text:PDF
GTID:2518306749471894Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
As a key technology of human-computer interaction,speech synthesis is one of the important research directions in the field of artificial intelligence.The speech synthesis method based on deep learning technology is becoming increasingly mature.Its core is to establish a non-linear relationship model from text to speech,and realize the feature mapping relationship between text sequence and speech spectrum frame.However,the current mainstream speech synthesis models are faced with the contradiction of large demand and insufficient supply of high-quality single timbre training data and the robustness problems of repeated pronunciation and missing pronunciation when combined with text.Aiming at the problems of lack of high-quality monophonic corpus and poor robustness of long text synthesis in the field of speech synthesis,the following work was carried out in this thesis:(1)Based on the time-delay neural network,the speaker discriminating vector of x-vectors was extracted,and Combined with the variational autoencoders,a speaker representation extraction network was constructed,and speaker representation latent variables were extracted from a small amount of target speaker corpus,so as to realize the decoupling of acoustic representation under a small amount of target speaker corpus;(2)Since Tacotron2,the mainstream speech synthesis model,couldn't synthesize the target speaker's timbre speech based on a small amount of corpus,the speaker representation extraction network and Tacotron2,the mainstream speech synthesis model,were fused to construct a speech synthesis model that can synthesize speech based on a small amount of corpus;(3)Aiming at the robustness of the long text composition of the model,the attention score of the previous speech frame was used to smooth out the abnormal attention score of the current speech frame based on the forward attention method.At the same time,considering the influence degree of different speech frames before and after,the forward attention method with constraint factor was introduced to carry out adaptive smoothing of abnormal attention score.The model was trained by using the Biaobei Chinese female voice dataset,and then the model was tested by using a small number of target speakers randomly selected from the THCHS-30 dataset to verify that the model could achieve speech synthesis based on a small number of target speakers.The synthesized speech was 3.65 and 3.723 in the MOS score of speech naturalness and similarity,which were close to 80% of the real speech.In terms of the robustness of the model,100 texts were selected for verification,and the error rate of sentences was only 2%,which was 24% lower than that of the model without forward attention method.In the forward attention method,the rate of synthesized speech was inversely proportional to the bias of constraint factors,which proves the effectiveness of the model in the control of synthesized speech speed.In terms of phonological naturalness,they increased by 6.0 percent and 8.5 percent,respectively.In terms of speech similarity score,the results improved by 5.1% and 7.0% respectively,both of which achieved the purpose of improving the robustness of the model.A series of experiments above verify the effectiveness of the proposed robust speech synthesis method based on a small amount of corpus.
Keywords/Search Tags:Speech synthesis, Small amount of corpus, Variational autoencoder, Long text robustness, Forward attention
PDF Full Text Request
Related items