Font Size: a A A

Study On Chinese Speech Synthesis Methods Based On Deep Learning

Posted on:2022-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2518306509477354Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speech synthesis is a technique that converts a given text into a speech,and it has a wide range of applications in mobile phone voice assistants,audiobooks,song synthesis,map navigation,and other fields.In recent years,with the rapid development of neural network theory,speech synthesis methods based on deep learning have become a current research hotspot,and important research progress has been made.This kind of methods usually adopts an end-to-end speech synthesis model,which can synthesize speech with high quality and good naturalness.But end-to-end models usually have many parameters and large amount of calculation,requiring large storage spaces and high computing capabilities for hardware devices.So it is difficult to achieve real-time speech synthesis on devices that have low computing power.In this thesis,aiming at Chinese end-to-end speech synthesis with low-complexity,solutions based on autoregressive model and feed-forward model are proposed respectively.The main work of this thesis is as follows:(1)A autoregressive speech synthesis model based on depthwise separable convolution(DSC)and gated residual network(GRN)is proposed.In this method,the depthwise separable convolution effectively reduces the number of parameters and calculations of the model.The gated residual network stacks multiple layers of DSC with different dilated coefficients to increase the convolution receptive field,so that the encoder and decoder can extract more long-term context information of sequences,which can improve the model performance in fitting text features and spectral features.Moreover,the model uses a multi-head attention mechanism to improve alignment stability between text features and spectral features.As for Chinese speech synthesis,the Chinese text preprocessing method is introduced,and the influence of different input types on model performance is compared.(2)To solve the problems of DSC training difficulty and slow inference of autoregressive model,a feedforward speech synthesis model based on the Ghost module and the residual network is proposed.The model is a fully convolutional model with a duration predictor.The Ghost module is used to replace the deep separable convolution,thus the parameter amount and calculation amount of the model are effectively reduced by adjusting the compression ratio of the module.In terms of alignment,the duration predictor is used to achieve hard alignment between text features and spectral features,which effectively reduces the number of mispronunciation,skips,and repeats.Besides,the impact of real duration sequences extracted by different methods on model performance is also compared.In this thesis,different evaluation indicators are used to evaluate the proposed solutions.Experimental results show that,compared with mainstream autoregressive models,the proposed autoregressive model has fewer parameters and faster synthesis speed,and guarantees the quality of synthesized speech.The proposed feedforward model further reduces the number of parameters and greatly improves the synthesis speed with a lightweight vocoder.On the single core CPU,its speech synthesis speed is 24 times faster than that of real-time playback,and the mean opinion score(MOS)of synthesized speech is 3.98,which is only about 0.1 lower than the mainstream large-parameter feedforward models.In addition,the model has a good adaptability to the corpus of different speakers,and fine-tuned with a small number of data,it can synthesize speech with high naturalness and similarity.
Keywords/Search Tags:End-to-End Speech Synthesis, Autoregressive Model, Feedforward Model, Depthwise Separable Convolution, Ghost Module
PDF Full Text Request
Related items