Study On Chinese Speech Synthesis Methods Based On Deep Learning

Posted on:2022-09-25

Degree:Master

Type:Thesis

Country:China

Candidate:H Wang

Full Text:PDF

GTID:2518306509477354

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Speech synthesis is a technique that converts a given text into a speech,and it has a wide range of applications in mobile phone voice assistants,audiobooks,song synthesis,map navigation,and other fields.In recent years,with the rapid development of neural network theory,speech synthesis methods based on deep learning have become a current research hotspot,and important research progress has been made.This kind of methods usually adopts an end-to-end speech synthesis model,which can synthesize speech with high quality and good naturalness.But end-to-end models usually have many parameters and large amount of calculation,requiring large storage spaces and high computing capabilities for hardware devices.So it is difficult to achieve real-time speech synthesis on devices that have low computing power.In this thesis,aiming at Chinese end-to-end speech synthesis with low-complexity,solutions based on autoregressive model and feed-forward model are proposed respectively.The main work of this thesis is as follows:(1)A autoregressive speech synthesis model based on depthwise separable convolution(DSC)and gated residual network(GRN)is proposed.In this method,the depthwise separable convolution effectively reduces the number of parameters and calculations of the model.The gated residual network stacks multiple layers of DSC with different dilated coefficients to increase the convolution receptive field,so that the encoder and decoder can extract more long-term context information of sequences,which can improve the model performance in fitting text features and spectral features.Moreover,the model uses a multi-head attention mechanism to improve alignment stability between text features and spectral features.As for Chinese speech synthesis,the Chinese text preprocessing method is introduced,and the influence of different input types on model performance is compared.(2)To solve the problems of DSC training difficulty and slow inference of autoregressive model,a feedforward speech synthesis model based on the Ghost module and the residual network is proposed.The model is a fully convolutional model with a duration predictor.The Ghost module is used to replace the deep separable convolution,thus the parameter amount and calculation amount of the model are effectively reduced by adjusting the compression ratio of the module.In terms of alignment,the duration predictor is used to achieve hard alignment between text features and spectral features,which effectively reduces the number of mispronunciation,skips,and repeats.Besides,the impact of real duration sequences extracted by different methods on model performance is also compared.In this thesis,different evaluation indicators are used to evaluate the proposed solutions.Experimental results show that,compared with mainstream autoregressive models,the proposed autoregressive model has fewer parameters and faster synthesis speed,and guarantees the quality of synthesized speech.The proposed feedforward model further reduces the number of parameters and greatly improves the synthesis speed with a lightweight vocoder.On the single core CPU,its speech synthesis speed is 24 times faster than that of real-time playback,and the mean opinion score(MOS)of synthesized speech is 3.98,which is only about 0.1 lower than the mainstream large-parameter feedforward models.In addition,the model has a good adaptability to the corpus of different speakers,and fine-tuned with a small number of data,it can synthesize speech with high naturalness and similarity.

Keywords/Search Tags:

End-to-End Speech Synthesis, Autoregressive Model, Feedforward Model, Depthwise Separable Convolution, Ghost Module

PDF Full Text Request

Related items

1	Depthwise Separable Convolutional Neural Network Structure Optimization For Embedded Systems
2	Study On Command Word Recognition Based On Deep Learning
3	Research On Speech Synthesis Algorithm Based On Sequence To Sequence Model
4	The Design And FPGA Verification Of A CNN Accelerator With Depthwise Separable Convolutions
5	Research On Video Semantic Segmentation Algorithm Based On Deep Learning
6	Accelerator Design And Research Of Depthwise Separable Convolutional Neural Network Based On FPGA
7	Study On Speech Quality Assessment Based On Deep Learning
8	Research On Target Detection Based On Improved Convolutional Neural Network
9	Research Of Computation Efficient Algorithm For Deep Learning
10	Research On Image Semantic Segmentation Algorithm Based On Deep Learning