A Study On Speech Synthesis And Visual Speech Synthesis Based On Neural Networks

Posted on:2017-09-13

Degree:Master

Type:Thesis

Country:China

Candidate:B Fan

Full Text:PDF

GTID:2348330536452860

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Speech synthesis is a kind of technology transforming text into speech,which is one of the core technologies to create a human machine speech interface(HMSI)system.In visual speech synthesis,the input features(text or speech)are transformed into facial animation to achieve the goal of multi-modal HMSI.Hidden Markov model(HMM)is widely used in speech synthesis and visual speech synthesis,but in HMM,it is assumed features can be clustered,which leads to inaccuracy in charactering the feature space and oversmoothness of the generated feature parameters.To solve these problems,we choose the neural network as the statistical model and apply it successfully to speech synthesis and visual speech synthesis.Firstly,this thesis presents the speech synthesis system based on neural networks in details.Through a study on the fundamental principles of neural networks,two speech synthesis systems are completed based on deep neural network(DNN)and recurrent neural network(RNN),respectively,whose baseline system is HMM based speech synthesis.Both the subjective and objective experiments show that,compared with the baseline system,the speech synthesis based on the neural network performs better.Particularly,RNN is essentially a sequential learner,and thus performs the best in the three systems.Secondly,a high quality speech synthesis framework is proposed in this thesis.To parameterize speech signals in time domain into speech features,vocoder has been typically used.A minimum phase hypothesis has been used in most vocoders,which ignores the natural mixed-phase characteristics of speech signals,resulting in apparent degradation of the speech waveform quality.In order to acquire high quality synthesis,we propose a phase-embedded waveform representation framework,which requires magnitude-phase joint modeling,and the synthesized speech quality is apparently improved.Experimental analysis also proves the effectiveness of the proposed approach.Finally,a visual speech synthesis system based on neural networks is proposed in this thesis.We use the active appearance model(AAM)to model the face image,which gives a good solution to directly modeling the face image.The relation of the input features and AAM parameters can be learned through the statistical model,where the input features can be either text or speech,or both of them.The performances of HMM and RNN are compared and analyzed through experiments.The predicted visual parameters are over smooth by the statistical model,which makes the synthesized face animation a little blur.The problem is solved by trajectory-tiling,selecting the optimal sequence from the real image database.

Keywords/Search Tags:

speech synthesis, visual speech synthesis, hidden Markov model, neural network, active appearance model

PDF Full Text Request

Related items

1	Research On Emotional Speech Synthesis Based On Deep Neural Network
2	Research On Automatic Labeling Of Speech Synthesis Corpora
3	Research On Method Of Unit Selection Speech Synthesis Based On Hidden Markov Model
4	Research On Statistical Acoustic Model Based Speech Synthesis
5	Research On Statistical Parametric Speech Synthesis Integrating Speech Production Mechanisms
6	Research On Neural Network-based Acoustic Modeling For Speech Synthesis
7	Mandarin Speech Synthesis System And Rhythm Adjustment
8	The Study On Key Technologies Of Realistic Chinese Visual Speech Synthesis
9	Based Hmm Can Be Training Vietnamese Speech Synthesis System
10	Research On Mandarin-Tibetan Cross-lingual Speech Synthesis