Font Size: a A A

A Study On Speech Synthesis And Visual Speech Synthesis Based On Neural Networks

Posted on:2017-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:B FanFull Text:PDF
GTID:2348330536452860Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Speech synthesis is a kind of technology transforming text into speech,which is one of the core technologies to create a human machine speech interface(HMSI)system.In visual speech synthesis,the input features(text or speech)are transformed into facial animation to achieve the goal of multi-modal HMSI.Hidden Markov model(HMM)is widely used in speech synthesis and visual speech synthesis,but in HMM,it is assumed features can be clustered,which leads to inaccuracy in charactering the feature space and oversmoothness of the generated feature parameters.To solve these problems,we choose the neural network as the statistical model and apply it successfully to speech synthesis and visual speech synthesis.Firstly,this thesis presents the speech synthesis system based on neural networks in details.Through a study on the fundamental principles of neural networks,two speech synthesis systems are completed based on deep neural network(DNN)and recurrent neural network(RNN),respectively,whose baseline system is HMM based speech synthesis.Both the subjective and objective experiments show that,compared with the baseline system,the speech synthesis based on the neural network performs better.Particularly,RNN is essentially a sequential learner,and thus performs the best in the three systems.Secondly,a high quality speech synthesis framework is proposed in this thesis.To parameterize speech signals in time domain into speech features,vocoder has been typically used.A minimum phase hypothesis has been used in most vocoders,which ignores the natural mixed-phase characteristics of speech signals,resulting in apparent degradation of the speech waveform quality.In order to acquire high quality synthesis,we propose a phase-embedded waveform representation framework,which requires magnitude-phase joint modeling,and the synthesized speech quality is apparently improved.Experimental analysis also proves the effectiveness of the proposed approach.Finally,a visual speech synthesis system based on neural networks is proposed in this thesis.We use the active appearance model(AAM)to model the face image,which gives a good solution to directly modeling the face image.The relation of the input features and AAM parameters can be learned through the statistical model,where the input features can be either text or speech,or both of them.The performances of HMM and RNN are compared and analyzed through experiments.The predicted visual parameters are over smooth by the statistical model,which makes the synthesized face animation a little blur.The problem is solved by trajectory-tiling,selecting the optimal sequence from the real image database.
Keywords/Search Tags:speech synthesis, visual speech synthesis, hidden Markov model, neural network, active appearance model
PDF Full Text Request
Related items