Font Size: a A A

Research On Speech Generation Using Articulatory Features And Deep Learning

Posted on:2019-01-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z C LiuFull Text:PDF
GTID:1318330542997979Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Articulatory features refer to the quantitative descriptions about the positions and motions of articulators,including tongue,teeth,lips,etc,during human's speech produc-tion process.Articulatory features convey the physiological level information accord-ing to the three-level hierarchical speech production process.They are closely related to acoustic features,and explicit in physical explanations and robust to environmental noises.Therefore,researches on speech-related tasks integrating articulatory features have attracted many research attentions in recent years.This dissertation focuses on two speech generation tasks using articulatory featuers,i.e.articulatory-to-acoustic conver-sion and statistical parametric speech synthesis(SPSS)integrating articulatory features.Articulatory-to-acoustic conversion means to model the mapping from articulatory features to acoustic features,aiming at restoring intelligible and natural speech wave-forms when only articulatory features are available.It can be applied to silent speech in-terfaces(SSI),controllable speech synthesis,speaker and accent conversion,etc.Most existing researches focused on articulatory-to-spectrum conversion,while conversions from articulatory to excitation-related features like power,unvoiced/voiced decisions,fundamental frequency have rarely been studied.Additionally,Gaussian mixture model(GMM)was the most popularly used model to describe the mapping from articulatory features to acoustic features,which may lead to poor prediction accuracy and degraded generated speech.Statistical parametric speech synthesis integrating articulatory fea-tures aims at introducing articulatory features into the acoustic model for statistical para-metric speech synthesis to improve the accuracy of predicting acoustic features from text and the naturalness of synthetic speech.Statistical parametric speech synthesis is now the mainstream for text-to-speech(TTS),and it is highly automatic,smooth in synthetic speech and flexible for different usages.Existing researches on statistical parametric speech synthesis integrating articulatory features were mainly under hidden Markov model(HMM)or hidden trajectory model(HTM)frameworks.Deep learning methods like deep forward network(DFN)and recurrent neural network(RNN)have been suc-cessfully applied to statistical parametric speech synthesis in recent years.Articulatory features,however,have not been involved.This dissertation researches deep learning-based speech generation methods inte-grating articulatory features.Two aspects,articulatory-to-acoustic conversion and sta-tistical parametric speech synthesis integrating articulatory features are studied.De-tailed contents are as follows.First,deep learning-based articulatory-to-acoustic conversion method is studied.To overcome GMM's weakness in modeling the non-linear relationship between fea-tures,this dissertation proposes to apply DFN and RNN to articulatory-to-acoustic con-version.Experiments show that the proposed methods outperform GMM in prediction accuracy and quality of generated speech.Moreover,this dissertation investigates the feasibility of predicting excitation-related features including power,unvoiced/voiced decisions and fundamental frequency from articulatory features.The target of generat-ing speech waveforms from articulatory features only is achieved.Second,articulatory-to-acoustic conversion method with linguistic knowledge and cascaded prediction is studied.Articulatory features are limited in conveying all nec-essary information for articulatory-to-acoustic conversion.To overcome this disadvan-tage,this dissertation proposes two strategies to augment the input features for articulatory-to-acoustic conversion.On one hand,a classifier which maps articulatory features to phoneme labels is trained and linguistic knowledge is extracted from the classifier and added to the input end of the conversion model.On the other hand,a cascaded archi-tecture is proposed,in which spectral features are predicted firstly and they are used to boost the prediction accuracy for excitation-related features.Experiments demonstrate the effectiveness of the proposed methods.Next,deep learning-based acoustic modeling method integrating articulatory fea-tures for speech synthesis is studied.This dissertation proposes to apply articulatory features to deep learning-based acoustic modeling method for speech synthesis,to im-prove the conventional acoustic model's performance in modeling accuracy and syn-thetic speech's naturalness.Based on multi-task learning framework,three architec-tures,including simple multi-task learning-based acoustic model,hierarchical speech production multi-task learning-based acoustic model and structured output layer(SOL)multi-task learning-based acoustic model,are investigated.Experimental results show that compared with conventionl deep learning-based acoustic model,the naturalness of synthetic speech from the three proposed methods gets improved to different extent,and the SOL multi-task learning-based acoustic model achieves the best performance objectively and subjectively.Finally,distillation learning-based acoustic modeling method for speech synthesis is studied.Distillation learning is a newly proposed method for knowledge transfer.This dissertation studies the distillation algorithm for regression task under the neu-ral network framework,and proposes distillation-based acoustic modeling method for speech synthesis to further explore articulatory features' potential in acoustic model.Experiments show that in comparison with the common multi-task learning-based meth-ods,the proposed method can further improve the prediction accuracy for acoustic fea-tures and the naturalness of the synthetic speech.Besides,the proposed method can also work by substituting other acoustic features like linear spectral pairs(LSP)and short-time Fourier transform(STFT)spectra for articulatory features when they are not available.
Keywords/Search Tags:speech generation, acoustic model, articulatory features, deep learning
PDF Full Text Request
Related items