Font Size: a A A

Esearch On The Modeling And Generation Of Fundamental Frequencies In Statistical Parametric Speech Synthesis

Posted on:2016-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:L GaoFull Text:PDF
GTID:2308330470957751Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Hidden Markov model (HMM) based statistical parametric speech synthesis is one of most popular speech synthesis methods in recent decades. In the training stage, with recording speeh databases, it trains acoustic models to describe the distribution of acous-tic features with different contexts, such as spectral parameters, fundamental frequency (F0) and so on. In the synthesis stage, given a phone context sequence generated from text analysis, the corresponding sequences of HMM predicted from trained statistical acoustic models are concatenated and spectral parameters and FO are then generated. At last, parametric synthesizer is used to generate speech waveform. Compared to unit selection and wave concatenation synthesis, HMM-based speech synthesis possesses a lot of advantages, such as fast and automatic system construction, smoothness of syn-thetic speech, small system size and so on. However, the synthetic speech sounds still less natural compared with natural speech.FO is the frequency of vocal vibration during the production of voiced speech. It is a an important acoustic feature. In HMM-based spech synthesis, accurate FO pre-diction influences the naturalness of synthetic speech directly. Besides, in emotional speech synthesis, F0difference also plays an important role in reflecting different types of emotions. Different from spectral features, FO is a super-segmental feature. The FO trajectories are affected by long-term prosodic features, such as intonation, phrase boundaries, stress and so on. However, in conventional HMM-based statistical speech synthesis, FO is extracted and modeled in a unified framework with spectral parame-ters, which doesn’t take the long-term FO features into consideration and affects the naturalness of synthetic speech.In this dissertation, the author researches on FO modeling and generation meth-ods based on conventional HMM-based parametric speech synthesis. Syllable-level FO features, including length-normalized log F0vectors (FV) and Target Approximation (TA) features, called qTA parameters as well, are utilized to complete three aspects re-searches. The first is to design and complete one kind of FO modeling method based on qTA parameters; secondly, we propose a post-filtering method by using syllable-level FO features, to improve the naturalness of synthetic speech. In addition, the qTA param-eters and Gaussian Bidirectional Associative Memories (GBAM) based post-filtering method are also used in emotional convertion from synthetic neutral speech to natual emotional speech. For happiness and anger emotion, our proposed method is superior to conventional adaptive technique in the expressiveness of emotions.The rest of this dissertation is organized as follows: Chapter one is the introduction. It reviews the history of speech synthesis tech-niques, gives a brief introduction to several main-stream speech synthesis methods, emotinal speech synthesis and some background about fundamental frequency. Chap-ter two introduces HMM-based statistical speech synthesis method in detail, including the summary of this method, the nuclear algorithms in training and synthesis stage, and some existing problems. At last, the motivation of our research work is declared.Chapter three introduces a FO modeling method based on qTA parameters in detail. In the training stage, FO curves are parameterized in syllable level based on TA model. And then context-dependent decision tree is constructed to describe qTA paramters dis-tribution with different contexts. In the synthesis stage, predicted qTA parameters are used to synthesize FO curves based on TA model. At last, these achieved FO curves are utilized to generate final speech waveform together with spectral features and duration generated from conventional HMM-based speech systhesis method. The experimental result indicates that our proposed method can generate natural synthetic speech, but it may miss some detail of FO curves.Chapter four introduces a post-filtering method based on syllable-level FO fea-tures in detail. In the training stage, we extract syllabel-level FO features, including FV and qTA parameters, from predicted FO curves of conventional HMM-based synthesis method and natural FO curves at first. And then, post-filtering models, which contain global mean varience equation (GMVE) model, GBAM model and error compensation model, are constructed to map syllable-level FO features from synthetic FO to natural FO. In the synthesis stage, syllable-level FO features extracted from synthetic speech are post-filtered to get final converted FO curves. Subjective experimental result shows that our proposed method can improve the naturalness of synthetic speech significantly.Chapter five introduces a kind of emotional conversion method from synthetic neu-tral speech to natural emotional speech, based on TA model. This method is proposed to solve emotional speech synthesis with only small emotional databases. We construct a mapping model of qTA parameters from neutral synthetic speech to emotional natural speech, and then convert the FO features of synthetic neutral speech to emotional natu-ral speech. In this dissertation, GBAM is adopted as the mapping model. Experimental result shows that for happy and angry emotion, our proposed method is superior to con-ventional Maximum Likelihood Linear Regression (MLLR) based adaptation method in the expressiveness of emotions.Chapter six concludes the whole dissertation.
Keywords/Search Tags:speech synthesis, FO modeling, FO generation, emotional speech synthesis
PDF Full Text Request
Related items