Font Size: a A A

Research On Neural Network-based Acoustic Modeling For Speech Synthesis

Posted on:2017-05-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:X YinFull Text:PDF
GTID:1108330485451558Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent decades, statistical parametric speech synthesis (SPSS) has become a main-stream text-to-speech method, together with the unit selection and waveform con-catenation synthesis approach. The hidden Markov model (HMM)- based one is the most popular approach for SPSS. This method utilizes some algorithms in automatic speech recognition and some key techniques have been proposed specifically for speech synthesis tasks, such as multi-space probability distribution HMM and maximum likeli-hood parameter generation. Compared with unit selection and waveform concatenation synthesis, it has lots of adavantages, for example, fast and automatic system construc-tion, small system footprint, high smoothness, highg flexibility and so on. But there is still a big gap between synthesized speech of SPSS and that of unit selection and wave-form concatenation synthesis in the respect of natualness and quality. One of the main reasons that cause this problem is the inadequacy of acoutic modeling in the HMM-based SPSS.With the successful applications of deep neural network (DNN) in the automatic speech recognition, neural networks have been applied to the speech synthesis and made positive progresses since 2013. Compared with the hidden Markov model and decision tree-based state-level Gaussian distribution used in the conventional statistical para-metric speech synthesis, neural networks have stronger modeling capacity in the cross-dimensional correlations of high dimensional acoustic features and complex dependen-cies between input context features and output acoustic features. Therefore, this disser-tation focuses on neural network-based acoustic modeling methods for statistical para-metric speech synthesis. In order to solve the problem of capturing cross-dimensional correlations of high dimensional spectral envelopes in the spectrum modeling, a neu-ral autoregressive distribution estimator (NADE)-based HMM state-level distribution modeling method and a deep conditional restricted Boltzmann machine-based spectrum modeling method are proposed, and improve the quality and naturalness of synthesized speech. For considering the hierarchically additive property in F0 production and long-term property in F0 perception, a DNN-based hierarchically F0 modeling method is proposed, and reduce the prediction errors of F0 features and improve the naturalness of synthesized speech. Finally, an end-to-end speech synthesis method is explored. By utilizing the attention-based recurrent sequence generator, the unfied modeling of fea-ture alignment and prediction in the neural network-based speech synthesis is realized.The whole dissertaion is organised as follows:Chapter 1 is the introduction. It simply reviews the speech production mechanism and history of speech synthesis research and gives a brief introduction to some speech synthesis methods.Chapter 2 introduces the HMM-based statistic parametric synthesis method first, including the fundamental principles of HMM, the speech synthesis system framework, and four key techniques in the system. Then the advantages and disadvantages of this method are analysed. Next, the history of neural network and several existing applica-tions in the speech synthesis are reviewd. Finally, the motivation of the research work is declared.Chapter 3 proposes a neural autoregressive distribution estimator (NADE)-based state-level spectrum modeling method. The existing restricted Boltzmann machine (RBM)-based state-level spectral envelope modeling method uses RBM to replace Gaus-sian distribution in descibing the distribution of spectral features at each HMM state and gain some improvements. Considering RBM’s deficiencies in calculating model likeli-hood and gradients of parameters, N ADE can decompose the maginal probabilities of observations into multiplications of available conditonal probabilities for solving the limitations of RBM. This chapter propose to use NADE instead of RBM. From the ob-jective and subjective evaluation tests, the proposed method is proved to be able to im-prove the adequacy of acoustic modeling and the speech quality of synthesized speech.Chapter 4 considers the disabilities of modeling the multi-modal property and cross-dimensional correlations of spectral features in current DNN-based SPSS method, this chapter propose a deep conditional restricted Boltzmann machine (DCRBM)-based spectrum modeling and prediction method and analysis different pre-training stratergy of DCRBM with experiments. This method uses RBM to model the acoustic features in the output layer of DNN and combine the advantages of DNN-based dependency modeling and RBM-based distribution representation. It not only represents the multi-modal property of conditonal probabilities of acoustic features given linguistic features, but also describes the cross-dimensional correlations of high-dimensional spectral en-velopes. Experimental results show that proposed method can produce better quality of speech sounds than HMM-based, DNN-based and deep mixture density network (DMDN)-based synthesis methods.Chapter 5 investigates the DNN-based hierarchical FO modeling methods. By analysizing the deficiency of several FO modeling method, this chapter consider the hierarchy and additive propery in FO generation mechanism and long-term perception property with DNN. In the proposed hierarchical FO modeling method for all prosodic layers, two training frameworks are designed and realized, cascade DNN and parallel DNN. From the objective and subjective tests, proposed method is proved to be able to reduce the prediction errors FO and improve the naturalness of synthetic speech effec- tively.Chapter 6 makes an exploration on the research of end-to-end speech synthesis method. The aim of this method is to integrate the front-end text analysis and back-end acoutic modeling, and realize the direct transduction between asynchronous feature se-quences, such as context features and acoustic features. This chapter propose to adopt an attention-based recurrent sequence generator to realize the unfied modeling of fea-ture alignment and prediction in the neural network-based speech synthesis. The bulit system can synthesize moderately smooth and fairly intelligible speech without the de-pendency on HMMs.Chapter 7 concludes the whole dissertation.
Keywords/Search Tags:speech synthesis, hidden Markov model, parametric synthesis, neural au- toregressive distribution estimator, deep neural networks, deep conditional restricted Boltzmann machine, attention-based recurrent sequence generator
PDF Full Text Request
Related items