Font Size: a A A

Research On Speech Synthesis Methods Using Auditory Perception Related Measurements

Posted on:2017-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:X H SunFull Text:PDF
GTID:2308330485451794Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech synthesis technology has developed quickly in recent years, and has been applied in more and more real scenarios. Corpus-based unit selection and waveform concatenation approach and statistical parametric speech synthesis approach are the two most popular speech synthesis approaches at present. These methods usually extract speech acoustic features and build statistical models at training time, and achieve unit selection or parameter generation through specific criteria at synthesis time. Now com-monly used acoustic features include fundamental frequency, mel-cepstral coefficients and line spectral pairs, and the commonly used unit selection and parameter generation criteria include the maximum output probability criterion. These features and crite-ria are designed based on speech generation mechanism and statistical methods. Less consideration on speech perception mechanism has been made. On the other hand, the quality evaluation of synthetic speech still depends on the subject scoring of listeners. The absence of speech perception measurement in acoustic features and speech gener-ation criteria restricts the further improvement of current speech synthesis methods.Therefore, this dissertation focuses on the speech synthesis methods based on speech perception measurement. On one hand, the unit selection and waveform con-catenation speech synthesis method using the subjective perception data is studied. By constructing a synthetic error detector, the subjective perception information of the syn-thetic speech is integrated into the unit selection criteria, which improves the naturalness of synthetic speech; On the other hand, the method of acoustic modeling and parame-ter generation based on acoustic features related to speech perception mechanism in statistical parametric speech synthesis is studied. The traditional acoustic modeling and parameter generation method is improved by using the modulation spectrum and multiresolution spectrotemporal analysis which improves the subjective and objective quality of the synthesized speech.The dissertation is organized as follows:Chapter one is the introduction, which briefly introduces the concept, significance and development history of speech synthesis. It also introduces the current situation of speech synthesis research, and presents the research objectives and contents of this dissertation.Chapter two will propose a unit selection speech synthesis method based on the subjective perception data. Firstly, crowdsourcing platform is used to collect mass per-ceptual data efficiently.Then, a synthetic error detector is constructed based on these data. Finally, the synthetic error detector scoring results are integrated into the unit selection criteria. Experimental results show that the proposed method can effectively improve the naturalness of synthetic speech.Chapter three will introduce the statistical parametric speech synthesis method based on acoustic features related to auditory sensation. Firstly, the basic concepts of modulation spectrum and multiresolution spectrotemporal analysis are introduced, including their auditory physiological background and computational methods. Then, the modulation spectrum compensation method for quality ehancement of parametric speech synthesis is introduced. Several strategies of extracting modulation spectrum for line spectral pairs are developed. The experimental results indicate that calculate mod-ulation spectrum vectors using mel-cepstral coefficients derived from line spectral pairs can achieve the best performance, and effectively improve the naturalness of synthetic speech. At last, the method of acoustic modeling based on the parameters of multireso-lution spectrotemporal analysis is studied. Deep neural network with multi-task learning is used for acoustic modeling. The auditory spectrum and the auditory cortical output are used as the secondary task of model training, respectively. The experimental results show that the use of auditory spectrum as a secondary task can improve the prediction accuracy of the mel-cepstral coefficients.Chapter four concludes the whole dissertation, and puts forward the prospect of future work.
Keywords/Search Tags:speech synthesis, synthetic error detection, modulation spectrum, post fil- tering, multiresolution spectrotemporal analysis, deep neural network, multi-task learn- ing
PDF Full Text Request
Related items