Font Size: a A A

A Study On Concatenative Speech Synthesis Based On Unit Embeddings

Posted on:2022-04-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:1488306323982009Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech is one of the most ideal human-computer interaction methods.Speech synthesis,which realizes the conversion from text to speech waveforn,is the core tech-nology of intelligent speech interaction.Statistical parametric speech synthesis(SPSS)and concatenative speech synthesis(CSS)are the two mainstream methods of speech synthesis.Based on the information obtained from the input text analysis,the optimal unit sequence is selected from the pre-recorded and labeled corpus,and its waveform is concatenated to obtain the final synthetic speech.One of the key steps is unit selection,which measures the suitability of candidate units based on cost functions and searches the best candidate units using dynamic programming algorithm.Since the waveforms of each unit in the final synthetic speech are copied directly from the corpus,the ad-vantage of concatenative speech synthesis over SPSS is that the sound quality of the original recording is maintained.Traditional concatenative speech synthesis is based on shallow models such as hidden Markov model(HMM)for acoustic modeling and cost function computation.In recent years,deep learning models represented by deep neural networks(DNNs)have shown more advantages over HMMs and have been used for concatenative speech synthesis.This approach improves the accuracy of acoustic feature prediction,better characterizes the units and calculates the cost functions.However,there are still some shortcomings in the current stage of deep learning-based concatenative speech synthe-sis methods,including the difficulty of describing the long-term dependencies among consecutive phone units using frame-level acoustic models,the lack of utilization and research on sequence-to-sequence(Seq2Seq)acoustic models,and that the quality of synthetic speech is still limited by the size and unit coverage of the corpus.Therefore,this paper investigates concatenative speech synthesis methods based on deep learning,focusing on the learning and modeling unit embeddings.The unit embedding is defined as a fixed-length embedding vector for a phone-sized unit with variable length in a corpus.In this paper,we study a DNN-based unit embeddings extraction and modeling method to better describe the long-term dependencies among consecutive phone units.We also study a unit selection method based on Seq2Seq model and propose a two-layer autoregressive decoding model to realize the joint modeling of unit embeddings and cost functions.We further study a hybrid concatenative speech synthesis method combined with waveform generation to improve the quality of syn-thetic speech when there is no sufficient candidate units.The main research of this paper includes the following aspects.First,this paper investigates a DNN-based unit embedding and unit selection method.The method constructs a DNN model to learn unit embeddings of candidate units from the corpus and derives cost functions for unit selection by modeling the unit embeddings.Then this paper further improves the training criterion of the unit embed-ding extraction model and designs loss functions that include multiple prediction targets,such as acoustic features,unit durations,monophone and tone identifiers to improve the naturalness of the baseline HMM concatenative speech synthesis method.Then this paper investigates the unit embedding and unit selection method based on the Seq2Seq model.The method applies the Seq2Seq acoustic model used in SPSS to concatenative speech synthesis.The method utilizes the output of the encoder as unit embeddings,and designs the corresponding cost functions of unit selection to achieve better naturalness of synthetic speech than the unit embeddings learned by DNN.Then this paper investigates the unit embedding and unit selection method based on a two-layer autoregressive decoding model.This model is a Seq2Seq acoustic model specifically designed for concatenative speech synthesis.By designing two-level au-toregressive structures at both phone and frame levels in the decoder,the joint model-ing of unit embeddings and cost functions is realized,and the quality of concatenative speech synthesis is improved.In addition,with the phone-level autoregressive decoding structure,the model also improves the robustness of SPSS.Finally,this paper investigates the hybrid concatenative speech synthesis method combined with waveform generation.Based on a two-layer autoregressive decoding model,this method combines two technical routes,SPSS and CSS,and designs an online expansion strategy for candidate units to achieve a better naturalness than single SPSS or concatenative speech synthesis.
Keywords/Search Tags:speech synthesis, deep learning, neural network, unit selection, wave-form concatenation
PDF Full Text Request
Related items