Font Size: a A A

Research On Unit Selection Concatenation Speech Synthesis Method Based On Deep Learning

Posted on:2020-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:W B CaiFull Text:PDF
GTID:2518306452971779Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Speech synthesis is the technique of converting text into corresponding audio,the goal is to give machines the ability to speak.At present,the mainstream speech synthesis methods are statistical parameter speech synthesis and speech synthesis based on unit selection waveform concatenation.The statistical parameter speech synthesis method has the advantages of high flexibility and low physical memory consumption.However,the synthesized speech is easy to be smoothed,which easily increases the mechanical sense of the synthesized speech and reduces the overall audio naturalness.The statistical acoustic model is used to guide the selection and waveform concatenation of real speech unit segments,which can effectively avoid the over-smoothing problem and obtain more natural and high-quality synthetic speech than the traditional parameter synthesis method.The traditional unit selection waveform concatenation method uses the hidden Markov model to model the acoustic signal,and the modeling effect is not ideal due to the limited transfer state.When using conventional static and dynamic acoustic features(Mel cepstral coefficient,fundamental frequency)to participate in model training and cost calculation,multi-features splicing increase the computational complexity,and the predicted acoustic features are different from the real feature numerical distribution,resulting in the quality of synthesized audio decreases.For the above problems,this thesis takes the unit selection waveform concatenation speech synthesis method based on deep learning as the research entry point,aiming at improving the quality of synthesized audio.The specific work content and contributions are as follows:Firstly,a flexible concatenation synthesis system was constructed.In order to improve the experimental efficiency,each system component was modularized.The system was divided into six submodules: front-end text analysis,phoneme search database,phoneme pre-selection,model building,cost function and phoneme concatenation.The audio data of text was aligned at the frame level,the minimal concatenation unit is phoneme,researchers can focus on one module to conduct research.Secondly,a variety of deep network models are constructed.The duration model was built to predict the number of frames of phonemes,and the acoustic model was used to predict the acoustic features.Contrast experiments were designed and the effects of different acoustic models on the final synthesized speech quality were verified.The results show that the acoustic network model based on LSTM-RNN and GAN is more powerful than the traditional DNN model in acoustic signal modeling which improved the quality of synthesized audio.Thirdly,the acoustic features were explored.On the one hand,the acoustic features of conventional cepstral coefficient,fundamental frequency and aperiodic signal were extracted,and the effect of removing dynamic features on the synthesized audio quality is analyzed.On the other hand,a fine-tuning bottleneck feature system was constructed.The system model outputs low-dimensional bottleneck features from the middle layer to participate in the target cost calculation,which reduced the total computational complexity of the cost function.It is also improves the accuracy of the features predicted by the fine-tuning model and improves the quality of the synthesized audio.Fourthly,Evaluation of the performance of the experimental system.In this thesis,Blizzard Challenge 2018 english synthetic corpus was used as experimental data.Average subjective opinion score and Mel cepstrum distortion were used as subjective and objective measures of system performance.By calculating subjective and objective indicators,the advantages and disadvantages of each experimental system and the quality of synthesized audio were analyzed.The experimental results showed that the bottle-neck feature system,which was fine-tuned after similar corpus training,achieved good results in both subjective and objective indicators.This system can better characterize the characteristics of concatenation unit by predicted bottle-neck features,thus guiding the target cost to select suitable candidate units and improving the quality of synthesized speech.
Keywords/Search Tags:Speech synthesis, Unit selection, Cost function, Acoustic model, Acoustic features
PDF Full Text Request
Related items