| In recent years,with the continuous development of science,technology and society,voice interaction has become the most direct,efficient and easy-to-understand way for robot human-computer interaction.Text-to-Speech(TTS),also known as textbased speech generation,is one of the core technologies in human-computer interaction.It has attracted the attention of researchers and has gradually become a key research direction in the field of various speech tasks.It has important practical value in the fields of mobile phone voice assistants,AI audio novels,emotional escort robots,and voice map navigation.With the rapid development of deep learning,end-to-end speech synthesis methods based on this technology have gradually become the mainstream.At present,Chinese speech synthesis technology based on deep learning can be divided into autoregressive working mode and non-autoregressive working mode.There are some problems such as unstable synthesized speech,poor naturalness,slow synthesis speed and poor personalized speech synthesis effect,which do not meet some practical application scenarios.In view of the above problems,this thesis carries out research on Chinese speech synthesis and Chinese personalized speech synthesis.The research content is mainly divided into the following aspects:First of all,in order to solve the problems of instability,poor naturalness and low synthesis efficiency in Chinese speech synthesis,an end-to-end Chinese speech synthesis model F-MelGAN is proposed in this thesis.By use post-processing network to refine the Mel-spectrum predicted by decoder and alleviate the phenomenon of Melspectrum distortion,the naturalness and stability of synthesized speech are improved.MelGAN is used as the vocoder of the model,and the model has good real-time performance.The Mel cepstral distortion value of objective evaluation of synthesized speech is 9.53,and the real-time factor of generated speech on GPU is 0.155.Secondly,Due to the lack of high-quality Chinese speech data set,and in order to solve the problem of poor personalized Chinese speech synthesis effect,this thesis proposes a combination of acoustic condition network,speaker encoder network GCNet,and a feedback constraint training method to realize Chinese personalized speech customization.The experimental results show that the whole model can generate high-quality speech with high similarity to the speaker for the speaker who has appeared in the training and the speaker who has never appeared in the training process.Meanwhile,the real-time factor of speech synthesis on GPU is 0.278,which meets the requirement of real-time speech synthesis.Finally,in order to verify the practicability of the personalized speech synthesis model in the human-machine voice dialogue scene,this thesis utilize Qiming robot as the robot carrier and uses Pytorch to build the chattering personalized voice dialogue system,and realizes the initial practical application of the model. |