| Speech Synthesis is a technology that converts input text into natural speech.Compared with traditional speech synthesis,personalized speech synthesis is more complex.It is necessary for personalized speech synthesis to effectively control personalized information such as speaker information and accent information of synthesized speech on the basis of ensuring the intelligibility and naturalness of synthesized speech.Due to the complexity of personalized speech synthesis and the increasing user demands,the research for multi-speaker speech synthesis and accent speech synthesis which can effectively control speaker information and accent information has become an important research topic.At present,the multi-speaker speech synthesis model based on speaker encoder which is designed according to the speaker verification or recognition model can effectively control the speaker information in synthesized speech.However,the speaker encoder based on speaker classification task ignores the richness of speech information,such as linguistic information and speaker dynamic information in speech,thus destroys the naturalness of synthesized speech.At the same time,it limits the further development of the multi-speaker speech synthesis model because that the design of the speaker encoder highly depends on the speaker verification/recognition model.Currently there are only a few related researches on accent speech synthesis.The research of accent speech synthesis is less now.In the process of accent transfer learning,the traditional end-to-end speech synthesis methods relies highly on large-scale accent data,lacks the application of accent prior knowledge,and accent information is mixed with other information.In view of the above problems,this paper proposes to use deep speech representations to effectively control speaker information and accent information.The deep representations is composed of rich speech embedding which includes speaker information and linguistic information and deep accent representation.The main contributions of this paper are as follows:(1)This paper proposes a new multi-speaker speech synthesis model based on rich speech embedding.It uses speech embedding extraction model based on speech recognition to extract rich speech embedding containing speaker information and linguistic information,and adds speaker labels during model training to further improve the ability of the speech embedding to control speaker information.Feature visualization analysis,subjective and objective experiments results show that the proposed model is not only able to control the speaker information of synthesized speech,but also significantly improves the naturalness of synthesized speech.(2)This paper proposes a new accent speech synthesis model based on prior knowledge guidance and deep accent embedding.It uses a self-supervised accent encoder based on speaker labels and tone related acoustic features as soft labels to extract deep accent representations,and adds prediction of tone related acoustic features to speech synthetic acoustic models to improve the modeling and control of accent information.Unsupervised data filtering and progressive data augmentation strategies are adopted during all model training process.Experimental results have proved that the proposed model can effectively control accent information.In summary,the personalized speech synthesis technology based on deep speech representation proposed in this paper has both high theoretical research and practical application value. |