With the rapid development of speech synthesis technology,people’s requirements for speech synthesis have gone beyond naturalness and intelligibility.And they hope that the speech synthesis system can be more diverse and personalized,specifically,specific speakers’ voices customization,which is also called voice cloning.However,the personalized Chinese speech synthesis system is often limited by few-shot scenarios during the system construction.So,the generalization and accuracy of the system cannot be guaranteed.First,the target speaker’s audio is very rare,and it is hard to apply a large-scale data-driven method for voice cloning.And the existing few-shot method of voice cloning suffers from a complex training process and overfitting.In addition,for personalized Chinese speech synthesis systems,the polyphone disambiguation module is also plagued by data shortages.The existing methods often require a large-scale corpus while the cost of manual annotation is too expensive.And the only opensource dataset has small data volume and poor data quality,so,it is hard to ensure classification accuracy.In view of the above problems,this article has carried out the following work.(1)A few-shot voice cloning method based on the phoneme-level speaker feature is proposed.This method utilizes the fine-grained speaker characteristics in the target speakers’ audio,and we introduce an attention mechanism to transfer the speaker feature between various phonemes,supplemented by the random samples training strategy,to improve the utilization efficiency of the target speaker’s data and achieve high-quality voice cloning in few-shot scenarios.(2)A polyphone disambiguation method based meta-learning is proposed.This method no longer treats the polyphone disambiguation as a classification task in machine learning but compares and distinguishes the semantic features of different pronunciations in the polyphone.It determines the pronunciation of polyphones in the pattern of feature comparison.The experiment results show that our voice cloning method is more generalized than the existing methods,and the generated speech has high voice similarity compared with target speakers and high naturalness.And our polyphone disambiguation method via meta-learning shows excellent disambiguation performance in the low-quality training dataset.It shows sufficient generalization performance compared with the existing methods and has a good performance on unseen polyphones. |