| Speech synthesis is a task that converts text sequences to speech sequences.Text sequence contains not only content information,but also rich semantic and grammatical information,while speech sequence contains speaker and prosody information.Speech synthesis has a wide range of practical applications,such as navigational broadcasting,e-book reading and so on.However,at present,speech synthesis largely relies on high-quality databases and fails to make full use of the information contained in text sequences and speech sequences.Based on the idea of information supplement,this paper adds content information,prosodic information,speaker information and semantic information from the perspective of text sequence and speech sequence to improve the quality of synthesized speech.1)The speech synthesis method under the condition of data limitation is studied.The database plays a key role in the modeling of speech synthesis model,and the quality of the database is usually positive to the effect of the model.Data limitation can be manifested in the aspects of text obsolescence,uneven speed,incorrect labeling,mute,etc.Training speech synthesis model under the condition of data limitation has poor effect on content modeling and prosodic modeling.In this paper,different model learning strategies such as transfer learning,multi-tasking learning,pre-training model to supplement content information and prosodic information are studied from the perspective of information supplement.Through the analysis of speaker similarity,it is found that different learning strategies have an impact on speaker information modeling.In this paper,speaker recognition model is adopted to supplement speaker information,and the methods based on speaker vector embedding and speaker recognition loss function are studied respectively.The experimental results show that the proposed scheme improves the MCD index by 1.016.2)The semantic information supplement method based on pre-training model is studied.This chapter studies the addition of semantic information under the standard database,the use of standard text speech data for the modeling of speech quality is high,but for Chinese,the same sentence in different contexts have different pronunciation,the limited text in the standard database is insufficient to learn the model,through the investigation and comparison of different methods,this paper adopts the pre-training model to add semantic information.Firstly,based on the pre-trained Bert model,the whole semantic information of the text sequence is modeled.Considering the semantic information contained in the category information output by Bert model,the location of the semantic information introduced into the speech synthesis model is studied,and the implicit prosody extractor is trained by the semantic information,and the semantic information is supplemented in the test stage.Secondly,based on the modeling of the prosodic words in the text sequence,the influence of the pretrained word vector on the text word segmentation,word vector modeling,character vector alignment and the information fusion mode of the speech synthesis model on the experiment was studied.The experimental results show that the synthesized speech based on the above two pre-training models improves the MCD index by about 0.436. |