| The evaluation of the output speech of speech synthesis(TTS)systems can be considered from various aspects,but mainly in terms of intelligibility and naturalness.Intelligibility depends mainly on the performance of the text processing module at the front end of speech synthesis,and although existing technologies have reached a good level,there is still room for improvement.The naturalness depends on the performance of the prosodic structure prediction module,which is not able to simulate the prosodic of natural language well,resulting in a large gap between the overall naturalness of the synthesis system and the prosodic of human pronunciation.Therefore,the construction of TTS text corpus is the foundation of TTS system,which is an important means to improve the intelligibility and naturalness of the synthesized speech.In this thesis,key techniques for the construction of a TTS text corpus are investigated,namely,three subtasks in the text processing module of the speech synthesis front-end: grapheme-to-phoneme conversion,phoneme balance measurement,and prosodic structure prediction.These three subtasks are studied and improved to further improve the intelligibility and naturalness of the synthesized speech.The main work of this article includes:(1)Most of the current grapheme-to-phoneme conversion studies are based on a single language.This thesis studies the Transformer architecture for multilingual(Chinese,English,Japanese,Korean,and Cantonese)grapheme-to-phoneme conversion under text cross-mixing conditions,solving the problem that previous models only perform better on a single language.In addition,this thesis builds its own Cantonese dataset,and optimizes and expands the data of the existing Korean and Japanese datasets.The experimental results show that the model has significantly reduced the phoneme error rate and word error rate compared with the monolingual case.(2)Most studies have applied the N-gram model to the single-word level,and this thesis proposes to apply the N-gram model to the phoneme balance measurement task,which belongs to the phoneme level.The combination probabilities obtained by the unary,binary and ternary models discriminate the reasonableness of the combination,and solve the problem of too large phoneme table and high complexity of the calculation due to too many phoneme measurement units of languages in the traditional method.The phoneme balance of the grapheme-to-phoneme conversion results in the first task is measured to ensure the coverage of relevant phonemes in the grapheme-to-phoneme conversion results while reducing the size of the constructed corpus as much as possible.(3)A character-level span-based model is used for prosodic structure prediction,which directly accepts Chinese characters as input without relying on complex feature engineering,solves the problem of error accumulation caused by the prerequisite subword module,and improves the prediction capability of the model.Meanwhile,the model works better on the label-modified Chinese standard female voice database.In addition,compared with the past methods,the model improves the evaluation index F-values of rhyming phrases and intonation phrases by 1.05% and 1.27%,respectively. |