Font Size: a A A

Research And Implementation Of Multi-Speaker Speech Synthesis System For Audio Novels

Posted on:2024-06-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y H SunFull Text:PDF
GTID:2568306944457044Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Multi-speaker speech synthesis is a kind of technology that converts text information to voice signal which simultaneously provides synthesized voices of designated speakers.It’s one of the most common humancomputer interaction scenes.With the rapid development of computer science and artificial intelligence,multi-speaker speech synthesis based on deep learning neural network has become a mainstream research direction.Multi-speaker speech synthesis system aims to integrate the abilities of synthesizing the voice from different speakers into one system.Based on multi-speaker speech synthesis system,audio novel speech synthesis system requires higher accuracy in the synthesized speech and higher similarity in the synthesized voices of certain speaker.But there are still many problems to be solved in the whole multi-speaker speech synthesis process.In most speech synthesis system,the text normalization is designed based on rules or character-level classification.Those solutions may lead to low accuracy while dealing with some special ambiguous symbol,which will affect the effect of synthesized speech.While training the multispeaker TTS model,the feature representation ability extracted by speaker feature extraction model is weak.The weak feature will lead to poor synthesis effect.To solve these two problems,the paper mainly carries out the following research work:(1)Designed a text normalization system based on non-standard level classification.In this paper,we promote the classification level of nonstandard token from single character to ambiguous non-standard token.Before classification,we designed a non-standard token scan system which responsible for locating all non-standard tokens in a sentence via given regular expression.After locating all non-standard token,those ambiguous token will be classified by deep-learning model.The text normalization strategy we proposed improves the accuracy of the classification model on some ambiguous symbols and reduces the computational load of model by pre-scan the whole sentence.(2)Designed a self-attention-based speaker feature extraction model.In this paper,we add a self-attention based voice feature extraction model in traditional speaker feature extraction model.By training several voice feature embeddings in the self-attention based voice feature extraction model,every feature calculated by speaker feature extraction model will be calculated the similarity towards each voice feature embedding and get a weight The final speaker feature is the weighted sum of the voice feature embeddings.This additional self-attention-based voice feature extraction model can help improve the feature representation ability and alleviate the problem of bad synthesis effect.(3)Designed and implemented the whole multi-speaker speech synthesis system includes text processing,acoustic model,vocoder and speaker feature extractor.In this paper,we trained each deep learning model and deployed each model in separate process.By parallel scheduling and other methods,we improve the throughput of the whole system.
Keywords/Search Tags:multi-speaker speech synthesis, attention mechanism, text normalization, speaker feature extraction
PDF Full Text Request
Related items