Research And Implementation Of Multi-Speaker Speech Synthesis System For Audio Novels

Posted on:2024-06-30

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Sun

Full Text:PDF

GTID:2568306944457044

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Multi-speaker speech synthesis is a kind of technology that converts text information to voice signal which simultaneously provides synthesized voices of designated speakers.It’s one of the most common humancomputer interaction scenes.With the rapid development of computer science and artificial intelligence,multi-speaker speech synthesis based on deep learning neural network has become a mainstream research direction.Multi-speaker speech synthesis system aims to integrate the abilities of synthesizing the voice from different speakers into one system.Based on multi-speaker speech synthesis system,audio novel speech synthesis system requires higher accuracy in the synthesized speech and higher similarity in the synthesized voices of certain speaker.But there are still many problems to be solved in the whole multi-speaker speech synthesis process.In most speech synthesis system,the text normalization is designed based on rules or character-level classification.Those solutions may lead to low accuracy while dealing with some special ambiguous symbol,which will affect the effect of synthesized speech.While training the multispeaker TTS model,the feature representation ability extracted by speaker feature extraction model is weak.The weak feature will lead to poor synthesis effect.To solve these two problems,the paper mainly carries out the following research work:(1)Designed a text normalization system based on non-standard level classification.In this paper,we promote the classification level of nonstandard token from single character to ambiguous non-standard token.Before classification,we designed a non-standard token scan system which responsible for locating all non-standard tokens in a sentence via given regular expression.After locating all non-standard token,those ambiguous token will be classified by deep-learning model.The text normalization strategy we proposed improves the accuracy of the classification model on some ambiguous symbols and reduces the computational load of model by pre-scan the whole sentence.(2)Designed a self-attention-based speaker feature extraction model.In this paper,we add a self-attention based voice feature extraction model in traditional speaker feature extraction model.By training several voice feature embeddings in the self-attention based voice feature extraction model,every feature calculated by speaker feature extraction model will be calculated the similarity towards each voice feature embedding and get a weight The final speaker feature is the weighted sum of the voice feature embeddings.This additional self-attention-based voice feature extraction model can help improve the feature representation ability and alleviate the problem of bad synthesis effect.(3)Designed and implemented the whole multi-speaker speech synthesis system includes text processing,acoustic model,vocoder and speaker feature extractor.In this paper,we trained each deep learning model and deployed each model in separate process.By parallel scheduling and other methods,we improve the throughput of the whole system.

Keywords/Search Tags:

multi-speaker speech synthesis, attention mechanism, text normalization, speaker feature extraction

PDF Full Text Request

Related items

1	Speaker Extraction And Verification Based On Deep Learning
2	The Research And Application Of Text-Independent Speaker Recognition Technology
3	Research On Speech Separation Algorithm Based On Self-attention Mechanism And Speaker Embedding
4	Research And Implementation Of Speech Synthesis Based On Fastpeech
5	Research On Speaker Recognition In Noisy Environment
6	Research On Adaptive Methods For Text-independent Speaker Recognition
7	Text-independent Speaker Recognition Method And System Based On Spatial Distribution Of Speech Features
8	Research And Application Of Chinese Text-to-speech Based On Recurrent Neural Network
9	Text-Dependent Speaker Verification System
10	Research On Statistical Parametric Speech Synthesis Based On Speaker Adaptive Training