Font Size: a A A

Research And Implementation Of Chinese Text-to-speech Technology Based On Deep Learning

Posted on:2022-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:D S HeFull Text:PDF
GTID:2518306740983269Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As a key research field of how to make machines intelligent,text-to-speech aims to solve the problem of how to make machines sound.As society becomes more intelligent,intelligent voice devices have been integrated into people's lives,and the quality of machine sounds has become more and more important.In recent years,deep learning technology has continued to develop.Researchers have applied deep learning technology to the field of speech synthesis,which has greatly improved the quality of generated speech compared with traditional speech synthesis technology.However,there are still some problems,such as:research is mainly concentrated in the field of English,Chinese speech synthesis still has a lot of room for exploration;The naturalness of synthesized speech needs to be improved;Higher requirements for personalized speech generation.This paper focuses on the above issues and studies Chinese speech synthesis based on deep learning technology.The main work and contributions are as follows:(1)In terms of Chinese text-to-speech data,a method for making Chinese speech data sets was proposed and a Chinese text-to-speech data set was made.This paper makes full use of a large number of high-quality speech data sources on the Internet and introduces some automated methods in the production.Using this method,a high-quality Chinese text-to-speech data set named CFNAS was produced.And based on the Transformer model,a highly natural text-to-speech is realized,which verifies the validity of the CFNAS data set.Compared with the traditional production method of speech data set,the method proposed in this paper improves production efficiency while reducing production cost.(2)In terms of improving the quality of synthesized speech,in order to reduce errors in Chinese text-to-speech,a local attention mechanism is introduced into the model to guide the model to learn the alignment between the input text and the speech frame more effectively.Experiments show that the Transformer model based on local attention can effectively reduce word skipping,repetition,and unnatural prosodic phenomena in synthetic sentences and effectively improve the performance of the model in the synthesis of long sentences.(3)In terms of personalized text-to-speech based on a small amount of data,this paper proposes two solutions.The first is the adaptive training method,which uses about 10 minutes of target speaker's speech data to perform adaptive training on a pre-trained model without changing the model,and achieves simple and synthesized speech with high similarity.The second method is based on the speaker coding method,which uses the speech data of the target speaker for a few seconds through the speaker coding network for feature extraction,and then merges with the text-to-speech model.The network can perform personalized speech based on the extracted speech features of the target speaker synthesis.This method requires a large amount of multi-speakers speech data during the training process,but greatly reduces the demand for the data volume of the target speaker,and a model can be applied to all speakers.In summary,the Chinese text-to-speech method researched and implemented in this paper has improved the production efficiency of the Chinese text-to-speech dataset and the quality of synthesized speech to a certain extent.In addition,adaptive training and speaker coding are proposed as two personalities based on a small amount of data.The method of text-to-speech has strong application value.
Keywords/Search Tags:deep learning, text to speech, transformer model, personalized text-to-speech
PDF Full Text Request
Related items