Font Size: a A A

Research On Prosodic Structure Prediction Based On Deep Neural Network

Posted on:2017-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2308330482479277Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Chinese prosodic prediction plays an important role in the naturalness of synthetic speech. The goal of this paper is to improve the prediction accuracy of the prosodic structure. Based on the previous statistical prosodic prediction models, researchers need do a lot of work in feature engineering. Because of the lack of correlation between the words, words often form "lexical gap" phenomenon, resulting in even two synonyms cannot show the correlation. Therefore, we need to use representations which could reflect the relationship between words and use them as the input features of the model. Hence, this paper uses deep neural network model as the prosodic prediction model.In this paper, firstly we use Gensim to train lexical word embeddings, then we learn the prosodic word embeddings by constructing the lexical word embeddings together; Secondly, the traditional neural network model was improved in the hidden layer to better capture the word-word interaction. The main work is as follows:(1) Using Gensim to train the word embeddings for lexical words, using lexical word embeddings to learn prosodic word embeddings, and using different levels of word embeddings to grab the prosodic structure information in the context;(2) Training the neural network model by labeled data, using the lexical word embeddings, prosodic word embeddings, tag embeddings and length embeddings as he input features to improve the prediction ability of the model;(3) Adding tensor to the hidden layer to improve the ability of model. The tensor matrix captures the word-word interaction and the interaction between different prosodic levels.The results of experiments show that compound input features are better than single input feature, with the ER(error rate) of prosodic words decreasing by 3.2%(from 15.3% to 12.1%), the ER of prosodic phrases decreasing by 5%(from 40.3%to 35.3%); After adding tensor to hidden layer, the ER of prosodic words decreasing by 0.5%(from 12.1% to 11.6%). The results show that compound input features could improve the ER of prosodic prediction; Compared to the traditional hidden layer, hidden layer with tensor could capture more information in different prosodic levels.
Keywords/Search Tags:Speech synthesis, Prosodic structure prediction, Word embedding, Deep Neural Network
PDF Full Text Request
Related items