Font Size: a A A

Deep Learning Based Speech Emotion Recognition By Fusing Acoustic Features And Transcriptions Clues

Posted on:2021-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:T TianFull Text:PDF
GTID:2428330605464172Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the development of artificial intelligence in recent years,in order to improve the human-computer interaction ability and enable the intelligent device perceive the human emotion from the speech,a lot of research has been carried out on speech emotion recognition,and it has made a great development.In a conversation,the speech and its transcription can express a lot of information and emotion states.Human beings can capture these details from a speech to recognize the speaker's psychological states.The early research of emotion computing utilized traditional machine learning algorithm to recognize the emotion states of a speech.In recent years,with the rapid development of deep learning technology,deep neural network has been widely studied and applied in emotion analysis.Compared with the traditional emotion computing methods,its performance has been greatly improved.Our research proposed a deep neural network which can comprehensively use the acoustic information and textual information of a speech to recognize the speaker's emotion states.Firstly,we calculate the acoustic features of the speech such as Mel cepstrum feature,linear prediction coefficient,and text features of speech transcription such as word embedding vector.Then feed them into the bimodal emotion recognition model.The proposed neural network model comprise of three components,which are acoustic part,textual part and bimodal fusion part.In acoustic part,we construct a temporal global feature extractor and combine it with an attention layer to extract the high level features of acoustic features.In textual part,we utilized a bidirectional long-short term memory network with an attention layer to extract the high level features of textual features.In bimodal fusion part,we separately draw on feature level fusion and decision level fusion to fuse the high-level features or decision result of the acoustic part and textual part and compare the performance on feature level fusion and decision level fusion.In the experiment,we make use of IEMOCAP dataset and EMODB dataset to evaluate the performance of proposed deep neural network.Given four kinds of emotion categories,the experimental results show that the single-modal acoustic emotion recognition accuracy of our model is 64.2%and the single-modal textual emotion recognition accuracy of our model is 65.8%.By using feature level fusion algorithm,the accuracy of the bimodal emotion recognition can be increased to 72.3%,meanwhile,decision level fusion can achieve recognition accuracy of up to 74.8%.Compared with the single-modal emotion recognition,the accuracy of the bimodal fusion emotion recognition is significantly improved.
Keywords/Search Tags:Emotion recognition, Acoustic features, Textual features, Feature level fusion, Decision level fusion
PDF Full Text Request
Related items