Deep Learning Based Speech Emotion Recognition By Fusing Acoustic Features And Transcriptions Clues

Posted on:2021-04-05

Degree:Master

Type:Thesis

Country:China

Candidate:T Tian

Full Text:PDF

GTID:2428330605464172

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the development of artificial intelligence in recent years,in order to improve the human-computer interaction ability and enable the intelligent device perceive the human emotion from the speech,a lot of research has been carried out on speech emotion recognition,and it has made a great development.In a conversation,the speech and its transcription can express a lot of information and emotion states.Human beings can capture these details from a speech to recognize the speaker's psychological states.The early research of emotion computing utilized traditional machine learning algorithm to recognize the emotion states of a speech.In recent years,with the rapid development of deep learning technology,deep neural network has been widely studied and applied in emotion analysis.Compared with the traditional emotion computing methods,its performance has been greatly improved.Our research proposed a deep neural network which can comprehensively use the acoustic information and textual information of a speech to recognize the speaker's emotion states.Firstly,we calculate the acoustic features of the speech such as Mel cepstrum feature,linear prediction coefficient,and text features of speech transcription such as word embedding vector.Then feed them into the bimodal emotion recognition model.The proposed neural network model comprise of three components,which are acoustic part,textual part and bimodal fusion part.In acoustic part,we construct a temporal global feature extractor and combine it with an attention layer to extract the high level features of acoustic features.In textual part,we utilized a bidirectional long-short term memory network with an attention layer to extract the high level features of textual features.In bimodal fusion part,we separately draw on feature level fusion and decision level fusion to fuse the high-level features or decision result of the acoustic part and textual part and compare the performance on feature level fusion and decision level fusion.In the experiment,we make use of IEMOCAP dataset and EMODB dataset to evaluate the performance of proposed deep neural network.Given four kinds of emotion categories,the experimental results show that the single-modal acoustic emotion recognition accuracy of our model is 64.2%and the single-modal textual emotion recognition accuracy of our model is 65.8%.By using feature level fusion algorithm,the accuracy of the bimodal emotion recognition can be increased to 72.3%,meanwhile,decision level fusion can achieve recognition accuracy of up to 74.8%.Compared with the single-modal emotion recognition,the accuracy of the bimodal fusion emotion recognition is significantly improved.

Keywords/Search Tags:

Emotion recognition, Acoustic features, Textual features, Feature level fusion, Decision level fusion

PDF Full Text Request

Related items

1	Research On Speech Emotion Recognition Method Based On Multi-feature Fusion
2	The Research Of Speech Emotion Recognition Based On The Fusion Features
3	Research Of Acoustic Target Recognition Based On Data Fusion
4	Research On Feature Extraction Algorithm Of IMFE And Fusion KELM Recognition Algorithm For Speech Emotion Recognition
5	Research On Emotion Recognition Based On Multi-modal Feature Fusion
6	Human Ear Recognition Algorithm Based On Local Feature And SRC Fusion
7	Deep Learning-Based Bimodal Emotion Recognition From Facial Expression And Body Posture
8	Research On The Speech Emotion Recognition Fusing Articulatory And Acoustic Features
9	Research On Handmetric Recognition And Feature Level Fusion Method
10	A Research On Information Fusion Algorithms In Voice Biometrics