Font Size: a A A

Research On Emotion Recognition Based On Multi-modal Feature Fusion

Posted on:2020-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q FengFull Text:PDF
GTID:2438330578977076Subject:Education Technology
Abstract/Summary:PDF Full Text Request
Emotion recognition plays an important role in human-computer interaction.Generally speaking,people's emotions are mainly expressed through facial expressions,gestures and verbal expressions.As one of the most important channels for human beings to express themselves,speech,which can effectively express emotions,has been successfully used in automatic recognition of emotions.However,speech is only a way of emotional expression and does not contain all the emotional information.Text messages can also convey the feelings of the speaker.Therefore,emotion recognition based on multi-modal feature fusion is an important research direction.The research objective of this study is to improve the accuracy of emotion recognition by using the method of speech and text feature fusion.Based on this objective,the following experiments were designed:first,the speech data were preprocessed,the low-level acoustic features were extracted,and various statistical functions were applied on the low-level acoustic features to construct the global acoustic features,which were then used for speech emotion recognition.The speech training recognition model is used as the baseline system and the subsequent recognition model is compared.Secondly,text sentences are preprocessed to extract different features,and a total of three types of features are generated,namely,word bag features,word vectors and sentence vectors,for text emotion recognition.The text features with the highest recognition accuracy are selected for subsequent fusion with speech features.Finally,the speech and the best performing text features are fused for emotion recognition,and their performance on the IEMOCAP dataset is compared.In feature fusion,two feature fusion methods are used,namely feature layer fusion and decision layer fusion.Finally,this study compared the emotion recognition results after the fusion of speech and text features with the recognition results of the single speech channel,and compared the influence of the fusion mode on the recognition results.The experimental results show that the emotion recognition model trained by the fusion of speech and text features achieves better recognition effect and higher recognition accuracy than the emotion recognition model trained by single modal features.Specifically,the recognition rate of the emotion recognition model after the fusion of speech and text features is higher than that of the speech emotion recognition model,and is also higher than that of the text emotion recognition model.Secondly,decision level fusion performs better in emotion recognition than feature level fusion.The recognition rate of the speech and text emotion recognition model fused by the decision layer is higher than that of the speech and text emotion recognition model fused by the feature layer.In general,compared with single-mode speech emotion recognition or single-mode text emotion recognition,multi-mode feature fusion can effectively improve the accuracy of emotion recognition.
Keywords/Search Tags:speech emotion recognition, Text emotion recognition, Feature layer fusion, Decision level fusion
PDF Full Text Request
Related items