Deep Emotion Recognition Based On Speech And Semantics

Posted on:2022-10-21

Degree:Master

Type:Thesis

Country:China

Candidate:G Shen

Full Text:PDF

GTID:2518306353484174

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the increasing popularity of human-computer interaction,emotion recognition has won increasing attention from scholars.For example,chatbots have become popular in various customer services,accurately detecting users' emotions via their utterances is key to better user experience.Emotion recognition is also used to help children with autism to solve their difficulties in recognizing and expressing emotions.Due to its practical importance,emotion recognition has attracted increasing attention from academia and industry.However,the inherent subtlety of human emotions makes emotion recognition a challenging problem.The way humans express their emotions is multimodal.While previous studies only consider a single modality for emotion recognition,the latest research confirms the necessity of multimodality to improve the performance of emotion recognition.Audio and text are the two most commonly used modalities for emotion recognition.While,there are still some problems in multimodal emotion recognition based on audio and text.Firstly,the fusion of multimodal is completely independent and ignore the interaction between different modal.Secondly,the traditional speech feature processing methods ignore the quiet frame and background noise in the speech signal,which limits the emotion expression ability of speech.Thirdly,the speech emotion representation method based on deep learning improves the performance of emotion recognition,while long audio is intercepted in speech feature processing,leading to the lack of emotion information.Finally,due to the imbalance of emotion expression between speech and text,word level dynamic interaction is needed to improve the performance of emotion recognition.To solve the shortcomings of multimodal emotion recognition,this paper proposed the following solutions: 1.proposed a new speech emotion representation model based on hierarchical learning.we use attention mechanism to aggregate the features of frame level,phoneme level and word level to obtain high-level emotion features of speech,improving the emotion representation ability of speech and reducing the impact of noise.2.proposed a word level dynamic interaction mechanism.An interaction matrix is used between the working cells of the long short term memory network to learn the word level dynamic consistency of audio and text feature sequences,modeling the evolution of emotion along timeline.3.proposed a deep multimodal fusion model based on temporal attention mechanism,which learns the weight of audio and text hidden state and the weight of audio and text at different word levels over time and obtains the high-level emotional joint representation of voice and text,realizing the emotional complementary of voice and text.Finally,based on the above components,we proposed word-level interaction-based multimodal fusion networks for speech emotion recognition.The experimental results show that the proposed model is superior to the latest research results on the standard emotion recognition dataset IEMOCAP.

Keywords/Search Tags:

Emotion Recognition, Word Level, Temporal Attention, Multimodal, Phoneme

PDF Full Text Request

Related items

1	Real-time Emotion And Phoneme Recognition Based On A Two-level Model
2	Based On Multimodal Feature Emotion Recognition Research
3	The Study Of Multimodal Emotion Recognition Based On Text,Speech And Video
4	Research On Multi-modal Emotion Recognition Based On UDP-MIF
5	Multi-modal Emotion Recognition Based On Deep Learning
6	Multimodal Emotion Recognition From Speech And Text
7	Multimodal Emotion Recognition Algorithm Based On Deep Learning
8	Study On Attention Based Speech Emotion Recognition
9	Design Of Emotion Recognition System Based On Multimodal Feature Fusion
10	Research On Multi-modal Emotion Recognition Method Combining Speech And Expression