Font Size: a A A

Deep Emotion Recognition Based On Speech And Semantics

Posted on:2022-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:G ShenFull Text:PDF
GTID:2518306353484174Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the increasing popularity of human-computer interaction,emotion recognition has won increasing attention from scholars.For example,chatbots have become popular in various customer services,accurately detecting users' emotions via their utterances is key to better user experience.Emotion recognition is also used to help children with autism to solve their difficulties in recognizing and expressing emotions.Due to its practical importance,emotion recognition has attracted increasing attention from academia and industry.However,the inherent subtlety of human emotions makes emotion recognition a challenging problem.The way humans express their emotions is multimodal.While previous studies only consider a single modality for emotion recognition,the latest research confirms the necessity of multimodality to improve the performance of emotion recognition.Audio and text are the two most commonly used modalities for emotion recognition.While,there are still some problems in multimodal emotion recognition based on audio and text.Firstly,the fusion of multimodal is completely independent and ignore the interaction between different modal.Secondly,the traditional speech feature processing methods ignore the quiet frame and background noise in the speech signal,which limits the emotion expression ability of speech.Thirdly,the speech emotion representation method based on deep learning improves the performance of emotion recognition,while long audio is intercepted in speech feature processing,leading to the lack of emotion information.Finally,due to the imbalance of emotion expression between speech and text,word level dynamic interaction is needed to improve the performance of emotion recognition.To solve the shortcomings of multimodal emotion recognition,this paper proposed the following solutions: 1.proposed a new speech emotion representation model based on hierarchical learning.we use attention mechanism to aggregate the features of frame level,phoneme level and word level to obtain high-level emotion features of speech,improving the emotion representation ability of speech and reducing the impact of noise.2.proposed a word level dynamic interaction mechanism.An interaction matrix is used between the working cells of the long short term memory network to learn the word level dynamic consistency of audio and text feature sequences,modeling the evolution of emotion along timeline.3.proposed a deep multimodal fusion model based on temporal attention mechanism,which learns the weight of audio and text hidden state and the weight of audio and text at different word levels over time and obtains the high-level emotional joint representation of voice and text,realizing the emotional complementary of voice and text.Finally,based on the above components,we proposed word-level interaction-based multimodal fusion networks for speech emotion recognition.The experimental results show that the proposed model is superior to the latest research results on the standard emotion recognition dataset IEMOCAP.
Keywords/Search Tags:Emotion Recognition, Word Level, Temporal Attention, Multimodal, Phoneme
PDF Full Text Request
Related items