Font Size: a A A

A Study Of Deep Learning Based Multimodal Emotion Recognition

Posted on:2020-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y ZhangFull Text:PDF
GTID:2428330572987272Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Emotion recognition plays an important role in building intelligent human-computer interaction systems,which has an impact in education,security assistance,personal entertainment and some other fields.Usually,humans express emotions through multimodal signals such as speech and facial expressions.In a traditional single-modal emotion recognition system,the manually designed features are extracted from the signal and then are used to train the classifier.However,these features can not adequately characterize emotional information,which limits the system performance.Recently,with the development of deep learning,the deep learning based emotion recognition systems have demonstrated the superiority.Most of the deep learning based emotion recognition systems use the convolutional neural networks or long-short term memory to handle the audio or video input directly.These methods do not take into account the sparsity of emotions,i.e.,emotions tend to exist only in local segments in a long signal,which lead to the inefficiency of current systems.The most common multi-modal emotion recognition systems fuse the multi-modalities in the decision level,i.e.,fusing the results of the separate single-modal systems,or in the middle layer,using the linear fusion,such as the concatenation,to integrate the features from different modalities.These strategies can not capture the association of multi-modalities deeply.This study is carried out to solve the two problems above.First,an attention based fully convolutional network is proposed for speech emo-tion recognition.The fully convolutional network is utilized to handle the speech spec-trogram,by which the length regularization of speech is avoided.The attention mech-anism detects the correlation of different time-frequency regions in the spectrum with the emotions,and then according to correlation the attention weights are assigned for these time-frequency regions.So that the system can focus on the emotional salient of the spectrogram.The transfer learning is introduced to deal with the lack of emo-tional speech data.In addition,we employ a scaled-softmax function to deal with the training problem of the attention mechanism due to the insufficient training data.Our system achieves the state-of-the-art result on the IEMOCAP database.We find that the attention mechanism can ignore the non-speech segments of speech and can assign the attention weights to different time-frequency regions according to the emotional infor-mation they contain.Usually,the attention weights are smaller in the high frequency regions of the spectrogram.Second,we introduce the attention mechanism to the video emotion recognition task.We adopt the attention mechanism to compute the importance of all frames in a video for the emotion,and then assign attention weights according to the importance,which focus the system to the emotion-relevant frames in a video.In addition,we intro-duce a scaled-softmax function to deal with the attention mechanism training problem due to the insufficient training data.We have validated the effectiveness of the atten-tion mechanism on the AFEW8.0 video database.By analyzing the attention mechanism weight curves of videos,we find that the attention mechanism can ignore the anomalous frames in the video and can the assign weight to a frame according to the its relevance with the emotional state.Finally,we integrate the speech emotion recognition system and the video emotion recognition system mentioned above.And the factorized bilinear pooling is introduced to fuse the speech feature and the video feature.Our system achieve the state-of-the-art on the AFEW8.0 audio-video database.By comparing the attention weight curves of the video emotion recognition system and the curves of the video sub-system in the audio-video fusion system,we find that the attention mechanism of video sub-system is influenced by the audio signal in our system due to bilinear pooling and joint training,i.e.,the audio and video are fused deeply.
Keywords/Search Tags:Emotion Recognition, Attention Mechanism, Fully Convolutional Network, Multi-modal Fusion, Bilinear Pooling
PDF Full Text Request
Related items