Font Size: a A A

Research On Multimodal Emotion Recognition Based On The Fusion Of Temporal And Spatial Features

Posted on:2023-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:X N GuFull Text:PDF
GTID:2568306836472224Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of science and technology,human society has entered the era of artificial intelligence.At present,the development of artificial intelligence has developed from machine intelligence to perceptual intelligence,and gradually moves towards cognitive intelligence.As an expression of cognitive intelligence,human-computer interaction is inseparable from the support of emotion recognition technology.Most of the previous emotion recognition technologies are based on single-modal emotion recognition which has low recognition rate and poor robustness.Therefore,more and more researches focus on multimodal emotion recognition.There are two key points in multimodal emotion recognition,one is how to extract distinctive emotional features,and the other is how to effectively fuse information from different modalities.Therefore,the main work of this thesis is to study multimodal emotion recognition by extracting distinctive emotional features and fusing emotional information from different modalities.In this thesis,Multimodal database and RAMAS database are used to study multimodal emotion recognition of facial expression,voice and gesture.Firstly,the audio-visual data in the two databases are preprocessed.For the data of expression modality,the face part in the video sequence is intercepted firstly,and the key frames are selected at equal intervals to form the image sequence of facial expression;for the data of voice modality,the audio information is extracted from the video data and segmented with equal time length.Then,the segmented audio information is transformed into spectrogram sequence;for the data of gesture modality,the key frames are selected at equal intervals from the original data to form the feature sequence of gesture.Taking the above data of three modalities as the basis of emotion research,this thesis studys feature extraction and multimodality fusion.The main research contents of this thesis are as follows:(1)A CNN-based model with asymmetric non-local module,efficient channel attention module and long short term memories is proposed to extract distinctive emotional features.Since the gesture data in the database is human skeleton point data,which is different from the other modal data formats,it cannot be input into the network.Only the facial expression and voice data are used as the input of the network.The network mainly includes three modules: the first module is an asymmetric non-local module,which is used to capture long sequence dependencies;the second module is an efficient channel attention module,which realizes local cross-channel interaction without dimensionality reduction and increases the nonlinear expression ability of emotional features;the third module is the space-time LSTM network,which is used to learn the spatial correlation of emotional features and the temporal correlation of emotional feature sequences,so as to promote the information interaction between time and space.The entire network is trained in an end-to-end manner to extract distinctive emotional features.(2)In order to better fuse the features between individual modality to form a distinctive emotional feature representation,this thesis proposes a multi-level and multi-stage fusion network based on asymmetric non-local network and Deep belief networks to capture correlations and differences between different modalities.In the first stage,three modalities of facial expression,voice and gesture are combined in pairs.In each group,one of the modal features is used as the dominant feature,and the other is used as the auxiliary feature.Input two modal features into the asymmetric non-local module at the same time,which is used to find the emotional information in favor of the dominant feature in the auxiliary feature,so as to obtain the fusion feature with the correlation between two modalities;in the second stage,three groups of fusion features obtained in the first stage are input into the DBN fusion network together with the original three modal features to obtain fusion features with differences between modalities.The network performs bottom-up unsupervised training and error back-propagation to achieve global optimization,strengthen the nonlinear expression ability of fused features,and finally inputs the softmax layer to complete emotion classification.Experiments show that the CNN-based model with asymmetric non-local module,efficient channel attention module and LSTM can extract different emotional features,and the multi-level and multi-stage fusion network based on asymmetric non-local network and DBN can effectively fuse multimodal emotional information and improve the effect of emotion recognition.
Keywords/Search Tags:Multimodal emotion recognition, asymmetric non-local neural network, efficient channel attention neural network, space-time LSTM network, multi-level and multi-stage fusion network
PDF Full Text Request
Related items