Font Size: a A A

Research On Multi-Modal Emotion Recognition Based On Deep Learning And Feature Fusion

Posted on:2022-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhuFull Text:PDF
GTID:2518306557969959Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Recent years,the artificial intelligence industry has entered a stage of rapid development.Countries and well-known universities in the world have increasingly invested in related fields,Affective computing is a very important subject in these fields.Subjects like face recognition have undergone long-term research and development.The recognition accuracy under conventional conditions has reached a high level and some subject are widely used in daily life.People are no longer satisfied with using machines unilaterally.More people hope to achieve barrier-free communication with robots as in science fiction movies.The current technology cannot meet this requirement.In order to achieve this effect,it is necessary for the emotion recognition model to be able to extract more efficient features,and at the same time to collect richer information from different sources for emotion calculation.Based on the Multimodal database and RAMAS database,this paper study the multimodal emotion recognition of expression,speech and gesture.For the expression data,an enhanced CNNspace-time LSTM deep network based on space-time feature points is proposed to extract feature.The network combines traditional space-time feature algorithms and convolutional neural network structure to enhance the feature extraction of high attention regions in the input expression image sequence,and uses multiple LSTM networks to calculate the spatial and temporal correlation of features,and finally get the feature.In this paper,we study the effect of single-modal emotion recognition of facial expressions under different space-time cube parameters,and the effect of emotion recognition of facial expression,voice and gesture multi-modal fusion when using facial expression modal features extracted by the network.For the voice modal data,we use software open SMILE to extract the features of emobase2010.For the posture modal data,we use the human skeleton data collected by the Kinectv2 device to form posture features.Secondly,this paper proposes Supervised Least Squares Multiset Kernel Canonical Correlation Analysis(SLSMKCCA)and Sparse Supervised Least Squares Multiset Kernel Canonical Correlation Analysis(SSLSMKCCA),which are used for expression language gesture multi-modal fusion emotion recognition.The SLSMKCCA method uses emotional label information in the form of a label matrix to supervise trianing,and optimizes it in the form of least squares to calculate the correlation between expressions,speech and posture features.On the basis of SLSMKCCA,SSLSMKCCA combines the sparse method to perform feature screening,measures the degree of sparseness through the L1 norm form,and calculates the correlation between the three modal features.Finally,experiments show that the expression modal features extracted from the enhanced CNNspace-time LSTM deep network based on space-time feature points can well recognize emotions,and use SLSMKCCA and SSLSMKCCA to perform expression,speech and posture three-modal feature fusion emotion Recognition can achieve relatively ideal results.Compared with the single-modality in the two databases,the highest accuracy rate can be increased by 20% and 5.19%.At the same time,compared with the past multiple bimodal fusion methods and three-modal fusion methods,the highest accuracy increased by 1.99% and 0.11% respectively.However,compared with the past fusion methods,the parameters and calculations of the above two methods are larger.
Keywords/Search Tags:Multimodal emotion recognition, enhanced convolutional neural network, space-time LSTM, SLSMKCCA, SSLSMKCCA
PDF Full Text Request
Related items