Font Size: a A A

Research Of Multi-Modal Emotion Recognition Based On Deep Learning

Posted on:2021-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:W Q ChenFull Text:PDF
GTID:2428330605968108Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the development of artificial intelligence technology in recent years,people hope that computers can recognize different emotions like human beings so as to serve more conveniently which has become an important technology.Emotion recognition integrates a variety of disciplines such as speech signal processing,psychology,pattern recognition,and video image processing,which can be applied to various fields such as education,transportation,and medical treatment.Due to the shortcomings of insufficient information utilization and low recognition accuracy in single-modal emotion recognition,more and more researchers focus on multimodal emotion recognition.However,extracting discriminant features and effective interactive fusion is still a challenging problem in multimodal emotion recognition.Based on huge amount of video samples collected from human-computer interaction,this thesis separates into text,voice and video modalities from original videos.And deep learning technology is applied to conduct multimodal emotion recognition research by exploring and improving feature extraction,modal interaction and information fusion.The main research contents of the thesis are as follows:(1)Analysis,comparison and research are performed on the pre-processing and feature extraction techniques of three modal data:texts,voices and videos.In order to obtain word vectors containing as much semantic and grammatical information as possible,the preprocessing and feature extraction of text data are performed using GloVe embedding pre-trained model processing.The preprocessing and feature extraction of audio data are based on the Covarep feature extraction tools.The most important feature is the MFCC,and also includes many other effective time-domain features and frequency-domain features.For the preprocessing and feature extraction of video data,the most advanced Openface 2.0 open source tool is used to obtain 68 key points,facial shape parameters,head pose estimation,line of sight estimation,facial behavior units,and Hog features of the human face are obtained.Then,for the idea of multimodal information time interaction,P2FA alignment criterion is applied to align in the time dimension,and use Z-score standardization to process the data again to speed up the convergence rate based on the gradient descent method and improve the accuracy of the model.(2)A multimodal emotion recognition algorithm based on Double Attention Network(DAN)and Gated Memory Network(GMN)is proposed.First of all,for multimodal data,we use the LSTMs coding system in a recurrent neural network,to process three modal time series data.Then,based on LSTMs coding system,an improved attention mechanism Delta-Time Attention Network(DTAN)is proposed,which aims to discover the modal crossover and time interaction between different dimensions of memory in the LSTM system.Then,it is natural to propose GMN to update and save the modal and time interaction information of DTAN.Experimental results show that the gating mechanism composed of neural networks has stronger expression ability and helps model convergence.Finally,the Global-Time Attention Network(GTAN)global attention mechanism is used to calculate the correlation between different frames of each modal,so as to assign different weights,promote the model to focus on the frames which are more significant for the effect of emotion recognition.In addition supplementing the information of DTAN and GMN is able to make the entire model more expressive.(3)Complete experiments are carried out to evaluate the proposed method by(a)comparing recognition rate among single-modal,dual-modal and multi-modal,(b)conducting ablation experiments versus different variables,and(c)comparisons of multiple baseline methods.Based on the experiments,the results show that the recognition effect of the dual-modal is better than that of the single-modal,and that of the three-modal is better than that of the dual-modal.It is proved that simply introducing the modal can also improve the emotion recognition accuracy.An ablation comparison experiment was performed for three unique components,DTAN,GMN,and GTAN.Through the analysis of the experimental results,it was concluded that each component can significantly improve the overall effect of multimodal emotion recognition.On the MOSI data set,it achieved a two-class accuracy rate of 77.4%,and on the MOSEI,it achieved a 6-class accuracy rate of 83.1%,which verified the feasibility and effectiveness of the proposed model.
Keywords/Search Tags:Multi-modal emotion recognition, deep learning, recurrent neural network, attention mechanism, gated memory network
PDF Full Text Request
Related items