Font Size: a A A

Research On Emotion Recognition Method Based On Multi-modal Information Fusion Of Speech And Image

Posted on:2022-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:G H ChenFull Text:PDF
GTID:2518306536963379Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
As the core component of the Human-Computer Interaction(HCI)system,emotion recognition has important application value in intelligent driving systems,remote teaching systems,smart home systems,health detection systems,travel recommendation systems,and intelligent robot systems.Humans generally express emotions through speech and images.Thus,it has important theoretical significance and practical value to carry out multi-modal emotion recognition of speech and image and to improve the recognition rate of multi-modal emotion recognition.In this paper,the emotion recognition model is used as the research object,and the relationship between the emotion information of speech and image and the human emotion state is analyzed.In addition,the key frame extraction method and multi-modal feature fusion method in the process of speech and image multi-modal information fusion have been deeply studied.The purpose is to fully fuse the emotional features of the speech and image and improve the recognition rate of multi-modal emotion recognition.With the advent of the multimedia information era,in the face of massive amounts of emotional video,how to extract speech and image key frames from emotional video datasets is particularly important to improve the performance of multi-modal emotion recognition.However,the traditional speech and image key frame extraction method has problems such as key frame redundancy and loss of important emotional information.Thus,this paper proposes a speech and image key frame extraction method based on multi-modal emotion recognition to solve this problem.Firstly,the preliminary speech key frames are extracted by using the Voice Activation Detection(VAD)algorithm.Secondly,the information entropy is used to characterize the generation of human emotion as a continuous process,and the preliminary image key frames are extracted by using the perceptual hash algorithm.Finally,the final speech and image key frame are extracted by using the principle of complementarity between speech and image modalities in the speech image key frame alignment technology.The experimental results on the RML,eNTERFACE05 and BAUM-1s datasets show that the speech and image key frame extraction method based on multi-modal emotion recognition not only effectively reduces the redundancy of speech and image key frames,but also it effectively saves important emotional information by the speech Mel-Frequency Cepstral Coefficients(MFCC)extraction experiment,the expression image extraction experiment,the expression image information entropy experiment and the speech image key frame extraction experiment.Aiming at the problem that the current feature layer fusion does not fully consider the correlation between speech and image modalities,resulting in poor performance of multi-modal emotion recognition.This paper proposes a multi-modal emotion recognition method by fusing the correlation features of speech and images.Firstly,the speech and image key frame extraction method proposed in this paper is used to extract speech and image key frames,respectively.Secondly,the MFCC features of the speech key frames and the facial expression sequence of the image key frames are input into the two-dimensional convolutional neural network and three-dimensional convolutional neural network to extract high-level emotional features,respectively.Thirdly,an improved Canonical Correlation Analysis(CCA)feature fusion method is proposed by adding class information on the basis of CCA in the feature fusion stage.In addition,the correlation between speech and image modalities is used to construct a weighting matrix K and a new inter-class divergence matrix S_b to distinguish similar emotion classes.Finally,Support Vector Machines(SVM)is used to realize the emotion classification of speech and image correlation features.The experimental results on the RML,eNTERFACE05 and BAUM-1s datasets show that the speech and image key frame extraction method and the speech and image correlation feature fusion method proposed in this paper can effectively improve the recognition rate of multi-modal emotion recognition by the ablation experiment,the speech and image key frame extraction experiment,the speech and image correlation feature fusion experiment,and comparison experiment.
Keywords/Search Tags:multi-modal emotion recognition, speech information, image information, key frame, correlation
PDF Full Text Request
Related items