Research On Emotion Recognition Method Based On Multi-modal Information Fusion Of Speech And Image

Posted on:2022-01-12

Degree:Master

Type:Thesis

Country:China

Candidate:G H Chen

Full Text:PDF

GTID:2518306536963379

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

As the core component of the Human-Computer Interaction(HCI)system,emotion recognition has important application value in intelligent driving systems,remote teaching systems,smart home systems,health detection systems,travel recommendation systems,and intelligent robot systems.Humans generally express emotions through speech and images.Thus,it has important theoretical significance and practical value to carry out multi-modal emotion recognition of speech and image and to improve the recognition rate of multi-modal emotion recognition.In this paper,the emotion recognition model is used as the research object,and the relationship between the emotion information of speech and image and the human emotion state is analyzed.In addition,the key frame extraction method and multi-modal feature fusion method in the process of speech and image multi-modal information fusion have been deeply studied.The purpose is to fully fuse the emotional features of the speech and image and improve the recognition rate of multi-modal emotion recognition.With the advent of the multimedia information era,in the face of massive amounts of emotional video,how to extract speech and image key frames from emotional video datasets is particularly important to improve the performance of multi-modal emotion recognition.However,the traditional speech and image key frame extraction method has problems such as key frame redundancy and loss of important emotional information.Thus,this paper proposes a speech and image key frame extraction method based on multi-modal emotion recognition to solve this problem.Firstly,the preliminary speech key frames are extracted by using the Voice Activation Detection(VAD)algorithm.Secondly,the information entropy is used to characterize the generation of human emotion as a continuous process,and the preliminary image key frames are extracted by using the perceptual hash algorithm.Finally,the final speech and image key frame are extracted by using the principle of complementarity between speech and image modalities in the speech image key frame alignment technology.The experimental results on the RML,eNTERFACE05 and BAUM-1s datasets show that the speech and image key frame extraction method based on multi-modal emotion recognition not only effectively reduces the redundancy of speech and image key frames,but also it effectively saves important emotional information by the speech Mel-Frequency Cepstral Coefficients(MFCC)extraction experiment,the expression image extraction experiment,the expression image information entropy experiment and the speech image key frame extraction experiment.Aiming at the problem that the current feature layer fusion does not fully consider the correlation between speech and image modalities,resulting in poor performance of multi-modal emotion recognition.This paper proposes a multi-modal emotion recognition method by fusing the correlation features of speech and images.Firstly,the speech and image key frame extraction method proposed in this paper is used to extract speech and image key frames,respectively.Secondly,the MFCC features of the speech key frames and the facial expression sequence of the image key frames are input into the two-dimensional convolutional neural network and three-dimensional convolutional neural network to extract high-level emotional features,respectively.Thirdly,an improved Canonical Correlation Analysis(CCA)feature fusion method is proposed by adding class information on the basis of CCA in the feature fusion stage.In addition,the correlation between speech and image modalities is used to construct a weighting matrix K and a new inter-class divergence matrix S_b to distinguish similar emotion classes.Finally,Support Vector Machines(SVM)is used to realize the emotion classification of speech and image correlation features.The experimental results on the RML,eNTERFACE05 and BAUM-1s datasets show that the speech and image key frame extraction method and the speech and image correlation feature fusion method proposed in this paper can effectively improve the recognition rate of multi-modal emotion recognition by the ablation experiment,the speech and image key frame extraction experiment,the speech and image correlation feature fusion experiment,and comparison experiment.

Keywords/Search Tags:

multi-modal emotion recognition, speech information, image information, key frame, correlation

PDF Full Text Request

Related items

1	Emotion Recognition Based On Multi-modal Information Fusion
2	End-to-End Multi-Modal Emotion Recognition Based On Speech And Image Information
3	Research On Key Techniques Of Speech Emotion Recognition
4	Research On Speech Emotion Recognition Method Based On Multi-feature And Multi-modal Fusion
5	Research On Emotion Recognition Of Monomodal Speech And Multimodal Speech Vision Based On Transfer Learning
6	Speech Emotion Recognition Based On Deep Learning
7	Research On Multi-modal Emotion Recognition Method Combining Speech And Expression
8	Application And Research Of PAD Emotion Model In Speech Emotion Recognition
9	Research On Multi-modal Emotion Recognition Algorithm Based On Speech And Face Expression
10	Individual Difference Model And Method Of Speech Emotion Recognition