Font Size: a A A

Multimodal Emotion Recognition Based On Audio And Video

Posted on:2023-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:B Y ZhouFull Text:PDF
GTID:2568306788964349Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the increasing demand for human-computer interaction in society,emotion recognition has received extensive attention from academia.Emotion recognition has been widely used in traffic detection,teaching guidance,medical treatment and other fields.Emotion recognition methods mostly use single-modal recognition such as voice signals and video signals,which has good practicability and universality in certain scenarios.However,the current emotion recognition application scenarios are complex and the amount of data is increasing rapidly,and single-modal emotion recognition has been unable to meet people’s needs.When the amount of data reaches a certain scale,only the more complete the modal types are,the better the recognition effect of the model is.Therefore,this thesis integrates two single-modal emotion recognition methods to study the effectiveness of multi-modal emotion recognition applications.For speech emotion recognition research,the feature extraction of Mel Frequency Cepstral Coefficients(MFCC)is improved,and MFCC based on wavelet packet decomposition is proposed.The MFCC of wavelet packet decomposition can solve the problem of lack of high-frequency information by setting thresholds to remove noise and adaptively selecting frequency parameters,so as to extract more representative speech features.The experimental results show that the Weighted Accuracy(WA)and Unweighted Accuracy(UA)indicators on the IEMOCAP public dataset are 71.93% and69.86%,respectively,which are better than other speech feature methods and mainstream algorithm.For the research of video emotion recognition,in order to make full use of the correlation between features and extract effective features,an emotion recognition algorithm based on multi-feature extraction is proposed.First,scene features and expression features are extracted from image information,and then the two features are input into two GRU networks respectively.The attention mechanism is used to fuse the two features,and finally the classifier outputs the emotion recognition result.The experiment is verified in two parts on the IEMOCAP dataset.The first part proves that the introduced scene information can make full use of the complementarity between modalities to improve the emotion recognition results;the second part proves that the attention mechanism improves the long-term distance-dependent problem and reduce the misjudgment rate of emotion recognition.Multimodal emotion recognition: A multimodal emotion recognition algorithm based on Bidirectional Gate Recurrent Unit(Bi-GRU)and attention mechanism is proposed by combining audio feature vector and video feature vector.Bi-GRU can reflect time series information from all directions,and can obtain valuable information compared to GRU.The experiments are verified from two aspects.On the one hand,the superiority of multimodal emotion recognition is verified by comparing the results of multimodal and single modality,with an average increase of 7.82%;on the other hand,the effectiveness of the model is verified by comparison with other models.This article contains 33 figures,11 tables,and 84 references.
Keywords/Search Tags:emotion recognition, GRU network, attention mechanism, multimodal fusion
PDF Full Text Request
Related items