Emotion recognition has important research value in the fields of human-computer interaction,intelligent medical treatment and driving assistance.Facial expressions have achieved remarkable results in the field of emotion recognition due to their advantages of intuition and rich emotional information.However,expressions are easy to disguise,and the results of emotion recognition are not necessarily objective and true.Moreover,a single modality cannot obtain complete emotional information,and more and more scholars are turning their attention to multimodal emotion recognition.In recent years,some scholars have solved the above problems by fusing facial expressions and physiological signals,but these physiological signals need to be measured through contact with equipment,which will not only affect the emotions of the subjects,but also limit the usage scenarios.Based on this,this paper uses a non-contact method to extract the pulse wave from the video,and integrates the facial expressions in the video to realize the dual-modal emotion recognition of expression and pulse wave,and improve the accuracy and objectivity of emotion recognition.The specific work of this paper is as follows:(1)In order to better extract facial features from videos,this paper conducts research on static and dynamic facial expression recognition.Using VGG16 and VIT-B/16 networks for static facial expression recognition on single-frame face images.Based on the C3 D network model,a 3D-convolutional block attention module(3D-CBAM)is introduced to realize dynamic facial expression recognition.The experimental results show that the emotion recognition effect of the C3 D network based on 3D-CBAM is the best,and the accuracy rates in the two dimensions of arousal and valence are 65.14% and 65.51%,respectively.The network has achieved certain results,but there is still room for improvement.(2)Extract the pulse wave signal from the face video through imaging photoplethysmography(IPPG),and use wavelet transform and narrow band-pass filter to remove noise interference.Extract heart rate variability(HRV)features and classify the features using four machine learning models.The experimental results show that the emotion recognition effect of support vector machine is the best,and the accuracy rates in the dimensions of arousal and valence are 61.09% and 53.31%,respectively.IPPG signal has great application potential in the field of emotion recognition.(3)In order to explore the emotional information fusion method of facial expressions and IPPG signals,feature-level fusion and decision-level fusion are performed on facial expressions and IPPG signals,respectively.In this paper,a network model based on 3D convolution and 1D convolution is proposed for feature-level fusion.The model uses 3DCBAM-based C3 D network to extract the spatio-temporal features of facial expressions from videos,and uses 1D convolutional neural network to extract The features of the IPPG signal,and then perform feature fusion on the facial expression features and the IPPG signal features.At the same time,decision-level fusion of facial expressions and IPPG signals is performed based on voting method,bayesian fusion,and integrated learning methods.The experimental results show that the fusion of facial expressions and IPPG signals can effectively improve the accuracy of emotion recognition,but when the performance of the two modalities is too different,the decision-level fusion may not achieve ideal results.The feature-level fusion method in this paper is better than decision-level fusion,the accuracy rates in the two dimensions of arousal and valence are 72.37% and 70.82%,respectively,which are 7.23%and 5.31% higher than the single modality.Compared with most multimodal emotion recognition methods,the method in this paper has certain advantages in both emotion recognition effect and data source acquisition method.Compared with the multimodal emotion recognition method that needs to use multiple data sources,the method in this paper only needs to use video as a data source to realize the fusion of facial expressions and physiological signals,and obtain data in a non-contact manner,which is very practical value. |