| As an important aspect of human-computer interaction,emotion recognition has become a hot research direction in the field of artificial intelligence.At present,emotion recognition is divided into two types: single-modal emotion recognition and multimodal emotion recognition.Single-modal emotion recognition mainly aims at the recognition of single modality that embody human emotions,such as facial expressions,voices,and gestures.Therefore,it has shortcomings such as single feature information,low error tolerance,and poor robustness.Multi-modal emotion recognition combines expressions,speech,text,etc.,and uses the diversity and richness of emotional features to effectively solve the shortcomings of single-modal emotion recognition,which has become a major research hotspot in this field.This thesis mainly studies the end-to-end multi-modal emotion recognition method based on speech and image information,uses deep learning to learn and train the features of each modal,and integrates through the fusion network,and finally realizes the recognition of six basic emotions.The main work includes the following three aspects:1)Considering that 3D Convolutional Neural Network(3D-CNN)can effectively extract the spatial-temporal features of image sequences,this thesis proposes a method of using 3D-CNN to extract and train features of Mel spectrograms of speech and facial expressions.Compared with traditional methods,this method trains speech and facial expression features separately,and then combines the trained features to realize multimodal emotion recognition.2)In view of the insignificance of the features of Mel Spectrogram and the distortion caused by audio segmentation,this thesis proposes a Multi-modal emotion recognition method based on Mel-scale Frequency Cepstral Coefficients(MFCC)of the audio signal and facial expression.The facial expression part still uses 3D-CNN for feature extraction and training;while the audio part uses 2D convolutional neural network(2D-CNN)and long-short-term memory neural network(LSTM)for feature extraction and training of MFCC,thereby enriching the spatial-temporal characteristics of the speech signal;3)In the process of facial expression changes,the muscle changes around the eyes and mouth are the most significant.Therefore,this thesis proposes a multi-modal emotion recognition method based on the combination of Mel-scale Frequency Cepstral Coefficients and key areas of the face.Aiming at the above-mentioned method,simulation experiments are carried out on three public datasets of SAVEE,RAVDESS and e NTERFACE05,and the results show the effectiveness of the method proposed in this thesis. |