End-to-End Multi-Modal Emotion Recognition Based On Speech And Image Information

Posted on:2022-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:F C Lv

Full Text:PDF

GTID:2558307154976859

Subject:Engineering

Abstract/Summary:

As an important aspect of human-computer interaction,emotion recognition has become a hot research direction in the field of artificial intelligence.At present,emotion recognition is divided into two types: single-modal emotion recognition and multimodal emotion recognition.Single-modal emotion recognition mainly aims at the recognition of single modality that embody human emotions,such as facial expressions,voices,and gestures.Therefore,it has shortcomings such as single feature information,low error tolerance,and poor robustness.Multi-modal emotion recognition combines expressions,speech,text,etc.,and uses the diversity and richness of emotional features to effectively solve the shortcomings of single-modal emotion recognition,which has become a major research hotspot in this field.This thesis mainly studies the end-to-end multi-modal emotion recognition method based on speech and image information,uses deep learning to learn and train the features of each modal,and integrates through the fusion network,and finally realizes the recognition of six basic emotions.The main work includes the following three aspects:1)Considering that 3D Convolutional Neural Network(3D-CNN)can effectively extract the spatial-temporal features of image sequences,this thesis proposes a method of using 3D-CNN to extract and train features of Mel spectrograms of speech and facial expressions.Compared with traditional methods,this method trains speech and facial expression features separately,and then combines the trained features to realize multimodal emotion recognition.2)In view of the insignificance of the features of Mel Spectrogram and the distortion caused by audio segmentation,this thesis proposes a Multi-modal emotion recognition method based on Mel-scale Frequency Cepstral Coefficients(MFCC)of the audio signal and facial expression.The facial expression part still uses 3D-CNN for feature extraction and training;while the audio part uses 2D convolutional neural network(2D-CNN)and long-short-term memory neural network(LSTM)for feature extraction and training of MFCC,thereby enriching the spatial-temporal characteristics of the speech signal;3)In the process of facial expression changes,the muscle changes around the eyes and mouth are the most significant.Therefore,this thesis proposes a multi-modal emotion recognition method based on the combination of Mel-scale Frequency Cepstral Coefficients and key areas of the face.Aiming at the above-mentioned method,simulation experiments are carried out on three public datasets of SAVEE,RAVDESS and e NTERFACE05,and the results show the effectiveness of the method proposed in this thesis.

Keywords/Search Tags:

Emotion recognition, multi-modality, convolutional neural network, Mel Spectrogram, Mel-scale Frequency Cepstral Coefficients

Related items

1	The Speech Emotion Recognition Research Based On Speech Spectrogram And Convolutional Neural Network
2	Speech Emotion Recognition Based On Spectrogram And Neural Network
3	The Research Of Speech Emotion Based On Multi-feature Extraction And Fusion
4	Anti-noise Power Normalized Cepstral Coefficients For Two-level Robust Environmental Sounds Recognition In Real Noisy Conditions
5	Research On Emotion Recognition Technology Based On Speech Information
6	Research On Speech Emotion Recognition Based On Improved Convolutional Neural Network
7	Analysis Of Effective Fused Features And Model Evaluation For Speech Emotion Recognition
8	The Research Of Speaker Recognition Based On Vector Quantization
9	Hidden Markov Model Based Automatic Speech Recognition Using Mel Frequency Cepstral Coefficients In Nepalese
10	Research Of Speech Emotion Recognition Based On Feature Learning