Font Size: a A A

Research On Emotion Recognition Of Monomodal Speech And Multimodal Speech Vision Based On Transfer Learning

Posted on:2022-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:L LinFull Text:PDF
GTID:2518306320954219Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Emotion recognition is a computer that uses sensors to collect data such as human voice,expressions and behaviors for analysis to infer the emotional state of people.Humans' recognition of other people's emotions is mainly done through visual modalities or voice modalities.Emotion recognition plays an important supporting role in human-computer interaction,medical treatment,criminal investigation,aerospace and other fields.Single-modal speech emotion recognition is to analyze the emotional state through the voice signal collected by the sensor;multi-modal voice visual emotion recognition is to use two or more signals from the voice,expression,and behavior signals collected by the sensor.Analyze the emotional state.For single-modal speech emotion recognition and multi-modal speech visual emotion recognition,there are the following two problems in existing research at home and abroad.Problem one,for single-modal speech emotion recognition,due to the different distribution of speech emotion databases,excessive training data,high computational complexity and low recognition rate,the application effect of single-modal speech emotion recognition is not good.The second problem is that for multi-modal speech visual emotion recognition,the application effect of multi-modal speech visual emotion recognition is not good due to the mutual influence of the features of different modalities and the lack of modalities.In view of the two issues described above,the main work of this research can be divided into the following three aspects:Aiming at problem 1,this paper proposes a transfer learning monomodal speech emotion recognition based on Mel cepstrum.This method is based on the single-layer LSTM and the migrated Inception-v3 network model.The Mel cepstrum map of the speech data set of the multi-corpus is data-enhanced as the input.After forward propagation through the single-layer LSTM,it enters the pre-trained The Inception-v3 model extracts features,and then sends it to the newly defined fully connected and classification layer for training,allows the parameters of the last layer to be fine-tuned,and finally gets the classification result.Through experiments,the multi-distribution speech emotion recognition rate reached 67%,and the accuracy area under each type of ROC curve,the accuracy area under the macro-average ROC curve,and the accuracy area under the micro-average ROC curve are also very small.This method has a good effect on multi-distribution emotion recognition,and this method does not reduce the performance of the model.Aiming at problem 2,this paper proposes multimodal emotion recognition based on feature reconstruction and particle swarm feature fusion.This method is based on the e NTERFACE'05 audiovisual emotion data set,which is the data of two modalities of speech and vision.CNN is used to extract the high-level emotion features of visual face keyframes and speech mel cepstrum maps,and then combine one of the modalities Transfer learning of the high-level emotional features of the modal to obtain the reconstructed high-level emotional features of another modal,and then use the particle swarm algorithm to perform feature fusion to obtain multi-modal shared emotional features,and finally enter the softmax classifier for classification Training to complete emotion recognition.Through experiments,this method validates the reconstruction of missing modalities,solves the mutual influence of features of different modalities,and has higher emotion recognition rate and robustness than singlemodality,and improves the overall effect of emotion recognition.Aiming at problem 2,this paper proposes another new solution,that is,transfer learning multi-modal emotion recognition based on feature reconstruction and decision-level fusion.The method of audiovisual emotion data set,visual face keyframes and voice mer cepstrum of high-level emotional features and reconstruction features are the same as the previous method,the difference is that the features of the two modalities are input separately the softmax classifier performs classification training to obtain the probability matrix set,and finally completes the emotion classification through the fusion rules in the decision layer fusion.Through experiments,this method has verified the reconstruction of missing modalities.Among them,the decision-making layer based on the minimum rule fusion large multi-modal speech visual emotion recognition rate is the highest,reaching 85.8%.This method solves the problem of missing modalities and mutual influence between modalities to a certain extent.
Keywords/Search Tags:Single-modal speech emotion recognition, Multi-modal speech visual emotion recognition, Transfer learning, Multi-modality, Feature reconstruction, Shared emotional features, Decision-level fusion
PDF Full Text Request
Related items