Font Size: a A A

Research On Speech Emotion Recognition Based On Multimodal Information Fusion

Posted on:2022-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:D L JiangFull Text:PDF
GTID:2518306482455114Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Communication is the key way for human beings to express their thoughts.Among all the ways of communication,language is the most popular and effective way of communication.Nowadays,IOT applications are developing more rapidly.These applications range from simple wearable devices or small parts to complex autonomous vehicles and a variety of automation devices,which bring great convenience to people's daily life.The intelligent application is interactive,which requires users to carry out certain specific operation instructions to use.The main way to realize the application is to make the intelligent device play a role through voice input.Language perceptron can detect the speaker's gender,age,language type,emotion and other information,which creates the necessary conditions for computer applications to understand human language.In order to analyze the speaker's emotional state,many applications use the existing speech recognition system and emotion detection system at the same time.The performance index of emotion detection system can reflect the usage status of IOT application,and provide better improvement methods based on this.Improving the multi-modal fusion mechanism is a decisive factor to improve the performance of the emotion recognition system.Most existing multi-modal emotion recognition systems just cascade the features extracted from different modalities.The main problem that this method faces in traditional classification algorithms is that the information carried by different modalities will be generated.Problems such as information conflict and redundancy.In addition,the method of concatenating the eigenvectors of different modalities to form a high-dimensional eigenvector will ignore the implicit correlation between the modalities.The current primary task is to minimize the impact of information conflict and redundancy in the audio and visual modalities on the multi-modal emotion recognition system.In response to the above problems,this research proposes a new hybrid fusion method that combines audiovisual content and user comment text.This method uses the latent space plane feature-level fusion method to fuse audio and visual signals,and calculates the correlation between the two modalities to remove redundant features,and then uses DS evidence theory to fuse audiovisual and text modalities.This method solves the problem of information redundancy and conflict in audio and video.In our proposed method,Marginal Fisher Analysis(MFA)is introduced and compared with Cross Modal Factor Analysis(CFA)and Canonical Correlation Analysis(CCA)methods.The experimental results show that our method has better performance.Although there have been some similar studies to solve the redundancy problem in feature-level fusion by maintaining the statistical correlation between modes,they have not been applied to decision-level fusion.In other words,the existing methods either use feature-level latent space plane fusion methods,or use evidence theory methods to fuse audiovisual and text modalities.Experiments with the DEAP data set show that this method is superior to ordinary decision-level fusion and non-latent spatial plane fusion.In addition,compared with cross-modal factor analysis(CFA)and canonical correlation analysis(CCA),edge Fisher analysis(MFA)has a better effect on feature-level audio-visual fusion.
Keywords/Search Tags:speech emotion recognition, decision level fusion, latent space plane, multimodality, Dempster-Shafer
PDF Full Text Request
Related items