Font Size: a A A

Dual Fusion Speech Emotion Recognition Based On Deep Learning

Posted on:2022-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:J Q QianFull Text:PDF
GTID:2518306482973219Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Emotion is an important part of the cognitive process.It is a new subject in the field of artificial intelligence to make the computer have the ability of perceiving all kinds of emotion like human.Speech emotion recognition(SER)is an important research direction in the field of emotion recognition,which aims to enable computers to understand human emotions and achieve smooth communication between humans and machines.However,there are currently problems in the field of speech emotion recognition,such as the lack of effective emotional feature sets,the lack of effective emotional recognition models and so on.In order to improve the recognition performance of speech emotion recognition model,this paper designs a double fusion speech emotion recognition system based on deep learning.The main contributions are as follows:(1)The fused features have better classification performance.A speech emotion recognition system based on recurrent convolutional neural network(RCNN)is constructed.The prosodic feature,spectral feature,fusion features of prosodic features and spectral features are used in CASIA database to study the performance of different feature combinations.The analysis shows that the fusion features of prosodic features and spectral features has better classification performance.(2)Several speech emotion recognition models are proposed.Firstly,based on the feature that global average pooling(GAP)can reduce the computational complexity and the risk of over fitting,the RCNN-GAP model is constructed.Secondly,by reducing the depth of RCNN and adding deep neural network(DNN)to reduce the risk of gradient disappearance,the DNN-RCNN-GAP model is proposed.Finally,the DNNARCNN-GAP model is proposed by using attention mechanism to pay attention to more emotion related information,the generalization of the model is verified in EMODB database,and the SER system is built based on the model.(3)Decision fusion and model fusion are used to further improve the system performance.In order to further improve the emotion recognition rate of the model,the unweighted voting method and the stacking algorithm are used to conduct decision fusion and model fusion experiments respectively.And verify the emotion recognition rate of the two methods.The experimental results show that the four SER models: RCNN model,RCNNGAP model,DNN-RCNN-GAP model and DNN-ARCNN-GAP model have achieved recognition rates of 80.44%,81.25%,83.33% and 84.58% respectively on CASIA database.Decision fusion and model can fusion further improve the emotion recognition rate,which proves that the SER model proposed in this study has good performance.
Keywords/Search Tags:Speech emotion recognition, Feature extraction, Attention mechanism, Convolutional neural network
PDF Full Text Request
Related items