Font Size: a A A

Research On Speech Emotion Recognition Based On Deep Learning

Posted on:2022-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:H N XuFull Text:PDF
GTID:2518306533495414Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the development of deep learning(DL)and artificial intelligence(AI),emotion express become more and more important in the field of human-computer interaction,and speech,as the most direct way to express emotion,is an important prerequisite to achieve the natural human-computer interaction.How to automatically recognize human emotion by computer and to automatically extract the key features to represent speech emotion by deep learning are hot topics in today's research.In this paper,we construct a model which is used to extract feature and recognize emotion of speech signal,based on the current popular deep learning network,focusing on finding the high-level emotional features that effectively represent the speaker's emotions and simulating human attention mechanism to recognize emotion.The main tasks are as follows:(1)Aiming at the problems of single feature extraction and low classification accuracy in speech emotion recognition(SER)task,an emotion recognition method based on time-frequency feature fusion is proposed.In this paper,the 3-D Log-Mel feature set synthesized by the Log-Mel features,the first-order differential and the second-order differential features is taken as the input of the BCNN-LSTM-attention network to extract the frequency domain features,and the speech signal is divided into equal length segments and inputted into the CNN-LSTM network to obtain the time domain features.The frequency domain and time domain features are fused.The experiments on IEMOCAP and EMO-DB databases show that the recognition rate of multi-feature fusion algorithm was higher than that of extracting the single frequency domain features or time domain features algorithm.(2)The 3-D Log-MEL feature set extracted in(1)is retained.A speech emotion recognition algorithm based on the spatio-temporal features of self-attention is proposed in this paper to moodel the key spatio-temporal dependencies.The optimal spatio-temporal representation of speech signals are automatically learned by Bilinear Convolution Neural Network(BCNN)and Long Short-Term Memory Network(LSTM)and the multi-head attention mechanism is introduced to explore the key frame information.The experiments on IEMOCAP and EMO-DB databases show that the recognition rate of spatio-temporal feature fusion algorithm is higher than that of extracting single spatial or temporal features algorithm,and the multi-head attention mechanism improves the performance of the whole system.(3)An online speech emotion recognition system based on the spatio-temporal features of self-attention has been designed.All functional modules are realized by calling EXE executable files.The experimental results prove the superiority of the algorithm and the effectiveness of the speech emotion recognition system.
Keywords/Search Tags:Speech Emotion Recognition, Multi-features Fusion, Spatio-Temporal Modeling, Attention Mechanism, Emotion Recognition System
PDF Full Text Request
Related items