Font Size: a A A

Research On Speech Emotion Recognition Method Based On Time Series Deep Learning Model

Posted on:2019-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:X M ChenFull Text:PDF
GTID:2428330566998101Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the maturity of speech recognition technology,the voice of speech emotion recognition technology is increasingly high.Because the development of speech emotion recognition technology will make the machine step into a more humane era.And in many areas SER technology have inestimable effect,such as car driving,medical services,distance education,disease diagnosis and so on.However,the current speech emotion recognition technology has not yet reached a practical level.On the one hand,because the emotional activity itself is a complex physiological process,on the other hand,the database and model used for speech emotion recognition need further development.This paper starts from the speech emotion recognition model and address the problem that the traditional Long-Short Term Memory(LSTM)model learn all the speech frame information firstly.It is considered that the frame sequence of emotional speech is divided into emotional frames and non-emotional frames.Based on this,the LSTM-CTC time series deep learning model for emotional speech recognition is proposed,and the emotion tag is aligned to the emotional frame in the speech by the automatic alignment ability of the Connectionist temporal classification(CTC)method.In order to measure the performance of the model,we performed speaker independent experiments on the IEMOCAP sentiment database using four types of emotions(happy,sad,neutral,and angry)and 65.7%(UAR)and 64.2%(WAR)recognition performances were achieved.Compared to the state-ofthe-art method,LSTM-ELM model,2.3%(UAR)and 1.8%(WAR)of performance are improved.Then,according to the problem that the speech emotion frames are treated equally in the LSTM-CTC model,Then,according to the problem that the speech emotion frames are treated equally in the LSTM-CTC model and it is considered that the sentiment information content of each emotional frame is different,the Att RNN-RNN time series deep learning model is proposed from the perspective of attention mechanism.The Att RNN-RNN model regards the SER process as a codec mission.Considering that the human attention has the characteristics of global to local changes,we use LSTM as a decoder to calculate attention at each time step to perform emotion recognition inference and simulate the change of attention.This model overcomes the problem of equal treatment of speech emotion frames in the LSTM-CTC model and achieves 67.6%(UAR)and 67.5%(WAR)performance on the IEMOCAP database in four types of emotion recognition better than LSTM-CTC model did.However,taking into account that the CTC method has the feature of automatically aligning emotion tags and speech frames,in order to take full advantage of this feature,this paper introduces the CTC method on the basis of the Att RNN-RNN model,and proposed an Attention-CTC fusion model by sharing an emotional semantic encoder.The CTC method is responsible for aligning the emotional key frames in the speech while The Attention mechanism is responsible for extracting different degrees of information in different emotional frames for learning,and finally optimizing the two objective functions at the same time.This model achieved 70.3%(UAR)and 65.1%(WAR)recognition performance on the IEMOCAP library.Finally,this paper implements an online speech emotion recognition system,OESERS system,which transforms the above research outcomes into practical applications.The system adopts Client/Server structure,has good recognition performance,friendly manmachine interface and large-scale concurrent task processing capabilities.The system provides voice emotion recognition support for Samsung Bixby voice assistant.The research work in this thesis provides an effective improvement solution for the key problems existing in the field of speech emotion recognition.Through experiments,the time series deep learning model presented in this paper has a significant effect on speech emotion recognition tasks.At the same time,it also provides new ideas and directions for deep learning technology in dealing with the time series problems.
Keywords/Search Tags:Speech emotion recognition, Neural network, Deep learning, Attention mechanism
PDF Full Text Request
Related items