Font Size: a A A

Speech Emotion Recognition Based On Deep Learning

Posted on:2022-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2518306764978149Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Speech emotion recognition usually refers to the process that which the machine automatically recognizes human emotions from speech.It is widely used in humancomputer interaction systems such as customer service centers,onboard systems,smart speakers,and so on.In recent years,with the increasing demand of industry for the intelligent degree of human-computer interaction systems in the industry,speech emotional recognition has gradually become a research hotspot in the industry.In previous studies,deep learning method is usually used for speech emotion recognition based on convolutional neural network or cyclic neural network.Based on time-delay neural networks and bidirectional encoder representation,this thesis has done the following three parts in the field of speech emotion recognition:(1)Based on the time delay neural network(ECAPA-TDNN),which emphasizes channel attention,propagation and aggregation in TDNN,an ECAPA-TDNN-LSTM model is proposed and applied to speech emotion recognition.On the IEMOCAP dataset,the ECAPA-TDNN-LSTM model achieved a performance of 72.1% for WA and 69.0%for UA.Compared with the CNN benchmark model based on the convolution neural network,it improves the performance by 9.15% for WA and 5.73% for UA.Compared with the ECAPA-TDNN model,it improves the performance by 4.34% for WA and 3.92%for UA.(2)Assuming that the text information given in the IEMOCAP dataset comes from the results of speech recognition,through fine-tuning,the BERT pre-training model based on the bidirectional encoder is applied to the text emotion classification and achieves a performance of 66.5% for WA and 67.6% for UA.(3)Using decision-level fusion,the ECAPA-TDNN-LSTM model presented in work(1)and the BERT pre-training model presented in work(2)are fused to obtain the ETLBERT model.The ETL-BERT model achieves a performance of 80.5% for WA and 79.9%for UA.Compared with the ECAPA-TDNN-LSTM,it improves the performance by 11.65%for WA and 15.80% for UA.Compared with the BERT pre-training model,it improves the performance by 21.05% for WA and 18.20% for UA.
Keywords/Search Tags:Deep Learning, Speech Emotion Recognition, Model Fusion
PDF Full Text Request
Related items