Font Size: a A A

Research On Speech Emotion Recognition Based On Deep Neural Network

Posted on:2022-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y H FanFull Text:PDF
GTID:2518306752493404Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Emotion recognition has important application value in safe driving,medical treatment,emotional robot,distance teaching,and other fields.It is an indispensable step to realize artificial intelligence.Speech is one of the most natural and effective ways for human beings to communicate emotions and thoughts.Therefore,the research of speech emotion recognition is particularly important.But up to now,deep neural networks such as RNNs and CNNs,which are commonly used in speech emotion recognition,have their own shortcomings.Therefore,on the basis of deep neural networks,this thesis improves the performance of a single network by building different network models.The main contributions to this thesis are as follows:(1)The speech emotion recognition model P-CBGRU is constructed.Aiming at the problems of spatial information loss and CNN ignoring timing information in RNNs,an acoustic model P-CBGRU is constructed.First,the model CBGRU is built by cascading CNN networks and Bi-GRU networks,but the cascading method also leads to the loss of some information.Therefore,the two networks are combined in parallel to obtain the model P-CBGRU,and the information of the two branches of the model can be complementary,which achieves better results than the cascading method.(2)A new speech emotion recognition model ATCN-GRU is proposed.Due to the simple change of the combination of networks,the improvement effect of SER performance is not significant,and a new acoustic model ATCN-GRU is proposed.The model is cascaded from the time convolution network TCN,the bi-directional gated recurrent unit Bi-GRU,and the attention mechanism.Firstly,the TCN network can overcome the problems of RNN spatial information loss and CNN ignoring timing information,and select the most representative and robust features from the manually extracted features;Secondly,the Bi-GRU model is used to learn the context-related information of the speech sample,and the attention mechanism is used to learn the degree of association between the input sequence and the output sequence of the model,so as to give more attention to the effective information;and Finally,emotions are categorized by the Softmax layer.(3)Focus loss is introduced.By introducing focus loss,the uneven recognition results caused by the uneven training samples of the EMODB database are improved,so that the average accuracy of the model ATCN-GRU on the EMODB database reaches 86.26%.Experimental results show that the network model P-CBGRU proposed in this study achieves an average recognition accuracy of 86.46% on the CASIA database.The model ATCN-GRU achieves an average accuracy of 88.17% and 85.98% on the CASIA and EMODB databases,respectively.And the two models proposed in this thesis have achieved better recognition performance than the previous research results.
Keywords/Search Tags:speech-emotion recognition, deep neural networks, attention mechanisms, focus loss
PDF Full Text Request
Related items