Font Size: a A A

Research On Speech Emotion Recognition Algorithm Based On Deep Learning

Posted on:2020-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z L LiangFull Text:PDF
GTID:2428330599962095Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speech Emotion Recognition(SER)is a research hotspot in the field of artificial intelligence in recent years.It has broad application prospects in emotional robots,online education,customer service centers,assisted driving,and criminal investigation.At present,the research on speech emotion recognition has made many progress,but the establishment of a reasonable and efficient speech emotion recognition network model is still one of the main problems currently facing.Therefore,based on the analysis of the current mainstream Convolutional Recurrent Neural Network(CRNN)recognition model,this paper carries out three aspects: the processing of unequal size samples,category imbalance samples and unbalanced samples of emotional information frames.Research improvements to improve the recognition performance of the model.The main research work is as follows:Firstly,for the unequal length samples,a variable length input strategy is adopted to solve the problem of emotional type confusion and discontinuous timing information caused by long-term sample segmentation in the fixed-length input model,which effectively improves the recognition performance of the model.In the IEMOCAP corpus(neutral,happy,sad,angry)four types of emotion recognition experiments,66.59% UAR(Unweighted Average Recall)and 69.33% WAR(Weighted Average Recall)recognition performance,and fixed-length input model Compared with 8.61% and 5.86% respectively.Secondly,for the class imbalance samples,the focus loss function is used instead of the cross entropy inverse weight method to train the model,which improves the model's ability to mine difficult samples and effectively enhances the model's ability to learn from unbalanced samples.The experimental performance of 68.66% UAR and 69.67% WAR was improved,which was 2.06% and 0.34% higher than the "baseline" model.Finally,for the unbalanced sample distribution of emotional information frames,the Connectionist Temporal Classification(CTC)method is introduced in the “baseline” model,and the emotional tags are aligned to the emotional frames by the CTC method,so that the model only focuses on learning emotions.Frames effectively improve model recognition performance.The experiment achieved 69.75% UAR and 70.42% WAR recognition performance,compared with the "baseline" model,respectively increased by 1.09% and 0.75%.Considering the limitation of the CTC method to the same degree of learning of emotional frames,the Attention Mechanism(AM)is introduced in the “baseline” model,which assigns different attention weights to speech frames according to the content of emotional information.Carry out different levels of learning.The experiment achieved 71.77% UAR and 71.60% WAR recognition performance,which is better than the above CTC model.
Keywords/Search Tags:Speech Emotion Recognition, Convolutional Recurrent Neural Network, Focal Loss, Connectionist Temporal Classification, Attention Mechanism
PDF Full Text Request
Related items