Font Size: a A A

Study On Attention Based Speech Emotion Recognition

Posted on:2021-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z T BaoFull Text:PDF
GTID:2428330623471426Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of speech emotion recognition had already made the man-machine interface more humanization and wildly applied in a lot of aspects in our social life.Moreover,this domain has become one of the most significant research directions in artificial intellectual.In recent years,with the prosperous development of deep learning,a series of domains had made unimaginable achievements,including the domain of pattern recognition,speech recognition,natural language processing and the like.This method also successful applied in the domain of speech emotion recognition.In this area,deep learning methods try to use the method of neural network structure to extract more robust features and generate powerful models through the temporal information.Furthermore,the model inspired from other domain had already made considerable scores to improve the performance.However,there also existing some problems needed to be solved.To begin with,current deep learning methods only consider temporal information or contextual information in the speech emotion recognition,and do not make useful combine methods.In addition,model which applied in speech emotion recognition always considering this request as a sequence-to-label structure,which will ignore the subtle emotion fluctuation and undulation.Last but not least,current research achievements didn't use the prior knowledge from the similar domains to improve the performance.In this paper,we propose the Attention-BLSTM-CNN model to combine the temporal information extracted by BLSTM with contextual information extracted by CNN.In this way,we propose a fusion method,which using the attention mechanism calculated the weight of each part and made the weight fusion generating speech emotion prediction.By this method,this paper solved the problem of combining temporal features with contextual features.In addition,this paper abstracted the origin sequence-to-label request to sequence-to-sequence situation,and using Connectionist Temporal Classification to extract subtle emotion fluctuate.This part also applied the attention mechanism to generate better features and improve the performance of CTC structure.Furthermore,based on the theology of knowledge transform,this paper included the priori information from the domain of speech recognition to guide the learning process in the speech emotion recognition.In this part,we used the Teacher-Student Network structure to transform the attention weight as the priori knowledge.Last but not least,based on the above results,this paper raised the attention based multi-model fusion method,which used the parallel structure to assembly the advantages of the past methods to generate the final emotion prediction.This paper had already made experiences in the IEMOCAP dataset and FAU-AEC dataset.In the experiment,we used the spectrum features through the Short-Term Flourier Transform as the features of the experiment.The attention-based fusion method had got the 72.5%(UA)and 71.5%(WA)in the IEMOCAP dataset while 52.1%(UA)in the FAU-AEC dataset.These experiments are fully proving the robust and effective of the attention mechanism.
Keywords/Search Tags:Speech Emotion Recognition, Deep Learning, Connectionist temporal classification, Attention Mechanism, Attention Transform
PDF Full Text Request
Related items