Font Size: a A A

Research On Speech Emotion Recognition Based On Neural Network And Attention Mechanism

Posted on:2022-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:W J SongFull Text:PDF
GTID:2518306545490184Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
As the most natural way of communication between people,speech usually contains rich emotional information.In the field of emotion recognition,speech emotion recognition technology(SER)is mainly used to study the formation and change of various emotions in speech.With the development of deep learning,speech emotion recognition technology using speech as emotion carrier is a very challenging problem,because people express emotions in different ways,and the features are unclear to distinguish emotions.At present,the extraction of the most relevant emotion features in speech and the improvement of the hierarchical structure of the model are the mainstream research directions of speech emotion recognition,and their selection will directly affect the recognition accuracy of the whole system.Through the study and practice of the above research,a speech emotion recognition system based on neural network and attention mechanism is proposed to improve the recognition performance of existing models.In the preliminary work of the subject research,we conducted research and practice on the basic theories of related literature in this field,which provided sufficient theoretical support for the further research of the subject.In order to explore the influence of the emotional features in the CHEVAD 2.0 speech emotion data set on the voice emotion recognition model,it is found that the emotional features in the low-frequency part of the spectrogram have a better differentiation of emotion categories.In order to further explore the performance of the spectrogram features,we use the IEMOCAP dataset for training.The focal loss method is used to improve the contribution of each emotion category to the total model.Compared with the neural network model,the model achieved 1.59%(WA)and 4.41%(UA)improvements on the IEMOCAP corpus,and the happiness category sentiment increased by 7.9%.In order to reduce the impact of non-emotional information in the IEMOCAP data set on the recognition performance of the model,the multi-head attention is introduced into the speech emotion recognition model.The multi-head attention can make the model learn different aspects of features by transforming the input.It can learn emotional features more comprehensively and further improve the recognition performance of the model.The experimental results show that the neural network model with multi-head attention mechanism has been improved by 7.16% in WA and 8.73% in UA.At the same time,although this model can improve the accuracy of sentiment classification on the IEMOCAP data set,it takes more time during the training process,especially the long input sequences.Locally sensitive hashing algorithm(LSH)can quickly find the nearest neighbor in high-dimensional space.In other words,it can simplify the factor of the attention layer to the factor.Compared with the multi-head attention model training time will increase approximately linearly with the length of the input sequence,the training speed of the LSH attention speech recognition model will remain relatively stable.This effectively reduces the training time of the entire model,and the recognition accuracy is almost the same as the multi-head attention model.
Keywords/Search Tags:Speech emotion recognition, Multi-headed attention, CNN, GRU, LSH
PDF Full Text Request
Related items