Font Size: a A A

Research On Speech Emotion Recognition Based On Multi Features Fusion

Posted on:2022-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhaoFull Text:PDF
GTID:2518306350482394Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of the times and the maturity of artificial intelligence technology,human-computer interaction is gradually being valued by people,and the audio field is also one of the most direct ways of human-computer interaction,such as intelligent audio and audio assistants that have emerged in recent years.If speech emotion recognition can be realized,the machine can provide more humane services.In addition,speech emotion recognition also has a wide range of applications in fields such as customer service quality detection and dialogue robots.With the rise of deep learning,although speech emotion recognition has developed rapidly in recent years,and its accuracy has been improved compared to before,there are still shortcomings.In speech emotion recognition,the utilization of a single speech emotion feature on the original speech signal is low,and the emotion information is not evenly distributed in the speech signal.To solve these problems,how to design the network.This article will propose corresponding solutions to the current shortcomings of speech emotion recognition,including the following three aspects.First,the traditional MFCC frame-level features are used as the initial features of the speech signal,and one-dimensional convolutional neural networks of different layers are selected as the basic network,and the number of reference layers is selected by comparison.In view of the uneven distribution of emotions,the channel and key frames are considered separately.The SE module is used as the channel attention module and BiLSTM is used as the time attention module.By generating weights,key information is retained and redundant information is suppressed.While using the time attention module,the gradient disappearance problem appears due to the serial structure of the entire model.To solve this problem,the residual structure is introduced and the feature fusion part is added.After experimental comparison,the recognition rate of CASIA and IEMOCAP has been improved.Secondly,the feature extraction is performed from the frequency domain and time domain of the speech signal,that is,the feature extraction method based on the spectrogram is used,and the convolutional neural network is used as the reference network.From the perspective of attention,we use channel and spatial attention as the improvement plan,and analyze the deep and shallow features of the network.Try to use deep and shallow feature fusion modules PANet and BiFPN.Aiming at the imbalance of samples in each category of emotion and the difficulty of identification,the Focal loss function is proposed to improve it.Through experimental comparison results,it is found that the final recognition effect has been improved.Finally,from the perspective of multi-features of speech emotion,based on different feature extraction model and improvement scheme,adding traditional speech emotion features,at the same time,for the subsequent feature fusion consideration,it is proposed to use the feature alignment module for dimension alignment Finally,through experiments,we compared integrated learning,feature splicing,BiLSTM-based element-level and decision-level feature fusion methods.Compared with the previous single network,the recognition emotion is improved,and the speech signal can be achieved from multiple angles analysis.
Keywords/Search Tags:Speech Emotion Recognition, Convolutional Neural Network, Recurrent Neural Network, Multi-feature Fusion, Attention Mechanism
PDF Full Text Request
Related items