Font Size: a A A

Research On Speech Emotion Recognition Based On Feature Fusion Of CNN And BLSTM

Posted on:2021-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:H L LvFull Text:PDF
GTID:2428330629953011Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Speech emotion recognition plays an important role in how to anthropomorphize computers so that they can sense human emotions and provide comfortable conversational environment for interlocutors adaptively.As one of the main communication media of human beings,speech not only contains basic text information,but also contains rich emotional information.How to extract emotion information from speech signals is of great significance for speech emotion recognition.However,due to the establishment of emotion database,the search of emotion features,the research of modeling algorithm and other factors,speech emotion recognition has been full of challenges.In the traditional speech emotion recognition research,the main focus is on feature extraction.Most of these studies focus on designing some of the most distinctive manual features for emotion recognition.Since the birth of deep learning,many deep neural networks have been rapidly and widely applied in speech recognition,image recognition,natural language processing and other fields,which brings a new idea to speech emotion recognition,that is,to obtain the best feature representation by deep learning.Based on the traditional speech emotion recognition methods and driven by the existing research progress of deep neural networks,this paper users Convolutional Neural Networks(CNN)and Bidirectional Long Short-Term Memory(BLSTM)networks and features fusion to realize speech emotion recognition.Specific research contents are as follows:(1)BLSTM network combines the advantages of long and short time memory network and bidirectional cyclic neural network,and can learn temporal context information of speech sequence.Considering that different layers of BLSTM have outputs,if the outputs of each layer are combined,the fusion of shallow features and deep features can be achieved.The addition and fusion of the features of each layer of BLSTM is actually to supplement the high-level network information with the low-level network information.A multi-output BLSTM network model for speech emotion recognition is proposed to make full use of the context information from each layer.The weighted accuracy of 92.27% and unweighted accuracy of 91.30% were obtained by using 7 kinds of emotion in EMO-DB emotion database.The same network model was used to experiment on CASIA,where weighted accuracy and unweighted accuracy reached 85.56% and 85.56% respectively.In The Chinese speech environment,the multi-output BLSTM network model still maintains a good mobility.These experimental results show that context information is fully utilized.(2)Although the multi-output BLSTM model performs well on the EMO-DB and CASIA,its performance on the IEMOCAP declines significantly.The speech emotion recognition based on deep learning are limited to using a spectrogram or manual features as input,but can not capture enough emotional information defect,put forward a method of feature fusion based on CNN and BLSTM to learn more rich emotional features,the method is combining spatial features and context.Log-mel spectrogram was used as the input of CNN to extract the spatial features of speech signals,and statistical features were used as the input of BLSTM to extract the contextual features of speech signals.The two models perceive different emotional information from different perspectives and learn together the emotional characteristics with better recognition performance.The weighted accuracy and unweighted accuracy were 74.14% and 65.62%,respectively,after the recognition test on the IEMOCAP emotion database.In addition,the effectiveness of the CNNBLSTM feature fusion model is verified by comparing it with the existing models.(3)Finally,we propose a speech emotion recognition method that directly applies the deep neural network to the raw signal.The raw speech data carries the emotional information,twodimensional spatial information and temporal context information of the speech signal.The model we built is trained in an end-to-end manner,and the network automatically learns the feature representation of the raw speech signal without the need for manual feature extraction.The network model takes into account the advantages of both CNN and BLSTM neural networks.CNN is used to learn spatial features from the raw speech data,and then a BLSTM learning context feature is added.In order to evaluate the effectiveness of the system,recognition tests were carried out on IEMOCAP,EMO-DB,CASIA and three different emotion databases respectively.The experimental results showed that the proposed method was superior to the baseline model in both weighted accuracy and unweighted accuracy.
Keywords/Search Tags:speech emotion recognition, CNN, BLSTM, feature fusion, raw signal
PDF Full Text Request
Related items