Font Size: a A A

Speech Emotion Recognition Based On Deep Learning And Multi-Feature Fusion

Posted on:2022-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q BaoFull Text:PDF
GTID:2518306767977499Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of speech emotion recognition,this technology has gradually entered various production scenes,and plays an irreplaceable role in some fields.For example,it can assist the teacher in teaching by monitoring the status of students answering questions in class.And it reminds drivers to drive safely by combining their voices,expressions and behaviors.It can also be used to assist in sentiment analysis of various dialects to help achieve more accurate translation effects.In the process of applying deep learning to the field of speech emotion recognition,most researchers directly build a network model based on a variety of manual feature sets and spectrograms extracted from the original speech.However,few researchers carry out speech emotion recognition directly on the original speech through deep learning.The reason may be that too much information is sampled in a piece of speech.Deep learning network is difficult to extract emotional information effectively.At the same time,in the current research work,many researchers only use manual features and spectrogram features to build models for speech emotion recognition.These manually extracted features will cause the loss of the integrity of the original speech information,which will affect the recognition of speech emotion.In this paper,aiming at the problem that it is difficult to extract effective deep features from the original speech,a feature extraction method based on convolutional neural network is designed by simulating the filter in speech.The parallel combination of one-dimensional convolution and dilated convolution is used to extract the local and global features of the original speech.At the same time,we make the model learn more diverse speech features.Firstly,based on the study of speech spectrogram,we extract the deep semantic information between frequency and amplitude on the spectrogram through the feature extraction method of unsupervised learning.Second,we also design a network model with deep features learned from handcrafted features.The model extracts the temporal information in the hand-crafted features and the information between the features through multi-dimensional learning of the manual features.Finally,we propose a two-stage training strategy of multi-feature fusion.In the first stage,different models are trained separately to the best performance.Then in the second stage,the joint training of feature fusion is carried out to fine-tune the model parameters.A speech emotion recognition model based on the joint action of original speech,spectrogram and manual features is constructed.The experimental data in this paper comes from the IEMOCAP dataset,and the manual feature set extraction is extracted from the e Ge MAPS speech feature set in the Open SMILE toolkit.The multi-feature speech emotion recognition model based on deep learning extracts the deep feature information of speech from the original speech,spectrogram and hand-crafted features,respectively,and achieves 65.3%(Unweighted Accuracy)and 64.0%(Weighted Accuracy)recognition accuracy on the IEMOCAP dataset.It is proved that the combination of original speech,hand-crafted features and spectrogram through deep learning can guide each other to a certain extent to achieve better results.
Keywords/Search Tags:Speech Emotion Recognition, Unsupervised Learning, Long Short-Term Memory, Attention Mechanism, Multi-Feature Fusion
PDF Full Text Request
Related items