Font Size: a A A

Research On Speech Emotion Recognition Based On Automatic Speech Frame Tagging And Domain Knowledge Transfer

Posted on:2021-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:H Q LiaoFull Text:PDF
GTID:2428330611965591Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rise of deep learning,various deep learning methods have been taken to speech emotion recognition to study.Although there are many research works in the field of speech emotion recognition,there are still some challenges.Aiming at the three challenges,this thesis carries out some works.First of all,the judgment of speech emotion is subjective,and different people have different interpretations of the same utterance.The problem is natural,and there is only one way to tackle,which is to take the label of dataset as the standard.In order to fit the data as much as possible,this thesis in the third chapter selects the appropriate convolutional neural network as the backbone,and compares some popular deep learning models and methods through experiments,including some classic convolutional neural network model,the commonly used recurrent neural network and transformer,different pooling technologies.Secondly,the frame-level training method in speech emotion recognition cut an utterance into smaller units of speech frames,and train them in frames.In this way,all speech frames of an utterance are given the same emotion label,but the speech frames of an utterance may contain other emotions,such as neutral emotion.In order to deal with the problem,this thesis in the forth chapter draws on the experience on the multi-instance learning method in the field of sound event detection,combines utterance-level training and frame-level training,uses the model obtained by utterance-level training to select appropriate frames and label them,and then carries out frame-level training with frames.This method improves performance in three datasets.Finally,the current speech emotion datasets are very small.In order to make full use of the existing information in the case of lack of data,this thesis in the fifth chapter uses the vggish model,which has been pretrained on the large-scale audio dataset,to introduce the information in the field of sound event detection for speech emotion recognition task.The experiments in the three datasets show that the method of introducing vggish features into the model can improve accuracy on the basis of the method in the forth chapter.Compared with other advanced methods,the method proposed in this thesis is close to the optimal method in one dataset and achieves good results in the other two datasets.In addition,experiments show that the methods in this thesis can improve accuracy under different main models.
Keywords/Search Tags:Speech Emotion Recognition, Convolutional Neural Network, Speech Frame Tagging, Domain Knowledge Transfer
PDF Full Text Request
Related items