Font Size: a A A

Research On Speech Emotion Recognition Method Based On Multi-feature Fusion

Posted on:2022-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2518306323993779Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of artificial intelligence technology,more intelligent human-computer interaction experience has attracted much attention.As one of the most commonly used methods of human-computer interaction,the key to speech is that the machine can fully understand human emotions.Therefore,the research of speech emotion recognition has become a very important task in the field of artificial intelligence research.At present,the research work of speech emotion recognition mainly has the following problems: First,most of the emotional features in existing articles only focus on some acoustic features or semantic features,considering that the emotional features are not sufficient and do not combine the both.Secondly,most of the current methods for fusing features in the field of speech emotion recognition are direct stitching methods,ignoring the divergences between different features.Finally,the existing speech emotion data set has less data,which will lead to overfitting of the speech emotion recognition model.In response to these problems,this thesis has done the following three tasks:(1)Extract multi-emotional features based on acoustic and semantic features.Two types of emotional features are extracted: acoustic features and semantic features.When extracting acoustic features,in order to describe emotional information from different angles,this paper extracts high-level statistical features obtained from(LLDs),uses DNN to extract depth features based on spectral correlation features,and uses CNN to extract depth features based on Filter-bank.When extracting semantic features,the LAS automatic speech recognition model based on the "Encoder-Decoder" framework is used as the semantic features extractor,and the output of the Encoder is learned through BILSTM to higher-level features.(2)A feature level-decision level fusion model based on attention mechanism is proposed.First,the three types of acoustic features are regarded as independent features and semantic features are fused at the feature level,and the method of constructing Huffman trees is introduced to generate feature-level fusion features,and then the fused features are used for speech emotion Recognition.Then,use the decision-level fusion of the weighted voting method to give full play to the advantages of different features,thereby improving the emotion recognition rate.Finally,a feature level-decision level fusion model based on the attention mechanism is proposed,which assigns weights to different results through the attention mechanism,and integrates the results obtained from feature-level fusion and decision-level fusion.(3)Three data augmentation methods for increasing the number of speech emotion data sets are proposed.Due to the small number of existing speech emotion data sets and the subjectivity of speech emotion annotation,the cost of constructing speech emotion data integration is too high.This article expands the data by adding noise,sound length disturbance,and audio correction.The cost can be reduced and the accuracy of speech emotion recognition can be improved.This thesis has achieved a recognition effect of 76.85% on IEMOCAP,and improved 3.85%,which shows the effectiveness of the proposed method.
Keywords/Search Tags:Speech emotion recognition, Acoustic features, Semantic features, Feature level-Decision level fusion, Data augmentation
PDF Full Text Request
Related items