Font Size: a A A

Speech Emotion Recognition Based On Three-layer Model

Posted on:2021-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:D H MiaoFull Text:PDF
GTID:2518306302953969Subject:Applied Statistics
Abstract/Summary:
Emotion has always played a very important role in daily interpersonal communication.Nowadays,computer technology has developed to a relatively mature stage,and the application of artificial intelligence enables people to interact with machines.Therefore,it is very important for machines to have "emotions" or understand human emotions,which naturally has become a hot topic today.Humans often express emotions such as anger,fear,happiness,neutrality,sadness,surprise,etc.When humans express these emotions,they often involve many aspects,such as facial expressions,sounds and body language.Speech signals(human voices)have become a good source of emotional computing because of their inherent advantages,so the rise of speech emotion recognition(SER)technology.The same study of speech,the wellknown Speech Recognition Technology(ASR),has only done speech-to-text conversion for decades,completely ignoring the emotional information contained in speech.Therefore,when speech emotion recognition reaches maturity,it will certainly make a huge contribution to the progress of speech recognition technology.The speech emotion recognition system(SER)mainly includes the following three parts: first,pre-process the speech data,then extract various emotional features in the speech,and finally use these proposed features and emotional tags to perform emotion classification Training.Because humans are affected by their own physical factors and external environmental conditions during their speech,emotions will be diverse and volatile.It can be said that the task of speech emotion recognition is difficult,and it is very challenging at the current level of artificial intelligence..From the beginning of traditional features such as MFCC to later deep neural feature extraction using neural networks,researchers have already achieved certain results in recognition accuracy.This paper not only considers traditional features such as Mel Cepstrum Coefficient(MFCC)in speech feature extraction,but also extracts some useful features such as short-term energy,pitch frequency,and zero-crossing rate for multi-feature fusion.Based on the development of deep neural network(DNN)technology,this paper proposes a three-layer model of speech emotion recognition method,and studies the robustness of the model to speech in different languages.This model can not only dig deep emotional information of speech signals,but also extract more distinctive emotional features from the easily confused emotions.The model first obtains the confusion matrix through rough classification,thereby calculating the degree of confusion between emotions,setting the threshold to complete the construction of the decision tree structure,and then training different DNNs for different emotion groups(leaf nodes of the decision tree)to Extract the bottleneck features used to train each XGBOOST in the decision tree.Finally,based on the proposed three-layer model,several speech emotion classification experiments and some comparative experiments are carried out.The dataset first selected by the model is the Chinese Academy of Sciences Emotion Database(CASIA)for evaluation.Finally,the experimental results in this paper show that the average emotion recognition rate based on this method is 4.1% and 1.6% higher than the shallow feature + XGBOOST and bottleneck feature + XGBOOST classification methods,respectively.Practice has proved that this method can effectively reduce the confusion between emotions and improve the speech emotion recognition rate.In this paper,the German data set(EMO_DB)and English data set(RAVDESS)are selected to create multilingual models.It is found that the speech of different languages has similar characteristics to the same emotion.All the shallow features extracted in this paper can effectively construct a multilingual speech emotion recognition model,and have a good recognition effect on the languages involved in training.
Keywords/Search Tags:Speech Emotion Recognition, MFCC, Zero Crossing Rate, DNN, XGBOOST
Related items