| With the rapid development of information technology and the maturity of artificial intelligence technology,more and more intelligent devices have entered the public’s field of vision.Emotion is considered as one of the three major challenges facing the development of AI,and has been widely concerned by academia and industry.Speech emotion recognition technology can directly obtain individual emotional state from speech signals,and then endow machines with the ability to perceive and express human emotions.At present,in the field of speech emotion recognition,the end-to-end method based on deep learning is mainly adopted,aiming to build a speech emotion recognition system with high robustness.Based on the production and perception process of emotional speech and individual cognitive characteristics,this thesis aims to explore the essential characteristics of effectively describing different emotional categories,that is,the invariance of emotional information in speech.Based on this,the main research contents of this thesis include the following two parts:1.The relative invariance of emotion is studied This thesis studies the influence of speaker and content factors on the invariance of speech emotion,and proposes a speech emotion recognition method based on the introduction of relative information.In view of the problem that there are a lot of emotion-irrelevant content and speaker information in speech signals at present,this thesis first extracts the speaker and content codes through Deep Speaker and Transformer respectively,then explores the specific location,weight coefficient and fusion mode of introducing them into the model in experiments,and then puts forward a method that can effectively use relative information to improve the performance of emotion recognition of the model.Through the misidentification analysis of the results of different emotion categories,the specific effects of relative information introduced into the model on various emotions are summarized,and the relative invariance of emotions is deeply explored.Research shows that the introduction of speaker information and content information can effectively improve the recognition accuracy of anger emotion to a certain extent.2.This thesis studies the emotional invariance of speech based on cognition.To solve the problem of insufficient cognition of emotion at present,based on the developmental characteristics of individual cognition,this thesis first adopts the method of curriculum learning,and respectively adopts the method of marking the difference degree of the sample based on model error and the relativity of individual emotion perception,so as to effectively fit the gradual process of individual emotion learning.Experiments show that the method based on individual divergence degree is better than that based on model error.Furthermore,based on the integrated characteristics of individual cognition and external feedback learning,combined with individual cognitive commonness and different emotional expression differences,this thesis explores the influence of center loss,triplet center loss and focal loss on the emotional invariance characteristics of the model.Finally,when classifying basic emotions,the model can capture the invariance features of the same emotion,and make it gather relatively in feature space.At the same time,it can capture the essential differences of samples of different emotion categories,and make them relatively dispersed in feature space,thus improving the model’s ability to describe the invariance of emotions.Based on the characteristics of emotion expression and cognition,combined with the process of speech production and perception,this thesis puts forward a speech emotion recognition method based on the introduction of relative information.Then,based on the developmental and integrated characteristics of individual cognition,curriculum learning and various loss functions are adopted to deeply explore the problem of speech emotion invariance based on cognition. |