Font Size: a A A

Study On Emotion Recognition For Spoken And Written Language Considering Physiological And Behavioral Traits

Posted on:2021-11-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z C PengFull Text:PDF
GTID:1488306548974389Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The coming era of the Internet of Everything provides huge development opportunities for the field of human-robot interaction.Speech interaction is the most natural and convenient way of interaction,usually including spoken text and spoken voice.Emotion information can help the robot to monitor the speaker's mental state during human-robot interaction,and also help the robot understand the speaker's true intention.In the Internet environment,speech emotional information can be transmitted through spoken text(such as microblog)and spoken voice.Sentiment in spoken text usually refers to the speaker's conscious emotional expression.From the perspective of speech generation,sentiment in spoken text embodies the responses of individual physiological and behavioral states.Speakers with different emotional states are different from normal people in terms of spoken text and interaction behavior.Emotion in spoken voice refers to the speaker's unconscious and involuntary emotional expression.From the perspective of speech perception,speech signal is transferred to the auditory cortex by a series of transformations in the listener's auditory system.The pitch,intensity and duration of different emotions will produce different physiological and behavioral responses in the auditory system.Therefore,this study focuses on the affective recognition from spoken text and spoken speech with physiological and behavioral traits.According to the characteristics of text production and speech perception,physiological and behavioral features are integrated into the study of speech emotion recognition from different perspectives.In sentiment analysis of spoken text,we mainly mine the physiological and psychological state of microblog users through the text content and behavior characteristics of microblog.In speech emotion recognition,we mainly use the listener's auditory mechanism to extract effective features related to emotion to improve the recognition rate of emotion.The main contents and innovations include the following four aspects:(1)This study proposes a sentiment analysis method of spoken text based on speech interaction behaviors.The specific application of this method is to identify depressed people from spoken text.According to the new characteristics of spoken text,a depression emotion dictionary is constructed including colloquial text and emoticons,and the text feature representation is extracted based on the dictionary.Then,the multi-kernel learning method is used to find the optimal mapping between heterogeneous features and emotion to identify depressed people.Experimental results show that the combination of text feature representation and interactive behavior feature is an effective emotion mining method.(2)This study first proposes an emotion recognition method based on cochlear filter.Although this method achieves better results than the method based on MFCC,it has obvious shortcomings.Therefore,this study further proposes an emotion recognition method based on modulation filter.Modulation filtering is introduced to generate multi-dimensional temporal modulation cues,and then 3D CNN(convolutional neural network)model is used to directly learn the joint spectral-temporal features of modulation cues.The experimental results show that 3D CNN can effectively extract emotional discriminative auditory representation from temporal modulated cues.(3)Inspired by the mechanism of human auditory attention,this study proposes an attention based sliding recurrent neural network(ASRNN)model to recognize speech emotion.The continuous attention is realized by sliding window and continuous segment level internal representation is extracted,and then selective attention mechanism is realized by temporal attention model.Finally,the correlation between the attention model and human auditory attention is analyzed through a subjective evaluation experiment.The experimental results show that the model can effectively capture significant emotional regions from auditory representation.(4)Inspired by the multi-scale modulation of human auditory system,this study proposes a dimension emotion recognition method based on multi-resolution modulation-filtered cochleagram(MMCG)features.MMCG encodes temporal modulation cues into modulation-filtered cochleagram with different resolutions to capture temporal and contextual modulation cues.Considering that each modulation-filtered cochleagram in MMCG contains modulation cues of different scales,a parallel LSTM network structure is designed to establish multiple time dependence relationships from different resolution features and track the dynamic of emotion in time series.Experimental results show that MMCG features can obtain multi-scale emotion information,and parallel LSTM can effectively track the temporal dynamics of emotions.
Keywords/Search Tags:Affective recognition, spoken text, spoken voice, human auditory characteristics, interaction behavior
PDF Full Text Request
Related items