Font Size: a A A

Research On Speaker-independent Speech Emotion Recognition Based On Deep Learning

Posted on:2024-04-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:C LuFull Text:PDF
GTID:1528307364967819Subject:Information and Communication Engineering (Signal and Information Processing)
Abstract/Summary:PDF Full Text Request
As a common communication mean between human beings,speech contains rich emotional information.Therefore,enabling machines to automatically and accurately recognize emotions in speech,i.e.,speech emotion recognition(SER),is a key technology for realizing natural human-computer interaction(HCI).Since the speech signal is a short-term stationary signal with time-varying characteristics,it is difficult to capture the emotional variables in it,and it is also easily disturbed by other acoustic factors,e.g.,noise,speaker,language.Among them,speaker information is an obvious factor.Different speakers have specific information in identity,age,accent,language,etc.,which are often reflected in acoustic features,e.g.,timbre and rhythm.In this case,the speaker features are easily confused with emotion ones such that it is difficult to decouple them.Therefore,improving speaker-independent SER is an effective mean to promote the generalization of SER models,which has attracted wide attention from researchers in recent years.In the dissertation,we focus on two challenging problems in speaker-independent SER,i.e.,(1)how to extract more discriminative speech emotion feature and(2)how to eliminate feature distribution discrepancy caused by speakers.Eventually,a series of well-performed deep neural networks are proposed to address the speaker-independent SER effectively.The main work of this dissertation includes the following four aspects:1.We propose a novel Multi-level Time-Frequency Feature Learning(MTFFL)for speaker-independent SER.This method is from the perspective of discriminative speech representation extraction.Through excavating the emotional information in the speech signal under the interference of multiple factors(especially the speaker information),MTFFL can enhance the emotional discrimination,thereby curb the influence of speaker information.To copy with the above issue,MTFFL takes full advantage of the property that emotional information distributed in the time and frequency domains of speech signals.It simultaneously model multiple-levels emotional information in speech,e.g.,frame-level phonemic features,segment-level word/phrase features,and sentence-level semantic features.To this end,a nested Transformer model is used to perform feature fusion learning from local to global such that it can capture the long-range emotional dependencies from multi-level features to enhance the emotional discrimination of speech features.2.Based on the time-frequency property of speech,the sparse distribution of emotional information in the time and frequency domains of speech signals are further explored.Inspired by this property,we propose an attention time-frequency neural network(ATFNN)for speaker-independent SER.The ATFNN firstly models the time domain and frequency domain of speech to ensure the integrity of the information in each domain.Then,an attention network in the time-frequency domain is designed to capture the key frequency bands and time frames highly related to emotions,respectively.Besides,the un-related factors(e.g.,speaker information and noise)can also be masked through the spare constraint.Finally,the discriminative emotion features of speech can be obtained through a joint time-frequency learning strategy.3.From the perspective of feature distribution discrepancy caused by different speakers,we borrow the feature distribution adaptation from transfer learning and propose a domain-invariant feature learning framework(DIFL)for speakerindependent SER.The DIFL firstly transforms the speaker-independent SER problem to a multi-source domain unsupervised domain adaptation issue.In detail,we embed a hierarchical distribution alignment layer with a strong and weak distribution alignment strategy into the backbone network.Also,a domain adversarial layer with an emotion discriminator,a domain discriminator,and a speaker discriminator is design to eliminate the domain shifts caused by the speaker between the source-target domain and multiple speakers in the source domain while maintaining the emotion discrimination of speech features.4.Under the multi-source domain UDA setting of speaker-independent SER,marginal distribution adaptation and conditional distribution adaptation are further adopted to promote the domain shift adaptation caused by speakers.Motivated by this consideration,we proposed an adaptive joint distribution adaptation method(AJDA)for speaker-independent SER.The AJDA mainly considers two issues under the multisource UDA caused by different speakers,one is how to adapt the fine-grained distribution shift of each emotion;the other one is how to quantify the contribution of different types of distribution adaptation to deal with speech samples collected by new speakers.To this end,we aligns the global and local distribution shifts more precisely by designing a joint distribution adaptation strategy under multi-domain UDA,and then adopts an adaptive balance coefficient to quantify the contributions of marginal and conditional distribution adaptation.
Keywords/Search Tags:speech emotion recognition, speaker-independent speech emotion recognition, deep learning, time-frequency feature learning, unsupervised domain adaptation
PDF Full Text Request
Related items