Font Size: a A A

Speech Emotion Recognition Via Domain Adaptation

Posted on:2021-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Elias Nii Noi OcquayeFull Text:PDF
GTID:1368330623979233Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Speech Emotion Recognition(SER)or acoustic emotion recognition has accomplished some significant advancement in the previous couple of decades since the inception of speech,speaker identity,and emotion research.In numerous viewpoints,varied exploration tasks have been investigated to generate,and implement effective and efficient real-world humanlike speech emotion recognition systems to resolve real-life time issues.Nevertheless,considering different corpora of speech emotions available both publicly and privately with numerous factors that make them different,the premise of having features of both training and testing samples drawn from the same distribution,and the parameterization of the same feature space are not applicable in most real-world scenarios.In effect,SER systems are challenged with cases of high distribution disparity challenges when they are trained and tested from different speech corpora.In addition,the issue of cross-lingual still remains a challenge in emotion recognition and needs the attention of researchers to resolve the scenario of applying different language types in both training and testing.To help address these challenges,this dissertation considers the use of three innovative unsupervised domain adaptation SER systems as some models to help curb the pending challenges with respect to SER.These novel models are as follows:1)To address the challenges of domain disparity,and to construct effective model that explicitly model domain shifts without target domain labels,we propose an Unsupervised Domain Adaptation method which is Coupled Deep Convolutional Neural Network(CDCNN)architecture.The architecture uses the correlation alignment loss(CALoss)of both the source and target distributions without target labels to effectively minimize the domain shift and learn extremely good nonlinear transformations.Also,the weights in the corresponding layers in both streams are not shared yet related which is effective for modeling the shift of one domain to the other.To evaluate our proposed method,we use the INTERSPEECH 2009 Emotion Challenge's FAU Aibo Emotion Corpus as target dataset and two publicly available corpora(ABC and Emo-DB)as source dataset.Experimental results indicate that our proposed approach achieves a high performance compared to other state-of-art speech emotion recognition methods with an unweighted average recall(UAR)of 62.51%and 64.96% on ABC and Emo-DB corpora,respectively.2)To address the challenges of extracting salient and detailed feature representation,and the robustness of the model,we propose a Dual Exclusive Attentive Transfer(DEAT)for deep convolutional neural network architecture based on unsupervised domain adaptation setting.The proposed architecture adapts to an unshared attentive transfer procedure for convolutional adaptation of both source and target domain.Also,the proposed model implements a dual domain adaptation procedure on convolutional and fully connected layers by aligning the second-order correlation statistics of the source and target domains to learn effective nonlinear transformations and also capture good discriminative features.Also,to effectively model the shift dissimilar domains,we make the weights of the corresponding layers exclusive but related.The proposed model minimizes the classification loss of the source domain with labels and correlation alignment loss of both convolutional and fully connected layers collectively.Also,we evaluate the proposed architecture using the INTERSPEECH2009 Emotion Challenge FAU Aibo Emotion Corpus as target dataset and two publicly available corpora(ABC and Emo-DB)as source dataset.Our experimental results show that,our proposed domain adaptation method is superior to other stateof-the-art methods with an unweighted accuracy recall(UAR)of 65.02% and 67.79%on ABC and Emo-DB corpora,respectively.3)Finally,we address the challenge of cross-lingual corpora challenge(i.e.model tested on a different language type corpus performs poorly).We propose a triple attentive asymmetric convolutional neural network to address the recognition of emotions for cross-lingual and cross-corpus speech in an unsupervised approach.The proposed method adopts the joint supervision of softmax loss and center loss to learn high power discriminative feature representations for target domain via the use of highquality pseudo-labels.The proposed model uses three attentive convolutional neural networks asymmetrically,where two of the networks are used to artificially label unlabeled target samples as a result of their predictions from training on source labeled samples and the other network is used to obtain salient target discriminative features from the pseudo-labeled target samples.We evaluate our proposed method on three different language types(i.e.English,German,and Italian)datasets.Compared with other state-of-the-art methods,our proposed method achieves the highest emotion recognition accuracy in both scenarios 1 and 2.
Keywords/Search Tags:Attention transfer, correlation alignment, speech emotion recognition, unsupervised domain adaptation, Cross-lingual, center loss, triple attentive asymmetric
PDF Full Text Request
Related items