The research of speech emotion recognition(SER)aims to enable machines to have enough intelligence to perceive and understand the emotional states,e.g.,happiness,surprise,and sadness from human beings’ speech signals.It has become a hot topic among the research fields of affective computing,pattern recognition,and speech signal processing.However,most of the existing SER works only considered to design and evaluate their methods based on the single speech emotion corpus,and hence the generalization performance of these methods may not satisfy the requirement of real-world applications.For this reason,in this dissertation we focus on a more challenging but interesting task in SER,i.e.,unsupervised cross-corpus SER(CrossCorpus SER),in which the labeled training(source)and unlabeled testing(target)speech samples belong to different speech emotion corpora.To address this issue,we conduct extensive research and propose a novel idea,i.e.,jointly enhancing the emotion discriminability and corpus independence of SER models to eliminate the feature distribution difference between source and target speech datasets to improve their generalization.Following this idea,we propose three novel transfer learning methods to deal with the unsupervised cross-corpus SER.The main contributions of this dissertation can be summarized as follows:(1)We propose a novel model called Joint Distribution Adaptive Regression(JDAR)to solve the problem of cross-corpus SER.The basic idea of JDAR is to design a feature distribution difference metric based on the first-order statistical moment of mean value to serve as the regularization term considering both the marginal and emotion label guided conditional probability distributions.By resorting to this welldesigned regularization term,a linear regression model trained on the labeled source speech samples can predict the emotion labels of the target ones.Moreover,we also propose an extension of JDAR called Emotion Wheel Knowledge guided Joint Distribution Adaptive Regression(EWK-JDAR).Different from JDAR,we design an additional conditional probability distribution adaptation constraint for EWK-JDAR based on high and low valence label information of the source and target speech samples.Experimental results showed that the generalization performance of EKWJDAR can be further enhanced compared with the original JDAR due to the consideration of the confusion of speech samples with respect to the arousal label information.(2)We propose a deep transfer learning model with stronger corpus adaptability and higher emotion discrimination called Progressively Distribution Aligned Neural Networks(PDAN)to solve cross-corpus speech emotion recognition.PDAN model can be seen as a deep learning version of EWK-JDAR model.With the help of the powerful nonlinear mapping ability and hierarchical feature learning mode of deep neural networks,PDAN replaces the regression coefficient matrix in EWK-JDAR with the deep neural network,and directly builds the relationship between the speech spectrograms and the emotion labels.Thus,the handcrafted features used to describe speech signals are no longer needed in dealing with cross-corpus SER tasks.In other words,PDAN can be used to deal with cross-corpus SER with an end-to-end way.Meanwhile,PDAN makes use of three types of distribution adaptation items used in EWK-JDAR with a different way,i.e.,aligning them to respectively regularize the different feature layers in deep neural networks to guide the feature learning.This coarse-to-fine adaption approach can make full use of the marginal distribution,rough emotion label information guided conditional distribution and grained emotion label information guided conditional distribution measurements.Consequently,PDAN can achieve more promising performance in copiing with the cross-corpus SER tasks.(3)We propose a Progressively Discriminative Transfer Neural Networks(PDTN)model to solve the cross-corpus SER task.Different from work(1)and(2),the work of PDTN model focuses on exploring the feasibility of further improving the emotion discriminability of the model to benefit the model in coping with the cross-corpus speech emotion recognition tasks.Therefore,inspired by the progressive distribution adaptation idea used in PDAN,we design an additional loss function called Progressive Center Loss for the PDAN model as the constraint together with the term of eliminating the loss function of distribution differences to guide the learning of the deep neural network.Therefore,PDAN can enforce the learned features in shallow fully connected layer to cluster at the rough emotion center and meanwhile the learned features in deep fully connected layer to cluster at the fine emotion center.Thus,PDAN can learn more discriminative features for cross-corpus SER.The experimental results showed that PDTN model can achieve more promising performance in dealing with cross-corpus SER than PDAN,which verifies the feasibility of further improving the emotion discrimination of the model to solve the cross-corpus SER task. |