Font Size: a A A

Research On Speech Emotion Recognition Methods Based On Deep Learning And Transfer Learning

Posted on:2017-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:W T XueFull Text:PDF
GTID:2308330509452543Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important interaction menas and a medium for emotion expression, speech has been an important research direction of artificial intelligence. In traditional speech emotion recognition(SER), how to extract the most discriminative features has attracted many researchers. One of the challenges is to disentangle the emotion-realated factors from the emotion-unrelated facors(e.g., environment, speakers) during feature extraction, so that the extracted emotion features are more robust. The traditional SER has an assumption: the training and test set come from the same corpus, namely they have the same data distribution. But the speech data obtained from different devices and recording conditions are highly dissimilar in terms of language, type of emotion and labeling scheme. In this case, the training and test set have different data distributions, and the traditional SER approaches cannot deal with this problem well. Domain adaptation(DA), a special transfer learning method, proves to be effective for dealing with the divergence between different datasets.In this paper, for traditional SER, we propose a method for speech emotion features extraction based on the neural network; for cross-corpus SER, we propose the semi-supervised domain adaptation method based on priors sharing, and the unsupervised domain adaptation method based on labels supervision and features disentangling. Details are as follows:1)We propose a method for the discriminative speech emotion features extraction based on the neural network. The aim is to disentangle the emotion-related factors from the emotion-unrelated factors, and thus to extract emotion-related features. Four steps are introduced. First, we carry out speech pre-processing and get the spectrograms. Then, we do unsupervised feature learning. Many patches are randomly selected from the spectrograms and used for unsupervised learning. Then we can get the kernel(weight and bias). With different sizes of patches, we learn different kernels. With the kernels, we perform convolution and mean-pooling on a whole spectrogram input. We stack the pooled features of different kernels, and get a rough feature representation. Then we perform semi-supervised feature learning. These rough features are further fed into the semi-supervised feature learning framework and disentangled into two parts: one is emotion-related, and the other is emotion-unrelated. Here the whole loss function consists of four parts: reconstruction penalty, orthogonal penalty, discriminative penalty and verification penalty. With the reconstruction penalty and orthogonal penalty, the rough features are disentangled into emotion-specific features and emotion-unrelated features. Then we put some constraints on the emotion-related features. The discriminative penalty increases the variation of the emotion-specific features of different emotions. The verification penalty encourages reducing the intra-emotion variations. Finally, the emotion-related features learned from semi-supervised framework with its corresponding labels are used to train classifiers. The effectiveness of our approach is evaluated on the INTERSPEECH 2009 Emotion Challenge 5-class problem. Compared to the basic acoustic features, our approach gets an improvement under the same conditions.2)We propose the semi-supervised domain adaptation method based on the priors sharing, and we aim to make the classes with few labeled data in target domain borrow knowledge from the related classes in source domain. The proposed model is a two-layer neural network, and the first layer is for feature extraction and the second layer is a softmax classifier. Generally speaking, the classifier parameters for one class are independent of that of all other classes. It does work well for most applications when large plenty of labeled examples per class are available. However, in semi-supervised DA, there exist only a small number of labeled examples in target domain(these data is not enough for training a roubst classifier). So we introduce the sharing priors between the classifier parameters of the related classes(namely the weights of the related classes are drawn form the same distribution). Three steps are introduced. First, we do speech pre-processing and get the feture representation with 384 attributes. Then the source and target unlabeled data are used to train a sharing-hidden-layer autoencoder, and the weight of first layer is used to initialize the first layer of our model. Finally, the source and target labeled data are used to train our model. The effectiveness of our approach is evaluated on the INTERSPEECH 2009 Emotion Challenge 2-class problem. The source set is ABC or Emo-DB, and the target set is FAU AEC. When only a small number of labeled examples in target domain is available, the experimental results show that our approach can get a higher UAR compared with the one without priors sharing, and the traditional machine learning methods.3)We propose the unsupervised domain adaptation method based on labels supervision and features disentangling, and we aim to learn the domain-invariant and emotion-discriminative features. Our model is a feed-forward neural network and three important parts are included: feature extractor, sentiment label predictor, and domain label predictor. First input data is disentangled into two parts: emotion-discriminative features and emotion-unrelated features. Then the emotion-discriminative features are used for a hierarchical non-linear transformation and we get the high level feature representation. And this high level feature representation is used for the prediction of emotion and domain labels respectively. The effectiveness of our approach is evaluated on the INTERSPEECH 2009 Emotion Challenge 2-class problem. The source set is ABC or Emo-DB, and the target set is FAU AEC. Compared to the traditional machine learning methods and some other DA methods, our approach gets a higher UAR.
Keywords/Search Tags:Speech emotion recognition, Feature learning, Deep learning, Transfer learning, Domain adaptation
PDF Full Text Request
Related items