Font Size: a A A

Research On Speech Emotion Classifier Based On Deep Learning

Posted on:2021-07-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F XiaoFull Text:PDF
GTID:1488306458976909Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speech emotion recognition(SER)is one of the research hotspots in speech processing and human-computer interaction.It aims to recognize the emotional state of the speaker by analyzing and classifying the emotional information contained in the speech signal.Therefore,machine can perceive and process emotional information as humans.To build a speech emotion recognition system(SERs),researchers apply traditional machine learning as a classifier.Recently,deep learning has been introduced into SER,which has improved the recognition performance.Therefore,research on deep learning-based SER models with better recognition performance is of great significance to promote the development of SER.Compared with traditional machine learning,SER based on deep learning achieves significant improvement of recognition performance.However,it still faces several huge challenges.Due to the influence of different data collection methods and speakers,the training set(source domain)and test set(target domain)may have inconsistent data distribution,that is named as domain mismatch.This problem may result in poor generalization performance.In addition,the labeled data is scarce for training.The training processing of a speech emotion recognition model based on deep learning is a supervised learning process,which depends on the amount of labeled data to optimize the parameters.Scarce labeled training set is likely to cause overfitting problem.Moreover,the model has high complexity.Deep learning needs to carry out a large number of floating-point operations and requires a large amount of memory space to store model parameters.It increases the difficulty of deploying deep learning models to mobile platforms with limited computing capabilities and storage resources.In response to the above problems,this paper uses the domain adversarial training to reduce the discrepancy of distribution between the source domain and target domain data in the feature space.It alleviates the domain mismatch problem.To alleviate the dependence on label data,this paper uses the deep semi-supervised learning model to learn the inherent distribution and discrimination features from the labeled and unlabeled data.At the same time,deep learning model is compressed through the binary function to reduce the complexity.The main research content and contributions of this paper include the following aspects:(1)Domain adversarial neural network(DANN)is able to reduce the distribution discrepancy between source and target domain data in the feature space.In this way,the features learned by DANN are domain-unrelated.However,these features are only the mapping of the input in the feature space which are easily influenced by disturbance.This paper proposes two novel domain adaptation models,namely generalized domain adversarial neural network(GDANN)and class-aligned generalized domain adversarial neural network(CGDANN).They integrate variational autoencoder on the basis of DANN that focuses on learning feature distribution.The variational Inference in variational auto-encoding is used as the feature generator.It can alleviate the influence of the input disturbance on the domain-irrelevant features.In the end,the generalization of the model is improved.Unlike GDANN,CGDANN performs class alignment with additional target domain labeled data that make the distribution of features closing to the target domain category distribution.From the experimental results,the generalization of GDANN and CGDANN has been improved.(2)In order to reduce the dependence on training labeled data,this paper proposes semisupervised adversarial variational autoencoder(SSAVAE)for SER,which consists of a generation network and a variational derivation network.It can learn the inherent features and emotional category information from labeled and unlabeled data.According to whether the data has label or not,SSAVAE is divided into two cases.For labeled data,the input data and label as observable vectors are generated from implicit feature vectors.Its posterior probability is learned by the variational inference network.For unlabeled data,the label information is used as hidden vector.The data is generated by the implicit feature vector and label vector through the generating network.Their joint posterior probability is learned by the variational inference network.The feature vector is shared these two cases.The parameters of SSAVAE are optimized by the joint objective function.Since the feature vector and the label vector are independent,the posterior distribution probability of the label can be used for emotion recognition.To reduce the dependence of the feature vector distribution on the input data,generative adversarial network is applied to fit the probability distribution of the feature vector.It can improve the quality of the feature.Experimental results show that SSAVAE is better than the other semisupervised learning methods.Moreover,it achieves the performance of the supervised learning methods.(3)This paper proposes semisupervised generative adversarial network(SSGAN)for SER,which extends the classification categories of the discriminator in generative adversarial network.Therefore,SSGAN can not only learn the probability distribution of the input data,but also perform emotion recognition.It can reduce the dependence on the training labeled data.When the input lies in the adversarial direction with a slight disturbance,the model may obtain a wrong classification result.To deal with this problem,this paper proposes a smooth semisupervised generative adversarial network(SSSGAN)and a virtual smooth semisupervised generative adversarial network(VSSSGAN),which smooth the adversarial direction by adversarial training.In the end,the label of the adversarial sample is corrected that can improve the robustness.Among them,VSSSGAN applies virtual label for smoothing that reduces the dependence on label information.Experimental results show that the robustness of smoothed semi-supervised learning is improved.(4)To reduce the complexity of the deep learning models for SER,this paper proposes a compressive SER model via binarization,named as binary convolutional recurrent neural network(BCRNN).BCRNN converts the real value of input and weight value in the convolutional recurrent neural network(CRNN)into a bit-1/+1 through a binarization function.It decreases the memory requirement for model storage.In addition,the complex convolution operations are replaced by a faster XOR operation that requires limited computational processing.In order to alleviate the information loss caused by the binarization process,the model introduces a scale factor to make the binary value approximating the corresponding real value.Theoretical analysis shows that the storage space of BCRNN is1/8,compared with CRNN.The experimental results show that BCRNN obtains a larger model compression rate with impressive recognition performance.
Keywords/Search Tags:domain adaption, semisupervised learning, model compact, deep learning, speech emotion recognition
PDF Full Text Request
Related items