Font Size: a A A

Probabilistic Modeling Of Emotion Reconstruction For Speaker Recognition

Posted on:2017-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2308330482481776Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
This paper is dedicated to solve the emotion mismatch problem, to explain the theory and to conclude the facts. In this paper, I will prove the difference between emotion effect and or-dinary channel effect to speaker features, and explain the reason why the state-of-the-art speaker recognition algorithms cannot deal with emotion mismatch. A statistical inference system based on probabilistic models has been proposed. And, previous emotion model reconstruction theory is also improved, its result on MASC has topped other known algorithms. To solve the overfitting problem of joint factor analysis model, a full Bayesian training is deviced, and a nonparametric In-dian Buffet Process model is proposed. Corresponding training iterative and sampling algorithms are also included. By not losing precision, I have successfully decreased the number of parameters to 30% of the original model.The major contributions of this paper are as follows:1. Explains the theory and concludes the facts of emotion mismatching. There is still no gen-eralized model to describe emotion change for different speakers on different emotions. The emotion effect is very non-linear for different voice entries (phonemes), speaker identities, emotion characteristics and so on. No like ordinary channel mismatch, there is no way to use separate spaces to describe identity and emotional features and then add them up to create a complete feature. Because of the lack of data, it is unfeasible to use dimension reduction techniques like LDA to extract the emotion-independent features. Remarkably, the emotion transform has quite good neighborhood similarity, which means the speakers who have sim-ilar features in one emotion are similar in other emotions with high probability.2. Formalizes the probabilistic model of speaker recognition, proposes a new classification mod-el based on statistical distance. The contemporary techniques of Universal Background Mod-eling always rely on very high mixture dimension of the Gaussian Mixture Models. This kind of models requires a very large amount of enrollment data and relatively similar lengths of testing and training data. Based on the inference methods of Bayesian statistics, we are able to estimate the generative model of the speech data with the help of conjugate priors of expo-nential families. This technique does not require the EM iteration of GMM training. So it has advantage in computational simplicity. In addition to that, with the help of model selection techniques, we are able to use AIC, BIC and so on to develop a set of statistical distance measures to evaluate the model similarity, enjoying its robustness to recording length and text content which could lead to session variability otherwise.3. Devices a new emotional model reconstruction algorithm based on manifold learning, which improves the result of nearest neighbors reconstruction. Making use of the neighborhood preserving properties of voice features, we could use the emotional recordings of enrollment speakers similar to the training target to reconstruct his or her corresponding emotional model. In this paper, I will introduce an optimal neighborhood reconstruction algorithm, which solves a constrained second order optimization problem and reconstruct the emotional model with enrollment speaker models. This reconstruct has good invariant properties under several kinds of transformations, making the combination scheme considerable stable across different emotion spaces, thus can be used to reconstruct training target models in other emotions other than where the combination if learned.4. Proposes a full Bayesian estimation to joint factor analysis models and devices a nonpara-metric model to improve the robustness. Joint factor analysis models are typically solved by EM algorithm, where estimation of feature vector and maximization of likelihood w.r.t. factor loading matrix iteratively take places. Since there is no constraints on the sparsity of the parameters, where factor loading matrix has far more parameters than the feature vector, the model is very prone to overfitting allowing the feature vector to approximate 0 unlimit-edly. By adding a prior to the factor loading matrix, we can estimate the model by coordinate descent algorithm. It has been shown in empirical studies that this algorithm can effectively deal with the overfitting problem. Based on this result, I proposed a new nonparametric joint factor model of GMMs based on Indian Buffet Process, which can automatically learn the proper speaker dimensionality and leave out Gaussian component parameters no relative to certain speakers to increase the robustness. It has been shown that with this method we can achieve same result as original JFA models with only 30% amount of parameters.
Keywords/Search Tags:Emotional Speaker Recognition, Channel Mismatch, Bayesian Statistics, Nonpara- metric Models, Metropolis-Hasting Sampling, Joint Factor Analysis, Model Selection, Statistical Distance, Exponential Family, Novelty Detection
PDF Full Text Request
Related items