Font Size: a A A

Research On Deep Learning Based Speaker Recognition Modeling

Posted on:2017-08-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y FengFull Text:PDF
GTID:1318330536950953Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Speaker recognition is a technique using the characteristics of the human voice to distinguish the speakers so as to identify the speaker's identity. Because of its availability, speaker recognition has broad application prospects in the fields of finance, security, public security, justice, military affairs and information services. Nowadays, in the complex background(multi environment, multiple transmission channels), the framework of i-vector model which integrate Gaussian Mixture Model- Universal Background Model(GMM-UBM), Total Variability Model(TVM) and Linear discriminant Analysis(LDA) has become the mainstream technique in speaker recognition. In the framework of i-vector model, the Gaussian supervector obtained by the GMM-UBM is better to describe the data distribution of speech features, TVM reduces the high dimensional Gaussian supervector to the low dimensional representation which is called total variability factor(i.e. i-vector) of the speaker's identity by factor analysis, and LDA performs channel compensation to i-vector to make the distance between the different classes maximum and the distance between the same class minimum. By means of these three models, the performance of speaker recognition is improved significantly.However, in the framework of i-vector model, both TVM and LDA are based on the hypothesis that the speaker information and channel information are linear discriminative. In fact, it is hard to separate them effectively and accurately only through the linear relationship, which limits its recognition performance in the real complex conditions. In recent years, due to the deep information extraction and nonlinear modeling, deep learning theory has been successfully applied in many machine learning fields. In order to further improve the performance and robustness of the text independent speaker recognition, this dissertation introduces deep learning into the modeling framework of speaker recognition. By utilizing the deep nonlinear structural characteristics of deep learning model, we explore it in the i-vector modeling and channel compensation respectively. We also evaluate and analyze their recognition performances in the condition that there are massive data and large scale of voiceprints. The main achievements and innovations are as follows:1. In the i-vector modeling aspect, considering the problem that the linear dimension reduction is difficult to preserve the nonlinear characteristics of the original data, a Restricted Boltzmann Machine(RBM) based TVM modeling algorithm is proposed to replace the traditional i-vector model. This method supposes the visible layer and hidden layer obey the Gaussian distribution or Bernoulli distribution so as to derive a mathematic representation similar as i-vector. Based on this, a Gaussian- Bernoulli or Gaussian-Gaussian RBM speaker feature vector extractor(i.e. RBM-i-vector) is proposed, which maps the high dimensional Gaussian supervector to a low dimensional representation by nonlinear dimension reduction. With additional LDA module, better performance could be achieved. Evaluation also proved that the more the layers of the RBM network is, the better the recognition performance will be. By fusion RBM-i-vector model with the traditional i-vector system, the speaker verification performance will be further improved.2. In the channel compensation aspect, considering the problem that the LDA model does not have enough discriminant ability, a deep neural network based nonlinear metric learning modeling method is proposed to replace the traditional LDA model. Different with the traditional linear metric learning method, the proposed method adopts RBM or Independent Subspace Analysis(ISA) to stack a deep neural network and utilizes its nonlinear characteristics to transform the i-vector feature to a certain subspace to fulfill the channel compensation. Meanwhile, the side information constraint for metric learning is also combined with deep neural network. Based on this, the similarity of two utterances is computed to get better discriminant result. Evaluation results proved that the proposed method can effectively improve the discriminating of speaker recognition modeling and get better recognition performance.3. By combining the above two modeling methods, a method which integrates the RBM-i-vector modeling and ISA based nonlinear metric learning modeling is proposed(we named it RBM_ISA model). The RBM_ISA model is an alternative model to the traditional i-vector and LDA. It first reduces the dimension of Gaussian supervector to a low dimension representation, RBM-i-vector, and then using the nonlinear metric learning to classify the speaker utterances so as to improve the discriminant ability of speaker recognition system. Compared with the above deep learning model and traditional i-vector model, the RBM_ISA model can achieve better speaker verification performance.4. Since the nowadays evaluations of speaker recognition systems are mostly performed on small or middle scale corpuses and there is almost no evaluation on a large scale of voiceprint corpus, we construct a large scale corpus including about 400 k speakers. The corpus is adopted to evaluate the performances of the traditional i-vector modeling framework and the proposed RBM_ISA model. The speaker identification results in conditions of 400 k voiceprints and 400 k test utterances are provided respectively. The effect of channel mismatching to speaker recognition under the massive test data is also analyzed. All of these evaluations provide valuable analysis and reference to speaker recognition's real applications.
Keywords/Search Tags:Speaker Recognition, Deep Learning, Restricted Boltzmann Machine, Independent Subspace Analysis, Metric Learning
PDF Full Text Request
Related items