Font Size: a A A

Research On Universal Background Model And Preliminary Study On Deep Learning In Speaker Recognition

Posted on:2020-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:W X MeiFull Text:PDF
GTID:2428330572496506Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Speaker recognition is an important research direction in the field of speech signal processing.The main purpose is to automatically extract the identity of the speaker from the voice,which is widely used in the fields of banking,public security systems and smart home.Currently,mainstream algo-rithms are based on the probabilistic models.The GMM-UBM model has achieved good performance when the background corpus is sufficient and the channel is single.However,in practical applications,the noise and channel mismatch make the performance of this method drop dramatically.The i-vector method was proposed to solve these problems to some extent.The above algorithms are all based on GMM-UBM,and have achieved good results in NIST evaluation,but there still remain some problems that have not been solved.The main manifestations are:On the one hand,the training of the universal background speaker model requires a lot of com-puting resources and makes it difficult to deploy quickly in new environments;on the other hand,there is no further research on the theoretical basis of the universal background model training,only by collecting a large number of different speakers' data to fill the feature space as much as possibl.This paper focuses on the text-independent speaker verification and the universal background speaker model corpus selection problem.The main work and innovations are as follows:Firstly,speaker verification systems based on the GMM-UBM model and the i-vector/PLDA method are constructed.The preprocessing of features,the training method of UBM model,the pro-cess of MAP adaptation,the extraction method of i-vector global difference matrix and the scoring method based on PLDA are introduced in detail.The effects of GMM model order and MFCC fea-ture dimension on system performance are discussed.The experimental results show that the system constructed in this paper has reached the performance of mainstream open source implementation.Secondly,a N-support speaker selection algorithm based on GMM super-vector clustering is proposed.The core idea of the speaker selection algorithm is to make the difference of the selected speakers' speech feature distribution as large as possible to cover the entire feature space.Therefore,this paper proposes to use the data of each background speaker to train the GMM model separately,use the GMM super vector to approximate its feature distribution,and finally use the clustering al?gorithms to find the speaker set with the largest distance between each other.Experiments show that the algorithm can only use the 8.807%,8.6%and 4.3%of the benchmark speaker corpus on the three data sets of AISHELL,MASC and TIMIT to construct the UBM with baseline performance.Thirdly,the background speaker corpus selection algorithm based on GMM Token ratio is re-alized.Another idea for UBM data selection is to screen directly at the frame level.The current mainstream algorithm is the IFS(Intelligent Feature Selection)algorithm proposed by Hansen et al.,which dynamically estimates the probability distribution of Euclidean distance between background corpus frames.This paper changes the way of thinking,starting from GMM Token,which can reflect the characteristics of phonemes,proposes a background corpus selection algorithm based on Token's ratio.Experiments show that the algorithm can only build the UBM with baseline performance on the AISHELL,MASC and TIMIT datasets with only 18.1%,10.0%and 9.1%of the benclunark speaker corpus.Fourthly,a speaker identification system based on Mel spectrum and convolutional neural net-work is constructed.At present,the features of the mainstream speaker verification method are manual features such as MFCC.This paper proposes a speaker identification system based on convolutional neural network,which is directly used as the input of the system.The experimental results show that as the amount of training data increases,the performance of the system constructed in this pa-per gradually approaches and exceeds the traditional probability model.Specifically,on the MASC corpus,when the ratio of the training data to the test data is 8:2,the discrimination rate(IR)of the method reaches 90%;when the ratio reaches 9:1,the discrimination rate reaches 95.7%,exceeding The discrimination rate of the GMM-UBM system.
Keywords/Search Tags:Speaker Recognition, Universal background Speaker Data Selection, GMM Token, Mel-spetorgram, ConvNets
PDF Full Text Request
Related items