Font Size: a A A

Research On Key Algorithms Of Speaker Recognition Based On Deep Learning

Posted on:2021-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:W J ZhangFull Text:PDF
GTID:2518306050454464Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Voice is one of the most direct and natural ways of human-computer interaction.Speaker recognition is a biometric recognition technology which can distinguish speakers according to their voice and realize identity authentication.Deep learning methods have powerful feature abstraction and data modeling ability profitting,it has achieved great success in image recognition,machine translation and many other fields.In this thesis,the key algorithms in speaker recognition technology are researched with the aim of improving the overall performance of the speaker recognition system.In this process,the most mainstream Mel-Frequency Cepstral Coefficients(MFCC)is selected as the feature.Combined with the deep learning theory,this thesis discusses three contents: traditional speaker recognition based on Gaussian Mixture Model(GMM),classified speaker recognition based on deep learning and coded speaker recognition based on deep feature.GMM is the most widely used traditional speaker recognition algorithms at present.This thesis proposes a traditional speaker recognition system based on GMM.The speech database of our lab and Aishell speech database are used for test.What's more,the accuracy variation of the system is tested by using different number of training samples and speakers,and different noise environment on the Aishell speech database.Due to the limited non-linear data modeling ability of GMM,this thesis uses a deep learning model to build a speaker recognition system,instead of GMM.In detail,two kinds of network structures,Convolutional Neural Networks(CNN)and Long Short-Term Memory(LSTM),are tired for speaker classification respectively.Dropout and Batch Normalization layer are used to prevent overfitting during the training process,which can improve the generalization ability of the network.Compared with the GMM algorithm,CNN performs better when using a small training set and a large number of speakers to be recognized,while LSTM has the better noise robustness.Based on these,a score fusion speaker recognition algorithm is proposed,in which the input data are fed into the two recognition networks parallelly,and the final output are obtained by calculating the arithmetic average over the normalized score of two networks.Experiments verify that the score fusion algorithm has the best performance in all kinds of test conditions.In order to deal with the uncertainty of the number of speakers to be recognized and the corpus insufficiency of target speakers in practical application,x-vector framework is selected to implement the coded speaker recognition algorithm based on deep features.And the trained neural network is used as a eigenvector encoder to perform the transformation from MFCC to identity coding vector.The original x-vector framework counts data by means of average pooling layer,and treats all frames as equally important,which is not reasonable.This thesis uses the idea of attention mechanism to improve the statistical pooling layer.By calculating the weighted mean and weighted standard deviation,better experimental results are obtained.Standard softmax loss is generally good at optimizing the inter-class differences,but not good at reducing intra-class differences.This thesis introduces three new loss functions to solve the above problems,and experimentally explores the effect of their hyperparameter values on the performance of the algorithm to guide the parameter settings in future work.The results show that the discriminative effect is greatly improved when using the three new loss functions.And the performance of AMSoftmax loss and Arc Softmax loss is slightly better than Asoftmax loss.
Keywords/Search Tags:Speaker Recognition, Deep Learning, MFCC, Score Fusion, Deep Features
PDF Full Text Request
Related items