Font Size: a A A

Research On Technologies Of Speaker Recognition Based On Sparse Decomposition

Posted on:2018-05-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:L T XuFull Text:PDF
GTID:1368330566995805Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speaker recognition,also known as voiceprint recognition,is a biometric technology.Since voice is natural,unique,and easy to collect,speaker recognition has become more popular over the years and is now at the leading edge of technological development.Speaker recognition is a multidisciplinary research topic.It is closely related to psychology,physiology,digital signal processing,pattern recognition,artificial intelligence and so on.Speaker recognition technique can be applied in authentication,Internet,communications,call center and so on.This dissertation mainly studies how to use sparse decomposition in speech signal processing,especially in the field of speaker recognition.Currently,the state-of-the-art speaker recognition accuracy has already reached acceptable level for long-duration speech data.However,there is still much room for improvement for short utterances.This dissertation proposes a new method to improve the accuracy of the speaker recognition system based on short utterances.For practical applications,since the storage capacity and processing speed of mobile devices or embedded devices are much weaker than large-processor computer systems,this dissertation also proposes new methods to optimize the memory and run-time efficiency of the speaker recognition algorithm.The main purpose of this dissertation is to improve the accuracy of speaker recognition system based on short utterances,reduce memory requirement and reduce the running time.In details,the innovations of this dissertation are as follows:Firstly,this dissertation investigates the speaker identification task for short utterances.Based on log filter-bank energies(log FBE)features,we make use of the sparse dictionary model to represent the sub-space for every speaker.Based on this dictionary model,we proposed two new methods:(1)Two methods are proposed in this dissertation to reduce the dictionary size for the dictionary model.In the first approach,we proposed to use the concept of under-complete dictionary as the speaker model.The second approach employs statistical method to reduce the dictionary size.Experimental results show that the former method can achieve better performance than the later one.(2)Based on the dictionary model,we also propose a robust speaker recognition system under noisy environment.It is trained by stacking together several dictionary models,each of which are trained using noisy speech under different SNR,so as to improve the noise robustness.Secondly,we investigate the use of sparse coding technique to reduce the redundancy in representing the total variability space,in order to reduce the memory consumption and increase the decoding speed.Two innovations are proposed based on this idea:(1)Two approaches are proposed based on this method.The first approach is to compute the I-vector directly(direct computation).The second method performs sparse decomposition followed by diagonal approximation,so as to reduce the memory consumption in computing the posterior precision matrix(approximate computation).Experimental results show that the first method(direct computation)can achieve the same accuracy as the baseline system;while the second method(approximate computation)can achieve significant speed up by one order of magnitude,at the expense of a small and tolerable reduction in verification accuracy.(2)Besides,we proposed an algorithm named "eigen decomposition like factorization"(EDLF)on top of the method of sparse decomposition of total variability space matrix.The EDLF method shows no disadvantages of the approximation computation system in running time and increases the identification accuracy.Finally,in order to reduce the memory requirement and reduce the running time,this dissertation reformulates the estimation of I-vector for rapid speaker recognition.The widely used I-vector extraction is assumed to have a standard normal distribution prior.In this dissertation,the fast I-vector computation is based on subspace-orthonormalizing prior replacing the standard Gaussian prior.The fast method is achieved by the use of subspace-orthonormalizing prior and the uniform-scaling assumption,which costs 1/10 of the running time of the baseline system.Furthermore,we show that occupancy re-weighting could be accomplished in conjunction with whitening and centering as part of the pre-processing step applied on sufficient statistics.The unified formulation is applicable to fast I-vector extraction and increases the recognition accuracy.
Keywords/Search Tags:Speaker recognition, gaussian mixture model, sparse coding, universal background model, latent variability, standard normal distribution, identity vector, prior information, fast method
PDF Full Text Request
Related items