Font Size: a A A

Text-Independent Speaker Verification Based On GMM And High-Level Information

Posted on:2010-03-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:D X XuFull Text:PDF
GTID:1118360275955508Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
In order to verify the latest technology in text-independent speaker recognition, National Institute of Standard and Technology(NIST) conduct the speaker recognition evaluation.NIST SRE stands for the most advanced technology in speaker recognition.In order to explore and to seek for the suitable resolution under different conditions,it sets up different tasks,supplying the universal telephone speech data,which are from multiple channels,in different environment,and from a great number of speakers,together with the evaluation criterion.One task is to recognize speakers from long speech,which aim to utilize high level information for text independent speaker recognition,and it becomes a focus for many foreign institutes.High-level information is always text-dependent,so how to extract features from high level information which can be used in text-independent becomes the focal point.This thesis makes an in-depth study on how to extract features from high-level information such as prosodics and articulatory properties,and on how to utilize them in text-independent speaker recognition.According to the characteristics of text-independent speaker recognition,as features from text-dependent prosodics(X~t,X is some feature of speech)can be regarded as a combination of super-segmental feature units,this thesis adopts probability-statistical model to describe the distribution of such units to recognize speakers.This thesis proposes a method to extract super-segmental features with multi-resolution wavelet analysis and apply it to excitation and vocal tract prosodics. Approximation coefficients representing the low frequency components and the detail coefficients representing the high frequency components from F0~t compose a 6-dim super-segmental prosodic feature,termed PF0.As MFCC is of high dimension,considering its lowly correlation between different dimensions and the slow changes of the vocal tract,we make analysis of each dimension of MFCC and compose the approximation coefficients to form the vocal super segmental feature PMFCC.Experiments on NIST SRE 2006 8side-lside task show that PF0 performs a 23.66%EER reduction than short-time related feature F0,and PMFCC can match short time spectral feature MFCC.As features from excitation and vocal tract are complementary to one another, we make a study of their combination PMFCCF0.Experiments on NIST SRE 2006 8side-lside task show that PMFCCF0 based system gives a 40%EER reduction compared with MFCC,and experiments on MSRA database shows that PMFCCF0 has better robustness.Linear fusion of the scores from the two systems brings better performance.In NIST SRE 2008,we get the best DET curve under the telephone training and telephone testing condition using the linear fusion of the systems based on PMFCCFO and short-time features.This thesis also study on extracting articulatory position feature from spectral feature and applying it in text-independent speaker recognition.We propose a method to extract articulatory position feature with feature space mapping.The multi-layer perceptron(MLP) mapping network is trained with standard pronunciation from many speakers,so it can be shared by everyone.Articulatory feature is extracted by mapping spectral feature with this network.AF represents the characteristics how a speaker produces sounds,so it is related to the physical property of the articulatory organs and the way how he/she produces sounds.AF contains information of speaker and has better robustness.MFCCAF,a combination of AF and MFCC,can improve the performance and robustness of speaker verification system.
Keywords/Search Tags:Super-segment feature, articulatory feature, text-independent, probability statistical model, speaker verification
PDF Full Text Request
Related items