Font Size: a A A

Effective data selection technology for robust speaker recognition

Posted on:2013-03-13Degree:Ph.DType:Dissertation
University:The University of Texas at DallasCandidate:Suh, Jun-WonFull Text:PDF
GTID:1458390008483924Subject:Engineering
Abstract/Summary:
Speaker recognition performance degrades when faced with a limited amount of train and test data. Classification accuracy is also dramatically reduced in the presence of background noise when only sparse train and test data is available because of nonstationary noise. Methods such as MAP and Eigenvoice can be used to address short duration (10 sec) train data for speaker recognition, but even more effective adaptation strategies are needed when there is half the amount of train (5 sec) and test (2 ∼ 6 sec) data. In this study, limited train data is analyzed with a Gaussian Mixture Tagger (GMT) in order to estimate a complete acoustic information inventory. With this knowledge, this dissertation proposes two speaker adaptation schemes, entitled TDBU and MRNC speaker adaptations. The TDBU separates the speaker model into "Top" and "Bottom" models, where the method reinforces the available train data for the "Top" and fills the acoustic holes for the "Bottom" by borrowing data from acoustically similar speakers'. The MRNC method balances the cohort speakers' data to represent the conversational speech so that the MRNC speaker model covers the acoustic space more naturally. For the second contribution, a leveraged speaker and background noise system is also proposed to address any noise present in the sparse data. Using the assumption, where a speaker produces speech in the same environment, the proposed system takes advantage of analyzing the background noise with the speaker information to balance the output evaluation results. The TDBU method achieves a relative +14% EER improvement over traditional MAP adaptation, and the MRNC method improves the relative +16% EER over MAP and Eigenvoice using only 5 sec of train and 6 sec of test data from the Fisher corpus. Results with the in-vehicle CU-Move corpus also show similar performance improvement for both speaker modeling methods. The noise leverage system improves performance by a relative +27% EER using 10 sec train/test on the NIST SRE evaluation.;For the third contribution, advancements in negative speaker selection are addressed. A discriminant training model such as the Support Vector Machine (SVM), requires effective selection of negative examples to acquire optimum classification performance. The SVM system cannot obtain optimum performance with unseen evaluation data, since the SVM system is designed to use a fixed size set of negative examples based on development data. The ErrDiff method is proposed which ErrDiff method is evaluated using the NIST SRE corpus framework, and is shown to achieve a relative +6% improvement over the best SVM baseline system without a hard configuration of the system. The strength in this advancement is the consistency in performance between development testing and open testing results.;The main contributions of this dissertation based on TDBU and MRNC impact speaker modeling using limited duration train/test sets for speaker recognition. The leveraged noise system can be applied in many speech applications exposed to noisy environments. Finally, the ErrDiff can impact SVM based speaker and language ID problems as well as text classification. These contributions provide meaningful steps towards improving speech, speaker, and language identification systems for human-machine interaction.
Keywords/Search Tags:Speaker, Data, Recognition, System, Train, Classification, Performance, MRNC
Related items