Font Size: a A A

Study On Session Variability Modeling For Speaker Verification

Posted on:2017-01-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:L P ChenFull Text:PDF
GTID:1108330485451555Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
So far, the techniques for speaker verification of long utterances in complex chan-nel conditions have been well developed, founding the basis for it to be applicable for real-world applications. Among the techniques, total variability modeling, based on the Gaussian mixutre model, together with the following channel compensation techniques, such as probabilistic linear discriminant analysis (PLDA), has become the mainstream method for speaker verification due to its simple variability modeling and effective speaker verification performance.The idea of the total variability modeling is to first estimate the session variabil-ity in an utterance with a vector of low dimension, containing nonspeaker information (mainly channel variability) as well as the speaker information. The channel compen-sation technique in the backend is crucial for getting rid of the influence of nonspeaker variability on speaker comparison. The crux of total variability modeling lies in how to extract the session variability and how to implement the channel compensation for better speaker comparison, on which my work is focused. This dissertation is organized as follows:The total variability modeling provides a representation of the session variability of an utterance thouroughly, while ignoring the specific variability contained in it. In this dissertation, we first propose to extract the local session variability that cannot be modeled by total variablity model and apply it for speaker verification. In our proposed local session variability modeling, we extract the session variability contained in every single Gaussian and every dimension of the acoustic features respectively. After that, we tie the dimensions of the acoustic features for variability extraction. In our work, we found that since the total and local session variability models focus on different aspects of the session, the variability modeled by them can compensate to each other. By fusing them on system and model levels respectively, better performances can be achieved than any single model.Total variability modeling is effective in the speaker verification task where there is no content-mismatch problem among the speech utterances for model training, speaker enrollment and testing, e.g. the text-independent speaker verification on utterances on long durations and the text-dependent speaker verification on short utterances. When the mismatch problem exists, such as the text-independent speaker verification on short utterances, the performance of the total variability model degrades since it cannot get rid of the influence of the text variability on speaker comparison. For this problem, we propose to estimate the local session variability with respect to the phonemes in a speech utterance using the deep neural network based acoustic model for speech recog-nition. Here, we apply monophonic and triphonic acoustic models respectively for phone-centric local variability estimation. On the backend, we select the local vectors according to the phones existing in the utterances, thus solving the content-mismatch problem. After that, we treat the words recognized by an automatic speech recognizer as the objects for local session variability estimation. With this, the exploration on content-aware session variability modeling gets closer to completeness.The mainstream technique for compensating the nonspeaker variability is the PLDA model, a linear and probabilistic mdoel. In this dissertation, we research on the channel compensation techniques for the total variability model. In the begining, we propose a scoring model of speaker adaptation for PLDA scoring which is equivalent to the ex-isting state-of-the-art scoring model. Based on the speaker adaptation scoring model, we propose to adapt the speaker model with the prior distribution parameters instead of the posterior for speaker adaptation which is used for the conventional speaker adapta-tion. With this, we solve the problems of the multi-session speaker verification tasks. Furthermore, we introduce the idea of channel adaptation to the compared model. The speaker-adapted and global PLDA models in every single trial are adapted to the channel condition given by the test utterance before scoring. In this way, the specific informa-tion of each trial is considered instead of being scored on the channel condition which is general to all trials. At last, we propse to apply a deep neural network for the chan-nel compensation on total variability to replace the existing linear and shallow models, bringing better performances.
Keywords/Search Tags:speaker verification, session variability modeling, local session variability, short utterances, content-matching, PLDA scoring model, nonlinear channel compen- sation, deep neural network
PDF Full Text Request
Related items