The focus of this dissertation is the time-varying issue in speaker recognition andthe time-varying robustness is explored. Major efforts and contributions are:1. A proper longitudinal voiceprint database that specially focuses on thetime-varying issue. After analyzing existing speech databases with the time-varyingattribute, we designed to create a fixed-text read speech database with16recordingsessions within a time span of3years. Since the time-varying effect was the only focus,other factors, such as recording equipment, software, conditions and environment werekept as constant as possible throughout all recording sessions. Gradient time intervalswere used, with the length of intervals increasing gradually.2. Performance evaluation index for a time-varying speaker recognition system.For a time-varying speaker verification task, there are generally a series of EERs,corresponding to each recording session. Then when comparing the performance of twosystems, we are indeed comparing two arrays of EERs. Therefore, it is natural to usemean and standard deviation of each array of EERs to evaluate the overall performanceof a system. The mean value serves as an indicator of the averages performance ofsessions, while the standard deviation value serves as an indicator of the time-varingrobustness across sessions. Specifically in this paper, the product of those two values isused to evaluate the overall time-varying speaker verification performance.3. Time-varying robust feature extraction algorithms with discriminationsensitivity of frequency bands calculated through F-ratio. The concept of overalldiscrimination sensitivity of frequency bands regarding the time-varying speakerecognition task was proposed. Efforts were made to identify frequency bands thatrevealed high discrimination sensitivity for speaker-specific information, while lowdiscrimination sensitivity for time-varying session-specific information. F-ratio wasemployed as an intermediary criterion to calculate the overall discrimination sensitivitybased on the log-energy spectrum. Thus according to the overall discriminationsensitivity, tme-varying robust feature extraction algorithms were presented duringfeature extraction of cepstral coefficients with different emphasis on different frequencybands from two aspects: pre-filtering frequency-warping and post-filtering filter-bankoutputs weighting. Experimental results showed that the two algorithms outperformedthe baseline MFCC by26.90%and5.45%, respectively. 4. Performance-driven feature extraction algorithm based on frequency warping.This algorithm evaluated the overall discrimination sensitivity of frequency bands froma performance-driven point of view instead of the F-ratio criterion. Specifically, theoverall discrimination sensitivity of a designated frequency band is determined by theoverall performance of a time-varying speaker recognition system, which made use offrequency-warping approach to soly emphasize the designated frequency band, leavingother unchanged. Finally, frequency warping was performed and experimental resultsshowed that it yielded a better result than MFCC, with a gain of32.47in overallperformance.5. Discriminative feature extraction algorithm based on filter-bank outputsweighting. This was also a performance-driven approach, yet it was designed for thefilter-bank outputs weighting method. After resigning an initial series of weights forfilter-bank outputs, speaker modeling and utterance scoring were performed; thenaccording to the performance feedback, the series of weights were adjusted by theproposed MCE*MSV criterion. After several iterations of such a process, the bestseries of weights were found automatically. The MCE*MSV criterion wasproposed to minimize the target optimization function of the error rates ofrecording sessions and their standard deviation. The best series of weights wereapplied to filter-bank outputs and experimental results showed that it workedbetter than MFCC by34.08%. |