Font Size: a A A

Use of speaker location features in meeting diarization

Posted on:2009-12-02Degree:Ph.DType:Thesis
University:University of WashingtonCandidate:Otterson, ScottFull Text:PDF
GTID:2448390002496935Subject:Engineering
Abstract/Summary:
This thesis proposes several improvements to the correlation-based location features recently used in meeting speaker diarization (answering the question, "Who spoke when?"). The problem of leveraging time delay information is examined for multi-microphone meeting environments, where microphones are placed at unknown, widely spaced, and ad-hoc locations. In addition, conversational speech is challenging because of the many short utterances and speaker overlaps. Finally, assuming no room constraints, the microphone configuration and acoustic environment changes from meeting to meeting. Together, these conditions make it impractical to apply standard localization and beamforming techniques. To address these challenges, we first consider what combination of channel pairs and signal processing to use for location information extraction. Initially, we consider all pairs, then de-emphasizing low quality pairs with feature vector dimension reduction. We also develop an approach for fusing speaker ID information as viewed by different physical processes. Two views are a new time delay estimate and multi-band energy ratios (cues to location) and a third is a vector of mel-warped cepstral coefficients (MFCC's), related to vocal tract characteristics. We find that both MFCC's and energy ratios can improve time delay information when jointly transformed using canonical correlation analysis (CCA). Oracle experiments show that the location feature dimension producing the best diarization error varies with meeting. Therefore, we evaluate automatic methods for determining feature reduction output dimension. In addition, we separately consider reducing the feature dimension by explictly selecting subsets of channel pairs using estimated signal to noise ratio (SNR) and information-theoretic feature selection methods. Location features are also employed to detect speaker overlap, a significant cause of increased speaker diarization error. First, monaural overlap features are developed for a single channel beamformer output. These features are then compared to overlap detector features which make use of location information, but neither type provides good performance due to a high degree of variation across meetings. We also develop a simple, nearest-neighbor overlap processing scheme which, when given accurate overlap detection, improves diarization accuracy. Together, these results underscore the need for dynamic models to handle variable room and recording configurations.
Keywords/Search Tags:Diarization, Location, Meeting, Speaker
Related items