Font Size: a A A

Sequential organization in computational auditory scene analysis

Posted on:2008-03-12Degree:Ph.DType:Thesis
University:The Ohio State UniversityCandidate:Shao, YangFull Text:PDF
GTID:2448390005463800Subject:Computer Science
Abstract/Summary:
A human listener has the ability to follow a speaker's voice while others are speaking simultaneously. In particular, the listener can organize the time-frequency (T-F) energy of the same speaker into a single stream. This aspect of auditory perception is termed auditory scene analysis (ASA). ASA comprises two organization processes: segmentation and grouping. Segmentation decomposes the auditory scene into T-F segments. Grouping combines the segments from the same source into a single perceptual stream. Within the grouping process, simultaneous organization integrates segments that overlap in time, and sequential organization groups segments across time.; Inspired by ASA research, computational auditory scene analysis (CASA) aims to organize sound based on ASA principles. CASA systems seek to segregate target speech from a complex auditory scene. However, almost all the existing systems focus on simultaneous organization. This dissertation presents a systematic effort on sequential organization. The goal is to organize T-F segments from the same speaker that are separated in time into a single stream. This study proposes to employ speaker characteristics for sequential organization.; This study first explores bottom-up methods for sequential grouping. Subsequently, a speaker-model-based sequential organization framework is proposed and shown to yield better grouping performance than feature-based methods. Specifically, a computational objective is derived for sequential grouping in the context of cochannel speaker recognition. Cochannel speech occurs when two utterances are transmitted in a single communication channel. This formulation leads to a grouping system that searches for the optimal grouping of separated speech segments. To reduce search space and computation time, a hypothesis pruning method is then proposed and it achieves performance close to that of exhaustive search. Systematic evaluations show that the proposed system improves not only grouping performance but also speech recognition accuracy.; The model-based grouping system is then extended to handle multi-talker as well as non-speech intrusions using generic models. This generalization is shown to function well regardless of interference types and the number of interfering sources. The grouping system is further extended to deal with noisy inputs from unknown speakers. Specifically, it employs a speaker quantization method that extracts representative speakers from a large speaker space and performs sequential grouping using obtained generic models. The resulting grouping performance is only moderately lower than that with known speaker models.; In addition to sequential grouping, this dissertation presents a systematic effort in robust speaker recognition. A novel usable speech extraction method is proposed that significantly improves recognition performance. Then, missing-data recognition is combined with the use of CASA as a front-end processor. Substantial performance improvements are achieved in speaker recognition evaluations under various noisy conditions. Finally, a general solution is proposed for robust speaker recognition in the presence of additive noise. Novel speaker features are derived from auditory filtering and cepstral analysis, and are used in conjunction with an uncertainty decoder that accounts for mismatch introduced in front-end processing. Systematic evaluations show that the proposed system achieves significant performance improvement over the use of typical speaker features and a state-of-the-art robust front-end processor for noisy speech.
Keywords/Search Tags:Speaker, Auditory scene, Sequential organization, Grouping, Performance, Speech, Computational, ASA
Related items