Font Size: a A A

Integrating computational auditory scene analysis and automatic speech recognition

Posted on:2007-05-31Degree:Ph.DType:Thesis
University:The Ohio State UniversityCandidate:Srinivasan, SoundararajanFull Text:PDF
GTID:2458390005489635Subject:Artificial Intelligence
Abstract/Summary:
We present a schema-based model for phonemic restoration. The model employs missing-data ASR to decode speech based on unmasked portions and activates word templates that contain the masked phoneme via dynamic time warping. An activated template is then used to restore the masked phoneme. A systematic evaluation shows that the model is able to restore both voiced and unvoiced phonemes with a spectral quality close to that of original phonemes.; Missing-data ASR relies on a binary mask generated by bottom-up CASA to label the speech-dominant time-frequency (T-F) regions of a noisy mixture as reliable and the rest as unreliable. However, errors in mask estimation cause degradation in recognition accuracy. Hence, we propose a two-pass ASR system that performs segregation and recognition in tandem. In the first pass, an n-best lattice, consistent with bottom-up speech separation, is generated. The lattice is then re-scored using a model-based hypothesis test to improve mask estimation and recognition accuracy concurrently. This two-pass system leads to significant improvement in recognition performance.; By combining a monaural CASA system with missing-data ASR, we present a model that simulates listeners' ability to attend to a target speaker when degraded by the effects of energetic and informational masking in multitalker environments. Energetic masking refers to the phenomenon that a stronger signal masks a weaker one within a critical band. Informational masking occurs when the listener is unable to segregate target from interference. Missing-data ASR is used to account for energetic masking. The effects of informational masking are modeled by the output degradation of the CASA system in binary mask estimation. The model successfully simulates several quantitative aspects of listener performance including the differential effects of energetic and informational masking on multitalker perception.; While missing-data ASR performs well on small vocabulary tasks, previous studies have not examined the effect of vocabulary size. In this dissertation, we investigate the performance of the missing-data ASR on a larger vocabulary task and compare its results to those of conventional ASR. For conventional ASR, we extract the speech signal from a noisy mixture by estimating a Wiener filter based on estimated interaural time and intensity differences within a T-F unit. For missing-data ASR, the same estimation is used to produce a binary T-F mask. We find that while missing-data recognition outperforms conventional ASR on a small vocabulary task, the performance of conventional ASR is significantly better when the vocabulary size is increased. (Abstract shortened by UMI.)...
Keywords/Search Tags:ASR, Speech, Recognition, Mask, Model, Vocabulary, Performance
Related items