Font Size: a A A

Learning discriminant narrow-band temporal patterns for automatic recognition of conversational telephone speech

Posted on:2006-04-07Degree:Ph.DType:Thesis
University:University of California, BerkeleyCandidate:Chen, Barry YueFull Text:PDF
GTID:2458390005992585Subject:Engineering
Abstract/Summary:
Typical automatic speech recognition (ASR) systems extract features from the full spectrum of speech over relatively short time spans (from about 25 milliseconds to approximately 100 milliseconds). They rely on the short-term spectral envelope of speech for modeling speech sounds. This dependence on the short-term spectral envelope of speech may account for the fact that ASR systems still fall short of human recognition ability. Variabilities in the speech signal come from environmental sources (such as noise and reverberation) as well as from the speaker herself/himself (such as accent and speaking style). These variabilities create difficult problems for typical ASR systems relying on the short-term spectral envelope of speech. This thesis further explores the extraction of discriminant speech information from long-term narrow-frequency energy trajectories of speech. These long-term narrow-frequency energy trajectories stretch over 500 milliseconds of speech and span critical-bandwidths. Previous work on extracting information from these long-term trajectories led to the development of a neural network architecture called Neural TRAP [52, 112]. Neural TRAP consists of two stages of multi-layer perceptrons (MLPs), each of which is a single hidden layer fully-connected MLP. The first stage is trained to estimate the phone posterior probabilities within each critical-band, while the second stage uses the critical-band level phone probabilities to come up with an overall estimate of the full spectrum phone posterior probabilities. This system was competitive to conventional ASR systems, but in combination with conventional systems, Neural TRAP significantly improved ASR performance. We extend the Neural TRAP work along two major directions in this thesis. First, we develop two new Neural TRAP-like architectures that extract different critical-band level information. The first new architecture, Hidden Activation TRAP (HAT), is like Neural TRAP except that instead of using the outputs of the critical-band MLPs, which estimate critical-band level phone probabilities, it uses the outputs of the critical-band hidden units, which represent probabilities of certain discriminant energy trajectories. The second new architecture, Tonotopic Multi-Layer Perceptron (TMLP), has the same network topology as HAT, but the critical-band hidden unit parameters and the discriminant energy trajectories that they model are not constrained to learn critical-band level phone posteriors, rather they are free to learn useful critical-band discriminant patterns for the estimation of the full-band phone posteriors. The second major extension in this thesis is the integration of the long-term narrow-band systems with a conventional ASR system for the recognition of conversational telephone speech (CTS). By augmenting conventional short-term features with features derived from a combination of phone posteriors estimated by the long-term systems and by more conventional intermediate-term systems, we achieve word error rate reductions of about 9% relative on CTS, which is considered impressive for this task.
Keywords/Search Tags:Speech, Systems, Recognition, ASR, Neural TRAP, Phone, Discriminant, Short-term spectral envelope
Related items