Font Size: a A A

Kinematic measurement and feature sets for automatic speech recognition

Posted on:2002-07-01Degree:Ph.DType:Thesis
University:California Institute of TechnologyCandidate:Fain, Daniel ClarkFull Text:PDF
GTID:2468390011992727Subject:Computer Science
Abstract/Summary:
This thesis examines the use of measured and inferred kinematic information in automatic speech recognition and lipreading, and investigates the relative information content and recognition performance of vowels and consonants. The kinematic information describes the motions of the organs of speech—the articulators. The contributions of this thesis include a new device and set of algorithms for lipreading (their design, construction, implementation, and testing); incorporation of direct articulator-position measurements into a speech recognizer; and reevaluation of some assumptions regarding vowels and consonants.; The motivation for including articulatory information is to improve modeling of coarticulation and reconcile multiple input modalities for lipreading. Coarticulation, a ubiquitous phenomenon, is the process by which speech sounds are modified by preceding and following sounds.; To be useful in practice, a recognizer will have to infer articulatory information from sound, video, or both. Previous work made progress towards recovery of articulation from sound. The present project assumes that such recovery is possible; it examines the advantage of joint acoustic-articulatory representations over acoustic-only. Also reported is an approach to recovery from video in which camera placement (side view, head-mounted) and lighting are chosen to robustly obtain lip-motion information.; Joint acoustic-articulatory recognition experiments were performed using the University of Wisconsin X-ray Microbeam Speech Production Database. Speaker-dependent monophone recognizers, based on hidden Markov models, were tested on paragraphs each lasting about 20 seconds. Results were evaluated at the phone level and tabulated by several classes (vowel, stop, and fricative). Measured articulator coordinates were transformed by principal components analysis, and velocity and acceleration were appended. Concatenating the transformed articulatory information to a standard acoustic (cepstral) representation reduced the error rate by 7.4%, demonstrating across-speaker statistical significance ( p = 0.018). Articulation improved recognition of male speakers more than female, and recognition of vowels more than fricatives or stops.; The analysis of vowels, stops, and fricatives included both the articulatory recognizer of chapter 3 and other recognizers for comparison. The information content of the different classes was also estimated. Previous assumptions about recognition performance are false, and findings of information content require consonants to be defined to include vowel-like sounds.
Keywords/Search Tags:Recognition, Information, Speech, Kinematic
Related items