Font Size: a A A

Robust and efficient techniques for audio-visual speech recognition

Posted on:2003-03-08Degree:Ph.DType:Dissertation
University:Clemson UniversityCandidate:Gurbuz, SabriFull Text:PDF
GTID:1468390011485174Subject:Engineering
Abstract/Summary:
This dissertation describes the development of a multi-modal automatic speech recognition (ASR) system to provide effective recognition performance under acoustically noisy environments.; This dissertation contributes to the following five aspects of the audio-visual ASR system development: (1) Formulation and development of an adaptive and real-time lip tracking algorithm, (2) Definition of a set of general principals for visual speech feature extraction methods, (3) Formalization and implementation of a contour based affine invariant Fourier descriptors algorithm, (4) Formalization and implementation of a pixel based 2D kurtosis measure of a sub-image block's frequency profile algorithm, and (5) Proposal of a multi-stream state synchronous Hidden Markov Model (HMM) based audio-visual integration scheme which adopts the stream weighting values on the fly using noise type and SNR level.; The Bayesian framework is utilized for real-time lip tracking algorithm development. The proposed method has a strong mathematical foundation and enables parameter adaptation on the fly. It is a practical approach for real-time lip tracking. A set of general principles are defined, and these principles are satisfied by the proposed visual speech feature extraction algorithms.; The proposed state synchronous noise adaptive audio-visual integration algorithm's performance is compared with the late integration scheme. It extends the existing audio-only automatic speech recognizer to implement a state synchronous multi-stream audio-visual ASR system. The proposed method forms a multi-stream feature vector from audio-visual data, computes the statistical modal probabilities on the basis of multi-stream audio-visual features, and performs dynamic programming jointly on the multi-stream Hidden Markov Models by utilizing a noise type and signal-to-noise ratio (SNR) based stream-weighting value. Results are presented, and it is demonstrated that the distinct information available from a visual subsystem will allow optimal joint-decisions in the maximum likelihood sense based on the SNR-ratio and type of noise to exceed the performance of both the audio and video subsystems in nearly all noisy environments.
Keywords/Search Tags:Speech, Audio-visual, Performance, ASR, System, Real-time lip tracking, Noise, Development
Related items