Font Size: a A A

Multimodal fusion with applications to audio-visual speech recognition

Posted on:2004-07-05Degree:Ph.DType:Dissertation
University:University of Illinois at Urbana-ChampaignCandidate:Chu, Stephen MingyuFull Text:PDF
GTID:1458390011457792Subject:Engineering
Abstract/Summary:
This study considers the fundamental problem of multimodal fusion in the context of pattern recognition tasks in human-computer interfaces (HCI). Specifically, the research stems from two basic recognition problems: first, automatic speech recognition; and second, biometrics, i.e., person recognition. In both cases, multiple cues carried in different modalities are often available for the recognition targets. Thus, the multiple information sources may be modeled or evaluated jointly to improve the recognition performance, especially under adverse ambient conditions. This motivation respectively leads to audio-visual speech recognition and multichannel biometrics. A crucial problem that arises in these multimodal approaches is how to carry out fusion to best take advantage of the available information.; Differences in the characteristics of the intermodal couplings in audio-visual speech recognition and in multichannel biometrics defy a universal fusion method for both applications. For audio-visual speech modeling, we propose a novel sensory fusion method based on the coupled hidden Markov models (CHMMs). The CHMM framework allows the fusion of two temporally coupled information sources to take place as an integral part of the statistical modeling process. An important advantage of the CHMM-based fusion method lies in its ability to model asynchronies between the audio and visual channels. We describe two approaches to carry out inference and learning in CHMMs. The first is an exact algorithm derived by extending the forward-backward procedure used in hidden Markov model (HMM) inference. The second method relies on the model transformation strategy that maps the state space of a CHMM onto the state space of a classic HMM, and therefore facilitates the development of sophisticated audio-visual speech recognition systems using existing infrastructures. For multichannel biometrics, we introduce a general formulation based on the late integration paradigm and address the environmental robustness issue through multichannel fusion. Based on this formulation, two effective approaches to carry out environment-adaptive decision fusion are developed: the environmental confidence weighting method and the optimal channel weighting method.
Keywords/Search Tags:Fusion, Recognition, Audio-visual speech, Multimodal, Method
Related items