Audio Visual Information Fusion for Human Activity Analysis

Posted on:2011-07-28

Degree:Ph.D

Type:Dissertation

University:University of California, San Diego

Candidate:Thagadur Shivappa, Shankar

Full Text:PDF

GTID:1448390002960759

Subject:Engineering

Abstract/Summary:

Human activity analysis in unconstrained environments using far-field sensors is a challenging task. The fusion of audio and visual cues enables us to build robust and efficient human activity analysis systems. Traditional fusion schemes including feature-level, classifier-level and decision-level fusion have been explored in task-specific contexts to provide robustness to sensor and environmental noise. However, human activity analysis involves the extraction of information from audio and visual cues at multiple levels of semantic abstraction. This naturally leads to a hierarchical fusion framework.;In this dissertation, the limitations of existing fusion schemes are explored and new algorithms are developed to address some of these limitations. The iterative decoding algorithm (IDA) fuses the audio and video modalities at the decision level but unlike other schemes, it uses an iterative strategy to infer the joint likelihood of the hidden states from the unimodal likelihoods. The iterative decoding is advantageous to joint modeling and other decision level fusion schemes in terms of ease of training of the models and the performance under low SNR scenarios. The extension of the IDA to more complex tasks, such as audio-visual person tracking and meeting scene analysis, leads to hierarchical fusion frameworks. The multilevel iterative decoding framework for audio-visual person tracking (MID-AVT) uses the iterative decoding framework for tracking multiple subjects using both audio and visual cues from multiple cameras and microphone arrays. The local sensor-level tracks are fused using the IDA to obtain globally consistent tracks. The MID-AVT framework is robust to sensor calibration errors and requires only a rough calibration step to learn the correspondences between different sensors. The location specific speaker modeling (LSSM) framework for audio-visual meeting scene analysis augments the tracking information with speaker recognition information. Speaker recognition using far-field microphones is a challenging task. The LSSM framework addresses this issue by using the speaker's location information to select the corresponding location specific speaker recognition model. In practice, training such contextual models requires intensive labeling of audio-visual datasets. Semi-supervised techniques for model learning and sensor calibration are presented in this dissertation to address this issue. A particular case, learning the LSSM models using face recognition information, is explored in detail and found to perform well in practice. The overall contribution of this dissertation is the exploration of various aspects of hierarchical fusion in audio-visual human activity analysis and the extensive analysis of these hierarchical fusion frameworks on real world audio-visual testbeds.

Keywords/Search Tags:

Human activity analysis, Fusion, Audio, Visual, Information, Using, Framework, Iterative decoding

Related items

1	Human Tracking Based On Audio Visual Information Fusion And Its Application
2	Research On Human Activity Analysis Based On Wearable Devices
3	Research On Human Action Recognition Based On Audio-Visual Information
4	Multi-speaker Recognition Based On Audio Video Information Fusion In Meeting Room Environment
5	Research On Mobile Phone And Wearable Devices Based Human Activity Recognition Technologies
6	Human Activity Analysis In Videos
7	Algorithm Of Audio And Visual Fusion For Localization And Tracking Based On Audio Auxiliary Information
8	Real-time Human Intrusion Detection Using Audio-visual Fusion
9	Research Of Speech Recognition Method Based On Audio-visual Information Fusion
10	Sports Video Analysis And Highlight Ranking Based On Audio/visual Fusion