Font Size: a A A

Audio-Visual Asynchrony Modeling and Analysis for Speech Alignment and Recognition

Posted on:2012-07-12Degree:Ph.DType:Thesis
University:Northwestern UniversityCandidate:Terry, LouisFull Text:PDF
GTID:2458390008493826Subject:Speech communication
Abstract/Summary:
This work investigates perceived audio-visual asynchrony, specifically anticipatory coarticulation, in which the visual cues (e.g. lip rounding) of a speech sound may occur before the acoustic cues. This phenomenon often gives the impression that the visual and acoustic signals are asynchronous. This effect can be accounted for using models based on multiple hidden Markov models with some synchrony constraints linking states in different modalities, though generally only within phones and not across phone boundaries. In this work, we consider several such models, implemented as dynamic Bayesian networks (DBNs). We study the models' ability to accurately locate audio and viseme (audio and video sub-word units, respectively) boundaries in the audio and video signals, and compare them with human labels of these boundaries. This alignment task is important on its own for purposes of linguistic analysis, as it can serve as an analysis tool and a convenience tool to linguists. Furthermore, these advances in alignment systems can carry over into the speech recognition domain.;This thesis makes several contributions. First, this work presents a new set of manually labeled phonetic boundary data in words expected to display asynchrony, and analysis of the data confirms our expectations about this phenomenon. Second, this work presents a new software program called AVDDisplay which allows the viewing of audio, video, and alignment data simultaneously and in sync. This tool is essential for the alignment analysis detailed in this work. Third, new DBN-based models of audio-visual asynchrony are presented. The newly proposed models consider linguistic context within the asynchrony model. Fourth, alignment experiments are used to compare system performance with the hand-labeled ground truth. Finally, the performance of these models in a speech recognition context is examined. This work finds that the newly proposed models outperform previously suggested asynchrony models for both alignment and recognition tasks.
Keywords/Search Tags:Asynchrony, Alignment, Work, Speech, Recognition, Models
Related items