Audio-Visual Asynchrony Modeling and Analysis for Speech Alignment and Recognition

Posted on:2012-07-12

Degree:Ph.D

Type:Thesis

University:Northwestern University

Candidate:Terry, Louis

Full Text:PDF

GTID:2458390008493826

Subject:Speech communication

Abstract/Summary:

This work investigates perceived audio-visual asynchrony, specifically anticipatory coarticulation, in which the visual cues (e.g. lip rounding) of a speech sound may occur before the acoustic cues. This phenomenon often gives the impression that the visual and acoustic signals are asynchronous. This effect can be accounted for using models based on multiple hidden Markov models with some synchrony constraints linking states in different modalities, though generally only within phones and not across phone boundaries. In this work, we consider several such models, implemented as dynamic Bayesian networks (DBNs). We study the models' ability to accurately locate audio and viseme (audio and video sub-word units, respectively) boundaries in the audio and video signals, and compare them with human labels of these boundaries. This alignment task is important on its own for purposes of linguistic analysis, as it can serve as an analysis tool and a convenience tool to linguists. Furthermore, these advances in alignment systems can carry over into the speech recognition domain.;This thesis makes several contributions. First, this work presents a new set of manually labeled phonetic boundary data in words expected to display asynchrony, and analysis of the data confirms our expectations about this phenomenon. Second, this work presents a new software program called AVDDisplay which allows the viewing of audio, video, and alignment data simultaneously and in sync. This tool is essential for the alignment analysis detailed in this work. Third, new DBN-based models of audio-visual asynchrony are presented. The newly proposed models consider linguistic context within the asynchrony model. Fourth, alignment experiments are used to compare system performance with the hand-labeled ground truth. Finally, the performance of these models in a speech recognition context is examined. This work finds that the newly proposed models outperform previously suggested asynchrony models for both alignment and recognition tasks.

Keywords/Search Tags:

Asynchrony, Alignment, Work, Speech, Recognition, Models

Related items

1	Liar, liar, hands on fire: What gesture -speech asynchrony reveals about thinking
2	Research On Automatic Speech-Text Alignment For Mongolian Long Audio
3	Text-Speech Alignment Based On General Speech Recognition
4	Research Of Mandarin Text-Speech Alignment Based On SailAlign
5	Research On Speech Phoneme Recognition Based On Deep Learning
6	People Independent Chinese Speech Recognition Based On HMM And ANN
7	Research On The Acoustic Models And Implemention On The Keyword Recognition System
8	Research And Implementation Of Speech Recognition Based On HMM/BP
9	Multiview Dysarthric Speech Recognition Based On Sequential Neural Networks
10	Research On Unannotated Long Chinese Speech Text-speech Alignment