Font Size: a A A

Unsupervised Alignment of Natural Language with Video

Posted on:2016-03-09Degree:Ph.DType:Thesis
University:University of RochesterCandidate:Naim, IftekharFull Text:PDF
GTID:2478390017981162Subject:Computer Science
Abstract/Summary:
Today we encounter large amounts of video data, often accompanied with text descriptions (e.g., cooking videos and recipes, videos of wetlab experiments and protocols, movies and scripts). Extracting meaningful information from these multimodal sequences requires aligning the video frames with the corresponding sentences in the text. Previous methods for connecting language and videos relied on manual annotations, which are often tedious and expensive to collect. In this thesis, we focus on automatically aligning sentences with the corresponding video frames without any direct human supervision.;We first propose two hierarchical generative alignment models, which jointly align each sentence with the corresponding video frames, and each noun in a sentence with the corresponding object in the video frames. Next, we propose several latent-variable discriminative alignment models, which incorporate rich features involving verbs and video actions, and outperform the generative models. Our alignment algorithms are primarily applied to align biological wetlab videos with text instructions. Furthermore, we extend our alignment models for automatically aligning movie scenes with associated scripts and learning word-level translations between language pairs for which bilingual training data is unavailable.;Thesis: By exploiting the temporal ordering constraints between video and associated text, it is possible to automatically align the sentences in the text with the corresponding video frames without any direct human supervision.
Keywords/Search Tags:Video, Text, Align, Language
Related items