Discovering audio-visual associations in narrated videos of human activities

Posted on:2009-06-10

Degree:Ph.D

Type:Dissertation

University:University of Illinois at Urbana-Champaign

Candidate:Oezer, Tuna

Full Text:PDF

GTID:1448390002996006

Subject:Computer Science

Abstract/Summary:

This research presents a novel method for learning the lexical semantics of action verbs. The primary focus is on actions that are directed towards objects, such as kicking a ball or pushing a chair. Specifically, this dissertation presents a robust and scalable method for acquiring grounded lexical semantics by discovering audio-visual associations in narrated videos. The narration associated with the video contains many words, including other verbs that are unrelated to the action. The actual name of the depicted action is only occasionally mentioned by the narrator. More generally, this research presents an algorithm that can reliably and autonomously discover an association between two events, such as the utterance of a verb and the depiction of an action, if the two events are only loosely correlated with each other.; Semantics is represented in a grounded way by association sets, a collection of sensory inputs associated with a high level concept. Each association set associates video sequences that depict a given action with utterances of the name of the action. The association sets are discovered in an unsupervised way. This dissertation also shows how to extract features from the video and audio for this purpose.; Extensive experimental results are presented. The experiments make use of several hours of video depicting a human performing 13 actions with 6 objects. In addition, the performance of the algorithm was also tested with data provided by an external research group. The unsupervised learning algorithm presented in this dissertation has been compared to standard supervised learning algorithms. This dissertation introduces a number of relevant experimental parameters and various new analysis techniques.; The experimental results show that the algorithm presented in this dissertation successfully discovers the correct associations between video scenes and audio utterances in an unsupervised way despite the imperfect correlation between the video and audio. The algorithm outperforms standard supervised learning algorithms. Among other things, this research shows that the performance of the algorithm depends mainly on the strength of the correlation between video and audio, the length of the narration associated with each video scene and the total number of words in the language.

Keywords/Search Tags:

Video, Audio, Action, Association

Related items

1	Research On Video Classification Based On Association Rules Of Action Semantics
2	Design And Implementation Of The Instant Messenger System Based On P2P Structure
3	Design And Implementation Of The Instant Messenger System Based On P2p Structure
4	Design And Implementation Of Video Conferencing Systems
5	Research On Some Problems Of Video Action Detection Based On Deep Learning
6	The Technology Of Audio And Video Session For The Mobile Handheld Terminals
7	Research On Temporal Action Location Method Combining Light And Heavy Networks In Untrimmed Video
8	The Research Of Video-Audio Technology And Its Application In Intelligent Community
9	Implementation Of A/V Receiving And Synchronous Playback With Embedded Systems
10	The Research And Implementation Of The Synchronization Problem Of Audio And Video