Font Size: a A A

Modeling data-source variability for content-based video retrieval using hidden Markov models

Posted on:2010-02-13Degree:Ph.DType:Dissertation
University:The Johns Hopkins UniversityCandidate:Ghoshal, ArnabFull Text:PDF
GTID:1448390002476925Subject:Engineering
Abstract/Summary:
This dissertation is about developing simple statistical models of visual data for the purpose of content-based image and video retrieval with textual queries. We develop two hidden Markov model (HMM) based methods for mapping the visual content in images (or video frames) to a set of visual concepts. We also show that source-dependent characteristics in video data can be modeled to significantly improve retrieval performance.;Retrieval of images or videos based on their visual content requires modeling the visual information in a way that can be matched with the queries specified by a user. Queries for content-based image retrieval (CBIR) can take several form---rough sketches, complete images or image patches, natural language, etc. This dissertation focuses on the situation where the user specifies their query in natural language; specifically, the query is assumed to be drawn from a pre-defined vocabulary of keywords. Such a setup is at the core of many content-based multimedia retrieval systems, and requires a mapping from the space of images to the set of keywords. It is a common practise to represent the images by the visual features extracted from them---the features themselves are usually in the form of vectors---and so one needs a mapping from a vector space to the set of keywords.;We propose two HMM systems for mapping visual feature vectors to a set of keywords, inspired by two different viewpoints of the problem. The first model attempts to explain the observed feature vectors through a generative process whose underlying variables correspond to the objects or visual concepts present in the image; we call this the joint visual +caption model. The second one attempts to determine whether a given image contains a particular concept by comparing the appearance of different parts (or locations) of the image frame with those of images that are known to contain the concept, and the images that are known not to contain the concept; we call this the location-specific model. Both models can be tuned to the source of the visual data using HMM adaptation techniques that have been widely studied in the field of automatic speech recognition.;The experimentation uses two standard datasets---the Corel Stock Photo Library; and the TRECVID video dataset provided by the National Institute of Standards and Technology (NIST). We show that the retrieval performance of our models is comparable to state-of-the-art systems on these datasets, while providing considerable reduction in computation. We observe that video data collected from various sources exhibit source-dependent variabilities. While on one hand we show that modeling such variabilities helps to greatly improve retrieval performance, on the other hand we demonstrate that the source-dependent characteristics of the visual data can cause severe degradation of retrieval performance for a source if no annotated data is available for that source. We present some unsupervised data selection techniques that can improve retrieval for several visual concepts. Lastly, if limited amounts of labeled data are available for a source, we show that model adaptation techniques are effective at considerably improving retrieval perfromance.
Keywords/Search Tags:Retrieval, Data, Model, Video, Source, Content-based, Visual, Image
Related items