Font Size: a A A

An information-theoretic framework towards large-scale video structuring, threading, and retrieval

Posted on:2008-05-07Degree:Ph.DType:Thesis
University:Columbia UniversityCandidate:Hsu, Winston HFull Text:PDF
GTID:2448390005965750Subject:Computer Science
Abstract/Summary:
Video and image retrieval has been an active and challenging research area due to the explosive growth of online video data, personal video recordings, digital photos, and broadcast news videos. In order to effectively manage and use such enormous multimedia resources, users need to be able to access, search, or browse video content at the semantic level. Current solutions primarily rely on text features and do not utilize rich multimodal cues. Works exploring multimodal features often use manually selected features and/or ad hoc models, thus lacking scalability to general applications. To fully exploit the potential of integrating multimodal features and ensure generality of solutions, this thesis presents a novel, rigorous framework and new statistical methods for video structuring, threading, and search in large-scale video databases.; We focus on investigation of several fundamental problems for video indexing and retrieval: (1) How to select and fuse a large number of heterogeneous multimodal features from image, speech, audio, arid text? (2) How to automatically discover and model mid-level features for multimedia content? (3) How to model similarity between multimodal documents such as news videos or multimedia web documents? (4) How to exploit unsupervised methods in video search to boost performance in an automatic fashion?; To address such challenging problems, our main contributions include the following: First, we extend the Maximum Entropy model to fuse diverse perceptual features from multiple levels and modalities and demonstrate significant performance improvement in broadcast news video segmentation. Secondly, we propose an information-theoretic approach to automatically construct mid-level representations. It is the first work to remove the dependency on the manual and labor-intensive processes in developing mid-level feature representations from low-level features. Thirdly, we introduce new multimodal representations based on visual duplicates, cue word clusters, high-level concepts, etc. to compute similarity between the multimedia documents. Using such new similarity metrics, we demonstrate significant gain in rnulti-lingual cross-domain topic tracking. Lastly, to improve the automatic image and video search performance, we propose two new methods for reranking the initial video search results based on text keywords only. In the image/video level, we apply the information bottleneck principle to discover the image clusters in the initial search results, and then rerank the images based on cluster-level relevance scores and the occurrence frequency of images. Such method is efficient and generic, applicable to reranking of any initial search results using other search approaches, such as content-based image search or semantic concept-based search. In the multimedia document level, building on the multimodal document similarities, we propose a random walk framework for reranking the initial text-based video search results. Significant performance improvement is demonstrated in comparison with text-based reranking methods. In addition, we have studied application and optimal parameter settings of the power method in solving the multi-modal random walk problems. All of our experiments are conducted using the large-scale diverse video data such as the TRECVID benchmark data set, which includes more than 160 hours of broadcast videos from multiple international channels.
Keywords/Search Tags:Video, Search, Image, Large-scale, Framework, Features
Related items