Font Size: a A A

Multi-level Video Annotation And Retrieval

Posted on:2009-12-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:X YuanFull Text:PDF
GTID:1118360242495815Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of multimedia, computer and internet, there is an explosive increasement of video data. For efficient storage, management, and indexing of these massive video data, we need to investigate more efficient Content-Based Video Retrieval (CBVR) algorithms. Video annotation is a preliminary step for video retrieval and search. In this dissertation, we will investigate how to utilize machine learning techniques and video features, to performan content-based video annotation at different video levels.There are totally four levels in video structure: video, scene, shot, and frame. Typically video annotation is performed in video-level and shot-level. Annotation in video-level is to assign video genre information for each video clip. Annotation in shot-level is to annotate the corresponding semantic concept for each shot, based on the key-frame extracted from the shot. Shot-level video annotation is further classified into image-level annotation and object-level annotation, according to the annotated concept belongs to image-level or object-level. In this dissertation, we investigate some prolems in video annotation in video-level, image-level, and object-level. The main contributions and innovations can be summarized as follows:1. For video-level annotation, current research works usually annotate several limited genres, or the sub-genres within a certain genre, and their classifiers are often too simple. We define a relatively comprehensive video genre ontology, analyze and extract a series of spatial and temporal features related to video genres. Furthermore, we propose to use a local optimal and global optimal SVM binary-tree for multi-class SVM classification to improve the classification accuracy.2. Current research works in video-level annotation usually adopt passive learning, which demand large-scale training data and time-consuming human labeling effort. We incorporate active learning into video genre classification, and propose an SVM active learning algorithm based on posterior probability. We first use posterior probability output by SVM classifier to calculate the confidence of each unlabeled sample, and then select the "most unconfident" samples of the classifier for users to label. The "most unconfident" samples always correspond to the "most valuable" samples for the classifier. Through this active learning strategy, we can use fewer training samples to obtain comparable classification accuray obtained by using large-scale training samples, thus alleviate users' labeling effort.3. For key-frame image-level video annotation, we discuss a typical case in video annotation: to learn the target concept using only a small number of positive samples. A novel manifold-ranking based scheme is proposed to tackle this problem. However, video annotation need large scale video data and large scale feature pool to get good performance. In this situation, applying manifold ranking will induce the following two problems: intractable computation cost and the curse of dimensinality. We incorporate two modules, i.e. pre-filtering and feature selection, to tackle the two problems respectively. This scheme is extensible and flexible in terms of adding new features into the feature pool, introducing human interaction on selecting features, and defining new concepts.4. In object-level video annotation, because the training data are usually labeled in image-level while the semantic concepts are in regional-level, typical single-instance supervised learning cannot learn the target concept directly. If we deem each image as a labeled bag of multiple instances, and the objects in the image as the instances in the bag, object-level video annotation becomes a typical multiple-instance learning (MIL) problem. However, conventianl multiple-instance learning in video annotation neglects the concept dependencies, i.e. the relationship between positive and negative concepts. Therefore, we propose the existence -based MIL formulation to model the concept dependencies, and present a MIL algorithm MI-AdaBoost according to the existence-based MIL formulation. MI-AdaBoost firstly maps each training bag into a feature vector in a new bag-level feature space, thus translating the MIL problem into a standard single-instance problem. This feature mapping would induce a high-dimensional feature vector with much noise for each bag. Therefore, we utilize AdaBoost to perform feature selection and build the final classifier.5. As there are usually large gaps between the effective features for different semantic concepts, feature selection is a key problem in video annotation. Typical feature selection algorithm under single-instance settings usually cannot be adapted directly under multi-instance settings. Previous works on MIL in video annotation often neglect the feature selection problem under MIL settings. We propose a feature selection algorithm named EBMIL under MIL settings. EBMIL is able to select different raw feature sources (color, texture, etc.) during selecting mapped bag-level features, thus achieve better performance in video annotation.
Keywords/Search Tags:Content-Based Video Retrieval (CBVR), Video Annotation, Video Genre Categorization, Multiple-Instance Learning (MIL), Feature Selection, Semi-Supervised Learning (SSL), Active Learning
PDF Full Text Request
Related items