Font Size: a A A

Research On Several Issues In Video Semantic Annotation

Posted on:2009-01-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H TangFull Text:PDF
GTID:1118360242495761Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Digital video collections are growing rapidly in recent years, accompanied with the decreased cost of storage devices, high transmission rates and improved compression techniques. The demand for solutions to manage video database is increasing tremendously. It is a common theme to develop automatic analysis techniques for deriving metadata to describe the video content at semantic level. With the help of these metadata, the tools and systems for video retrieval, summarization, delivery and manipulation can be created effectively.Automatic semantic annotation (also called high-level feature extraction in TRECVID benchmark) of video and video segments is an elementary step for obtaining these metadata. If we only consider the visual information, it is closely related to the works on image annotation. As manual annotation for a large video archive is labor-intensive and time-consuming, lots of learning-based automatic approaches are proposed to annotate the video shots with some certain concepts recently.Although learning based annotation has achieved some success recent years, due to the well-known "semantic gap", there are still several problems in learning based video annotation including training set construction, leveraging large amount of the unlabeled data, mining the contextual knowledge in the video data and typicality ranking problem. This thesis conducts a deep research on these issues and obtains the following achievements:(1) For the training set construction, we argue that a training set which includes most temporal and spatial distribution of the whole data will achieve a satisfying performance even in the case of limited size of training set, so that we can reduce much manual work. In order to capture the geometrical distribution characteristics of a given video collection, we propose four metrics for constructing an optimal training set, and propose a set of optimization rules to capture the most distribution information of the whole data for a training set with a given size. Experiments conducted on a home video dataset show these construction rules' effectiveness.(2) We propose a novel graph-based SSL method, named Structure-Sensitive Anisotropic Manifold Ranking (SSAniMR), based on a structure-sensitive similarity measure. Instead of using Euclidean distance only, SSAniMR takes local structural difference into account to more accurately measure pairwise similarity. Furthermore, we show that SSAniMR can also be deduced from a partial differential equation (PDE) based anisotropic diffusion framework, which demonstrates that the label propagation in SSAniMR is anisotropic, which is intrinsically different from the isotropic label propagation process in general graph-based SSL methods. Experiments conducted on the TRECVID dataset demonstrate that SSAniMR outperforms SVM and other popular graph-based semi-supervised learning methods for video annotation.(3) Motivated by the great success of kernel trick in pattern recognition area, a novel graph-based semi-supervised learning method named kernel linear neighborhood propagation (KLNP) is proposed and applied to video annotation. This approach combines the consistency assumption, which is the basic assumption in semi-supervised learning, and the local linear embedding (LLE) method in a nonlinear kernel-mapped space. KLNP improves a recently proposed method linear neighborhood propagation (LNP) by tackling the limitation of its local linear assumption on the distribution of semantics.(4) Exploit two kind of contextual knowledge in video data, that are temporal consistency and semantic correlation and combine these two knowledges into learning method respectively. Two methods named as temporal consistent Gaussian random field and multi-relational graph based label propagation are proposed. Experiments conducted on the TRECVID dataset demonstrate that combing these contextual knowledges can significantly improve the annotation performance.(5) We address the issue of typicality ranking for video annotation and propose to use a novel criterion, Average Typicality Precision (ATP), to replace the frequently used one, Average Precision (AP), for evaluating the performance of video annotation algorithms. General annotation methods just care the number of true positive samples at the top of the ranked list; they actually do not care the order of these samples. We argue that it is more reasonable to rank "typical" true-positive samples higher than non-typical ones, which can be evaluated by our proposed ATP. A typicality ranking framework for video annotation is proposed. Beside, we also propose a multiple-instance semi-supervised typicality ranking method for natural scene annotation by combining multiple-instance learning and semi-supervised learning.
Keywords/Search Tags:Video semantic annotation, training set construction, semi-supervised learning, label propagation, anisotropic, linear neighborhood propagation, kernel trick, temporal consistency, semantic correlation, typicalty ranking
PDF Full Text Request
Related items