Font Size: a A A

Video Semantic Understanding With Multi-modality Feature Fusion And Variable Selection

Posted on:2011-12-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y N LiuFull Text:PDF
GTID:1118360302474594Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the recent advances in computer technologies and Internet applications, the number of multimedia files and archives increase dramatically, and video data constitute the majority. Therefore, efficient and fast content-based video storage, management, indexing, browsing and retrieval have become important research topics. Video data comprises plentiful semantics.such as people.object, event and story.etc. In general, video data compose of three low level modalities namely the image, audio. and text modalities. These multiple modalities in video are in essence characteristic of temporal associated cooccurrence (TAC). Considering the TAC of the multiple modalities of video data, this paper proposes effective feature fusion and variable selection schemes to better analyze video semantic contents.Interaction and integration of multi-modality media types such as visual, audio and textual data in video are the essence of video content analysis. A great deal of research has been focused on utilizing multi-modality features for better understanding of video semantics. In this paper, we propose a new approach to detect semantic concept in video using Co-Occurrence Data Embedding (CODE), SimFusion, and Locality Preserving Projections (LPP) from temporal associated co-occurring multimodal media data in video. CODE is a method for embedding objects of different types into the same low dimension Euclidean space based on their co-occurrence statistics. SimFusion is an effective algorithm to reinforce or propagate the similarity relations between multiple modalities. LPP is an optimal combination of linear and nonlinear dimensionality reduction method. Our experiments show that by employing these key techniques, we can improve the performance of video semantic concept detection and get better video semantics mining results.Traditionally, the multimodal media features in video are preferred to be represented merely by concatenated vectors, whose high dimensionalities always cause the problem of "curse of dimensionality". Besides, over-compression problem will occur when the sample vector is very long and the number of training samples is small, which results in loss of information in the dimension reduction process. This paper proposes a higher-order tensor framework for video analysis and understanding. In this framework, we represent image frame, audio and text which are the three modalities in video shots as data points by the 3rd-order tensor. Then we propose a novel video representation and dimension reduction method which explicitly considers the manifold structure of the tensor space from temporal-sequenced associated co-occurring multimodal media data. We call it TensorShot approach. Semi-supervised learning used large amount of unlabeled data together with the labeled data, to build better classifiers. We propose a new transductive support tensor machines algorithm to train effective classifier and an active learning based contextual and temporal post-refining strategy to enhance detection accuracy. Our algorithm preserves the intrinsic structure of the submanifold where tensorshots are sampled, and is also able to map out-of-sample data points directly. Moreover. the utilization of unlabeled data builds better classifiers. Experiment results show that our method improves the performance of video semantic concept detection.Based on Compressive Sensing and Sparse Representation theories, also with the idea of nonnegative matrix factorization and supervised learning, this paper develops a novel approach to image and video representation, classification and retrieval, which we call group sparse representation. The basic idea is to represent a test image as a weighted combination of all the training images. In particular, we introduce two sets of weight coefficients, one of which is for each training image and another is for each class, which does varable selection at the class level. Moreover, due to the "nonnegative" features of image and video, we impose nonnegative constraints to the coefficients to make the classifier more interpretable and additive model. Specifically, we formulate our concern as a group nonnegative garrote model. The resulting representations are sparse, and they are appropriate for discriminant analysis.
Keywords/Search Tags:multi-modality, temporal associated cooccurrence (TAC), subspace correlation propagation, TensorShot, higher order SVD (HOSVD), active learning, compressive sensing, group sparse representation, nonnegative garrote, l1-minimization
PDF Full Text Request
Related items