Font Size: a A A

Research On Context-Based Audio And Video Annotation

Posted on:2015-10-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:C C ZhongFull Text:PDF
GTID:1488304322450454Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With great advances in computer and network technologies, multimedia data are increasing in an explosive way. For the convenience of exploiting and organizing these massive data, researchers attempt to describe their multimedia content in terms of low-level feature, video structure and semantic feature. Among these descriptions, semantic feature that provides benefits for human understanding has been receiving a lot of attention. And consequently, as the most effective and efficient way to derive this kind of description, machine learning-based audio and video annotation is highly desired and greatly explored. However, due to the well-known semantic gap between low-level features and high-level semantics, satisfactory annotation performance is difficult to be achieved just by improving the learning algorithms. Thus, it is necessary to make full and effective use of the useful contextual cues underlying the rich content of audio and video, such as semantic correlation, temporal correlation, multi-modal correlation, etc., so as to bridge the semantic gap and enhance the annotation.Focusing on context-based audio and video annotation, in this thesis, we analyze the existing problems and conduct a deep research on the exploration, modeling and exploitation of the aforementioned three contextual cues. The main contributions are as follows:(1) We propose a new model named Correlated-Aspect Gaussian Mixture Model for multi-label audio concept detection and explore a topic feedback-based keyword spotting method, aiming to utilize semantic correlation-based contextual cues (such as the association between two annotation units) which have been neglected in most cases for boosting audio annotation. Oriented to generic audio data, the former algorithm models concept correlation under the framework of Gaussian Mixture Model, whereby the detection for some concepts that are difficult to be detected is enhanced by those which can be easily detected. While for speech, the latter algorithm exploits topics derived from text categorization to model the original intentions of speakers making the speech and takes them as the high-level semantic context to refine the initial results of keyword spotting. In the application of speech document retrieval, the effectiveness of this algorithm is demonstrated.(2) We propose a data-specific two-view concept correlation estimation procedure for video annotation refinement. As the guidance to annotation, concept correlation is crucial. Since the commonly used generic concept correlation which is applied to all data is not practically useful as expected, this procedure focuses on inferring the spatial and temporal concept correlations respectively underlying specific shot and shot pair by formulating this as a problem of data decomposition and reconstruction. In a probability calculation-based video annotation refinement scheme where the derived two types of data-specific correlations are incorporated, experiments on TRECVID2006-2008datasets show that these correlations could well characterize the semantic content of specific data and refine the initial results stemming from individual concept detectors effectively.(3) We propose graph regularized probabilistic Latent Semantic Analysis with Gaussian Mixtures (GRGM-pLSA) to deal with the problem of video temporal consistency modeling, and further present a feature conversion algorithm for video concept detection. Originating from pLSA with Gaussian Mixtures (GM-pLSA), GRGM-pLSA employs graph-based manifold regularization to model the neglected intrinsic interdependence between terms. By this means, video temporal consistency, marked by the fact that temporally consecutive video segments usually have similar visual content and express similar semantic meanings, can be modeled in terms of term correlation. Except for feature mapping, GRGM-pLSA is also applied as a generative model. Grounded on the contextual cue underlying video structure, a GRGM-pLSA-based visual-to-textual feature conversion algorithm is proposed, which provides a new perspective of applying probabilistic modeling-based annotation to video. Extensive experiments on YouTube and TRECVID datasets prove the effectiveness of our approaches.(4) We propose multi-modal pLSA with Gaussian Mixtures (MMGM-pLSA) as a way of exploiting multi-modal correlation-based contextual cue and extend it to a generalized model-graph regularized MMGM-pLSA (GRMMGM-pLSA). As the multi-modal features extracted from one video segment are correlated with each other, a reasonable multi-modal fusion manner should be capable of maintaining the characteristic of each modality as well as the intrinsic interdependence between them. For this purpose, MMGM-pLSA introduces multiple GMMs with each depicting the feature distribution of each modality, and is used for audio-visual fusion in the task of classification-based video annotation. Furthermore, so as to capture the intrinsic correlation between multi-modal terms, GRMMGM-pLSA, as the generalization of GM-pLSA and our GRGM-pLSA and MMGM-pLSA, is derived, and consequently succeeds in modeling the contextual cues of multiple modalities and temporal consistency simultaneously.
Keywords/Search Tags:Audio annotation, video annotation, context, concept correlation, temporal consistency, multi-modality
PDF Full Text Request
Related items