Font Size: a A A

Semantic Information Extration And Analysis Of Digital Video

Posted on:2011-06-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:G GuoFull Text:PDF
GTID:1118330332478700Subject:Military Intelligence
Abstract/Summary:PDF Full Text Request
With the advancement in information processing techniques, coupled with the availability of low-cost multimedia recording devices application, digital video data collections in daily life are growing rapidly in recent years. The overwhelming amount of video data will trigger the requirement for solutions to manage, organize and search video databases. Video content analysis technique emphasizes processing, analyzing and understanding video data at a conceptual level, this allows for effective and efficient retrieval of specific video automatically. Unfortunately, most content analysis methods are low-level feature based or single modal utilized, which are quite different from the semantic concepts in human thought. Automatic techniques for indexing and understanding video with conceptual labels suffer from the fact that it is hard to map the low-level features to semantic concepts directly. The existence of semantic gap has hampered the development of automatic processing in multimedia domain at present.Keeping up with the newly developed techniques of video content analysis, this thesis focuses on the generic techniques of video semantic information extraction and analysis. To arrive at the purpose, the whole semantic space is abstracted hierarchically. Visual, text and audio semantic information are extracted respectively in each layer and all the clues are fused to obtain comprehensive understanding of video content. The main contributions of this thesis are summarized as follows:1. A generic analysis framework for video semantic content extraction and analysis using multimodal and multigranular is proposed. The proposed framework is not restricted to the genre of video data, and could provide users semantic concepts in different modal space and different granular space. The analysis framework is composed of two sub-model called video content representation sub-model and technology sub-model. The former provides the theoretic foundation of the framework, otherwise, the latter offers the implementation methods for semantics extraction.2. According to perceive mechanism, a multimodal fusion method based on multi-concept selection is proposed for video semantic analysis. Multimodal features of video are divided into three basic groups and semantic concepts are obtained in each group respectively. Then a multi-concept selection scheme using correlation is applied to delete the incorrect concepts. The feature vector of fusion is composed by a new value called concept importance measurement, which is used to replace the probability. Experiment results comparing to other fusion method support that the new fusion method utilizes the temporal character and correlation between concepts, and could effectively achieve high-level semantic contents.3. The visual cognitive-level and affective-level semantic extraction and analysis technique is studied deeply. According to the visual attention mechanisms, spatio-temporal salient region is advanced to describe video shot content. A dynamic selective support vector machine (SVM) ensemble classification model based on rough sets and clustering is proposed after anylyzing the disadvantages of SVM. Cognitive-level semantic concepts are obtained subsequently by the improved SVM. To detect high-level affective semantics, a number of affective features from keyframe and shot are extracted respectively and recognized by above SVM. Experimental results show that the proposed support vector machine ensemble scheme achieves satisfied recognition capability; spatio-temporal salient region could correctly describe the video content and accord with the mental sensibility of human beings.4. A spatio-temporal algorithm to detect, locate and extract the overlay captions in digital video is advanced. The text detection algorithm adopts the EQSDD measurement and a binary-search algorithm to decide the initiative and final frames including the same text within one shot. Using the initiative and final frames, two frame transition pairs and two difference images are obtained. Text region localization is subsequently applied from coarseness to fine on edge maps of the two difference images respectively. For getting binary text region, an automatic threshold selection algorithm according to region background complexity is proposed. Experimental results prove that the algorithm possesses higher detection speed and localization precision; the binary text could establish a preferable foundation for OCR.5. A generic audio cognitive-level and affective-level semantic extraction and analysis technique is proposed. Using gaussian mixture model, the cognitive-level audio semantics could be recognized in term of its sound genre. Then the cognitive-level concepts are classified into recessive semantics, prominence semantics and neutral semantics according to the emotion contained in different sound genre. The recessive semantics and prominence semantics are then segmented into affective units with different time duration in which audio segments belonging to the same type are concatenated together. The secene affective-level semantics then could be extracted by synthetically analyzing the emotion state of all the affective units.
Keywords/Search Tags:Digital video, Semantic analysis, Cognitive semantics, Affective semantics, Visual information, Text information, Audio information, Multigranular, Multimodal, Fusion
PDF Full Text Request
Related items