Font Size: a A A

Research On Key Techniques Of Visual Semantic Understanding

Posted on:2020-06-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:N XuFull Text:PDF
GTID:1488306131967169Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of the Internet technology,especially the popularity of the public security surveillance system and the online video sharing platform,the amount of visual data has shown explosively growth.In the era of the big data,how to automatically analyze the complex semantics contained in the visual data and realize a leap from the recognition of independent semantic concepts to the generation of human-like language descriptions,have been important research topics in the field of the computer vision and the artificial intelligence.Meanwhile,it is of significant practical value in many fields such as the public security defence and the Internet culture market regulation.Given the image/video data,visual semantic understanding aims to analyze the different types of semantic concepts such as objects,behaviors and scenes based on the domain-specific knowledge,and further generate the human-like natural language descriptions,which maps visual modality to textual modality,and shortens visual semantic gap.Based on the thorough analysis of current domestic research,this thesis proposes a novel theoretical system on the visual semantic understanding and conducts in-deep researches on three central scientific issues: 1)At the level of the visual semantic concept recognition,to address the difficulty of semantic modeling with similar visual patterns,the thesis discovers and formulates the multi-semantic implicit correlations to facilitate the interaction system between data and knowledge.The proposed method can recognize the complex semantic concepts by the dual associative modeling on the visual-semantic and semantic-semantic learning.2)At the level of the human-like visual analysis,to address the multi-modal fusion on the visual natural language description generation modeling,this thesis explores and discovers the complementary cues and the fusion manners within the multi-modal data.The corresponding deep sequence generation networks and optimization algorithms are designed,which automatically map the visual content into the human-like language descriptions.3)To address the evaluation criteria guidance absence for the above data-driven visual analysis models,this thesis proposes the novel feedback mechanism,which can jointly learn the visual analysis models and the criteria models,and facilitate the human-like analysis models upon objective evaluation criteria.In view of the above theoretical system,this thesis proposes the solutions on the corresponding scientific issues.The main contributions are summarized as follows:1.The thesis proposes a novel visual semantic concept modeling method based on the multi-semantic implicit correlations.First,a sparse transfer learning-based method is proposed to learn the co-embedded space for the multi-domain data.Then,the multi-domain feature learning and the multi-semantic concept modeling are jointly learned with the multi-task learning mechanism.We propose an objective function,aiming to realize the multi-semantic implicit correlation modeling.The proposed method is evaluated on the multi-view and multi-modal human action representation task,which shows its effectiveness by the extensive experiments.2.For the multi-modal fusion on the human-like visual analysis,the thesis studies the following three subproblems: the multi-model inherent association,the hierarchical attention mechanism of sequence data,and the asynchrony of multi-modal changes.We design the corresponding deep sequence generation networks to discover the multi-model complementary cues,and further fuse the multi-model data from different perspectives.3.For the absence of evaluation criteria guidance in the data-driven visual analysis modeling,the thesis proposes a reinforcement-learning-based human-like visual analysis framework with the multi-level policy network induced by the multi-level reward function.The proposed method can integrate current representative human-like analysis models,language metrics,and visual-semantic functions.Hence,it has the strong generalization and flexibility.We evaluate the proposed framework on the different human-like analysis models and language metrics,which shows its effectiveness.
Keywords/Search Tags:Visual Semantic Understanding, Visual Semantic Concept Recognition, Human-Like Visual Analysis, Multi-Task Learning, Multi-Model Fusion, Deep Sequence Generation Modeling, Deep Reinforcement Learning
PDF Full Text Request
Related items