Font Size: a A A

Research On Intelligent Semantics Generation For Visual Data

Posted on:2021-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y BinFull Text:PDF
GTID:1368330647960768Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of digital transformation of the society,visual data,such as images and videos,has been constantly produced and accumulating in all aspects of modern life for its direct and powerful information expression.Beyond creating and shar-ing,people are more interested in the rich semantic information underlying visual data.Therefore,analyzing the semantic information from visual data effectively and efficiently has become an important research problem in the field of computer vision.In fact,visual semantic analysis and generation have been receiving extensive research attention recently in various tasks including image/video annotation,visual relationship analysis,and visual captioning,etc.Visual captioning is an advanced form of visual semantic analysis and generation,which aims to explicitly explain visual content with natural language descrip-tion to convey the semantic information in the visual content more clearly and directly.In this thesis,towards effective visual content describing,we focus on the visual captioning task and study it from different aspects.First,we improve the visual representation for captioning by applying the bidirectional temporal modelling method on video snippets.Second,we propose an adaptive attention mechanism to discriminate the “visual concept word” and “function word” during the description generation,which selectively absorbs information from visual content and linguistic knowledge.Third,to improve the compre-hensiveness and completeness of video captioning due to the rich information underlying the videos,we introduce a novel captioning task,dubbed Multi-Perspective Video Cap-tioning(MPVC).We comprehensively investigate and analyze MPVC through problem definition,data collection,solution and evaluation.Finally,focusing on interested visual content and semantic consistency,we propose to fill captions with masked visual entity slot via jointly understanding and analyzing vision and language.Specifically,the main content of this thesis are as follows:(1)This work presents an attention based bidirectional long short-term memory(Bi L-STM)for video captioning.The proposed approach applies two LSTMs,combining the information of the forward and backward passes,to enhance the feature representation dur-ing video encoding.Besides,Bi LSTM integrates temporal attention mechanism to select important video snippets during bidirectional encoding and language generation.This op-eration enables the Bi LSTM to simultaneously utilize global and local video information,and improves the local relevance between visual content and generated description.(2)This work presents a novel adaptive attention mechanism for language generation,which adaptively selects visual and linguistic information via a “visual gate”.From our observation,in visual captioning,notional words are usually related to visual content while function words are more relevant to linguistic knowledge.However,traditional attention mechanism ignores the difference between the words and attends to the visual content for all words.To address this issue,we first obtain the linguistic knowledge for current word by mapping all the previous hidden states to the semantic embedding space,and then devise a “visual gate” to adaptively attend visual information and linguistic knowledge for each word.The adaptive attention is able to effectively improve the performance of word and description generation.(3)This work presents a new captioning task termed multi-perspective video caption-ing targeting at comprehensively describing video from different perspectives.To this end,we first collect and annotate Vid OR-MPVC dataset,including 3,136 videos and 41,031 descriptions,for multi-perspective video captioning.We also propose a perspective-aware captioner employing recurrent neural networks to mine all the perspectives of given video and describes the entire video from each perspective.Besides,we devise an evaluation strategy for the new MPVC task based on traditional captioning metrics,which evaluates the semantic similarity,completeness and compactness for the MPCV results.(4)This work presents a novel visual caption infilling task to study vision under-standing and language consistency.Different from the classic visual captioning which“translates” an image to a natural language description,the proposed caption infilling task is required to simultaneously perceive images and slotted captions,and generate an appre-ciate text snippet to fill in the slot.For the new visual caption infilling task,we construct an entity slot filling captioning dataset based on existing captioning dataset,which slots the description by masking visual entity text snippets.We also propose an adaptive dynamic attention multi-modal fusion network to perceive the cross-modal information between vision and language,and generate a text snippet to complete the slotted description.By the end of this thesis,we briefly summarize these research works and prospect for the expansion and further research with possible research directions and ideas.
Keywords/Search Tags:Visual captioning, bidirectional long short-term memory, adaptive attention mechanism, multi-perspective video captioning, visual caption infilling
PDF Full Text Request
Related items