Research On Deep Image Captioning Technology With Semantic Guidance

Posted on:2022-04-28

Degree:Doctor

Type:Dissertation

Country:China

Candidate:F Chen

Full Text:PDF

GTID:1528307169977369

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Image captioning is a cross-domain research task of computer vision and natural language processing,which aims to generate natural language descriptions for input images automatically.It contains two parts: visual understanding and natural language generating.In the visual understanding part,it requires the vision model to understand image content comprehensively,e.g.,objects,their attributes and their relations,which are then encoded to features.In the language generation part,the captioning model uses the encoded visual featurs to realize the transformation between image and text modal,generating syntactically and semantically correct sentences.Image captioning has become a research hotspot in recent years because of its important theoretical significance and application value.Current researches have achieved promising achievements,however,there is always a ”semantic gap” between vision and text,that is,the semantic expressions by the two are different and and there is a span between them,which is difficult to unify.Therefore,image captioning is still a challenging research task.Based on the in-depth analysis and discussions of image captioning task,aiming to solve four key technical problems,combined with deep computer vision technology and natural language processing technology,we research image captioning from the perspective of semantic guidance to generate semantic rich captions.The research work focuses on the following problems: semantic mining and representation,global semantic perception and knowledge utilization.Note that,aiming at the multi-feature fusion problem,each research method explores different multi-feature fusion technologies designed for semantic characteristics.The main contributions of this thesis are as follows:1)Aiming at the problem of semantic representation and mining,we propose a topicguided rethinking method for image captioning(Hierarchical Topic semantics-Guided Image Captioning,HTGIC).Different from the semantic representation method of manual extraction and annotation,topics are annotated based on unsupervised topic model,which greatly reduces the dependence on manual annotation and is transferable.Compared with simple semantic representations such as attributes and concepts,hierarchical topic features contain richer semantic information.In order to further effectively introduce hierarchical topic semantics to enhance the semantic expression ability of the captioning model,we design an attention-based ”Rethinking” mechanism to model visual and hierarchical topic features interactively and explore the complementarity between them during generation.By rethinking,the proposed captioning model achieves optimized visual attention to generate better captions.2)To obtain global semantic perception,a novel image captioning method based on global semantics is proposed(Global Semantics-Guided Image Captioning,GSGIC).Due to the characteristics of the sequence model,when generating the current word,existing works mostly generate words based on the previous generated word and the current model state.however,these methods fail to perceive the subsequent semantics at the same time to grasp the global semantics of the image,so as to guide the generation of more accurate words.This work explores the effective perception of global semantics,and proposes to use the pre-training base model to generate a initial caption which describes the overall content of the image.The initial caption is encoded as the global semantic feature.Then,this work designs a Temporal-Free Semantic-Guided attention mechanism(TFSG)to grasp the global semantics of the image in the two-round generation process.Specifically,in the second round of optimization generation,new words are generated based on the global semantics of the first round generation.The captioning model realizes the selective optimization and update of the current words,so as to improve the quality of the final generated caption.3)Aiming at the problem of the utilization of internal prior knowledge,an AMRbased captioning method is proposed(Abstract Meaning Representation-based Image Captioning,AMRIC).Advanced methods mostly rely on image visual features and implicit relationship modeling,while ignoring explicit relationship modeling and can not expand related semantics.To solve this problem,we propose to use abstract meaning representation(AMR)for image captioning from the perspective of structured semantic guidance and the association of internal priori knowledge.Based on the encoding and modeling of visual and abstract semantic representation features of an image,and using of the implicit and explicit relationships,multi-feature fusion is realized to effectively improve the performance of captioning model.Furthermore,from the perspective of semantic association,a internal AMR prior knowledge graph is constructed based on the labeled image caption corpus of the training set.Utilizing the predicted AMR node pairs of images as queries,the association information is obtained through path finding in the AMR prior knowledge graph as the expanded semantics,so as to realize fusion modeling of the visual and internal prior knowledge of the image to expand the semantic features,So as to provide richer semantic guidance for caption generation to enhance the semantic richness of the generated captions.4)Aiming at the problem of the utilization of external commonsense knowledge,a captioning method based on commonsense knowledge is proposed(General KnowledgeEnhanced Image Captioning,GKEIC).Currently,the mainstream research methods discuss the mapping between image and text space,and use the encoded image information to generate captions word by word.However,in some cases,the generated captions miss some important words or contain incorrect information.Considering that the existing large-scale commonsense knowledge graph(e.g.,Concept Net)often contains relationships between concepts(or entities),the mentioned problem can be solved from the perspective of semantic association.In the research method based on AMR,the effectiveness of internal knowledge(AMR prior knowledge)for image captioning has been explored.This work further considers the expanded semantics based on external commonsense knowledge.Specifically,based on the pre-generated concept pairs,the associated information is first extracted from the commonsense knowledge graph.Then,we design a multi-feature fusion attention mechanism to enhanced feature encoding of visual and external commonsense knowledge.Through the commonsense semantic expansion features extracted based on the limited pre-generated information,more relevant semantics are introduced into the proposed captioning model,so as to generate more accurate and semantically richer captions.To sum up,this thesis focuses on four key problems: semantic mining and representation,global semantic perception,knowledge utilization and multi-feature fusion.Aiming at the problem of semantic mining and representation,a topic-guided rethinking method for image captioning is proposed(Chapter 3);Aiming at the problem of global semantic perception,a novel image captioning method based on global semantics is proposed(Chapter 4);Aiming at the problem of knowledge utilization,an AMR-based captioning method(Chapter 5)and a novel captioning method based on commonsense knowledge(Chapter 6)are proposed.Because the proposed method comprehensively considers the visual and semantic features,the multi-feature fusion problem runs through our works.According to the semantic characteristics used in each research method,we designs different fusion methods to better encode multi-features.

Keywords/Search Tags:

Image captioning, Deep neural networks, Topic model, Attention mechanism, Semantic guidance, Abstract meaning representation, Internal AMR prior knowledge, External commonsense knowledge, Reinforcement learning

PDF Full Text Request

Related items

1	Research On Knowledge Representation Learning Fusing Of External Text Information
2	Research On Image Captioning Algorithms Based On Deep Learning
3	Research On External Knowledge Dynamic Visual Commonsense Reasoning
4	Knowledge Point Marking System Based On LSTM And Attention
5	Multi-Topic Text Generation Based On External Knowledge
6	A Research On Knowledge Representation Learning Of Joint Text Based On Deep Learning
7	The Study Of Sentiment Commonsense Induced Neural Networks For Sentiment Classification
8	Knowledge Scene Graph And Topic Correlation Graph For Image Captioning
9	Multiple-choice Question Answering Based On Commonsense Knowledge
10	Research On Deep Reinforcement Learning Based On Prior Knowledge Extraction