Font Size: a A A

Research On Image Captioning Method Based On Casual Inference And Part Of Speech Tagging

Posted on:2024-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:D WangFull Text:PDF
GTID:2568307118482494Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image captioning,a conditional generation task that aims to generate grammatically correct descriptions of images,has attracted great attentions in the field of image understanding.This task is challenging as it requires to first recognize the objects in the given image,the relationships between them,and finally properly organize and describe them in natural language.Image captioning has attracted much attention due to its wide applications in the fields of image content retrieval,human-computer interaction and smart healthcare,etc.In recent years,with the advent of deep learning technology,the encoder-decoder framework inspired by neural machine translation has been widely adopted and achieved a great progress in image captioning.However,most popular deep learning methods focus on the correlation between image and text modalities,irrespective of causality.The law of the caption generation controlled by the grammatical structure is unknown,which makes it difficult to generalize and apply in practical scenarios.In order to promote the practical application of image captioning methods,this thesis focuses on issues such as spurious correlation,structural disorder,and single mapping in image captioning tasks,summarizes the current research status of the image captioning,and proposes corresponding image captioning methods.The main works of this thesis are as follows:(1)To alleviate the spurious correlation between image and text modalities,this thesis proposes a novel image captioning framework based on confounder decomposition and causal inference,which consists of a visual confounder-oriented feature extraction model and a visual and linguistic confounder decomposition model to jointly confront both the visual confounder and the linguistic confounder.In the feature extraction stage,the visual confounder-oriented feature extraction model is able to disentangle the region-based visual features by deconfounding the visual confounder.In the image captioning stage,the visual and linguistic confounder decomposition model introduces causal intervention into the transformer-based framework and deconfounds the visual and linguistic confounders simultaneously.(2)Aiming at the disorder of grammatical structure of image descriptions,this thesis proposes a part-of-speech guided transformer framework for image captioning,which boosts the performance of image captioning by separating syntax and semantics for the prediction of each word Firstly,by designing a visual sub encoder,a language sub encoder,and a self attention part of speech prediction network,the image and its corresponding captioning sentences and part of speech sequences are modeled to obtain visual features,language signals,and part of speech information,respectively.Subsequently,a part of speech guided attention mechanism is proposed,enabling the decoder to adaptively focus on visual features and language signals through the part of speech information provided by the part of speech predictor,thereby guiding the generation of the next word.(3)Aiming at the issues that existing transformer-based image captioning models are limited to single mapping and artificial evaluation metrics,this thesis proposes an end-to-end conditional variational Transformer framework with introspective adversarial learning,which consists of a variational inference encoder and a generator.Specifically,the variational inference encoder employs sequential variational autoencoder to make it learn a latent space for each word and model one-to-many relationships between the image space and caption space.Meanwhile,the introspective adversarial learning is introduced into the captioning framework,which enables the framework to self-estimate the generated captions without extra discriminators,and encourage the generation of high quality captions.Experimental results show that the method(1)and(2)in this thesis can significantly boost the performance of the Transformer-based image captioner and surpasses the previous state-of-the-art records on the MSCOCO dataset.Besides,method(3)in this thesis is superior to the current mainstream image captioning methods in terms of the image captions quality,and diversity.
Keywords/Search Tags:transformer, image captioning, causal inference, part of speech, conditional variational autoencoder
PDF Full Text Request
Related items