Neural Networks Based Image Captioning Models For Obtaining Accurate Descriptions

Posted on:2024-09-11

Degree:Doctor

Type:Dissertation

Institution:University

Candidate:Amr Abdussalam Sallam

Full Text:PDF

GTID:1528306932458794

Subject:Information and Communication Engineering

Abstract/Summary:

Image captioning has attracted a lot of interest in the last few years.It has enticed a lot of researchers in the natural language processing and computer vision fields.It is the task of producing semantically and syntactically meaningful annotations for an image.It is considered as a challenging task since it requires the comprehensive understanding for the image’s rich contents.In general,image captioning serves as a linkage between the fields of computer vision and natural language processing.In image captioning,the model needs to correctly capture the salient objects in the image,recognize the objects’characteristics and attributes,and express the interaction between the detected objects.Typically,the image captioning networks are based on the encoder-decoder framework.The encoder is a CNN-based module that extracts the visual features and representations from the input image,while the decoder is an RNN-based module responsible for generating the image’s textual description.Despite the magnificent progress attained in the performance of the previous image captioning methods,they are still lagging in some aspects,including the generation of multiple diverse captions for a single image and the efficient exploitation of the visual representations of the input image.In our research work,we are aiming to design image captioning models that can elegantly obtain multiple annotations for a single image,build image captioning networks that effectively utilize the dataset set’s similar images to the input image,and develop image descriptors that can exploit the visual features of images efficiently.These proposed methods can contribute to boosting the performance of the image captioning models and generating high-quality descriptions.The first proposed work is a number-controlled multi-caption image captioning model.Existing image captioning models are primarily trained to generate one caption per image.However,an image may contain rich contents,and one caption cannot express its full details.A better solution is to describe an image with multiple captions,with each caption focusing on a specific aspect of the image.In this regard,we introduce a new number-based image captioning model that describes an image with multiple sentences.An image is annotated with multiple ground-truth captions;thus,we assign an external number to each caption to distinguish its order.Given an image-number pair as input,we could achieve different captions for the same image under different numbers.First,a number is attached to the image features to form an image-number vector(INV).Then,this vector and the corresponding caption are embedded using the orderembedding approach.Afterward,the INV’s embedding is fed to a language model to generate the caption.The strategy of incorporating numbers helps equip our proposed model with the ability to leverage the quantitative availability of multiple ground-truth descriptions for the training set’s images to produce distinct captions for the input image.To show the efficiency of the numbers incorporation strategy,we conduct extensive experiments using MS-COCO,Flickr30K,and Flickr8K datasets.The achieved results demonstrate that our method is competitive with a range of state-of-the-art models and validate its ability to produce different descriptions under different given numbers.Besides,a qualitative analysis of the model’s generated captions is performed confirming the ability of the captioning model to produce distinct captions for an image with high quality.Second,we proposed a visually-directed image captioning model using common images.Most image captioning models leverage only the input image’s visual features to produce the image’s description,leaving the visual features of the similar images in the dataset unused.Exploiting the visual features of the similar images to the current input image will help enrich the overall visual semantics of the model and improve its performance.To achieve this goal,we introduce an image captioning model capable of utilizing the input image’s visual features besides the visual features of its similar images in the dataset to generate an informative caption.First,a set of images similar to most of the training set’s images,named common images(CIs),are selected.Then,the nearest CIs(NCIs)to the input image are chosen and entered together with the input image into the image captioning model.The proposed framework contains three primary modules:NCI detector(NCID),NCI visual attention(NCIVA),and NCI concatenator(NCIC).The NCID selects the NCIs to the input image.The NCIVA attends to visual features of either the input image or the NCIs in accordance with the current context.The NCIC fuses the visual information of both the input image and the NCIs.The NCIs provide additional visual semantics that can be utilized in combination with the input image’s features to direct the captioning network to produce better captions.The adoption of CIs supplies the image captioning models with rich visual features helping enhance the quality of the generated descriptions.The benchmark MS-COCO dataset is adopted to conduct our experiments.The attained results indicate that the incorporation of CIs into the captioning task helps direct the captioning network to generate more descriptive expressions and bolster the captioning networks’ performance in terms of standard evaluation metrics.Additionally,the qualitative analysis of the generated captions from the proposed model demonstrates the model’s ability to produce high-quality descriptions for images.Our third proposed work is an image captioning network with visual vectors selector.The visual features of input images play a significant role in producing high quality captions.Most previous works utilize the visual attention to adaptively attend to the local representations of the input image to locate the image’s region more effective at every time step.However,giving the image captioning models the ability to individually reweight the generated context vectors along with the extracted visual representations and selectively attend to them can promote the utilization of the input image’s visual features,which in turn boost the captioning network’s performance.To this end,we propose an image captioning framework that can leverage the image’s extracted representations efficiently.Our framework is made up of three components:the extended visual attention module(EVA),the visual vectors selector module(VVS),and the language model.The EVA attends to the local visual features yielding two context vectors,i.e,the weighted context vector(WCV)and the selected context vector(SCV).The VVS individually reweights the resulted context vectors along with the image’s global visual representation,and adaptively attends to them to produce the refined context vector(RCV).Then,the language model leverages the generated RCV to generate an informative description.The integration of the VVS module provides an additional level of processing for the visual features contributing to improving the image captioning model’s performance.We conduct a lot of experiments on the well-known MS-COCO benchmark dataset.The attained scores demonstrate that the proposed framework is competitive with several recent state-of-the-art methods,and verify the ability of our captioning model to effectively exploit the image’s visual representations and boost the model’s performance with respect to the evaluation metrics.Furthermore,we perform qualitative analysis for the model’s generated captions demonstrating that the proposed model is capable of generating high-quality and expressive descriptions.Overall,in this dissertation,we introduce three novel neural networks-based image captioning networks.The proposed approaches can greatly contribute to generating multiple and diverse annotations for images,effectively utilizing the visual features of images,improving the quality of the produced descriptions,and enhancing the performance of the captioning models in terms of the standard evaluation metrics.Additionally,qualitative analysis has been performed for the proposed methods manifesting that the quality of the produced captions from our methods are comparatively high.

Keywords/Search Tags:

Numbers Incorporation Strategy, Encoder-Decoder Framework, Image Captioning, Natural Language Processing, Order-Embedding, Common Images, Nearest Common Images, Visual Semantic Embeddings, Visual Attention, Visual Vectors Selector, Context Vector

Related items

1	Research On Image Captioning Algorithm Guided By Attention And Visual Common Sense
2	Hierarchical Visual Semantic Embedding For Image Captioning
3	Research On Video Captioning Methods Based On Encoder-decoder Structure
4	Research On The Theory And Method Of Visual Captioning
5	Multimodal Natural Language Generation For Human-computer Interaction
6	The Research On Visual Captioning Based On Attention Mechanism
7	Image Captioning Based On Adaptive Visual Attention Mechanism
8	Research On Image Description Generation Based On Visual Attention
9	Research On Image Captioning Method Based On Temporal Collaboration Attention
10	Research On Semantic-Attentive Deep Image Captioning Method