Research On Image Captioning Based On Image Feature Fusion

Posted on:2024-05-27

Degree:Master

Type:Thesis

Country:China

Candidate:Q Li

Full Text:PDF

GTID:2568307091497294

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Image captioning is the task of having a computer generate natural language descriptions of the visual content in an image.It requires a comprehensive understanding of the visual entities and their interactions within the input image,and the establishment of fine-grained visual cues and hidden correlations between each generated word.With the application and development of deep neural networks in computer vision and natural language processing tasks,various new methods have inspired researchers to explore new research at the intersection of these previously separated fields.The first challenge for the image captioning task is to develop effective image encoding methods,i.e.,extracting features about the image content.Many studies have improved the visual perception of models towards images by improving the image feature extraction methods,especially through the introduction of attention mechanisms.This allows the model to focus on important information and reduce interference from irrelevant visual information,greatly enhancing the performance of image captioning models.However,the current attention mechanisms still suffer from the problem of incorrect localization,where the attention model may not correctly focus on the relevant area at the current moment,or even focus on non-significant areas in the background,leading to the generation of incorrect target words by the language model.Additionally,the current image feature extraction methods all inevitably downsample the original input in a hierarchical manner,producing a single global or local feature,resulting in varying degrees of lack of fine-grained information.As a result,generated image descriptions often overlook some important details.This article addresses the above problems,and specifically proposes the following:(1)A mixed attention-based image captioning method is proposed to solve the problem of inaccurate attention mechanism.This study combines machine attention with human descriptive attention by encoding the rich information perceived by humans in the image captioning task,and reweights the bottom-up attention,effectively addressing the problem of "illusionary" descriptions generated by the model and improving the diversity of image descriptions.Experimental results on the MS COCO dataset show that this method effectively improves the performance of existing image captioning methods.(2)A multi-granularity feature fusion-based image captioning method is proposed to solve the problem of missing fine-grained information in images.This method uses a visual transformer to extract multi-granularity image features and designs a dynamic feature fusion mechanism to fuse different granularity image features,retaining deep image feature information while avoiding the loss of image detail information.Experimental results on the MS COCO dataset show that this method effectively improves the performance of image captioning methods.

Keywords/Search Tags:

image captioning, attention mechanism, multi-granularity feature, feature fusion

PDF Full Text Request

Related items

1	Research On Video Captioning Methods Based On Feature Fusion And Attention Mechanism
2	Research On Image Captioning Algorithm Guided By Attention And Visual Common Sense
3	Image Captioning By Multi-feature Fusion
4	Research Of Video Captioning On Egocentric Videos
5	Research On Person Re-Identification Method Based On Multi-Granularity Feature Fusion And Local Attention
6	The Research Of Person Re-identification Based On Multi-granularity Feature Fusion And Local Information Enhancement
7	Image Semantic Segmentation Based On Multi-level Feature Fusion And Attention Mechanism
8	Research On Image Caption Algorithm Based On Fusion Of Multi-attention Mechanism
9	Research On Image Deraining Algorithm Based On Attention Mechanism And Feature Fusio
10	Research On Social Image Captioning Based On Deep Learning