Font Size: a A A

Research On Image Captioning Based On Image Feature Fusion

Posted on:2024-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiFull Text:PDF
GTID:2568307091497294Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image captioning is the task of having a computer generate natural language descriptions of the visual content in an image.It requires a comprehensive understanding of the visual entities and their interactions within the input image,and the establishment of fine-grained visual cues and hidden correlations between each generated word.With the application and development of deep neural networks in computer vision and natural language processing tasks,various new methods have inspired researchers to explore new research at the intersection of these previously separated fields.The first challenge for the image captioning task is to develop effective image encoding methods,i.e.,extracting features about the image content.Many studies have improved the visual perception of models towards images by improving the image feature extraction methods,especially through the introduction of attention mechanisms.This allows the model to focus on important information and reduce interference from irrelevant visual information,greatly enhancing the performance of image captioning models.However,the current attention mechanisms still suffer from the problem of incorrect localization,where the attention model may not correctly focus on the relevant area at the current moment,or even focus on non-significant areas in the background,leading to the generation of incorrect target words by the language model.Additionally,the current image feature extraction methods all inevitably downsample the original input in a hierarchical manner,producing a single global or local feature,resulting in varying degrees of lack of fine-grained information.As a result,generated image descriptions often overlook some important details.This article addresses the above problems,and specifically proposes the following:(1)A mixed attention-based image captioning method is proposed to solve the problem of inaccurate attention mechanism.This study combines machine attention with human descriptive attention by encoding the rich information perceived by humans in the image captioning task,and reweights the bottom-up attention,effectively addressing the problem of "illusionary" descriptions generated by the model and improving the diversity of image descriptions.Experimental results on the MS COCO dataset show that this method effectively improves the performance of existing image captioning methods.(2)A multi-granularity feature fusion-based image captioning method is proposed to solve the problem of missing fine-grained information in images.This method uses a visual transformer to extract multi-granularity image features and designs a dynamic feature fusion mechanism to fuse different granularity image features,retaining deep image feature information while avoiding the loss of image detail information.Experimental results on the MS COCO dataset show that this method effectively improves the performance of image captioning methods.
Keywords/Search Tags:image captioning, attention mechanism, multi-granularity feature, feature fusion
PDF Full Text Request
Related items