Font Size: a A A

Research On Image Captioning Using Semantic Enhanced Features And Negative Examples Mining

Posted on:2021-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:W J CaiFull Text:PDF
GTID:2428330611965680Subject:Software engineering
Abstract/Summary:
The conversion of visual information into text information can establish a connection between images and text,and has a wide range of practical applications.Current "encoderdecoder" framework based on convolutional neural networks and recurrent neural networks is an effective solution.However,there are several shortcomings: 1)the not representative image feature,2)the unbalanced and insufficient training data.These shortcomings result in the inaccurate generated captions.Focused on the above issues,the contribution of this work includes Image Captioning with Panoptic Segmentation-based Attention,Image Captioning with Semantic-Enhanced Module and Image Captioning with Extremely Hard Negatives.1.Image Captioning with Panoptic Segmentation-based Attention.In order to enhance the image features,attention mechanisms have been widely adopted in image captioning.However,in existing models with detection-based attention,the rectangular attention regions are not finegrained,as they contain irrelevant regions(e.g.,background or overlapped regions)around the object,making the model generate inaccurate captions.To address this issue,we propose panoptic segmentation-based attention that performs attention at a mask-level(i.e.,the shape of the main part of an instance).Our approach extracts feature vectors from the corresponding segmentation regions,which is more fine-grained than current attention mechanisms.Moreover,we propose a dual-attention module which can process features of foreground and background classes independently.With the panoptic segmentation-based attention,our model could recognize the overlapped objects and understand the scene better.2.Image Captioning with Semantic-Enhanced Module.In existing image captioning models,the generated captions usually lack semantic discriminability.Semantic discriminability is difficult as it requires the model to capture detailed differences in images.In this paper,we propose an image captioning framework with a semantic-enhanced module.The semantic-enhanced module consists of an image-text matching sub-network and a feature fusion layer.The image-text matching sub-network can provide the similarity between the image and the generated captions.The feature fusion layer fuses the feature from object detection model and the feature from image-text matching model,which can which provides rich semantic feature.With the semantic-enhanced module,our model can generated semantic discriminative captions.3.Image Captioning with Extremely Hard Negatives.Training deep learning models requires a lot of labeled data,which requires a lot of manpower and time.In order to solve this issue and enable the model to capture the subtle differences between different images,we propose a method for automatically generating extremely hard negative training examples.There is only one noun difference between the extremely hard negative examples and the corresponding positive examples.Through the extremely hard negative training,the model leverages the feedback information from the image-text matching network and learns the semantic differences between images.The ability to discriminate differences help the model generate more accurate captions.
Keywords/Search Tags:Image captioning, Attention mechanism, Semantic enhancement, Hard negative examples
Related items