Font Size: a A A

Image Captioning Theories And Methods

Posted on:2024-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:W Z HuFull Text:PDF
GTID:2568307079955539Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The multi-modal analysis is a relatively new area in computer vision,requiring deep models to understand visual and linguistic information.Image captioning is a typical task of vision-language multi-modal interpretation.Models need to generate descriptions in natural language based on input images.Also,some subtasks derive from image captioning,such as grounded image captioning,requiring models to ground the object in the image when predicting a certain object word; crowd scene captioning,focusing specifically on humans and crowd behaviors and states and crowd descriptions.This thesis studies these three tasks and develops models to achieve higher performances.The main contributions are as follows:(1)Image CaptioningMost previous image captioning methods process visual features extracted by the backbone network.However,since there are deviated,enlarged,and partial proposals,the corresponding visual features are not accurate enough,so they do not have enough class distinctiveness.Therefore,this thesis proposes a contrastive learning and class-enhancing feature encoder,using class-enhancing encoder to improve the intra-class similarity and add a contrastive learning loss to force the features of the same class to be closer in feature space,leading to more class distinctiveness.Experiments prove that the proposed method can improve model performance effectively.(2)Grounded Image CaptioningPrevious grounded image captioning methods mostly employ encoder-decoder structure and compute visual features.Nevertheless,this thesis argues that only the highdimensional features cannot provide adequate information.Thus,this thesis proposes a spatial-semantic attention module,which can utilize spatial and semantic information to help the model determine the importance of objects and refine attention weights.This thesis also designs a grounding loss to supervise the training of the spatial-semantic attention module.Experiments demonstrate that the proposed method acquires higher performances on both captioning and grounding.(3)Crowd Scene CaptioningIn order to make deep models serve people better,the crowd scene captioning task exists.This task focuses exclusively on the behavior and states of humans and crowds,whereas existing methods do not have such a special design for recognizing and analyzing humans.To this end,this thesis proposes a new human-aware crowd scene captioning method.It extracts human body key point features and designs a human-aware feature encoder to explore the deep relationships between visual and human body features.Experiments show that the proposed method achieves better performance on this task.
Keywords/Search Tags:Vision-Language Interpretation, Multi-modal Analysis, Image Captioning, Grounded Image Captioning, Crowd Scene Captioning
PDF Full Text Request
Related items