Font Size: a A A

Scene Graph With 3D Information For Change Captioning

Posted on:2023-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z M LiaoFull Text:PDF
GTID:2568306794981439Subject:Control engineering
Abstract/Summary:PDF Full Text Request
With the development of related technologies in computer vision and natural language processing,a number of tasks have been derived that merge the two,such as image captioning,image-text matching and visual question answering.Change captioning involves both image difference detection and language generation,aiming to identify the difference target from the input image and describe it in natural language,the task is of great research value and challenge.There are two types of existing work,which are two-stage scheme and end-to-end framework.Most of these work use convolutional neural networks to encode images,due to the viewpoint change in image pairs,the subtraction between the image features can not represent the differences accurately,leading to the generation of incorrect captions.In addition,the existing models do not explicitly model the relative position relationships between objects,resulting in inaccurate directional descriptions between objects,that leads the observer can not locate the difference on the image through description well.In order to solve these problems,this thesis proposes a 3D information aware Scene Graph based Change Captioning(SGCC)model,which contains three modules,i.e.,a scene graph generation module,a scene graph embedding module,and a change captioning module.First,in order to avoid the problems caused by utilizing the convolutional neural networks to encode the image,this thesis employs the scene graph to represent the image.Besides,to make the scene graph accurately reflect the input image,the scene graph generation module contains an image feature extraction network,an image attribute detection network and an image depth value extraction network.Consequently,in addition to the necessary elements,the obtained scene graph contains both the relative position relationship between objects and the depth value of objects,so the model can better describe the relative position relationship between objects and determine whether the objects have moved.Then,we utilize the scene graph embedding module to encode the scene graph and get the semantic representation of the image.Finally,the change captioning module generates the difference description with the representation of the image pair.In this thesis,we conduct comparison experiments,ablation experiments,visualization analysis,and case studies on the CLEVR-Change dataset and the Spot-the-Diff dataset.The results show that our proposed model achieves optimal results on most metrics.This not only demonstrates that utilizing the scene graphs to represent images has unique advantages in the change captioning task,but also verifies that explicitly introducing the relative position relationships between objects promotes the model to describe more accurate relative position relationships.Besides,with the spatial information,our model is capable of alleviating the viewpoint change problem to some extent.These findings provide ideas for solving the viewpoint change problem in the change captioning task and further facilitate its development.
Keywords/Search Tags:Image change captioning, Scene graph, 3D information, Graph convolutional network, explicit feature extraction
PDF Full Text Request
Related items