Complex Scene Reasoning Based On Multi-modal Attention Mechanism

Posted on:2022-03-27

Degree:Master

Type:Thesis

Country:China

Candidate:G H Xu

Full Text:PDF

GTID:2518306569481654

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Multimodal information reasoning in complex scenes is one of the hot topics in the field of artificial intelligence,which combines Computer Vision and Natural Language Processing,and has become the focus of attention in academia and industry.Specifically,given a scene(image or video),reasoning models are supposed to understand the complex multimodal information(objects and texts)in the scene,then generate a description that conforms to the semantics of the scene,or answer questions related to the scene.In this sense,reasoning tasks in complex scenes can be divided into image captioning and visual question answering(VQA).Image captioning technology can be used for automatic captions in movies,to help visually impaired people quickly understand their surroundings.VQA technology is able to help people explore unknown environments in an interactive manner,can also be used for visual navigation and chatbots.Thus,studying and solving the reasoning tasks in complex scenes has important practical significance.More critically,it is also the technological commanding heights that many domestic and foreign enterprises and scientific research institutions are trying to seize.However,solving the reasoning tasks in complex scenes reasoning remains following challenges.1)Most existing methods do not have the “reading” ability,and often ignore the text information in the scenes.2)There may be a large number of objects in the complex scenes that block each other,and these objects may have rich text information.How to better model and exploit the scene multimodal information is still unknown.3)Existing reasoning models usually tend to describe one or two significant objects in the scene,which makes them possible to ignore important(or genuinely interesting)objects and texts.4)It is difficult to correctly understand the complex logic of the question,and further capture the relationship between the question and the multimodal contents of the scene.To conquer the above challenges of complex scenes reasoning,in this paper,we propose new reasoning methods based on the multimodal attention mechanism.1)As for image captioning,we propose a novel Anchor-Captioning method based on the anchor-centred graphs(ACGs).Specifically,we conduct the multi-view caption generation to improve the content diversity of generated captions.2)As for VQA,we propose a cascading reasoning method to gradually consider different modal information when fusing.Based on the scene semantic understanding,we extract key clues to answer questions and eliminate interference from other unrelated information.Experimental results show that our methods are able to effectively solve the image captioning and VQA,and significantly improve the reasoning performance of existing models in complex scenes.

Keywords/Search Tags:

Complex scene reasoning, Multi-modal attention mechanism, Image captioning, Visual question answering

PDF Full Text Request

Related items

1	Research On Visual Question Answering Algorithm Based On Spatial Attention Reasoning Mechanism
2	Research On Visual Question-Answering Methods Based On Attention Mechanism
3	Research And Application Of Multi-domain Visual Question Answering System Based On Image Comprehension
4	Research On Visual Question Answering Method With Attention Reasoning Mechanism
5	Research On Visual Question Answering Based On Text Semantic Understanding
6	Research And Implementation Of Visual Question Answering Algorithm Based On Deep Attention Stacking
7	Question-Guided Attention Reasoning Mechanism For Visual Question Answering
8	Visual Question Answering Based On Object Relationship Modeling And Attention Mechanisms
9	Research On Situational Reasoning Visual Question Answering Based On Graph Neural Network
10	Visual Question Answering Based On Deep Reasoning