Font Size: a A A

Complex Scene Reasoning Based On Multi-modal Attention Mechanism

Posted on:2022-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:G H XuFull Text:PDF
GTID:2518306569481654Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Multimodal information reasoning in complex scenes is one of the hot topics in the field of artificial intelligence,which combines Computer Vision and Natural Language Processing,and has become the focus of attention in academia and industry.Specifically,given a scene(image or video),reasoning models are supposed to understand the complex multimodal information(objects and texts)in the scene,then generate a description that conforms to the semantics of the scene,or answer questions related to the scene.In this sense,reasoning tasks in complex scenes can be divided into image captioning and visual question answering(VQA).Image captioning technology can be used for automatic captions in movies,to help visually impaired people quickly understand their surroundings.VQA technology is able to help people explore unknown environments in an interactive manner,can also be used for visual navigation and chatbots.Thus,studying and solving the reasoning tasks in complex scenes has important practical significance.More critically,it is also the technological commanding heights that many domestic and foreign enterprises and scientific research institutions are trying to seize.However,solving the reasoning tasks in complex scenes reasoning remains following challenges.1)Most existing methods do not have the “reading” ability,and often ignore the text information in the scenes.2)There may be a large number of objects in the complex scenes that block each other,and these objects may have rich text information.How to better model and exploit the scene multimodal information is still unknown.3)Existing reasoning models usually tend to describe one or two significant objects in the scene,which makes them possible to ignore important(or genuinely interesting)objects and texts.4)It is difficult to correctly understand the complex logic of the question,and further capture the relationship between the question and the multimodal contents of the scene.To conquer the above challenges of complex scenes reasoning,in this paper,we propose new reasoning methods based on the multimodal attention mechanism.1)As for image captioning,we propose a novel Anchor-Captioning method based on the anchor-centred graphs(ACGs).Specifically,we conduct the multi-view caption generation to improve the content diversity of generated captions.2)As for VQA,we propose a cascading reasoning method to gradually consider different modal information when fusing.Based on the scene semantic understanding,we extract key clues to answer questions and eliminate interference from other unrelated information.Experimental results show that our methods are able to effectively solve the image captioning and VQA,and significantly improve the reasoning performance of existing models in complex scenes.
Keywords/Search Tags:Complex scene reasoning, Multi-modal attention mechanism, Image captioning, Visual question answering
PDF Full Text Request
Related items