Font Size: a A A

Research On Visual Question Answering Models Based On Top-down Attention

Posted on:2021-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:R C ZhouFull Text:PDF
GTID:2428330614460444Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)is one of the popular research directions in the field of artificial intelligence in recent years.The key questions of VQA is how to find relationships between images and questions.There is a semantic gap between images and questions.It hinders the organic integration of the semantic information in images and questions.A basic idea to solving the problem of semantic gap is to process the image and the question simultaneously,which can find out the strong correlations between these two kinds of information.Therefore,solutions for organic fusion and relationship mining between images and questions are studied in this dissertation,and two new VQA models are designed accordingly.The main works of this dissertation are as follows:(1)The attention mechanism is widely used in most VQA models to highlight the key information and suppress the irrelevant information.However,most existing models tend to use questions information to guide how to process images,while images information is rarely used to guide how to process questions.It makes the extraction of key information from questions unfounded,and affects the overall performance of VQA model.So a new model based on cascading top-down attention is proposed in this dissertation.The model uses the question to guide the image attention to highlight the key areas in the image,and also uses the image to guide the question attention to highlight the key words in the question.It can effectively highlight the question words related to the image,making the relationship between the image and the question closer.The model is tested on two public VQA data sets.The experimental results show that the model can effectively improve the overall performance of VQA.(2)In most existing VQA models,a normalization operation is widely used in the attention mechanism adopted by these models.The research in this dissertation indicates that this may be a weakness of theses models,because when dealing with complex questions involving multiple regions of information in the image at the same time,the normalization operation reduces the model's ability of focusing on multiple regions of the image simultaneously.So a multi-overlay attention model is proposed this dissertation.The model uses the question to pay attention to the image multiple times,and then highlights multiple areas in the image that are related to the question.This can solve questions that require information from multiple image areas to answer,thereby improving the overall performance of VQA.The model is tested on two public VQA data sets.The experimental results show that the model effectively improves the overall performance of VQA.
Keywords/Search Tags:Visual Question Answer, Deep Learning, Attention Mechanism, Computer Vision
PDF Full Text Request
Related items