Font Size: a A A

Visual Question Answering Based On Interpretation And Attention Mechanism

Posted on:2022-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y W HouFull Text:PDF
GTID:2518306542481104Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of big data,massive multi-modal data exists widely.How to mine the huge value hidden in the data through complementary learning between modal data is the main concern of big data research at this stage.In the research field of this article,image caption generation and visual question answering are looking for the bridge between the two modal data of image and text.The task of image caption generation is to allow the machine to automatically generate a meaningful sentence to accurately describe the content of the image,which belongs to the intersection of computer vision and natural language processing.Existing researches mostly use CNN to encode image information,RNN to decode and generate text information,and add traditional attention mechanism on this basis.However,the existing methods only perform attention matching for different regions of the feature map,and ignore the feature map channel,which is easy to cause attention deviation.To solve this problem,this paper proposes a new attention model,which incorporates the non-dimensionality reduction attention mechanism(FND-ICG).The FND-ICG encoder encodes images through three parallel attention mechanisms: first,it is similar to traditional attention,seeking regional attention values,but on this basis,it makes innovations and uses full connection for feature fusion to fully extract Image information;second,different from traditional attention,the channel attention value is calculated,and the coupling degree between the current embedded word and the hidden layer state is used to give the corresponding weights to the different channels of the feature map;third,the two-dimensional image is processed into a vector and used non-dimensionality reduction one-dimensional attention mechanism to process the vector.Use visualization tools to specialize the area of the positioning image when generating the caption,in order to achieve the purpose of explaining the model.The experimental results have a certain improvement in the specified evaluation method,indicating that the FND-ICG model can generate more fluent and accurate natural sentences.Visual question answering is designed to allow the computer to automatically answer natural language questions after understanding the content of the image.Questions can be divided into two categories,the answer can be obtained directly from the image and the answer needs to be obtained with the help of external knowledge.At present,the research of visual question answering generally has a high accuracy rate on the first type of questions,but the technology for answering the second type of questions is not yet mature.In order to expand the types of questions that can be answered,this paper proposes an innovative model,Exp-VQA: Firstly,the feature vector is generated for the problem location image area to replace the image;secondly,the data set "question-answer" and The annotation of the object mark box enables the machine to dig out the triples representing the relationship between the characters;then generate visual explanation and text explanation,where the visual explanation is composed of marked boxes with a large relationship between visualization and the question,the text explanation is composed of external knowledge base triples and images It is composed of the generated headings;in the end,several modules jointly generate the answer.After comparing several groups of experiments,it can be seen that Exp-VQA performs better than the general model on the data set.
Keywords/Search Tags:image caption generation, attention mechanism, visual question answering, explanation mechanism
PDF Full Text Request
Related items