Font Size: a A A

Research On Visual Question Answering Algorithm Based On Image Description And Multi-level Attention Mechanism

Posted on:2020-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:W L CaiFull Text:PDF
GTID:2438330602452745Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,with the continuous study of deep learning,computer vision and natural language processing have developed rapidly,and many research directions have emerged,such as "image description" and "visual question and answer" and other new tasks.As human beings,we can identify people or objects in an image,understand the spatial position between these people or objects,predict their properties and their interrelationships,and also infer each object in a given environment.The purpose of the visual question-and-answer system is to replace people to do these things to some extent.Visual Q&A is a new task that is closely related to computer vision(CV)and natural language processing(NLP).Its task is to take a photo and the questions related to this photo as input to the visual question answering system.The output of the task is The answer to a word or multiple words.This paper reviews the current research status of visual question and answer at home and abroad,and analyzes the problems existing in the current visual question answering algorithm.First,most algorithms are based on the attention mechanism,but most algorithms only carry out the spatial aspect of the image.Concern and lack of attention to problem information.The second is that the current visual Q&A task is to output the answer of a word as the main goal,and it is not friendly enough in human-computer interaction.We believe that the output of a single vocabulary cannot fully understand the connection between the image and the problem for the visually impaired.Based on the above problems,this paper proposes a visual question-and-answer algorithm based on image description and multi-level attention mechanism.The algorithm in this paper can not only effectively predict the answer,but also explain the picture and answer.First,the multi-level attention model proposed in this paper can combine image and problem information well,and pay attention to the two levels of image:space attention and convolution channel attention.Second,after the introduction of the image description task,the visual question-and-answer task becomes more friendly,the output is not only a question-based answer,but also an image description for the problem,except for the previous image description without problem information.As a guide,the image description is simply obtained by image information,and our model is an image description with problems guiding attention,so this paper effectively combines the two tasks of image description and visual question and answer.Specifically,our model uses a deep convolutional neural network and long-and short-term memory network algorithms and a multi-level attention mechanism to generate a semantically guided picture description.Then we fuse the description with multiple feature information of the image problem to obtain the answer.And the picture is described as output.Our model is compared with the mainstream algorithms of COCO-VQA and VQA.The experimental results show that the algorithm model of this paper can predict the answer more accurately than the previous model,and can output and problem.A close-knit description of the image and enhance the user's understanding of the answer.
Keywords/Search Tags:Visual question answer, Attention mechanism, Image caption
PDF Full Text Request
Related items