Font Size: a A A

Multimodal Visual Question Answering Methods Based On Action Semantic

Posted on:2020-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:J W LianFull Text:PDF
GTID:2428330590973934Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Human experience of the world is multimodal,such as images,sounds,smells,and so on.In order to obtain information more efficiently,it is expected that computers can understand and process multimodal data.Visual question-answering is a popular research direction of multimodal data,which combines visual technology and natural language processing technology.It gives a corresponding answer to the input image and question,and has a good application prospect in the fields of security and children education.The current multi-modal visual question-answering methods cannot further understand image content based on specific application scenarios,and the application scenarios are too extensive.Although these methods can better distinguish different types of questions in different scenes and give relevant answers,the accuracy is still not good for the related questions in the same scene.On the other hand,in feature extraction of multi-modal data,current methods do not fully consider the characteristics of visual question-answering task.They simply extract features on single-modal data,and have insufficient feature expression ability to learn deep semantic information.In order to solve the shortcoming of the current multi-modal visual question-answering methods,we propose a multi-modal visual questionanswering method based on action semantic.In real application scenarios,people's question of images is often about interaction information.We propose a multi-branch behavioral semantic information extraction network,ASI-Net,based on attention mechanism,for the problem of application scenario is too extensive.It helps the model focuses more on learning interaction information.Through the attention mechanism,the surrounding information of human and object instances is further extracted.And the spatial information of the human and object instances is integrated to detect the interaction in the image.So that the model achieves the extraction of action semantic information.We propose a feature extraction method of bidirectional attention mechanism,in order to solve the problem that the current visual question-answering methods have insufficient ability to express the features of multimodal data.First,the model automatically detects the object instance of the image and extracts the features in the corresponding position.Then,dynamically assigned different weights by the guidance of the problem for different target instance features.It improves the model's ability to express features of multi-modal data and can learn richer semantic information.Action semantic information extraction network and bidirectional attention mechanism feature extraction methods are all for improving the effect of visual question-answering model.In this paper,the action semantic information extraction network and the multi-modal data feature extraction network are fused to realize the multi-modal visual question-answering algorithm model,ASM-Net,based on action semantic.Experiments show that the accuracy of our multi-modal visual questionanswering method based on action semantic reaches 70.13% in open-end questions,which is higher than the mainstream visual question-answering methods.And its accuracy in interaction-related questions exceeds the current model by 2.18 percentage points.
Keywords/Search Tags:attention mechanisms, action semantic understanding, multimodal visual question-answering methods
PDF Full Text Request
Related items