Font Size: a A A

Research On Visual Question Answering Technology Based On Knowledge Graph

Posted on:2022-10-03Degree:MasterType:Thesis
Country:ChinaCandidate:J F LiFull Text:PDF
GTID:2518306335458414Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
The task of visual question answering topic requires the model to understand the input image and text question content,and then give the corresponding answer.Unlike the message question answering task that only needs to process single modal information,visual question answering needs to perform multi-modal information fusion processing on the information of the visual modal and the text modal.Such a task is more in line with the real scene of humans facing the problem.It is close to the artificial intelligence form with reasoning ability,has high research value and has broad application scenarios in the fields of medical auxiliary equipment,security,and early childhood education.At present,the visual question answering task also faces the following problems and challenges: When the model faces the input of two different modal information from image and text language,how to efficiently process multi-modal information and obtain accurate visual image feature representation and natural Language text feature representation or image text feature joint representation presents challenges;high-dimensional image features and text features in the semantic alignment of image text and how the model extracts the corresponding object attributes or object relationship features in the image according to the text problem and performs reasoning These problems hinder the further development of visual question answering tasks.In response to the above problems,this thesis proposes an improvement plan for the visual question answering model by simulating the human perception and cognitive reasoning process when facing real-world problems.The main research contents are as follows:(1)This thesis constructs an image-related knowledge graph through the annotation data in the data set and extracts the objects,attributes and object relationships in the images of the data set,and combines the different semantic similarity calculation methods in Word Net to design the entities in the above-mentioned knowledge graph Relationship weight.A visual question answering framework based on knowledge graph feature embedding and attention enhancement is proposed.The structured knowledge feature of image scene in the knowledge graph is combined with text problem feature and image feature,which effectively solves the problem of image text semantic alignment.(2)This thesis proposes a visual question answering framework based on cross-modal pre-training and knowledge map feature alignment,by introducing Transformer structure to encode image modal and text modal information,and designing knowledge map entity prediction,knowledge map relationship prediction,and knowledge Multiple pre-training tasks such as map attribute prediction,image ROI region mask category prediction,image text matching judgment,etc.allow the model to learn the combined features of images,texts,and knowledge maps,effectively solving multi-modal feature fusion and finer-grained image text semantics Characteristic issues.Experimental results show that adding knowledge graph features containing image scene information to the visual question answering model or framework can significantly promote the performance of the visual question answering task.
Keywords/Search Tags:Visual question answering, Knowledge graph, Image understanding, Multi-modal fusion
PDF Full Text Request
Related items