| Visual question answering(VQA),as a typical multi-modal task,involves two popular fields of computer vision and natural language processing and has been widely studied in recent years.VQA aims to generate a natural language answer that conforms to the human mind based on a given input image and a natural language question associated with the image.VQA has highly valued in many fields,such as medical image,natural light image,and indoor scene image fields.For example,it can not only assist doctors in clinical decision-making,but also help blind groups perceive the world.Therefore,visual question answering has extensive research value.The VQA task needs to answer the given natural language question based on understanding the input image.However,the existing VQA algorithms often have three problems leading to poor performance:(1)Unable to fully understand the semantic information in the input image and question.(2)Ignoring the vast semantic difference between the two modes of information.(3)It is challenging to properly integrate the information of the two modes for answer reasoning.This thesis takes the research of the above three difficult problems,focusing on understanding multi-modal input information,especially for understanding input image information.Three fields of medical image,natural light image,and indoor scene image are discussed respectively,and two kinds of image enhancement VQA algorithms are designed to improve the algorithm’s overall performance by enhancing the understanding ability of image information.Finally,combined with the optimized VQA algorithm,this thesis implements a multi-domain VQA system.The research content of this thesis mainly includes the following three parts:(1)This thesis proposes a medical domain visual question answering(Med-VQA)method based on multi-view fusion for the medical image field.Unlike the traditional method of obtaining image features using a single image encoder,this thesis proposes to use a variety of image encoders to extract multi-view features respectively and use the attention-weighted method to dynamically fuse multi-view features to obtain high-quality image feature representation.In addition,this thesis also proposes a collaborative attention(co-attention)module,which is used to model the relationship between the question and the image features,fully excavating the attribute information contained in itself.Furthermore,it is also used to conduct semantic correlation between the two modes,reducing the semantic difference between the two modes,and helping the model better integrate the two-modal information,to improve the overall performance of the Med-VQA model.(2)This thesis proposes a general domain VQA method based on the joint grid and region features for natural light image and indoor scene image fields.Unlike the traditional VQA method,which only uses region features or grid features,this thesis proposes a cross-attention module that can simultaneously model the relationship between two kinds of image features.By constructing geometric alignment maps between the grid and region features,the interaction between two kinds of image features is constrained,which complements the advantages of the two image features while avoiding the appearance of semantic noise to generate a complete fine-grained image feature representation.In addition,a multi-modal fusion module based on fusion vector re-attention is also proposed.This module dynamically merges the two grid-text and region-text intermediate co-embedding features by using attention weighting method to generate the final co-embedding representation to predict the correct answer to the input question.(3)For the application of VQA system,this thesis combines with the above work,while improving the accuracy of the existing models,designs and implements a VQA application system for medical image,natural light image,and indoor scene image fields.This system uses Python as the programming language,encapsulating the complex algorithm models,and carries on the front-end page design with the help of the Py Qt toolkit.This system has been tested on Windows 10 platform,and the test results show that the system can answer the VQA questions in many fields correctly.The operation is simple,user-friendly,and has certain practicability. |