| Visual Question Answering(VQA)is a multi-modal learning task involving computer vision,natural language processing,knowledge representation and reasoning.Given an image and a natural language question related to the content of the image,VQA task requires the model to give an accurate natural language answer.With the vigorous development of AI technology and related fields,as well as the tireless efforts of researchers,VQA systems have now been able to correctly answer questions that require complex reasoning and external general knowledge,and its achievements have far exceeded people’s expectations.Existing VQA models only model the object-level visual representations but ignore the relationships between visual objects and the models’ attention is distracted because of modeling interactions between each image region and each question word.And it is challenging to separate question guided-attention from mood-guided-attention due to the concatenation of the question words and the mood labels in Affective Visual Question Answering Network(AVQAN).To solve these problems,in this paper,we study and discuss VQA systems from three aspects:visual relationship reasoning,attention mechanisms and affective computing,and propose Multi-Modal Co-Attention Relation Networks,Multi-Modal Explicit Sparse Attention Networks,Sparse Co-Attention Visual Question Answering Networks Based on Thresholds and Double-Layer Affective Visual Question Answering Networks.We implement the corresponding VQA systems and verify the validity and interpretability of the proposed models by comparative experiments and ablation studies based on mainstream VQA datasets.Finally,we design and implement a simple intelligent medical diagnosis system combining advanced technologies such as information management,transfer learning,VQA and human-computer interaction.The main research contents of this paper are as follows:(1)The current mainstream VQA models only model the object-level visual representations but ignore the relationships between visual objects.In order to solve this problem and effectively utilize the position information of visual objects and their relative geometry relations in VQA task,we propose a Multi-Modal Co-Attention Relation Network(MCARN)that combines co-attention and visual object relation reasoning.MCARN uses the co-attention mechanism to learn the textual features and the object-level visual representations that are more critical to correctly answer the input questions,and further uses the visual object relation module to model the visual representations at relation-level.Based on MCARN,we stack its visual object relation module to further improve the accuracy of the model on Number questions.Inspired by MCARN,we also propose two models,RGF-CA and Cos-Sin+CA,which combine the co-attention mechanism with the relative geometry features of visual objects,and achieve excellent comprehensive performance and higher accuracy on Other questions respectively.This work verifies the synergistic effect of co-attention and visual object relation modeling in VQA task.(2)Today,advanced VQA approaches model dense interactions between image regions and question words by designing co-attention mechanisms to achieve better accuracy.However,modeling interactions between each image region and each question word will force the model to calculate irrelevant information,thus causing the model’s attention to be distracted.In this paper,to solve this problem,we propose a novel model called Multi-modal Explicit Sparse Attention Network(MESAN),which concentrates the model’s attention by explicitly selecting the parts of the input features that are the most relevant to answering the input question.We consider that this method based on top-k selection can reduce the interference caused by irrelevant information and ultimately help the model to achieve better performance.In addition,we also demonstrate that our model can obtain better attended features than other advanced models through attention visualization results.Our work proves that models combined with sparse attention mechanisms can also achieve competitive results in VQA task.(3)Most existing VQA models choose to model the dense interactions between each image region and each question word when learning the co-attention between the input images and the input questions.However,to correctly answer a natural language question related to the content of an image usually only requires understanding a few keywords of the question and capturing the visual information contained in a few regions of the input image.The noise information generated by the interactions between the image regions unrelated to the input questions and the question words unrelated to the prediction of the correct answers will distract VQA models and negatively affect the performance of the models.In this paper,to solve this problem,we propose a Sparse Co-Attention Visual Question Answering Network(SCAVQAN)based on thresholds.SCAVQAN concentrates the attention of the model by setting thresholds for attention scores to filter out the image features and the question features that are the most helpful for predicting the correct answers and finally improves the overall performance of the model.(4)Based upon advances and improvements,AVQAN enriches the understanding and analysis of VQA models by making use of the emotional information contained in input images to produce sensitive answers,while maintaining the same level of accuracy as ordinary VQA baseline models.It is a reasonably new task to integrate the emotional information contained in input images into VQA.However,it is challenging to separate question guided-attention from mood-guided-attention due to the concatenation of question words and mood labels in AVQAN.Also,it is believed that this type of concatenation is harmful to the performance of the model.To mitigate such an effect,we propose a Double-Layer Affective Visual Question Answering Network(DAVQAN)that divides the task of generating emotional answers in VQA into two simpler subtasks,i.e.,the generation of non-emotional responses and the production of mood labels,and two independent layers are utilized to tackle these subtasks.We also introduce a more advanced word embedding method and a more fine-grained image feature extractor into AVQAN and DAVQAN to further improve their performance and obtain better results than their original models,which proves that VQA integrated with affective computing can improve the performance of the whole model by improving these two modules just like general VQA models.(5)In order to alleviate the problems caused by the shortage of medical resources in China,such as frequent medical disputes and difficult implementation of medical insurance,we propose an intelligent medical diagnosis system to provide efficient medical diagnosis services and promote the integration of medical information to help medical staff improve the quality and efficiency of medical services.The intelligent medical diagnosis system is based on our proposed visual object relation module,threshold-based multi-head scaled dot-product attention and DAVQAN’s idea of dividing complex tasks into simple subtasks,and utilizes advanced technologies such as transfer learning to collect,process,analyze and understand the input medical diagnosis information.Furthermore,the system also combines its internal experiential knowledge to answer natural language questions related to medical diagnosis on medical images.In addition,the intelligent medical diagnosis system can accumulate,improve,learn and update the experiential knowledge through its interaction with medical diagnosis information in external environment to achieve autonomous learning.The intelligent medical diagnosis system performs medical diagnosis tasks in an automatic way so that the users can not intuitively feel its reliability.Therefore,we prove the validity and interpretability of the system through attention visualization results.Finally,we point out the shortcomings of the intelligent medical diagnosis system and take them as the main content and direction of future research. |