Visual question answering is an emerging topic that integrates natural language processing and computer vision,focusing on how to generate accurate answers by analyzing information from given image and its content-related questions.As a domain-specific research,medical visual question answering is similar to general-purpose ones in that the models both require effective fusion and processing of multimodal data to predict correct answers.The biggest problem faced by medical visual question answering is that the small-sample medical data is highly specialized which makes it difficult for conventional visual question answering models to achieve better results in medical tasks.In order to solve the problem,this paper optimizes the model and algorithm for the characteristics of medical data and implements a medical vision question answering network based on small-sample medical data augmentation.The main work is as follows:(1)This paper proposes an augmentation algorithm for multimodal medical data,using the I-FGSM algorithm to generate augmented samples by multiple iterations of adversarial attacks on images,and a Seq2Seq-based machine translation model for bilingual translation with similarity restrictions to filter valid samples in text augmentation,then verifying its effectiveness when dealing with small-sample data.(2)The feature extraction algorithm is improved for the characteristics of medical data.This paper introduces Me SH word list to deal with the complex subword splitting in word vector training,and the pre-training model using multi-task learning with multi-task inference for extracting medical image features,which improves the effectiveness of the feature extraction algorithm when dealing with medical professional data.(3)Based on self attention,this paper implements multimodal fusion and augmentation of attention and counting information.Integrating all the work of this paper into a basic visual question answeing model,a medical visual question answering network based on small-sample professional data augmented called SPAN(Smallsample Professional Data Augmented Network).In this paper,the accuracy of the answers is validated by testing in the open source dataset VQA-RAD.The accuracy of the network is 76.8% for the closed domain questions and 55.6%for the open domain questions.Compared to the most widely used attention-based network,Up-down,the accuracy is improved by 1.7% in the closed domain questions and 11.7% in the open domain questions.The result proves the effectiveness of the network implemented in this paper for medical data.The experiments are also compared with the generic model to verify the advantages of the augmentation algorithm in the small-sample condition. |