With the re-emergence of artificial intelligence,many previously difficult tasks have been mentioned,and Visual Question Answering(VQA)is one of them.It combines the two major research directions of natural language processing and computer vision,takes two modalities of image features and question text features as input,and then fuses and interacts,and finally obtains the answer to the question related to the image.In real life have important research significance.In terms of medical treatment,even experienced doctors need to use some computer systems to assist in confirming the specific condition of patients for some prior knowledge issues.In addition,with the help of these systems,patients can more easily understand their real situation,so VQA-MED,the medical visual question answering technology,came into being.In this paper,research on medical visual question answering technology has been carried out,and in-depth research has been done in several aspects,such as text and image feature extraction,effective interaction and fusion between modalities,and making up for the shortcomings of multi-modal processing.The data has innovated and improved the model,and the research contents are as follows:For the VQA task,this paper starts with the traditional processing direction,that is,the effective extraction of multi-modal features and the effective fusion and interaction between multi-modalities.For feature extraction,Bio BERT is used for problem feature processing,and the multi-modal factorized high-order pooling with the best effect is used to achieve the fusion,and the collaborative attention mechanism is used for interaction.Then start to change the thinking,from the perspective of complementarity with the multi-modal branch,due to the differences between multi-modalities and the feature loss and interference during fusion interaction,this paper analyzes the characteristics of the data set,first according to the difficulty of the data,Referring to the course learning idea and Des Net technology,a model of single-modal branch and multi-modal branch is designed to form feature complementation with Mixup data enhancement and global average pooling technology,and finally the performance is improved by means of integration.The second is to start with the great imbalance of category distribution in the dataset,and through the improved unified bilateral branch network model,combined with cross-validation and enhanced model generalization,the final prediction accuracy is improved.,which improves the final prediction effect.For the above experimental models,this paper chooses to conduct experiments on two commonly used VQA medical datasets,Image CLEF2021 VQA-Med and VQA-Rad.The final experimental results show that the proposed model achieves good results on both Image CLEF2021 VQA-Med and VQA-Rad datasets.Among them,the accuracy of 69.8% and 69.6% were obtained on Image CLEF2021VQA-Med,respectively,and the accuracy of 72.9% and 72.6% were obtained on the VQA-Rad dataset,respectively.The effectiveness of the proposed method on the medical VQA task is demonstrated. |