| With the rapid advancement of medical artificial intelligence,processing information from a single modality is no longer sufficient to meet the growing demand for precise and intelligent healthcare.Therefore,crossmodal modeling of vision and language has emerged as a research frontier in medical artificial intelligence due to its flexible interaction patterns and high model capacity.One of the typical cross-modal tasks in medical information processing is text recognition and question answering for multimodal medical visual information.This task has significant scientific research value and extensive application prospects,as it involves the fusion of vision and language at different levels.In this work,we present solutions to the typical efficiency and accuracy problems in medical text recognition and visual question-answering by introducing large-scale weak annotation through weakly supervised learning.We strengthen prior knowledge and optimize the information processing pipeline to improve the performance of medical visual text recognition and question-answering methods from multiple perspectives.Firstly,this study aims to address the efficiency and accuracy issues of attention mechanisms in text line recognition,with a focus on the medical field.To achieve this,we conduct a comprehensive comparative analysis of various types of encoders,decoders,and attention mechanisms with or without coverage model,based on millions of real-world text lines.We provide a detailed comparison between the optimized attention mechanism model and the Connectionist Temporal Classification(CTC)model,evaluating their performance on printing and handwriting,line-level and wordlevel text recognition.Additionally,we present the typical errors encountered by the attention mechanism,providing insights into its underlying operation.Moreover,this study presents an efficient attention mechanism guided by the weakly supervised alignment results of the CTC algorithm.We have designed chunking and truncation mechanisms to leverage the weakly supervised location results of spaces and characters,which combines the benefits of both the attention mechanism and CTC.The chunking mechanism performs a random chunking operation on the feature sequence at space positions.The obtained feature chunks are then decoded in parallel to expedite the recognition process while still maintaining the full context to a considerable extent.To reduce computational redundancy from calculating attention weights,the truncation mechanism restricts the attention area to nearby context rather than the entire feature sequence.Weakly supervised character location results guide this process,allowing for more efficient computation.Secondly,this work addresses the challenge of prior knowledge learning in medical visual question answering.To tackle this problem,we a solution that leverages information learned from large-scale image captioning dataset.Specifically,we construct an image captioning and weakly supervised semantic locating model that can help answer medical visual questions more effectively.By integrating insights from image captioning,our model can better understand the semantics of medical images,and by using weakly supervised learning,it can identify important regions of interest within these images.Firstly,we present a novel image captioning and weakly supervised semantic locating model that incorporates multiple attention mechanisms.Our model builds upon the strengths of both Transformer decoder and soft-attention decoder to effectively model global context while also enabling focused analysis of local regions.By combining these two approaches,our model can generate more accurate and informative image captions while also identifying important areas within the image.Then,based on the generated caption text and regions of interest,we design caption-aware medical visual question answering model.In this model,we leverage similarity analysis of regions of interest to guide model focus the key semantic regions during visual feature extraction.To further enhance the accuracy of our model,progressive compact bilinear interactions model is proposed to efficiently fuse the features from three sources,i.e.,image,question,and generated caption.Lastly,based on the observation that medical images typically contain anomalies or lesions,this study proposes an anomaly-oriented medical visual question answering model.The proposed model utilizes anomaly location results and healthy images generated through weakly supervised anomaly detection.Firstly,we collect three kinds of medcial images which usually appear in medical visual question answering.We have curated a large dataset consisting of hundreds of thousands of labelled images that are categorized as either "healthy" or "diseased".Using this dataset,we evaluate the performance of anomaly location results and healthy images generated by Fixed-Point GAN and VAE methods for medical visual question answering.Subsequently,we propose our anomaly-oriented medical visual question answering model which enhances the accuracy of answers for both anomaly-related and anomaly-unrelated scenarios.To handle anomaly-related questions,we have designed a multiplication anomaly sensitive module that leverages anomaly location results as a filter.This module allows the model to emphasize abnormality information,which can improve its accuracy when dealing with such queries.For anomalyunrelated questions,we propose a residual anomaly sensitive module.This module learns the anomaly feature by computing the difference between the input image and a generated healthy image.Also,it retains the feature of the input image,which can be useful in answering anomaly-unrelated questions. |