| Vision and language are important ways for humans to understand the world,and are the main information channels in the process of human interaction with the external world.In recent years,multimodal learning tasks for vision and language have attracted widespread attention.With the development of Optical Character Recognition(OCR)technology,the scene text-based visual question answering(Text-VQA)task is proposed,which poses a huge challenge to the model’s scene text understanding ability and multimodal learning ability.Given an image and a question,the Text-VQA task aims to answer the question by understanding the visual and textual information in the scene of the image.Therefore,one of the core problems to be solved in the Text-VQA task is how to understand the scene text semantics.However,the current Text-VQA method still has many shortcomings and urgent problems in scene text understanding.Specifically,the scene text visual question answering task has the following three challenges:First,scene text semantic modeling.Compared with natural language text,the text in the scene image exists separately,which requires people to construct its text context and understand its semantic information;second,the semantic complement of the scene text when there are missing characters.That is,the natural scene text in the picture may be occluded,blurred,or missing.In this case,how to understand and complement the scene text semantics according to the context and existing characters;third,semantic extension under the guidance of prior knowledge.Because text is a high-level language of human beings,it contains a lot of prior knowledge,and many questions in real scenes cannot be answered directly according to the picture.The model needs to combine scene information and some external prior knowledge for joint reasoning to get the correct answer.Given the above problems,this article proposes a scene text visual question answering technology based on text understanding,which realizes the contextual semantic modeling of complex scene texts and alleviates the semantic understanding bias caused by scene text recognition errors,and effectively uses prior knowledge to extend scene text semantics.The main research contents and innovative work are as follows.1)Aiming at text semantic modeling in scene text-based visual question answering task,this article proposes a scene text-based visual question answering method based on text reading comprehension.The method constructs the context of the scene text according to the natural reading order of the text,and fully mines the context information of the scene text through the machine reading comprehension model,and realizes the semantic modeling of the context of the scene text.The experimental results on multiple Text-VQA datasets show that the method can effectively learn the contextual information of texts and demonstrate excellent performance.Moreover,the method has strong generality and can be effectively integrated with other methods.2)In order to deal with the problems of character occlusion and missing in scene text,this article proposes a scene text-based visual question answering method based on contrastive learning for semantic completion.This method artificially constructs misspelled OCR text during training,which makes the model more robust to OCR character errors,and proposes an OCR text and word contrast learning(TWC)task to pre-train the scene text representation.A variety of experiments show that the method significantly outperforms the state-of-the-art methods on both TextVQA and ST-VQA datasets,and can effectively alleviate the semantic bias caused by character errors in OCR texts.3)To address the problem of difficult text semantic understanding in scene text-based visual question answering,this article proposes a prior knowledgeguided scene text-based visual question answering method.This method proposes a prior knowledge retrieval system driven by both OCR text and questions,and a prior knowledge verification module based on prompt learning to rank the prior knowledge.The experimental results show that this method can effectively select the prior knowledge most relevant to the current scene and avoid introducing more irrelevant noise,which is beneficial to improve the joint reasoning ability of the entire model. |