| Multimodal task is an important task in the field of deep learning.Compared with traditional multimodal task,multimodal task based on Optical Character Recognition(OCR)has different meanings.The OCR in the image has both visual and literal attributes,which can not only help to understand the scene and content in the image,but also form the answer to the question,broadening the usual research idea of multimodal task.At the same time,because OCR is equivalent to the text in the image,it also has practical value for the visually impaired people.In this paper,the multimodal tasks we focus on are visual question answering task and image caption task.These two multimodal tasks both involve visual and text modes,and the required output modes are same.However,due to the different input modal structures and different data set distributions under different tasks,it requires using similar but not identical approaches to solve these two problems.Compared with traditional multimodal tasks,OCR-based multimodal tasks require additional consideration of OCR mode,resulting in more complex input mode,which is also the difficulty and focus of OCR-based multimodal tasks.In order to enrich the input modes of multimodal tasks and enhance the modal expression,object tag and OCR cluster feature are used in this paper.Object tag refers to the category of the object in the image,which is equivalent to describing visual information with words.As a noun,object tag can help the model understand the question and also directly participate in the formation of the answer.OCR cluster feature can be calculated by using mean shift clustering algorithm on the OCR bounding boxes of an image.When OCR cluster feature is used in model,it can give OCR the concept of cluster,which helps avoid the situation that missing any OCR words within a certain area during prediction.Improving the integrity of OCR words in predicted answer greatly benefits the performance of our model.In order to further enhance the performance of model in predicting long sentence,model needs to fully understand the sentence structure and strengthen the internal consistency between words.By using multiple decoders in training and changing the original cross entropy loss function to multi-decoder loss function,the model takes the relationship between words in longer distance into account during training.Using multi-decoder loss function further improves the model effect on the basis of the original model.This paper focuses on the visual question answering task and image caption task of OCR-based multimodal task.We carry out the relevant researches by analyzing the characteristics and difficulties of the tasks,while combining with the advanced models in the current field of multimodal research.The main research contents of this paper are as follows:For visual question answering task based on OCR,we optimize the model from two aspects:input mode and model structure.We use an object tag extraction model on the images of TextVQA to extract object tags,and apply mean shift cluster algorithm on the OCR bounding boxes of image to obtain OCR cluster feature.By improving the model structure,more abundant modes are integrated into the model to complete the learning of various modes.Relevant statistics and experimental results prove the necessity and effectiveness of the improvement.For image caption task based on OCR,because the length of sentences involved in image caption task are always long,the decoder used in image caption model should has stronger ability to understand the relationship between words within sentence,thus the caption generated by model could be closer to the standard answer whether in the use of word or sentence structure.On the basis of the object label and OCR cluster label,we pay more attention to the design of loss function during training.Using the loss function from the multi-decoder helps the model to understand the relationship between words in longer distance within a sentence,so as to further improve the decoding ability of the decoder. |