Font Size: a A A

Research On Cross-modal Fusion Between Vision And Language

Posted on:2024-04-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:L S ZhangFull Text:PDF
GTID:1528307376485094Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important function of the human brain,the combination of vision and language functions has been an important issue in artificial intelligence,and also a core technique for industry and information technology.Cognitive science research shows that the human brain has cross-modal linkage.Compared with processing vision or images alone,the fusion of vision and language can greatly increase the amount of information perceived by intelligent systems,improve the efficiency of machine-human interaction,and achieve more accurate decision-making.With the current deep neural network development,the visual and language models have come to the unification through the self-attention structure,indicating the imminent future of cross-modal fusion applications.Due to the heterogeneity of text and images in data format,an intelligent system needs to first establish a semantic relationship between the two,and then combine the information of the two modalities in an effective way to improve the performance of specific tasks.Following the thinking mode of humans,this dissertation first establishes the foundation of representation learning for multi-modal association,and then explores cross-modal fusion and application based on the multi-modal representation.(1)A replacement self-supervision for vision and language pre-training is proposed to address the problem of weakly supervised fine-grained representations.This method first substitutes a word in a real aligned sentence and then predicts whether each word is replaced based on the image through the language modeling task,thereby learning vocabulary-level fine-grained alignment ability.We set up substitution tasks for both single-modal and multi-modal stages.In addition,we propose a homonyms sentence rewriting strategy to increase the difficulty of the substitution language modeling task.The effectiveness of the proposed method is demonstrated through experiments on multiple multi-modal downstream tasks.(2)A vision-language decomposing approach is proposed to address the trade-off between performance and efficiency of single-tower and dual-tower models.This method divides the construction of cross-modal embedding into two stages: the early interactive fusion stage adopts a single-tower model structure,which is responsible for fully fusing visual and language information and learning high-quality cross-modal representations;while the decoupling stage decomposes the model into a dual-tower structure,achieving efficient and flexible inference.Experimental results on public datasets demonstrate that the visual language decomposition method significantly outperforms the retrieval accuracy of dual-tower models,and compared with single-stream models,it can achieve acceleration while keep most of the accuracy.(3)A semi-supervised visualization and integration method for language models is proposed to improve pre-trained language model.This framework uses cross-modal retrieval to simulate human visual imagination and can be integrated into downstream tasks with plug-and-play.Experimental results on natural language inference and reading comprehension tasks show that our framework can improve the performance of widely used strong baseline models.(4)A text-guided image inpainting method is proposed to solve the ill-defined and hard-to-control problems in image inpainting tasks.This method extracts explicit semantic information about the damaged area through a dual multi-modal attention mechanism and ensures the semantic similarity between the generated image and the text by applying image-text matching loss.Experimental results on public datasets show that the model achieves state-of-the-art performance on both quantitative and qualitative metrics of image inpainting methods.By additional extension on image editing functionality,the consistency between the semantics of the restored image and the guided text is verified,and language-controlled image erasing modification is achieved for the first time.In summary,this dissertation follows the framework of ”establishing multi-modal representation foundation,and then fuse multi-modal knowledge into single-modal task”.Firstly,we research single-tower and dual-tower visual language representation models and achieved advanced performance in cross-modal retrieval and other tasks.Then we studied a semi-supervised visual integration framework with human-like imagination mechanism.We also proposed the first text-guided image inpainting model.The proposed methods were validated on multiple international public datasets and achieved excellent performance.Besides,the text-guided image inpainting task has been supported by international AI creative projects,and research on cross-modal retrieval has been implemented in industrial scenarios and has applied patent.
Keywords/Search Tags:Multi-modal Representation, Vision and Language Pre-training, Cross-modal Retrieval, Cross-modal Generation
PDF Full Text Request
Related items