Font Size: a A A

Unicoder-VL:A Universal Encoder For Vision And Language By Cross-modal Pre-training

Posted on:2021-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:G LiFull Text:PDF
GTID:2428330620968583Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Our world is multimodal-information often exists simultaneously in language,sound,images,and so on.Artificial intelligence has been developing rapidly,with significant breakthroughs in natural language processing(NLP),automatic speech recognition(ASR)and computer vision(CV),but this breakthrough in a single field is still very different from our real world.To build artificial intelligence that truly understands the human world,it needs the ability to recognize and respond to multimodal signals.In recent years,natural language processing and computer vision start converging,a lot of cross modal research topics spring up(such as image retrieval and q&a,etc.).But most research of these two areas are for pure NLP or CV tasks(for example,BERT in view of the pre-trained language model in NLP and image classification in the design of ResNet for CV),these models cannot describe the connection between linguistic and visual content well.One of the solutions is learning associations from annotation data corresponding to downstream tasks,but it still has obvious disadvantages like the high annotation cost.This paper focuses on the fusion of linguistic and visual information.We designed a Universal Encoder for Vison and Language,a multi-layer Transfromer structure based on self-attention mechanism to learn the cooperative representation between linguistic and viusal information.Base on the above work,we use large-scale image-caption pairs and design several pre-training objectives:text language model(MLM)mask,mask category prediction based on image region(MOC),image text matching(VLM)and image characteristics generated(MRFG),with the help of general cross-modal pre-training and fine-tuning technology,this model can learn the intrinsic connection between language and vision.We can fuse the linguistic and visual information and generate a better representation.This unified model of cross-modal pre-training,taking into account the j oint vector representation of cross-modal information,can be transferred to the downstream task well.We achieve state-of-the-art results in multiple tasks such as image retrieval,visual question-and-answer,visual commonsense reasoning,etc.
Keywords/Search Tags:Deep learning, Cross-modal, Pre-training, Image-text matching, Visual question answering
PDF Full Text Request
Related items