Unicoder-VL:A Universal Encoder For Vision And Language By Cross-modal Pre-training

Posted on:2021-02-09

Degree:Master

Type:Thesis

Country:China

Candidate:G Li

Full Text:PDF

GTID:2428330620968583

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Our world is multimodal-information often exists simultaneously in language,sound,images,and so on.Artificial intelligence has been developing rapidly,with significant breakthroughs in natural language processing(NLP),automatic speech recognition(ASR)and computer vision(CV),but this breakthrough in a single field is still very different from our real world.To build artificial intelligence that truly understands the human world,it needs the ability to recognize and respond to multimodal signals.In recent years,natural language processing and computer vision start converging,a lot of cross modal research topics spring up(such as image retrieval and q&a,etc.).But most research of these two areas are for pure NLP or CV tasks(for example,BERT in view of the pre-trained language model in NLP and image classification in the design of ResNet for CV),these models cannot describe the connection between linguistic and visual content well.One of the solutions is learning associations from annotation data corresponding to downstream tasks,but it still has obvious disadvantages like the high annotation cost.This paper focuses on the fusion of linguistic and visual information.We designed a Universal Encoder for Vison and Language,a multi-layer Transfromer structure based on self-attention mechanism to learn the cooperative representation between linguistic and viusal information.Base on the above work,we use large-scale image-caption pairs and design several pre-training objectives:text language model(MLM)mask,mask category prediction based on image region(MOC),image text matching(VLM)and image characteristics generated(MRFG),with the help of general cross-modal pre-training and fine-tuning technology,this model can learn the intrinsic connection between language and vision.We can fuse the linguistic and visual information and generate a better representation.This unified model of cross-modal pre-training,taking into account the j oint vector representation of cross-modal information,can be transferred to the downstream task well.We achieve state-of-the-art results in multiple tasks such as image retrieval,visual question-and-answer,visual commonsense reasoning,etc.

Keywords/Search Tags:

Deep learning, Cross-modal, Pre-training, Image-text matching, Visual question answering

PDF Full Text Request

Related items

1	Research On Cross-Modal Matching Technologies Based On Deep Learning
2	Relation-based Visual Question Answering
3	Cross-modal Constraints And Adaptive Learning On Visual Question Answering
4	Research And Algorithm Implementation Of Efficient Visual Question Answering Based On Deep Learning
5	Multi-modal Information Fusion In Visual Question Answering
6	Research Of Visual Question Answering Method Based On Deep Learning
7	Fine-grained Visual Question Answering Based On Deep Learning
8	Research And Application Of Deep Text Matching Algorithm In Question Answering System
9	Research Of Visual Question Answering Technique Based On Deep Learning
10	Research And Implementation Of Visual Question Answering Algorithm Based On Deep Attention Stacking