Visual Grounding Based On Deep Learning

Posted on:2020-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:C C Xiang

Full Text:PDF

GTID:2428330605467979

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Visual grounding aims to localize object in the image referred by a textual query phrase.A method to address this problem needs to understand the semantics of both visual content and query phrases,and bridge the huge gap of the two modalities as well as predicts the localization of the object in the image.A general framework to address this problem consists of four parts: 1)textual representation;2)proposal generation and visual representation;3)multi-modal feature fusion;4)object localization.This paper proposes two different deep-learning based approaches:(1)A Region-based End-to-End Network(R-EEN)for visual grounding.In contrast to most existing two-stage approaches,R-EEN is an end-to-end approach.REEN generates object proposals and the corresponding visual features simultaneously with Region Proposal Networks(RPN).For cross-modal fusion module,R-EEN uses Multi-modal Factorized Bilinear pooling(MFB)model to fuse the multi-modal features effectively.(2)A Diversified and Discriminative Proposal Network(DDPN)for visual grounding,this part mainly discusses the problem of what properties make a good proposal generator.We introduce the diversity and discrimination simultaneously when generating proposals,and in doing so propose DDPN model.Based on DDPN,we propose a high performance baseline model for visual grounding.(3)A enhanced DDPN model based on mixture of detector(Mo D)and MFB.This method proposes multi-detection model to extract ensemble image features of the proposal region.and utilizes a more complex model MFB to fuse the visual and textual features more effectively.To verify the effectiveness of our approach,we conduct experiments on several datasets of visual grounding(Flickr30k,Refer It Game,Ref COCO,Ref COCO+).Experimental results demonstrate that our approaches outperform existing state-of-thearts on all the tested data-sets.

Keywords/Search Tags:

Deep Learning, Visual Grounding, Multi-modal, Cross-media

PDF Full Text Request

Related items

1	Multimodal Fine-grained Interaction Modeling For Textual Video Grounding
2	Research On Cross-media Retrieval Algorithm Based On Discriminative Subspace Learning
3	Research On Cross-modal Retrieval And Recognition Of Visual And Text
4	Unicoder-VL:A Universal Encoder For Vision And Language By Cross-modal Pre-training
5	Research On Cross-modal Retrieval Method Based On Deep Semantic Hashing
6	Multi-view Neural Network Learning Approaches For Cross-modal Retrieval And Classification
7	Research Of Cross-modal Retrieval Methods Based On Deep Learning
8	Analysis Of Multi-Modal Social Media Based On Graph Model
9	Research On Multi-Scale Fusion Cross Modal Retrieval Based On Deep Learning
10	Research On Human Action Analysis And Recognition Method Based On Deep Learning