Font Size: a A A

Visual Grounding Based On Deep Learning

Posted on:2020-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:C C XiangFull Text:PDF
GTID:2428330605467979Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Visual grounding aims to localize object in the image referred by a textual query phrase.A method to address this problem needs to understand the semantics of both visual content and query phrases,and bridge the huge gap of the two modalities as well as predicts the localization of the object in the image.A general framework to address this problem consists of four parts: 1)textual representation;2)proposal generation and visual representation;3)multi-modal feature fusion;4)object localization.This paper proposes two different deep-learning based approaches:(1)A Region-based End-to-End Network(R-EEN)for visual grounding.In contrast to most existing two-stage approaches,R-EEN is an end-to-end approach.REEN generates object proposals and the corresponding visual features simultaneously with Region Proposal Networks(RPN).For cross-modal fusion module,R-EEN uses Multi-modal Factorized Bilinear pooling(MFB)model to fuse the multi-modal features effectively.(2)A Diversified and Discriminative Proposal Network(DDPN)for visual grounding,this part mainly discusses the problem of what properties make a good proposal generator.We introduce the diversity and discrimination simultaneously when generating proposals,and in doing so propose DDPN model.Based on DDPN,we propose a high performance baseline model for visual grounding.(3)A enhanced DDPN model based on mixture of detector(Mo D)and MFB.This method proposes multi-detection model to extract ensemble image features of the proposal region.and utilizes a more complex model MFB to fuse the visual and textual features more effectively.To verify the effectiveness of our approach,we conduct experiments on several datasets of visual grounding(Flickr30k,Refer It Game,Ref COCO,Ref COCO+).Experimental results demonstrate that our approaches outperform existing state-of-thearts on all the tested data-sets.
Keywords/Search Tags:Deep Learning, Visual Grounding, Multi-modal, Cross-media
PDF Full Text Request
Related items