One-stage Visual Grounding Research Based On Multi-task Learning

Posted on:2024-02-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2568307070951709

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

Visual localization aims to locate an area on an image referenced by a natural lan-guage query,which consists of two subtasks: reference expression understanding(REC)and reference expression segmentation(RES).In recent years,more and more attention has been paid to visual localization technology.Fine-grained alignment between images and sentences constructed by visual localization model helps better understand multi-modal data of downstream tasks,such as visual question answering and visual language naviga-tion.In this paper,multimodal understanding and reasoning of one stage visual localiza-tion are discussed.The existing one-stage method extracts visual feature mapping and text feature respectively,and uses multi-modal reasoning to predict the boundary frame of the referenced object.These methods have the following disadvantages: First,pre-trained vi-sual feature extractors introduce text-independent visual signals into visual features,which hinders multimodal interaction.Secondly,the reasoning process of these methods lacks visual guidance for language modeling.Finally,previous REC or RES approaches have been limited by the performance of two-phase designs,or by the complexity of the one-phase architecture and the lack of a one-phase approach that allows for easy and efficient co-learning of REC and RES tasks.To solve the problem of visual noise,TVNF is proposed to reduce the influence of text-independent visual noise on inference.This method uses two modules,channel atten-tion and spatial attention,to enhance the representation of text information in images,filter out a lot of text-independent visual noise,significantly enhance the degree of fine-grained interaction between image text information,and improve the accuracy and generalization ability of visual positioning model.The effectiveness of TVNF was verified by compari-son and ablation experiments.Aiming at the positioning error problem under long and complex reference,this pa-per proposes a recursive interactive method model of text encoding.Starting with image features,the intermediate understanding of each round of reasoning is represented as text conditional visual features.After multiple rounds of recursive reasoning between image and text information,the reference ambiguity of visual positioning task in complex scenes is gradually reduced.Generate more accurate positioning prediction results.The effec-tiveness of this method was verified by reference experiment and ablation experiment.Aiming at REC and RES subtasks in visual localization tasks,this paper proposes a one-stage dual-path multilevel interactive multitask network(DMIMN).DMIMN uses low-order interaction to filter text-independent visual noise and high-order interaction to perform multi-step reasoning.The context representation of sentences is introduced in feature extraction of low-order interactive visual information,and language modeling is enhanced by using high-order interactive visual features.At the same time,DMIMN links REC and RES tasks to maximize their collaborative learning abilities.Experiments show that this model is real-time and effective in REC and RES multi-task training.

Keywords/Search Tags:

Visual grounding, Multimodal understanding, Referring expression grounding, Multi-task learning

PDF Full Text Request

Related items

1	Multimodal Fine-grained Interaction Modeling For Textual Video Grounding
2	Referring Grounding In Situated Human-robot Dialogue
3	Research On Visual Grounding Algorithm Based On Multimodal Feature Pairs
4	Research Of Referring Expression Comprehension Based On Visual-Language Cross-Modal Joint Learning
5	Visual Grounding Based On Deep Learning
6	Research On Proposal-free Method For Video Temporal Grounding Based On Deep Learning
7	Research And Application Of Relation-Driven Referring Expression Understanding
8	Research And Implementation Of The Classification Of Grounding Grid Material Corrosion
9	Visual Grounding Via Accumulated Attention
10	Research On Grounding Network Detection Technology Without Excavation Based On Ground Penetrating Radar(GPR) Measurement