Visual localization aims to locate an area on an image referenced by a natural lan-guage query,which consists of two subtasks: reference expression understanding(REC)and reference expression segmentation(RES).In recent years,more and more attention has been paid to visual localization technology.Fine-grained alignment between images and sentences constructed by visual localization model helps better understand multi-modal data of downstream tasks,such as visual question answering and visual language naviga-tion.In this paper,multimodal understanding and reasoning of one stage visual localiza-tion are discussed.The existing one-stage method extracts visual feature mapping and text feature respectively,and uses multi-modal reasoning to predict the boundary frame of the referenced object.These methods have the following disadvantages: First,pre-trained vi-sual feature extractors introduce text-independent visual signals into visual features,which hinders multimodal interaction.Secondly,the reasoning process of these methods lacks visual guidance for language modeling.Finally,previous REC or RES approaches have been limited by the performance of two-phase designs,or by the complexity of the one-phase architecture and the lack of a one-phase approach that allows for easy and efficient co-learning of REC and RES tasks.To solve the problem of visual noise,TVNF is proposed to reduce the influence of text-independent visual noise on inference.This method uses two modules,channel atten-tion and spatial attention,to enhance the representation of text information in images,filter out a lot of text-independent visual noise,significantly enhance the degree of fine-grained interaction between image text information,and improve the accuracy and generalization ability of visual positioning model.The effectiveness of TVNF was verified by compari-son and ablation experiments.Aiming at the positioning error problem under long and complex reference,this pa-per proposes a recursive interactive method model of text encoding.Starting with image features,the intermediate understanding of each round of reasoning is represented as text conditional visual features.After multiple rounds of recursive reasoning between image and text information,the reference ambiguity of visual positioning task in complex scenes is gradually reduced.Generate more accurate positioning prediction results.The effec-tiveness of this method was verified by reference experiment and ablation experiment.Aiming at REC and RES subtasks in visual localization tasks,this paper proposes a one-stage dual-path multilevel interactive multitask network(DMIMN).DMIMN uses low-order interaction to filter text-independent visual noise and high-order interaction to perform multi-step reasoning.The context representation of sentences is introduced in feature extraction of low-order interactive visual information,and language modeling is enhanced by using high-order interactive visual features.At the same time,DMIMN links REC and RES tasks to maximize their collaborative learning abilities.Experiments show that this model is real-time and effective in REC and RES multi-task training. |