Referring Expression Understanding(REU)refers to the task of localizing an object in a visual scene based on an unambiguous description.The unambiguous description of an object is referred to as a referring expression(RE),which bridges the gap between natural language and the real world,plays an indispensable role in everyday life.To better simulate human understanding of REs,modeling the relationships between objects in an image must be given more attention.Currently,research on relationship is still in its early stage,has not yet utilized more fine-grained and structured information for REU.Therefore,in this paper,we propose a relation-driven REU model by studying relationship modeling methods and applying them to the task of REU.In terms of spatial relationship,we use fine-grained mask information to model the relationships between objects,which improves the coarsegrained approach of the original five-dimensional vector encoding.At the same time,a phrase-guided object attention mechanism is used to identify absolute and relative relationship,which are ultimately unified and more fully trained using the same block.In absolute relationship,the multi-level message passing method further improve recognition ability.Experiments show that this model is a general spatial relationship modular-based model,which can locate objects through spatial relation-driven modeling.In terms of structural relationship,the text-side model adopts a structured text-guided phrase generator based on dependency tree and constituency tree to generate structured phrases.The visual-side model applies an improved graph attention network,which,in conjunction with the text-side model,endows the model with the ability to perform multihop and long-range reasoning.Meanwhile,weakly supervised reverse inference training method alleviates the problem of unannotated objects that cannot be trained,thereby constraining the localization of object.Thus,the forward text-side structural relationships,improved graph network,and reverse inference can serve as a closed-loop system for the flow of relationship information.Experiments show that the advantages of finegrained mask information can be extended to well-designed graph networks,and that structured phrases,reverse inference can lead to a competitively performing model. |