| Visual Question Answering(VQA)is a challenging task that requires a cross-modal understanding of images and questions with relational reasoning leading to the correct answer.To bridge the semantic gap between these two modalities,previous works focus on the word-region alignments of all possible pairs without attending more attention to the corresponding word and object.In this paper,the relation-consistent pairs are used to promote the learning of deep neural network and further improve the performance of the model.We propose a Cross-modal Relational Reasoning Network(CRRN),to mask the inconsistent attention map and highlight the full latent alignments of corresponding word-region pairs.Specifically,we present two relational masks for inter-modal and intra-modal highlighting,inferring the more and less important words in sentences or regions in images.The attention interrelationship of consistent pairs can be enhanced with the shift of learning focus by masking the unaligned relations.Then,we propose two novel losses LCMAM and LSMAM with explicit supervision to capture the fine-grained interplay between vision and language.Meanwhile,contextual information in the question is vital for guiding accurate visual attention.Thus our network is further equipped with a novel gate mechanism to give higher weight to contextual information.With the consideration that existing works give little consideration to the long-tailed data distribution in common VQA datasets.The extreme class imbalance causes training bias to behave well in head class,but fail in tail class.Therefore,we propose a unified Adaptive Re-balancing Network(ARN)to take care of classification in both head and tail classes,further improving performance for VQA.Specifically,two training branches are introduced to perform their own duty iteratively,which learn the universal representations first and then emphasize the tail data progressively by the re-balancing branch with adaptive learning.The Experimental results on common benchmarks such as VQA-v2 and GQA have demonstrated the superiority of our method compared with state of the art. |