Cross-modal Constraints And Adaptive Learning On Visual Question Answering

Posted on:2023-07-02

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Chen

Full Text:PDF

GTID:2568306914982019

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Visual Question Answering(VQA)is a challenging task that requires a cross-modal understanding of images and questions with relational reasoning leading to the correct answer.To bridge the semantic gap between these two modalities,previous works focus on the word-region alignments of all possible pairs without attending more attention to the corresponding word and object.In this paper,the relation-consistent pairs are used to promote the learning of deep neural network and further improve the performance of the model.We propose a Cross-modal Relational Reasoning Network(CRRN),to mask the inconsistent attention map and highlight the full latent alignments of corresponding word-region pairs.Specifically,we present two relational masks for inter-modal and intra-modal highlighting,inferring the more and less important words in sentences or regions in images.The attention interrelationship of consistent pairs can be enhanced with the shift of learning focus by masking the unaligned relations.Then,we propose two novel losses LCMAM and LSMAM with explicit supervision to capture the fine-grained interplay between vision and language.Meanwhile,contextual information in the question is vital for guiding accurate visual attention.Thus our network is further equipped with a novel gate mechanism to give higher weight to contextual information.With the consideration that existing works give little consideration to the long-tailed data distribution in common VQA datasets.The extreme class imbalance causes training bias to behave well in head class,but fail in tail class.Therefore,we propose a unified Adaptive Re-balancing Network(ARN)to take care of classification in both head and tail classes,further improving performance for VQA.Specifically,two training branches are introduced to perform their own duty iteratively,which learn the universal representations first and then emphasize the tail data progressively by the re-balancing branch with adaptive learning.The Experimental results on common benchmarks such as VQA-v2 and GQA have demonstrated the superiority of our method compared with state of the art.

Keywords/Search Tags:

Visual Question Answering, Cross-modal Relational Reasoning, Adaptive Learning, Supervised Learning

PDF Full Text Request

Related items

1	Research On Visual Question Answering Based On Deep Neural Network
2	Research On Situational Reasoning Question Answer Method Based On Deep Learning
3	Relation-based Visual Question Answering
4	Research On Cross-Modal Matching Technologies Based On Deep Learning
5	Multi-modal Information Fusion In Visual Question Answering
6	Research And Algorithm Implementation Of Efficient Visual Question Answering Based On Deep Learning
7	Unicoder-VL:A Universal Encoder For Vision And Language By Cross-modal Pre-training
8	Object-oriented Two-Stream Network And Heterogeneous Graph Reasoning On Video Question Answering
9	Research Of Visual Question Answering Method Based On Deep Learning
10	Deep Learning Model Based On Memory Modeling For Question Answering System