Font Size: a A A

Research On Visual Question Answering With Deep Metric Learning

Posted on:2023-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:H Y LiangFull Text:PDF
GTID:2568306794482134Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)is a core multi-modal task,which aims at deducing a correct answer based on a given question and a corresponding image.With the rapid development of computer vision and natural language processing,many current VQA methods have achieved remarkable improvements on the widely-used VQA datasets.However,since the inevitable annotation artifacts in the real image datasets,recent studies have found that most VQA models tend to over-rely on spurious linguistic correlations in the training set and usually perform poorly when transferred to out-of-distribution test sets.To mitigate the limitation of language prior,the effective approach for tackling language bias in VQA can be roughly categorized as ensemble-based methods and data-balanced methods.Ensemble methods tend to allocate large weights for the samples of minor class and suppress biased samples by introducing an additional question-only module.Although aforementioned ensemble-based methods achieve satisfactory performance,they distort the data distribution from scratch,which will increase the risk of over-fitting the minor class samples and damage the learned universal representations from whole data.Data-balanced methods automatically generate additional question-image pairs to balance the training data.Further,some data-augmented methods learn to model the relationship between the original samples and the expanded samples by contrastive learning.Compare with other methods,these data-balanced methods have achieved dominant performance.Nonetheless,the process of generating samples is blind,which may introduce new biases,such as sampling bias,because the augmented samples may have incorrect answer labels.In this thesis,we propose a novel Decoupling Representation and classification Network(DRCN)and a novel Lateral Network equipped with deep Metric Learning(LNML).The training process of decoupled network include two stages.The first stage focuses on learning universal representations,and the second stage focuses on learning balanced classifier based on the learned representation from first stage.The end-to-end trainable lateral network simultaneously retains the universal learned representations from whole data and improves the discrimination of each class sample.The lateral network consists of three branches which are termed as original branch,positive-based branch and negative-based branch.By resampling the samples from the training set,LNML will not introduce new biases.Each branch of the lateral network has their own duty.In particular,the equipped metric learning brings additional constraints over joint embedding space and answer prediction space of the lateral network,which leads the model to focus more on the contents of samples.Different from the conventional contrastive learning on the triplet,we additionally consider the relationship between positive and negative samples for better moving.That is,each unit in the triplet is constrained by relative distances from the other units in the triplet,called as Double-Reference Contrastive Learning(DRCL).The results of extensive experiment on the VQA-CP v2 benchmark dataset demonstrate that our DRCN and LNML can be serves as a plug-and-play component to improve the robustness of VQA models.Additionally,our LNML achieves competitive performance with state-of-the arts without extra generated or tagged supplementary data.
Keywords/Search Tags:Visual question answering, Robust VQA, Decoupling, Contrastive learning, Language bias
PDF Full Text Request
Related items