Font Size: a A A

Attention & Meta-learning Based Visual Question Answering

Posted on:2022-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:G Y LiFull Text:PDF
GTID:2518306524490254Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Visual question answering(VQA)has been a hot research area in deep learning,This task is defined as follows: A VQA system involves visual and textual processing.Using natural images and free-form natural language questions as input,To generate a natural language answer as output.Current VQA methods are usually based on object detection model,which is slow in calculation and lack of interpretation.And training relies on large set of samples and lacks the ability to learn from small set of samples.In this thesis,In order to solve the calculation performance consumption problem,image features are extracted by using pure Transformer structure or combining convolution with Transformer.And the key information of feature is extracted by attention method.At the same time,the meta-learning method further improves the few-shot learning ability.The main research contents of this thesis are as follows:Firstly,In this thesis,the influence of different visual feature extraction methods are re-examined,it turns out that convolution and Transformer can be used to replace the region selection and region feature calculation module,which greatly improves the computational efficiency.Compared with the traditional VQA method,this method has higher explanability.By visualizing the attention information in the model,we can clearly see the important areas in the image and the important words in the question during the process.Secondly,Traditional VQA methods rely on large training set,while the types and forms of questions involved in the VQA task are unpredictable.Traditional methods lack the ability to deal with unfamiliar problems,In this thesis,in order to enhance the fewshot learning ability,the questions are grouped according to the similarity.and a group of similar questions are compared through the meta-learning method,so as to infer the possibility of the same answers among these questions.In general,this thesis mainly uses the attention based method to extract textual and visual information,realizes the collaborative attention mechanism in multi-mode,and enhances the accuracy in the case of few-shot learning by using the meta-learning method.Finally,through the experiments,it is proved that the proposed model is superior to the traditional VQA methods in both accuracy and computational efficiency.
Keywords/Search Tags:vqa, vision transformer attention, meta-learning, few-shot learning
PDF Full Text Request
Related items