Research On Image Content Understanding And Visual Reasoning Algorithm Based On Attention Mechanism

Posted on:2022-06-23

Degree:Master

Type:Thesis

Country:China

Candidate:T Zhang

Full Text:PDF

GTID:2518306524480944

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and the fast expansion of the amount of information,traditional single-modal information is gradually being replaced by multimodal fusion media information such as audio,video,image,and text.The current methods of processing single-modal information in deep learning algorithms cannot effectively solve these real problems in complex situation.Therefore,exploring the processing methods of multi-modal information has become an important research direction,image content understanding and visual reasoning tasks are typical in between.Image content understanding and visual reasoning refer to taking images and natural language questions about the image as input,and then integrating the image and the target question into a multi-modal feature.Using the multi-modal feature,the image content is "reasoned",and finally output natural language answer.This thesis takes image content understanding and visual reasoning as the research content.Aiming at the problem of insufficient multi-modal information fusion in existing methods,proposed a new multimodal feature fusion algorithm,significantly optimizes the scale of multi-modal feature fusion through the targeted design of multi-modal feature fusion loss and multi-modal feature alignment.In addition,this thesis also designed a new image high-level semantic feature extraction algorithm for the low level of image semantic understanding in existing methods.Based on the above two algorithms,this thesis finally builds an end-to-end multi-modal feature fusion model,and take experiments on mainstream datasets to verify the feasibility of the model.The results show that the model achieved superiority,which gained significantly effect better than the existing mainstream models.The main contents of this thesis are as follows:1.Aiming at image content understanding,an image high-level semantic feature extraction algorithm is proposed.The algorithm takes images and natural language problems about images(ie,target problems)as input,through this algorithm,the model learns the characteristics of the target problem,and uses the problem characteristics to guide the model to learn deeper information in the image(the object itself,high-level semantic information such as behaviors and events for the target problem),and finally output the high-level semantic features of the image.2.For multi-modal feature fusion,the definition of "multi-modal feature alignment" is proposed,and the concept of "multi-modal feature fusion error" is introduced,and a new multi-modal feature fusion algorithm is finally realized.The algorithm takes image high-level semantic features and problem features as input,based on the bilinear pooling method,uses MSE distance and Cross-entropy function for loss construction,and finally the image feature and the question feature are effectively fused into a stable multi-modal feature through the "multi-modal feature alignment".3.Based on the above two algorithms,an overall model of image content understanding and visual reasoning is constructed and verified by experiments.This thesis conducts ablation studies to analyze of the model on the common-used dataset VQA-v2 to explore the different effects of different parameters on the model’s expressiveness to analyze the effectiveness of the model,and it is also in the current three commonly used datasets(CLEVR,GQA,and VQA-v2)compared the model with the existing mainstream models.Through horizontal and vertical experimental analysis,it can be found that the image content understanding and visual reasoning model constructed in this thesis has significantly improved compared with the existing models.

Keywords/Search Tags:

image content understanding and visual reasoning, attention mechanism, high-level image semantics, multi-modal feature fusion

PDF Full Text Request

Related items

1	Research On Image Caption Method Based On High-level Image Semantic And Attention
2	Complex Scene Reasoning Based On Multi-modal Attention Mechanism
3	Research On Image-Text Cross-Modal Matching Based On Attention Mechanism
4	Research On Image Caption Method Based On High Level Semantic Extraction And Attention Mechanism
5	Research On Visual Question Answering Based On Deep Learning
6	Research On Infrared And Visible Image Fusion Algorithm Based On Modal Characteristics And Multi-layer Feature Fusion
7	Resaerch And Implementation Of Image Captioning Algorithm With High-level Semantics Based On Deep Learning
8	Research On Image Captioning Models Based On High-Level Semantics
9	Research And Implementation Of Scene Graph Generation Algorithm Based On Attention Mechanism
10	Image-text Translation Based On Cross-modal Related Semantics And Attention Mechanism