Font Size: a A A

Research On Image Content Understanding And Visual Reasoning Algorithm Based On Attention Mechanism

Posted on:2022-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhangFull Text:PDF
GTID:2518306524480944Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the fast expansion of the amount of information,traditional single-modal information is gradually being replaced by multimodal fusion media information such as audio,video,image,and text.The current methods of processing single-modal information in deep learning algorithms cannot effectively solve these real problems in complex situation.Therefore,exploring the processing methods of multi-modal information has become an important research direction,image content understanding and visual reasoning tasks are typical in between.Image content understanding and visual reasoning refer to taking images and natural language questions about the image as input,and then integrating the image and the target question into a multi-modal feature.Using the multi-modal feature,the image content is "reasoned",and finally output natural language answer.This thesis takes image content understanding and visual reasoning as the research content.Aiming at the problem of insufficient multi-modal information fusion in existing methods,proposed a new multimodal feature fusion algorithm,significantly optimizes the scale of multi-modal feature fusion through the targeted design of multi-modal feature fusion loss and multi-modal feature alignment.In addition,this thesis also designed a new image high-level semantic feature extraction algorithm for the low level of image semantic understanding in existing methods.Based on the above two algorithms,this thesis finally builds an end-to-end multi-modal feature fusion model,and take experiments on mainstream datasets to verify the feasibility of the model.The results show that the model achieved superiority,which gained significantly effect better than the existing mainstream models.The main contents of this thesis are as follows:1.Aiming at image content understanding,an image high-level semantic feature extraction algorithm is proposed.The algorithm takes images and natural language problems about images(ie,target problems)as input,through this algorithm,the model learns the characteristics of the target problem,and uses the problem characteristics to guide the model to learn deeper information in the image(the object itself,high-level semantic information such as behaviors and events for the target problem),and finally output the high-level semantic features of the image.2.For multi-modal feature fusion,the definition of "multi-modal feature alignment" is proposed,and the concept of "multi-modal feature fusion error" is introduced,and a new multi-modal feature fusion algorithm is finally realized.The algorithm takes image high-level semantic features and problem features as input,based on the bilinear pooling method,uses MSE distance and Cross-entropy function for loss construction,and finally the image feature and the question feature are effectively fused into a stable multi-modal feature through the "multi-modal feature alignment".3.Based on the above two algorithms,an overall model of image content understanding and visual reasoning is constructed and verified by experiments.This thesis conducts ablation studies to analyze of the model on the common-used dataset VQA-v2 to explore the different effects of different parameters on the model's expressiveness to analyze the effectiveness of the model,and it is also in the current three commonly used datasets(CLEVR,GQA,and VQA-v2)compared the model with the existing mainstream models.Through horizontal and vertical experimental analysis,it can be found that the image content understanding and visual reasoning model constructed in this thesis has significantly improved compared with the existing models.
Keywords/Search Tags:image content understanding and visual reasoning, attention mechanism, high-level image semantics, multi-modal feature fusion
PDF Full Text Request
Related items