Font Size: a A A

Research On Image-based Visual Reasoning Technology

Posted on:2023-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:M D WuFull Text:PDF
GTID:2558306908950549Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and computer multimedia technology,a large amount of image and text data are collected in Internet platforms and major data centers.These image and text data often contain a lot of valuable information.How to fully mine and make use of the correlation information contained in multimodal data and carry out deeper reasoning has become an important research hotspot.With the continuous development of deep learning technology,computers can deeply mine the information in a single image data or text data.However,it is still a challenging task for computer to mine the association information in multimodal data composed of images and texts.In order to improve the machine’s ability to understand multimodal data composed of images and texts,a large number of researchers have carried out research on visual reasoning tasks based on scene understanding and relational reasoning.By analyzing the characteristics of multimodal data composed of image and text,this thesis puts forward corresponding solutions to the problems faced by the current visual reasoning task.The main research contents of this paper are summarized as follows:(1)The network structure of the existing typical visual reasoning model and its performance under the current evaluation index are discussed.This thesis pointed out that the current evaluation index only focuses on the proportion of the number of correct reasoning problems in the visual reasoning task,and ignores the probability value of semantically similar options,the probability value of semantically different options and the probability value of correct options when reasoning is wrong.This thesis concluded that the current evaluation indicators unable to evaluate well whether the model fully understands the current scene.To solve these problems,this thesis proposes a model evaluation index based on text similarity.Firstly,calculate the text similarity between the other options and the correct options by using the method based on deep learning,then design the corresponding calculation rules to combine the text similarity with the probability value,and get the score of the visual reasoning of the model in the current scene.Finally,weighted average the scores of all scenes to get the final evaluation of the model.(2)Aiming at the problem that visual reasoning task requires not only the ability of the model to understand image data and text data,but also the ability of the model to understand multimodal information combined with image and text,a visual reasoning model based on two-stream attention network is designed in this thesis.Firstly,the feature extraction module is used to process the image data and text data to preliminarily extract the data features.Then a two-stream attention network composed of self attention module and image-text joint attention module is designed to enhance the ability to understand unimodal data and multimodal data.On this basis,this thesis also designs three pretraining methods,namely masked region modeling,masked language modeling and image-text matching which further improves the model’s understanding of image data and text data.Finally,the proposed method is verified on the visual reasoning dataset.The experiment mainly includes two parts: horizontal comparison experiment and ablation experiment.The accuracy of this method on the three subtasks of the data set is 75.3,77.2 and 58.0respectively,and the scores of the weighted similarity measure proposed in this thesis are6.843,6.765 and 3.892 respectively,showing excellent performance.
Keywords/Search Tags:Visual reasoning, Attention mechanism, Evaluation index, Multimodal data
PDF Full Text Request
Related items