Font Size: a A A

Research On Scene Image Fine-grained Classification Based On Multimodal Fusion

Posted on:2024-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y WenFull Text:PDF
GTID:2568307085470594Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
At present,millions of data sets enable machine learning algorithms to achieve similar classification performance in natural scene image classification.However,the acquisition of million-level data sets is a heavy workload,which requires a lot of manpower and material resources as well as long-term accumulation.At the same time,the preprocessing and multiple training of data sets is also a heavy workload,which requires longterm training of high-speed GPU.These factors limit the landing and application of machine learning and deep learning in engineering,making many algorithms only stay in the laboratory stage.Through observation,it is not difficult to find that texts are everywhere in the urban and social environment.These texts carry a lot of semantic information,which is indispensable for achieving complete scene understanding.This thesis focuses on how to effectively classify scene images with fewer data sets by fusing text and visual information.The details are as follows:(1)As one of the most frequently used languages in the world,the research on Chinese text recognition and classification is not deep enough.The first is the lack of data sets.The image data set with Chinese text is very rare,especially the scene image data set.In view of this,a scene image data set based on Chinese information is established to solve the image fine-grained classification task.The establishment and pre-processing of this data set is of certain significance for promoting the semantic extraction and research of images with Chinese text information.(2)An adaptive weighted decision-level fusion method based on confidence level is proposed.First of all,the scene text in the image is effectively recognized through Baidu Paddle Paddle OCR.After the classification tasks based on image and text are completed respectively,we observe that in each image,the importance of image and text information for decision-making is different.Some images contain rich image information but text information is very scarce or difficult to extract.Some image text information is clear and easy to get but the image information is not clear enough to distinguish the category,Therefore,this thesis focuses on an adaptive weighted decision fusion algorithm to solve this problem,so that the decision weights of each image and text are not the same.By adjusting the weight ratio adaptively,we can make full use of our own information advantages to achieve the best classification effect.Compared with the visual classification model,the introduction of the decision-level fusion algorithm increased the F1 score from 0.77 to 0.83,which fully demonstrates that the introduction of text information mode can effectively improve the results of scene image classification.(3)A multimodal feature-level fusion model based on a dual attention mechanism is proposed.This is a more in-depth fusion technology.On the basis of image and text feature extraction,the multi-hop attention mechanism is added to repeatedly extract key features,and then the text information feature and image information feature are fused to obtain joint feature representation.By adding the modal fusion attention mechanism,the correlation learning between modal features is strengthened,and the classification effect is further improved,The final F1 score was increased from 0.83 to 0.89 in the decision-level fusion in Chapter 4.Finally,the effectiveness of the two attention sub-modules and their impact on classification accuracy are verified by ablation experiments.
Keywords/Search Tags:Multimodal Fusion, Scene Classification, FGIC, Decision Fusion, Feature Fusion
PDF Full Text Request
Related items