Font Size: a A A

Research On Image Retrieval Methods Based On Vision Transformer

Posted on:2024-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:B N XiuFull Text:PDF
GTID:2568306917996999Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Image Retrieval(IR)is an important research task in the field of computer vision.In recent years,with the development of technology,more challenging subtasks have been proposed and attracted more and more attention,such as Fine-grained Image Retrieval(FGIR)and person Re-identification(ReID).In the two subtasks,models based on Convolutional Neural Network(CNN)have achieved an impressive performance.With the help of CNN,these methods can make full use of global features of images.However,for FGIR and ReID tasks,local features also play a very important role in the retrieval process.More recently,Vision Transformer(ViT)based approaches have achieved great success in the area of traditional image analysis,which is attributed to the natural advantage of ViT in capturing important regions and focusing on fine-grained features in an image.However,how to apply ViT to these more challenging tasks requires further exploration.Therefore,for FGIR and ReID tasks,this thesis carries out research works based on ViT.Thereinto,the author firstly uses ViT as backbone,and proposes a fine-grained image retrieval method based on ViT to make better use of the local features of the image.In this method,the author designs a Local Aligned Loss(LAL)to dynamically calculate the minimum distance between the paired regions of two images,so as to align the local regions of two images.Further,the distance between the two images can be calculated accurately,so that the similarity between them can be better measured.In this way,the discriminative regions of the image can be captured effectively,and the finegrained features contained in the image can be better utilized.At the same time,a twicesorting approach is introduced in this method,which not only improves the efficiency of retrieval,but also guarantees the accuracy of retrieval results to the greatest extent.On this basis,to both utilize the global and local fine-grained information of images,the author introduces a novel hybrid ViT framework for fine-grained image retrieval and further explore how to play the joint role of CNN and ViT adequately in this architecture.Specifically,in this method,the author proposes a Critical Patches Reanalysis(CPReA)module,which uses CNN to guide the selection of critical patches in ViT so that more representative global features can be generated.In addition,the author designs a Cross Network Feature Fusion(CNFF)module to integrate the features of ViT and CNN effectively,so that the output features are more informative.Meanwhile,the author proposes a Global-Local Aligned Loss(GLAL)function to enhance LAL and to better measure the similarity between two images.In order to verify the generalization ability of the proposed hybrid ViT architecture on different tasks,the author proposes a hybrid ViT framework for person Reidentification,and tests its performance ability on the ReID task.In this method,the author designs a Hierarchical Feature Fusion(HiFF)module to make full use of the image features generated by intermediate layers of CNN and ViT.By using this module,the final features used for retrieval can contain richer coarse-grained and fine-grained information.Moreover,a Self-supervised Optimization Ranking(SSOR)module is introduced to further improve the retrieval efficiency and accuracy of the model.To evaluate the proposed methods,the author conducts extensive comparative and ablationexperiments on two typical fine-grained datasets(CUB-200-2011 and Cars-196)and two typical person ReID datasets(DukeMTMC and MSMT17).The results demonstrate the effectiveness of the proposed methods.
Keywords/Search Tags:Fine-grained Image Retrieval, Person Re-identification, Vision Transformer, Convolutional Neural Network
PDF Full Text Request
Related items