Font Size: a A A

Research On Cross-Modal Retrieval Algorithm Based On Fine-Grained Semantic Preservation

Posted on:2024-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y R LiFull Text:PDF
GTID:2568307136451564Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence technology and the widespread popularization of big data applications,tremendous multimedia data are generated on the Internet.As the scale of data continues to expand,the problem of the curse of dimensionality becomes more and more difficult.How to quickly and efficiently retrieve and manage the huge and complex data has become a problem that has attracted much attention.Image-text matching plays a key role in solving the problem of cross-modal information processing.Since there are intractable semantic differences between vision and language,a key challenge is how to learn a unified and comprehensive image-to-text representation.In multimodal information processing tasks,how to effectively integrate different modal information,and how to achieve semantic consistency and accuracy are all urgent problems to be solved in current research.Therefore,the specific research content of this paper is as follows:(1)Most methods only utilize regional image features and ignore global image features,which will lose important background information in the global image.To strengthen the association between global concepts and local concepts to obtain more accurate visual features,a multi-level semantic alignment with self-attention is proposed.First,this paper extracts the local image features to obtain the fine-grained information in the image,then extracts the global image features to introduce the environmental information into the network learning,so as to obtain different visual relationship levels and provide more information for the joint visual features.Then,the image features are combined,and finally the combined visual features and text features are aligned to get more accurate similarity representation.Through a lot of experiments and analysis,the effectiveness of this method on two public datasets is proved.(2)Aiming at the problem that most methods only focus on entity alignment,but ignore the alignment of entity relationship and attribute alignment,a Cross-Modal Semantically Augmented Network for Image-Text Matching is proposed.Firstly,this paper extracts regional and global image features to obtain fine-grained semantic information and significant background information in the image,respectively.Second,an adaptive word type prediction model is proposed,which predicts each word into four types,so as to obtain the probability of predicting four types.Finally,this model performs local alignment,global alignment and relational alignment on regional image features,global image features and word type prediction features,respectively,to augment the semantic association between the images-texts and obtain a more accurate similarity representation.Extensive experiments prove the effectiveness and superiority of the proposed method.(3)Most methods only maintain absolute similarity in hash codes,failing to capture high-order neighborhood information between training data.A Discrete Multi-similarity Consistent Matrix Factorization Hashing(DMCMFH)is proposed.Specifically,an individual subspace is first learned by matrix factorization and multi-similarity consistency for each modality.Then,the subspaces are aligned by a shared semantic space to generate homogenous hash codes.Finally,an iterative based discrete optimization scheme is presented to reduce the quantization loss.We conduct quantitative experiments on three datasets,MSCOCO,Mirflickr25 K and NUS-WIDE.Compared with supervised baseline methods,DMCMFH achieves increases of 0.22%,3.00% and 0.79% on the image-query-text task for three datasets respectively,and achieves increases of 0.21%,1.62% and 0.50% on the text-query-image task for three datasets respectively.
Keywords/Search Tags:Image-Text Matching, Fine-Grained Semantic Information, Adaptive Word Type Prediction Model, Matrix Factorization
PDF Full Text Request
Related items