Font Size: a A A

Research On Multi-Scale Fusion Cross Modal Retrieval Based On Deep Learning

Posted on:2022-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:K Q ZhaoFull Text:PDF
GTID:2518306743474194Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of multi-modal data in recent years,how to establish semantic associations on multi-modal data for better management has become a hot topic of deep learning research.Cross-modal retrieval has applications in image text retrieval,sketch search,recipe retrieval.Because the multi-modal data have different representations and underlying constructs,it is difficult to measure the similarity between multimodal data directly.In this paper,cross-modal retrieval is implemented from both common space learning and association learning respective,and accomplish the following work.We propose a method named Dual-Scale Similarity with Rich Features for CrossMedia Retrieval(DSRF),which fuses the similarity of category labels with the similarity of contained objects to consider the similarity of multimodal data.Most existing methods map data of different modalities into a common space utilizing category labels and pairwise relationships,however,other discriminative information contained in multimodal data is ignored.In this paper,results that belong to the same category as the query sample but contain fewer identical objects get an appropriate penalty,while correct results(with the same labels and contains many identical objects)get more rewards.In addition,a new semantic feature extraction frame is designed to provide rich semantic information.Multiple attention maps are created to obtain multiple semantic features.Distinguishing from other works that cumulatively average multiple semantic representations,LSTM only with forgetting gates is used to eliminate redundant information.Specifically,forgetting factors are generated for each semantics,and unimportant semantics will be assigned a larger forgetting factor.The m AP scores and R@K scores are increased on MSCOCO,which improve the retrieval accuracy significantly.A multiscale alignment cross-modal retrieval method(MACMR)is proposed,which measures the relevance of multimodal data in terms of fusion alignment at three levels: global,local object,and action-position relationship.Most of the existing works focus on the alignment at the global level or local level,and these methods ignore the information of relationships(location-action)between locally significant regions.The relational-level(action and location)between multimodal data is very important for II cross-modal retrieval.In this paper,relationship-level alignment is added to the global and local-level alignment.Specifically,a cross-modal multi-path network is constructed to extract relevant information from global,local,and relational levels respectively.Obtaining object regions based on target detection takes the intersection regions between objects as relationship regions,aligns object regions and relationship regions with corresponding descriptors in text data.The image regions and text keywords that can't be matched are removed by a joint attention mechanism,achieving better crossmodal retrieval of image text by aligning image and text data at three scales adaptively.Extensive experiments are conducted on the MSCOCO dataset,which improve the R@K score significantly.
Keywords/Search Tags:Common Space Learning, Correlation Learning, Cross-media Retrieval, Multi-scale Fusion, Semantic Feature Extraction
PDF Full Text Request
Related items