Font Size: a A A

Research On Multi-modal And Multi-Grained Network Based Cross-Media Retrieval

Posted on:2022-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:S J YuanFull Text:PDF
GTID:2518306746996249Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the application of multimedia equipment and the rapid development of the Internet,a great number of users post multimedia data(such as short videos,images,and texts)on social media platforms.There is an enormous demand for accurately retrieving information from the massive multimedia data,which makes cross-media retrieval receive extensive attention and a profound study in academia.Cross-media retrieval uses one type of media as queries,then returns retrieval results that should have similar semantic information but in different types of media.As different media types have different encodings,it is impossible to directly measure the cross-modal similarity.Therefore,due to the heterogeneous gap,the cross-media retrieval task faces huge difficulties and challenges.Existing research on cross-media retrieval has made great progress,while it still fails to solve the heterogeneous gap problem.Furthermore,existing studies have not deeply studied the positive role of multi-modal and multi-grained data in bridging the heterogeneous gap.In order to reduce the heterogeneous gap,different data modalities can be connected by increasing the cross-modal similarity between different data modalities which have a higher semantic correlation with each other.In addition,as the semantic information contained in multi-grained data is complementary,it is of great importance to narrow the heterogeneous gap through fully mining the complementary information contained in multi-grained data,and further achieve semantic information enhancement.This paper proposes two crossmedia retrieval networks to make full use of multi-modal and multi-grained data:(1)Most of the existing works focus on using single-grained data,and only exploit binary values to distinguish correlations between cross-modal data.To handle these problems,this paper proposes a cross-media retrieval network based on multi-margin triplet loss function and coarse-and fine-grained feature fusion.We divide this network into two parts: Network I,a coarse-and fine-grained feature fusion network based on Deep Belief Networks;Network II,a multi-modal data fusion network based on the multi-margin triplet loss function.We innovatively propose a multi-margin triplet loss function.According to the margins in the margin set,the features belonged to different modalities and different semantic categories are separated from the anchors in a multi-margin manner.The experimental results of compared methods and ablation experiments demonstrate that our proposed method can improve the performance of cross-media retrieval.Particularly,the strategies of fusing coarse-and fine-grained data,and distinguishing irrelevant data in a multi-margin manner are effective.(2)Most of the existing works haven't fully considered the complementary relationship between foreground objects and background information.Although there are several crossmedia retrieval works focusing on coarse and fine-grained data fusion,there is still a lack of research on the fusion of more than two types of data granularities.Hence,this paper proposes a cross-media retrieval network that combines object detection and multi-grained data alignment.The network is divided into two parts,which include object detection subnetwork,and multi-grained sub-networks.First,we use object detection to extract foreground objects,then innovatively build an object detection sub-network.Secondly,we further divide multi-grained data into multi-level fine-grained and coarse-grained data.We use the sliding window strategy to divide images and texts into different fine-grained levels,then construct the multi-level fine-grained subnetworks and the coarse-grained sub-network.Finally,we linearly fuse the similarity matrix of the object detection sub-network and multi-grained sub-networks.This matrix reflects the complementary relationship between foreground objects and multi-level backgrounds.The experimental results of compared methods show that our proposed method can effectively improve the performance of cross-media retrieval.The ablation experiment further proves that each sub-network has a positive effect on enhancing the cross-media retrieval performance.It also confirms the effectiveness of using the sliding window strategy to divide multi-grained data.This paper studies on cross-media retrieval based on multi-modal and multi-grained networks.Firstly,we design a multi-margin triplet loss function to constrain the relationship between multi-modal data.Afterwards,we divide data into multi-grained levels,and explore the complementarity between image foreground objects and multi-level backgrounds.In general,we fully utilize the semantic correlation for multi-modal data relationship modeling,and further mine and fuse the complementary semantic information between multi-grained data.The works we do play an important and positive role in narrowing the heterogeneous gap and effectively improving cross-media retrieval performance.
Keywords/Search Tags:Cross-media Retrieval, Multi-Margin Loss Function, Multi-Modal Data, Multi-Grained Data, Attention Mechanism
PDF Full Text Request
Related items