Font Size: a A A

Image-text Retrieval Based On Multi-modal Feature Fusion

Posted on:2022-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y F WangFull Text:PDF
GTID:2518306764966909Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
With the rapid growth of Internet technology of social medias,the volume of multimodal data is increasing dramatically.In return,the demands of people for mutual retrieval between data of different modalities increase accordingly.One task with widely applications is image-text retrieval,which refers to retrieve the most relevant data(e.g.,text)based on data from another modality(e.g.,images)through the trained model automatically.Nevertheless,the intrinsic differences of data distribution from images and texts,which is also called “heterogeneous gap”,make measuring the semantic correlation between images and texts challenging.Most of models utilize late fusion scheme,which enhancing image features and text features independently and fusing features from different modalities to calculate imagetext similarity in the end.Through these models has achieved success,it still suffers from the problems below: Firstly,late fusion schemes usually ignore the latent interactions across modalities,which could not mitigate the modality gap perfectly.Besides,existing early fusion schemes mostly aggregates global features,and it would miss fine-grained intra-modal information.Furthermore,these models usually concentrate on either global matching or local matching,while the local-global fusion across different modalities are not fully explored.As for the first two problems,we propose a hybrid fusion framework to combine early fusion and late fusion.Based on the raw image and raw text features,the early fusion module integrates localized visual regions in images and global information in texts,so that the fusion representation could maintain the interactions between local visual information and global textual semantics.The following network would adapt to the fusion features automatically.Meanwhile,the hybrid fusion perform late fusion on images and texts to enhance semantic information in each modality.For the third problems,we propose a hybrid fusion architecture based on inter-modal fusion and intra-modal fusion.The attention flow passes inter-modal correlations and intra-modal correlations between visual features and textual features.Additionally,intra-modal fusion module deploys gate mechanism to dynamically control the aggregation of local information within each modality based on global information from another modality.The experiments of the above two methods have been implemented on two large-scale cross-modal dataset Flickr30 K and MSCOCO.The fair comparison compared with previous work has proved the improvement of retrieval performance of our methods.Moreover,a series of ablation studies and corresponding analysis have validated the rationality of specific design in two proposed methods.
Keywords/Search Tags:Image-text Retrieval, Multi-modal Fusion, Early Fusion, Late Fusion, Attention Mechanism
PDF Full Text Request
Related items