A Self Attention Guided Network For Cross-modal Matching

Posted on:2022-06-01

Degree:Master

Type:Thesis

Country:China

Candidate:X F Qi

Full Text:PDF

GTID:2518306509493044

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,due to the great development of the multimedia era,cross-modal matching has received much more attention.As an important basic research,it plays an critical role in many relative cross-modal tasks,such as cross-modal retrieval,image inpainting,image captioning and visual/video question answering.Traditional matching methods mainly start from statistical analysis,canonical correlation analysis and partial least squares are frequently applied to measure the relationship between different modalities.Although the theoretical interpretability is reasonable,these methods can not deeply understand the semantic meaning of image or text modal,which brings serious limitations to these traditional algorithms.Recently,researchers find that deep learning methods are more effective and malleable.Typically,the convolutional neural network can extract high-level and multi-scale feature maps from the picture,which shows superior performance in image processing;and the recurrent neural network and its variants can effectively learn the sequence features and understand their semantic content.On the base of deep learning,how to extract more reasonable representations and how to measure their content similarities become the key issues to be addressed.Many methods start from image regions and words matching,they calculate the local similarity first and then synthesize the overall image and text similarity.However,not all regions or words have equal contribution,and they all have different importance in semantic expression.In order to solve this problem,this paper introduces a combination of self-attention mechanism and cross-attention mechanism.The former can distinguish local information from the context in the same modal and learn its self-attention weight;The latter uses data of different modalities as context for each other and learns cross-attention weights on the premise of cross-modal content alignment.On the other hand,we noticed that the statistical features such as word frequency have a great influence on words importance,so the TF-IDF is introduced as a preprocessing method,which can obtain statistical prior information and bring great improvement to the performance of the overall model.We test our method on MSCOCO and Flickr30 K datasets,and compare its results with recent works.The results prove the effectiveness of our algorithm,which can better understand the critical content of the modalities,and obtain more accurate matching results.

Keywords/Search Tags:

Cross-modal Matching, Attention Mechanism, Representation Learning

PDF Full Text Request

Related items

1	Research On Image-Text Cross-Modal Matching Based On Attention Mechanism
2	An Optimized Approach To Cross-Modal Retrieval Based On Multi-level Attention Mechanism
3	Research On Cross-modal Applications Via Exploiting High-level Semantics
4	Attention-aware Deep Cross-modal Hashing
5	Audio-Video Based Cross-modal Speaker Retrieval And Recognition
6	Cross-modal Representation Learning Based On Multi-negatives Supervised Contrastive Mechanism And Its Application
7	Cross-modal Retrieval Based On Deep Model Learning
8	Image-text Translation Based On Cross-modal Related Semantics And Attention Mechanism
9	Research On Image And Text Retrieval Based On Attention Mechanism
10	Research On Multi-modal Similarity Learning