Font Size: a A A

A Self Attention Guided Network For Cross-modal Matching

Posted on:2022-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:X F QiFull Text:PDF
GTID:2518306509493044Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years,due to the great development of the multimedia era,cross-modal matching has received much more attention.As an important basic research,it plays an critical role in many relative cross-modal tasks,such as cross-modal retrieval,image inpainting,image captioning and visual/video question answering.Traditional matching methods mainly start from statistical analysis,canonical correlation analysis and partial least squares are frequently applied to measure the relationship between different modalities.Although the theoretical interpretability is reasonable,these methods can not deeply understand the semantic meaning of image or text modal,which brings serious limitations to these traditional algorithms.Recently,researchers find that deep learning methods are more effective and malleable.Typically,the convolutional neural network can extract high-level and multi-scale feature maps from the picture,which shows superior performance in image processing;and the recurrent neural network and its variants can effectively learn the sequence features and understand their semantic content.On the base of deep learning,how to extract more reasonable representations and how to measure their content similarities become the key issues to be addressed.Many methods start from image regions and words matching,they calculate the local similarity first and then synthesize the overall image and text similarity.However,not all regions or words have equal contribution,and they all have different importance in semantic expression.In order to solve this problem,this paper introduces a combination of self-attention mechanism and cross-attention mechanism.The former can distinguish local information from the context in the same modal and learn its self-attention weight;The latter uses data of different modalities as context for each other and learns cross-attention weights on the premise of cross-modal content alignment.On the other hand,we noticed that the statistical features such as word frequency have a great influence on words importance,so the TF-IDF is introduced as a preprocessing method,which can obtain statistical prior information and bring great improvement to the performance of the overall model.We test our method on MSCOCO and Flickr30 K datasets,and compare its results with recent works.The results prove the effectiveness of our algorithm,which can better understand the critical content of the modalities,and obtain more accurate matching results.
Keywords/Search Tags:Cross-modal Matching, Attention Mechanism, Representation Learning
PDF Full Text Request
Related items