| With the rapid development of artificial intelligence and deep learning,image text matching has gradually become an important task in the field of cross-modal.Realizing the correct matching of image and text requires a strong understanding of the corresponding relationship between vision and language.In recent years,the image text matching method based on deep learning has achieved remarkable success,but the existing methods still have the following problems: first,image text matching not only needs to deeply understand the intra-modal information,such as the relationship between image objects and the long-term dependence of text words;We also need to explore the fine-grained alignment between image regions and text words.Therefore,how to better integrate the above two points in the same model remains to be solved.Second,the existing fine-grained image text matching measures all possible image regions and word segments for multiple similarity measures.Although the accuracy is greatly improved compared with coarse-grained image text matching,multiple similarity measures lead to excessive calculation of the model and many unnecessary redundant alignments.How to effectively balance accuracy and efficiency is an urgent problem to be solved.Third,in recent years,priori knowledge has been applied to various fields of deep learning,which can enhance the representation ability and interpretability of the model and reduce the internal complexity.However,how to effectively construct and utilize prior knowledge to guide image text matching task is a complex problem.Therefore,the specific research contents of this thesis are as follows:(1)Aiming at the problem of how to fuse the intra-modal information and inter-modal information into the same model,an image text matching method based on self attention is proposed in this thesis.This method not only uses the powerful self attention mechanism to model the information in the single mode,but also uses the way of cross attention to model the image regions and text words.Experiments show that this method effectively improves the accuracy of image text matching.(2)Aiming at the problems of excessive computation and lack of multi view matching in the existing fine-grained matching methods,a multi view image text matching method based on Transformer architecture is proposed in this thesis.In this method,a multi-layer Transformer architecture is stacked for image branches and text branches,and the weights of image and text branches are shared at the last layer.In the similarity measurement stage,the method uses dilated convolution to construct multi view matching,so that the model can understand the image from different perspectives.Because this method does not stack the interactive information of image regions and text words,all the output of image vectors and text vectors are compact vectors,it can reduce the amount of calculation of the model.Experimental results show that thanks to multi-layer Transformer architecture and multi view matching,this method effectively improves the accuracy of the model.(3)Aiming at the problem of how to construct prior knowledge to guide image text matching,an image text matching method based on prior knowledge graph is proposed in this thesis.This method uses priori knowledge graph to guide image text matching.The effective use of priori knowledge can not only reduce the amount of calculation in the model,but also enhance the ability of the model to understand the real world,rather than focusing on a data set.This method uses graph convolution to construct priori knowledge graph,which makes a deep connection between priori knowledge.At the same time,this method adopts self attention mechanism and one-dimensional convolution in the stage of image and text feature extraction,which enhances the reasoning ability of the model.Experiments show that this method effectively improves the accuracy of image text matching on two benchmark data sets. |