| With the rapid development of Internet technology,people tend to create and share more various multimedia materials(e.g.,images and texts)in their daily life.Faced with explosively growing amount of multi-modal data in the Internet,the traditional single-mode retrieval has gradually unable to meet the needs of users.How to make computer achieve cross-modal retrieval based on understanding the association between multi-modal data has become a critical research topic in the community of multimedia understanding.Therefore,based on exploiting semantics and commonsense knowledge,this thesis engages in the task of cross-modal image-text retrieval.Briefly,the main contributions of this thesis can be summarized as follows:(1)We propose a Stacked Multi-modal Attention Network(SMAN)for cross-modal image-text retrieval,which makes use of stacked attention mechanism to exploit the hierarchical fine-grained interdependencies between image and text.Specifically,we jointly employ intra-modal information and multi-modal information as query guid-ance to perform a multiple-step attention reasoning so that the multi-level fine-grained correlation between image and text can be captured.Moreover,we present a novel bi-directional ranking loss,which imposes the distance constraint on pairwise multi-modal instances to preserve the manifold structure of multi-modal data distribution in the joint embedding space.(2)We propose a Stacked Squeeze-and-Excitation Recurrent Residual Network(SER~2-Net)for cross-modal image-text retrieval.Firstly,an efficient multi-level rep-resentation module is presented to produce a series of semantically discriminative fea-tures,which is built by combining multiple semantic enhancing operations.Besides,to capture the implicit correlations contained among multi-level features,we propose a novel objective namely Cross-modal Semantic Discrepancy(CMSD)loss that exploits the interdependency among different semantic levels to narrow the distribution discrep-ancy between heterogeneous data.(3)We propose a Consensus-aware Visual-Semantic Embedding(CVSE)model for cross-modal image-text retrieval.Specifically,the consensus information is ex-ploited by computing the statistical co-occurrence correlations between the semantic concepts from the image captioning corpus and incorporating the constructed concept correlation graph into graph convolution network to yield the consensus-aware con-cept(CAC)representations.Afterwards,our CVSE can simultaneously pinpoint the high-level concepts and generate the unified consensus-aware concept representations for representing both modalities,thereby achieving more precise semantic alignments between image and text.(4)We propose a Commonsense Aided Visual-semantic Embedding(COVE)model for cross-modal image-text retrieval.Through the combination of knowledge graphs,statistical relation and graph neural network,the logical commonsense knowledge and statistical commonsense knowledge can be simultaneously captured,so that multiple commonsense knowledges can be introduced into the learning procedure of cross-modal representation.Extensive experiments on the two public benchmark dataset,i.e.Flickr30k and MSCOCO datasets,demonstrate the superiority of all four proposed approaches. |