| With the development of mobile device and computer hardware,multimedia data have been increasing at an unprecedented rate.In massive multimedia data,some users want to use the sample in one modality to retrieve the samples from various modalities that related to the same topic.Existing multi-media retrieval algorithms usually face the following problems: 1)Multi-modal data are heterogeneous,and data distribution of different modalities is different.2)Semantics are usually abstract.In many cases,a topic needs multiple modalities to elaborate and supplement,for example,a piece of news usually contains not only text,but also pictures or videos of events.In order to solve the above problems,based on the graph convolutional network,our work integrates the idea of generative adversarial network and attention mechanism,effectively fits the distribution of the multi-modal data and fuses the features from multi-modalities,and achieves good results in narrowing the heterogeneity between different modalities.Our work is described as follows:(1)An adversarial graph convolutional network for cross-modal retrieval is proposed.The completeness of semantic expression plays an important role in cross-modal retrieval tasks,which contributes to align the cross-modal data and thus narrow the modality gap.But due to the abstractness of semantics,the same topic may have different aspects to be well described so it may be incomplete to express semantics with only one sample.In order to obtain semantic complementary information and strengthen similar information for samples with the same semantics,a graph convolutional network(GCN)is utilized to reconstruct the sample representation based on the adjacency relationship between the sample itself and its neighborhoods.Local graph is constructed for each instance,and a novel Graph Feature Generator based on GCN and fully-connected network is used to reconstruct node features based on local graph and maps the features of two modalities into a common space.The Graph Feature Generator and Graph Feature Discriminator adopt a minimax game strategy to generate modality-invariant graph feature representations.Experiments on three benchmark datasets demonstrate the superiority of the proposed model compared with several state-of-the-art methods.(2)An iterative graph attention memory network for cross-modal retrieval is proposed.How to eliminate the semantic gap between multi-modal data and effectively fuse multi-modal data is the key problem of cross-modal retrieval.The abstractness of semantics makes semantic representation one-sided.In order to obtain complementary semantic information for samples with the same semantics,local graph for each instance is constructed and a graph feature extractor(GFE)is used to reconstruct the sample representation based on the adjacency relationship between the sample itself and its neighbors.Owing to the problem that some cross-modal methods only focus on the learning of paired samples and cannot integrate more cross-modal information from the other modalities,a cross-modal graph attention strategy is used to generate the graph attention representation for each sample from the local graph of its corresponding paired sample.In order to eliminate heterogeneous gap between modalities,the features of the two modalities are fused using a recurrent gated memory network to choose prominent features from other modalities and filter out unimportant information to obtain a more discriminative feature representation in the common latent space.Experiments on four benchmark datasets demonstrate the superiority of our proposed model compared with state-of-the-art cross-modal methods. |