| As multimedia data explosively grows,people are gradually surrounded by various modality data such as images,videos,texts,audios,etc.People generate multimodal data constantly,which accelerates the application of cross-modal retrieval.Although a considerable progress has been made in cross-modal retrieval research,finding heterogeneous data with content relevance still remains challenging.This thesis studies image-text crossmodal retrieval using interactive methods and graph matching methods,and the main research contents are as follows:1.A semantic filtering and adaptive pooling-based image-text retrieval model is proposed.This thesis uses cross-attention mechanism to realize cross-modal interaction between images and texts,and implements a semantic filtering module that utilizes matching information and local similarity of image-text pairs to reduce the attention weights assigned to fragments in mismatched image-text pairs,aligning image-text fragments with relevant semantics.When aggregating local features to obtain global features,the traditional mean or max pooling cannot achieve optimal results.Therefore,this thesis implements a learnable pooling module that adapts to different feature forms and adjusts the pooling method adaptively,aggregating local features to obtain global features.This thesis conducted experiments on the Flickr30 K and MSCOCO datasets,and the results showed that compared with existing image-text retrieval models,the proposed model based on semantic filtering and adaptive pooling improves retrieval accuracy.2.A graph matching-based image-text retrieval model is proposed.Firstly,this thesis uses salient regions in images and words in texts to model graph nodes,and then utilizes graph convolutional networks to infer intra-modality relationships of graph nodes and extract intra-modality correlations.Secondly,a cross-modal feature extraction method is introduced,and the matching information between image regions and words is allowed to flow through the graph,extracting features that contain cross-modal matching information and fully utilizing intra-modality and inter-modality information.Finally,graph structure matching and image-text global similarity calculation are performed,and different levels of image-text matching relationships are learned using graph structure matching information and global similarity information.This thesis conducted experiments on the Flickr30 K and MSCOCO datasets,and the results showed that compared with existing graph matching-based image-text retrieval models,the proposed model based on crossgraph structure matching improves retrieval accuracy.3.This thesis designs and implements a cross-modal retrieval system,which provides cross-modal retrieval services through the architecture of browser and server.The cross-modal retrieval model of the system adopts the method proposed in this thesis,and can perform cross-modal retrieval according to the images or texts uploaded by users,providing the functions of image retrieval by text and text retrieval by image. |