| Cross-modal retrieval aims to retrieve the results of one modal through the data of another modality,which is an important research direction in the field of multimodality.The first problem to be solved in cross-modal retrieval is how to effectively measure the distance of heterogeneous data,that is,the heterogeneous gap problem.Although many excellent results have been achieved in the recent researches on cross-modal retrieval,how to effectively capture complex semantic associations in images and texts,further improve retrieval accuracy and efficiency,and reduce the lack of model generalization caused by noise samples are still important problems that researchers need to solve urgently.To solve the above problems,this paper proposes a cross-modal retrieval method for semantic consistency learning,and the main contents of the paper are as follows:(1)In order to effectively capture complex semantics,improve retrieval accuracy and efficiency,an Efficient Cross Modal Retrieval Model for Complex Semantic Interaction is proposed.The method includes a multi-scale global interaction module and a prototype-based local interaction module,which can perform fine-grained local interaction and semantic-rich global interaction at the same time.The former preserves hierarchical global semantic information by exploring the multi-scale correspondence between images and text,while the latter reduces the amount of computation while decomposing regional word interactions into regional prototype interactions and word prototype interactions.Finally,the joint training of multi-scale global interaction modules and prototype-based local interaction modules in a unified model can promote each other and further improve the retrieval performance.(2)In order to improve the generalization of the cross-modal retrieval model under noisy label conditions,a Residual Network Facilitated Noise-Robust Cross-Modal Retrieval Method is proposed,which guides the learning of multimodal prototypes through the noise reduction classification module,so that the prototype can continuously approach the class center of the clean distribution and improve the generalization of the model to deal with noise.Specifically,the noise sample is first denoised by the residual structure to generate a reliable noise reduction pseudo-label.Then,under the guidance of noise reduction pseudo-labels,the prototype representation of multimodal states is gradually optimized by the method of momentum update,and the overfitting of the model to the noise sample is mitigated by the sample prototype discriminant loss.In addition,the instances wised discriminant loss learns the correspondence between images and texts,which further reduces the influence of noise samples while crossing the modal gap.(3)A cross-modal image and text retrieval system is designed,and the method proposed in this paper is encapsulated by the hybrid programming method of React and Python to realize the two-way retrieval of images and text.In view of the above three challenges,a cross-modal retrieval method for semantic consistency learning is proposed,in which the efficient cross-modal retrieval method for complex semantic interaction and the noise cross-modal retrieval method with prototype matching consistency are improved compared with the existing methods.Finally,a cross-modal image and text retrieval system is designed for these two research methods. |