| In recent years,research on fine-grained semantic matching of text and images has attracted increasing attention from researchers in industry and academia.Fine-grained alignment information of text and images(for example:aligning target objects in images with phrase entities involved in text)can be widely used in many important application scenarios,such as:multimodal retrieval,multimodal sentiment analysis,personality Recommendation system and offline store digitization,etc.Traditional text-image fine-grained matching tasks aim to align fine-grained entities in pictures and texts,without deep semantic analysis of these fine-grained entities.Through corpus analysis,this dissertation finds that most of the text phrases used to describe image objects contain shallow semantics and are out of contextual information.Therefore,the matching relationship between them is easy to be captured by the model,and it is difficult to effectively evaluate the ability of model to truly understand modal semantics;however,there are a large number of phrases in the corpus with deep and sparse semantics,which are closely related to the context and even external knowledge.Therefore,the matching relationship between these phrases and image objects is difficult to capture,which can effectively evaluate the ability of model to truly understand multimodal semantics.Based on this,different from previous studies,this dissertation proposes a new implicit scene-oriented1 text-image fine-grained matching task focuses on processing text-image pairs that need to rely on context or more external knowledge to identify their fine-grained matching relationships.In particular,for this new task,this dissertation formulates a corresponding corpus annotation specification and annotates a text-image fine-grained matching dataset for implicit scenes.The critical problems in fine-grained matching of implicit text images include:(1)phrases with implicit phenomena tend to have deep and sparse semantics,and it is difficult to fully understand the semantic information of phrases;(2)In most cases,which implicit phrase describes is not a certain type of specific object,but a kind of emotion or atmosphere,and it is difficult to map it to the object region in the picture;(3)The corpus containing fine-grained annotations is scarce,which can be used to learn is limited.The limited set resources are not conducive to improving ability of the model to understand implicit matching relationships.In response to the above challenges,this dissertation proposes the following three research points:First,for the challenge that phrases with implicit phenomena contain deep and sparse semantics,this dissertation proposes a method based on multimodal interaction,while helping the model learn external pre-trained knowledge to better learn the semantic information of implicit text phrases.This method encodes the input text-image pair by introducing pretrained models,so as to better extract the feature information of text and images.In addition,the method uses attention mechanism to make the features of text phrases and image objects fully interact,extracts the joint representation of text phrases and image objects,and then compares the learned visual attention distribution ratio between text phrases and image objects with the actual corresponding relationships,then fit the correspondences between entities in text-image pairs and fine-tune the pre-trained model used,so as to better learn the implicit text-image fine-grained matching.Second,for the phenomenon that the object described by the implicit phrase is not specific to an entity in most cases,this dissertation proposes a method based on visual contrastive attention,which incorporates both successful and unsuccessful matching samples into the reference,helping the model to learn the implicit matching relationship between fine-grained entities from multiple perspectives.This method designs a language-guided visual contrast attention method,which fully captures the matching and mismatching relationships between text and image entities,expands the implicit semantic information,and allows the model to correctly determine the relationship between implicit entities,thereby improving the performance of this implicit matching task.Finally,in view of the lack of resources of fine-grained implicit matching corpus,this dissertation proposes a method based on weakly supervised learning,which does not need to rely on fine-grained annotated corpus,and can directly learn fine-grained matching on input text-image pairs.The method uses a general target detector for knowledge distillation during training,matches text phrases with the labels of target detection objects,generates "pseudo"labels corresponding to text phrases and image objects,and learns fine-grained matching from them;moreover,this method design a contrastive learning framework to learn both text phrase-image object and text-image matching to help the model learn fine-gained matching.In short,this dissertation proposes a series of solutions to the key problems and challenges in the implicit matching task.The experimental results show that the method in this dissertation achieves a significant performance improvement compared to the benchmark model. |