Font Size: a A A

Towards Explicit And Implicit Fine-Grained Image-Text Matching

Posted on:2024-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2568306941463724Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Improving the accuracy of fine-grained matching between images and text is an important research direction in multimodal matching tasks.However,traditional methods often focus only on direct matching between fine-grained entities in images and text,while ignoring the deeper semantic relationships between them.To address this issue,this paper aims to apply machine learning and deep learning techniques to enhance the performance and accuracy of fine-grained matching between images and text from a deeper semantic perspective.Through corpus analysis,we found that the textual phrases used to describe objects in images are mostly semantically explicit,and these phrases do not require the model to have a deep understanding of contextual relationships during fine-grained matching training,thus limiting the model’s learning ability.At the same time,the corpus also contains a portion of fine-grained entities that carry richer and complex semantics,which are often difficult to match but are more challenging and effective for evaluating the model’s performance.Therefore,this paper focuses on this challenging subset of data and discusses how to achieve more precise fine-grained matching by gaining a deeper understanding of the contextual semantic relationships between these entities.To achieve accurate matching,this paper introduces the concept of explicit and implicit relationships.Implicit relationships refer to the relationships between fine-grained entities that have rich and implicit contextual semantics,which are difficult for the model to directly learn,while explicit relationships have sparse semantic information and are easier to match.In light of explicit and implicit relationships,this paper proposes a fine-grained image-text matching task in explicit and implicit scenarios,focusing on learning the implicit relationships that exist in fine-grained matching with further understanding of the context or leveraging external knowledge to achieve precise matching between fine-grained entities.Additionally,to highlight the challenges of fine-grained entity matching involving implicit relationships,the paper conducts similar experiments on corpora containing explicit relationships.In terms of methodology,this paper adopts a combination of multimodal interaction and pre-trained language models,which improves the predictive accuracy of the model on explicit and implicit corpora.Furthermore,the paper introduces a contrastive distillation method that utilizes external knowledge as pseudo-labels to supervise the training process and enhance the model’s learning ability for fine-grained image-text matching.Moreover,a curriculum learning-based training strategy is proposed to progressively learn and focus on matching difficult samples,thereby improving the model’s learning capability.This paper analyzes and identifies the existence of explicit and implicit phenomena in the corpus,and proposes a task of fine-grained image-text matching oriented towards explicit and implicit scenarios.Accordingly,corresponding solutions are proposed to address different challenges.Moreover,considering the relative scarcity of coarse-grained imagetext alignment information,this paper adopts a weakly supervised training approach.Experimental results demonstrate that the proposed model in this paper improves the performance of the task and contributes to the research on multimodal matching.
Keywords/Search Tags:Image-Text Matching, Explicit and Implicit Relations, Cross-Modal Interaction, Knowledge Distillation, Curriculum Learning
PDF Full Text Request
Related items