Research On Synergizing Vision And Text Semantics For Composed Image Retrieval | Posted on:2024-12-13 | Degree:Doctor | Type:Dissertation | Country:China | Candidate:Y H Xu | Full Text:PDF | GTID:1528307373970069 | Subject:Computer Science and Technology | Abstract/Summary: | PDF Full Text Request | The traditional image retrieval commonly utilizes a single image or text description to search for the desired image in a large dataset.The single input usually contains limited information,which cannot completely reflect the user’s search intent.Therefore,the traditional image retrieval with a single input cannot meet the users’ complex retrieval requirements.To give users the flexibility to express their search intent,composed image retrieval(CIR)has recently been proposed and attracts growing attention in both academia and industry.CIR uses a reference image and the modification text as composed inputs to precisely reflect the user’s search intent and retrieve the target images.It requires the CIR system to 1)comprehensively mine the semantic information within the composed inputs; 2)efficiently fuse the composed inputs to get the joint representation; and 3)establish accurate semantic correlation relationships between composed inputs and target images.This dissertation proposes several methods to solve the three key issues in the CIR.Concretely,the content of this dissertation mainly includes the following aspects:(1)This dissertation proposes to explore the hierarchical semantic information in composed image retrieval.It first introduces the scene graph including entity,attribute,and relationship nodes to represent the image structure.Besides,it performs hierarchical composition learning by fusing modification text and the reference image in a globalentity-structure manner,which takes advantage of the complementary information among three levels.The experimental results demonstrate that the proposed method can comprehensively mine the semantic information in each modality and achieve performance improvement.(2)This dissertation proposes a multimodal transformer-based architecture for composed image retrieval.Instead of the complicated composition designs in traditional methods,the neat yet effective multimodal transformer is adopted to homogeneously fuse the composed inputs at various scales.Moreover,this dissertation introduces an efficient global-local feature constraint loss to narrow the distance between the composed inputs and the target image.It not only considers the divergence in the global joint embedding space but also forces the model to focus on the local detail differences.Extensive experiments on three real-world datasets demonstrate the superiority of the proposed method.(3)This dissertation proposes the dual composition-and-decomposition paradigm for composed image retrieval.The composition module is designed to fuse the composed inputs into the joint representation and learn the correlation between the joint representation and the target image.The decomposition module is proposed to disentangle the target image into subspaces corresponding to the reference image and the modification text,and learn the partial similarity related to each query element.The composition and decomposition modules form a closed loop and synergistically improve the performance.The experimental results show that the proposed dual composition-and-decomposition paradigm outperforms the traditional unidirectional CIR models.(4)This dissertation proposes the many-to-many matching among multiple semantic meanings and diverse instances for composed image retrieval.It first adopts the set-based diverse queries to learn the various semantic meanings within the sample.The many-tomany matching among multiple semantic meanings is constructed through fine-grained query-wise alignment.Moreover,this dissertation introduces the uncertainty regularization module to sample some positive instances.The many-to-many matching among multiple positive instances is constructed in the probabilistic view.Through the many-tomany matching,the proposed method can effectively tackle the inherent ambiguity problem in the CIR task.Finally,a summary of the dissertation is provided,followed by an outlook on the future of composed image retrieval and potential directions for further research. | Keywords/Search Tags: | Composed Image Retrieval, Multi-modal Learning, Image Retrieval, Transformer, Semantic Matching Analysis | PDF Full Text Request | Related items |
| |
|