| In recent years,the rapid development of internet technology and social media has led to an explosive growth of multimodal data.The demand for retrieval of these multimodal data has promoted the development of cross-modal retrieval technology,which aims to retrieve data samples of one modality based on a query from another modality.With the development of deep learning,many advanced algorithmic models have been proposed to address the inherent distribution differences,or ”heterogeneous gap”,between different modalities.However,these methods still face the”generalization” challenge,performing poorly on unseen data beyond the training data.Zero-shot sketch-based image retrieval(ZS-SBIR)is a cross-modal retrieval task that focuses on studying the generalization of methods.It retrieves natural images related to the sketch query modality and the testing process is completed on a dataset consisting of unseen categories.Therefore,zero-shot sketch-based image retrieval faces both the challenge of heterogeneous gap and the ”semantic gap” challenge,namely the difference between seen and unseen categories.Currently,most methods adopt similar strategies to bridge the heterogeneous gap in training data distributions by sharing embedding spaces across modalities,as well as using additional ”semantic embeddings” to bridge the semantic gap.Here,semantic embeddings refer to the features of seen class names extracted by natural language models.Although these methods have made some progress in zero-shot sketch-based image retrieval,they still face the following problems: First,semantic embeddings are coarse-grained representations that cannot adapt to the diversity of intra-class samples?second,existing methods ignore the most important shared information between sketches and real images,namely,the global structural information of objects?additionally,these works only focus on limited training data and overlook the knowledge already learned from pre-training.Therefore,existing methods cannot effectively bridge the heterogeneous gap between modalities and the semantic gap between seen and unseen categories.To address these issues,this thesis proposes two solutions.Firstly,the first method proposes a three-way vision Transformer model based on multi-modal hypersphere learning.It pre-trains a separate model for sketch and real image modalities to model the global structural information of each sample.Then,the three-way visual Transformer architecture is built to align the representations of each modality in the hypersphere.Meanwhile,a Transformer token-based distillation strategy is designed to maintain the model’s existing knowledge and improve its generalization performance on unseen data during training.Secondly,this thesis proposes a zero-shot sketch-based image retrieval method based on adaptive balanced discriminability and generalizability.It not only preserves the global structural information of pre-training through task-independent teacher models but also retains rich semantic information through task-dependent teacher models.The entire model is jointly optimized through self-distillation and an adaptive weighting strategy based on information entropy.These two methods have an inheritance relationship.In light of the limitations of the previous method,this thesis proposes the second method as its improvement to optimize its zero-shot sketch retrieval capability.These two methods proposed in this thesis are evaluated on three benchmark datasets: Sketchy,TU-Berlin,and Quick Draw.They are compared with existing mainstream methods,and their superiority and effectiveness are demonstrated through a series of experimental analyses.In addition,these experiments also demonstrated that the second method achieved a significant performance improvement compared to the first method and other baselines. |