| Deep learning models trained on massive data have led to a revolution in computer vision.However,collecting a large-scale dataset is often expensive and time-consuming,increasing the cost of deploying and limiting adoption in real-world applications.In particular,obtaining comprehensive and rich training images is nearly impossible in many specialized fields due to privacy protection and the scarcity of scenes.Humans can quickly understand novel objects with few images(or limited interactions)by relating them to known concepts.Current image recognition methods do not have the ability for efficient learning.Images vary greatly.Images of the same category or even the same object can be significantly different due to shooting angle,background lighting,and object deformation.Building a robust and accurate image recognition model with few training samples is a highly challenging and urgent problem in current research.The goal of few-shot image recognition is to reduce dependence on large amounts of training data,bringing more opportunities for visual understanding in real-world scenarios and facing many challenges.Existing few-shot image recognition relies on the transfer learning paradigm,which is to learn a transferable model(or knowledge)with the additional large-scale upstream data for addressing the downstream few-shot image recognition task.However,existing methods still have shortcomings in upstream(pre)training or downstream transfer.This thesis proposes three works targeting these two aspects,focusing on upstream self-supervised pre-training,downstream transferring of the vision-language pre-train model,and the multi-domain problem in the downstream task.The main content and innovations of the thesis are summarized as follows:(1)The thesis proposes a self-supervised few-shot image recognition method based on mutual information(MI).Our method leverages a low-bias MI estimator to perform self-supervised pre-training,which captures the intrinsic structure of data to learn comprehensive and generalized features.From the information theory perspective,the thesis revisits different roles of supervised and self-supervised pre-training in few-shot image recognition,suggesting that they maximize two different MI objectives.With a lowbias MI estimator,the thesis proposes a novel self-supervised pre-training method that improves existing contrastive learning methods,effectively learning visual features that generalize to unknown downstream tasks and avoiding the problem of overfitting to the known classes in the supervised pre-training.From a unified MI perspective,the thesis further proposes a self-supervised distillation method,which employs a pre-trained large model to guide the self-supervised learning of a small model for improving its recognition performance.In a fair comparison,the proposed method achieves(or even outperforms)the state-of-the-art supervised methods without using pre-trained labels.(2)The thesis proposes prompt distribution learning to effectively transfer the vision-language pre-trained model for addressing downstream few-shot image recognition tasks.Vision-language models can utilize prior human knowledge with natural language to address image recognition.However,their recognition performance is highly dependent on the provided prompt template.As a data-driven method,the proposed prompt distribution learning avoids tedious manual designing and is more aligned with the visual task context.Compared to learning a single prompt,the proposed method implicitly learns the distribution of diverse prompts to address various visual contents.The thesis further proposes an efficient learning strategy,i.e.,learning the distribution of output text features of category descriptions.In contrast to modeling the distribution of original text prompts that requires a complex sequence generation model,the output text features can be modeled with a simple distribution such as multivariate Gaussian distribution.Based on the Gaussian assumption,the thesis proposes a surrogate objective,an upper bound of the original optimization objective,for effective training.(3)The thesis proposes mutual embedding optimization,a meta-learning method to solve the multi-domain problem in downstream few-shot image recognition.Training and testing samples come from different distributions(i.e.,multiple domains)in this setting.Our goal is to address the potential distribution shifts between limited training samples and the real data distribution while investigating how to leverage related data from other domains to facilitate recognition of the current domain.To learn transferable knowledge across domains with few training samples,the thesis introduces a novel mutual embedding optimization,which employs parameters mutual embedding as structural priors between domains and learns meta-priors during meta-training.The proposed method decouples the learning process of domain-specific and domain-shared information.The domain-specific information obtained from meta-learning is used to guide the adaptation of the model in the downstream task,optimizing mutual embedding to capture transferable information related to the task efficiently and avoiding negative transfer across domains.The proposed method can also be applied to a generalized cross-modal few-shot image recognition setting. |