| The world is experiencing the flourish of big data on the Internet,where hundreds of millions of multimedia data are generated every day,such as images and texts.As one of the fundamental tasks of multimedia data mining,cross-modal retrieval has received extensive attention.Moreover,Internet users are demanding more customization,which imposes higher requirements for the refinement of retrieval results.For example,a user wants to know the species of a flower from its picture.As a result,fine-grained crossmodal retrieval has become a hotspot of research in recent years.In the setting of fine-grained cross-modal retrieval,users feed a query sample to the retrieval method,which is then expected to return candidate samples that belong to the same subcategory as the query sample.Here the query and the candidates can have different modalities,such as image query versus text return.Fine-grained data is known as “small divergence between different subclasses but large variance within the same subclass,” which poses a great challenge to traditional retrieval methods.Moreover,heterogeneous data are inconstant and have the media gap,which increases the difficulty of fine-grained cross-modal retrieval.This thesis addresses the above challenges by enhancing the single-modal feature discrimination and the cross-modal semantic correlation.Specifically,we start with the basic fine-grained image retrieval task.Then we generalize the task to the fine-grained image-text retrieval.Finally,we continue to extend the task of fine-grained image-text retrieval based on multi-features to improve the fine-grained retrieval capability of machine vision and multimedia systems.The main contributions are summarized as follows:1.In the fine-grained image retrieval task,the local correlation feature learning for fine-grained image retrieval method is proposed.This method learns the globallocal aware feature representation.On the one hand,the global feature could capture the object coarsely and discard the background clutters.On the other hand,the local feature mines the correlation among different parts.Further,an aggregation feature that learns the global-local aware feature representation is designed.Consequently,the discriminative ability among different fine-grained classes is enhanced.Extensive experimental results demonstrate the effectiveness.In particular,this method achieves a 2% accuracy improvement over the state-of-the-art baselines on the Aircraft dataset.2.In the fine-grained image-text retrieval task,the discriminative latent space learning for fine-grained image-text retrieval method is proposed.This method first extracts image and text features for capturing the subtle difference in fine-grained data.Subsequently,based on the extracted features,this method performs couple dictionary learning to align the heterogeneous data in a uniform latent space.To make such alignment discriminative enough for the fine-grained task,the learned latent space is endowed with discriminative property via learning a discriminative map.Extensive experiments demonstrate the effectiveness of the proposed method and the overall results on two tasks surpass the state-of-the-art methods.3.In the fine-grained image-text retrieval task,the multi-features shared semantic space learning for fine-grained image-text retrieval method is proposed.This method extracts global features and contextual features of images and texts,respectively,and carries out joint learning to enhance the discrimination of features.Further,in order to enhance the semantic correlation of heterogeneous data,this method constructs a multi-features shared semantic space.Remarkably,this method achieves a 15% accuracy improvement over the state-of-the-art baselines on the instance-specific task. |