Image retrieval is a popular research topic.Fine-grained image retrieval is a sub-project with broad application prospects in image retrieval,such as e-commerce image search and scientific research on animals and plants.Compared with the coarse classification of traditional image retrieval,fine-grained image retrieval pays more attention to the subtle local features of an image.With the success of deep learning in computer vision,algorithms applying deep learning have also appeared for fine-grained image retrieval.Feature Aggregation is the process of encoding multiple local features into one global feature in image retrieval.Based on the aggregation process of the features of deep convolutional neural networks,we propose an unsupervised retrieval algorithm and a supervised metric learning algorithm.The main contribution of this paper is as follows:For the task of fine-grained image retrieval,this paper proposes an unsupervised algorithm that does not require image labels to train the network to extract image features,and automatically filters out local depth features that cover the image body.We counted the frequency of occurrence of the maximum response occurring at each spatial position on the accumulated feature map,and set a threshold to filter out local descriptor on the background,so that we separated the local features of the subject and background.Then we will apply generalized-mean pooling to the selected features.First,we studied the relationship between threshold parameters of feature selection and generalized-mean pooling.Secondly,we study the effect of different feature combinations on retrieval performance.We've put our approach and state-of-the-art image retrieval aggregation method at CUB200-2011,Stanford Dogs,Oxford Flowers,Oxford Pets,Aircraft dataset.The experimental results demonstrate the effectiveness of our proposed algorithm.Finally,we study the effects of feature dimension reduction and whitening.For the features obtained by our aggregation method,feature dimension reduction and whitening may improve the retrieval performance.Unsupervised fine-grained image retrieval uses pre-trained networks.However,for a specific dataset,the features extracted by the pre-trained network need to be re-trained with supervision information such as labels.We use the method of metric learning to train the network.However,existing work on deep metric learning focuses on designing better loss functions and how to build training pairs.Inspired by the aggregation of convolutional features of traditional image retrieval,we explored whether convolutional layer feature aggregation is more suitable as the input of the loss function of training.We designed three network structures to validate our ideas.The first network structure is VGG-16 BN,we use the last fully connected layer as the input to the loss function.The second network structure is that VGG-16 BN removes the fully connected layer,and connect a feature aggregation layer,which is used as both the input of the loss function and the output of the retrieval test.The third network structure is a feature aggregation layer connecting a fully connected layer,the full connected layer as the input of the loss function,and the feature aggregation layer and the full connection layer as the output of the test.The experimental results prove that the network structure can greatly improve the retrieval performance.And the output of the feature aggregation layer is better than the output of the fully connected layer.We compare the third network structure with state-of-the-art deep metric learning methods using the triplet loss function on CUB200-2011,CarS-196,Stanford Online Product,In-shop datasets.Our approach is superior to the comparative approach.In general,feature aggregation can be a contributor to fine-grained image retrieval tasks using metric learning. |