| Object localization is a fundamental computer vision problem,which aims to discover and locate interesting objects in a visual scene.The object localization task can quickly find meaningful foreground objects and reduce the interference of background information.Therefore,it is a preprocessing of many high-level vision tasks,and can benefit a wide variety of applications,e.g.,object tracking,video understanding,unmanned driving and so on.Benefiting from the tremendous success of deep learning,fully supervised object localization and detection tasks have achieved remarkable performance.However,Deep learning models usually require strong annotations in terms of precise bounding boxes.Obtaining such annotations at a large scale can be costly,time-consuming,or even infeasible,which greatly limits the application of object localization in practical scenarios.Therefore,one of the urgent problems to be solved in object localization community is to explore a solution with good performance under the setting of using as few annotations as possible(i.e.less supervision),so as to alleviate the dependence on accurate annotations.To this end,the main work of this paper focuses on object localization task under the setting of less supervision.We explore weakly supervised object localization method requiring only image-level labels,unsupervised object localization method without any annotations,and unsupervised part localization method.Finally,object localization and part localization methods are applied to solve fine-grained image classification task.Specially,the main works and contributions in this dissertation are as follows:(1)A method based on frequency information is proposed for weakly supervised object localization.This paper aims to solve some critical issues existing in the weakly supervised object localization field: failing to localize integral regions of the target objects and low localization accuracy.From the perspective of spatial attention and channel attention,we study the impact of the spatial location of highly responsive features in the convolutional feature layer and their activation frequencies in all channels on the object localization problem.Therefore,we propose a simple yet effective method,called Frequent Class Activation Map(Freq CAM)for weakly supervised localization.Our Freq CAM considers the importance of spatial information to fuse the information between different channels,so it can reduce the loss of regional information and obtain more accurate object localization area,which ultimately improves the overall localization performance.Experiments on the standard fine-grained datasets show that our proposed method can effectively improve the performance of weakly supervised object localization methods.Moreover,compared with the existing state-of-the-art methods,Freq CAM is a plug-and-play module that does not need to design complex network models and new loss functions,and it also does not require to modify existing architecture and even to backpropagate any computations.Therefore it can be directly plugged into any standard existing weakly supervised frameworks,resulting in good generalization and generality.(2)A method based on pattern mining technique is proposed for unsupervised object localization.Weakly supervised object localization refers to learning object locations in a given image using the image-level labels.However,most images in real life are usually with unknown categories or even without labels.In order to break through the limitations brought by annotations and take full advantage of these unlabeled data,this work studies the unsupervised object localization task,which can locate possible objects from an image without any annotations.The previously proposed Freq CAM mainly considers the spatial response and channel response of a single position in the feature map,ignoring the correlation of adjacent pixels in the image.To tackle this problem,we propose a simple but effective pattern mining-based method,called Object Location Mining(OLM),which exploits the advantages of data mining and the hidden structural information in feature maps of pretrained convolutional neural networks(CNNs).OLM can make full use of the correlation between adjacent pixels in the object by introducing frequent item mining method,which can mine the semantic and spatial information.Our proposed method outperforms other unsupervised methods by a large margin.Moreover,we also evaluate the localization ability on unsupervised saliency detection task and achieve competitive performance.The result shows that the OLM has good effectiveness and generalization in the field of unsupervised object localization.(3)An unsupervised part localization method and a fine-grained visual categorization method based on knowledge distillation are proposed.We further study unsupervised part localization problem and apply them to fine-grained image classification tasks.OLM method can accurately locate objects in one image without any labels,and the support value in the obtained localization map can reflect the importance of the position.Based on this observation,we propose a cluster-based unsupervised part localization method,called PLM(Part Localization Mining),which exploits the advantages of the support information of the localization map and the feature maps of the image.Experiments show that the proposed method achieves competitive performance compared with weakly supervised part localization methods.Locating the accurate discriminative regions plays the key role in addressing fine-grained classification,therefore,based on the minded part regions,we further proposes a fine-grained image classification method based on part mining and knowledge distillation.We conduct comprehensive experiments on fine-grained datasets,and achieve competitive performance compared with the stateof-the-art methods.Accordingly,our method can can not only localize discriminative parts,but also learn features that are more distinctive and more powerful to classification,which can achieve a significant improvement in part localization and fine-grained classification performance. |