| Benefiting from the improvement of computational power,deep learning methods have achieved great successes in Computer Vision in the past decade.However,these successes are heavily attributed to large-scale,well-labeled training data.The requirement for large quantities of human-labeled training data poses a limitation to the practicality and scalability of deep learning models,especially for fine-grained visual recognition.Compared to human-labeled training data,web image engines are free sources for acquiring extensive training images.Learning from web images can ease the extreme dependence of deep learning methods on labeled data.Therefore,training fine-grained recognition models with web images has attracted more and more researchers recently.This dissertation focuses on training fine-grained visual recognition models using web images under deep learning frameworks.Starting from proposing a dynamically visual disambiguation method for web image collection,three web-image-based fine-grained datasets are created accordingly for subsequent research.Afterward,two major challenges(i.e.,label noise and fine-grained feature learning)in this topic are studied in-depth.This dissertation first proposes a sample selection method to tackle the label noise issue.Then,a category similarity-based distributed labeling method is proposed to promote fine-grained feature learning implicitly.Finally,this dissertation proposes to address label noise and fine-grained feature learning simultaneously using a hybrid approach.The main research content of this dissertation includes:(1)This dissertation proposes a dynamically visual disambiguation method to tackle visual polysemy in the keyword-based image collection process.This method consists of three steps:discovering text queries,filtering text queries,and removing outliers.Finally,three web-image-based fine-grained datasets(i.e.,Web-Aircraft,Web-Bird,and Web-Car)are created accordingly.Baseline experiments on these datasets are conducted,and results are reported for future comparison.(2)This dissertation proposes a peer-learning-based sample selection approach to cope with the label noise issue.This approach maintains two networks simultaneously.In the training process,on the one hand,these two networks teach themselves by using samples with prediction disagreement to update network parameters;on the other hand,they teach each other by exchanging selected clean samples for network updates.These clean samples are selected from those with identical predictions.Experiments on a realworld web-image-based fine-grained dataset(i.e.,Web-Bird)reveal 1.19%performance improvement compared to the state-of-the-art method(i.e.,JoCoR),demonstrating the effectiveness of this approach.(3)This dissertation proposes a category similarity-based distributed labeling method to promote fine-grained feature learning.In the process of training fine-grained models,joint supervision between the cross-entropy loss and center loss is adopted for enhancing the compactness of learned features.Meanwhile,feature centers generated from the center loss are leveraged for distributed labeling based on feature similarity.Finally,the calculated distributed labels are used to compute the cross-entropy loss,replacing the conventional one-hot label distribution.This promotes fine-grained feature learning implicitly.Experiments on widely-used fine-grained datasets(i.e.,CUB200-2011,FGVC Aircraft,and Stanford Cars)show that performance is boosted by 0.8%,0.5%,and 0.4%,respectively,compared with the state-of-the-art method(i.e.,DCL).(4)This dissertation proposes a hybrid approach to address label noise and finegrained feature learning simultaneously.To address the label noise issue,this approach adopts a two-step sample selection process to select clean samples and reusable ones.Then reusable samples are label-corrected before feeding into networks together with clean ones.To address the fine-grained feature learning,a cross-layer attention-based feature refinement module is proposed.This module enhances the representation ability by taking full advantage of the rich semantic information from the higher layer’s feature maps and rich spatial content from the lower layer’s ones.Experiments on our web datasets(i.e.,Web-Aircraft,Web-Bird,and Web-Car)reveal performance improvement of 4.5%,6.7%,and 6.5%,respectively,compared to the state-of-the-art method(i.e.,JoCoR).This verifies the superiority of this approach.This dissertation proposes an overall solution to the web image-based fine-grained visual recognition under deep learning frameworks by conducting the aforementioned research.Specifically,this dissertation decouples the topic into two sub-topics(i.e.,label noise and fine-grained feature learning)and proposes methods accordingly.Comprehensive experiments verify the effectiveness of each algorithm.These algorithms have been published in journals and conferences(e.g.,Pattern Recognition,and ACM MM). |