Font Size: a A A

Fine-Grained Visual Categorization With Weak Supervision

Posted on:2022-05-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:S B MinFull Text:PDF
GTID:1488306323982459Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Fine-Grained Visual Categorization(FGVC)is an important task in the computer vision community,of which the target is to recognize subordinate categories of a basic category.Since fine-grained categories can provide abundant semantic clues,FGVC is widely applied to sensitive image filtering,medical image classification,and dangerous goods detection,etc.Therefore,FGVC has important research significance and social value.Compared to traditional basic-category recognition,FGVC puts a higher demand on the representation ability of a vision model,so that the subtle difference between fine-grained categories can be well captured.To this end,traditional methods depend on massive manually-labeled data with careful local part-annotations to learn subtle visual clues.However,in the practical scenario,it is hard to obtain enough labeled data for all fine-grained categories with part annotations,which requires more expert knowledge and incurs more cost than basic category data.Based on the above analysis,this paper focuses on the following three challenges of FGVC,which are:weakly-annotated fine-grained visual categorization;semi-supervised fine-grained visual categorization;and zero-shot fine-grained visual categorization.1.Weakly-Annotated Fine-Grained Visual Categorization based on Bilinear Feature NormalizationThe visual difference between fine-grained categories usually exists in local part regions of an image.However,fine-grained annotations usually incur expensive la-belling cost.Thus,the first research approach of this dissertation focuses on how to explore potentially discriminative visual clues only from category labels,and learn dis-criminative fine-grained visual features.First,we propose a cross-attention mechanism,which improves the feature discrimination by spatial redundancy reduction via an S-net and feature power enhancement via an P-net.The spatial redundancy reduction can make the model focus on important local regions without part annotations,while the feature power enhancement enables the model to mine the hard sample information.Then,the two enhanced features from P-net and S-net are fused and projected into a second-order space via bilinear pooling.Finally,we develop a Mult-Objective La-grange Normalization(MOLN)method,which can effectively regularize second-order features in terms of matrix square root,low-rank,and sparsity,simultaneously.This can not only stabilize the second-order information during training,but also improve the feature generalization.The experiments on three FGVC benchmarks prove that our methods obtain new state-of-the-art performance,e.g.,89.7%top-1 accuracy on CUB-200-2011 dataset.2.Semi-Supervised Fine-Grained Visual Categorization based on Two-Stream Mutual AttentionAlthough the above weakly-annotated methods can alleviate the dependency of part annotations,they still require massive manually-labeled images,e.g.,category la-bels,to provide fine-grained visual clues.Compared to expensive manual labeled-data,unlabeled data is cheap and available.Since these unlabeled data also contains much useful vision knowledge,the second research approach of this dissertation is to explore automatic machine labels of unlabeld images to learn more potential knowledge and improve recognition performance.First,we propose a Hierarchical Distillation(HD)model to improve automatically generated pseudo labels by exploring complementary information between different data views and recognition models.Then,a Two Steam Mutual Attention Network(TSMAN)is designed to improve the model robustness to noisy labeled data.Specifically,TSMAN explores the disagreement between two stu-dent models in terms of both predictions and intermediate features to indicate potential noisy gradients,thereby avoiding negative effects from noisy labels.Finally,HD and TSMAN make up an integrated semi-supervised learning framework that can explore more useful visual knowledge from unlabeled images than previous methods.With only a small portion of labeled data,this dissertation obtains superior performance on two medical benchmark datasets.For example,on HVSMR2016 dataset,we obtain 82.1%pixel recognition accuracy.3.Zero-Shot Fine-Grained Visual Categorization based on Domain-Aware DisentanglementThe third research approach of this dissertation is the zero-shot FGVC,when only a portion of seen categories is available during training and the modal should recog-nize unseen categories during testing.This dissertation targets to construct semantic relationship between known and unknown categories via semantic labels,e.g.,category descriptions,and learn fine-grained visual features with strong transferability.How-ever,when connecting known and unknown categories,the category discrimination will be also weakened due to the weak discrimination of semantic labels.To this end,this dissertation first proposes a decomposed semantic projection model,which consists of three separate sub-functions,to capture domain-shared and domain-specific informa-tion.This can improve the discrimination of semantic features.Then,we further de-compose the visual feature into two complementary and separate sub-features,i.e.,weak semantic and strong semantic,to tackle known and unknown categories,respectively.The strong discrimination of weak semantic features enables the domain detection of an input sample and,specifically,recognition of seen domain categories.The semantic relation of strong semantic features can be used to recognize unseen domain categories.Finally,when unknown categories are the subordinate of known basic categories,we design a Bi-Granularity Semantic Projection Network(BigSPN)to accommodate the visual difference between two granularity samples.By designing different visual pro-jection functions,our BigSPN can learn transferable visual features that can well gen-eralize to unknown subordinate category recognition.The experiments on four public zero-shot benchmarks show that our methods obtain averaged 5.7%gain,which prove their effectiveness.For the above three FGVC settings and based on the existing methods,this disser-tation explores the bilinear feature complementary,two-stream network disagreement,and semantic-visual disentanglement and successfully improves the feature discrimina-tion and transferability of fine-grained visual features.This provides a good theoretical foundation and technical support for FGVC in different practical scenarios.
Keywords/Search Tags:Fine-Grained Visual Categorization, Weakly-Supervised Learning, At-tention Mechanism, Semi-Supervised Learning, Zero-Shot Learning, Transfer Learning, Convolution Neural Network
PDF Full Text Request
Related items