In recent years,with the development and popularization of Convolutional Neural Network(CNN)and deep learning,computer vision has achieved great success in various research fields of image and video,and many specific application sceneries have been derived.Fine-Grained Visual Classification(FGVC)is proposed to solve the problems of classifying the sub-categories of an object from the same super-category.It has attracted extensive attention recently due to a wide range of applications such as intelligent retail and intelligent transportation,along with grave challenges due to the difficulties of low inter-class variance,high intra-class variance and lack of samples.Existing approaches mainly focus on distilling information from highlevel features,however,further improvement suffers from lack of using low-level information.In this work,we conduct in-depth research on the above issue,which indicates that multi-level information can effectively improve the FGVC performance with enhanced feature representation and accurately located discriminative regions.In this paper,we propose a dual pathway hierarchy structure upon backbone networks with a top-down feature pathway and a bottom-up attention pathway,hence generate feature pyramid and attention pyramid by learning both high-level semantic and low-level detailed feature representation.We take this structure as basis and further conduct feature encoding and feature refinement to utilize multi-level information.We adapt Multimodal Bilinear Pooling(MBP)upon multi-level features of feature pyramid to encode pairwise correlations between channels,and thus enhance feature representations.We design the Region Proposal Generator(RPG)upon multi-level attentions of attention pyramid to locate discriminative region proposals in a weakly supervised fashion.We propose the adaptive Non-Maximum Suppression(NMS)strategy to remove overlapped ones and merge the correlated ones.And conduct ROIguided refinement with ROI-guided DropBlock and ROI-guided zoom-in operation,which refines features with discriminative local regions enhanced and background noises eliminated.The proposed method can be trained end-to-end in the weakly supervised fashion.And achieves competitive results on two CNN backbones and three popularly tested FGVC datasets. |