| Traditional image recognition requires manual features,which requires high cost.Some image features are not universal,since they are designed according to specific scenarios.With the rapid development of deep learning,convolutional neural networks(CNNs),as a branch of it,effectively avoids artificial design and manual extraction,and have been widely used in the field of image recognition.However,coarse-grained image classification cannot meet people’s deeper understanding of images,and fine-grained image classification methods have emerged.The task of fine-grained image classification requires the identification of image details.Low inter-class variances and large intra-class variances make it difficult to distinguish hundreds of subordinates even for human beings.In addition,complicated image backgrounds,different poses,views and illumination conditions in one class make the task much more challenging.Building a suitable CNN structure can achieve end-to-end fine-grained image classification tasks.First of all,we introduce the basic theory of CNNs and bilinear CNNs.Then,on the datasets of CUB-200-1011 and Stanford Car,we verified the classification accuracy of bilinear CNN on single-label fine-grained images,by adjusting the network hierarchy,regularization method,the size of input image to optimize the classification task.The result shows that larger sizes of images have higher accuracy.The method of bilinear CNNs are not adaptive for more complex finegrained image data,such as multi-label,data imbalance,etc.We propose a coarse-and-fine mixed method which changes the sampling method of training set and builds a deep network.The advantage of this novel method is combining the spatial pyramid pool and feature concatenation.This network extracts model’s feature maps by different convolution layers and combines through the spatial pyramid pooling layer.The discriminative regions are automatically detected by using the global and local information of images.Great improvement can be obtained for the classification of fine-grained images.The method of ensemble learning is used to get further optimization.Compared with the classical CNN model,experimental results show that the coarse-and-fine mixed classification method has a better performance in accuracy and F1 score on the dataset of human protein atlas(hpa). |