| Compared to natural scene images,remote sensing scene images are affected by high altitude photography,resulting in images with smaller key objects,images with larger object deformations,more constituent objects and complex background information.Existing convolutional neural networks are limited by the receptive field and have limitations in extracting features,resulting in fragmentation between features and insufficient understanding of local semantics by whole-image learning of transformer networks.Moreover,applying these methods to remote sensing scene classification all require the aid of pre-training in Image Net,making the study of remote sensing scene image classification dependent on large models,and the classification task with remote sensing scene datasets alone is extremely challenging.This thesis introduces the attention-based multiple instances learning framework to study attention-based multiple instances lightweight networks for remote sensing scene classification from both global image and local image aspects.(1)For the problem of more constituent objects and complex background information,a multiple instances network with instance-level and scene-level information fusion is proposed and explored using a scheme without pre-training.The method is constructed by a densely connected transformer module and an instance location-aware module,which is a combination of convolution and self-attention to learn global and local relationships;the instance location-aware module learns the potential locations of key instances and obtains a discriminative-rich classification model by combining the classification losses of the two components..(2)To address the problem of large object deformation,a mixed-attention multiple instances learning network is proposed;the features are extracted using a lightweight network,and then the features are augmented with a mixed-attention multiple instances pooling-based method to construct an end-to-end neural network;by combining the two parts,a global classification loss is constructed to obtain a discriminative-rich classification model.The research is validated on three publicly available remote sensing datasets,UCM,AID and NWPU-RESISC.Compared with other studies,the research approach improves the performance of remote sensing scene classification while compressing the number of network model parameters and the computational effort.The research method has also been partially experimented on natural scenes CIFAR-10 and Image Net2012,and achieved better results. |