With the rapid development of earth observation technology in China,the scale of highresolution remote sensing image data has increased geometrically.These high resolution remote sensing multimodal data with high timeliness,clear spatial structure and rich detail information are widely used in various fields of national production and life.Such a large scale of data volume has promoted the development of remote sensing industry applications and has put forward higher requirements for intelligent interpretation technology of remote sensing images.Traditional remote sensing image understanding methods are based on artificially designed features,and machine learning models are constructed to obtain information such as labels of pixels,targets and scenes,which have limited model generalization ability and ignore the information of target attributes and correlation information between targets in images.How to use the massive high-resolution remote sensing data to mine the attributes of image targets and the relationship between them,and then understand the content of remote sensing images at the semantic level,has become a topic that really needs to be explored.The semantic description of remote sensing images provides a new way of thinking to achieve the understanding of high-resolution remote sensing images and to bridge the semantic gap between the low-level features and the high-level semantics of images.Most of the existing remote sensing image semantic understanding methods focus on the problems of scene classification,target recognition and image segmentation,without involving the attributes of remote sensing image targets themselves and the relationship mining between targets.At present,there are several problems existing in the semantic description of highresolution remote sensing images.(1)The existing methods mostly use convolutional neural networks to extract global features of images,without considering the typical target features of remote sensing images,so that the problem of inaccurate expression of target relationships or incomplete representation of multiple targets may occur.(2)Exposure Bias problem,i.e.,the training model and the testing model are not coincident.(3)Most of the existing methods do not take into account the integrity and fluency of the generated text.The existing methods use automatic evaluation metrics to evaluate the generated text sequences from the word perspective rather than the overall evaluation from the sentence perspective;meanwhile,the cross-entropy loss cannot evaluate the fluency of text utterances.In response to the above problems,the main contents and innovations of this research include three aspects.(1)To address the problem of not considering the fusion of typical targets and their contextual information in remote sensing image description,a scene-object feature fusion method for remote sensing image description generation is proposed,combining the feature pyramid pooling module to extract image scene features with contextual semantics,and using the lightweight and efficient Triplet attention.The target detection model optimized by the lightweight and efficient Triplet attention module is used to extract the typical target features of the image,and the two features are fused by a two-layer long and short-term memory network to achieve information complementarity between image targets,and between targets and their contexts,so as to improve the accuracy of image description text generation.(2)To address the problem of exposure bias in the text generation process,an imageretrieval-driven semantic description method for remote sensing images is proposed.By retrieving an image-text sample library,representative text descriptions corresponding to similar images from the test images are selected as the reference text for training.The generated description text is output by the generator of the retrieval model and the text generation model based on the scene-target feature fusion method as the generative adversarial network.And the discriminator compares it with the reference text and judges whether the obtained utterance text is the reference text in the sample library or the machinegenerated text,which avoids the exposure bias problem caused by the direct use of crossentropy loss and improves the accuracy of text generation.(3)A remote sensing image description generation method incorporating semantic relevance is proposed for the evaluation mechanism of remote sensing image description generation and the fluency of the generated text.The semantic relevance is used as the restriction condition of the conditional generation adversarial network,and the generative model of fused image retrieval is used as the generator of the conditional adversarial network.When the discriminative network judges the generated text,the semantic relevance evaluation score is obtained by calculating the correlation between the generated text of the test image and the reference text library corresponding to all images in the dataset,and it is used as the evaluation index of the generated text.Meanwhile,the loss function of the discriminant network is optimized by introducing the logarithmic correlation of unrelated texts,which improves the diversity and fluency of the generated text sequences of remote sensing images. |