Font Size: a A A

Remote Sensing Image Captioning Based On Semantic Priori Information

Posted on:2024-07-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:X T YeFull Text:PDF
GTID:1522307340473864Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Remote sensing image captioning is an important problem in the field of remote sensing image interpretation,which generates natural language descriptions about the content of given remote sensing images.Compared to traditional intelligent interpretation tasks for remote sensing images,remote sensing image captioning not only identifies objects in the images but also facilitates the exploration of spatial and semantic relationships among the objects.Therefore,remote sensing image captioning has a wide range of potential applications in environmental monitoring,urban planning,disaster assessment,and so on.Currently,research on remote sensing image captioning methods is still in its early stages,and further exploration is needed to integrate the characteristics of remote sensing data,specific applications for remote sensing tasks,and requirements for real-world environments.This dissertation aims to address the task of remote sensing image captioning by studying semantic priori information-based methods that incorporate rich image and text priori information derived from remote sensing data.The goal is to improve the accuracy of remote sensing image captions.The main research contents and innovative contributions of this dissertation are as follows:(1)To address the issue of semantic ambiguity in remote sensing image captioning,a multilabel semantic feature fusion method is proposed.This method introduces multi-label semantic information into the two-stage remote sensing image captioning for the first time,aiming to alleviate semantic ambiguity in complex remote sensing scenes and exploit more comprehensive and accurate priori semantic information.A robust multi-label semantic attribute extraction algorithm is designed to capture multi-label semantic attributes in remote sensing images.Two cross-modal semantic feature fusion operators are proposed to obtain discriminative feature representations for remote sensing image captioning.The effectiveness of this method has been validated through extensive experimentation.(2)To tackle the problem of mutual interference between tasks in previous two-stage remote sensing image captioning methods,a joint training two-stage remote sensing image captioning method is proposed.Firstly,a differentiable sampling operator is introduced to replace the traditional non-differentiable sampling operation,enabling the optimization of both image captioning and multi-label classification tasks through back-propagation.A dynamic contrastive loss function is proposed to maintain a certain probability gap between positive and negative labels during the sampling process,effectively improving the accuracy and stability of multi-label classification.An attribute-guided decoder is designed to filter the multi-label priori information obtained by the sampling operator,resulting in generated captions that better align with the image content.Extensive experiments have demonstrated the effectiveness of this method.(3)To address the problem of overlooking the influence of multi-scale priori semantic information in remote sensing change captioning,we propose a multi-scale differential semantic priori information-based method.Firstly,the multi-temporal remote sensing images are divided into multiple scales,and a pre-trained image-text retrieval model is used to extract semantic features for multi-scale changes in the remote sensing images.A cross-modal information interaction module is introduced to fuse image information and differential semantic information,taking advantage of the complementary nature of different modalities to enhance feature representation.A multi-level semantic information aggregation network is proposed to better utilize differential semantic information at different scales and alleviate the imbalance issue among different scales of differential semantic information.The experimental results have demonstrated that this method achieves state-of-the-art performance on multi-temporal remote sensing image change captioning task.(4)To address the problem of underutilizing structured multi-level priori information from image and text semantics in remote sensing image captioning,a multi-modal and multi-level priori information-based image captioning method is proposed.This method systematically organizes image and text priori information into three levels: global level,local level,and associated semantic level.Furthermore,we employ task simplification to convert priori information extraction into multi-modal retrieval,enabling the extraction of these three types of priori features from remote sensing images with reduced computational costs.Attentionbased cross-level information aggregation networks and cross-modal information fusion networks are proposed to extract cross-level and cross-modal priori information.These networks effectively aggregate priori features from different levels and modalities,thereby better representing the semantic relationships between images and text.A multi-level priori information-based decoder is designed to guide the generation process using multi-level priori information,resulting in more accurate image captions.Extensive experimental results have demonstrated the effectiveness of this method.(5)To address the inability of remote sensing image captioning to meet the demands in open-world scenarios,a task-oriented continual novel object captioning method is proposed.This method identifies the reasons why existing remote sensing image captioning tasks fail to meet the requirements of open-world tasks and introduces a new task of continual novel object captioning for remote sensing images.To address these issues,we propose a taskoriented continual novel object captioning method,which learns novel objects based on task needs and overcomes the catastrophic forgetting problem in continual learning.A pseudo pair generation strategy allows the model to obtain pseudo image-sentence pairs containing novel objects from unpaired data,avoiding expensive manual annotation.A compositional decoder is proposed to simulate the human learning process,enabling the model to gradually learn to describe novel objects.A feature reconstruction model is introduced to capture the data distribution of different tasks during the continual learning process,mimicking human memory.Extensive experiments have been carried out on the UCM-Captions,RSICD,heldout MS-COCO,and Open Images V4 datasets to prove the effectiveness of the proposed method.
Keywords/Search Tags:Remote sensing image captioning, Remote sensing change captioning, Multimodal learning, Multi-task learning, Feature fusion, Image priori information, Text priori information, Continual learning
PDF Full Text Request
Related items