Font Size: a A A

Research On Key Technologies Of Image Captioning Based On Semantic Relation Enhancement

Posted on:2024-03-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:N N HuFull Text:PDF
GTID:1528306944470164Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Image captioning(IC)aims to enable the computer to perceive content semantics and detailed entity relationships in given images,align the visual and text semantics,then map the visual content into the text sentences through language inference.It is a cross-modal understanding research that bridges the fields of computer vision(CV)and natural language processing(NLP).It is a key research direction in the development of cognitive intelligence.It has a wide range of applications,such as multimodal imagetext retrieval and visual assistance for the visually impaired.Image captioning has important academic research and industrial application value.The goal of image captioning is to establish semantic mapping from image to sentences.As an important basis for semantic reasoning and association analysis,context-based semantic relation learning plays a key role in cross-modal semantic mapping.Existing image captioning models explore the potential semantic correlation and realize cross-modal semantic mapping between visual and text mainly based on the encoder-decoder framework.However,these models still face the problem of insufficient learning of semantic relationships in the process of visual coding,crossmodal semantic alignment and language decoding,which brings serious challenges to the rapid generation of high-quality sentences.This dissertation focuses on enhancing the learning of semantic relationships and carry out research on enhancing semantic relationships in terms of visual semantic relationships,cross-modal relationships and semantic relationships between words respectively.The main innovative contributions are as follows:(1)Research on Visual Semantic Relationship Enhancement Based on Multi-head Association Attention Mechanism:Based on the analysis of the leading visual coding methods in existing captioning models,to address the issue of missing feature of image element relationship and insufficient understanding of element relationship in image content caused by multiattention independent parallel modeling,a Multi-head Association Attention Enhancement Network(MAENet)based on Transformer is proposed in this dissertation.At first,the model introduces associated parameters to compute correlation weights between different attention heads.Through cross-branch modeling,the model effectively captures and integrate element relationships across different attention channels,enhancing the semantic encoding of relationships in the image content.Additionally,to ensure consistency between attention results and encoding expectations,a dual-level aggregation technique based on channel and spatial dimensions is applied to enhance the multi-headed associated attention results,thereby improving the guidance of effective attention during the decoding process.The proposed model is validated on the widely used MS COCO dataset.The results show that the X-linear attention-enhanced model scores 38.3%and 28.8%on BLEU-4 and METEOR under cross-entropy loss training,which are 1%and 1.3%higher than the optimal comparison model,respectively.After reinforcement learning fine-tuning,the highest scores of METEOR and SPCIE are 29.6%and 23.5%,which are 0.1%and 0.5%higher than the optimal comparison model.This indicates that the proposed model can enhance the visual element relation learning and make the generated description statements contain more diverse relation attribute representations.(2)Research on Cross-Modal Semantic Relationship Enhancement Based on Dual Attention Fusion:Based on the analysis of the factors affecting the cross-modal semantic alignment of visual-text in existing captioning models,to address the problems of missing features and insufficient cross-modal semantic matching caused by single-feature flow embedding,a Triple-Steam feature Fusion Network(TSFNet)based on dual attention is proposed in this dissertation.In the model,a novel threelevel encoder-decoder framework is designed for obtain complete image content representation.To enhance cross-modal semantic alignment,a novel dual attention mechanism that combines self-attention and soft attention branches is proposed in the model.This mechanism incorporates visual content interaction to strengthen the relationship between the three feature streams and the textual description,improving the cross-modal semantic alignment.Furthermore,a multi-element residual module is utilized to fuse the attention features from the three streams,and the fused features are fed into a global perception decoder to fully utilize the global image content information for guiding the generation of description sentences.The proposed model is validated on the widely used MS COCO dataset.The results show that the TSFNet scores 78.9%and 39.3%on BLEU-4 and BLEU-4 under cross-entropy loss training,which are 0.8%and 0.9%higher than the optimal comparison model,respectively.After reinforcement learning fine-tuning,the TSFNet scores 81.7%and 40.3%on BLEU-1 and BLEU-4,which are further improved by 0.3%and 0.5%compared with the optimal comparation model.This shows that the proposed model can enhance the learning of semantic relationship between image and text,make the cross-modal semantic alignment more adequate,and generate a more accurate and comprehensive description of the image content.(3)Research on Enhancement of Semantic Relationship Between Words Based on Capsule Networks in Semi-Autoregressive:Based on the analysis of the necessary language decoding process in existing captioning models,to address the problem that the inter-word relationship is ignored in the non-autoregressive language reasoning,which leads to the lack of inter-word relationship guidance in the process of rapid language generation,the capsule network layer is introduced into the captioning framework for the first time,and a novel Semi-Autoregressive Transformer with Capsule network layer(SATC)is proposed in this dissertation.In the model,the group mask attention layer is first designed to control the semiautoregressive property of the model,so that the model can output all terms in parallel within the group,and output complete statements from left to right among the groups,so that the whole model maintains a fast reasoning.More importantly,the model will use the dynamic routing mechanism in the capsule network layer to model the history,future and other temporal attributes of the whole sequence candidate information,so as to enhance the learning of inter-word dependency in fast reasoning and guide the language generation.In addition,n-gram loss based on continuous group mask was designed in the process of model training for this dissertation,and combined with distillation loss,a joint loss function was constructed to further strengthen the guidance of fragment and sentence level information in the process of language reasoning,and alleviate the phenomenon of lexical repetition caused by parallel reasoning.The proposed model is validated on the MS COCO captioning dataset.The results show that when the mask grouping is set to K=2,compared with the autoregressive comparison model,the score gap of BLEU-1 is reduced from 1.5%to 0.6%under cross-entropy loss training.After reinforcement learning fine-tuning,the score of ROUGE-L is 58.6%,which is 0.2%higher than the semi-autoregressive comparison model.Meanwhile,the reasoning speed of SATC is 3.13 times faster than the comparative autoregressive model,and 0.87 times faster than the comparative semiautoregressive model.This indicates that the language quality obtained by rapid reasoning is significantly improved after the enhancement of interword relational guidance.To summarize,this dissertation improves the richness and comprehensiveness of generated captions,and significantly improves the quality of sentences obtained from fast inference by enhancing semantic relation in visual coding,cross-modal semantic alignment and linguistic inference.The work provides technical support for other multimodal understanding tasks,such as video description,cross-modal retrieval.It also promotes the application of image captioning in real-time intelligent scenarios.
Keywords/Search Tags:image captioning, semantic relationships enhancement, association attention enhancement, dual attention fusion, capsule network
PDF Full Text Request
Related items