Font Size: a A A

Research On Multimodal Fusion Based On Latent Structure Representation Learnin

Posted on:2024-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:X ChengFull Text:PDF
GTID:2568307106984089Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of high technology in artificial intelligence,people’s daily contact with information has gradually shifted from single-modal text or image to the multi-modal data that integrates vision and language.These data not only involve images and text,but also contain rich visual scenes.In this case,how to achieve efficient understanding and reasoning for these multi-modal data has become a concern for researchers.The image captioning task combines the two hot research directions of natural language processing and computer vision,it requires the computer to recognize the input image and output smooth natural description statements related to the image content.Therefore,it has a wide range of application prospects in automatic driving,intelligent assistance and other aspects.The current research on image captioning is developing rapidly,and many different methods have emerged.However,the existing image captioning methods still have some problems,such as poor granularity of feature acquisition,incomplete feature acquisition and inaccurate description sentence context relation.Based on the multi-modal fusion research of latent structure representation learning,this thesis analyzes the current research methods of image processing and natural language processing,and carries out the following research from two aspects of image feature extraction and text context:(1)For the current feature extraction based on object detection network,there is still a lack of fine-grained and contextual information,which leads to the inaccurate detail recognition of objects in the image and affects the language generation effect.An image captioning method based on latent space feature enhancement is proposed.An efficient convolutional network named Multi-order Feature Aggregation Net(MOAN)is designed and applied to the object region of the image to capture multi-order feature information interactions efficiently.Then the multi-order features are fused with regional features to obtain enhanced features with more detailed information.Finally,the feature is input into the decoder to generate language description.Experiments show that this method can generate more detailed description content.(2)Most current models rely on the attention mechanism to capture the interaction between salient regional features,which leads to the failure of the model to fully mine the information of non-salient regions in the image,incomplete feature acquisition;and the failure to consider the relationship between context in the image when generating description,resulting in poor reasoning effect of the model on the spatial relationship of the image and semantic relationship between contexts.An image description model based on Cross-Modal Multi-Dimensional Relationship Enhancement(MRE)is proposed.Firstly,a feature diversity module is designed to mine sub-significant regional features related to significant regions in the image,so as to enhance the representation of spatial relationship features of the image.A context-guided attention module is designed to guide the model to learn the relationship of language context in the image,so as to achieve cross-modal relationship alignment.Experiments on MSCOCO data sets show that the proposed model achieves better performance,and the results show that the generated statements are improved in terms of integrity and relational accuracy.
Keywords/Search Tags:Image captioning generation, latent structure feature extraction, contextual semantics, multidimensional relationship
PDF Full Text Request
Related items