Font Size: a A A

Research On Cross-modal Semantic Alignment For Vision And Language

Posted on:2024-04-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Y WuFull Text:PDF
GTID:1528306932457744Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Visual modality is an important way for humans to understand the world,while language modality is an important bridge for communication between humans.In order to enable artificial intelligence models to understand visual and language information,scholars have conducted a lot of research in the fields of computer vision and natural language processing and have achieved considerable results.However,with the continuous maturity of Internet technology and the widespread popularity of intelligent mobile terminals,the way of information dissemination has gradually evolved from a single visual mode or a single language mode in the past to a multi-modal way that integrates vision and language.With this comes the increasing demand for humans to intelligently process multi-modal data that fuses visual and language information.Currently,methods based on deep neural networks have made remarkable achievements in learning visual and language representations.However,in the cross-modal domain involving vision and language,how to learn semantic alignment across modalities is still a challenging problem.The key to enabling artificial intelligence models to achieve cross-modal semantic alignment is:(i)learning visual representations and language representations with semantic consistency;(ii)fully exploring the semantic information contained in the visual and language modalities;(iii)establishing accurate associations between the visual and language modalities.To address these three key issues,this thesis conducts research on cross-modal semantic alignment methods for vision and language,aiming to enhance the understanding ability of artificial intelligence models for cross-modal information in different cross-modal application scenarios.The main research contents of this thesis are as follows:·Propose a method for image paragraph generation based on hierarchical policy network and hierarchical supervision mechanism.The image paragraph generation aims to generate a semantically rich and coherent paragraph describing the content in the image.Existing methods for image paragraph generation do not fully exploit the fine-grained semantic alignment relationships between matched image-paragraph pairs in training data.To this end,this thesis proposes a method for image paragraph generation based on a hierarchical policy network and a hierarchical supervision mechanism.The hierarchical policy network generates paragraphs from top to bottom by utilizing the hierarchical relationship of"paragraph-sentence-word" in the paragraph.The hierarchical supervision mechanism provides fine-grained supervision information for the generated paragraphs and their components from the "paragraph-sentence-word" level.In particular,in order to fully exploit the semantic alignment relationship between cross-modal information,the hierarchical supervision mechanism uses the distance between visual and textual information in the shared semantic space and evaluation metrics designed to measure the similarity between generated paragraphs and ground truth paragraphs as bridges connecting the semantic alignment relationship between the two modalities.Experimental results show that the hierarchical policy network can generate paragraphs with richer and more coherent semantics after being optimized by the supervision signals provided by the hierarchical supervision mechanism.·Propose a method for visual grounding based on cross-modal adaptive information fusion.Visual grounding requires the machine to use a bounding box to locate the content described by the referring expression in the image.Existing visual grounding methods rely on commonly used object detection frameworks and cannot fully utilize visual contextual information and attribute information in referring expression,and therefore cannot learn cross-modal semantic alignment relationships.To address this issue,this thesis proposes a method for visual grounding based on cross-modal information interaction and fusion.This method does not rely on traditional object detection frameworks based on pre-defined positions or anchor points,but instead uses a Transformer network to model visual contextual relationships.Furthermore,in order to fully learn cross-modal semantic alignment relationships,the method uses a cross-modal information interaction network to guide the model in understanding the referring expression based on visual content,while also focusing attention on visual regions related to the referring expression.Additionally,the method uses a cross-modal feature fusion network to learn joint representations of visual and language features with semantic consistency.Experimental results show that the proposed method can improve the accuracy of visual grounding.·Propose a method for vision-and-language navigation based on latent semantic alignment learning.Vision-and-language navigation requires the agent to navigate in the environment according to the given instruction and form a trajectory.Existing methods for vision-and-language navigation do not fully mine the fine-grained semantic alignment relationship between trajectory-instruction pairs in the data,the data utilization rate is low,and the robustness of the agent in unseen environments is poor.Therefore,this thesis proposes a vision-and-language navigation method based on latent semantic alignment learning.The method utilizes a Transformer network to model the cross-modal semantic alignment relationship between vision and language.In particular,to improve the network’s ability to mine cross-modal semantic alignment relations,three novel pre-training methods are proposed in this thesis.These three pre-training methods can improve the reasoning ability of the agent from vision to language and from language to vision,and improve the ability of establishing consistent perception of the environment.Using the proposed pre-training methods to train the agent can improve the interpretability of the agent’s behavior,improve the utilization of the training data by the agent,and improve the robustness of the agent in unseen environments.Experimental results also show that the method proposed in this thesis improves the navigation success rate of the agent in an unseen environment.In summary,in the cross-modal domain of vision and language,how to learn the cross-modal semantic alignment relationships is a challenging problem.To further address this issue,this thesis proposes three different cross-modal semantic alignment methods to enhance image paragraph generation,visual grounding,and vision-andlanguage navigation.
Keywords/Search Tags:Vision and Language, Cross-modal Semantic Alignment, Image Para-graph Generation, Visual Grounding, Vision-and-Language Navigation
PDF Full Text Request
Related items