Font Size: a A A

Research And Implementation Of Multi-modal Fusion Method For Vision And Language

Posted on:2022-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:S P YuFull Text:PDF
GTID:2518306764967549Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Vision and language navigation is a complex multi-modal task that aims to build a navigational agent that follows the instructions of natural language commands and moves in a visual environment to reach a destination,which can be effectively extended to various real-world scenarios.Most existing vision and language navigation approaches use Transformer to learn vision and language multi-modal fusion representations and decide navigation actions based on the fusion representation features,which significantly improves model performance compared to traditional approaches,but still face the problem of poor generalization performance in unseen environments.This thesis focuses on improving the generalization ability of vision and language navigation models in unseen environments,and investigates the vision and language multimodal fusion method from two perspectives: improving the internal network structure of the model and the external reward generation mechanism.The specific research focuses on the following points.1.Vision and language alignment via spatial-time Trannsformer with casual attntionTo address the problem that data bias misleads the model to learn false correlations and break the loop generalization ability,a vision and language alignment via spatial-time Trannsformer with casual attntion method is proposed.The method includes 1)vision and language alignment sub-network.Through the causal attention Transformer unit,it mines the causal relationship between environmental panoramic visual and natural language commands and navigation actions,reasoning about navigation actions that conform to the causal effect and improving the generalization ability of the model.2)Gated update sub-network.The gating mechanism is used to filter key moment information and provide historical information for navigation decisions.The method is tested and validated on the publicly available dataset R2 R and the simulation platform Mattport3 D,and the accuracy SR was improved by 2.15% for the seen environment and 2.07% for the unseen environment compared to the existing baseline model.2.Intrinsic reward via self-supervised auxiliary tasks base on TransformerTo address the ambiguity of environmental feedback rewards in vision and language navigation tasks,which cannot provide effective supervised information for vision and language alignment,an intrinsic reward via self-supervised auxiliary tasks base on Transformer is proposed.By constructing three self-supervised auxiliary tasks applicable to Transformer as a policy network,promote the model to spontaneously summarize the environmental semantic information and the internal operating mechanism of the intelligence,generate intrinsic rewards to provide additional training signals,and enhance the learning rate and generalization ability of the model.The effectiveness of the method is verified on the publicly available dataset R2 R.Compared with several existing baseline models,the accuracy SR is improved by 5.58% and 1.28% in the seen and unseen environments.3.Indoor vision and language navigation systemFor the application requirements of vision and language navigation in real-life scenarios,we design the overall system architecture and related functional modules,combine Vue,Flask,Pymysql and other development frameworks to build an indoor vision and language navigation system.And invoke the two vision and language navigation methods proposed in this thesis to realize the navigation function.
Keywords/Search Tags:vision and language alignment, reinforcement learning, causal inference, selfsupervised learning
PDF Full Text Request
Related items