| Visual Question Answering(VQA)is an important task in visual language learning,which aims to automatically answer natural language questions based on image content.As a cross-discipline between computer vision and natural language processing,VQA has a wide range of practical applications in areas such as human-computer interaction and intelligent transportation.A key aspect of VQA tasks is the need to reason about the relationships between visual entities in the image and question context.Existing VQA methods are not accurate enough for determining multi-hop relationships between image entities in complex questions and cannot provide a clear reasoning process,resulting in a lack of interpretability of the model.To address the above issues,this thesis applies graph networks to VQA tasks and conducts the following research:(1)To address the challenge of performing multi-hop reasoning to capture the interrelationships and interactions between visual entities when solving complex questions,this thesis proposes a Question-Guided Multi-hop Reasoning Graph Network(QMRGT).It represents the multi-hop reasoning process of visual question answering as question-guided multiple rounds of dynamic interactions and updates between image entities.The network updates the question instructions and visual entity representations in both directions at each reasoning step,and captures the relationships between visual entities on the graph according to a question-guided messaging algorithm,ensuring coherent and consistent multi-hop reasoning for VQA.The interpretability of the method is demonstrated by analysing the weight changes of the question-related visual entities in each reasoning step.(2)To further enhance the robustness of the model’s reasoning ability,and reliability and correctness of the reasoning chain,based on the framework of work(1),this thesis proposes an Adaptive Path Reinforced Reasoning Graph Networks(APRGT).It transforms the multi-hop reasoning process of visual question answering into an expansion task of reasoning paths to learn multi-hop relations between visual entities.Based on a method of self-adaptive expansion of reasoning paths,the network independently explores and expands reasoning paths on the graph according to the question,and realizes accurate and transparent reasoning decisions in the reasoning chain.By analysing the adaptive expansion process of reasoning paths,the method clearly represents a complete reasoning process,further enhancing the interpretability and reasoning robustness of the model.Finally,this thesis conducts a series of comparative and ablation experiments on the public VQA datasets GQA,GQA-OOD and VQA2.0,which verifies the effectiveness of the method in this thesis in VQA tasks,especially the complex question samples which need to perform multi-hop reasoning and question samples for out of distribution generalization.Qualitative experimental analysis also demonstrates that our approach can enhance interpretability. |