| According to WHO,at least 2.2 billion people around the world are visually impaired or blind.China is the country with the largest number of blind people in the world,accounting for 18%-20%of the total number of blind people in the world.The number of blind people added every year is up to 450000.Traditional guidance tools have some shortcomings,such as poor interactivity,difficult popularization and low portability,which are difficult to meet the needs of blind people.If visual question and answer technology is integrated into the guidance equipment,it can better assist the life of blind people.In this paper,the visual question answering technology predicts answers by using scene images and corresponding natural language,and studies the scene information in the life scenes of blind people.With the development of artificial intelligence technology,related technologies in computer vision and natural language processing are booming,which makes it possible to integrate visual question answering technology into the task of blind guide.The research object of this paper is the visual question answering system in the indoor scene.For the construction of visual question answering scene,the feature extraction strategy and feature fusion strategy are studied in depth.The main research contents are as follows1)Conduct functional requirements analysis on the VQA system for indoor scenes,and analyze in detail the tasks and functions that the VQA system needs to complete.Then,design the overall architecture of the VQA system.Finally,design in detail all functional modules of the VQA system,including dataset,data collection module,data processing module,feature fusion module,answer prediction module,and display module,In the feature fusion module,two different feature fusion strategies are designed based on different input data.2)In order to solve the problems of inadequate representation ability of existing models in images and problems and insufficient information interaction between modes,a visual question answering algorithm based on attention mechanism is proposed.The algorithm introduces self attention mechanism to images and problems.Obtain the relationship information between single modes;At the same time,the attention guidance mechanism is used on the image and problem features to establish information interaction between multimodals,so as to enhance the expression ability of image features and problem features.Finally,the effectiveness of each module was verified through ablation experiments and the superiority of this model over other visual question answering models was verified through comparative experiments;3)With the stacking of attention modules,the accuracy of visual question and answer algorithm can be improved within a certain range.When the number of layers reaches a certain limit,the performance of the model drops suddenly.To solve this problem,inspired by the capsule network,this paper proposes a visual question and answer model based on capsule network,which realizes the multi-step attention operation of the model through single-layer attention.Without reducing the performance of the visual question and answer model,Reduce the parameters of the visual question answering model,and improve the compactness and robustness of the model.Finally,the ablation experiment verifies the superiority of the visual question answering model based on capsule network;4)A set of interpretation system that can realize the task of blind guide is designed to verify the accuracy of visual question answering algorithm,and the performance of the system can be intuitively experienced.The camera is used to capture the real scene image,and the real question information is converted into text information through the voice conversion module.Both of them are transferred to the visual question answering system as inputs,and finally the obtained answers are fed back to the blind in the form of voice output.This article ultimately builds a VQA system that simulates functions such as intelligent information collection,intelligent user intention reasoning,and intelligent information presentation.The aim is to build an information accessible interaction system for visually impaired individuals,laying a theoretical foundation for the subsequent integration of the system into guidance aids. |