| With the rapid development of the field of deep learning and the sharp increase in the amount of information in modern civilized society,human beings are no longer satisfied with the intelligent processing of singlemodal information media.The field of multimodal has become a research hotspot in the academic world in recent years,which is used to comprehensively process multiple information carriers,such as picture,text,video,voice,etc.The development of multi-modal deep learning is bound to help artificial intelligence better understand and perceive the whole world for the reason that human senses are multi-modal.As an important task in the multimodal field,visual question answering has attracted extensive attention from researchers.Thus,the research on visual question answering is of great significance.Visual question answering is a task that combines computer vision and natural language processing,requiring the computer to answer corresponding questions based on input images.It focuses on improving the image understanding and reasoning capabilities of the model,which is quite challenging.This thesis carries out the following research work:(1)We have investigated the development of the field of visual question answering.Current mainstream methods focus on using the attention mechanism network to make the model focus on key objects in the image or keywords in the question text.However,we find that the attention distribution of these previous networks always tend to locate similar regions after reproducing quantities of models and performing visual analysis,which causes the existence of redundancy problem.And it’s difficult for the model to capture important entities due to this problem.To solve the above issue,this thesis proposes a multi-head attention fusion network,which is named MHAFN.It aims to achieve multi-level,multigradient and multi-angle multi-modal fusion.At the same time,it has multiple branches,which can capture fine-grained and complex relationship from different levels:words,regions,and the interaction between them.In addition,it can also achieve discrete attention distribution.It can focus on multiple different visual and text components,which can help to better infer the final answer.Through a large quantity of experiments,the result on the VQA v2.0 dataset shows that the proposed model of MHAFN has achieved a competitive performance.(2)Based on the model of MHAFN,this thesis has developed a corresponding visual question answering intelligent assistance system for the blind.The system can collect and upload pictures and questions related to blind people.It can also use technologies of visual question answering to comprehensively process them and give the final answer,thereby helping the blind to better perceive the surrounding environment.After a lot of testing,the system can facilitate blind users and improve their life in the view of efficiency and happiness.Finally,it achieves the desired effect. |