| Visual Question Answering(VQA)is a multimodal task that combines computer vision and natural language processing,aims to enable computers to feed bac corresponding answers based on a pair of image information with open-ended text question.The current development of VQA faces some problems,such as answering the complex question”How many mammals are in the picture?” which needs to confirm whether the animals in the diagram belong to the mammal category or not,and this question needs to rely on external knowledge to answer.The existing methods are difficult to retrieve the limited supporting knowledge in the knowledge base with massive data,and the training process of the multimodal fusion module generates a high-dimensional tensor that is difficult to compute.Therefore,this thesis designs a method for knowledge embedding based on question query to introduce key knowledge,and proposes a joint embedding method based on tensor decomposition to effectively reduce the number of parameters and improve the computational efficiency.The main innovative points and work in this thesis are as follows:1、In this thesis,we design the knowledge embedding method based on question query.The method uses a lightweight convolution method to process image information into embedding vector,uses Bert Embedding for the embedding of text.We design query rules for the text and introduce Embedding of knowledge according to the rules.In the multimodal fusion module,information about text,knowledge and image is learned by multi-headed self-attentive mechanism approach.Experimentally,this method proves to be able to introduce relevant and valid knowledge and improve the model scoring accuracy.This method achieved an overall accuracy of 72.13% on the VQAv2 test set and79.39% on the VQA-abstract test set.2、In this thesis,we propose a multimodal fusion approach based on joint embedding,which is on the basis of the knowledge embedding method.The approach performs a trilinear joint embedding of the feature matrix output from the multi-headed self-attentive module to form a new global feature vector.Due to the problem of difficult computation of high-dimensional vectors caused by the joint embedding,a tensor decomposition method is introduced to decompose the high-dimensional tensor into the form of the sum of several low-rank tensors.It is experimentally demonstrated that the method improves1.47 percentage points in the total accuracy of the OKVQA test set and 1.43 percentage points in the total accuracy of the Viz Wiz test set.3、On the basis of the above VQA techniques,this thesis combines with speech recognition,machine translation and speech synthesis to design and implement a VQA system based on We Chat applets.The system is capable of feeding back answers based on photos or album images taken by the user and voice recognition or handwritten input questions.In summary,this thesis proposes a knowledge-embedded VQA method based on question query for the problem of low accuracy of VQA.For the current problem of huge number of method parameters,the existence of difficult training computation and the difficulty of grounded application of the algorithm,the joint embedding multimodal fusion method based on tensor decomposition is proposed.Finally,based on the proposed algorithm,a VQA system based on We Chat applets is designed and implemented. |