Font Size: a A A

Research And Implementation Of VQA Based On Priori Attention Mechanism

Posted on:2021-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z L XuFull Text:PDF
GTID:2428330611473248Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,understanding the content of videos is one of the core technologies for developing various useful applications in the real world,such as identifying various human behaviors in surveillance systems or conducting customer behavior analysis in automated stores.However,due to its huge amount of data and time structure,understanding the content of the video remains a challenging issue.In recent years,the attention mechanism method is an important method for researching video question answering.However,the current methods have the following four shortcomings: one is to use video for feature extraction,so although it can capture all the information of the video,but because of the redundancy of the video itself,the training cost is huge,and the gains are outweighed;the second is some methods The extracted frame information is used to describe the video,but there is still a lot of redundancy;the third is that the processing of the problem is relatively rough,and the stop words in the problem are not processed.Fourth,the complexity and logic of video question-and-answer tasks are not considered.The above-mentioned shortcomings greatly affect the generalization performance and accuracy of the model.Based on deep learning,this paper first proposes a priori MASK attention mechanism model.Based on this,two different video question answering schemes are proposed,namely the video question answering scheme of multi-attention mechanism of prior MASK and the video question answering scheme of graph attention mechanism of prior MASK.The video question answering scheme of multi-attention mechanism of prior MASK proposes three kinds of attention mechanisms and prior MASK method.First,the scheme uses frame features to extract the key frames of the video,and then extracts the video from Faster R-CNN and the residual network.Frame features to obtain features and object labels in key frames of the video,use word2 vec and LSTM to encode the problem,and merge the extracted video features,video labels and problem text features into the above-mentioned prior MASK attention mechanism In the model,the answer to the question is finally obtained.The model in this paper participated in the Tianchi ZJB competition and finally won the championship.At the same time,it is compared with the existing methods at the end of this paper.From a large number of experiments,it can be proved that this method is more superior than the existing methods.The video question answering scheme of graph attention mechanism of prior MASK uses graph data structure to express the relationship between video objects and objects.Faster R-CNN is used to extract the coordinates and category of the objects in the key frames of the video.The node attention mechanism and edge attention mechanism are used to construct the graph as nodes,and then the problem features and graph features are used to embed the results.In the prior MASK,the final answer of the model is obtained.The experimental results show that the graph nodes greatly reduce the amount of parameters of the network model,but the accuracy is not lost.In some scenarios where the accuracy requirements are not very high but the speed requirements are high this scheme can be used.
Keywords/Search Tags:video question answering, deep learning, prior MASK, attention mechanism, graph attention
PDF Full Text Request
Related items