Font Size: a A A

Research On Deep Learning Algorithm For Automatic Question Answering

Posted on:2021-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:S HouFull Text:PDF
GTID:2428330611455203Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence,people's research direction is not limited to a single field,but presents a multi-disciplinary,multi-directional development trend,such as Visual question answering,Video question answering and so on,they are the combination of image processing and natural language processing,especially Visual question answering has been widely concerned by the academic community in recent years,however,as an extension of Visual question answering,the research on Video question answering is not enough,the main reasons are as follows:firstly,it is difficult to extract video features because of the three-dimensional characteristics of video,which makes video features more complex and rich in information description;secondly,video features and question features belong to different modal features,so it is difficult to interact effectively;thirdly,the model needs to consider the global semantic features of the question when predicting the final answer,however,the time complexity of traditional semantic feature extraction models is always very high,how to further reduce the complexity of feature extraction has become another major problem.In order to solve the above problems,this thesis proposes the following solutions based on the knowledge of image processing and natural language processing.(1)Feature extraction:in this thesis,the video features are effectively extracted from the static features and dynamic features.Aiming at the shortcomings of the existing models using VGG to extract static features and C3D to extract dynamic features,this thesis uses a new combination method.Faster R-CNN is used to extract the static features of video,and P3D is used to extract the dynamic features of video,then the extracted video features are calculated by Multi-head Self attention module,so that the model can capture the sequence dependencies in different dimensions of the video features.(2)Interaction between video features and question features:in order to make the model better understand the question features and video features,this thesis proposes a multi-stage bidirectional attention memory unit,which mainly includes two attention mechanisms:the first is the attention mechanism based on word granularity,which not only enhances the influence of each word in the question on the answer prediction,but also selects the most relevant features from the video dual channel features guided by each word in the question,thus greatly reducing the computational complexity.The second is a bidirectional attention mechanism based on time step,through this module,we can calculate the bidirectional attention mechanism between the question feature and the video feature under the current time step,and realize the information interaction under multi-modal features.(3)For optimization of semantic feature extraction models:this thesis proposes a bidirectional gated convolutional network based on the existing gated convolutional network.Compared with the traditional recurrent neural network,bidirectional gated convolutional network can further reduce the time complexity of training while ensuring the accuracy of the model.Finally,the experimental results of the proposed model on ZJB and MSVD-QA datasets exceed a large number of baseline models,which proves the effectiveness of the model.
Keywords/Search Tags:Video question answering, Faster R-CNN, P3D, Attention mechanism, Gated convolutional network
PDF Full Text Request
Related items