Font Size: a A A

Video Captioning Aiming At Objects

Posted on:2020-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:Z X GanFull Text:PDF
GTID:2428330596976498Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the success of deep learning,the field of artificial intelligence has made revolutionary progress,and research in various fields has developed rapidly.The research in the field of combining computer vision and natural language processing has also received extensive attention from scholars.Video Caption is a task that combines computer vision and natural language processing.It not only requires the algorithm model to extract useful information from the video well,but also requires the model to effectively combine the information and accurately establish it and finally establish the right correlation with natural language information.In order to establish a high-performance video Caption algorithm,this paper proposes a method of first creating a scene graph for a video frame and then using graph convolution for feature encoding.When establishing the scene graph,the Faster R-cnn target detection algorithm is used to detect the position and category information of each target in the video frame.Then,the detection information is used to establish a simple fully connected model to detect the entity attribute information of the target;The relationship detection model detects the association between each target.In order to reduce the excessive detection overhead caused by sparse target association,this paper proposes a self-attention based pruning model.Using all the detection information,we can construct a scene graph containing the target node,the target attribute node,and the target relationship node.Such a scene graph can contain almost all of the semantic information in a video frame,and then encode the scene graph through a graph convolutional network.When encoding video frames by using convolution,this paper simplifies the above scene graph structure by embedding,so that the scene graph only contains the target nodes,and the target association is represented by the directed edges;then the paper convolution is improved.It can be applied to the directed graph,and the multiplicative attention mechanism is added to the graph convolution,so that each node in the graph convolution can better balance the relationship with the neighbor nodes.By such a video frame coding method,features can be refined to individual targets in the image,and the correlation of the respective objects is included,which is more robust than the conventional overall feature extracted from the video frame using the convolutional network.In order to learn the long-term step dependence between video frames and video text description sequences,this paper uses Transformer to replace the traditional recurrent neural network,which improves the sequence feature learning ability and training efficiency of the model.The final experimental results show that the algorithm constructed in this paper can generate text descriptions closer to the video target,and has achieved good results in the MSR-VTT dataset.
Keywords/Search Tags:Video Caption, Graph Convolution Network, Scene Graph, Deep Learning
PDF Full Text Request
Related items