Font Size: a A A

Dense Video Captioning Based On Part-of-speech Tagging And Attention

Posted on:2021-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:Z J ZhuFull Text:PDF
GTID:2518306107453374Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the gradual popularization of high-definition video surveillance and the rapid development of short-video social software and live broadcast software,video data has exploded.How to analyze these massive video data to obtain key feature information has gradually become a research focus in the direction of visual intelligence analysis.For example,relevant government departments can analyze the video surveillance video to obtain the behavioral characteristics of the characters;video reviewers can quickly review the video content through the video description.Therefore,research on this issue is of great significance to the development of intelligent video analysis.Dense video captioning refers to finding the sequential actions contained in the input video,including the start and end moments of the actions,and describing these sequential actions in natural language.This topic mainly studies two aspects,one is the time-series action generation,that is,to accurately obtain the start and end time of the action contained in the video.The second is video description,which describes the timing actions in the video.The current sequential action generation algorithm only considers the characteristics of the unidirectional propagation of the video,and fails to effectively combine the reverse characteristics of the video,resulting in a low recall rate of the generated sequential actions.At the same time,the video description algorithm fails to fully integrate the video features and time-series action features to generate a dynamic video feature,and ignores the part-ofspeech tagging time-series feature information of words,which makes the generated natural sentences less accurate.In order to solve the above challenges,this paper proposes a dense video caption algorithm(PosA?DVC)based on part-of-speech tagging and attention mechanism.Among them,for the generation of sequential actions,a bidirectional single-stream TemporalAction Proposals Based onAttention(BiA?SST)algorithm based on the attention mechanism is proposed,and the forward characteristics and reverse directions of the sequential actions are obtained through two sequential network models Feature,and use the attention mechanism to combine these two features,and ultimately improve the recall rate of sequential actions.For video description generation,this paper uses attention mechanism to fuse video features and motion features to obtain dynamic video features,and combine word annotation information to generate word annotation timing features,and finally combine word annotation timing features,dynamic video features,and word features to dynamically generate Corresponding natural sentence description,in order to improve the description accuracy.In this paper,the experiments of BiA?SST and PosA?DVC algorithms are carried out on THUMOS-14 andActivity Net Caption video data sets,respectively,and the experimental results are analyzed,and finally compared with related algorithms,thus reflecting the feasibility of BiA?SST and PosA?DVC algorithms.
Keywords/Search Tags:Dense video caption, Sequential action proposal generation, Video description, POS, Attention
PDF Full Text Request
Related items