Font Size: a A A

Temporal Action Detection And Video Caption Algorithm Based On Deep Learning

Posted on:2020-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:X N LiuFull Text:PDF
GTID:2428330575456375Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of large-capacity memory,multimedia technology,digital devices,computer networks and communication technologies,the video data in the network has exploded.How to analyze the large and unorganized video data faster and better has become a research hotspot of the computer vision.Due to the limitations of traditional video analysis methods and the advantages of deep learning techniques in extracting high-level semantic information of images,video comprehension based on deep learning has become the mainstream solution for video intelligence analysis.At present,the research on video comprehension includes video action recognition,temporal action detection,object tracking,video summary and caption.This paper focuses on two issues.One is how to effectively detect actions in the untrimmed video from real scenes,and the other is how to establish the connection between visual infomation and natural language.The temporal action detection and the video caption algorithms currently have the following challenges:1)fixed feature maps make the temporal action detection have a low recall of the varied-duration actions;2)dense video caption based on temporal action proposal is divided into two stages,which destroys the interaction between the two tasks.To solve these challenges,this paper proposes a multi-scale temporal action detection algorithm based on feature pyramid network(FPN-TAD)and a joint optimization dense event caption algorithm based on the descriptive regression(DR-DVC).The FPN-TAD algorithm detects the objective action area on the multi-scale feature maps by introducing the FPN structure,which effectively improves the recall of the varied-duration actions.The DR-DVC introduces a descriptive loss in the event proposal,which encourages the proposals to contain more information relating to descriptions.Considering the different contribution of different video frames to the description results,the descriptive scores also can be used as the attention mechanism weights of the event description to improve the description accuracy.Finally,the paper validates the effectiveness of the FPN-TAD and DR-DVC on the ActivityNet and others.By comparing the proposed algorithm with the baseline and the mainstream algorithms,the results show that the proposed algorithm has obvious performance improvement compared with the baseline,and it is better than most of the mainstream algorithms,which proves the feasibility and effectiveness of the algorithm.
Keywords/Search Tags:deep learning, temporal action detection, video caption, multiscale feature, joint optimization
PDF Full Text Request
Related items