Font Size: a A A

Research On Temporal Action Detection Algorithm Based On Deep Learning

Posted on:2021-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:K X LiFull Text:PDF
GTID:2518306107952959Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
The information contained in videos is rich and complex,and the exploration and research of video analysis have received great attention.Due to the rapid development of deep learning technology,the performance of deep learning networks in video analysis is significantly improved compared to traditional machine learning models,so that deep learning networks are widely used in video analysis.Temporal action detection is a basic and difficult task in video analysis work,which has a great promotion significance for the detection task based on deep learning and the development of video understanding work.This thesis carried out research on three aspects of temporal multi-scale structure,optical flow feature extraction,RGB and optical flow feature fusion with the typical end-to-end temporal action detection model.R-C3 D is a typical end-to-end deep learning network model for temporal action detection.Based on this model,the thesis researches some problems temporal action detection.There is often a large difference of duration in videos,which makes the performance of detection algorithms based on a single scale limited.The thesis designs two temporal multi-scale detection modules,one adopts the method of multi-scale feature map fusion,and the other uses different branches to process and merge behaviors of different durations at the end of detection model.Two kinds of multi-scale detection modules are applied to the R-C3 D network.Experiments on two datasets of THUMOS14 and Activity Net1.3 for temporal action detection have shown that both network modules can improve the detection ability of the R-C3 D model.Video recognition and video detection tasks often use the two-stream network,which takes RGB images and optical flow images as two network inputs respectively,and then fuses the processing results of the two to obtain the final prediction results.However,the existing methods do not provide an in-depth analysis of why the two-stream network can improve performance,and the existing optical flow calculation method has the disadvantages of large calculation amount and limited algorithm practicability.The thesis makes a visual analysis and comparison of the temporal action detection network with RGB and optical flow as input,which shows that the introduction of optical flow image can provide the network with information different from RGB image,so combining optical flow with RGB image can improve the performance of the detection network.Furthermore,a temporal action detection model that takes RGB images as input and uses optical flow and optical flow feature as the intermediate supervision to extract optical flow features can directly extract optical flow features from RGB images using an encoder-decoder structure,reducing optical flow calculated cost.For the traditional two-stream network,the output scores of the two branches of RGB and optical flow are fused to improve the performance of the network,but it has not been studied whether the fusion of the feature maps of the two branches is effective.In this thesis,the fusion method of RGB features and optical flow features is studied by experiments.The RGB features and optical flow features of video are fused at different stages of the network,and three fusion methods are proposed: feature map addition,feature map multiplication,and feature map concatenation.It is found that the fusion of the feature map of the twostream network can achieve better detection results than the traditional dual-stream network.Feature map concatenation at the end of feature extraction module has the best effect.
Keywords/Search Tags:Deep learning, Temporal action detection, Multi-scale detection, Optical flow
PDF Full Text Request
Related items